# [ECIR 2022 Tutorial on Hands on advanced machine learning for information extraction from tweets tasks, data, and open source tools](https://socialmediaie.github.io/tutorials/ECIR2022/)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/socialmediaie/tutorials/blob/master/docs/ECIR2022/ECIR_2022_Tutorial_SocialMediaIE.ipynb)


* Author: [Shubhanshu Mishra](http://shubhanshu.com/)
* Contact: [https://twitter.com/TheShubhanshu](https://twitter.com/TheShubhanshu)
* **QnA Page:** [https://slido.com with #905356](https://slido.com/905356)

More details at: https://socialmediaie.github.io/tutorials/ECIR2022/

This notebook demonstrates the usage of an open-source tool we have built called [SocialMediaIE](https://github.com/socialmediaie/SocialMediaIE), which uses multi-task learning to do state-of-the-art performance on English language social media data like tweets. You can find more details about the tool at: https://github.com/socialmediaie/SocialMediaIE

If you have feedback or requests for features in SocialMediaIE please raise an issue at: https://github.com/socialmediaie/SocialMediaIE/issues

If you would like to use SocialMediaIE on your local machine please follow the instructions at: https://socialmediaie.github.io/tutorials/ECIR2022/#software-setup



# Lexicon Based Information Extraction

### Install dependencies

In [None]:
! pip install flashtext

Collecting flashtext
  Downloading flashtext-2.7.tar.gz (14 kB)
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25l[?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9309 sha256=9a20cd89876b706e2ae67e2e0f534523f1eb2fb5b7fe58c5393cf1fcec7a7208
  Stored in directory: /root/.cache/pip/wheels/cb/19/58/4e8fdd0009a7f89dbce3c18fff2e0d0fa201d5cdfd16f113b7
Successfully built flashtext
Installing collected packages: flashtext
Successfully installed flashtext-2.7


In [None]:
%%bash
if [ ! -f Enhanced_Morality_Lexicon_V1.1.txt ]; then
  # Source of data: https://databank.illinois.edu/datasets/IDB-3957440
  wget -q https://databank.illinois.edu/datafiles/pjwpj/download -O Enhanced_Morality_Lexicon_V1.1.txt
  # cd SocialMediaIE && tar -xzf ../ic2s2_data.tar.gz
fi

In [None]:
%%bash
if [ ! -f NRC-Emotion-Lexicon.zip ]; then
  # Source of data: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
  wget -q http://saifmohammad.com/WebDocs/Lexicons/NRC-Emotion-Lexicon.zip -O NRC-Emotion-Lexicon.zip
  unzip NRC-Emotion-Lexicon.zip
  # cd SocialMediaIE && tar -xzf ../ic2s2_data.tar.gz
fi

Archive:  NRC-Emotion-Lexicon.zip
   creating: NRC-Emotion-Lexicon/
  inflating: __MACOSX/._NRC-Emotion-Lexicon  
  inflating: NRC-Emotion-Lexicon/EmoLex-Ethics-Data-Statement.pdf  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._EmoLex-Ethics-Data-Statement.pdf  
  inflating: NRC-Emotion-Lexicon/.DS_Store  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._.DS_Store  
   creating: NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-v0.92/
  inflating: __MACOSX/NRC-Emotion-Lexicon/._NRC-Emotion-Lexicon-v0.92  
  inflating: NRC-Emotion-Lexicon/NRC - Sentiment Lexicon - Research EULA Sept 2017 .pdf  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._NRC - Sentiment Lexicon - Research EULA Sept 2017 .pdf  
  inflating: NRC-Emotion-Lexicon/README.txt  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._README.txt  
  inflating: NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-v0.92/.DS_Store  
  inflating: __MACOSX/NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-v0.92/._.DS_Store  
   creating: NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-v0.92/

In [None]:
%%bash
if [ ! -f subjectivity_clues_hltemnlp05.zip ]; then
  # Source of data: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
  wget -q https://mpqa.cs.pitt.edu/data/subjectivity_clues_hltemnlp05.zip -O subjectivity_clues_hltemnlp05.zip
  unzip subjectivity_clues_hltemnlp05.zip
fi

Archive:  subjectivity_clues_hltemnlp05.zip
   creating: subjectivity_clues_hltemnlp05/
  inflating: subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.README  
   creating: __MACOSX/subjectivity_clues_hltemnlp05/
  inflating: __MACOSX/subjectivity_clues_hltemnlp05/._subjclueslen1-HLTEMNLP05.README  
  inflating: subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.tff  


### Create Helpers

In [None]:
import pandas as pd
from flashtext import KeywordProcessor
from pathlib import Path
import seaborn as sns
from spacy import displacy

In [None]:
class Lexicon(object):
  def __init__(self):
    pass

  def read_morality_lexicon(self, file_path):
    with open(file_path) as fp:
      data = []
      for line in fp:
        line = line.strip()
        line = line.split("|")
        line_data = dict([v.split(" = ") for v in line])
        data.append(line_data)
      self.lexicon = pd.DataFrame(data)

  def read_mpqa_lexicon(self, file_path):
    with open(file_path) as fp:
      data = []
      for i, line in enumerate(fp):
        line = line.strip()
        line = line.split(" ")
        try:
          line_data = dict([v.split("=") for v in line])
        except ValueError as e:
          print(i, e)
          print(line)
        data.append(line_data)
      self.lexicon = pd.DataFrame(data)

  def read_emolex(self, dir_path):
    file_path = Path(dir_path) / "NRC-Emotion-Lexicon-v0.92/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt"
    df = pd.read_csv(file_path, sep="\t", skiprows=1, header=None, names=["word", "affect", "association"])
    self.lexicon = df[df.association == 1]
  
  def create_tagger(self, token_col, label_col, case_sensitive=False, set_colors=False):
    keyword_processor = KeywordProcessor(case_sensitive=case_sensitive)
    keyword_dict = self.lexicon.groupby(label_col)[token_col].agg(list)
    keyword_processor.add_keywords_from_dict(keyword_dict)
    if set_colors:
      self.colors = self.get_colors(keyword_dict.keys(), cmap="hls")
    self.keyword_processor = keyword_processor

  def get_colors(self, labels, cmap="hls"):
    colors = {
      l.upper():c 
        for l,c in zip(
          labels,
          sns.color_palette(cmap, len(labels)).as_hex(),
        )
    }
    return colors

  def display(self, text, annotations):
    ex = [{"text": text,
       "ents": [
                {"start": start, "end": end, "label": label} 
                for label, start, end in annotations
                ],
       "title": None}]
    options = dict()
    if self.colors:
      options = dict(colors=self.colors)
    html = displacy.render(ex, style="ent", manual=True, jupyter=True, options=options)
    return html

### Morality Lexicon

In [None]:
morality_lexicon = Lexicon()
morality_lexicon.read_morality_lexicon("./Enhanced_Morality_Lexicon_V1.1.txt")

In [None]:
morality_lexicon.lexicon.head()

Unnamed: 0,token,pos,syn_label,lemma,m_desc,m_type
0,aegis,n,S,no,CareVirtue,1
1,aegises,n,S,no,CareVirtue,1
2,affection,n,H,no,CareVirtue,1
3,affectionate,adj,S,no,CareVirtue,1
4,affectionateness,n,H,no,CareVirtue,1


In [None]:
morality_lexicon.lexicon.groupby("m_desc")["token"].agg(list)

m_desc
AuthorityVice      [agitated, agitating, agitation, agitations, a...
AuthorityVirtue    [abbess, abbesses, abidance, abidances, abide,...
CareVice           [abandon, abandoned, abandons, abase, abuse, a...
CareVirtue         [aegis, aegises, affection, affectionate, affe...
FairnessVice       [advantage, advantages, banishment, banishment...
FairnessVirtue     [aboveboard, adjudicator, adjudicators, admit,...
GeneralVice        [actus rei, actus reus, amiss, amoral, amorall...
GeneralVirtue      [adjust, admirably, admiration, admirations, a...
IngroupVice        [abandon, abandonment, abandonments, act of te...
IngroupVirtue      [accord, accords, across the country, across t...
PurityVice         [scavenge, abhorrent, abominably, adulterant, ...
PurityVirtue       [ablutionary, abstainer, abstainers, abstemiou...
Name: token, dtype: object

In [None]:
morality_lexicon.create_tagger(token_col="token", label_col="m_desc", case_sensitive=False, set_colors=True)

In [None]:
text = (
    'There is a lot of affection and hate for people with love and sorrow '
    'this causes agitation and abhorrent behaviour'
    )
morality_annotations = morality_lexicon.keyword_processor.extract_keywords(text, span_info=True)
morality_annotations

[('CareVirtue', 18, 27),
 ('CareVirtue', 53, 57),
 ('AuthorityVice', 81, 90),
 ('PurityVice', 95, 104)]

In [None]:
morality_lexicon.lexicon.groupby("token")["pos"].agg(list)

token
Bolshevism               [n]
Church Father            [n]
Church Fathers           [n]
Defense Department       [n]
Defense Departments      [n]
                       ...  
yield                    [v]
yieldingly             [adv]
yuckier                [adj]
yuckiest               [adj]
yucky                  [adj]
Name: pos, Length: 4419, dtype: object

In [None]:
morality_lexicon.display(text, morality_annotations)

### Emolex

In [None]:
emolex = Lexicon()
emolex.read_emolex("./NRC-Emotion-Lexicon")
emolex.lexicon.head()

Unnamed: 0,word,affect,association
18,abacus,trust,1
22,abandon,fear,1
24,abandon,negative,1
26,abandon,sadness,1
29,abandoned,anger,1


In [None]:
emolex.create_tagger(token_col="word", label_col="affect", case_sensitive=False, set_colors=True)

In [None]:
text = (
    'There is a lot of affection and hate for people with love and sorrow '
    'this causes agitation and abhorrent behaviour'
    )
emolex_annotations = emolex.keyword_processor.extract_keywords(text, span_info=True)
emolex_annotations

[('trust', 18, 27),
 ('sadness', 32, 36),
 ('positive', 53, 57),
 ('sadness', 62, 68),
 ('negative', 81, 90),
 ('negative', 95, 104)]

In [None]:
emolex.display(text, emolex_annotations)

### MPQA Subjectivity Lexicon

In [None]:
! sed -n 5549p subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.tff

type=strongsubj len=1 word1=pervasive pos1=adj stemmed1=n m priorpolarity=negative


In [None]:
mpqa = Lexicon()
mpqa.read_mpqa_lexicon("subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.tff")
mpqa.lexicon.head()

5548 dictionary update sequence element #5 has length 1; 2 is required
['type=strongsubj', 'len=1', 'word1=pervasive', 'pos1=adj', 'stemmed1=n', 'm', 'priorpolarity=negative']
5549 dictionary update sequence element #5 has length 1; 2 is required
['type=strongsubj', 'len=1', 'word1=pervasive', 'pos1=noun', 'stemmed1=n', 'm', 'priorpolarity=negative']


Unnamed: 0,type,len,word1,pos1,stemmed1,priorpolarity,polarity,mpqapolarity
0,weaksubj,1,abandoned,adj,n,negative,,
1,weaksubj,1,abandonment,noun,n,negative,,
2,weaksubj,1,abandon,verb,y,negative,,
3,strongsubj,1,abase,verb,y,negative,,
4,strongsubj,1,abasement,anypos,y,negative,,


In [None]:
mpqa.create_tagger(token_col="word1", label_col="priorpolarity", case_sensitive=False, set_colors=True)

In [None]:
text = (
    'There is a lot of affection and hate for people with love and sorrow '
    'this causes agitation and abhorrent behaviour'
    )
mpqa_annotations = mpqa.keyword_processor.extract_keywords(text, span_info=True)
mpqa_annotations

[('positive', 18, 27),
 ('negative', 32, 36),
 ('positive', 53, 57),
 ('negative', 62, 68),
 ('negative', 81, 90),
 ('negative', 95, 104)]

In [None]:
mpqa.display(text, mpqa_annotations)

In [None]:
mpqa.create_tagger(token_col="word1", label_col="pos1", case_sensitive=False, set_colors=True)

In [None]:
mpqa.display(text, mpqa.keyword_processor.extract_keywords(text, span_info=True))

## Other Lexicons

* Multilingual Abusive words lexicon - https://github.com/valeriobasile/hurtlex
* Named Entity Lexicon - https://github.com/napsternxg/TwitterNER/tree/master/data/cleaned/custom_lexicons

# Model Based Information Extraction

## Install dependencies and setup system path


In [None]:
# Add -f https://download.pytorch.org/whl/cu100/torch_stable.html on cuda notebook
! pip install torch==1.0.0 allennlp==0.8.3 numpy==1.15.4 scipy==1.2.1 pandas scikit-learn==0.20.2 tqdm twarc flashtext overrides==3.1.0

Collecting torch==1.0.0
  Downloading torch-1.0.0-cp37-cp37m-manylinux1_x86_64.whl (591.8 MB)
[K     |████████████████████████████████| 591.8 MB 574 bytes/s 
[?25hCollecting allennlp==0.8.3
  Downloading allennlp-0.8.3-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 40.3 MB/s 
[?25hCollecting numpy==1.15.4
  Downloading numpy-1.15.4-cp37-cp37m-manylinux1_x86_64.whl (13.8 MB)
[K     |████████████████████████████████| 13.8 MB 48.8 MB/s 
[?25hCollecting scipy==1.2.1
  Downloading scipy-1.2.1-cp37-cp37m-manylinux1_x86_64.whl (24.8 MB)
[K     |████████████████████████████████| 24.8 MB 1.3 MB/s 
Collecting scikit-learn==0.20.2
  Downloading scikit_learn-0.20.2-cp37-cp37m-manylinux1_x86_64.whl (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 43.1 MB/s 
Collecting twarc
  Downloading twarc-2.10.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.9 MB/s 
Collecting overrides==3.1.0
  Downloading overrides-3.1.0.tar.gz (11 kB)

In [None]:
!which python # should return /usr/local/bin/python
!python --version
!echo $PYTHONPATH

/usr/local/bin/python
Python 3.7.13
/env/python


In [None]:
%%bash
if [ ! -d "SocialMediaIE" ]; then
  git clone --single-branch --branch tutorial https://github.com/socialmediaie/SocialMediaIE.git
  # conda env update -n base -f ./SocialMediaIE/environment.yml
  pip install -e ./SocialMediaIE/.
  #! pip install -e git+https://github.com/socialmediaie/SocialMediaIE.git#egg=SocialMediaIE
fi

Obtaining file:///content/SocialMediaIE
Installing collected packages: SocialMediaIE
  Running setup.py develop for SocialMediaIE
Successfully installed SocialMediaIE-0.1


Cloning into 'SocialMediaIE'...


### Setup system path else the dependencies will not load properly 

In [None]:
import sys
print(sys.path)
#index_to_insert = min([i for i, v in enumerate(sys.path) if "dist-packages" in v])
# sys.path.insert(0, "/usr/local/lib/python3.6/site-packages")
# sys.path.insert(0, "./SocialMediaIE/") # Important to have this first else the package will not load
sys.path

['', '/content', '/env/python', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.7/dist-packages/IPython/extensions', '/root/.ipython']


['',
 '/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '/usr/local/lib/python3.7/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython']

### Download the models

In [None]:
%%bash
if [ ! -f ic2s2_data.tar.gz ]; then
  # Source of data: https://databank.illinois.edu/datasets/IDB-0851257
  wget -q https://databank.illinois.edu/datafiles/vodt2/download -O ic2s2_data.tar.gz
  cd SocialMediaIE && tar -xzf ../ic2s2_data.tar.gz
fi

## After dependencies are installed. Restart kernel and run from here.

In [None]:
import torch
print(torch.__version__)
torch.cuda.is_available()

1.0.0


False

In [None]:
import sys
print("You are using Python {}.{}.".format(sys.version_info.major, sys.version_info.minor))

You are using Python 3.7.


In [None]:
%env SOCIALMEDIAIE_PATH /content/SocialMediaIE/

env: SOCIALMEDIAIE_PATH=/content/SocialMediaIE/


### Check the SocialMediaIE folder was cloned properly

In [None]:
%%bash 
echo "${SOCIALMEDIAIE_PATH}"
ls -ltrh "${SOCIALMEDIAIE_PATH}/data"
realpath "${SOCIALMEDIAIE_PATH}"
cd "${SOCIALMEDIAIE_PATH}" && ls -ltrh

/content/SocialMediaIE/
total 16K
-rw-r--r-- 1 root root  11K Apr 10 05:29 databank_api_client_v3.py
-rw-r--r-- 1 root root 2.8K Apr 10 05:29 cleanup_model_folders.py
/content/SocialMediaIE
total 88K
-rw-r--r-- 1 root root 2.8K Apr 10 05:29 README.md
-rw-r--r-- 1 root root  12K Apr 10 05:29 LICENSE
-rw-r--r-- 1 root root 1.7K Apr 10 05:29 TODO.md
drwxr-xr-x 9 root root 4.0K Apr 10 05:29 SocialMediaIE
drwxr-xr-x 3 root root 4.0K Apr 10 05:29 docs
drwxr-xr-x 2 root root 4.0K Apr 10 05:29 data
-rw-r--r-- 1 root root  338 Apr 10 05:29 environment.yml
drwxr-xr-x 2 root root 4.0K Apr 10 05:29 experiments
drwxr-xr-x 4 root root 4.0K Apr 10 05:29 figures
drwxr-xr-x 2 root root 4.0K Apr 10 05:29 notebooks
drwxr-xr-x 3 root root 4.0K Apr 10 05:29 tests
-rw-r--r-- 1 root root  928 Apr 10 05:29 setup.py
-rw-r--r-- 1 root root   69 Apr 10 05:29 run_tests.sh
-rw-r--r-- 1 root root   69 Apr 10 05:29 run_tests.cmd
-rw-r--r-- 1 root root 3.0K Apr 10 05:29 requirements.pinned.txt
-rw-r--r-- 1 root root 

In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0


### Imports

In [None]:
from pathlib import Path
import json

from IPython.display import display

import pandas as pd
import torch

from SocialMediaIE.data.tokenization import get_match_iter, get_match_object

## Multi task Classification

In [None]:
from SocialMediaIE.predictor.model_predictor_classification import run, get_args, PREFIX, get_model_output, output_to_json

In [None]:
SERIALIZATION_DIR = Path("./SocialMediaIE/data/models_classification/all_multitask_shared_bilstm_l2_0_lr_1e-3/")
args = get_args(PREFIX, SERIALIZATION_DIR)
args = args._replace(
    dataset_paths_file = "./SocialMediaIE/experiments/all_classification_dataset_paths.json",
    cuda=False # Very important as not running on GPU
)
args

ModelArgument(task=['founta_abusive', 'waseem_abusive', 'sarcasm_uncertainity', 'veridicality_uncertainity', 'semeval_sentiment', 'clarin_sentiment', 'politics_sentiment', 'other_sentiment'], dataset_paths_file='./SocialMediaIE/experiments/all_classification_dataset_paths.json', dataset_path_prefix='/experiments', model_dir='/content/SocialMediaIE/data/models_classification/all_multitask_shared_bilstm_l2_0_lr_1e-3', clean_model_dir=True, proj_dim=100, hidden_dim=100, encoder_type='bilstm', multi_task_mode='shared', dropout=0.5, lr=0.001, weight_decay=0.0, batch_size=16, epochs=10, patience=3, cuda=False, test_mode=True, residual_connection=False)

In [None]:
TASKS, vocab, model, readers, test_iterator = run(args)

  "num_layers={}".format(dropout, num_layers))


In [None]:
def tokenize(text):
    objects = [get_match_object(match) for match in get_match_iter(text)]
    n = len(objects)
    cleaned_objects = []
    for i, obj in enumerate(objects):
        obj["no_space"] = True
        if obj["type"] == "space":
            continue
        if i < n-1 and objects[i+1]["type"] == "space":
            obj["no_space"] = False
        cleaned_objects.append(obj)
    keys = cleaned_objects[0].keys()
    final_sequences = {}
    for k in keys:
        final_sequences[k] = [obj[k] for obj in cleaned_objects]
    return final_sequences

def predict_json(texts=None):
    # Empty cache to ensure larger batch can be loaded for testing
    if texts:
        data = [tokenize(text) for text in texts]
    else:
        text = "Barack Obama went to Paris and never returned to the USA."
        text1 = "Stan Lee was a legend who developed Spiderman and the Avengers movie series."
        text2 = "I just learned about donald drumph through john oliver. #JohnOliverShow such an awesome show."
        texts = [text, text1, text2]
        data = [tokenize(text) for text in texts]
    torch.cuda.empty_cache()
    tokens = [obj["value"] for obj in data]
    output = list(get_model_output(model, tokens, args, readers, vocab, test_iterator))

    output_json = [
                   {
                       "classification": dict(
                           text=text, 
                           doc_idx=i, 
                           **output_to_json(tokens[i], output[i], vocab))
                       }
                   for i, text in enumerate(texts)
                   ]
    return output_json

In [None]:
output_json = predict_json()
for d in output_json:
  display(d)

{'classification': {'clarin_sentiment': {'negative': 0.3775871992111206,
   'neutral': 0.5517899394035339,
   'positive': 0.07062287628650665},
  'doc_idx': 0,
  'founta_abusive': {'abusive': 0.018016666173934937,
   'hateful': 0.04844215139746666,
   'normal': 0.9298549294471741,
   'spam': 0.003686218522489071},
  'other_sentiment': {'negative': 0.3766862154006958,
   'neutral': 0.5350586175918579,
   'positive': 0.0882551521062851},
  'politics_sentiment': {'negative': 0.4884316921234131,
   'neutral': 0.40015631914138794,
   'positive': 0.11141201853752136},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9953067898750305,
   'sarcasm': 0.004693204071372747},
  'semeval_sentiment': {'negative': 0.39029020071029663,
   'neutral': 0.5236591100692749,
   'positive': 0.08605081588029861},
  'text': 'Barack Obama went to Paris and never returned to the USA.',
  'tokens': ['Barack',
   'Obama',
   'went',
   'to',
   'Paris',
   'and',
   'never',
   'returned',
   'to',
   'the',
   'USA',


{'classification': {'clarin_sentiment': {'negative': 0.09120587259531021,
   'neutral': 0.5898061990737915,
   'positive': 0.3189878761768341},
  'doc_idx': 1,
  'founta_abusive': {'abusive': 0.005538747180253267,
   'hateful': 0.0044571938924491405,
   'normal': 0.9843591451644897,
   'spam': 0.005644865799695253},
  'other_sentiment': {'negative': 0.12104218453168869,
   'neutral': 0.5459958910942078,
   'positive': 0.33296194672584534},
  'politics_sentiment': {'negative': 0.584104597568512,
   'neutral': 0.33796775341033936,
   'positive': 0.0779275968670845},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9936363697052002,
   'sarcasm': 0.0063636587001383305},
  'semeval_sentiment': {'negative': 0.02551230788230896,
   'neutral': 0.48244330286979675,
   'positive': 0.4920443296432495},
  'text': 'Stan Lee was a legend who developed Spiderman and the Avengers movie series.',
  'tokens': ['Stan',
   'Lee',
   'was',
   'a',
   'legend',
   'who',
   'developed',
   'Spiderman',
   'and

{'classification': {'clarin_sentiment': {'negative': 0.014927789568901062,
   'neutral': 0.033872947096824646,
   'positive': 0.9511992335319519},
  'doc_idx': 2,
  'founta_abusive': {'abusive': 0.0244298093020916,
   'hateful': 0.013859059661626816,
   'normal': 0.957321286201477,
   'spam': 0.004389943089336157},
  'other_sentiment': {'negative': 0.012285849079489708,
   'neutral': 0.026221955195069313,
   'positive': 0.9614921808242798},
  'politics_sentiment': {'negative': 0.2683788239955902,
   'neutral': 0.20243778824806213,
   'positive': 0.5291833281517029},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9019794464111328,
   'sarcasm': 0.09802058339118958},
  'semeval_sentiment': {'negative': 0.0021598394960165024,
   'neutral': 0.009135011583566666,
   'positive': 0.9887052178382874},
  'text': 'I just learned about donald drumph through john oliver. #JohnOliverShow such an awesome show.',
  'tokens': ['I',
   'just',
   'learned',
   'about',
   'donald',
   'drumph',
   'throug

In [None]:
json.dumps(predict_json(["barack obama went to paris"])[0])

'{"classification": {"text": "barack obama went to paris", "doc_idx": 0, "tokens": ["barack", "obama", "went", "to", "paris"], "founta_abusive": {"normal": 0.9031787514686584, "spam": 0.013646047562360764, "abusive": 0.05717331916093826, "hateful": 0.0260018240660429}, "waseem_abusive": {"none": 0.9790737628936768, "sexism": 0.01925569958984852, "racism": 0.001670498983003199}, "sarcasm_uncertainity": {"not_sarcasm": 0.9918217062950134, "sarcasm": 0.008178316056728363}, "veridicality_uncertainity": {"uncertain": 0.5396470427513123, "definitely_yes": 0.2180899977684021, "probably_yes": 0.1622510403394699, "probably_no": 0.055232349783182144, "definitely_no": 0.024779530242085457}, "semeval_sentiment": {"positive": 0.20253653824329376, "neutral": 0.6687340140342712, "negative": 0.1287294328212738}, "clarin_sentiment": {"neutral": 0.5424699783325195, "positive": 0.15769727528095245, "negative": 0.2998327910900116}, "politics_sentiment": {"negative": 0.2542945146560669, "neutral": 0.623954

In [None]:
text = """'You can give our democracy new meaning': Barack Obama urges young Americans to vote."""

json.dumps(predict_json([text])[0])

'{"classification": {"text": "\'You can give our democracy new meaning\': Barack Obama urges young Americans to vote.", "doc_idx": 0, "tokens": ["\'", "You", "can", "give", "our", "democracy", "new", "meaning", "\'", ":", "Barack", "Obama", "urges", "young", "Americans", "to", "vote", "."], "founta_abusive": {"normal": 0.9205254912376404, "spam": 0.017247281968593597, "abusive": 0.02483111247420311, "hateful": 0.03739606589078903}, "waseem_abusive": {"none": 0.917626142501831, "sexism": 0.04094952344894409, "racism": 0.04142434149980545}, "sarcasm_uncertainity": {"not_sarcasm": 0.9846667051315308, "sarcasm": 0.015333329327404499}, "veridicality_uncertainity": {"uncertain": 0.33441463112831116, "definitely_yes": 0.21644815802574158, "probably_yes": 0.3413499593734741, "probably_no": 0.05547914654016495, "definitely_no": 0.0523080937564373}, "semeval_sentiment": {"positive": 0.45415130257606506, "neutral": 0.48010358214378357, "negative": 0.06574511528015137}, "clarin_sentiment": {"neutr

### Visualizing the output:

You can copy the JSON output of the above cell and paste it at (and click Visualize): https://codepen.io/napsternxg/full/YzwRqEb to see a pretty representation of the output as shown in the presentation. 

If you hover over the output of the above cell, Colab will show you how to copy it to clipboard. 

Embeds the visualization page https://socialmediaie.github.io/PredictionVisualizer/ as an iframe.

In [None]:
%%html
<p>
  Full Screen Display at: <a href="https://socialmediaie.github.io/PredictionVisualizer/">https://socialmediaie.github.io/PredictionVisualizer</a>
</p>
<iframe src="https://socialmediaie.github.io/PredictionVisualizer/" width="80%" height="750"></iframe>

In [None]:
%%time
texts = [
    "Beautiful day in Chicago! Nice to get away from the Florida heat.",
    "Barack obama went to New York.",
    "obama went to Paris.",
    "Facebook is a new company.",
    "New york is better than SFO",
    "Urbana Champaign is the best",
    "urbana champaign is the best place to live and study",
    "going to Ibiza"
]
output_json = predict_json(texts)
for text, output in zip(texts, output_json):
  print(f"Text: {text}")
  display(output)

Text: Beautiful day in Chicago! Nice to get away from the Florida heat.


{'classification': {'clarin_sentiment': {'negative': 0.023783288896083832,
   'neutral': 0.02779068984091282,
   'positive': 0.9484260082244873},
  'doc_idx': 0,
  'founta_abusive': {'abusive': 0.019275737926363945,
   'hateful': 0.011243454180657864,
   'normal': 0.9636068344116211,
   'spam': 0.005873945541679859},
  'other_sentiment': {'negative': 0.08314713835716248,
   'neutral': 0.03771388158202171,
   'positive': 0.8791390061378479},
  'politics_sentiment': {'negative': 0.15084771811962128,
   'neutral': 0.17478203773498535,
   'positive': 0.6743702292442322},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9280483722686768,
   'sarcasm': 0.07195160537958145},
  'semeval_sentiment': {'negative': 0.0018827994354069233,
   'neutral': 0.010021764785051346,
   'positive': 0.9880954027175903},
  'text': 'Beautiful day in Chicago! Nice to get away from the Florida heat.',
  'tokens': ['Beautiful',
   'day',
   'in',
   'Chicago',
   '!',
   'Nice',
   'to',
   'get',
   'away',
   'from',

Text: Barack obama went to New York.


{'classification': {'clarin_sentiment': {'negative': 0.1194087415933609,
   'neutral': 0.7646072506904602,
   'positive': 0.11598397046327591},
  'doc_idx': 1,
  'founta_abusive': {'abusive': 0.01673225313425064,
   'hateful': 0.0134650943800807,
   'normal': 0.9632186889648438,
   'spam': 0.00658390810713172},
  'other_sentiment': {'negative': 0.17124077677726746,
   'neutral': 0.7575942873954773,
   'positive': 0.07116501778364182},
  'politics_sentiment': {'negative': 0.1599583476781845,
   'neutral': 0.5975431203842163,
   'positive': 0.24249853193759918},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9978109002113342,
   'sarcasm': 0.002189158694818616},
  'semeval_sentiment': {'negative': 0.05326622724533081,
   'neutral': 0.8259894847869873,
   'positive': 0.12074428051710129},
  'text': 'Barack obama went to New York.',
  'tokens': ['Barack', 'obama', 'went', 'to', 'New', 'York', '.'],
  'veridicality_uncertainity': {'definitely_no': 0.04828564450144768,
   'definitely_yes': 0.30

Text: obama went to Paris.


{'classification': {'clarin_sentiment': {'negative': 0.1185915470123291,
   'neutral': 0.6891473531723022,
   'positive': 0.19226108491420746},
  'doc_idx': 2,
  'founta_abusive': {'abusive': 0.013645026832818985,
   'hateful': 0.011792311444878578,
   'normal': 0.9690179228782654,
   'spam': 0.005544720683246851},
  'other_sentiment': {'negative': 0.21782013773918152,
   'neutral': 0.6783519387245178,
   'positive': 0.10382793098688126},
  'politics_sentiment': {'negative': 0.18480557203292847,
   'neutral': 0.523166298866272,
   'positive': 0.29202812910079956},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9962776303291321,
   'sarcasm': 0.003722313093021512},
  'semeval_sentiment': {'negative': 0.06243203952908516,
   'neutral': 0.7492068409919739,
   'positive': 0.18836118280887604},
  'text': 'obama went to Paris.',
  'tokens': ['obama', 'went', 'to', 'Paris', '.'],
  'veridicality_uncertainity': {'definitely_no': 0.045364461839199066,
   'definitely_yes': 0.3276190757751465,
   'p

Text: Facebook is a new company.


{'classification': {'clarin_sentiment': {'negative': 0.2274392694234848,
   'neutral': 0.5238809585571289,
   'positive': 0.24867978692054749},
  'doc_idx': 3,
  'founta_abusive': {'abusive': 0.02448102831840515,
   'hateful': 0.01304545160382986,
   'normal': 0.9303367137908936,
   'spam': 0.03213689103722572},
  'other_sentiment': {'negative': 0.4741344749927521,
   'neutral': 0.3161902129650116,
   'positive': 0.2096753865480423},
  'politics_sentiment': {'negative': 0.39996397495269775,
   'neutral': 0.48681116104125977,
   'positive': 0.11322487890720367},
  'sarcasm_uncertainity': {'not_sarcasm': 0.961678147315979,
   'sarcasm': 0.0383218377828598},
  'semeval_sentiment': {'negative': 0.09598779678344727,
   'neutral': 0.4928544759750366,
   'positive': 0.41115766763687134},
  'text': 'Facebook is a new company.',
  'tokens': ['Facebook', 'is', 'a', 'new', 'company', '.'],
  'veridicality_uncertainity': {'definitely_no': 0.03843403607606888,
   'definitely_yes': 0.248526319861412

Text: New york is better than SFO


{'classification': {'clarin_sentiment': {'negative': 0.4914599359035492,
   'neutral': 0.20600374042987823,
   'positive': 0.3025363087654114},
  'doc_idx': 4,
  'founta_abusive': {'abusive': 0.04685322567820549,
   'hateful': 0.026506099849939346,
   'normal': 0.9180614948272705,
   'spam': 0.008579263463616371},
  'other_sentiment': {'negative': 0.36646193265914917,
   'neutral': 0.31711050868034363,
   'positive': 0.3164275884628296},
  'politics_sentiment': {'negative': 0.4619922637939453,
   'neutral': 0.4029514789581299,
   'positive': 0.1350562870502472},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9427891373634338,
   'sarcasm': 0.057210784405469894},
  'semeval_sentiment': {'negative': 0.22233836352825165,
   'neutral': 0.20790386199951172,
   'positive': 0.5697577595710754},
  'text': 'New york is better than SFO',
  'tokens': ['New', 'york', 'is', 'better', 'than', 'SFO'],
  'veridicality_uncertainity': {'definitely_no': 0.029801219701766968,
   'definitely_yes': 0.252011835

Text: Urbana Champaign is the best


{'classification': {'clarin_sentiment': {'negative': 0.007115937769412994,
   'neutral': 0.18015988171100616,
   'positive': 0.8127241730690002},
  'doc_idx': 5,
  'founta_abusive': {'abusive': 0.011929438449442387,
   'hateful': 0.004677792079746723,
   'normal': 0.9724303483963013,
   'spam': 0.010962321422994137},
  'other_sentiment': {'negative': 0.01420773658901453,
   'neutral': 0.19019335508346558,
   'positive': 0.7955989241600037},
  'politics_sentiment': {'negative': 0.016615478321909904,
   'neutral': 0.1212422102689743,
   'positive': 0.8621423244476318},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9976256489753723,
   'sarcasm': 0.002374327974393964},
  'semeval_sentiment': {'negative': 0.0017641354352235794,
   'neutral': 0.03472493588924408,
   'positive': 0.9635109901428223},
  'text': 'Urbana Champaign is the best',
  'tokens': ['Urbana', 'Champaign', 'is', 'the', 'best'],
  'veridicality_uncertainity': {'definitely_no': 0.027731962502002716,
   'definitely_yes': 0.407

Text: urbana champaign is the best place to live and study


{'classification': {'clarin_sentiment': {'negative': 0.010827533900737762,
   'neutral': 0.15020863711833954,
   'positive': 0.8389638066291809},
  'doc_idx': 6,
  'founta_abusive': {'abusive': 0.011580286547541618,
   'hateful': 0.0034822355955839157,
   'normal': 0.9680749773979187,
   'spam': 0.016862528398633003},
  'other_sentiment': {'negative': 0.03107861801981926,
   'neutral': 0.17956583201885223,
   'positive': 0.7893555760383606},
  'politics_sentiment': {'negative': 0.037669457495212555,
   'neutral': 0.22055895626544952,
   'positive': 0.7417715787887573},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9887657761573792,
   'sarcasm': 0.011234153062105179},
  'semeval_sentiment': {'negative': 0.002735211979597807,
   'neutral': 0.04287683963775635,
   'positive': 0.954387903213501},
  'text': 'urbana champaign is the best place to live and study',
  'tokens': ['urbana',
   'champaign',
   'is',
   'the',
   'best',
   'place',
   'to',
   'live',
   'and',
   'study'],
  'veri

Text: going to Ibiza


{'classification': {'clarin_sentiment': {'negative': 0.07004306465387344,
   'neutral': 0.6626803874969482,
   'positive': 0.2672765254974365},
  'doc_idx': 7,
  'founta_abusive': {'abusive': 0.0382884219288826,
   'hateful': 0.019051579758524895,
   'normal': 0.8628808259963989,
   'spam': 0.07977917790412903},
  'other_sentiment': {'negative': 0.14210692048072815,
   'neutral': 0.6840463280677795,
   'positive': 0.1738467514514923},
  'politics_sentiment': {'negative': 0.16569136083126068,
   'neutral': 0.654079258441925,
   'positive': 0.18022948503494263},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9933017492294312,
   'sarcasm': 0.006698315497487783},
  'semeval_sentiment': {'negative': 0.02267160452902317,
   'neutral': 0.618860125541687,
   'positive': 0.35846832394599915},
  'text': 'going to Ibiza',
  'tokens': ['going', 'to', 'Ibiza'],
  'veridicality_uncertainity': {'definitely_no': 0.036893464624881744,
   'definitely_yes': 0.2997860610485077,
   'probably_no': 0.062247209

CPU times: user 13.8 s, sys: 1.64 s, total: 15.4 s
Wall time: 18.4 s


In [None]:
json.dumps(output)

'{"classification": {"text": "going to Ibiza", "doc_idx": 7, "tokens": ["going", "to", "Ibiza"], "founta_abusive": {"normal": 0.8628808259963989, "spam": 0.07977917790412903, "abusive": 0.0382884219288826, "hateful": 0.019051579758524895}, "waseem_abusive": {"none": 0.9589920043945312, "sexism": 0.028950253501534462, "racism": 0.012057810090482235}, "sarcasm_uncertainity": {"not_sarcasm": 0.9933017492294312, "sarcasm": 0.006698315497487783}, "veridicality_uncertainity": {"uncertain": 0.4078189432621002, "definitely_yes": 0.2997860610485077, "probably_yes": 0.19325432181358337, "probably_no": 0.06224720925092697, "definitely_no": 0.036893464624881744}, "semeval_sentiment": {"positive": 0.35846832394599915, "neutral": 0.618860125541687, "negative": 0.02267160452902317}, "clarin_sentiment": {"neutral": 0.6626803874969482, "positive": 0.2672765254974365, "negative": 0.07004306465387344}, "politics_sentiment": {"negative": 0.16569136083126068, "neutral": 0.654079258441925, "positive": 0.180

## Multi task tagging


In [None]:
from SocialMediaIE.predictor.model_predictor import run, get_args, PREFIX, get_model_output, output_to_df

In [None]:
SERIALIZATION_DIR = Path("./SocialMediaIE/data/models/all_multitask_stacked_l2_0_lr_1e-3_no_neel/")
print(SERIALIZATION_DIR.exists())
args = get_args(PREFIX, SERIALIZATION_DIR)
args = args._replace(
    dataset_paths_file="./SocialMediaIE/experiments/all_dataset_paths.json",
    cuda=False # Very important as not running on GPU
)
args

True


ModelArgument(task=['multimodal_ner', 'broad_ner', 'wnut17_ner', 'ritter_ner', 'yodie_ner', 'ritter_chunk', 'ud_pos', 'ark_pos', 'ptb_pos', 'ritter_ccg'], dataset_paths_file='./SocialMediaIE/experiments/all_dataset_paths.json', dataset_path_prefix='/experiments', model_dir='/content/SocialMediaIE/data/models/all_multitask_stacked_l2_0_lr_1e-3_no_neel', clean_model_dir=True, proj_dim=100, hidden_dim=100, encoder_type='bilstm', multi_task_mode='stacked', dropout=0.5, lr=0.001, weight_decay=0.0, batch_size=16, epochs=10, patience=3, cuda=False, test_mode=True, residual_connection=False)

In [None]:
TASKS, vocab, model, readers, test_iterator = run(args)

  "num_layers={}".format(dropout, num_layers))


In [None]:
TASKS

[Task(tag_namespace='multimodal_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='broad_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='wnut17_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='ritter_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='yodie_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='ritter_chunk', task_type=chunk, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='ud_pos', task_type=pos, label_encoding=None, calculate_span_f1=Noneis_classification=False),
 Task(tag_namespace='ark_pos', task_type=pos, label_encoding=None, calculate_span_f1=Noneis_classification=False),
 Task(tag_namespace='ptb_pos', task_type=pos, label_encoding=No

In [None]:
vocab

Vocabulary with namespaces:  broad_ner, Size: 7 || ritter_ccg, Size: 72 || ptb_pos, Size: 46 || yodie_ner, Size: 27 || ud_pos, Size: 18 || tag_namespace, Size: 10 || multimodal_ner, Size: 9 || ark_pos, Size: 25 || ritter_chunk, Size: 18 || wnut17_ner, Size: 13 || ritter_ner, Size: 21 || Non Padded Namespaces: {'*pos', 'tag_namespace', '*ccg', '*chunk', '*ner'}

In [None]:
readers

{'ark_pos': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf9dd0>,
 'broad_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf9e10>,
 'multimodal_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbadb1d0>,
 'ptb_pos': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf9a10>,
 'ritter_ccg': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf9b10>,
 'ritter_chunk': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf9110>,
 'ritter_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf9fd0>,
 'ud_pos': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf91d0>,
 'wnut17_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf9ed0>,
 'yodie_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4adbaf9f50>}

In [None]:
def tokenize(text):
    objects = [get_match_object(match) for match in get_match_iter(text)]
    n = len(objects)
    cleaned_objects = []
    for i, obj in enumerate(objects):
        obj["no_space"] = True
        if obj["type"] == "space":
            continue
        if i < n-1 and objects[i+1]["type"] == "space":
            obj["no_space"] = False
        cleaned_objects.append(obj)
    keys = cleaned_objects[0].keys()
    final_sequences = {}
    for k in keys:
        final_sequences[k] = [obj[k] for obj in cleaned_objects]
    return final_sequences

def predict_df(texts=None):
    # Empty cache to ensure larger batch can be loaded for testing
    if texts:
        data = [tokenize(text) for text in texts]
    else:
        text = "Barack Obama went to Paris and never returned to the USA."
        text1 = "Stan Lee was a legend who developed Spiderman and the Avengers movie series."
        text2 = "I just learned about donald drumph through john oliver. #JohnOliverShow such an awesome show."
        texts = [text, text1, text2]
        data = [tokenize(text) for text in texts]
    torch.cuda.empty_cache()
    tokens = [obj["value"] for obj in data]
    output = list(get_model_output(model, tokens, args, readers, vocab, test_iterator))
    idx = 0
    def _get_data_values(d):
      return {
        k: d[k]
        for k in d.keys()
        if k != "value"
    }
    #df = output_to_df(tokens[idx], output[idx], vocab)
    df = pd.concat([
                    output_to_df(tokens[i], output[i], vocab).assign(**_get_data_values(d)).assign(data_idx=i)
                    for i, d in enumerate(data)
          ])

    # for k in data[idx].keys():
    #     if k != "value":
    #         df[k] = data[idx][k]
    return df


def predict_json(texts=None):
    # Empty cache to ensure larger batch can be loaded for testing
    if texts:
      data = [tokenize(text) for text in texts]
    else:
        text = "Barack Obama went to Paris and never returned to the USA."
        text1 = "Stan Lee was a legend who developed Spiderman and the Avengers movie series."
        text2 = "I just learned about donald drumph through john oliver. #JohnOliverShow such an awesome show."
        texts = [text, text1, text2]
        data = [tokenize(text) for text in texts]
    torch.cuda.empty_cache()
    tokens = [obj["value"] for obj in data]
    output = list(get_model_output(model, tokens, args, readers, vocab, test_iterator))
    # idx = 0
    # df = output_to_df(tokens[idx], output[idx], vocab)
    # for k in data[idx].keys():
    #     if k != "value":
    #         df[k] = data[idx][k]
    # #df = df.set_index("tokens")
    # output_json = df.to_json(orient='table')
    # output_json = json.loads(output_json)
    # output_json = dict(tagging=output_json)
    def _get_data_values(d):
      return {
        k: d[k]
        for k in d.keys()
        if k != "value"
    }
    #df = output_to_df(tokens[idx], output[idx], vocab)
    output = [
                    output_to_df(tokens[i], output[i], vocab).assign(**_get_data_values(d)).assign(data_idx=i)
                    for i, d in enumerate(data)
          ]
    output = [
            dict(tagging=json.loads(df_t.to_json(orient='table')))
            for df_t in output
    ]
    return output


In [None]:
output_json = predict_json()
json.dumps(output_json[0])

'{"tagging": {"schema": {"fields": [{"name": "index", "type": "integer"}, {"name": "tokens", "type": "string"}, {"name": "multimodal_ner", "type": "string"}, {"name": "broad_ner", "type": "string"}, {"name": "wnut17_ner", "type": "string"}, {"name": "ritter_ner", "type": "string"}, {"name": "yodie_ner", "type": "string"}, {"name": "ritter_chunk", "type": "string"}, {"name": "ud_pos", "type": "string"}, {"name": "ark_pos", "type": "string"}, {"name": "ptb_pos", "type": "string"}, {"name": "ritter_ccg", "type": "string"}, {"name": "type", "type": "string"}, {"name": "span", "type": "string"}, {"name": "is_hashtag", "type": "boolean"}, {"name": "is_mention", "type": "boolean"}, {"name": "is_url", "type": "boolean"}, {"name": "is_emoji", "type": "boolean"}, {"name": "is_emoticon", "type": "boolean"}, {"name": "is_symbol", "type": "boolean"}, {"name": "no_space", "type": "boolean"}, {"name": "data_idx", "type": "integer"}], "primaryKey": ["index"], "pandas_version": "0.20.0"}, "data": [{"in

### Visualizing the output:

You can copy the JSON output of the above cell and paste it at (and click Visualize): https://codepen.io/napsternxg/full/YzwRqEb to see a pretty representation of the output as shown in the presentation. 

If you hover over the output of the above cell, Colab will show you how to copy it to clipboard. 

In [None]:
%%html
<p>
  Full Screen Display at: <a href="https://socialmediaie.github.io/PredictionVisualizer/">https://socialmediaie.github.io/PredictionVisualizer</a>
</p>
<iframe src="https://socialmediaie.github.io/PredictionVisualizer/" width="80%" height="750"></iframe>

In [None]:
df = predict_df()

In [None]:
df.head()

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,0
1,Obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,0
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,0
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,0
4,Paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(21, 26)",False,False,False,False,False,False,False,0


In [None]:
df[df.data_idx==0]

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,0
1,Obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,0
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,0
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,0
4,Paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(21, 26)",False,False,False,False,False,False,False,0
5,and,O,O,O,O,O,O,CCONJ,&,CC,...,token,"(27, 30)",False,False,False,False,False,False,False,0
6,never,O,O,O,O,O,B-ADVP,ADV,R,RB,...,token,"(31, 36)",False,False,False,False,False,False,False,0
7,returned,O,O,O,O,O,B-VP,VERB,V,VBN,...,token,"(37, 45)",False,False,False,False,False,False,False,0
8,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(46, 48)",False,False,False,False,False,False,False,0
9,the,O,O,B-LOCATION,B-FACILITY,O,B-NP,DET,D,DT,...,token,"(49, 52)",False,False,False,False,False,False,False,0


In [None]:
df.loc[df.data_idx==0, ["tokens", "multimodal_ner"]]

Unnamed: 0,tokens,multimodal_ner
0,Barack,B-PER
1,Obama,I-PER
2,went,O
3,to,O
4,Paris,B-LOC
5,and,O
6,never,O
7,returned,O
8,to,O
9,the,O


In [None]:
def split_tag(tag):
    return tuple(tag.split("-", 1)) if tag != "O" else (tag, None) 
    
def extract_entities(tags):
    tags = list(tags)
    curr_entity = []
    entities = []
    for i,tag in enumerate(tags + ["O"]):
        # Add dummy tag in end to ensure the last entity is added to entities
        boundary, label = split_tag(tag)
        if curr_entity:
            # Exit entity
            if boundary in {"B", "O"} or label != curr_entity[-1][1]:
                start = i - len(curr_entity)
                end = i
                entity_label = curr_entity[-1][1]
                entities.append((entity_label, start, end))
                curr_entity = []
            elif boundary == "I":
                curr_entity.append((boundary, label))
        if boundary == "B":
            # Enter or inside entity
            assert not curr_entity, f"Entity should be empty. Found: {curr_entity}"
            curr_entity.append((boundary, label))
    return entities

In [None]:
df_t = df.loc[df.data_idx==0]
df_t

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,0
1,Obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,0
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,0
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,0
4,Paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(21, 26)",False,False,False,False,False,False,False,0
5,and,O,O,O,O,O,O,CCONJ,&,CC,...,token,"(27, 30)",False,False,False,False,False,False,False,0
6,never,O,O,O,O,O,B-ADVP,ADV,R,RB,...,token,"(31, 36)",False,False,False,False,False,False,False,0
7,returned,O,O,O,O,O,B-VP,VERB,V,VBN,...,token,"(37, 45)",False,False,False,False,False,False,False,0
8,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(46, 48)",False,False,False,False,False,False,False,0
9,the,O,O,B-LOCATION,B-FACILITY,O,B-NP,DET,D,DT,...,token,"(49, 52)",False,False,False,False,False,False,False,0


In [None]:
entities = extract_entities(df_t["multimodal_ner"])
tokens = list(df_t["tokens"])

In [None]:
for label, start, end in entities:
  print(tokens[start:end], label)

['Barack', 'Obama'] PER
['Paris'] LOC
['USA'] LOC


In [None]:
df.columns

Index(['tokens', 'multimodal_ner', 'broad_ner', 'wnut17_ner', 'ritter_ner',
       'yodie_ner', 'ritter_chunk', 'ud_pos', 'ark_pos', 'ptb_pos',
       'ritter_ccg', 'type', 'span', 'is_hashtag', 'is_mention', 'is_url',
       'is_emoji', 'is_emoticon', 'is_symbol', 'no_space', 'data_idx'],
      dtype='object')

In [None]:
def get_entity_info(bio_labels, tokens, text=None, spans=None):
  entities_info = extract_entities(bio_labels)
  entities = []
  for label, start, end in entities_info:
    entity_phrase = None
    if text and spans:
      start_char_idx = spans[start][0]
      end_char_idx = spans[end-1][1]
      entity_phrase = text[start_char_idx:end_char_idx]
    entities.append(dict(
        tokens=tokens[start:end], 
        label=label, 
        start=start, 
        end=end, 
        entity_phrase=entity_phrase))
  return entities


def get_df_entities(df, text=None):
  span_columns = [
    c for c in df.columns if c.endswith(("_ner", "_chunk", "_ccg"))
  ]
  tokens = list(df["tokens"])
  spans = list(df["span"])
  task_entities = {c: [] for c in span_columns}
  for c in span_columns:
    bio_labels = df[c]
    task_entities[c] = get_entity_info(bio_labels, tokens, text=text, spans=spans)
  return task_entities

In [None]:
text = """Ryan Gosling and Chris Evans will star in the Russo Bros' 'The Gray Man' for Netflix

The film has a $200M+ budget and the goal is to launch a James Bond-level franchise

'For those who were fans of The Winter Soldier this is us moving into that territory in a real-world setting'"""

df = predict_df([text])
df

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Ryan,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 4)",False,False,False,False,False,False,False,0
1,Gosling,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(5, 12)",False,False,False,False,False,False,False,0
2,and,O,O,O,O,O,I-NP,CCONJ,&,CC,...,token,"(13, 16)",False,False,False,False,False,False,False,0
3,Chris,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,I-NP,PROPN,^,NNP,...,token,"(17, 22)",False,False,False,False,False,False,False,0
4,Evans,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(23, 28)",False,False,False,False,False,False,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,real,O,O,O,O,O,I-NP,ADJ,A,JJ,...,token,"(261, 265)",False,False,False,False,False,False,True,0
60,-,O,O,O,O,O,I-NP,PUNCT,",",:,...,token,"(265, 266)",False,False,False,False,False,True,True,0
61,world,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(266, 271)",False,False,False,False,False,False,False,0
62,setting,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(272, 279)",False,False,False,False,False,False,True,0


In [None]:
task_entities = get_df_entities(df, text=text)
for task, entities in task_entities.items():
  print(task)
  for entity in entities:
    print(entity)

multimodal_ner
{'tokens': ['Ryan', 'Gosling'], 'label': 'PER', 'start': 0, 'end': 2, 'entity_phrase': 'Ryan Gosling'}
{'tokens': ['Chris', 'Evans'], 'label': 'PER', 'start': 3, 'end': 5, 'entity_phrase': 'Chris Evans'}
{'tokens': ['Russo', 'Bros'], 'label': 'ORG', 'start': 9, 'end': 11, 'entity_phrase': 'Russo Bros'}
{'tokens': ['The', 'Gray', 'Man'], 'label': 'MISC', 'start': 13, 'end': 16, 'entity_phrase': 'The Gray Man'}
{'tokens': ['Netflix'], 'label': 'ORG', 'start': 18, 'end': 19, 'entity_phrase': 'Netflix'}
{'tokens': ['James', 'Bond'], 'label': 'PER', 'start': 35, 'end': 37, 'entity_phrase': 'James Bond'}
{'tokens': ['The', 'Winter', 'Soldier'], 'label': 'MISC', 'start': 47, 'end': 50, 'entity_phrase': 'The Winter Soldier'}
broad_ner
{'tokens': ['Ryan', 'Gosling'], 'label': 'PER', 'start': 0, 'end': 2, 'entity_phrase': 'Ryan Gosling'}
{'tokens': ['Chris', 'Evans'], 'label': 'PER', 'start': 3, 'end': 5, 'entity_phrase': 'Chris Evans'}
{'tokens': ['Russo', 'Bros'], 'label': 'PER'

In [None]:
text = """Ryan Gosling and Chris Evans will star in the Russo Bros' 'The Gray Man' for Netflix

The film has a $200M+ budget and the goal is to launch a James Bond-level franchise

'For those who were fans of The Winter Soldier this is us moving into that territory in a real-world setting'"""
json.dumps(predict_json([text])[0])

'{"tagging": {"schema": {"fields": [{"name": "index", "type": "integer"}, {"name": "tokens", "type": "string"}, {"name": "multimodal_ner", "type": "string"}, {"name": "broad_ner", "type": "string"}, {"name": "wnut17_ner", "type": "string"}, {"name": "ritter_ner", "type": "string"}, {"name": "yodie_ner", "type": "string"}, {"name": "ritter_chunk", "type": "string"}, {"name": "ud_pos", "type": "string"}, {"name": "ark_pos", "type": "string"}, {"name": "ptb_pos", "type": "string"}, {"name": "ritter_ccg", "type": "string"}, {"name": "type", "type": "string"}, {"name": "span", "type": "string"}, {"name": "is_hashtag", "type": "boolean"}, {"name": "is_mention", "type": "boolean"}, {"name": "is_url", "type": "boolean"}, {"name": "is_emoji", "type": "boolean"}, {"name": "is_emoticon", "type": "boolean"}, {"name": "is_symbol", "type": "boolean"}, {"name": "no_space", "type": "boolean"}, {"name": "data_idx", "type": "integer"}], "primaryKey": ["index"], "pandas_version": "0.20.0"}, "data": [{"in

In [None]:
text = """'You can give our democracy new meaning': Barack Obama urges young Americans to vote."""
json.dumps(predict_json([text])[0])

'{"tagging": {"schema": {"fields": [{"name": "index", "type": "integer"}, {"name": "tokens", "type": "string"}, {"name": "multimodal_ner", "type": "string"}, {"name": "broad_ner", "type": "string"}, {"name": "wnut17_ner", "type": "string"}, {"name": "ritter_ner", "type": "string"}, {"name": "yodie_ner", "type": "string"}, {"name": "ritter_chunk", "type": "string"}, {"name": "ud_pos", "type": "string"}, {"name": "ark_pos", "type": "string"}, {"name": "ptb_pos", "type": "string"}, {"name": "ritter_ccg", "type": "string"}, {"name": "type", "type": "string"}, {"name": "span", "type": "string"}, {"name": "is_hashtag", "type": "boolean"}, {"name": "is_mention", "type": "boolean"}, {"name": "is_url", "type": "boolean"}, {"name": "is_emoji", "type": "boolean"}, {"name": "is_emoticon", "type": "boolean"}, {"name": "is_symbol", "type": "boolean"}, {"name": "no_space", "type": "boolean"}, {"name": "data_idx", "type": "integer"}], "primaryKey": ["index"], "pandas_version": "0.20.0"}, "data": [{"in

In [None]:
predict_df(["the day is great"]).T

Unnamed: 0,0,1,2,3
tokens,the,day,is,great
multimodal_ner,O,O,O,O
broad_ner,O,O,O,O
wnut17_ner,O,O,O,O
ritter_ner,O,O,O,O
yodie_ner,O,O,O,O
ritter_chunk,B-NP,I-NP,B-VP,B-ADJP
ud_pos,DET,NOUN,VERB,ADJ
ark_pos,D,N,V,A
ptb_pos,DT,NN,VBZ,JJ


In [None]:
predict_df(["barack obama went to paris"])

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,0
1,obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,0
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,0
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,0
4,paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(21, 26)",False,False,False,False,False,False,True,0


In [None]:
%%time
texts = [
    "Beautiful day in Chicago! Nice to get away from the Florida heat.",
    "Barack obama went to New York.",
    "obama went to Paris.",
    "Facebook is a new company.",
    "New york is better than SFO",
    "Urbana Champaign is the best",
    "urbana champaign is the best place to live and study",
    "going to Ibiza"
]
df = predict_df(texts)
print(df.columns)
for i in df.data_idx.unique():
  display(df[df.data_idx==i])

Index(['tokens', 'multimodal_ner', 'broad_ner', 'wnut17_ner', 'ritter_ner',
       'yodie_ner', 'ritter_chunk', 'ud_pos', 'ark_pos', 'ptb_pos',
       'ritter_ccg', 'type', 'span', 'is_hashtag', 'is_mention', 'is_url',
       'is_emoji', 'is_emoticon', 'is_symbol', 'no_space', 'data_idx'],
      dtype='object')


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Beautiful,O,O,O,O,O,B-NP,ADJ,A,JJ,...,token,"(0, 9)",False,False,False,False,False,False,False,0
1,day,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(10, 13)",False,False,False,False,False,False,False,0
2,in,O,O,O,O,O,B-PP,ADP,P,IN,...,token,"(14, 16)",False,False,False,False,False,False,False,0
3,Chicago,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(17, 24)",False,False,False,False,False,False,True,0
4,!,O,O,O,O,O,O,PUNCT,",",PUNCT,...,token,"(24, 25)",False,False,False,False,False,True,False,0
5,Nice,O,O,O,O,O,B-ADJP,ADJ,A,JJ,...,token,"(26, 30)",False,False,False,False,False,False,False,0
6,to,O,O,O,O,O,B-VP,PART,P,TO,...,token,"(31, 33)",False,False,False,False,False,False,False,0
7,get,O,O,O,O,O,I-VP,VERB,V,VB,...,token,"(34, 37)",False,False,False,False,False,False,False,0
8,away,O,O,O,O,O,B-ADVP,ADV,R,RB,...,token,"(38, 42)",False,False,False,False,False,False,False,0
9,from,O,O,O,O,O,B-PP,ADP,P,IN,...,token,"(43, 47)",False,False,False,False,False,False,False,0


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,1
1,obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,1
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,1
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,1
4,New,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(21, 24)",False,False,False,False,False,False,False,1
5,York,I-LOC,I-LOC,I-LOCATION,I-GEO-LOC,I-GEO-LOC,I-NP,PROPN,^,NNP,...,token,"(25, 29)",False,False,False,False,False,False,True,1
6,.,O,O,O,O,O,O,PUNCT,",",PUNCT,...,token,"(29, 30)",False,False,False,False,False,True,True,1


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,obama,B-PER,O,O,O,O,B-VP,PROPN,^,NNP,...,token,"(0, 5)",False,False,False,False,False,False,False,2
1,went,O,O,O,O,O,I-VP,VERB,V,VBD,...,token,"(6, 10)",False,False,False,False,False,False,False,2
2,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(11, 13)",False,False,False,False,False,False,False,2
3,Paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(14, 19)",False,False,False,False,False,False,True,2
4,.,O,O,O,O,O,O,PUNCT,",",PUNCT,...,token,"(19, 20)",False,False,False,False,False,True,True,2


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Facebook,B-ORG,B-ORG,B-CORPORATION,B-COMPANY,B-COMPANY,B-NP,PROPN,^,NNP,...,token,"(0, 8)",False,False,False,False,False,False,False,3
1,is,O,O,O,O,O,B-VP,VERB,V,VBZ,...,token,"(9, 11)",False,False,False,False,False,False,False,3
2,a,O,O,O,O,O,B-NP,DET,D,DT,...,token,"(12, 13)",False,False,False,False,False,True,False,3
3,new,O,O,O,O,O,I-NP,ADJ,A,JJ,...,token,"(14, 17)",False,False,False,False,False,False,False,3
4,company,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(18, 25)",False,False,False,False,False,False,True,3
5,.,O,O,O,O,O,O,PUNCT,",",PUNCT,...,token,"(25, 26)",False,False,False,False,False,True,True,3


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,New,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(0, 3)",False,False,False,False,False,False,False,4
1,york,I-LOC,I-LOC,I-LOCATION,I-GEO-LOC,I-LOCATION,I-NP,PROPN,^,NNP,...,token,"(4, 8)",False,False,False,False,False,False,False,4
2,is,O,O,O,O,O,B-VP,VERB,V,VBZ,...,token,"(9, 11)",False,False,False,False,False,False,False,4
3,better,O,O,O,O,O,B-ADJP,ADJ,A,JJR,...,token,"(12, 18)",False,False,False,False,False,False,False,4
4,than,O,O,O,O,O,B-PP,ADP,P,IN,...,token,"(19, 23)",False,False,False,False,False,False,False,4
5,SFO,B-ORG,B-ORG,B-CORPORATION,B-COMPANY,B-COMPANY,B-NP,PROPN,^,NNP,...,token,"(24, 27)",False,False,False,False,False,False,True,4


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Urbana,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,5
1,Champaign,I-LOC,I-LOC,I-LOCATION,I-GEO-LOC,I-LOCATION,I-NP,PROPN,^,NNP,...,token,"(7, 16)",False,False,False,False,False,False,False,5
2,is,O,O,O,O,O,B-VP,VERB,V,VBZ,...,token,"(17, 19)",False,False,False,False,False,False,False,5
3,the,O,O,O,O,O,B-NP,DET,D,DT,...,token,"(20, 23)",False,False,False,False,False,False,False,5
4,best,O,O,O,O,O,I-NP,ADJ,A,JJ,...,token,"(24, 28)",False,False,False,False,False,False,True,5


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,urbana,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,6
1,champaign,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 16)",False,False,False,False,False,False,False,6
2,is,O,O,O,O,O,B-VP,VERB,V,VBZ,...,token,"(17, 19)",False,False,False,False,False,False,False,6
3,the,O,O,O,O,O,B-NP,DET,D,DT,...,token,"(20, 23)",False,False,False,False,False,False,False,6
4,best,O,O,O,O,O,I-NP,ADJ,A,JJ,...,token,"(24, 28)",False,False,False,False,False,False,False,6
5,place,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(29, 34)",False,False,False,False,False,False,False,6
6,to,O,O,O,O,O,B-VP,PART,P,TO,...,token,"(35, 37)",False,False,False,False,False,False,False,6
7,live,O,O,O,O,O,I-VP,VERB,V,VB,...,token,"(38, 42)",False,False,False,False,False,False,False,6
8,and,O,O,O,O,O,O,CCONJ,&,CC,...,token,"(43, 46)",False,False,False,False,False,False,False,6
9,study,O,O,O,O,O,B-VP,VERB,V,VB,...,token,"(47, 52)",False,False,False,False,False,False,True,6


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,going,O,O,O,O,O,B-VP,VERB,V,VBG,...,token,"(0, 5)",False,False,False,False,False,False,False,7
1,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(6, 8)",False,False,False,False,False,False,False,7
2,Ibiza,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(9, 14)",False,False,False,False,False,False,True,7


CPU times: user 13.3 s, sys: 350 ms, total: 13.7 s
Wall time: 12.7 s


# Visualizing outputs

Embeds the visualization page https://socialmediaie.github.io/PredictionVisualizer/ as an iframe.

Copy paste model output JSON from above into the text area and click **Visualize**


In [None]:
%%html
<p>
  Full Screen Display at: <a href="https://socialmediaie.github.io/PredictionVisualizer/">https://socialmediaie.github.io/PredictionVisualizer</a>
</p>
<iframe src="https://socialmediaie.github.io/PredictionVisualizer/" width="80%" height="750"></iframe>

# Data Download

## Twarc

Docs: https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#hydrate

In [None]:
import getpass

In [None]:
twarc_bearer_token = getpass.getpass("Bearer Token: ")
! echo {twarc_bearer_token} > ~/.twarc_bearer_token

Bearer Token: ··········


In [None]:
%%writefile twarc_tweet_ids.txt
21
22

Overwriting twarc_tweet_ids.txt


In [None]:
! twarc2 --bearer-token {twarc_bearer_token} hydrate ./twarc_tweet_ids.txt ./twarc_tweet_ids.txt.jsonl

  0% 0/2 [00:00<?, ?it/s] 50% 1/2 [00:00<00:00,  7.87it/s]100% 2/2 [00:00<00:00, 15.72it/s]


In [None]:
with open("./twarc_tweet_ids.txt.jsonl") as fp:
  hydrated_tweets = json.load(fp)
hydrated_tweets["data"]

[{'author_id': '13',
  'conversation_id': '21',
  'created_at': '2006-03-21T20:51:43.000Z',
  'id': '21',
  'lang': 'en',
  'possibly_sensitive': False,
  'public_metrics': {'like_count': 4464,
   'quote_count': 267,
   'reply_count': 164,
   'retweet_count': 6105},
  'reply_settings': 'everyone',
  'source': 'Twitter Web Client',
  'text': 'just setting up my twttr'},
 {'author_id': '14',
  'conversation_id': '22',
  'created_at': '2006-03-21T21:00:54.000Z',
  'id': '22',
  'lang': 'en',
  'possibly_sensitive': False,
  'public_metrics': {'like_count': 3495,
   'quote_count': 154,
   'reply_count': 76,
   'retweet_count': 4760},
  'reply_settings': 'everyone',
  'source': 'Twitter Web Client',
  'text': 'just setting up my twttr'}]

In [None]:
df_hydrated_tweets = pd.DataFrame(hydrated_tweets["data"])
df_hydrated_tweets

Unnamed: 0,id,conversation_id,lang,public_metrics,text,source,created_at,reply_settings,possibly_sensitive,author_id
0,21,21,en,"{'retweet_count': 6105, 'reply_count': 164, 'l...",just setting up my twttr,Twitter Web Client,2006-03-21T20:51:43.000Z,everyone,False,13
1,22,22,en,"{'retweet_count': 4760, 'reply_count': 76, 'li...",just setting up my twttr,Twitter Web Client,2006-03-21T21:00:54.000Z,everyone,False,14


## Academic API

* Tool: https://developer.twitter.com/apitools/downloader
* Details can be found at: https://twittercommunity.com/t/introducing-new-developer-tools-for-the-twitter-api-v2/168348




> Upload a file called `./ipl04-2022.json` which has a few tweets in it.



In [None]:
df_academic_data = pd.read_json("./ipl04-2022.json")
df_academic_data

Unnamed: 0,id,text
0,1509695225777864704,@calheirosmarcus Espero comentários da rodada ...
1,1509694519754776576,Indian Premier League: CSK Capable Of Retainin...
2,1509694421880950784,LSG prodigy Ayush Badoni played two fine knock...
3,1509694237318782976,Statsman: All about the Indian Premier League ...
4,1509693319764348928,Indian Premier League: CSK Capable Of Retainin...
5,1509693310973087744,Indian Premier League: CSK Capable Of Retainin...
6,1509692303308431360,Indian Premier League 2022 | Badoni a great fi...
7,1509692024928325632,Indian Premier League 2022 | Badoni a great fi...
8,1509691826453762048,RT @CricketNDTV: RCB leg-spinner Wanindu Hasar...
9,1509691482864824320,Indian Premier League 2022 | Badoni a great fi...
