# [CIKM 2022 Tutorial on Hands on advanced machine learning for information extraction from tweets tasks, data, and open source tools](https://socialmediaie.github.io/tutorials/CIKM2022/)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/socialmediaie/tutorials/blob/master/docs/CIKM2022/CIKM_2022_Tutorial_SocialMediaIE.ipynb)


* Author: [Shubhanshu Mishra](http://shubhanshu.com/)
* Contact: [https://twitter.com/TheShubhanshu](https://twitter.com/TheShubhanshu)
* **QnA Page:** [https://slido.com with #3287167](https://slido.com/3287167)

More details at: https://socialmediaie.github.io/tutorials/CIKM2022/

This notebook demonstrates the usage of an open-source tool we have built called [SocialMediaIE](https://github.com/socialmediaie/SocialMediaIE), which uses multi-task learning to do state-of-the-art performance on English language social media data like tweets. You can find more details about the tool at: https://github.com/socialmediaie/SocialMediaIE

If you have feedback or requests for features in SocialMediaIE please raise an issue at: https://github.com/socialmediaie/SocialMediaIE/issues

If you would like to use SocialMediaIE on your local machine please follow the instructions at: https://socialmediaie.github.io/tutorials/CIKM2022/#software-setup



# Lexicon Based Information Extraction

### Install dependencies

In [None]:
! pip install flashtext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting flashtext
  Downloading flashtext-2.7.tar.gz (14 kB)
Building wheels for collected packages: flashtext
  Building wheel for flashtext (setup.py) ... [?25l[?25hdone
  Created wheel for flashtext: filename=flashtext-2.7-py2.py3-none-any.whl size=9309 sha256=030bcff67127bf8129f3651f101795c38fe259f104c79cd104cd0c5aa106c794
  Stored in directory: /root/.cache/pip/wheels/cb/19/58/4e8fdd0009a7f89dbce3c18fff2e0d0fa201d5cdfd16f113b7
Successfully built flashtext
Installing collected packages: flashtext
Successfully installed flashtext-2.7


In [None]:
%%bash
if [ ! -f Enhanced_Morality_Lexicon_V1.1.txt ]; then
  # Source of data: https://databank.illinois.edu/datasets/IDB-3957440
  wget -q https://databank.illinois.edu/datafiles/pjwpj/download -O Enhanced_Morality_Lexicon_V1.1.txt
  # cd SocialMediaIE && tar -xzf ../ic2s2_data.tar.gz
fi

In [None]:
%%bash
if [ ! -f NRC-Emotion-Lexicon.zip ]; then
  # Source of data: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
  wget -q http://saifmohammad.com/WebDocs/Lexicons/NRC-Emotion-Lexicon.zip -O NRC-Emotion-Lexicon.zip
  unzip NRC-Emotion-Lexicon.zip
  # cd SocialMediaIE && tar -xzf ../ic2s2_data.tar.gz
fi

Archive:  NRC-Emotion-Lexicon.zip
   creating: NRC-Emotion-Lexicon/
  inflating: NRC-Emotion-Lexicon/Paper-Practical-Ethical-Considerations-Lexicons.pdf  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._Paper-Practical-Ethical-Considerations-Lexicons.pdf  
  inflating: NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-ForVariousLanguages.txt  
  inflating: NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._NRC-Emotion-Lexicon-Wordlevel-v0.92.txt  
  inflating: NRC-Emotion-Lexicon/Paper1_NRC_Emotion_Lexicon.pdf  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._Paper1_NRC_Emotion_Lexicon.pdf  
  inflating: NRC-Emotion-Lexicon/README.txt  
  inflating: NRC-Emotion-Lexicon/Paper-Ethics-Sheet-Emotion-Recognition.pdf  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._Paper-Ethics-Sheet-Emotion-Recognition.pdf  
  inflating: NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-Senselevel-v0.92.txt  
  inflating: __MACOSX/NRC-Emotion-Lexicon/._NRC-Emotion-Lexicon-Senselevel-

In [None]:
%%bash
if [ ! -f subjectivity_clues_hltemnlp05.zip ]; then
  # Source of data: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/
  wget -q https://mpqa.cs.pitt.edu/data/subjectivity_clues_hltemnlp05.zip -O subjectivity_clues_hltemnlp05.zip
  unzip subjectivity_clues_hltemnlp05.zip
fi

Archive:  subjectivity_clues_hltemnlp05.zip
   creating: subjectivity_clues_hltemnlp05/
  inflating: subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.README  
   creating: __MACOSX/subjectivity_clues_hltemnlp05/
  inflating: __MACOSX/subjectivity_clues_hltemnlp05/._subjclueslen1-HLTEMNLP05.README  
  inflating: subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.tff  


### Create Helpers

In [None]:
import pandas as pd
from flashtext import KeywordProcessor
from pathlib import Path
import seaborn as sns
from spacy import displacy

In [None]:
class Lexicon(object):
  def __init__(self):
    pass

  def read_morality_lexicon(self, file_path):
    with open(file_path) as fp:
      data = []
      for line in fp:
        line = line.strip()
        line = line.split("|")
        line_data = dict([v.split(" = ") for v in line])
        data.append(line_data)
      self.lexicon = pd.DataFrame(data)

  def read_mpqa_lexicon(self, file_path):
    with open(file_path) as fp:
      data = []
      for i, line in enumerate(fp):
        line = line.strip()
        line = line.split(" ")
        try:
          line_data = dict([v.split("=") for v in line])
        except ValueError as e:
          print(i, e)
          print(line)
        data.append(line_data)
      self.lexicon = pd.DataFrame(data)

  def read_emolex(self, dir_path, lang="English"):
    if lang == "English":
      file_path = Path(dir_path) / "NRC-Emotion-Lexicon-Wordlevel-v0.92.txt"
      df = pd.read_csv(file_path, sep="\t", skiprows=1, header=None, names=["word", "affect", "association"])
      self.lexicon = df[df.association == 1]
      return
    emotions = ['anger', 'anticipation', 'disgust', 'fear', 'joy',
       'negative', 'positive', 'sadness', 'surprise', 'trust',]
    file_path = Path(dir_path) / "NRC-Emotion-Lexicon-ForVariousLanguages.txt"
    df = pd.read_csv(file_path, sep="\t")[["English Word", lang] + emotions]
    df = df.melt(
        id_vars=["English Word", lang], var_name="affect", value_name="association"
    ).rename(columns={lang: "word"}).query("association != 0") # word	affect	association
    self.lexicon = df[df.association == 1]
    
  
  def create_tagger(self, token_col, label_col, case_sensitive=False, set_colors=False):
    keyword_processor = KeywordProcessor(case_sensitive=case_sensitive)
    keyword_dict = self.lexicon.groupby(label_col)[token_col].agg(list)
    keyword_processor.add_keywords_from_dict(keyword_dict)
    if set_colors:
      self.colors = self.get_colors(keyword_dict.keys(), cmap="hls")
    self.keyword_processor = keyword_processor

  def get_colors(self, labels, cmap="hls"):
    colors = {
      l.upper():c 
        for l,c in zip(
          labels,
          sns.color_palette(cmap, len(labels)).as_hex(),
        )
    }
    return colors

  def display(self, text, annotations):
    ex = [{"text": text,
       "ents": [
                {"start": start, "end": end, "label": label} 
                for label, start, end in annotations
                ],
       "title": None}]
    options = dict()
    if self.colors:
      options = dict(colors=self.colors)
    html = displacy.render(ex, style="ent", manual=True, jupyter=True, options=options)
    return html

### Morality Lexicon

In [None]:
morality_lexicon = Lexicon()
morality_lexicon.read_morality_lexicon("./Enhanced_Morality_Lexicon_V1.1.txt")

In [None]:
morality_lexicon.lexicon.head()

Unnamed: 0,token,pos,syn_label,lemma,m_desc,m_type
0,aegis,n,S,no,CareVirtue,1
1,aegises,n,S,no,CareVirtue,1
2,affection,n,H,no,CareVirtue,1
3,affectionate,adj,S,no,CareVirtue,1
4,affectionateness,n,H,no,CareVirtue,1


In [None]:
morality_lexicon.lexicon.groupby("m_desc")["token"].agg(list)

m_desc
AuthorityVice      [agitated, agitating, agitation, agitations, a...
AuthorityVirtue    [abbess, abbesses, abidance, abidances, abide,...
CareVice           [abandon, abandoned, abandons, abase, abuse, a...
CareVirtue         [aegis, aegises, affection, affectionate, affe...
FairnessVice       [advantage, advantages, banishment, banishment...
FairnessVirtue     [aboveboard, adjudicator, adjudicators, admit,...
GeneralVice        [actus rei, actus reus, amiss, amoral, amorall...
GeneralVirtue      [adjust, admirably, admiration, admirations, a...
IngroupVice        [abandon, abandonment, abandonments, act of te...
IngroupVirtue      [accord, accords, across the country, across t...
PurityVice         [scavenge, abhorrent, abominably, adulterant, ...
PurityVirtue       [ablutionary, abstainer, abstainers, abstemiou...
Name: token, dtype: object

In [None]:
morality_lexicon.create_tagger(token_col="token", label_col="m_desc", case_sensitive=False, set_colors=True)

In [None]:
text = (
    'There is a lot of affection and hate for people with love and sorrow '
    'this causes agitation and abhorrent behaviour'
    )
morality_annotations = morality_lexicon.keyword_processor.extract_keywords(text, span_info=True)
morality_annotations

[('CareVirtue', 18, 27),
 ('CareVirtue', 53, 57),
 ('AuthorityVice', 81, 90),
 ('PurityVice', 95, 104)]

In [None]:
morality_lexicon.lexicon.groupby("token")["pos"].agg(list)

token
Bolshevism               [n]
Church Father            [n]
Church Fathers           [n]
Defense Department       [n]
Defense Departments      [n]
                       ...  
yield                    [v]
yieldingly             [adv]
yuckier                [adj]
yuckiest               [adj]
yucky                  [adj]
Name: pos, Length: 4419, dtype: object

In [None]:
morality_lexicon.display(text, morality_annotations)

### Emolex

In [None]:
! head ./NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt

aback	anger	0
aback	anticipation	0
aback	disgust	0
aback	fear	0
aback	joy	0
aback	negative	0
aback	positive	0
aback	sadness	0
aback	surprise	0
aback	trust	0


In [None]:
! head ./NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-ForVariousLanguages.txt

English Word	anger	anticipation	disgust	fear	joy	negative	positive	sadness	surprise	trust	Afrikaans	Albanian	Amharic	Arabic	Armenian	Azerbaijani	Basque	Belarusian	Bengali	Bosnian	Bulgarian	Catalan	Cebuano	Chichewa	Chinese-Simplified	Chinese-Traditional	Corsican	Croatian	Czech	Danish	Dutch	Esperanto	Estonian	Filipino	Finnish	French	Frisian	Galician	Georgian	German	Greek	Gujarati	Haitian-Creole	Hausa	Hawaiian	Hebrew	Hindi	Hmong	Hungarian	Icelandic	Igbo	Indonesian	Irish	Italian	Japanese	Javanese	Kannada	Kazakh	Khmer	Kinyarwanda	Korean	Kurdish-Kurmanji	Kyrgyz	Lao	Latin	Latvian	Lithuanian	Luxembourgish	Macedonian	Malagasy	Malay	Malayalam	Maltese	Maori	Marathi	Mongolian	Myanmar-Burmese	Nepali	Norwegian	Odia	Pashto	Persian	Polish	Portuguese	Punjabi	Romanian	Russian	Samoan	Scots-Gaelic	Serbian	Sesotho	Shona	Sindhi	Sinhala	Slovak	Slovenian	Somali	Spanish	Sundanese	Swahili	Swedish	Tajik	Tamil	Tatar	Telugu	Thai	Turkish	Turkmen	Ukrainian	Urdu	Uyghur	Uzbek	Vietnamese	Welsh	Xhosa	Yiddish	Yoruba	Zulu

In [None]:
emolex = Lexicon()
emolex.read_emolex("./NRC-Emotion-Lexicon")
emolex.lexicon.head()

Unnamed: 0,word,affect,association
18,abacus,trust,1
22,abandon,fear,1
24,abandon,negative,1
26,abandon,sadness,1
29,abandoned,anger,1


In [None]:
emolex.create_tagger(token_col="word", label_col="affect", case_sensitive=False, set_colors=True)

In [None]:
text = (
    'There is a lot of affection and hate for people with love and sorrow '
    'this causes agitation and abhorrent behaviour'
    )
emolex_annotations = emolex.keyword_processor.extract_keywords(text, span_info=True)
emolex_annotations

[('trust', 18, 27),
 ('sadness', 32, 36),
 ('positive', 53, 57),
 ('sadness', 62, 68),
 ('negative', 81, 90),
 ('negative', 95, 104)]

In [None]:
emolex.display(text, emolex_annotations)

In [None]:
emolex = Lexicon()
emolex.read_emolex("./NRC-Emotion-Lexicon", lang="Hindi")
emolex.lexicon.head()

Unnamed: 0,English Word,word,affect,association
3,abandoned,त्यागा हुआ,anger,1
4,abandonment,संन्यास,anger,1
17,abhor,घृणा करना,anger,1
18,abhorrent,घिनौना,anger,1
27,abolish,समाप्त करना,anger,1


In [None]:
emolex.create_tagger(token_col="word", label_col="affect", case_sensitive=False, set_colors=True)

In [None]:
text = (
    'आसामन में बादलों की वजह से अगले तीन दिनों तक बारिश का असर दिखने की उम्मीद जाताई जा रही है।'
    )
emolex_annotations = emolex.keyword_processor.extract_keywords(text, span_info=True)
emolex_annotations

[('anticipation', 0, 1),
 ('sadness', 10, 14),
 ('negative', 45, 48),
 ('anticipation', 67, 73)]

In [None]:
emolex.display(text, emolex_annotations)

In [None]:
emolex = Lexicon()
emolex.read_emolex("./NRC-Emotion-Lexicon", lang="Korean")
emolex.lexicon.head()

Unnamed: 0,English Word,word,affect,association
3,abandoned,버려진,anger,1
4,abandonment,포기,anger,1
17,abhor,혐오하다,anger,1
18,abhorrent,혐오스러운,anger,1
27,abolish,폐하다,anger,1


In [None]:
emolex.create_tagger(token_col="word", label_col="affect", case_sensitive=False, set_colors=True)

In [None]:
text = (
    """상대방의 행동과 말을 놓치지않고 
한번더 되새김하면서 존중하는 방법 너무 좋다. 
이런 석진이의 예의바른 모습에 
김석진이라는 사람에 또 반한다💖"""
)
emolex_annotations = emolex.keyword_processor.extract_keywords(text, span_info=True)
emolex_annotations

[('negative', 0, 2),
 ('trust', 9, 10),
 ('trust', 30, 32),
 ('surprise', 35, 36),
 ('sadness', 39, 40),
 ('negative', 46, 47),
 ('negative', 51, 52),
 ('positive', 54, 56),
 ('negative', 67, 68),
 ('trust', 71, 73),
 ('negative', 77, 78)]

In [None]:
emolex.display(text, emolex_annotations)

### MPQA Subjectivity Lexicon

In [None]:
! sed -n 5549p subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.tff

type=strongsubj len=1 word1=pervasive pos1=adj stemmed1=n m priorpolarity=negative


In [None]:
mpqa = Lexicon()
mpqa.read_mpqa_lexicon("subjectivity_clues_hltemnlp05/subjclueslen1-HLTEMNLP05.tff")
mpqa.lexicon.head()

5548 dictionary update sequence element #5 has length 1; 2 is required
['type=strongsubj', 'len=1', 'word1=pervasive', 'pos1=adj', 'stemmed1=n', 'm', 'priorpolarity=negative']
5549 dictionary update sequence element #5 has length 1; 2 is required
['type=strongsubj', 'len=1', 'word1=pervasive', 'pos1=noun', 'stemmed1=n', 'm', 'priorpolarity=negative']


Unnamed: 0,type,len,word1,pos1,stemmed1,priorpolarity,polarity,mpqapolarity
0,weaksubj,1,abandoned,adj,n,negative,,
1,weaksubj,1,abandonment,noun,n,negative,,
2,weaksubj,1,abandon,verb,y,negative,,
3,strongsubj,1,abase,verb,y,negative,,
4,strongsubj,1,abasement,anypos,y,negative,,


In [None]:
mpqa.create_tagger(token_col="word1", label_col="priorpolarity", case_sensitive=False, set_colors=True)

In [None]:
text = (
    'There is a lot of affection and hate for people with love and sorrow '
    'this causes agitation and abhorrent behaviour'
    )
mpqa_annotations = mpqa.keyword_processor.extract_keywords(text, span_info=True)
mpqa_annotations

[('positive', 18, 27),
 ('negative', 32, 36),
 ('positive', 53, 57),
 ('negative', 62, 68),
 ('negative', 81, 90),
 ('negative', 95, 104)]

In [None]:
mpqa.display(text, mpqa_annotations)

In [None]:
mpqa.create_tagger(token_col="word1", label_col="pos1", case_sensitive=False, set_colors=True)

In [None]:
mpqa.display(text, mpqa.keyword_processor.extract_keywords(text, span_info=True))

## Other Lexicons

* Multilingual Abusive words lexicon - https://github.com/valeriobasile/hurtlex
* Named Entity Lexicon - https://github.com/napsternxg/TwitterNER/tree/master/data/cleaned/custom_lexicons

# Model Based Information Extraction

## Install dependencies and setup system path


In [None]:
# Add -f https://download.pytorch.org/whl/cu100/torch_stable.html on cuda notebook
! pip install torch==1.0.0 allennlp==0.8.3 numpy==1.15.4 scipy==1.2.1 pandas scikit-learn==0.20.2 tqdm twarc flashtext overrides==3.1.0
! pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.0.0
  Downloading torch-1.0.0-cp37-cp37m-manylinux1_x86_64.whl (591.8 MB)
[K     |████████████████████████████████| 591.8 MB 617 bytes/s 
[?25hCollecting allennlp==0.8.3
  Downloading allennlp-0.8.3-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 16.6 MB/s 
[?25hCollecting numpy==1.15.4
  Downloading numpy-1.15.4-cp37-cp37m-manylinux1_x86_64.whl (13.8 MB)
[K     |████████████████████████████████| 13.8 MB 884 kB/s 
[?25hCollecting scipy==1.2.1
  Downloading scipy-1.2.1-cp37-cp37m-manylinux1_x86_64.whl (24.8 MB)
[K     |████████████████████████████████| 24.8 MB 1.5 MB/s 
Collecting scikit-learn==0.20.2
  Downloading scikit_learn-0.20.2-cp37-cp37m-manylinux1_x86_64.whl (5.4 MB)
[K     |████████████████████████████████| 5.4 MB 23.8 MB/s 
Collecting twarc
  Downloading twarc-2.12.0-py3-none-any.whl (60 kB)
[K     |████████████████████████

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1 MB)
[K     |████████████████████████████████| 11.1 MB 10.2 MB/s 
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.1.0-py3-none-any.whl size=11074433 sha256=305bdd7dd6f3167096066e0908e66dd9d73f0d1a68c468bfac8145fbb4c64ce2
  Stored in directory: /root/.cache/pip/wheels/59/4f/8c/0dbaab09a776d1fa3740e9465078bfd903cc22f3985382b496
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.4.0
    Uninstalli

In [None]:
!which python # should return /usr/local/bin/python
!python --version
!echo $PYTHONPATH

/usr/local/bin/python
Python 3.7.15
/env/python


In [None]:
%%bash
if [ ! -d "SocialMediaIE" ]; then
  git clone --single-branch --branch tutorial https://github.com/socialmediaie/SocialMediaIE.git
  # conda env update -n base -f ./SocialMediaIE/environment.yml
  pip install -e ./SocialMediaIE/.
  #! pip install -e git+https://github.com/socialmediaie/SocialMediaIE.git#egg=SocialMediaIE
fi

### Setup system path else the dependencies will not load properly 

In [None]:
import sys
print(sys.path)
#index_to_insert = min([i for i, v in enumerate(sys.path) if "dist-packages" in v])
# sys.path.insert(0, "/usr/local/lib/python3.6/site-packages")
# sys.path.insert(0, "./SocialMediaIE/") # Important to have this first else the package will not load
sys.path

['/content', '/env/python', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '', '/usr/local/lib/python3.7/dist-packages', '/content/SocialMediaIE', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.7/dist-packages/IPython/extensions', '/root/.ipython']


['/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '',
 '/usr/local/lib/python3.7/dist-packages',
 '/content/SocialMediaIE',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython']

### Download the models

In [None]:
%%bash
if [ ! -f ic2s2_data.tar.gz ]; then
  # Source of data: https://databank.illinois.edu/datasets/IDB-0851257
  wget -q https://databank.illinois.edu/datafiles/vodt2/download -O ic2s2_data.tar.gz
  cd SocialMediaIE && tar -xzf ../ic2s2_data.tar.gz
fi

## After dependencies are installed. Restart kernel and run from here.

In [None]:
import torch
print(torch.__version__)
torch.cuda.is_available()

1.0.0


False

In [None]:
import sys
print("You are using Python {}.{}.".format(sys.version_info.major, sys.version_info.minor))

You are using Python 3.7.


In [None]:
%env SOCIALMEDIAIE_PATH /content/SocialMediaIE/

env: SOCIALMEDIAIE_PATH=/content/SocialMediaIE/


### Check the SocialMediaIE folder was cloned properly

In [None]:
%%bash 
echo "${SOCIALMEDIAIE_PATH}"
ls -ltrh "${SOCIALMEDIAIE_PATH}/data"
realpath "${SOCIALMEDIAIE_PATH}"
cd "${SOCIALMEDIAIE_PATH}" && ls -ltrh

/content/SocialMediaIE/
total 24K
-rw-r--r-- 1 root root 2.8K Oct 20 12:13 cleanup_model_folders.py
-rw-r--r-- 1 root root  11K Oct 20 12:13 databank_api_client_v3.py
drwxr-xr-x 3 root root 4.0K Oct 20 12:13 models
drwxr-xr-x 3 root root 4.0K Oct 20 12:13 models_classification
/content/SocialMediaIE
total 88K
-rw-r--r-- 1 root root 2.8K Oct 20 12:13 README.md
-rw-r--r-- 1 root root  12K Oct 20 12:13 LICENSE
-rw-r--r-- 1 root root 1.7K Oct 20 12:13 TODO.md
drwxr-xr-x 9 root root 4.0K Oct 20 12:13 SocialMediaIE
-rw-r--r-- 1 root root  338 Oct 20 12:13 environment.yml
drwxr-xr-x 3 root root 4.0K Oct 20 12:13 docs
drwxr-xr-x 2 root root 4.0K Oct 20 12:13 experiments
drwxr-xr-x 4 root root 4.0K Oct 20 12:13 figures
drwxr-xr-x 2 root root 4.0K Oct 20 12:13 notebooks
drwxr-xr-x 3 root root 4.0K Oct 20 12:13 tests
-rw-r--r-- 1 root root  928 Oct 20 12:13 setup.py
-rw-r--r-- 1 root root   69 Oct 20 12:13 run_tests.sh
-rw-r--r-- 1 root root   69 Oct 20 12:13 run_tests.cmd
-rw-r--r-- 1 root root 

In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0


### Imports

In [None]:
from pathlib import Path
import json

from IPython.display import display

import pandas as pd
import torch

from SocialMediaIE.data.tokenization import get_match_iter, get_match_object

## Multi task Classification

In [None]:
from SocialMediaIE.predictor.model_predictor_classification import run, get_args, PREFIX, get_model_output, output_to_json

100%|██████████| 336/336 [00:00<00:00, 136996.81B/s]
100%|██████████| 374434792/374434792 [00:07<00:00, 48462880.70B/s]


In [None]:
SERIALIZATION_DIR = Path("./SocialMediaIE/data/models_classification/all_multitask_shared_bilstm_l2_0_lr_1e-3/")
args = get_args(PREFIX, SERIALIZATION_DIR)
args = args._replace(
    dataset_paths_file = "./SocialMediaIE/experiments/all_classification_dataset_paths.json",
    cuda=False # Very important as not running on GPU
)
args

ModelArgument(task=['founta_abusive', 'waseem_abusive', 'sarcasm_uncertainity', 'veridicality_uncertainity', 'semeval_sentiment', 'clarin_sentiment', 'politics_sentiment', 'other_sentiment'], dataset_paths_file='./SocialMediaIE/experiments/all_classification_dataset_paths.json', dataset_path_prefix='/experiments', model_dir='/content/SocialMediaIE/data/models_classification/all_multitask_shared_bilstm_l2_0_lr_1e-3', clean_model_dir=True, proj_dim=100, hidden_dim=100, encoder_type='bilstm', multi_task_mode='shared', dropout=0.5, lr=0.001, weight_decay=0.0, batch_size=16, epochs=10, patience=3, cuda=False, test_mode=True, residual_connection=False)

In [None]:
TASKS, vocab, model, readers, test_iterator = run(args)

  "num_layers={}".format(dropout, num_layers))


In [None]:
def tokenize(text):
    objects = [get_match_object(match) for match in get_match_iter(text)]
    n = len(objects)
    cleaned_objects = []
    for i, obj in enumerate(objects):
        obj["no_space"] = True
        if obj["type"] == "space":
            continue
        if i < n-1 and objects[i+1]["type"] == "space":
            obj["no_space"] = False
        cleaned_objects.append(obj)
    keys = cleaned_objects[0].keys()
    final_sequences = {}
    for k in keys:
        final_sequences[k] = [obj[k] for obj in cleaned_objects]
    return final_sequences

def predict_json(texts=None):
    # Empty cache to ensure larger batch can be loaded for testing
    if texts:
        data = [tokenize(text) for text in texts]
    else:
        text = "Barack Obama went to Paris and never returned to the USA."
        text1 = "Stan Lee was a legend who developed Spiderman and the Avengers movie series."
        text2 = "I just learned about donald drumph through john oliver. #JohnOliverShow such an awesome show."
        texts = [text, text1, text2]
        data = [tokenize(text) for text in texts]
    torch.cuda.empty_cache()
    tokens = [obj["value"] for obj in data]
    output = list(get_model_output(model, tokens, args, readers, vocab, test_iterator))

    output_json = [
                   {
                       "classification": dict(
                           text=text, 
                           doc_idx=i, 
                           **output_to_json(tokens[i], output[i], vocab))
                       }
                   for i, text in enumerate(texts)
                   ]
    return output_json

In [None]:
output_json = predict_json()
for d in output_json:
  display(d)



{'classification': {'text': 'Barack Obama went to Paris and never returned to the USA.',
  'doc_idx': 0,
  'tokens': ['Barack',
   'Obama',
   'went',
   'to',
   'Paris',
   'and',
   'never',
   'returned',
   'to',
   'the',
   'USA',
   '.'],
  'founta_abusive': {'normal': 0.9298549294471741,
   'spam': 0.003686218522489071,
   'abusive': 0.018016666173934937,
   'hateful': 0.04844215139746666},
  'waseem_abusive': {'none': 0.9588715434074402,
   'sexism': 0.025741683319211006,
   'racism': 0.015386759303510189},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9953067898750305,
   'sarcasm': 0.004693204071372747},
  'veridicality_uncertainity': {'uncertain': 0.4550904631614685,
   'definitely_yes': 0.30803143978118896,
   'probably_yes': 0.11583792418241501,
   'probably_no': 0.07321972399950027,
   'definitely_no': 0.04782041162252426},
  'semeval_sentiment': {'positive': 0.08605081588029861,
   'neutral': 0.5236591100692749,
   'negative': 0.39029020071029663},
  'clarin_sentiment': 

{'classification': {'text': 'Stan Lee was a legend who developed Spiderman and the Avengers movie series.',
  'doc_idx': 1,
  'tokens': ['Stan',
   'Lee',
   'was',
   'a',
   'legend',
   'who',
   'developed',
   'Spiderman',
   'and',
   'the',
   'Avengers',
   'movie',
   'series',
   '.'],
  'founta_abusive': {'normal': 0.9843591451644897,
   'spam': 0.005644865799695253,
   'abusive': 0.005538747180253267,
   'hateful': 0.0044571938924491405},
  'waseem_abusive': {'none': 0.8772116303443909,
   'sexism': 0.11250307410955429,
   'racism': 0.010285232216119766},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9936363697052002,
   'sarcasm': 0.0063636587001383305},
  'veridicality_uncertainity': {'uncertain': 0.4374602735042572,
   'definitely_yes': 0.19472849369049072,
   'probably_yes': 0.29833394289016724,
   'probably_no': 0.03524045646190643,
   'definitely_no': 0.034236788749694824},
  'semeval_sentiment': {'positive': 0.4920443296432495,
   'neutral': 0.48244330286979675,
   'ne

{'classification': {'text': 'I just learned about donald drumph through john oliver. #JohnOliverShow such an awesome show.',
  'doc_idx': 2,
  'tokens': ['I',
   'just',
   'learned',
   'about',
   'donald',
   'drumph',
   'through',
   'john',
   'oliver',
   '.',
   '#JohnOliverShow',
   'such',
   'an',
   'awesome',
   'show',
   '.'],
  'founta_abusive': {'normal': 0.957321286201477,
   'spam': 0.004389943089336157,
   'abusive': 0.0244298093020916,
   'hateful': 0.013859059661626816},
  'waseem_abusive': {'none': 0.8648111820220947,
   'sexism': 0.13232487440109253,
   'racism': 0.002863895148038864},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9019794464111328,
   'sarcasm': 0.09802058339118958},
  'veridicality_uncertainity': {'uncertain': 0.4389866292476654,
   'definitely_yes': 0.18402868509292603,
   'probably_yes': 0.33069881796836853,
   'probably_no': 0.008736047893762589,
   'definitely_no': 0.037549715489149094},
  'semeval_sentiment': {'positive': 0.9887052178382874,

In [None]:
json.dumps(predict_json(["barack obama went to paris"])[0])

'{"classification": {"text": "barack obama went to paris", "doc_idx": 0, "tokens": ["barack", "obama", "went", "to", "paris"], "founta_abusive": {"normal": 0.9031787514686584, "spam": 0.013646047562360764, "abusive": 0.05717331916093826, "hateful": 0.0260018240660429}, "waseem_abusive": {"none": 0.9790737628936768, "sexism": 0.01925569958984852, "racism": 0.001670498983003199}, "sarcasm_uncertainity": {"not_sarcasm": 0.9918217062950134, "sarcasm": 0.008178316056728363}, "veridicality_uncertainity": {"uncertain": 0.5396470427513123, "definitely_yes": 0.2180899977684021, "probably_yes": 0.1622510403394699, "probably_no": 0.055232349783182144, "definitely_no": 0.024779530242085457}, "semeval_sentiment": {"positive": 0.20253653824329376, "neutral": 0.6687340140342712, "negative": 0.1287294328212738}, "clarin_sentiment": {"neutral": 0.5424699783325195, "positive": 0.15769727528095245, "negative": 0.2998327910900116}, "politics_sentiment": {"negative": 0.2542945146560669, "neutral": 0.623954

In [None]:
text = """'You can give our democracy new meaning': Barack Obama urges young Americans to vote."""

json.dumps(predict_json([text])[0])

'{"classification": {"text": "\'You can give our democracy new meaning\': Barack Obama urges young Americans to vote.", "doc_idx": 0, "tokens": ["\'", "You", "can", "give", "our", "democracy", "new", "meaning", "\'", ":", "Barack", "Obama", "urges", "young", "Americans", "to", "vote", "."], "founta_abusive": {"normal": 0.9205254912376404, "spam": 0.017247281968593597, "abusive": 0.02483111247420311, "hateful": 0.03739606589078903}, "waseem_abusive": {"none": 0.917626142501831, "sexism": 0.04094952344894409, "racism": 0.04142434149980545}, "sarcasm_uncertainity": {"not_sarcasm": 0.9846667051315308, "sarcasm": 0.015333329327404499}, "veridicality_uncertainity": {"uncertain": 0.33441463112831116, "definitely_yes": 0.21644815802574158, "probably_yes": 0.3413499593734741, "probably_no": 0.05547914654016495, "definitely_no": 0.0523080937564373}, "semeval_sentiment": {"positive": 0.45415130257606506, "neutral": 0.48010358214378357, "negative": 0.06574511528015137}, "clarin_sentiment": {"neutr

### Visualizing the output:

You can copy the JSON output of the above cell and paste it at (and click Visualize): https://codepen.io/napsternxg/full/YzwRqEb to see a pretty representation of the output as shown in the presentation. 

If you hover over the output of the above cell, Colab will show you how to copy it to clipboard. 

Embeds the visualization page https://socialmediaie.github.io/PredictionVisualizer/ as an iframe.

In [None]:
%%html
<p>
  Full Screen Display at: <a href="https://socialmediaie.github.io/PredictionVisualizer/">https://socialmediaie.github.io/PredictionVisualizer</a>
</p>
<iframe src="https://socialmediaie.github.io/PredictionVisualizer/" width="80%" height="750"></iframe>

In [None]:
%%time
texts = [
    "Beautiful day in Chicago! Nice to get away from the Florida heat.",
    "Barack obama went to New York.",
    "obama went to Paris.",
    "Facebook is a new company.",
    "New york is better than SFO",
    "Urbana Champaign is the best",
    "urbana champaign is the best place to live and study",
    "going to Ibiza"
]
output_json = predict_json(texts)
for text, output in zip(texts, output_json):
  print(f"Text: {text}")
  display(output)

Text: Beautiful day in Chicago! Nice to get away from the Florida heat.


{'classification': {'text': 'Beautiful day in Chicago! Nice to get away from the Florida heat.',
  'doc_idx': 0,
  'tokens': ['Beautiful',
   'day',
   'in',
   'Chicago',
   '!',
   'Nice',
   'to',
   'get',
   'away',
   'from',
   'the',
   'Florida',
   'heat',
   '.'],
  'founta_abusive': {'normal': 0.9638862013816833,
   'spam': 0.005775867495685816,
   'abusive': 0.019030630588531494,
   'hateful': 0.011307246051728725},
  'waseem_abusive': {'none': 0.9894468784332275,
   'sexism': 0.0074528660625219345,
   'racism': 0.003100211964920163},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9280757904052734,
   'sarcasm': 0.07192421704530716},
  'veridicality_uncertainity': {'uncertain': 0.31399354338645935,
   'definitely_yes': 0.3354087769985199,
   'probably_yes': 0.27589163184165955,
   'probably_no': 0.014976011589169502,
   'definitely_no': 0.059730030596256256},
  'semeval_sentiment': {'positive': 0.9880387187004089,
   'neutral': 0.01007162407040596,
   'negative': 0.0018896219

Text: Barack obama went to New York.


{'classification': {'text': 'Barack obama went to New York.',
  'doc_idx': 1,
  'tokens': ['Barack', 'obama', 'went', 'to', 'New', 'York', '.'],
  'founta_abusive': {'normal': 0.9632956981658936,
   'spam': 0.006528662983328104,
   'abusive': 0.016656825318932533,
   'hateful': 0.01351882517337799},
  'waseem_abusive': {'none': 0.9844968318939209,
   'sexism': 0.013027030043303967,
   'racism': 0.0024762239772826433},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9977854490280151,
   'sarcasm': 0.0022145439870655537},
  'veridicality_uncertainity': {'uncertain': 0.45956480503082275,
   'definitely_yes': 0.304178386926651,
   'probably_yes': 0.11380760371685028,
   'probably_no': 0.0741635262966156,
   'definitely_no': 0.04828560724854469},
  'semeval_sentiment': {'positive': 0.1205163449048996,
   'neutral': 0.8261353969573975,
   'negative': 0.05334826186299324},
  'clarin_sentiment': {'neutral': 0.76449054479599,
   'positive': 0.11594709008932114,
   'negative': 0.11956237256526947},


Text: obama went to Paris.


{'classification': {'text': 'obama went to Paris.',
  'doc_idx': 2,
  'tokens': ['obama', 'went', 'to', 'Paris', '.'],
  'founta_abusive': {'normal': 0.9691336750984192,
   'spam': 0.005559186916798353,
   'abusive': 0.013510918244719505,
   'hateful': 0.01179617177695036},
  'waseem_abusive': {'none': 0.9905763864517212,
   'sexism': 0.006693535950034857,
   'racism': 0.002730048494413495},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9962672591209412,
   'sarcasm': 0.003732779063284397},
  'veridicality_uncertainity': {'uncertain': 0.4528912305831909,
   'definitely_yes': 0.32768377661705017,
   'probably_yes': 0.1186356246471405,
   'probably_no': 0.05545300990343094,
   'definitely_no': 0.045336417853832245},
  'semeval_sentiment': {'positive': 0.1881154477596283,
   'neutral': 0.7494383454322815,
   'negative': 0.06244628503918648},
  'clarin_sentiment': {'neutral': 0.689195990562439,
   'positive': 0.19216635823249817,
   'negative': 0.11863768845796585},
  'politics_sentiment': {

Text: Facebook is a new company.


{'classification': {'text': 'Facebook is a new company.',
  'doc_idx': 3,
  'tokens': ['Facebook', 'is', 'a', 'new', 'company', '.'],
  'founta_abusive': {'normal': 0.9350468516349792,
   'spam': 0.03364601358771324,
   'abusive': 0.019015135243535042,
   'hateful': 0.012292065657675266},
  'waseem_abusive': {'none': 0.9807422161102295,
   'sexism': 0.014543265104293823,
   'racism': 0.004714523907750845},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9590727090835571,
   'sarcasm': 0.04092726111412048},
  'veridicality_uncertainity': {'uncertain': 0.3898276388645172,
   'definitely_yes': 0.24917195737361908,
   'probably_yes': 0.26808446645736694,
   'probably_no': 0.05484198033809662,
   'definitely_no': 0.038073983043432236},
  'semeval_sentiment': {'positive': 0.41344186663627625,
   'neutral': 0.48928093910217285,
   'negative': 0.09727723151445389},
  'clarin_sentiment': {'neutral': 0.5204728245735168,
   'positive': 0.24901308119297028,
   'negative': 0.23051413893699646},
  'poli

Text: New york is better than SFO


{'classification': {'text': 'New york is better than SFO',
  'doc_idx': 4,
  'tokens': ['New', 'york', 'is', 'better', 'than', 'SFO'],
  'founta_abusive': {'normal': 0.90111243724823,
   'spam': 0.025931205600500107,
   'abusive': 0.041987065225839615,
   'hateful': 0.030969295650720596},
  'waseem_abusive': {'none': 0.9544270038604736,
   'sexism': 0.043916307389736176,
   'racism': 0.0016567305428907275},
  'sarcasm_uncertainity': {'not_sarcasm': 0.946239709854126,
   'sarcasm': 0.053760260343551636},
  'veridicality_uncertainity': {'uncertain': 0.35985174775123596,
   'definitely_yes': 0.25793740153312683,
   'probably_yes': 0.32223159074783325,
   'probably_no': 0.029707206413149834,
   'definitely_no': 0.03027193807065487},
  'semeval_sentiment': {'positive': 0.5676565170288086,
   'neutral': 0.20926181972026825,
   'negative': 0.22308169305324554},
  'clarin_sentiment': {'neutral': 0.2087988704442978,
   'positive': 0.29941919445991516,
   'negative': 0.49178194999694824},
  'pol

Text: Urbana Champaign is the best


{'classification': {'text': 'Urbana Champaign is the best',
  'doc_idx': 5,
  'tokens': ['Urbana', 'Champaign', 'is', 'the', 'best'],
  'founta_abusive': {'normal': 0.9841240048408508,
   'spam': 0.006480109412223101,
   'abusive': 0.00602313969284296,
   'hateful': 0.003372754668816924},
  'waseem_abusive': {'none': 0.9532039165496826,
   'sexism': 0.04004957526922226,
   'racism': 0.006746443919837475},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9982795715332031,
   'sarcasm': 0.0017203980823978782},
  'veridicality_uncertainity': {'uncertain': 0.329534113407135,
   'definitely_yes': 0.434145987033844,
   'probably_yes': 0.1870717704296112,
   'probably_no': 0.02106301486492157,
   'definitely_no': 0.0281850453466177},
  'semeval_sentiment': {'positive': 0.9627512097358704,
   'neutral': 0.035411033779382706,
   'negative': 0.001837826450355351},
  'clarin_sentiment': {'neutral': 0.18252544105052948,
   'positive': 0.8102065324783325,
   'negative': 0.007267965003848076},
  'politic

Text: urbana champaign is the best place to live and study


{'classification': {'text': 'urbana champaign is the best place to live and study',
  'doc_idx': 6,
  'tokens': ['urbana',
   'champaign',
   'is',
   'the',
   'best',
   'place',
   'to',
   'live',
   'and',
   'study'],
  'founta_abusive': {'normal': 0.9668214321136475,
   'spam': 0.017974641174077988,
   'abusive': 0.010682592168450356,
   'hateful': 0.0045213340781629086},
  'waseem_abusive': {'none': 0.9832895994186401,
   'sexism': 0.013638623058795929,
   'racism': 0.0030717398039996624},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9911916851997375,
   'sarcasm': 0.008808319456875324},
  'veridicality_uncertainity': {'uncertain': 0.3851398825645447,
   'definitely_yes': 0.3990635573863983,
   'probably_yes': 0.14199885725975037,
   'probably_no': 0.03076864778995514,
   'definitely_no': 0.043029025197029114},
  'semeval_sentiment': {'positive': 0.9519215822219849,
   'neutral': 0.045599643141031265,
   'negative': 0.0024788370355963707},
  'clarin_sentiment': {'neutral': 0.162

Text: going to Ibiza


{'classification': {'text': 'going to Ibiza',
  'doc_idx': 7,
  'tokens': ['going', 'to', 'Ibiza'],
  'founta_abusive': {'normal': 0.7353984117507935,
   'spam': 0.13337735831737518,
   'abusive': 0.07335348427295685,
   'hateful': 0.057870738208293915},
  'waseem_abusive': {'none': 0.9546189308166504,
   'sexism': 0.031101075932383537,
   'racism': 0.014279968105256557},
  'sarcasm_uncertainity': {'not_sarcasm': 0.9907566905021667,
   'sarcasm': 0.009243348613381386},
  'veridicality_uncertainity': {'uncertain': 0.37934422492980957,
   'definitely_yes': 0.3249450922012329,
   'probably_yes': 0.18138942122459412,
   'probably_no': 0.07097399979829788,
   'definitely_no': 0.04334722459316254},
  'semeval_sentiment': {'positive': 0.376794695854187,
   'neutral': 0.5934746265411377,
   'negative': 0.02973070740699768},
  'clarin_sentiment': {'neutral': 0.660545289516449,
   'positive': 0.26087304949760437,
   'negative': 0.07858168333768845},
  'politics_sentiment': {'negative': 0.1625107

CPU times: user 12.8 s, sys: 346 ms, total: 13.2 s
Wall time: 12.8 s


In [None]:
json.dumps(output)

'{"classification": {"text": "going to Ibiza", "doc_idx": 7, "tokens": ["going", "to", "Ibiza"], "founta_abusive": {"normal": 0.7353984117507935, "spam": 0.13337735831737518, "abusive": 0.07335348427295685, "hateful": 0.057870738208293915}, "waseem_abusive": {"none": 0.9546189308166504, "sexism": 0.031101075932383537, "racism": 0.014279968105256557}, "sarcasm_uncertainity": {"not_sarcasm": 0.9907566905021667, "sarcasm": 0.009243348613381386}, "veridicality_uncertainity": {"uncertain": 0.37934422492980957, "definitely_yes": 0.3249450922012329, "probably_yes": 0.18138942122459412, "probably_no": 0.07097399979829788, "definitely_no": 0.04334722459316254}, "semeval_sentiment": {"positive": 0.376794695854187, "neutral": 0.5934746265411377, "negative": 0.02973070740699768}, "clarin_sentiment": {"neutral": 0.660545289516449, "positive": 0.26087304949760437, "negative": 0.07858168333768845}, "politics_sentiment": {"negative": 0.16251079738140106, "neutral": 0.6454045176506042, "positive": 0.19

## Multi task tagging


In [None]:
from SocialMediaIE.predictor.model_predictor import run, get_args, PREFIX, get_model_output, output_to_df

In [None]:
SERIALIZATION_DIR = Path("./SocialMediaIE/data/models/all_multitask_stacked_l2_0_lr_1e-3_no_neel/")
print(SERIALIZATION_DIR.exists())
args = get_args(PREFIX, SERIALIZATION_DIR)
args = args._replace(
    dataset_paths_file="./SocialMediaIE/experiments/all_dataset_paths.json",
    cuda=False # Very important as not running on GPU
)
args

True


ModelArgument(task=['multimodal_ner', 'broad_ner', 'wnut17_ner', 'ritter_ner', 'yodie_ner', 'ritter_chunk', 'ud_pos', 'ark_pos', 'ptb_pos', 'ritter_ccg'], dataset_paths_file='./SocialMediaIE/experiments/all_dataset_paths.json', dataset_path_prefix='/experiments', model_dir='/content/SocialMediaIE/data/models/all_multitask_stacked_l2_0_lr_1e-3_no_neel', clean_model_dir=True, proj_dim=100, hidden_dim=100, encoder_type='bilstm', multi_task_mode='stacked', dropout=0.5, lr=0.001, weight_decay=0.0, batch_size=16, epochs=10, patience=3, cuda=False, test_mode=True, residual_connection=False)

In [None]:
TASKS, vocab, model, readers, test_iterator = run(args)

In [None]:
TASKS

[Task(tag_namespace='multimodal_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='broad_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='wnut17_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='ritter_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='yodie_ner', task_type=ner, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='ritter_chunk', task_type=chunk, label_encoding=BIO, calculate_span_f1=Trueis_classification=False),
 Task(tag_namespace='ud_pos', task_type=pos, label_encoding=None, calculate_span_f1=Noneis_classification=False),
 Task(tag_namespace='ark_pos', task_type=pos, label_encoding=None, calculate_span_f1=Noneis_classification=False),
 Task(tag_namespace='ptb_pos', task_type=pos, label_encoding=No

In [None]:
vocab

Vocabulary with namespaces:  ptb_pos, Size: 46 || ritter_ner, Size: 21 || multimodal_ner, Size: 9 || ritter_ccg, Size: 72 || ark_pos, Size: 25 || broad_ner, Size: 7 || yodie_ner, Size: 27 || wnut17_ner, Size: 13 || ritter_chunk, Size: 18 || ud_pos, Size: 18 || tag_namespace, Size: 10 || Non Padded Namespaces: {'tag_namespace', '*chunk', '*pos', '*ner', '*ccg'}

In [None]:
readers

{'multimodal_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f41b34c1850>,
 'broad_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c68d90>,
 'wnut17_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c626d0>,
 'ritter_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c62510>,
 'yodie_ner': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c64e50>,
 'ritter_chunk': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c647d0>,
 'ud_pos': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c60b10>,
 'ark_pos': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c5d910>,
 'ptb_pos': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c5d350>,
 'ritter_ccg': <SocialMediaIE.data.conll_data_reader.ConLLDatasetReader at 0x7f4189c5ad50>}

In [None]:
def tokenize(text):
    objects = [get_match_object(match) for match in get_match_iter(text)]
    n = len(objects)
    cleaned_objects = []
    for i, obj in enumerate(objects):
        obj["no_space"] = True
        if obj["type"] == "space":
            continue
        if i < n-1 and objects[i+1]["type"] == "space":
            obj["no_space"] = False
        cleaned_objects.append(obj)
    keys = cleaned_objects[0].keys()
    final_sequences = {}
    for k in keys:
        final_sequences[k] = [obj[k] for obj in cleaned_objects]
    return final_sequences

def predict_df(texts=None):
    # Empty cache to ensure larger batch can be loaded for testing
    if texts:
        data = [tokenize(text) for text in texts]
    else:
        text = "Barack Obama went to Paris and never returned to the USA."
        text1 = "Stan Lee was a legend who developed Spiderman and the Avengers movie series."
        text2 = "I just learned about donald drumph through john oliver. #JohnOliverShow such an awesome show."
        texts = [text, text1, text2]
        data = [tokenize(text) for text in texts]
    torch.cuda.empty_cache()
    tokens = [obj["value"] for obj in data]
    output = list(get_model_output(model, tokens, args, readers, vocab, test_iterator))
    idx = 0
    def _get_data_values(d):
      return {
        k: d[k]
        for k in d.keys()
        if k != "value"
    }
    #df = output_to_df(tokens[idx], output[idx], vocab)
    df = pd.concat([
                    output_to_df(tokens[i], output[i], vocab).assign(**_get_data_values(d)).assign(data_idx=i)
                    for i, d in enumerate(data)
          ])

    # for k in data[idx].keys():
    #     if k != "value":
    #         df[k] = data[idx][k]
    return df


def predict_json(texts=None):
    # Empty cache to ensure larger batch can be loaded for testing
    if texts:
      data = [tokenize(text) for text in texts]
    else:
        text = "Barack Obama went to Paris and never returned to the USA."
        text1 = "Stan Lee was a legend who developed Spiderman and the Avengers movie series."
        text2 = "I just learned about donald drumph through john oliver. #JohnOliverShow such an awesome show."
        texts = [text, text1, text2]
        data = [tokenize(text) for text in texts]
    torch.cuda.empty_cache()
    tokens = [obj["value"] for obj in data]
    output = list(get_model_output(model, tokens, args, readers, vocab, test_iterator))
    # idx = 0
    # df = output_to_df(tokens[idx], output[idx], vocab)
    # for k in data[idx].keys():
    #     if k != "value":
    #         df[k] = data[idx][k]
    # #df = df.set_index("tokens")
    # output_json = df.to_json(orient='table')
    # output_json = json.loads(output_json)
    # output_json = dict(tagging=output_json)
    def _get_data_values(d):
      return {
        k: d[k]
        for k in d.keys()
        if k != "value"
    }
    #df = output_to_df(tokens[idx], output[idx], vocab)
    output = [
                    output_to_df(tokens[i], output[i], vocab).assign(**_get_data_values(d)).assign(data_idx=i)
                    for i, d in enumerate(data)
          ]
    output = [
            dict(tagging=json.loads(df_t.to_json(orient='table')))
            for df_t in output
    ]
    return output


In [None]:
output_json = predict_json()
json.dumps(output_json[0])



'{"tagging": {"schema": {"fields": [{"name": "index", "type": "integer"}, {"name": "tokens", "type": "string"}, {"name": "multimodal_ner", "type": "string"}, {"name": "broad_ner", "type": "string"}, {"name": "wnut17_ner", "type": "string"}, {"name": "ritter_ner", "type": "string"}, {"name": "yodie_ner", "type": "string"}, {"name": "ritter_chunk", "type": "string"}, {"name": "ud_pos", "type": "string"}, {"name": "ark_pos", "type": "string"}, {"name": "ptb_pos", "type": "string"}, {"name": "ritter_ccg", "type": "string"}, {"name": "type", "type": "string"}, {"name": "span", "type": "string"}, {"name": "is_hashtag", "type": "boolean"}, {"name": "is_mention", "type": "boolean"}, {"name": "is_url", "type": "boolean"}, {"name": "is_emoji", "type": "boolean"}, {"name": "is_emoticon", "type": "boolean"}, {"name": "is_symbol", "type": "boolean"}, {"name": "no_space", "type": "boolean"}, {"name": "data_idx", "type": "integer"}], "primaryKey": ["index"], "pandas_version": "0.20.0"}, "data": [{"in

### Visualizing the output:

You can copy the JSON output of the above cell and paste it at (and click Visualize): https://codepen.io/napsternxg/full/YzwRqEb to see a pretty representation of the output as shown in the presentation. 

If you hover over the output of the above cell, Colab will show you how to copy it to clipboard. 

In [None]:
%%html
<p>
  Full Screen Display at: <a href="https://socialmediaie.github.io/PredictionVisualizer/">https://socialmediaie.github.io/PredictionVisualizer</a>
</p>
<iframe src="https://socialmediaie.github.io/PredictionVisualizer/" width="80%" height="750"></iframe>

In [None]:
df = predict_df()

In [None]:
df.head()

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,0
1,Obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,0
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,0
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,0
4,Paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(21, 26)",False,False,False,False,False,False,False,0


In [None]:
df[df.data_idx==0]

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,0
1,Obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,0
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,0
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,0
4,Paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(21, 26)",False,False,False,False,False,False,False,0
5,and,O,O,O,O,O,O,CCONJ,&,CC,...,token,"(27, 30)",False,False,False,False,False,False,False,0
6,never,O,O,O,O,O,B-ADVP,ADV,R,RB,...,token,"(31, 36)",False,False,False,False,False,False,False,0
7,returned,O,O,O,O,O,B-VP,VERB,V,VBN,...,token,"(37, 45)",False,False,False,False,False,False,False,0
8,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(46, 48)",False,False,False,False,False,False,False,0
9,the,O,O,B-LOCATION,B-FACILITY,O,B-NP,DET,D,DT,...,token,"(49, 52)",False,False,False,False,False,False,False,0


In [None]:
df.loc[df.data_idx==0, ["tokens", "multimodal_ner"]]

Unnamed: 0,tokens,multimodal_ner
0,Barack,B-PER
1,Obama,I-PER
2,went,O
3,to,O
4,Paris,B-LOC
5,and,O
6,never,O
7,returned,O
8,to,O
9,the,O


In [None]:
def split_tag(tag):
    return tuple(tag.split("-", 1)) if tag != "O" else (tag, None) 
    
def extract_entities(tags):
    tags = list(tags)
    curr_entity = []
    entities = []
    for i,tag in enumerate(tags + ["O"]):
        # Add dummy tag in end to ensure the last entity is added to entities
        boundary, label = split_tag(tag)
        if curr_entity:
            # Exit entity
            if boundary in {"B", "O"} or label != curr_entity[-1][1]:
                start = i - len(curr_entity)
                end = i
                entity_label = curr_entity[-1][1]
                entities.append((entity_label, start, end))
                curr_entity = []
            elif boundary == "I":
                curr_entity.append((boundary, label))
        if boundary == "B":
            # Enter or inside entity
            assert not curr_entity, f"Entity should be empty. Found: {curr_entity}"
            curr_entity.append((boundary, label))
    return entities

In [None]:
df_t = df.loc[df.data_idx==0]
df_t

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,0
1,Obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,0
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,0
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,0
4,Paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(21, 26)",False,False,False,False,False,False,False,0
5,and,O,O,O,O,O,O,CCONJ,&,CC,...,token,"(27, 30)",False,False,False,False,False,False,False,0
6,never,O,O,O,O,O,B-ADVP,ADV,R,RB,...,token,"(31, 36)",False,False,False,False,False,False,False,0
7,returned,O,O,O,O,O,B-VP,VERB,V,VBN,...,token,"(37, 45)",False,False,False,False,False,False,False,0
8,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(46, 48)",False,False,False,False,False,False,False,0
9,the,O,O,B-LOCATION,B-FACILITY,O,B-NP,DET,D,DT,...,token,"(49, 52)",False,False,False,False,False,False,False,0


In [None]:
entities = extract_entities(df_t["multimodal_ner"])
tokens = list(df_t["tokens"])

In [None]:
for label, start, end in entities:
  print(tokens[start:end], label)

['Barack', 'Obama'] PER
['Paris'] LOC
['USA'] LOC


In [None]:
df.columns

Index(['tokens', 'multimodal_ner', 'broad_ner', 'wnut17_ner', 'ritter_ner',
       'yodie_ner', 'ritter_chunk', 'ud_pos', 'ark_pos', 'ptb_pos',
       'ritter_ccg', 'type', 'span', 'is_hashtag', 'is_mention', 'is_url',
       'is_emoji', 'is_emoticon', 'is_symbol', 'no_space', 'data_idx'],
      dtype='object')

In [None]:
def get_entity_info(bio_labels, tokens, text=None, spans=None):
  entities_info = extract_entities(bio_labels)
  entities = []
  for label, start, end in entities_info:
    entity_phrase = None
    if text and spans:
      start_char_idx = spans[start][0]
      end_char_idx = spans[end-1][1]
      entity_phrase = text[start_char_idx:end_char_idx]
    entities.append(dict(
        tokens=tokens[start:end], 
        label=label, 
        start=start, 
        end=end, 
        entity_phrase=entity_phrase))
  return entities


def get_df_entities(df, text=None):
  span_columns = [
    c for c in df.columns if c.endswith(("_ner", "_chunk", "_ccg"))
  ]
  tokens = list(df["tokens"])
  spans = list(df["span"])
  task_entities = {c: [] for c in span_columns}
  for c in span_columns:
    bio_labels = df[c]
    task_entities[c] = get_entity_info(bio_labels, tokens, text=text, spans=spans)
  return task_entities

In [None]:
text = """Ryan Gosling and Chris Evans will star in the Russo Bros' 'The Gray Man' for Netflix

The film has a $200M+ budget and the goal is to launch a James Bond-level franchise

'For those who were fans of The Winter Soldier this is us moving into that territory in a real-world setting'"""

df = predict_df([text])
df

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Ryan,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 4)",False,False,False,False,False,False,False,0
1,Gosling,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(5, 12)",False,False,False,False,False,False,False,0
2,and,O,O,O,O,O,I-NP,CCONJ,&,CC,...,token,"(13, 16)",False,False,False,False,False,False,False,0
3,Chris,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,I-NP,PROPN,^,NNP,...,token,"(17, 22)",False,False,False,False,False,False,False,0
4,Evans,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(23, 28)",False,False,False,False,False,False,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59,real,O,O,O,O,O,I-NP,ADJ,A,JJ,...,token,"(261, 265)",False,False,False,False,False,False,True,0
60,-,O,O,O,O,O,I-NP,PUNCT,",",:,...,token,"(265, 266)",False,False,False,False,False,True,True,0
61,world,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(266, 271)",False,False,False,False,False,False,False,0
62,setting,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(272, 279)",False,False,False,False,False,False,True,0


In [None]:
task_entities = get_df_entities(df, text=text)
for task, entities in task_entities.items():
  print(task)
  for entity in entities:
    print(entity)

multimodal_ner
{'tokens': ['Ryan', 'Gosling'], 'label': 'PER', 'start': 0, 'end': 2, 'entity_phrase': 'Ryan Gosling'}
{'tokens': ['Chris', 'Evans'], 'label': 'PER', 'start': 3, 'end': 5, 'entity_phrase': 'Chris Evans'}
{'tokens': ['Russo', 'Bros'], 'label': 'ORG', 'start': 9, 'end': 11, 'entity_phrase': 'Russo Bros'}
{'tokens': ['The', 'Gray', 'Man'], 'label': 'MISC', 'start': 13, 'end': 16, 'entity_phrase': 'The Gray Man'}
{'tokens': ['Netflix'], 'label': 'ORG', 'start': 18, 'end': 19, 'entity_phrase': 'Netflix'}
{'tokens': ['James', 'Bond'], 'label': 'PER', 'start': 35, 'end': 37, 'entity_phrase': 'James Bond'}
{'tokens': ['The', 'Winter', 'Soldier'], 'label': 'MISC', 'start': 47, 'end': 50, 'entity_phrase': 'The Winter Soldier'}
broad_ner
{'tokens': ['Ryan', 'Gosling'], 'label': 'PER', 'start': 0, 'end': 2, 'entity_phrase': 'Ryan Gosling'}
{'tokens': ['Chris', 'Evans'], 'label': 'PER', 'start': 3, 'end': 5, 'entity_phrase': 'Chris Evans'}
{'tokens': ['Russo', 'Bros'], 'label': 'PER'

In [None]:
text = """Ryan Gosling and Chris Evans will star in the Russo Bros' 'The Gray Man' for Netflix

The film has a $200M+ budget and the goal is to launch a James Bond-level franchise

'For those who were fans of The Winter Soldier this is us moving into that territory in a real-world setting'"""
json.dumps(predict_json([text])[0])

'{"tagging": {"schema": {"fields": [{"name": "index", "type": "integer"}, {"name": "tokens", "type": "string"}, {"name": "multimodal_ner", "type": "string"}, {"name": "broad_ner", "type": "string"}, {"name": "wnut17_ner", "type": "string"}, {"name": "ritter_ner", "type": "string"}, {"name": "yodie_ner", "type": "string"}, {"name": "ritter_chunk", "type": "string"}, {"name": "ud_pos", "type": "string"}, {"name": "ark_pos", "type": "string"}, {"name": "ptb_pos", "type": "string"}, {"name": "ritter_ccg", "type": "string"}, {"name": "type", "type": "string"}, {"name": "span", "type": "string"}, {"name": "is_hashtag", "type": "boolean"}, {"name": "is_mention", "type": "boolean"}, {"name": "is_url", "type": "boolean"}, {"name": "is_emoji", "type": "boolean"}, {"name": "is_emoticon", "type": "boolean"}, {"name": "is_symbol", "type": "boolean"}, {"name": "no_space", "type": "boolean"}, {"name": "data_idx", "type": "integer"}], "primaryKey": ["index"], "pandas_version": "0.20.0"}, "data": [{"in

In [None]:
text = """'You can give our democracy new meaning': Barack Obama urges young Americans to vote."""
json.dumps(predict_json([text])[0])

'{"tagging": {"schema": {"fields": [{"name": "index", "type": "integer"}, {"name": "tokens", "type": "string"}, {"name": "multimodal_ner", "type": "string"}, {"name": "broad_ner", "type": "string"}, {"name": "wnut17_ner", "type": "string"}, {"name": "ritter_ner", "type": "string"}, {"name": "yodie_ner", "type": "string"}, {"name": "ritter_chunk", "type": "string"}, {"name": "ud_pos", "type": "string"}, {"name": "ark_pos", "type": "string"}, {"name": "ptb_pos", "type": "string"}, {"name": "ritter_ccg", "type": "string"}, {"name": "type", "type": "string"}, {"name": "span", "type": "string"}, {"name": "is_hashtag", "type": "boolean"}, {"name": "is_mention", "type": "boolean"}, {"name": "is_url", "type": "boolean"}, {"name": "is_emoji", "type": "boolean"}, {"name": "is_emoticon", "type": "boolean"}, {"name": "is_symbol", "type": "boolean"}, {"name": "no_space", "type": "boolean"}, {"name": "data_idx", "type": "integer"}], "primaryKey": ["index"], "pandas_version": "0.20.0"}, "data": [{"in

In [None]:
predict_df(["the day is great"]).T

Unnamed: 0,0,1,2,3
tokens,the,day,is,great
multimodal_ner,O,O,O,O
broad_ner,O,O,O,O
wnut17_ner,O,O,O,O
ritter_ner,O,O,O,O
yodie_ner,O,O,O,O
ritter_chunk,B-NP,I-NP,B-VP,B-ADJP
ud_pos,DET,NOUN,VERB,ADJ
ark_pos,D,N,V,A
ptb_pos,DT,NN,VBZ,JJ


In [None]:
predict_df(["barack obama went to paris"])

Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,0
1,obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,0
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,0
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,0
4,paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(21, 26)",False,False,False,False,False,False,True,0


In [None]:
%%time
texts = [
    "Beautiful day in Chicago! Nice to get away from the Florida heat.",
    "Barack obama went to New York.",
    "obama went to Paris.",
    "Facebook is a new company.",
    "New york is better than SFO",
    "Urbana Champaign is the best",
    "urbana champaign is the best place to live and study",
    "going to Ibiza"
]
df = predict_df(texts)
print(df.columns)
for i in df.data_idx.unique():
  display(df[df.data_idx==i])

Index(['tokens', 'multimodal_ner', 'broad_ner', 'wnut17_ner', 'ritter_ner',
       'yodie_ner', 'ritter_chunk', 'ud_pos', 'ark_pos', 'ptb_pos',
       'ritter_ccg', 'type', 'span', 'is_hashtag', 'is_mention', 'is_url',
       'is_emoji', 'is_emoticon', 'is_symbol', 'no_space', 'data_idx'],
      dtype='object')


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Beautiful,O,O,O,O,O,B-NP,ADJ,A,JJ,...,token,"(0, 9)",False,False,False,False,False,False,False,0
1,day,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(10, 13)",False,False,False,False,False,False,False,0
2,in,O,O,O,O,O,B-PP,ADP,P,IN,...,token,"(14, 16)",False,False,False,False,False,False,False,0
3,Chicago,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(17, 24)",False,False,False,False,False,False,True,0
4,!,O,O,O,O,O,O,PUNCT,",",PUNCT,...,token,"(24, 25)",False,False,False,False,False,True,False,0
5,Nice,O,O,O,O,O,B-ADJP,ADJ,A,JJ,...,token,"(26, 30)",False,False,False,False,False,False,False,0
6,to,O,O,O,O,O,B-VP,PART,P,TO,...,token,"(31, 33)",False,False,False,False,False,False,False,0
7,get,O,O,O,O,O,I-VP,VERB,V,VB,...,token,"(34, 37)",False,False,False,False,False,False,False,0
8,away,O,O,O,O,O,B-ADVP,ADV,R,RB,...,token,"(38, 42)",False,False,False,False,False,False,False,0
9,from,O,O,O,O,O,B-PP,ADP,P,IN,...,token,"(43, 47)",False,False,False,False,False,False,False,0


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Barack,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,1
1,obama,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 12)",False,False,False,False,False,False,False,1
2,went,O,O,O,O,O,B-VP,VERB,V,VBD,...,token,"(13, 17)",False,False,False,False,False,False,False,1
3,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(18, 20)",False,False,False,False,False,False,False,1
4,New,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(21, 24)",False,False,False,False,False,False,False,1
5,York,I-LOC,I-LOC,I-LOCATION,I-GEO-LOC,I-GEO-LOC,I-NP,PROPN,^,NNP,...,token,"(25, 29)",False,False,False,False,False,False,True,1
6,.,O,O,O,O,O,O,PUNCT,",",PUNCT,...,token,"(29, 30)",False,False,False,False,False,True,True,1


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,obama,B-PER,O,O,O,O,B-VP,PROPN,^,NNP,...,token,"(0, 5)",False,False,False,False,False,False,False,2
1,went,O,O,O,O,O,I-VP,VERB,V,VBD,...,token,"(6, 10)",False,False,False,False,False,False,False,2
2,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(11, 13)",False,False,False,False,False,False,False,2
3,Paris,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(14, 19)",False,False,False,False,False,False,True,2
4,.,O,O,O,O,O,O,PUNCT,",",PUNCT,...,token,"(19, 20)",False,False,False,False,False,True,True,2


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Facebook,B-ORG,B-ORG,B-CORPORATION,B-COMPANY,B-COMPANY,B-NP,PROPN,^,NNP,...,token,"(0, 8)",False,False,False,False,False,False,False,3
1,is,O,O,O,O,O,B-VP,VERB,V,VBZ,...,token,"(9, 11)",False,False,False,False,False,False,False,3
2,a,O,O,O,O,O,B-NP,DET,D,DT,...,token,"(12, 13)",False,False,False,False,False,True,False,3
3,new,O,O,O,O,O,I-NP,ADJ,A,JJ,...,token,"(14, 17)",False,False,False,False,False,False,False,3
4,company,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(18, 25)",False,False,False,False,False,False,True,3
5,.,O,O,O,O,O,O,PUNCT,",",PUNCT,...,token,"(25, 26)",False,False,False,False,False,True,True,3


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,New,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(0, 3)",False,False,False,False,False,False,False,4
1,york,I-LOC,I-LOC,I-LOCATION,I-GEO-LOC,I-LOCATION,I-NP,PROPN,^,NNP,...,token,"(4, 8)",False,False,False,False,False,False,False,4
2,is,O,O,O,O,O,B-VP,VERB,V,VBZ,...,token,"(9, 11)",False,False,False,False,False,False,False,4
3,better,O,O,O,O,O,B-ADJP,ADJ,A,JJR,...,token,"(12, 18)",False,False,False,False,False,False,False,4
4,than,O,O,O,O,O,B-PP,ADP,P,IN,...,token,"(19, 23)",False,False,False,False,False,False,False,4
5,SFO,B-ORG,B-ORG,B-CORPORATION,B-COMPANY,B-COMPANY,B-NP,PROPN,^,NNP,...,token,"(24, 27)",False,False,False,False,False,False,True,4


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,Urbana,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-LOCATION,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,5
1,Champaign,I-LOC,I-LOC,I-LOCATION,I-GEO-LOC,I-LOCATION,I-NP,PROPN,^,NNP,...,token,"(7, 16)",False,False,False,False,False,False,False,5
2,is,O,O,O,O,O,B-VP,VERB,V,VBZ,...,token,"(17, 19)",False,False,False,False,False,False,False,5
3,the,O,O,O,O,O,B-NP,DET,D,DT,...,token,"(20, 23)",False,False,False,False,False,False,False,5
4,best,O,O,O,O,O,I-NP,ADJ,A,JJ,...,token,"(24, 28)",False,False,False,False,False,False,True,5


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,urbana,B-PER,B-PER,B-PERSON,B-PERSON,B-PERSON,B-NP,PROPN,^,NNP,...,token,"(0, 6)",False,False,False,False,False,False,False,6
1,champaign,I-PER,I-PER,I-PERSON,I-PERSON,I-PERSON,I-NP,PROPN,^,NNP,...,token,"(7, 16)",False,False,False,False,False,False,False,6
2,is,O,O,O,O,O,B-VP,VERB,V,VBZ,...,token,"(17, 19)",False,False,False,False,False,False,False,6
3,the,O,O,O,O,O,B-NP,DET,D,DT,...,token,"(20, 23)",False,False,False,False,False,False,False,6
4,best,O,O,O,O,O,I-NP,ADJ,A,JJ,...,token,"(24, 28)",False,False,False,False,False,False,False,6
5,place,O,O,O,O,O,I-NP,NOUN,N,NN,...,token,"(29, 34)",False,False,False,False,False,False,False,6
6,to,O,O,O,O,O,B-VP,PART,P,TO,...,token,"(35, 37)",False,False,False,False,False,False,False,6
7,live,O,O,O,O,O,I-VP,VERB,V,VB,...,token,"(38, 42)",False,False,False,False,False,False,False,6
8,and,O,O,O,O,O,O,CCONJ,&,CC,...,token,"(43, 46)",False,False,False,False,False,False,False,6
9,study,O,O,O,O,O,B-VP,VERB,V,VB,...,token,"(47, 52)",False,False,False,False,False,False,True,6


Unnamed: 0,tokens,multimodal_ner,broad_ner,wnut17_ner,ritter_ner,yodie_ner,ritter_chunk,ud_pos,ark_pos,ptb_pos,...,type,span,is_hashtag,is_mention,is_url,is_emoji,is_emoticon,is_symbol,no_space,data_idx
0,going,O,O,O,O,O,B-VP,VERB,V,VBG,...,token,"(0, 5)",False,False,False,False,False,False,False,7
1,to,O,O,O,O,O,B-PP,ADP,P,TO,...,token,"(6, 8)",False,False,False,False,False,False,False,7
2,Ibiza,B-LOC,B-LOC,B-LOCATION,B-GEO-LOC,B-GEO-LOC,B-NP,PROPN,^,NNP,...,token,"(9, 14)",False,False,False,False,False,False,True,7


CPU times: user 12.9 s, sys: 241 ms, total: 13.1 s
Wall time: 12.2 s


# Other Twitter NLP Libraries and Resources

* TweetNLP - https://github.com/cardiffnlp/tweetnlp - Uses transformer models
* TweeBankNLP - https://github.com/mit-ccc/TweebankNLP - User transformer models + Stanza and supports token level tasks like NER, POS, Dependency Parsing
* TwitterNER - https://github.com/socialmediaie/TwitterNER (more lightweight NER focused on English tweets)
* ConText - https://github.com/uiuc-ischool-scanr/ConText (generate networks from text data)
* Bertweet – large scale pre-trained Roberta model - https://huggingface.co/vinai/bertweet-base 
* BERTweet NER - https://huggingface.co/socialmediaie/bertweet-base_wnut17_ner
* Twitter Roberta - https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment



# Visualizing outputs

Embeds the visualization page https://socialmediaie.github.io/PredictionVisualizer/ as an iframe.

Copy paste model output JSON from above into the text area and click **Visualize**


In [None]:
%%html
<p>
  Full Screen Display at: <a href="https://socialmediaie.github.io/PredictionVisualizer/">https://socialmediaie.github.io/PredictionVisualizer</a>
</p>
<iframe src="https://socialmediaie.github.io/PredictionVisualizer/" width="80%" height="750"></iframe>

# Data Download

## Twarc

Docs: https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/#hydrate

In [None]:
import getpass

In [None]:
twarc_bearer_token = getpass.getpass("Bearer Token: ")
! echo {twarc_bearer_token} > ~/.twarc_bearer_token

Bearer Token: ··········


In [None]:
%%writefile twarc_tweet_ids.txt
21
22

Writing twarc_tweet_ids.txt


In [None]:
! twarc2 --bearer-token {twarc_bearer_token} hydrate ./twarc_tweet_ids.txt ./twarc_tweet_ids.txt.jsonl

  0% 0/2 [00:00<?, ?it/s] 50% 1/2 [00:00<00:00,  5.17it/s]100% 2/2 [00:00<00:00, 10.32it/s]


In [None]:
with open("./twarc_tweet_ids.txt.jsonl") as fp:
  hydrated_tweets = json.load(fp)
hydrated_tweets["data"]

[{'author_id': '13',
  'conversation_id': '21',
  'created_at': '2006-03-21T20:51:43.000Z',
  'id': '21',
  'lang': 'en',
  'possibly_sensitive': False,
  'public_metrics': {'like_count': 4465,
   'quote_count': 270,
   'reply_count': 167,
   'retweet_count': 6093},
  'reply_settings': 'everyone',
  'source': 'Twitter Web Client',
  'text': 'just setting up my twttr'},
 {'author_id': '14',
  'conversation_id': '22',
  'created_at': '2006-03-21T21:00:54.000Z',
  'id': '22',
  'lang': 'en',
  'possibly_sensitive': False,
  'public_metrics': {'like_count': 3503,
   'quote_count': 155,
   'reply_count': 77,
   'retweet_count': 4757},
  'reply_settings': 'everyone',
  'source': 'Twitter Web Client',
  'text': 'just setting up my twttr'}]

In [None]:
df_hydrated_tweets = pd.DataFrame(hydrated_tweets["data"])
df_hydrated_tweets

Unnamed: 0,lang,created_at,author_id,public_metrics,conversation_id,text,source,id,reply_settings,possibly_sensitive
0,en,2006-03-21T20:51:43.000Z,13,"{'retweet_count': 6093, 'reply_count': 167, 'l...",21,just setting up my twttr,Twitter Web Client,21,everyone,False
1,en,2006-03-21T21:00:54.000Z,14,"{'retweet_count': 4757, 'reply_count': 77, 'li...",22,just setting up my twttr,Twitter Web Client,22,everyone,False


## Academic API

* Tool: https://developer.twitter.com/apitools/downloader
* Details can be found at: https://twittercommunity.com/t/introducing-new-developer-tools-for-the-twitter-api-v2/168348




> Upload a file called `./ipl04-2022.json` which has a few tweets in it.



In [None]:
df_academic_data = pd.read_json("./ipl-april2022.json")
df_academic_data

Unnamed: 0,id,text
0,1509695225777864704,@calheirosmarcus Espero comentários da rodada ...
1,1509694519754776576,Indian Premier League: CSK Capable Of Retainin...
2,1509694421880950784,LSG prodigy Ayush Badoni played two fine knock...
3,1509694237318782976,Statsman: All about the Indian Premier League ...
4,1509693319764348928,Indian Premier League: CSK Capable Of Retainin...
5,1509693310973087744,Indian Premier League: CSK Capable Of Retainin...
6,1509692303308431360,Indian Premier League 2022 | Badoni a great fi...
7,1509692024928325632,Indian Premier League 2022 | Badoni a great fi...
8,1509691826453762048,RT @CricketNDTV: RCB leg-spinner Wanindu Hasar...
9,1509691482864824320,Indian Premier League 2022 | Badoni a great fi...
