# Analyzing a Twitter Collection

The goal of the notebook is to leverage pre-trained NLP models and tools (eg. [textblob](https://textblob.readthedocs.io/en/dev/), [flair](https://github.com/flairNLP/flair), [spaCy](https://spacy.io/), [transformers pipelines](https://github.com/huggingface/transformers#quick-tour-of-pipelines), etc) to analyze real world natural language texts in English of two different varieties: on one hand, Twitter messages, supposed to contain informal samples of language; on the other hand, journal headlines, supposed to show formal uses of language.

It's an open goal exercise, but there are some tasks you can attempt:

- extract named entities
- extract noun chunks
- identify qualities of entities and actions
- analyze sentiments of texts
- associate sentiment and named entities
- extract facts: WHAT happened? WHO did WHAT to WHOM?


In [0]:
!pip install simpletransformers

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/f4/41/418b2e4ffad5f165079a2ab4ec0be6cec4ff48bbddd2193c334b3f2a77db/simpletransformers-0.21.3-py3-none-any.whl (133kB)
[K     |████████████████████████████████| 133kB 53.0MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/13/33/ffb67897a6985a7b7d8e5e7878c3628678f553634bd3836404fef06ef19b/transformers-2.5.1-py3-none-any.whl (499kB)
[K     |████████████████████████████████| 501kB 56.8MB/s 
Collecting tensorboardx
[?25l  Downloading https://files.pythonhosted.org/packages/35/f1/5843425495765c8c2dd0784a851a93ef204d314fc87bcc2bbb9f662a3ad1/tensorboardX-2.0-py2.py3-none-any.whl (195kB)
[K     |████████████████████████████████| 204kB 69.6MB/s 
Collecting seqeval
  Downloading https://files.pythonhosted.org/packages/34/91/068aca8d60ce56dd9ba4506850e876aba5e66a6f2f29aa223224b50df0de/seqeval-0.0.12.tar.gz
Collecting sacremoses
[?25l  Downloading https://files

In [0]:
!pip install spacy
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [0]:
!git clone https://github.com/vitojph/nlp-exercises.git

Cloning into 'nlp-exercises'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects:  33% (1/3)[Kremote: Counting objects:  66% (2/3)[Kremote: Counting objects: 100% (3/3)[Kremote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 81 (delta 0), reused 2 (delta 0), pack-reused 78[K
Unpacking objects: 100% (81/81), done.
Checking out files: 100% (38/38), done.


In [0]:
!pip install wandb

Collecting wandb
[?25l  Downloading https://files.pythonhosted.org/packages/45/05/a0bf45b2f4909c3ffb1729deb19355a067a8cf8d56eebd3159d702321b68/wandb-0.8.29-py2.py3-none-any.whl (1.4MB)
[K     |▎                               | 10kB 29.9MB/s eta 0:00:01[K     |▌                               | 20kB 37.0MB/s eta 0:00:01[K     |▊                               | 30kB 44.5MB/s eta 0:00:01[K     |█                               | 40kB 45.9MB/s eta 0:00:01[K     |█▏                              | 51kB 35.2MB/s eta 0:00:01[K     |█▍                              | 61kB 37.6MB/s eta 0:00:01[K     |█▋                              | 71kB 40.5MB/s eta 0:00:01[K     |██                              | 81kB 26.4MB/s eta 0:00:01[K     |██▏                             | 92kB 24.7MB/s eta 0:00:01[K     |██▍                             | 102kB 25.2MB/s eta 0:00:01[K     |██▋                             | 112kB 25.2MB/s eta 0:00:01[K     |██▉                             | 122kB 25.

In [0]:
!wandb login XXXXXXXXXXXXX

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


## Twitter Messages

In [0]:
import numpy as np
import pandas as pd

In [0]:
tweets = pd.read_csv("/content/nlp-exercises/datasets/superbowl/tweets-superbowl.tsv", sep="\t", dtype=str)
tweets.head(10)

Unnamed: 0,tweet_id,datetime,user_id,text
0,828319872929112064,2017-02-05 19:10:21,ashhar_1,RT @BBCWorld: Astronauts attempt an out-of-thi...
1,https://t.co/bHxzttGXUR #SuperBowl2017 https://…,,,
2,828319872245432320,2017-02-05 19:10:21,RNRMontana,RT @theoptionoracle: Retweet if you think the ...
3,#BoycottNFL #ladygaga #SuperBowl Halftime Show.,,,
4,@AppSame #MAGA…,,,
5,828319872060944384,2017-02-05 19:10:21,DerksFighter,RT @JODYHiGHROLLER: $100 FREE SUPERBOWL GiVE A...
6,$50 iN FREE DELiVERY OF ALL SNACKS &amp; ALCOH...,,,
7,$50 iN FREE LYFT RiDES…,,,
8,828319872010563588,2017-02-05 19:10:21,FamCat,RT @TheBaxterBean: TRUMP'S AMERIKKKA: Texas hi...
9,828319871784120321,2017-02-05 19:10:21,Sydney10005,@DaRealWillPower are you ready for the superbo...


In [0]:
texts = [t for t in list(tweets["text"]) if isinstance(t, str)]
print(len(texts))

49881


## News Headlines

AGNews is a collection of news categorized under 4 distinc categories:

- World
- Sports
- Business
- Sci/Tech

Here, we're only interested in the text contents: the headline and the first paragraph.

In [0]:
news = pd.read_csv("/content/nlp-exercises/datasets/agnews/train.csv", dtype=str, header=None)
news.columns = "category headline text".split()
news.head(10)

Unnamed: 0,category,headline,text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."
5,3,"Stocks End Up, But Near Year Lows (Reuters)",Reuters - Stocks ended slightly higher on Frid...
6,3,Money Funds Fell in Latest Week (AP),AP - Assets of the nation's retail money marke...
7,3,Fed minutes show dissent over inflation (USATO...,USATODAY.com - Retail sales bounced back a bit...
8,3,Safety Net (Forbes.com),Forbes.com - After earning a PH.D. in Sociolog...
9,3,Wall St. Bears Claw Back Into the Black,"NEW YORK (Reuters) - Short-sellers, Wall Stre..."


In [0]:
import spacy

nlp = spacy.load("en")

doc = nlp("Donald Trump is the president. Donald Trump eats apples. Donald Trump gives a letter to her daughter.")

print([(ent.text, ent.label_) for ent in doc.ents])

print(list(doc.noun_chunks))

for sentence in doc.sents:
    who, action, what, whom = None, None, None, None
    for token in sentence:
        if token.dep_ == "ROOT":
            action = token.text
        elif token.dep_ == "nsubj":
            who = token.text
        elif token.dep_ == "dobj" or token.dep_ == "attr":
            what = token.text
        elif token.dep_ == "iobj":
            whom = token.text

    print(f"{who} -> {action} -> {what} -> {whom}")

[('Donald Trump', 'PERSON'), ('Donald Trump', 'PERSON'), ('Donald Trump', 'PERSON')]
[Donald Trump, the president, Donald Trump, apples, Donald Trump, a letter, her daughter]
Trump -> is -> president -> None
Trump -> eats -> apples -> None
Trump -> gives -> letter -> None


In [0]:
from collections import Counter
from tqdm import tqdm_notebook as tqdm

tweets_entities_counter = Counter()

for text in tqdm(texts):
    doc = nlp(text)
    tweets_entities_counter += Counter([f"{ent.text} {ent.label_}" for ent in doc.ents])

print(tweets_entities_counter)

HBox(children=(IntProgress(value=0, max=49881), HTML(value='')))


Counter({'SuperBowl MONEY': 11466, 'SuperBowl ORG': 5149, '# CARDINAL': 3683, 'Superbowl PRODUCT': 2536, 'SuperBowl PRODUCT': 2171, 'first ORDINAL': 1507, '#SuperBowl # MONEY': 1497, 'Sunday DATE': 1387, 'Tom Brady PERSON': 1231, 'Patriots ORG': 1207, 'tonight TIME': 1190, 'today DATE': 1097, 'NFL ORG': 1096, 'Nick Foles PERSON': 778, '#SuperBowl MONEY': 713, 'Philadelphia GPE': 616, '#Eagles MONEY': 546, 'the Super Bowl EVENT': 543, 'Beyoncé PERSON': 512, 'Lady Gaga PERSON': 510, 'Eagles ORG': 506, 'Michael Jackson PERSON': 498, 'Brady PERSON': 469, 'HISTORY ORG': 445, '25 Years DATE': 440, 'Halftime Show WORK_OF_ART': 440, 'Bruno Mars PERSON': 428, 'Philadelphia Eagles ORG': 418, 'PepsiHalftime ORG': 410, 'Chris Long PERSON': 404, '2 years DATE': 401, '6 CARDINAL': 396, 'one CARDINAL': 362, 'last year DATE': 361, 'Superbowl MONEY': 338, 'America GPE': 323, '@Eagles GPE': 322, 'Trump ORG': 309, 'JustinTimberlake ORG': 296, 'Kevin Hart PERSON': 294, 'Superbowl ORG': 291, 'New England 

In [0]:
news_entities_counter = Counter()

for text in tqdm(news["headline"]):
    doc = nlp(text)
    news_entities_counter += Counter([f"{ent.text} {ent.label_}" for ent in doc.ents])

print(news_entities_counter)

HBox(children=(IntProgress(value=0, max=120000), HTML(value='')))




In [0]:
from transformers import pipeline

sentiment_classifier = pipeline("sentiment-analysis")
ner = pipeline("ner")

HBox(children=(IntProgress(value=0, description='Downloading', max=230, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=230, style=ProgressStyle(description_width=…




In [0]:
from collections import Counter
from tqdm import tqdm_notebook as tqdm

positives = Counter()
negatives = Counter()

for text in tqdm(news["text"][:1000]):
    sentiment = sentiment_classifier(text)[0]["label"]
    if sentiment == "POSITIVE":
        positives += Counter([item["word"] for item in ner(text)])
    else:
        negatives += Counter([item["word"] for item in ner(text)])


HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))


Counter({'##uters': 57, 'Re': 56, 'S': 36, '.': 34, 'U': 33, 'AP': 24, '##K': 21, 'Olympic': 21, 'K': 20, '##S': 19, 'of': 17, 'University': 17, 'Earth': 16, 'W': 16, 'Y': 15, '-': 15, '##OR': 14, 'American': 14, 'NASA': 14, 'Phelps': 13, 'NE': 12, '##W': 12, 'A': 12, 'Michael': 12, 'Championship': 12, '##o': 11, 'Inc': 11, 'Bush': 11, 'C': 11, '##N': 10, '##M': 10, 'Corp': 10, 'United': 10, '##D': 10, '##EN': 10, 'G': 9, 'N': 9, 'O': 9, 'P': 9, '##d': 9, 'Java': 9, 'PGA': 9, '##i': 8, 'New': 8, 'States': 8, 'F': 8, '##A': 8, 'Pro': 8, '##P': 8, 'John': 8, '##co': 8, '##sa': 7, 'National': 7, 'Health': 7, 'V': 7, 'Europe': 7, 'Per': 7, '##sei': 7, '##O': 7, 'Space': 7, 'Washington': 7, 'Iraq': 7, 'Microsoft': 7, 'J': 7, 'US': 7, '##vez': 7, 'Japan': 7, 'Greece': 7, '##is': 7, 'Ian': 7, 'Thorpe': 7, '##e': 7, 'Vijay': 7, 'Singh': 7, '##ar': 7, 'Venezuela': 7, 'T': 6, 'North': 6, 'X': 6, 'Open': 6, 'Ko': 6, '3D': 6, 'Athens': 6, 'The': 6, '##R': 6, 'Olympics': 6, 'Chandra': 6, 'E': 6, '

In [0]:
from pprint import pprint

pprint(positives)

Counter({'##uters': 57,
         'Re': 56,
         'S': 36,
         '.': 34,
         'U': 33,
         'AP': 24,
         '##K': 21,
         'Olympic': 21,
         'K': 20,
         '##S': 19,
         'of': 17,
         'University': 17,
         'Earth': 16,
         'W': 16,
         'Y': 15,
         '-': 15,
         '##OR': 14,
         'American': 14,
         'NASA': 14,
         'Phelps': 13,
         'NE': 12,
         '##W': 12,
         'A': 12,
         'Michael': 12,
         'Championship': 12,
         '##o': 11,
         'Inc': 11,
         'Bush': 11,
         'C': 11,
         '##N': 10,
         '##M': 10,
         'Corp': 10,
         'United': 10,
         '##D': 10,
         '##EN': 10,
         'G': 9,
         'N': 9,
         'O': 9,
         'P': 9,
         '##d': 9,
         'Java': 9,
         'PGA': 9,
         '##i': 8,
         'New': 8,
         'States': 8,
         'F': 8,
         '##A': 8,
         'Pro': 8,
         '##P': 8,
         'John':

In [0]:
pprint(negatives)

Counter({'##uters': 124,
         'Re': 121,
         'S': 60,
         '.': 58,
         'U': 52,
         'Inc': 45,
         'AP': 40,
         'Google': 38,
         '##A': 37,
         '##s': 33,
         '##S': 32,
         'N': 31,
         'A': 30,
         '##K': 28,
         '##O': 27,
         '##W': 26,
         '-': 25,
         '&': 25,
         'NE': 24,
         '##r': 24,
         'Corp': 24,
         'Microsoft': 24,
         'Windows': 24,
         '##i': 23,
         'K': 23,
         'Y': 22,
         'O': 21,
         '##P': 21,
         '##OR': 20,
         'L': 20,
         'Iraq': 19,
         'Florida': 19,
         'Olympic': 19,
         '##FP': 18,
         '##RA': 18,
         '##H': 17,
         'J': 17,
         'Internet': 17,
         'Co': 17,
         'Hugo': 17,
         'Cha': 17,
         '##vez': 17,
         'US': 16,
         'Sudan': 16,
         'F': 16,
         'New': 16,
         'C': 16,
         'Venezuelan': 16,
         '##E': 15,
    

In [0]:
negatives.most_common(20)

[('##uters', 124),
 ('Re', 121),
 ('S', 60),
 ('.', 58),
 ('U', 52),
 ('Inc', 45),
 ('AP', 40),
 ('Google', 38),
 ('##A', 37),
 ('##s', 33),
 ('##S', 32),
 ('N', 31),
 ('A', 30),
 ('##K', 28),
 ('##O', 27),
 ('##W', 26),
 ('-', 25),
 ('&', 25),
 ('NE', 24),
 ('##r', 24)]

In [0]:

print(sentiment_classifier("Donald Trump is the worst president of the United States"))
print(sentiment_classifier("Donald Trump is the best president of the United States"))
print(sentiment_classifier("Donald Trump is not the worst president of the United States"))
print(sentiment_classifier("Donald Trump is not the best president of the United States"))
print(sentiment_classifier("Donald Trump isn't the worst president of the United States"))
print(sentiment_classifier("Donald Trump isn't the best president of the United States"))

[{'word': 'Donald', 'score': 0.9988574385643005, 'entity': 'I-PER'}, {'word': 'Trump', 'score': 0.9992254972457886, 'entity': 'I-PER'}, {'word': 'Spain', 'score': 0.9997214674949646, 'entity': 'I-LOC'}]


In [0]:
from collections import defaultdict

facts = defaultdict(list)

for text in tqdm(news["text"][-1000:]):
    doc = nlp(text)
    for sentence in doc.sents:
        who, action, what, whom = None, None, None, None
        for token in sentence:
            if action == None and token.dep_ == "ROOT":
                action = token.text
            elif who == None and token.dep_ == "nsubj":
                who = token.text
            elif what == None and token.dep_ == "dobj" or token.dep_ == "attr" or token.dep_ == "ccomp":
                what = token.text
            elif whom == None and token.dep_ == "iobj":
                whom = token.text

    if who:
        facts[who.lower()].append((who, action, what, whom, sentence))


HBox(children=(IntProgress(value=0, max=1000), HTML(value='')))

In [0]:
def search_facts_about(who: str):
    return facts[who.lower()]

In [0]:
search_facts_about("Bush")

[('Bush',
  'signs',
  'overhaul',
  None,
  President George Bush signs into law the country's most radical overhaul of its intelligence agencies in nearly 60 years.),
 ('Bush',
  'signed',
  'overhaul',
  None,
  AP - President Bush on Friday signed the largest overhaul of U.S. intelligence-gathering in a half century, aiming to transform a system designed for Cold War threats so it can deal effectively with the post-Sept. 11 scourge of terrorism.),
 ('Bush',
  'creating',
  'committee',
  None,
  AP - President Bush is creating a White House committee to oversee the nation's ocean policies, with plans to improve research, manage fisheries better and regulate pollution caused by boats.),
 ('Bush',
  'ordered',
  'creation',
  None,
  In response to a gloomy assessment of the state of the nation's coastal waters, President Bush ordered the creation of a new federal panel to coordinate oceanic policy.)]