# Analyzing a Twitter Collection

The goal of the notebook is to leverage pre-trained NLP models and tools (eg. [textblob](https://textblob.readthedocs.io/en/dev/), [flair](https://github.com/flairNLP/flair), [spaCy](https://spacy.io/), [transformers pipelines](https://github.com/huggingface/transformers#quick-tour-of-pipelines), etc) to analyze real world natural language texts in English of two different varieties: on one hand, Twitter messages, supposed to contain informal samples of language; on the other hand, journal headlines, supposed to show formal uses of language.

It's an open goal exercise, but there are some tasks you can attempt:

- extract named entities
- extract noun chunks
- identify qualities of entities and actions
- analyze sentiments of texts
- associate sentiment and named entities
- extract facts: WHAT happened? WHO did WHAT to WHOM?

## Twitter Messages

In [10]:
import numpy as np
import pandas as pd

In [22]:
tweets = pd.read_csv("../datasets/superbowl/tweets-superbowl.tsv", sep="\t", dtype=str)
tweets.head(10)

Unnamed: 0,tweet_id,datetime,user_id,text
0,828319872929112064,2017-02-05 19:10:21,ashhar_1,RT @BBCWorld: Astronauts attempt an out-of-thi...
1,https://t.co/bHxzttGXUR #SuperBowl2017 https://…,,,
2,828319872245432320,2017-02-05 19:10:21,RNRMontana,RT @theoptionoracle: Retweet if you think the ...
3,#BoycottNFL #ladygaga #SuperBowl Halftime Show.,,,
4,@AppSame #MAGA…,,,
5,828319872060944384,2017-02-05 19:10:21,DerksFighter,RT @JODYHiGHROLLER: $100 FREE SUPERBOWL GiVE A...
6,$50 iN FREE DELiVERY OF ALL SNACKS &amp; ALCOH...,,,
7,$50 iN FREE LYFT RiDES…,,,
8,828319872010563588,2017-02-05 19:10:21,FamCat,RT @TheBaxterBean: TRUMP'S AMERIKKKA: Texas hi...
9,828319871784120321,2017-02-05 19:10:21,Sydney10005,@DaRealWillPower are you ready for the superbo...


In [23]:
texts = [t for t in list(tweets["text"]) if isinstance(t, str)]
print(len(texts))

49881


## News Headlines

AGNews is a collection of news categorized under 4 distinc categories:

- World
- Sports
- Business
- Sci/Tech

Here, we're only interested in the text contents: the headline and the first paragraph.

In [28]:
news = pd.read_csv("../datasets/agnews/train.csv", dtype=str, header=None)
news.columns = "category headline text".split()
news.head(10)

Unnamed: 0,category,headline,text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."
5,3,"Stocks End Up, But Near Year Lows (Reuters)",Reuters - Stocks ended slightly higher on Frid...
6,3,Money Funds Fell in Latest Week (AP),AP - Assets of the nation's retail money marke...
7,3,Fed minutes show dissent over inflation (USATO...,USATODAY.com - Retail sales bounced back a bit...
8,3,Safety Net (Forbes.com),Forbes.com - After earning a PH.D. in Sociolog...
9,3,Wall St. Bears Claw Back Into the Black,"NEW YORK (Reuters) - Short-sellers, Wall Stre..."
