# Text Processing

The file `UCSD tweets.csv` has a small number of tweets from August and September 2018 that contained the term "UCSD".  Let's analyze them!

### Steps

Each step is explained in more detail below
1. Open the CSV and explore the data
2. Clean the data
3. Count word frequency
4. Sentiment analysis

In [13]:
# Imports
%matplotlib inline

import numpy as np
from datascience import *

# 1. Open the CSV and explore the data

### Steps:
* load the data from `data/UCSD tweets.csv`
* Examine the data.  How many tweets are there?  How long is the shortest tweet?  Longest?

In [4]:
# you code here
tweets = Table().read_table("data/UCSD tweets.csv")
tweets

username,date,text
@fox5sandiego,Aug 27,UCSD ranked 7th best school in US by Washington Monthly
@KamaFaye,Sep 1,it physically pains me that UCSD doesn’t have a football ...
@team10news_CA,Sep 2,A card mis-judged a parking space and over-ran a parking ...
@SapnaKmd,Sep 2,Almost 50% increase in #PedsICU engagement from last wee ...
@kabeerthirty,Aug 27T,here’s a tinge of classism in saying SDSU should make do ...
@KNBKSoshihan,Sep 2,This past weekend we observed the 25th anniversary of th ...
@jessiejpg,Aug 29,I HATE UCSD SO MUCH THEYRE SO ANNOYING OHHHHH MY GOD
@Juanickers,Aug 29,LMFAOOO UCSD JUST CALLED ME ASKING IF I COULD MAKE A $25 ...
@savvymunoz,Aug 31,I chose UCSD because it has the best fb meme group
@Yokwellzy,Aug 31,High key excited to be going back to UCSD 💙💛


In [5]:
# your code here
tweets.num_rows
length = []
for i in range(40):
    length.append(len(tweets.column('text')[i]))
tweets.with_column('length', length).sort('length').column('length').item(0)
tweets.with_column('length', length).sort('length', descending=True).column('length').item(0)

278

# 2. Clean the data

### Filter out tweets that don't have `UCSD` in the text

The Twitter search matches on both username and tweet text.  We want just the ones that have a match in the tweet itself.  The result should be a new dataframe with the subset of matching tweets.

* create functions to apply to your table and clean your data
* use where clauses to filter

In [6]:
# your code here
tweets.where('text', are.containing('UCSD'))

username,date,text
@fox5sandiego,Aug 27,UCSD ranked 7th best school in US by Washington Monthly
@KamaFaye,Sep 1,it physically pains me that UCSD doesn’t have a football ...
@team10news_CA,Sep 2,A card mis-judged a parking space and over-ran a parking ...
@SapnaKmd,Sep 2,Almost 50% increase in #PedsICU engagement from last wee ...
@kabeerthirty,Aug 27T,here’s a tinge of classism in saying SDSU should make do ...
@KNBKSoshihan,Sep 2,This past weekend we observed the 25th anniversary of th ...
@jessiejpg,Aug 29,I HATE UCSD SO MUCH THEYRE SO ANNOYING OHHHHH MY GOD
@Juanickers,Aug 29,LMFAOOO UCSD JUST CALLED ME ASKING IF I COULD MAKE A $25 ...
@savvymunoz,Aug 31,I chose UCSD because it has the best fb meme group
@Yokwellzy,Aug 31,High key excited to be going back to UCSD 💙💛


## Check for duplicates

See if any of the tweets have exactly the same text.  If so, are they true duplicates?  Does it make sense to remove them?

In [7]:
# your code here
tweets.group('text').sort('count', descending=True)

text,count
Great opportunity to take the ATOM & ASSET trauma course ...,2
📈 405.60 parts per million (ppm) #CO2 in the atmosphere ...,1
it physically pains me that UCSD doesn’t have a football ...,1
i can’t believe that in a month summer manar and i will ...,1
here’s a tinge of classism in saying SDSU should make do ...,1
Very interesting work from UCSD Neutrophil nanosponges s ...,1
USC rejected me and UCSD accepted me so disappointed.,1
UCSD women's soccer wins home opener as McManus eats piz ...,1
UCSD ranked 7th best school in US by Washington Monthly,1
UCSD REINSTATED ME IT’S MODELO TIME,1


# 2. Count Words

We want to find out what the most frequent words are, so we need to split things up.  In text this is called tokenizing.

### Steps

1. Make a single long string with all of the tweet text.  Make sure to put spaces between them!
2. Split the tweets into a list of words using `.split()`
3. Print out the first 20 words to make sure it looks like what you think it should.

How many words are there all together?  How many distinct words? (remember `set()`)

In [22]:
# your code here
string = ''
for i in range(40):
    string = string + tweets.column('text')[i] + " "
split = string.split()
print(split[:20])
print(len(split))
len(np.unique(split))
len(set(split))

['UCSD', 'ranked', '7th', 'best', 'school', 'in', 'US', 'by', 'Washington', 'Monthly', 'it', 'physically', 'pains', 'me', 'that', 'UCSD', 'doesn’t', 'have', 'a', 'football']
875


580

### Remove short words

Short words are really common, and aren't super helpful for comparing word count.  Usually it is best to remove what are called "stop words", which include things like "of", "a", "in", etc.  In this case we will just remove all words that are less than three charecters long.

The result should be a new list of words.  How many total?  How many unique?

In [29]:
# Your code here
long_words = []
for i in range(len(split)):
    if len(split[i]) >= 3:
        long_words.append(split[i])
long_words

['UCSD',
 'ranked',
 '7th',
 'best',
 'school',
 'Washington',
 'Monthly',
 'physically',
 'pains',
 'that',
 'UCSD',
 'doesn’t',
 'have',
 'football',
 'team',
 'card',
 'mis-judged',
 'parking',
 'space',
 'and',
 'over-ran',
 'parking',
 'lot',
 'the',
 'process',
 'taking',
 'out',
 'about',
 'about',
 'feet',
 'chain-link',
 'fence',
 'this',
 'morning',
 'the',
 'Nobel',
 'apartments',
 'the',
 'UCSD',
 'area.',
 'Around',
 '9:10',
 'I...',
 'https://www.facebook.com/JAMIESCOTTmobile/posts/2177799288905402',
 'Almost',
 '50%',
 'increase',
 '#PedsICU',
 'engagement',
 'from',
 'last',
 'week',
 '>1000',
 'tweets!👌🏽Special',
 'shoutout',
 '@DrKanaris',
 'for',
 'the',
 'new',
 'Friday',
 '#PedsICU',
 'quiz',
 'tradition',
 '@UCSD_PICU',
 'for',
 'the',
 'great',
 '#meded!',
 '@healthhashtags',
 '#medtwitter',
 'here’s',
 'tinge',
 'classism',
 'saying',
 'SDSU',
 'should',
 'make',
 'with',
 'its',
 '283',
 'acres',
 'while',
 'UCSD',
 'accommodates',
 'roughly',
 'the',
 'same',


# 3. Count word frequency

You can use a dictionary to create a categorical distribution of the words in a sentence:

In [30]:
my_sentence = 'Jack be nimble, Jack be quick, Jack jump over the candlestick'
my_words = my_sentence.split()

categorical_distribution = {} # empty dictionary
for word in my_words:
    if word in categorical_distribution:
        categorical_distribution[word] = categorical_distribution[word] + 1
    else:
        categorical_distribution[word] = 1
        
categorical_distribution

{'Jack': 3,
 'be': 2,
 'candlestick': 1,
 'jump': 1,
 'nimble,': 1,
 'over': 1,
 'quick,': 1,
 'the': 1}

Create a categorical distribution of words for all tweets.  
* Are you surprised by the most common?

In [31]:
# your code here
categorical_distribution_twitter = {} # empty dictionary
for i in long_words:
    if i in categorical_distribution_twitter:
        categorical_distribution_twitter[i] = categorical_distribution_twitter[i] + 1
    else:
        categorical_distribution_twitter[i] = 1
categorical_distribution_twitter  
#D = dictionary
#def sortfun(x):
    #return x[i]
#categorical_distribution_twitter.sorted(D, key=sortfun, reversed=True)

{'#AAST2018': 2,
 '#CO2': 1,
 '#GoFalcons': 1,
 '#GoTritons': 1,
 '#NASPA': 1,
 '#PHMFellowsJC': 1,
 '#PedsICU': 4,
 '#RealTimeChem': 1,
 '#Scripps': 1,
 '#TimeToShine': 1,
 '#TritonInvite.': 1,
 '#TritonPride': 1,
 '#UCSD': 1,
 '#UCSD.': 1,
 '#URDawgs': 1,
 '#art': 1,
 '#california': 1,
 '#chemtwitter': 1,
 '#cysticfibrosis': 1,
 '#goldleaf': 1,
 '#gpnation': 1,
 '#holiday': 1,
 '#laborday': 1,
 '#meded!': 1,
 '#medtwitter': 1,
 '#pedsICU': 1,
 '#pulmonary': 1,
 '#retina': 1,
 '#sandiego': 1,
 '#surgeons': 1,
 '#teamworkmakesthedreamwork': 1,
 '#trandplanttime': 1,
 '#ucsd': 1,
 '#wtc2018': 2,
 '$250': 1,
 '(Torrey': 1,
 '(ppm)': 1,
 '...': 1,
 '101': 1,
 '19th': 1,
 '2005': 1,
 '2012.': 1,
 '2018': 2,
 '2018”': 1,
 '2019': 1,
 '2141-acre': 1,
 '25th': 1,
 '283': 1,
 '405.60': 1,
 '50%': 1,
 '5:30am!': 1,
 '6pm': 1,
 '7th': 1,
 '9:10': 1,
 '>1000': 1,
 '>3900': 1,
 '@AleceAnderson': 1,
 '@CFS_UCSD': 2,
 '@DrKanaris': 1,
 '@GNPS_UCSD!': 1,
 '@JAMA_current': 1,
 '@LBSUWaterPolo!': 1,
 '

# 2.b. Tokenize again (with NLTK this time)

Why is UCSD only in 18?  

Because of `@UCSD` and similar.  

Tokenizing (like most things) is harder than it looks at first!  

Generally, a good solution is to use a tool built for the job rather than rolling your own.  In this case, we will use the Python package Natural Language ToolKit, NLTK.  

You may need to install NLTK and also download an English language corpus.  If so, do this in the terminal:

```
pip install --user nltk
```

Then in Jupyter run this once:
```
import nltk
nltk.download('punkt')
```

Run the code below to use NLTK's tokenizer, and then repeat the process of removing short words and counting.

In [43]:
import nltk
nltk.download('punkt')
from nltk import tokenize

[nltk_data] Downloading package punkt to /home/zhg061/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [32]:
allText = string # pass in a string consisting of all tweets

wordList = tokenize.word_tokenize(allText)
len(wordList)

1050

In [33]:
wordList

['UCSD',
 'ranked',
 '7th',
 'best',
 'school',
 'in',
 'US',
 'by',
 'Washington',
 'Monthly',
 'it',
 'physically',
 'pains',
 'me',
 'that',
 'UCSD',
 'doesn',
 '’',
 't',
 'have',
 'a',
 'football',
 'team',
 'A',
 'card',
 'mis-judged',
 'a',
 'parking',
 'space',
 'and',
 'over-ran',
 'a',
 'parking',
 'lot',
 'in',
 'the',
 'process',
 'taking',
 'out',
 'about',
 'about',
 '40',
 'to',
 '50',
 'feet',
 'of',
 'chain-link',
 'fence',
 'this',
 'morning',
 'at',
 'the',
 'Nobel',
 'apartments',
 'in',
 'the',
 'UCSD',
 'area',
 '.',
 'Around',
 '9:10',
 'AM',
 'I',
 '...',
 'https',
 ':',
 '//www.facebook.com/JAMIESCOTTmobile/posts/2177799288905402',
 '…',
 'Almost',
 '50',
 '%',
 'increase',
 'in',
 '#',
 'PedsICU',
 'engagement',
 'from',
 'last',
 'week',
 '&',
 '>',
 '1000',
 'tweets',
 '!',
 '👌🏽Special',
 'shoutout',
 'to',
 '@',
 'DrKanaris',
 'for',
 'the',
 'new',
 'Friday',
 '#',
 'PedsICU',
 'quiz',
 'tradition',
 '&',
 '@',
 'UCSD_PICU',
 'for',
 'the',
 'great',
 '#',

# 3.b. Count (again)

In [34]:
# Remove short words
new_long_words = []
for i in range(len(wordList)):
    if len(wordList[i]) >= 3:
        new_long_words.append(wordList[i])
new_long_words

['UCSD',
 'ranked',
 '7th',
 'best',
 'school',
 'Washington',
 'Monthly',
 'physically',
 'pains',
 'that',
 'UCSD',
 'doesn',
 'have',
 'football',
 'team',
 'card',
 'mis-judged',
 'parking',
 'space',
 'and',
 'over-ran',
 'parking',
 'lot',
 'the',
 'process',
 'taking',
 'out',
 'about',
 'about',
 'feet',
 'chain-link',
 'fence',
 'this',
 'morning',
 'the',
 'Nobel',
 'apartments',
 'the',
 'UCSD',
 'area',
 'Around',
 '9:10',
 '...',
 'https',
 '//www.facebook.com/JAMIESCOTTmobile/posts/2177799288905402',
 'Almost',
 'increase',
 'PedsICU',
 'engagement',
 'from',
 'last',
 'week',
 '1000',
 'tweets',
 '👌🏽Special',
 'shoutout',
 'DrKanaris',
 'for',
 'the',
 'new',
 'Friday',
 'PedsICU',
 'quiz',
 'tradition',
 'UCSD_PICU',
 'for',
 'the',
 'great',
 'meded',
 'healthhashtags',
 'medtwitter',
 'here',
 'tinge',
 'classism',
 'saying',
 'SDSU',
 'should',
 'make',
 'with',
 'its',
 '283',
 'acres',
 'while',
 'UCSD',
 'accommodates',
 'roughly',
 'the',
 'same',
 'number',
 'st

In [36]:
# Count
len(new_long_words)

697

# Sentiment with NLTK

What is sentiment?  Why do we care?

Will need to run once:
```
nltk.download('vader_lexicon')
```

In [44]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/zhg061/nltk_data...


True

In [45]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [46]:
sid = SentimentIntensityAnalyzer()
sid.polarity_scores("Good test!")

{'compound': 0.4926, 'neg': 0.0, 'neu': 0.239, 'pos': 0.761}

In [50]:
tweets = Table.read_table('data/UCSD tweets.csv')
tweetList = tweets.column('text')

In [51]:
tweetSentiments = []
for tweet in tweetList:
    tweetSentiment = sid.polarity_scores(tweet)
    tweetSentiment['text'] = tweet
    tweetSentiments.append(tweetSentiment)
    
tweetSentiments

[{'compound': 0.6369,
  'neg': 0.0,
  'neu': 0.682,
  'pos': 0.318,
  'text': 'UCSD ranked 7th best school in US by Washington Monthly '},
 {'compound': -0.4215,
  'neg': 0.237,
  'neu': 0.763,
  'pos': 0.0,
  'text': 'it physically pains me that UCSD doesn’t have a football team'},
 {'compound': 0.0,
  'neg': 0.0,
  'neu': 1.0,
  'pos': 0.0,
  'text': 'A card mis-judged a parking space and over-ran a parking lot in the process taking out about about 40 to 50 feet of chain-link fence this morning at the Nobel apartments in the UCSD area.  Around 9:10 AM I... https://www.facebook.com/JAMIESCOTTmobile/posts/2177799288905402 …'},
 {'compound': 0.8659,
  'neg': 0.0,
  'neu': 0.72,
  'pos': 0.28,
  'text': 'Almost 50% increase in #PedsICU engagement from last week & >1000 tweets!👌🏽Special shoutout to @DrKanaris for the new Friday #PedsICU quiz tradition & @UCSD_PICU for the great #meded! @healthhashtags #medtwitter'},
 {'compound': 0.0772,
  'neg': 0.0,
  'neu': 0.954,
  'pos': 0.046,
  'te

In [57]:
Table.from_records(tweetSentiments)

compound,neg,neu,pos,text
0.6369,0.0,0.682,0.318,UCSD ranked 7th best school in US by Washington Monthly
-0.4215,0.237,0.763,0.0,it physically pains me that UCSD doesn’t have a football ...
0.0,0.0,1.0,0.0,A card mis-judged a parking space and over-ran a parking ...
0.8659,0.0,0.72,0.28,Almost 50% increase in #PedsICU engagement from last wee ...
0.0772,0.0,0.954,0.046,here’s a tinge of classism in saying SDSU should make do ...
0.4588,0.0,0.885,0.115,This past weekend we observed the 25th anniversary of th ...
-0.6801,0.424,0.443,0.133,I HATE UCSD SO MUCH THEYRE SO ANNOYING OHHHHH MY GOD
-0.5319,0.231,0.657,0.112,LMFAOOO UCSD JUST CALLED ME ASKING IF I COULD MAKE A $25 ...
0.6369,0.0,0.682,0.318,I chose UCSD because it has the best fb meme group
0.34,0.0,0.789,0.211,High key excited to be going back to UCSD 💙💛


# Next Steps

* load the file of "internet research agency" tweets (a small sample) and explore!
    - `data/ira.csv`

In [55]:
internet_research_agency = Table().read_table('data/ira.csv')
internet_research_agency

3906258,ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef974526887856fe299d3f2c0,2016-11-16 09:04,The Best Exercise To Lose Belly Fat In 2 weeks https://t.co/oHFToG7rh6 #Exercise #LoseBellyFat #CatTV #TeenWolf… https://t.co/b4pr9gEx38
1051443,8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5fa79c825e2d6f291bf,2016-12-24 04:31,"RT @Philanthropy: Dozens of ‘hate groups’ have charity status, Chronicle study finds https://t.co/FxUBBHNlKy"
2823399,Room Of Rumor,2016-08-18 20:26,"Artificial intelligence can find, map poverty, researchers say #tech"
272878,San Francisco Daily,2016-03-18 19:28,Uber balks at rules proposed by world’s busiest airport #news
7697802,41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed8980383c82378e7518,2016-07-30 15:44,"RT @dirtroaddiva1: #IHatePokemonGoBecause he didn't let me do ""that"" for a Klondike bar. Screw you Pokemon. #PokesAreJokes. http ..."
1409274,New York City Today,2016-01-04 19:02,Chick-fil-A remains closed after health violations #health
2973541,ce7b9f8c86dfbf9b2bd03eda62f0d42ac1c2b1b593ba0b0b052210acf652f896,2016-05-20 14:56,RT @SenSanders: We cannot afford to wait to address this public health crisis. We must quickly fund efforts to stop Zika's spread. ht ...
1042655,Andy Sparks,2016-04-13 14:52,RT @MatthewGellert: #IWouldPreferToForget that the two leading Republican candidates are an ignorant bully and an ignorant preacher.
7838616,40bd0ff013b85c7646ca07ad238bc4dc865ce2cc87034af6e7884e69481f6422,2016-10-08 10:19,"RT @rapstationradio: #NowPlaying: RJ (OMMIO) ""From Nothing (Prod. By Davo)"" #rap #hiphop #music https://t.co/8TJZ3vVCxs"
8005939,0512ea612cfe45a7d9c8c0fd42466e8a8068a6fb3efb34baf7e7be40da578539,2016-08-15 09:57,Hill Street Vida Blues. #AthleticsTVShows @susanslusser
6262477,ef983249ef6ed5de427c4dc19ad6d966c6cf572c2505e44142e7e7261f917ae6,2016-09-09 18:39,RT @c982f7295cf57508a8d39bae6310c9546492d4105cac8d18cc798e56f7573376: All you wanted to know about Hillary #HillaryRottenClinton htt ...


In [None]:
internet_research_agency