## Lab 1: Computational Linguistics

Computational linguistics applies many of the concepts you learned in the previous module to specific challenges in the Real World.  For example, when examining a corpora of text, like a person's tweets, what can we learn about how and what is expressed in that corpora by examining words and combinations of words? 

Is there a story that emerges?

Fortunately, we live in an era where famous people frequently use Twitter; and Twitter is good at producing a large number of finite (small) sets of words for us to examine. 

Lets take a look at some tweets from Donald Trump over the past year. 

In [None]:
## Lets import some libraries form mathplotlib ... its helpful for plotting. 
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt

In [None]:
#All the packages we are using in this project
from __future__ import division
import nltk, re, pprint
from nltk import word_tokenize
from nltk import FreqDist
import json
import pandas as pd
import csv

All tweets from Donald Trump are stored in file 'realDonaldTrump_tweets.txt' in the current directory

## Text Analysis
- Text collocations to find common words that go together
- Regular expression to parse out hashtags and high frequent user account

In [None]:
file_path = 'realDonaldTrump_tweets.txt'

## realDonaldTrump_tweets.txt

First we uses Regular expression to strip away symbols and web links in the tweets

In [None]:
with open(file_path, 'r') as f:
    raw = re.sub(r'[^\w]|https.*\b', ' ', f.read())
    tokens = word_tokenize(raw)
    freqDist = nltk.FreqDist(tokens)
freqDist.plot(20)

In [None]:
from nltk.util import ngrams

def plot_ngram(tokens, num):
    ngram = ngrams(tokens, num)
    ngram_dist = nltk.FreqDist(ngram)
    ngram_dist.plot(25)

plot_ngram(tokens, 2) ##bigram frequency distribution
plot_ngram(tokens, 3) ##trigram frequency distribution

In [None]:
print("Unique trump tweet vocab: %i (including capitalized letters)" % len(set(tokens)))

In [None]:
text = nltk.Text(tokens)
text.collocations()

Here we can see that "Crooked Hillary", "FAKE NEWS" are all the slogans that Trump frequently uses in his campaign. 

In [None]:
from operator import itemgetter

with open(file_path, 'r') as f:
    hasher =  f.read()
    tokens = word_tokenize(raw)
    freqDist = nltk.FreqDist(tokens)

hashtags = re.findall(r"#(\w+)", hasher)
freqDist_hashtags = nltk.FreqDist(hashtags)
hashtags = []
for freq in freqDist_hashtags.keys():
    hashtags.append((freq, freqDist_hashtags[freq]))
sorted_hashtags = sorted(hashtags, key=itemgetter(1), reverse=True) #Sort by frequency

#find the top 50 most used hashtags by Donald Trump
print([text[0] for text in sorted_hashtags[:50]])

In [None]:
freqDist_hashtags.plot(20)

### Comprehension Check:

What do these hashtags tell you about the nature of Trumps Tweets? What are the common topics? Are they positive or negative do you think? Some of both? Explain based on the data above.






x

----

### Who is addressed the most in Trump's tweets (via mentions)

Note that, below, the code is looking for the "@" symbol, which is how people are mentioned on Twitter. 

In [None]:
tag_users = re.findall(r"@(\w+)", hasher)
freq_users = nltk.FreqDist(tag_users)
tag_users = []
for freq in freq_users.keys():
    tag_users.append((freq, freq_users[freq]))
tag_users = sorted(tag_users, key=itemgetter(1), reverse=True) #Sort by frequency

print([text[0] for text in tag_users[:50]])

In [None]:
freq_users.plot(10)

#### Comprehension Check:

What types of accounts are most frequently mentioned in Trump's tweets? 





x

----

## Final Comprehension Check :


Based on your analysis of the data from Trump's tweets, what can you say about how Twitter is used? 



----
Does this validate or invalidate how you thought about Trump's tweets previously? 






----