In [None]:
# Initialize OK
from client.api.notebook import Notebook
ok = Notebook('hw3.ok')

# Homework 3: Trump, Twitter, and Movies

## Due Date: Friday, May 31 5:00 pm PST

Welcome to the third homework assignment of INT15! In this assignment, we will work with Twitter data in order to analyze Donald Trump's tweets.

**Collaboration Policy**

Data science is a collaborative activity. While you may talk with others about the homework, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** below.

**Collaborators**: *list collaborators here*

In [1]:
# Run this cell to set up your notebook
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import zipfile

# Ensure that Pandas shows at least 280 characters in columns, so we can see full tweets
pd.set_option('max_colwidth', 280)

%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set()
sns.set_context("talk")
import re

Run the cell below to unzip and read tweets from the json file into a list named `all_tweets`.

In [2]:
# Unzip the dataset
dest_path = 'hw3-realdonaldtrump_tweets.json.zip'
my_zip = zipfile.ZipFile(dest_path, 'r')
with my_zip.open('hw3-realdonaldtrump_tweets.json', 'r') as f:
    all_tweets = json.load(f)

Here is what a typical tweet from `all_tweets` looks like:

In [3]:
from pprint import pprint # to get a more easily-readable view.
pprint(all_tweets[-1])

## Question 1

Let's construct a DataFrame called `trump` containing data from all the tweets stored in `all_tweets`. The index of the DataFrame should be the ID of each tweet (looks something like `907698529606541312`). It should have these columns:

- `time`: The time the tweet was created encoded as a datetime object. (Use `pd.to_datetime` to encode the timestamp.)
- `source`: The source device of the tweet.
- `text`: The text of the tweet.
- `retweet_count`: The retweet count of the tweet. 

Finally, **the resulting DataFrame should be sorted by the index.**

**Warning:** *Some tweets will store the text in the `text` field and other will use the `full_text` field.*

<!--
BEGIN QUESTION
name: q1
points: 2
-->

In [4]:
trump = pd.DataFrame({
    'time': pd.to_datetime([tweet['created_at'] for tweet in all_tweets]).tz_convert(None),
    'source': [tweet['source'] for tweet in all_tweets],
    'text': [tweet['text'] if "text" in tweet else tweet['full_text'] for tweet in all_tweets],
    'retweet_count': 
}, index=[tweet['id'] for tweet in all_tweets],
   columns=['time', 'source', 'text', 'retweet_count'],
).sort_index()
trump.head()

In [None]:
ok.grade("q1");

---
# Part 2: Tweet Source Analysis

In the following questions, we are going to find out the charateristics of Trump tweets and the devices used for the tweets.

First let's examine the `'source'` field:

In [16]:
trump['source'].unique()

## Question 2

Notice how sources like "Twitter for Android" or "Instagram" are surrounded by HTML tags. In the cell below, clean up the `source` field by removing the HTML tags from each `source` entry.

**Hints:** 
* Use `trump['source'].str.replace` along with a regular expression.
* You may find it helpful to experiment with regular expressions at [regex101.com](https://regex101.com/).

<!--
BEGIN QUESTION
name: q2
points: 1
-->

In [17]:
## Uncomment and complete
# trump['source'] = ...

In [None]:
ok.grade("q2");

In the following plot, we see that there are two device types that are more commonly used than others.

In [19]:
plt.figure(figsize=(8, 6))
trump['source'].value_counts().plot(kind="bar")
plt.ylabel("Number of Tweets")
plt.title("Number of Tweets by Source");

## Question 3

Now that we have cleaned up the `source` field, let's now look at which device Trump has used over the entire time period of this dataset.

To examine the distribution of dates we will convert the date to a fractional year that can be plotted as a distribution.

(Code borrowed from https://stackoverflow.com/questions/6451655/python-how-to-convert-datetime-dates-to-decimal-years)

In [20]:
import datetime
def year_fraction(date):
    start = datetime.date(date.year, 1, 1).toordinal()
    year_length = datetime.date(date.year+1, 1, 1).toordinal() - start
    return date.year + float(date.toordinal() - start) / year_length

trump['year'] = trump['time'].apply(year_fraction)

Now, use `sns.distplot` to overlay the distributions of Trump's 2 most frequently used web technologies over the years. Your final plot should look like:

<img src="images/source_years_q3.png" width="600px" />

<!--
BEGIN QUESTION
name: q3
points: 2
manual: true
-->
<!-- EXPORT TO PDF -->

In [21]:
# Hint: use sns.distplot(..., label = ...)
top_devices = ...
for device in top_devices:
    ...
plt.title('Distributions of Tweet Sources Over Years')
plt.legend();

## Question 4


Is there a difference between Trump's tweet behavior across these devices? We will attempt to answer this question in our subsequent analysis.

First, we'll take a look at whether Trump's tweets from an Android device come at different times than his tweets from an iPhone. Note that Twitter gives us his tweets in the [UTC timezone](https://www.wikiwand.com/en/List_of_UTC_time_offsets) (notice the `+0000` in the first few tweets).

In [22]:
for tweet in all_tweets[:3]:
    print(tweet['created_at'])

We'll convert the tweet times to US Eastern Time, the timezone of New York and Washington D.C., since those are the places we would expect the most tweet activity from Trump.

In [23]:
trump['est_time'] = (
    trump['time'].dt.tz_localize("UTC") # Set initial timezone to UTC
                 .dt.tz_convert("EST") # Convert to Eastern Time
)
trump.head()

### Question 4a

Add a column called `hour` to the `trump` table which contains the hour of the day as a floating point number computed by:

$$
\text{hour} + \frac{\text{minute}}{60} + \frac{\text{second}}{60^2}
$$

* **Hint:** See the cell above for an example of working with [dt accessors](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dt-accessors). E.g., use `dt.hour` to get the hour from the `trump['est_time']` series.

<!--
BEGIN QUESTION
name: q4a
points: 1
-->

In [24]:
trump['hour'] = ...

In [None]:
ok.grade("q4a");

### Question 4b

Use this data along with the seaborn `distplot` function to examine the distribution over hours of the day in eastern time that trump tweets on each device for the 2 most commonly used devices.  Your plot should look similar to the following:

<img src="images/device_hour4b.png" width="600px" />

<!--
BEGIN QUESTION
name: q4b
points: 2
manual: true
-->
<!-- EXPORT TO PDF -->

In [26]:
### make your plot here
...

### Question 4c

According to [this Verge article](https://www.theverge.com/2017/3/29/15103504/donald-trump-iphone-using-switched-android), Donald Trump switched from an Android to an iPhone sometime in March 2017.

Let's see if this information significantly changes our plot. Create a figure similar to your figure from question 4b, but this time, only use tweets that were tweeted before 2017. Your plot should look similar to the following:

<img src="images/device_hour4c.png" width="600px" />

<!--
BEGIN QUESTION
name: q4c
points: 2
manual: true
-->
<!-- EXPORT TO PDF -->

In [27]:
### make your plot here 
### remember that you can add a condition like so: trump[trump['year'] < ...]
...

### Question 4d

During the campaign, it was theorized that Donald Trump's tweets from Android devices were written by him personally, and the tweets from iPhones were from his staff. Does your figure give support to this theory? What kinds of additional analysis could help support or reject this claim?

<!--
BEGIN QUESTION
name: q4d
points: 1
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

---
# Part 3: Sentiment Analysis

It turns out that we can use the words in Trump's tweets to calculate a measure of the sentiment of the tweet. For example, the sentence "I love America!" has positive sentiment, whereas the sentence "I hate taxes!" has a negative sentiment. In addition, some words have stronger positive / negative sentiment than others: "I love America." is more positive than "I like America."

We will use the [VADER (Valence Aware Dictionary and sEntiment Reasoner)](https://github.com/cjhutto/vaderSentiment) lexicon to analyze the sentiment of Trump's tweets. VADER is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media which is great for our usage.

The VADER lexicon gives the sentiment of individual words. Run the following cell to show the first few rows of the lexicon:

In [28]:
print(''.join(open("vader_lexicon.txt").readlines()[:10]))

## Question 5

As you can see, the lexicon contains emojis too! Each row contains a word and the *polarity* of that word, measuring how positive or negative the word is.

(How did they decide the polarities of these words? What are the other two columns in the lexicon? See the link above.)

### Question 5a

Read in the lexicon into a DataFrame called `sent`. The index of the DataFrame should be the words in the lexicon. `sent` should have one column named `polarity`, storing the polarity of each word.

* **Hint:** The `pd.read_csv` function may help here. Note that `pd.read_csv` can read data in other, similar formats, especially, if its parameter `sep` is set correctly. You might also look up which parameters will allow you to select specific columns and set an index column.

<!--
BEGIN QUESTION
name: q5a
points: 1
-->

In [29]:
sent = ...
sent.head()

In [None]:
ok.grade("q5a");

### Question 5b

Now, let's use this lexicon to calculate the overall sentiment for each of Trump's tweets. Here's the basic idea:

1. For each tweet, find the sentiment of each word.
2. Calculate the sentiment of each tweet by taking the sum of the sentiments of its words.

First, let's lowercase the text in the tweets since the lexicon is also lowercase. Set the `text` column of the `trump` DataFrame to be the lowercased text of each tweet.

<!--
BEGIN QUESTION
name: q5b
points: 1
-->

In [34]:
...
trump.head()

In [None]:
ok.grade("q5b");

### Question 5c

Now, let's get rid of punctuation since it will cause us to fail to match words. Create a new column called `no_punc` in the `trump` DataFrame to be the lowercased text of each tweet with all punctuation replaced by a single space. We consider punctuation characters to be *any character that isn't a Unicode word character or a whitespace character* (consult the Python documentation on regexes for how to represent them).

(Why don't we simply remove punctuation instead of replacing with a space? See if you can figure this out by looking at the tweet data.)

<!--
BEGIN QUESTION
name: q5c
points: 1
-->

In [36]:
# Save your regex in punct_re
punct_re = r''
trump['no_punc'] = ...

In [None]:
ok.grade("q5c");

### Question 5d

Now, let's convert the tweets into what's called a [*tidy format*](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) to make the sentiments easier to calculate. Use the `no_punc` column of `trump` to create a table called `tidy_format`. **The index of the table should be the IDs of the tweets, repeated _once for every word_ in the tweet**. It has two columns:

1. `num`: The location of the word in the tweet. For example, if the tweet was "i love america", then the location of the word "i" is 0, "love" is 1, and "america" is 2.
2. `word`: The individual words of each tweet.

The first few rows of our `tidy_format` table look like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>num</th>
      <th>word</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>894661651760377856</th>
      <td>0</td>
      <td>i</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>1</td>
      <td>think</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>2</td>
      <td>senator</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>3</td>
      <td>blumenthal</td>
    </tr>
    <tr>
      <th>894661651760377856</th>
      <td>4</td>
      <td>should</td>
    </tr>
  </tbody>
</table>

**Note that your DataFrame may look different from the one above.** However, you can double check that your tweet with ID `894661651760377856` has the same rows as ours. Our tests don't check whether your table looks exactly like ours.

As usual, try to **avoid using** any `for` loops. Our solution uses a chain of 5 methods on the `trump` DataFrame, albeit using some rather advanced Pandas hacking.

* **Hint 1:** Try looking at the `expand` argument to pandas' `str.split`.

* **Hint 2:** Try looking at the `stack()` method.

* **Hint 3:** Try looking at the `level` parameter of the `reset_index` method.

* **Hint 4:** You might need to `rename(columns={...})`.

<!--
BEGIN QUESTION
name: q5d
points: 2
-->

In [47]:
tidy_format = ...

In [None]:
ok.grade("q5d");

### Question 5e

Now that we have this table in the tidy format, it becomes much easier to find the sentiment of each tweet: we can join the table with the lexicon table. 

Add a `polarity` column to the `trump` table.  The `polarity` column should contain **the sum of the sentiment polarity** of each word in the text of the tweet.

**Hints:** 
* You will need to `merge` the `tidy_format` and `sent` tables and group the final answer on the `index`.
* If certain words are not found in the `sent` table, set their polarities to 0 (use `fillna()`).

<!--
BEGIN QUESTION
name: q5e
points: 2
-->

In [50]:
trump['polarity'] = (
    tidy_format
    .merge(sent, how='left', left_on='word', right_index=True)
    .reset_index()
)
trump[['text', 'polarity']].head()

In [None]:
ok.grade("q5e");

Now we have a measure of the sentiment of each of his tweets! Note that this calculation is rather basic; you can read over the VADER readme to understand a more robust sentiment analysis.

Now, run the cells below to see the most positive and most negative tweets from Trump in your dataset:

In [57]:
print('Most negative tweets:')
for t in trump.sort_values('polarity').head()['text']:
    print('\n  ', t)

In [58]:
print('Most positive tweets:')
for t in trump.sort_values('polarity', ascending=False).head()['text']:
    print('\n  ', t)

## Question 6

Now, let's try looking at the distributions of sentiments for tweets containing certain keywords.

### Question 6a

In the cell below, create a single plot showing both the distribution of tweet sentiments for tweets containing `nytimes`, as well as the distribution of tweet sentiments for tweets containing `fox`.

**Hint**: You can use `str.contains()`.

<!--
BEGIN QUESTION
name: q6a
points: 1
manual: true
-->
<!-- EXPORT TO PDF -->

In [59]:
...
plt.title('Distributions of Tweet Polarities (nytimes vs. fox)')
plt.legend();

### Question 6b
Comment on what you observe in the plot above. Can you find other pairs of keywords that lead to interesting plots? (If you modify your code in 6a, remember to change the words back to `nytimes` and `fox` before submitting for grading).

<!--
BEGIN QUESTION
name: q6b
points: 1
-->

*Write your answer here, replacing this text.*

# Part 4: Movie Recommendation System

## Question 7

Your TA, Pushkar, has only recently started watching movies. While chatting with you, he asks you to recommed some good movies to add to his queue. Normally, you would have told him the movies that you liked but since you are taking INT15, you decide to do this in a data-science fashion. You create your own movie recommendation system that can later be used by Pushkar to watch the movies that he might like.


### Question 7a: Data

You decide to use the IMDB dataset [https://www.imdb.com/interfaces](https://www.imdb.com/interfaces/), which contains the names of the movies along with their respective ratings. 

Let's import the top 1000 movies with their plots stored in the files `top_1000_movies.csv` and `plots.csv`.

<!--
BEGIN QUESTION
name: q7a
points: 1
manual: false
-->

In [60]:
all_movies = ...
plots = ...

In [None]:
ok.grade("q7a");

Let's preview the top entries of both datasets.

In [64]:
all_movies.head()

In [65]:
plots.head()

Let's verify that we do not have any TV series, and if that is true, remove (`drop`) the "titleType" and "endYear" columns. 

In [66]:
all_movies[all_movies.titleType != "movie"] # should return an empty dataframe if we have only movies

In [67]:
# Remove the uninformative columns mentioned above
movies = ...
movies.head()

Check, how many movies have their primary title differ from their original title?

In [68]:
diff_titles = ...
len(diff_titles)

### Question 7b: Merge the datasets
As you can see, both dataframes have a common ID (although, the column names do not match). Let's join the two files to combine the movies with their plots. Use the `df.merge` command in Pandas.

<!--
BEGIN QUESTION
name: q7b
-->

In [69]:
plots.rename(columns={'Unnamed: 0':'tconst'}, inplace=True)
# Write your code to merge the two dataframes
movies = ...
movies.head(3)

In [None]:
ok.grade("q7b");

### Question 7c: Text preprocessing 

In [72]:
import re
import string
from nltk.corpus import stopwords
from nltk import word_tokenize

stop_words = set(stopwords.words('english'))
punct = string.punctuation

Let's preprocess the plot of each movie by:

1. Converting all words to lower case

2. Removing numbers and punctuation

3. Tokenizing
    
4. Removing stop words 

<!--
BEGIN QUESTION
name: q7c
-->

In [73]:
movies['key_words'] = "" # add a new column
movies['key_words'] = np.nan

# Convert to lowercase and remove leading/trailing whitespace
movies['key_words'] = (movies['plots']
 .str.lower()
 .str.strip()
)

# Remove Punctuation
punct_regex = ...
movies['key_words'] = movies['key_words'].str.replace(punct_regex, ' ')

# Remove Numbers
num_regex = ...
...

# Extract the keywords (exclude stop words)
movies['key_words'] = (movies['key_words']
                       .apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words])))

# Tokenize the plots using word_tokenize  (use .apply like above)
...

movies.head()

In [None]:
ok.grade("q7c");

We now have a set of keywords for each of the movies. We can use these keywords to create a dictionary and use it to find movies with similar keywords.

### Question 7d

For analyzing each word in a corpus, we associate that word with a number by creating a dictionary. The dicitionary can be created using the `Dictionary` module from `gensim`. 
 
<!--
BEGIN QUESTION
name: q7d
-->

In [77]:
# Processing Keywords
processed_keywords=[]
for keywords in movies['key_words']:
    processed_keywords.append(keywords)
    

from gensim.corpora.dictionary import Dictionary
# Write your code here
dictionary = ...

In [None]:
ok.grade("q7d");

We can now visualize the frequency distribution of the most frequent words.

In [79]:
from nltk import FreqDist

In [80]:
### concatenate all keywords from all movies together
all_movie_words=[]
for word in processed_keywords:
    all_movie_words.extend(word) 

In [81]:
def plot_freq_words(words, terms = 10):

    fdist = FreqDist(words)
    words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})

    # selecting top most frequent words
    d = words_df.nlargest(columns="count", n = terms) 
    plt.figure(figsize=(20,5))
    ax = sns.barplot(data=d, x= "word", y = "count")
    ax.set(ylabel = 'Count')
    plt.show()

In [82]:
plot_freq_words(all_movie_words)

As you that the figure represents the top words and their frequencies in the corpus. Now we can do this for the  keywords for each movie.

In [83]:
plot_freq_words(processed_keywords[70])  # get the frequency distribution for the words from a specific movie

Can you guess which movie we selected above just by looking at the keywords?

### Question 7e

What we plotted above is known as a _bag of words_.
The Bag of Words is a list of tuples where the first index denotes the postion of the word in the dictionary while the second index denotes it's frequency. 

Go through all keywords in `processed_keywords` and add them to the dictionary we created above using the `doc2bow` command.
 
<!--
BEGIN QUESTION
name: q7e
-->

In [84]:
corpus = ...

In [None]:
ok.grade("q7e");

### Question 7f: TF-IDF and Matrix similarity

In [87]:
from gensim.models.tfidfmodel import TfidfModel
from gensim.similarities import MatrixSimilarity

We can now create a `tfIdfModel` from `gensim`. The tfidfModel score will automatically give weigths to each of the keywords in the corpus. Based on these scores we will calculate the similarities between different movies using `MatrixSimilarity` function from `gensim`.
<!--
BEGIN QUESTION
name: q7f
-->

In [88]:
# Create the tf-idf model for the corpus created above 
tfidf = ...

# Create the similarity data structure. 
# This is the most important part where we get the similarities between the movies.
# Hint: use the length of the dictionary as the num_features
sims = ...

In [None]:
ok.grade("q7f");

###  Part 4 : Putting it all together.  Movie Recommendation
Finally, you have come to the point where you can put together a movie recommendation system. 

Pushkar has recently heard about _Batman_ and asks you what movies he should watch that are similar to it. You fire up your movie recommender and give him the best set of suggestions. 

The recommender would go through the following set of steps:

1. Create the bag of words for the movie 
2. Calculate the tf-idf of the queried movie 
3. Calculate the similarity of the matrix
4. See which movies are the most similar to the given movie 
5. Sort the current movies by the similarity score

<!--
BEGIN QUESTION
name: q7g
-->

In [92]:
def movie_recommendation(movie_title, dictionary, number_of_hits=10):
    top_words = 5
    # We will first start by  getting all the keywords related to the movies
    movie = movies.loc[movies.primaryTitle==movie_title] # get the movie row
    keywords = movie['key_words']# Get all the keywords
    doc=[]
    for word in keywords:
        doc.extend(word)
    
    # Convert the doc into it's equivalent bag of words
    query_doc_bow = ...
    
    # convert the regular bag of words model to a tf-idf model where we have tuples
    # of the movie ID and its tf-idf value for the movie
    query_doc_tfidf = tfidf[query_doc_bow]
    
    # get the array of similarity values between our movie and every other movie. 
    # To do this, we pass our list of tf-idf tuples to sims.
    similarity_array = sims[query_doc_tfidf] 
    # the length is the number of movies we have. 

    similarity_series = pd.Series(similarity_array.tolist(), index=movies.primaryTitle.values) #Convert to a Series
    top_hits = similarity_series.sort_values(ascending=False)[1:number_of_hits+1] 
    #get the top matching results, i.e. most similar movies; 
    # start from index 1 because every movie is most similar to itself

    #print the words with the highest tf-idf values for the provided movie:
    sorted_tfidf_weights = sorted(tfidf[corpus[movie.index.values.tolist()[0]]], key=lambda w: w[1], reverse=True)
    print('Top %s words associated with this movie by tf-idf are: ' % top_words)
    for term_id, weight in sorted_tfidf_weights[:top_words]:
        print(" '%s' (tf-idf score = %.3f)" %(dictionary.get(term_id), weight))
    print("\n")
    
    # Print the top matching movies
    print("Top %s most similar movies for movie %s are:" %(number_of_hits, movie_title))
    top_movies=[]
    for idx, (movie,score) in enumerate(zip(top_hits.index, top_hits)):
        print("%d %s (similarity score = %.3f)" %(idx+1, movie, score))
        top_movies.append(movie)
    return top_movies

top_10_movies = movie_recommendation("Batman", dictionary)


In [None]:
ok.grade("q7g");

In [95]:
terminator = ...

In [96]:
inception = ...

**Looking at the scores for the movies, how do the top recommendations for the movie _Inception_ differ from those for the movie _The Terminator_?** 
* If you have seen any of the movies or looked at the recommendations for other movies, comment on how well the system is able to recommend similar movies. 
* What else could we do to improve our movie recommendations?

<!--
BEGIN QUESTION
name: q7h
points: 2
manual: true
-->
<!-- EXPORT TO PDF -->

*Write your answer here, replacing this text.*

Have fun exploring the new movies whose plot's keywords are similar to other movies' plots.

# Submit
Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output.
**Please save before submitting!**

<!-- EXPECT 6 EXPORTED QUESTIONS -->

In [None]:
# Save your notebook first, then run this cell to submit.
import jassign.to_pdf
jassign.to_pdf.generate_pdf('hw3.ipynb', 'hw3.pdf')
ok.submit()