# Project 6: IMDB

This project involves NLP, decision trees, bagging, boosting, and more!

---

## Load packages

You are likely going to need to install the `imdbpie` package:

    > pip install imdbpie

---

In [4]:
import os
import subprocess
import collections
import re
import csv
import json

import pandas as pd
import numpy as np
import scipy

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

import psycopg2
import requests
from imdbpie import Imdb
import nltk

import urllib
from bs4 import BeautifulSoup
import nltk

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

---

## Part 1: Acquire the Data

You will connect to the IMDB API to query for movies. 

See here for documentation on how to use the package:

https://github.com/richardasaurus/imdb-pie

#### 1. Connect to the IMDB API

In [2]:
imdb = Imdb()
imdb = Imdb(anonymize=True)


#### 2. Query the top 250 rated movies in the database

In [3]:
# Output of the top 250 movies
# Top 250 movies are output as list of json files
# Converting to json makes it easy to put into dataframe

top250 = json.dumps(imdb.top_250())


#### 3. Make a dataframe from the movie data

Keep the fields:

    num_votes
    rating
    tconst
    title
    year
    
And discard the rest

In [4]:
# Can use read_json to import the dump from avove
movies250 = (pd.read_json(top250))

# Subset only desired columns
movies = movies250[['num_votes', 'rating', 'tconst', 'title', 'year']]

#### 3. Select only the top 100 movies

In [5]:
movies = movies.ix[0:100,:]

In [None]:
movies.tail()

#### 4. Get the genres and runtime for each movie and add them to the dataframe


There can be multiple genres per movie, so this will need some finessing.

In [7]:
# Create Empty lists to store genres/runtime
# movies_genres = []
# movies_runtime = []    


# for tconst in movies['tconst']: # iterate through the movie ids
#     title = imdb.get_title_by_id(tconst) # Get title info of id
#     movies_genres.append(title.genres)   # Append genre of id to list
#     movies_runtime.append(title.runtime) # Append 

In [8]:
# movies_genres

[[u'Crime', u'Drama'],
 [u'Crime', u'Drama'],
 [u'Crime', u'Drama'],
 [u'Action', u'Crime', u'Thriller'],
 [u'Biography', u'Drama', u'History'],
 [u'Crime', u'Drama'],
 [u'Crime', u'Drama'],
 [u'Adventure', u'Drama', u'Fantasy'],
 [u'Western'],
 [u'Drama'],
 [u'Adventure', u'Drama', u'Fantasy'],
 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi'],
 [u'Drama', u'Romance'],
 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller'],
 [u'Action', u'Adventure', u'Drama', u'Fantasy'],
 [u'Drama'],
 [u'Biography', u'Crime', u'Drama'],
 [u'Action', u'Sci-Fi'],
 [u'Action', u'Adventure', u'Drama'],
 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi'],
 [u'Crime', u'Drama'],
 [u'Crime', u'Drama', u'Mystery', u'Thriller'],
 [u'Crime', u'Drama', u'Thriller'],
 [u'Drama', u'Family', u'Fantasy', u'Romance'],
 [u'Crime', u'Drama', u'Mystery', u'Thriller'],
 [u'Comedy', u'Drama', u'Romance', u'War'],
 [u'Crime', u'Drama', u'Thriller'],
 [u'Western'],
 [u'Animation', u'Adventure', u'Family', u'Fantasy'],
 [u'Action'

In [10]:
movies['genres'] = movies_genres
movies['runtime'] = movies_runtime

#### 4. Write the Results to a csv

In [13]:
# movies.to_csv('movies.csv',     # Filepath
#               encoding='utf-8') # use utf-8 encoding

---

## Part 2: Wrangle the text data

#### 1. Scrape the reviews for the top 100 movies

*Hint*: Use a loop to scrape each page at once

In [1]:
import requests
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

url = 'http://www.imdb.com/title/%s/' % 'tt0111161' # Set URL
response = requests.get(url)                        # Get URL Response
HTML = response.text                                # Obtain text from response

In [43]:
# Select Text of Review
Selector(text=HTML).xpath('//span/div[2]/p/text()').extract() 

[u" One of the finest films made in recent years. It's a poignant story about hope. Hope gets me. That's what makes a film like this more than a movie. It tells a lesson about life. Those are the films people talk about 50 or even 100 years from you. It's also a story for freedom. Freedom from isolation,  from rule, from bigotry and hate. Freeman and Robbins are majestic in their performances. Each learns from the other. Their relationship is strong and you feel that from the first moment they make contact with one another. There is also a wonderful performance from legend James Whitmore as Brooks.",
 u'He shines when it is his time to go back into the world, only to find that the world grew up so fast he never even got a chance to blink. Stephen King\'s story is brought to the screen with great elegance and excitement. It is an extraordinary motion that people "will" be talking about in  50 or 100 years.  ']

In [45]:
# Select Review Score
Selector(text=HTML).xpath("//div[@class='ratingValue']/strong/span/text()").extract()

[u'9.3']

#### 2. Extract the reviews and the rating per review for each movie

*Note*: "soup" from BeautifulSoup is the html returned from all 25 pages. You'll need to either address each page individually or break them down by elements.

In [213]:
top250= pd.read_csv('movies.csv')

In [297]:
# Reviews for all of the movies are stored as .gif files
# The file name corresponds to a score out of 100

reviews = pd.DataFrame()

# url = 'http://www.imdb.com/title/%s/reviews?start=0' % 'tt0111161' # Set URL
# response = requests.get(url)                        # Get URL Response
# HTML = response.text                                # Obtain text from response

review = []
rank = []
tconsts =[]
for tconst in top250['tconst'].unique():
    url = 'http://www.imdb.com/title/%s/reviews?start=0' % tconst
    response = requests.get(url)
    HTML = response.text   
    for i in range(1,11):
        tconsts.append(tconst)
        review.append((Selector(text=HTML).xpath(
                "//div[@id='tn15content']/p[%s]/text()" % i).extract()[0]).encode('utf-8'))
        rank.append((Selector(text=HTML).xpath("//div[@id='tn15']/div[@id='tn15main']/div[@id='tn15content']/div[%s]/img/@src" % i ).extract()))

In [278]:
reviews = pd.DataFrame(
{'IMDB_Identifier' : tconsts,
'Ranking' : rank,
'Review' : review})

In [279]:
Selector(text=HTML).xpath('//div[1]/img/@src').extract()[0]

u'http://i.media-imdb.com/images/showtimes/100.gif'

In [280]:
reviews.head()

Unnamed: 0,IMDB_Identifier,Ranking,Review
0,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\nWhy do I want to write the 234th comment on ...
1,tt0111161,[http://i.media-imdb.com/images/SF8104186305d9...,"\n\nCan Hollywood, usually creating things for..."
2,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\n\nI have never seen such an amazing film sin...
3,tt0111161,[],"\nIn its Oscar year, Shawshank Redemption (wri..."
4,tt0111161,[],\nThe reason I became a member of this databas...


#### 3. Remove the non AlphaNumeric characters from reviews

In [281]:
reviews['Review'] = reviews['Review'].astype(str).convert_objects(convert_numeric=True)

  if __name__ == '__main__':


In [282]:
reviews

Unnamed: 0,IMDB_Identifier,Ranking,Review
0,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\nWhy do I want to write the 234th comment on ...
1,tt0111161,[http://i.media-imdb.com/images/SF8104186305d9...,"\n\nCan Hollywood, usually creating things for..."
2,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\n\nI have never seen such an amazing film sin...
3,tt0111161,[],"\nIn its Oscar year, Shawshank Redemption (wri..."
4,tt0111161,[],\nThe reason I became a member of this databas...
5,tt0111161,[],\n\nI believe that this film is the best story...
6,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\n\nOne of my all time favorites. Shawshank Re...
7,tt0111161,[],\n\nOne of the finest films made in recent yea...
8,tt0111161,[],\nMisery and Stand By Me were the best adaptat...
9,tt0111161,[],\n\nThe Shawshank Redemption is without a doub...


In [283]:
reviews

Unnamed: 0,IMDB_Identifier,Ranking,Review
0,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\nWhy do I want to write the 234th comment on ...
1,tt0111161,[http://i.media-imdb.com/images/SF8104186305d9...,"\n\nCan Hollywood, usually creating things for..."
2,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\n\nI have never seen such an amazing film sin...
3,tt0111161,[],"\nIn its Oscar year, Shawshank Redemption (wri..."
4,tt0111161,[],\nThe reason I became a member of this databas...
5,tt0111161,[],\n\nI believe that this film is the best story...
6,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\n\nOne of my all time favorites. Shawshank Re...
7,tt0111161,[],\n\nOne of the finest films made in recent yea...
8,tt0111161,[],\nMisery and Stand By Me were the best adaptat...
9,tt0111161,[],\n\nThe Shawshank Redemption is without a doub...


#### 4. Calculate the top 200 ngrams from the user reviews

Use the `TfidfVectorizer` in sklearn.

Recommended parameters:

    ngram_range = (1, 2)
    stop_words = 'english'
    binary = False
    max_features = 200

In [284]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [285]:
review_corpus = ""

for rev in reviews.Review:
    review_corpus += rev

In [286]:
tfidf  = TfidfVectorizer( 
                ngram_range = (1, 2),
                stop_words = 'english',
                binary = False,
                max_features = 200)

In [287]:
tfidf.fit_transform(reviews.Review)

<1010x200 sparse matrix of type '<type 'numpy.float64'>'
	with 13262 stored elements in Compressed Sparse Row format>

In [288]:
feature_names = tfidf.get_feature_names() 

indices = np.argsort(tfidf.idf_)[::-1]
features = tfidf.get_feature_names()
top_n = 200
top_features = [features[i] for i in indices[:top_n]]
print pd.DataFrame(top_features)

              0
0          book
1         seven
2         fight
3         space
4         leone
5         nolan
6       michael
7       western
8         crime
9     tarantino
10    godfather
11    star wars
12        lives
13    hitchcock
14         city
15         wars
16       family
17          guy
18       robert
19         jack
20       moving
21          men
22         epic
23      theater
24       ending
25     remember
26     released
27      trilogy
28        drama
29        close
..          ...
170   character
171      acting
172        know
173         don
174       world
175        does
176         say
177         man
178        make
179       watch
180          ve
181       years
182       think
183  characters
184      really
185         way
186      people
187        life
188        good
189        seen
190       films
191       story
192      movies
193       great
194        best
195        like
196        just
197        time
198       movie
199        film

[200 ro

#### 5. Merge the user reviews and ratings

In [289]:
int(reviews.Ranking[0][0].replace('http://i.media-imdb.com/images/showtimes/', 
                              '').replace('.gif',''))

100

In [290]:
reviews.apply(lambda x: x=np.NaN if reviews.Ranking)

ValueError: Length of values does not match length of index

In [291]:
reviews

Unnamed: 0,IMDB_Identifier,Ranking,Review
0,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\nWhy do I want to write the 234th comment on ...
1,tt0111161,[http://i.media-imdb.com/images/SF8104186305d9...,"\n\nCan Hollywood, usually creating things for..."
2,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\n\nI have never seen such an amazing film sin...
3,tt0111161,[],"\nIn its Oscar year, Shawshank Redemption (wri..."
4,tt0111161,[],\nThe reason I became a member of this databas...
5,tt0111161,[],\n\nI believe that this film is the best story...
6,tt0111161,[http://i.media-imdb.com/images/showtimes/100....,\n\nOne of my all time favorites. Shawshank Re...
7,tt0111161,[],\n\nOne of the finest films made in recent yea...
8,tt0111161,[],\nMisery and Stand By Me were the best adaptat...
9,tt0111161,[],\n\nThe Shawshank Redemption is without a doub...


In [295]:
reviews.Ranking.apply(lambda x: np.NaN if len(x) == 0 else x)

0       [http://i.media-imdb.com/images/showtimes/100....
1       [http://i.media-imdb.com/images/SF8104186305d9...
2       [http://i.media-imdb.com/images/showtimes/100....
3                                                     NaN
4                                                     NaN
5                                                     NaN
6       [http://i.media-imdb.com/images/showtimes/100....
7                                                     NaN
8                                                     NaN
9                                                     NaN
10      [http://i.media-imdb.com/images/showtimes/100....
11      [http://i.media-imdb.com/images/SF8104186305d9...
12      [http://i.media-imdb.com/images/showtimes/100....
13                                                    NaN
14      [http://i.media-imdb.com/images/showtimes/100....
15                                                    NaN
16      [http://i.media-imdb.com/images/showtimes/100....
17            

#### 6. Save this merged dataframe as a csv

---

## Part 3: Combine Tables in PostgreSQL

#### 1. Import your two .csv data files into your Postgre Database as two different tables

For ease, we can call these table1 and table2

#### 2. Connect to database and query the joined set

#### 3. Join the two tables 

#### 4. Select the newly joined table and save two copies of the into dataframes

---

## Part 4: Parsing and Exploratory Data Analysis

#### 1. Rename any columns you think should be renamed for clarity

#### 2. Describe anything interesting or suspicious about your data (quality assurance)

#### 3. Make four visualizations of interest to you using the data

---

## Part 5: Decision Tree Classifiers and Regressors

#### 1. What is our target attribute? 

Choose a target variable for the decision tree regressor and the classifier. 

#### 2. Prepare the X and Y matrices and preprocess data as you see fit

#### 3. Build and cross-validate your decision tree classifier

#### 4. Gridsearch optimal parameters for your classifier. Does the performance improve?

#### 5. Build and cross-validate your decision tree regressor

#### 6. Gridsearch the optimal parameters for your classifier. Does performance improve?

---

## Part 6: Elastic Net


#### 1. Gridsearch optimal parameters for an ElasticNet using the regression target and predictors you used for the decision tree regressor.


#### 2. Is cross-validated performance better or worse than with the decision trees? 

#### 3. Explain why the elastic net may have performed best at that particular l1_ratio and alpha

---

## Part 7: Bagging and Boosting: Random Forests, Extra Trees, and AdaBoost

#### 1. Load the random forest regressor, extra trees regressor, and adaboost regressor from sklearn

In [None]:
from sklearn import 

#### 2. Gridsearch optimal parameters for the three different ensemble methods.

#### 3. Evaluate the performance of the two bagging and one boosting model. Which performs best?

#### 4. Extract the feature importances from the Random Forest regressor and make a DataFrame pairing variable names with their variable importances.

#### 5. Plot the ranked feature importances.

#### 6.1 [BONUS] Gridsearch an optimal Lasso model and use it for variable selection (make a new predictor matrix with only the variables not zeroed out by the Lasso). 

#### 6.2 [BONUS] Gridsearch your best performing bagging/boosting model from above with the features retained after the Lasso. Does the score improve?

#### 7.1. [BONUS] Select a threshold for variable importance from your Random Forest regressor and use that to perform feature selection, creating a new subset predictor matrix.

#### 7.2 [BONUS] Using BaggingRegressor with a base estimator of your choice, test a model using the feature-selected dataset you made in 7.1

---

## [VERY BONUS] Part 8: PCA

#### 1. Perform a PCA on your predictor matrix

#### 2. Examine the variance explained and determine what components you want to keep based on them.

#### 3. Plot the cumulative variance explained by the ordered principal components.

#### 4. Gridsearch an elastic net using the principal components you selected as your predictors. Does this perform better than the elastic net you fit earlier?

#### 5. Gridsearch a bagging ensemble estimator that you fit before, this time using the principal components as predictors. Does this perform better or worse than the original? 

#### 6. Look at the loadings of the original predictor columns on the first 3 principal components. Is there any kind of intuitive meaning here?

Hint, you will probably want to sort by absolute value of magnitude of loading, and also only look at the obviously important (larger) ones!

# [Extremely Bonus] Part 9:  Clustering

![](https://snag.gy/jPSZ6U.jpg)

 ***Bonus Bonus:***
This extended bonus question is asking to do something we never really talked about but would like for you to attempt based on the assumptions that we learned during this weeks clustering lesson(s).

#### 1. Import your favorite clustering module

#### 2. Encode categoricals

#### 3. Evaluate cluster metics solely based on a range of K
If K-Means:  SSE/Inertia vs Silhouette (ie: Elbow), silhouette average, etc

#### 4.  Look at your data based on the subset of your predicted clusters.
Assign the cluster predictions back to your dataframe in order to see them in context.  This is great to be able to group by cluster to get a sense of the data that clumped together.

#### 5. Describe your findings based on the predicted clusters 
_How well did it do?  What's good or bad?  How would you improve this? Does any of it make sense?_