In [None]:
import nltk
import pandas as pd
import string
import numpy as np
import sklearn
import scipy
import omdb
import io, json
import requests
from sklearn.utils import shuffle
from bs4 import BeautifulSoup
import ast
import time

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In this tutorial, we will be using a Logistic Regression Model to predict whether a user will like a movie based on the plot summary. 

Table of Contents
<div id="toc"></div>

# Introduction

For our problem, we intend on dividing the dataset of movies into 2 parts - one that will be liked by the user, and one that will not. To do so, we will be using a very simplistic model of analysis on an Logistic Regression Model - the movie genre. Our goal is to predict the genre of a movie, given the plot summary, and use this to identify whether the movie is suitable for our user. But movies can be of several genres. The following is a list of genres from IMDB.

|                    Genres                   |
|----------|----------|------------|----------|
|Action    |Adventure |Animation   |Biography |
|Comedy    |Crime	  |Documentary |Drama     |
|Family    |Fantasy   |Film Noir   |History	  |
|Horror    |Music     |Musical     |Mystery   | 
|Romance   |Sci-Fi	  |Short       |Sport     |
|Superhero |Thriller  |War         |Western   |


So our first task is to list out the users favorite ones. For the sake of simplicity let us assume that the users genre preferences are as follows.

In [None]:
gen_prefs = ['Animation','Romance','Drama','Mystery']

In order to predict whether or not a movie will be iked by our user, we need to train our model. We will do so by collecting a large dataset of movies from IMDB, and gathering the info of these movies. This can be done using OMDb. 

# OMDb

We will be generating our dataset using the OMDb API. The OMDb API is a RESTful web service to obtain movie information, all content and images on the site are contributed and maintained by our users. You can find out more about the library on their website: http://www.omdbapi.com/

### Authentication

To use the OMDb API, however, we need to go through a few steps to set up an account with a private key and use that private key in this application. This can be done as follows:

1. On the OMDb page, select the API Key option in the top menu
2. Fill out the form to include a valid email ID. If you need more 1,000 requests (as for this program), you may need to become a patron first. A key will be sent to this email ID.
3. Store this key in a file called 'api_key.txt' in the same directory as this file. We will read from this file in order to gain access to the key, as shown below.

In [None]:
def get_key(filename):
    """
    Return the private key used for authentication of OMDb.

    Args:
        filename (string): the name of the file that stores the key 

    Returns:
        key (string): the private key
    """
    with open(filename, 'r') as f:
        return f.read().replace('\n','')

api_key = get_key('api_key.txt')

### Get Movie Information

Using the OMDb library we will now get the first 20,000 movies from IMDB. Each movie on IMDB has a unique IMDB serial ID. In order to gather the first 20,000 movies, we can use the seed Serial ID 'tt00000001' and incrementally gather the first 20,000 results. If the plot summary doesn't exist for that serial number or if that title hasn't been assigned Genre labels, we will skip that one and try again until we get a successful search.

As we read in a result, we will build a Pandas DataFrame to keep track of the data. We just need the imdbID, the movie plot summary and the associated movie genres. We will then write this dataframe into a csv file for easy access later. This way you always have access to the original dataset information.

In [None]:
def get_data(key, startID, numElems):
    """
    Uses the OMDb API to build the dataset. Incrementally goes through ImdbIDs (starting at 'tt0000001') to find 1000 movies that have valid
    Genre labels and Plot Summary.

    Args:
        key (string): Private Key required by the API.
        startID (int): The ImdbID to start with.
        numElems (int): The number of data points required.

    Returns:
        end_ID (int): The last imdbID that the search ended at.
        df (pd.Dataframe): Final Dataframe with imdbID, plot and genres columns.
    """
    count = 0
    imdbID = startID
    df = pd.DataFrame()
    while count<numElems:
        imdbIDStr = str(imdbID)
        response = requests.get('http://www.omdbapi.com/?apikey='+ str(key)+ '&plot=full&tomatoes=false&i=tt' + ('0'*
                    (7-len(imdbIDStr))) + imdbIDStr)
        try:
            data = response.json()
            if 'Genre' in data.keys():
                genres = data['Genre'].split(',')
                if data['Plot']!='N/A':
                    df = df.append({'imdbID':'tt'+ ("0"*(7-len(imdbIDStr))) + imdbIDStr,'plot':data['Plot'],'genres':tuple(genres)},
                                   ignore_index=True)
                    count+=1
        except:
            print('imdbStr', imdbIDStr)
            print('response', response)
        imdbID +=1 
    return imdbID, df
         

In [None]:
def write_df_to_csv(filename, df):
    """
    Writes the dataframe to a csv.

    Args:
        filename (string): The name of the csv file to be written
        df (pd.Dataframe): The dataframe that is to be written to the csv file
    """
    df.to_csv(filename, sep=',', encoding='utf-8')
    

When collecting our data, we need to try and collect as much data as possible, to train our model properly. In the cell below, we try to collect about 17000 data points. We split up the requests in 100-request batches with 10 second pauses in between to keep the request rate under than the maximum limit of the OMDb library. As each batch of requests is completed, we write the dataframe to end our csv file.

In [None]:
movie_df = pd.DataFrame(columns=['imdbID','plot','genres'])
startID = 1
new_beginning = True
for i in range(170):
    api_key = get_key('api_key.txt')
    time.sleep(10)
    startID, new_df = get_data(api_key, startID, 100)
    movie_df = movie_df.merge(new_df,how='outer')
    new_df['genres'] = new_df['genres'].apply(lambda x: list(x))
    new_df = new_df.set_index('imdbID')
    with open('training_data.csv', 'a') as f:
        if new_beginning:
            new_df.to_csv(f, header=True)
            new_beginning = False
        else:
            new_df.to_csv(f, header=False)
movie_df = movie_df.set_index('imdbID')
movie_df['genres'] = movie_df['genres'].apply(lambda x: list(x))


# Cleaning the Data

The next step is to clean up our data in a way that we can use to train our model. In other words, we need to modify the string passed in as the plot summary to convert the string into tokens that satisfy some criteria. Each token must satisfy the following criteria.

1. The tokens must all be in lower case.
2. The tokens should appear in the same order as in the raw text.
3. The tokens must be in their lemmatized form. If a word cannot be lemmatized (i.e, you get an exception), simply catch it and ignore it. These words will not appear in the token list.
4. The tokens must not contain any punctuations. Punctuations should be handled as follows: (a) Apostrophe of the form 's must be ignored. e.g., She's becomes she. (b) Other apostrophes should be omitted. e.g, don't becomes dont. (c) Words must be broken at the hyphen and other punctuations.

This is done to ensure that the pattern of tokens can map to a general meaning and thus a predictable result. But the above are the most common ways to make valid tokens. In addition to this, we will also add one more criterion. As the tokens are made from the plot summary of a movie, it is possible that the tokens contain some proper nouns. We do not want this to affect our prediction, so we will attempt to remove these.

The best way to do so is to use the NLTK libary, we can tag each token with what type of word it is. NLTK tags all proper nouns with one of 2 tags: NNP and NNPS. So we can filter out each of these tags while taking care of each of the other token filters. You can find more information on the other types of tags [here](https://www.nltk.org/_modules/nltk/tag.html#pos_tag).

Let us first begin with reading in the csv into a Pandas DataFrame.

In [None]:
def read_in_data(filename):
    """
    Reads in CSV file with data into a Pandas Dataframe

    Args:
        filename (string): The name of the csv file to be read in
        
    Returns:
        df (pd.Dataframe): The dataframe with the data that was read in
    """
    df = pd.read_csv(filename, sep=',', encoding='utf-8')
    df['genres'] = df['genres'].apply(lambda x: ast.literal_eval(x))
    return df
    
movie_df = read_in_data('training_data.csv')
print(movie_df.head())

The output should be look as follows:
    
```python
>>> print(movie_df.head())
     imdbID                        genres  \
0  tt0000001         [Documentary,  Short]   
1  tt0000002           [Animation,  Short]   
2  tt0000003  [Animation,  Comedy,  Short]   
3  tt0000005                       [Short]   
4  tt0000007               [Short,  Sport]   

                                                plot  
0  Performing on what looks like a small wooden s...  
1     Short film of 300 individually painted images.  
2  One night, Arlequin come to see his lover Colo...  
3  A stationary camera looks at a large anvil wit...  
4  James J. Corbett and Peter Courtney meet in a ...  
```

Now let's begin working with our data. Let's first write a function that tokenizes a string based on the required criteria.

In [None]:
def tokenize_string(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """
    Breaks the string down into its basic tokens by accounting for punctuation, capitalization, etc. and removing
    any proper nouns.

    Args:
        text (string): String that needs to be tokenized
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)

    Returns:
        tokens (string list): The list of all the final tokens
    """
    text = text.lower()
    text = text.replace("'s","")
    text = text.replace("'","")
    for i in string.punctuation:
        text = text.replace(i," ")
    tokens = nltk.tag.pos_tag(nltk.word_tokenize(text))
    result = []
    for (token,token_type) in tokens:
        if token_type!='NNS' and token_type!='NNPS':
            try:
                lemToken = lemmatizer.lemmatize(token)
                result.append(str(lemToken))
            except Exception as e:
                continue
    return result
        

In [None]:
tokens = tokenize_string("One night, Arlequin come to see his lover")
print(tokens)

We'll now apply all the above function to our dataframe in order to convert our plot summaries into tokens.

In [None]:
def apply_tokenizer_to_df(df):
    """
    Apply the tokenizer to each plot summary in the dataframe
    
    Args:
        df (pd.Dataframe): movies dataframe that stores the plot summaries

    Returns:
        df (pd.Dataframe): The dataframe with the tokenized plot summaries
    """
    df['plot_tokens'] = df['plot'].apply(tokenize_string)
    return df

In [None]:
movie_df = apply_tokenizer_to_df(movie_df)
print(movie_df.head())

The output should look as follows:

```python 
>>> print(movie_df.head())
   imdbID                        genres  \
0  tt0000001         [Documentary,  Short]   
1  tt0000002           [Animation,  Short]   
2  tt0000003  [Animation,  Comedy,  Short]   
3  tt0000005                       [Short]   
4  tt0000007               [Short,  Sport]   

                                                plot  \
0  Performing on what looks like a small wooden s...   
1     Short film of 300 individually painted images.   
2  One night, Arlequin come to see his lover Colo...   
3  A stationary camera looks at a large anvil wit...   
4  James J. Corbett and Peter Courtney meet in a ...   

                                         plot_tokens  
0  [performing, on, what, look, like, a, small, w...  
1      [short, film, of, 300, individually, painted]  
2  [one, night, arlequin, come, to, see, his, lov...  
3  [a, stationary, camera, look, at, a, large, an...  
4  [j, corbett, and, peter, courtney, meet, in, a...  
```

# Train the Classifier

The next thing that we will do is to train our Logistic Regression model to make predictions.
Logistic Regression comes up with a probability function that can give the chance for an input to belong to one of the classes. You can read more about the Logistic Regression model [here](https://machinelearningmastery.com/logistic-regression-for-machine-learning/).

### Create Features

For the purpose of the task at hand, we will be constructing a bag-of-words TF-IDF feature vector. While we have taken care of proper nouns in each plot summary, we also need to remove very common words (i.e. stopwords) as they add almost no information regarding similarity of two pieces of text.

Once this has been removed, we can create a sparse matrix of features for each tweet with the help of [sklearn.feature_extraction.text.TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [None]:
def create_features(df):
    """ 
    Creates the feature matrix using the processed movie plot summary text
    
    Inputs:
        df (pd.DataFrame): The dataframe with the tokenized plot summaries
        
    Outputs:
        sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
                                we need this to tranform test plot summaries in the same way as train plot summaries
        scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
    """
    stopwords=nltk.corpus.stopwords.words('english')
    tfidv = sklearn.feature_extraction.text.TfidfVectorizer(input = 'content', stop_words=stopwords, analyzer='word')
    df['modified_plot'] = df['plot_tokens'].apply(lambda x: " ".join(x))
    smatrix = tfidv.fit_transform(list(df['modified_plot']))
    return (tfidv, smatrix)


(tfidf, X) = create_features(movie_df)

### Add Labels

The next thing we will do is add labels to our dataset. Each label represents whether or not the user will like the movie. As mentioned earlier, we do this in a very simplistic manner. Earlier in this tutorial, we had the user select a list a genres that he/she liked, and named this list `gen_prefs`. We will now go through our dataframe and assign each row a label of `1` if the genre labels of that entry contain atleast of the genres from `gen_prefs` and `0` otherwise.

In [None]:
def create_labels(df):
    """ 
    Creates the class labels based genres
    
    Inputs:
        df (pd.DataFrame): The dataframe with the tokenized plot summaries with a column 'genres'
    
    Outputs:
        numpy.ndarray(int): dense binary numpy array of class labels
    """
    labels = []
    for genres in df['genres']:
        found = False
        for genre in genres:
            if genre in gen_prefs:
                labels.append(1)
                found = True
                break
        if found==False:
            labels.append(0)
    return labels


In [None]:
y = create_labels(movie_df)
print(len(y))
print(len(list(movie_df['genres'])))
# Should both be 17000

### Learn the data

Now that we have created our initial training data and labels, we need to feed this to our Logistic Regression model to train it.

In [None]:
def learn_classifier(X_train, y_train):
    """ learns a classifier from the input features and labels using the kernel function supplied
    Inputs:
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features, output of create_features_and_labels()
        y_train: numpy.ndarray(int): dense binary vector of class labels, output of create_features_and_labels()
        
    Outputs:
        sklearn.linear_model.LogisticRegression: classifier learnt from data
    """
    logreg = sklearn.linear_model.LogisticRegression()
    return logreg.fit(X_train, y_train)


In [None]:
classifier = learn_classifier(X, y)
print(classifier)

The output should look something like this:

 ```python
 >>> print(classifier)
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
  ```

# Test the Classifier

Now that we have trained our classifier, the next thing we want to do is to test our classifier and find it's accuracy.
Let us begin to do this by creating our test data. We can do so by using our earlier `get_data` function to find 3000 data points from the IMDbID of 57842 onwards (some arbitrary startID after 55,000 to avoid collision with training data). We will use this IMDbID as our starting point to avoid any collision with our training data.

In [None]:
test_df = pd.DataFrame(columns=['imdbID','plot','genres'])
startID = 57842
new_beginning = True
for i in range(30):
    api_key = get_key('api_key.txt')
    time.sleep(5)
    startID, new_df = get_data(api_key, startID, 100)
    test_df = test_df.merge(new_df,how='outer')
test_df = test_df.set_index('imdbID')
test_df['genres'] = test_df['genres'].apply(lambda x: list(x))

print(test_df.head())

The output should look similar to this:

```python
>>> print(test_df.head())
                                  genres  \
imdbID                                     
tt0057842                        [Drama]   
tt0057844                       [Comedy]   
tt0057846  [Adventure,  Drama,  History]   
tt0057851                      [Western]   
tt0057852                [Crime,  Drama]   

                                                        plot  
imdbID                                                        
tt0057842  THIS SPECIAL FRIENDSHIP tells of the tender re...  
tt0057844  "Doctor" Jayne Mansfield is in Italy to show a...  
tt0057846  In this first part of the Angélique cycle, set...  
tt0057851  In the Arizona Territory in 1879, Captain Jeff...  
tt0057852  The Creepiest, Crawliest and Deadliest Film Ev...  

```

### Classify the Test Results

Now let's classify our test data to predict whether or not the user will like the movie or not. 

In [None]:
def classify_movies(tfidf, classifier, test_data):
    """ 
    predicts whether the user will like the movie or not from raw movie plot text
    
    Inputs:
        tfidf (sklearn.feature_extraction.text.TfidfVectorizer): the TfidfVectorizer object used on training data
        classifier (sklearn.linear_model.LogisticRegression): classifier learnt
        test_data  (pd.DataFrame): tweets read from tweets_test.csv
        
    Outputs:
        numpy.ndarray(int): dense binary vector of class labels for unlabeled tweets
    """
    test_df = apply_tokenizer_to_df(test_data)
    test_df['modified_plot'] = test_df['plot_tokens'].apply(lambda x: " ".join(x))
    smatrix = tfidf.transform(test_df['modified_plot'])
    
    return classifier.predict(smatrix)

In [None]:
y_test_pred = classify_movies(tfidf, classifier, test_df)
print(y_test_pred)

### Evaluating Accuracy
Now we need to understand how well our classifier is performing. To do so, we will be using the accuracy measure. Accuracy is the ratio of the number of correct classifications to the total number of (correct or incorrect) classifications.
As we have all the data from IMDb for our test data, we can add labels to this (exactly as we did for our training data), to get the validation data or expected results. Then we can compare our predicted results to our validation data to get the accuracy as shown below.

In [None]:
def evaluate_classifier(classifier, y_pred, y_validation):
    """ evaluates a classifier based on a supplied validation data
    Inputs:
        classifier (sklearn.linear_model.LogisticRegression): classifer to evaluate
        y_pred (numpy.ndarray(int)): final result from predictions
        y_validation (numpy.ndarray(int)): expected outputs
    Outputs:
        double: accuracy of classifier on the validation data
    """
    cnt=0
    for i in range(y_pred.size):
        if y_pred[i]==y_validation[i]:
            cnt=cnt+1 
    return cnt/len(y_validation)


In [None]:
y_test_valid = create_labels(test_df)
accuracy = evaluate_classifier(classifier, y_test_pred, y_test_valid)
print(accuracy) #should get about 0.710666

# Ways to Improve Accuracy
While the accuracy of 0.710666 is not great, it can improved by cleaning the data more accurately. One common tactic is to remove the least common words, or words that will not help to identify the true meaning of the text. Just like stopwords (words that occur too frequently), the least frequent words also tend to take away from the true meaning of the text as they may be names or other unnecessary information. 
It might also help to gather a larger dataset to improve training the model. With regards to collecting training data, one could also change the increment of imdbIDs to be random as well. This way our dataset will be more spreadout within the IMDb database instead of having 17,000 consecutive listings.