# Text Classification Assessment

This assessment is a text classification project where the goal is to classify the genre of a movie based on its characteristics, primarily the text of the plot summarization. You have a training set of data that you will use to identify and create your best predicting model. Then you will use that model to predict the classes of the test set of data. We will compare the performance of your predictions to your classmates using the F1 Score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

The **movie_train.csv** dataset contains information (`Release Year`, `Title`, `Plot`, `Director`, `Cast`) about 10,682 movies and the label of `Genre`. There are 9 different genres in this data set, so this is a multiclass problem. You are expected to primarily use the plot column, but can use the additional columns as you see fit.

After you have identified your best performing model, you will create predictions for the test set of data. The test set of data, contains 3,561 movies with all of their information except the `Genre`. 

Below is a list of tasks that you will definitely want to complete for this challenge, but this list is not exhaustive. It does not include any tasks around handling class imbalance or about how to test multiple different models and their tuning parameters, but you should still look at doing those to see if they help you to create a better predictive model.


# Good Luck

### Task #1: Perform imports and load the dataset into a pandas DataFrame


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

df = pd.read_csv('movie_train.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
0,10281,1984,Silent Madness,A computer error leads to the accidental relea...,Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror
1,7341,1960,Desire in the Dust,"Lonnie Wilson (Ken Scott), the son of a sharec...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama
2,10587,1986,On the Edge,"A gaunt, bushy-bearded, 44-year-old Wes Holman...",Rob Nilsson,"Bruce Dern, Pam Grier",drama
3,25495,1988,Ram-Avtar,Ram and Avtar are both childhood best friends....,Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama
4,16607,2013,Machete Kills,Machete Cortez (Danny Trejo) and Sartana River...,Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action


### Task #2: Check for missing values:

In [2]:
# Check for NaN values:
df.isnull().sum()

Unnamed: 0        0
Release Year      0
Title             0
Plot              0
Director          0
Cast            169
Genre             0
dtype: int64

In [3]:
# Check for whitespace strings (it's OK if there aren't any!):
blanks = []  # start with an empty list

for rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        for i in df.itertuples():
            if rv.isspace():         # test 'review' for whitespace
                blanks.append(i)     # add matching index numbers to the list
        
len(blanks)

0

### Task #3: Remove NaN values:

In [4]:
df.dropna(inplace=True)

### Task #4: Take a look at the columns and do some EDA to familiarize yourself with the data. 

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
0,10281,1984,Silent Madness,A computer error leads to the accidental relea...,Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror
1,7341,1960,Desire in the Dust,"Lonnie Wilson (Ken Scott), the son of a sharec...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama
2,10587,1986,On the Edge,"A gaunt, bushy-bearded, 44-year-old Wes Holman...",Rob Nilsson,"Bruce Dern, Pam Grier",drama
3,25495,1988,Ram-Avtar,Ram and Avtar are both childhood best friends....,Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama
4,16607,2013,Machete Kills,Machete Cortez (Danny Trejo) and Sartana River...,Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action


In [6]:
df['Genre'].value_counts()

drama        3673
comedy       2703
action        823
horror        810
thriller      680
romance       644
western       525
adventure     329
crime         326
Name: Genre, dtype: int64

In [7]:
len(df)

10513

#### Shortcut: Make an Abridged Version of DF for Sample Testing

In [11]:
df1 = df.sample(2500)
df1.head()

Unnamed: 0.1,Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
2031,6962,1958,Auntie Mame,"Patrick Dennis (Jan Handzlik), orphaned in 192...",Morton DaCosta,"Rosalind Russell, Coral Browne, Forrest Tucker...",comedy
2730,10038,1982,Visiting Hours,"Deborah Ballin, a feminist activist, inspires ...",Jean-Claude Lord,"Michael Ironside, Lee Grant",thriller
7250,16067,2011,Limitless,Eddie Morra is a struggling author with writer...,Neil Burger,"Bradley Cooper, Abbie Cornish, Robert De Niro,...",thriller
10112,6973,1958,The Bravados,Jim Douglas (Gregory Peck) is a rancher pursui...,Henry King,"Gregory Peck, Stephen Boyd, Albert Salmi, Joan...",western
8757,19378,1958,Dublin Nightmare,Irish nationalists plans to seize a security v...,John Pomeroy,"William Sylvester, Marla Landi, Richard Leech",crime


In [38]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 2031 to 766
Data columns (total 7 columns):
Unnamed: 0      2500 non-null int64
Release Year    2500 non-null int64
Title           2500 non-null object
Plot            2500 non-null object
Director        2500 non-null object
Cast            2500 non-null object
Genre           2500 non-null object
dtypes: int64(2), object(5)
memory usage: 156.2+ KB


In [12]:
df1['Genre'].value_counts()

drama        882
comedy       629
action       213
horror       197
romance      166
thriller     149
western      117
adventure     81
crime         66
Name: Genre, dtype: int64

#### Check percentage of total for each genre for each dataframe

In [22]:
lens = []
for x in df1['Genre'].value_counts():
    length = round((x/len(df1)),3)
    lens.append(length)
    
lens

[0.353, 0.252, 0.085, 0.079, 0.066, 0.06, 0.047, 0.032, 0.026]

In [20]:
lens1 = []
for x in df['Genre'].value_counts():
    length = round((x/len(df)),3)
    lens1.append(length)
    
lens1

[0.349, 0.257, 0.078, 0.077, 0.065, 0.061, 0.05, 0.031, 0.031]

In [46]:
df1

Unnamed: 0.1,Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
2031,6962,1958,Auntie Mame,"Patrick Dennis (Jan Handzlik), orphaned in 192...",Morton DaCosta,"Rosalind Russell, Coral Browne, Forrest Tucker...",comedy
2730,10038,1982,Visiting Hours,"Deborah Ballin, a feminist activist, inspires ...",Jean-Claude Lord,"Michael Ironside, Lee Grant",thriller
7250,16067,2011,Limitless,Eddie Morra is a struggling author with writer...,Neil Burger,"Bradley Cooper, Abbie Cornish, Robert De Niro,...",thriller
10112,6973,1958,The Bravados,Jim Douglas (Gregory Peck) is a rancher pursui...,Henry King,"Gregory Peck, Stephen Boyd, Albert Salmi, Joan...",western
8757,19378,1958,Dublin Nightmare,Irish nationalists plans to seize a security v...,John Pomeroy,"William Sylvester, Marla Landi, Richard Leech",crime
4591,23482,2001,Prison on Fire - Life Sentence,A group of prisoners were doing a variety of a...,Edward Yuen,"Ben Wong, Iris Wong, Tommy Wong, William Ho, L...",crime
3390,24151,2014,The Royal Bengal Tiger,Abhi (Abir Chatterjee) is a typical meek & doc...,Deb Bera,"Jeet, Abir Chatterjee, Priyanka Sarkar, Shradd...",action
1484,15285,2007,Mr. Woodcock,John Farley (Seann William Scott) is a success...,Craig Gillespie,"Billy Bob Thornton, Seann William Scott, Susan...",comedy
4489,12705,1996,The Frighteners,"In 1990, architect Frank Bannister's wife, Deb...",Peter Jackson,"Michael J. Fox, Trini Alvarado",horror
5718,14161,2002,The Sweetest Thing,"In an opening scene, a group of men are interv...",Roger Kumble,"Cameron Diaz, Christina Applegate, Selma Blair",comedy


### Task #5: Split the data into train & test sets:

Yes we have a holdout set of the data, but you do not know the genres of that data, so you can't use it to evaluate your models. Therefore you must create your own training and test sets to evaluate your models. 

In [None]:
# Cleaning up df, creating dummy columns for all effects

effects=[]
pos=list(eff)
for p in pos:
    for i in p:
        if i not in effects:
            effects.append(i)
for i in effects:
    title=str(i)
    title=[]
    for x in df.effects:
        if i in x:
            title.append(1)
        else:
            title.append(0)
    df[i]=title
    
# Get dummies for type (indica=0,sativa=1,hybrid=2)
# Engineer features for positive effect score, negative effect score, and medical effect score

df.drop(columns='effects',inplace=True)
df.type=df.type.map({'indica':0,'sativa':1,'hybrid':2})
lowers=[]
for n in df['name']:
    lowers.append(n.lower())
df['name']=lowers
df['positive']=posi
df['negative']=neg
df['medical']=med

df.head()

#### Handle Class Imbalance

In [23]:
from collections import Counter
from imblearn.over_sampling import SMOTE

In [41]:
from imblearn.over_sampling import SMOTENC

In [39]:
features = df1.drop(columns=['Unnamed: 0','Release Year','Title','Director','Cast','Genre'])
target = df1.Genre

In [29]:
features = features.to_frame()
target = target.to_frame()

In [45]:
smote_nc = SMOTENC(categorical_features=[0, 10], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(features, target)
print(sorted(Counter(y_resampled).items()))

ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 1

In [47]:
from sklearn.model_selection import train_test_split

X = df1['Plot']
y = df1['Genre']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit your models.
You should train multiple types of models and try different combinations of the tuning parameters for each model to obtain the best one. You can use the SKlearn functions of GridSearchCV and Pipeline to help automate this process.


In [49]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

In [50]:
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
import re

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Creating our tokenizer function
def spacy_tokenizer(text):
    # remove html tags from all of the text before processing
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', text)
    # Creating our token object, which is used to create documents with linguistic annotations.
    # we disabled the parser and ner parts of the pipeline in order to speed up parsing
    mytokens = nlp(cleantext, disable=['parser', 'ner'])

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

In [13]:
%%capture
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

In [14]:
df.tail()

Unnamed: 0.1,Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
10677,4652,1948,Fighting Back,Nick Sanders comes home from the war and needs...,Malcolm St. Clair,"Jean Rogers, Paul Langton",drama
10678,23220,1987,The Romance of Book and Sword,The film covers the first half of the novel an...,Ann Hui,"Zhang Duofu, Chang Dashi, Liu Jia",action
10679,15847,2010,Holy Rollers,"Sam Gold (Jesse Eisenberg), is a mild-mannered...",Kevin Asch,"Jesse Eisenberg, Justin Bartha, Ari Graynor, D...",drama
10680,3102,1941,Lady from Louisiana,Yankee lawyer John Reynolds (John Wayne) and S...,Bernard Vorhaus,"John Wayne, Ona Munson",drama
10681,3583,1943,Hitler's Madman,Somewhat fictionalized account of the destruct...,Douglas Sirk,"Patricia Morison, Alan Curtis",drama


In [52]:
df1['tokenized_sents'] = df1.apply(lambda row: spacy_tokenizer(row['Plot']), axis=1)

In [53]:
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

In [61]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizers, max_features=1000, min_df=5, max_df=0.7)

In [60]:
def spacy_tokenizers(text):
    # remove html tags from all of the text before processing
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', text)
    # Creating our token object, which is used to create documents with linguistic annotations.
    # we disabled the parser and ner parts of the pipeline in order to speed up parsing
    mytokens = nlp(cleantext, disable=['parser', 'ner'])

#     # Lemmatizing each token and converting each token into lowercase
#     mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

In [63]:
X = tfidf_vector.fit_transform(df1['Plot']).toarray()

TypeError: 'in <string>' requires string as left operand, not spacy.tokens.token.Token

In [56]:
tfidf_vector.fit_transform(df1['tokenized_sents'])

AttributeError: 'list' object has no attribute 'lower'

### Task #7: Run predictions and analyze the results on the test set to identify the best model.  

In [None]:
# Form a prediction set


In [None]:
# Report the confusion matrix



In [None]:
# Print a classification report


In [None]:
# Print the overall accuracy and F1 score


### Task #8: Refit the model to all of your data and then use that model to predict the holdout set. 

### #9: Save your predictions as a csv file that you will send to the instructional staff for evaluation. 

## Great job!