# Text Classification Assessment

This assessment is a text classification project where the goal is to classify the genre of a movie based on its characteristics, primarily the text of the plot summarization. You have a training set of data that you will use to identify and create your best predicting model. Then you will use that model to predict the classes of the test set of data. We will compare the performance of your predictions to your classmates using the F1 Score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

The **movie_train.csv** dataset contains information (`Release Year`, `Title`, `Plot`, `Director`, `Cast`) about 10,682 movies and the label of `Genre`. There are 9 different genres in this data set, so this is a multiclass problem. You are expected to primarily use the plot column, but can use the additional columns as you see fit.

After you have identified yoru best performing model, you will create predictions for the test set of data. The test set of data, contains 3,561 movies with all of their information except the `Genre`. 

Below is a list of tasks that you will definitely want to complete for this challenge, but this list is not exhaustive. It does not include any tasds around handling class imbalance or about how to test multiple different models and their tuning parameters, but you should still look at doing those to see if they help you to create a better predictive model.


# Good Luck

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout
import re
from nltk.corpus import stopwords
from nltk import word_tokenize
STOPWORDS = set(stopwords.words('english'))
from bs4 import BeautifulSoup

import cufflinks
from IPython.core.interactiveshell import InteractiveShell
import plotly.figure_factory as ff
InteractiveShell.ast_node_interactivity = 'all'
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

Using TensorFlow backend.


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import svm

In [3]:
import gensim

In [4]:
import nltk


In [5]:
from chart_studio.plotly import iplot

### Task #1: Perform imports and load the dataset into a pandas DataFrame


In [44]:
import pandas as pd
train = pd.read_csv('movie_train.csv')

In [45]:
test = pd.read_csv('movie_test.csv')

In [48]:
data=pd.concat([train,test])


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.





### Task #2: Check for missing values:

In [6]:
train.Genre.value_counts()

drama        3770
comedy       2724
horror        840
action        830
thriller      685
romance       649
western       525
adventure     331
crime         328
Name: Genre, dtype: int64

In [77]:
train.Genre.value_counts().sort_values(ascending=False).iplot(kind='bar', yTitle='Number of Genres', title='Movie Genre#')

In [28]:
def print_plot(index):
    example = train[train.index == index][['Plot', 'Plot']].values[0]
    if len(example) > 0:
        print(example[0])
        print('Plot:', example[1])

In [41]:
# Check for NaN values:
train.isnull().sum().sum()

169

In [34]:
train.shape

(10682, 8)

In [7]:
train.dropna(how = 'all')


Unnamed: 0.1,Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
0,10281,1984,Silent Madness,A computer error leads to the accidental relea...,Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror
1,7341,1960,Desire in the Dust,"Lonnie Wilson (Ken Scott), the son of a sharec...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama
2,10587,1986,On the Edge,"A gaunt, bushy-bearded, 44-year-old Wes Holman...",Rob Nilsson,"Bruce Dern, Pam Grier",drama
3,25495,1988,Ram-Avtar,Ram and Avtar are both childhood best friends....,Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama
4,16607,2013,Machete Kills,Machete Cortez (Danny Trejo) and Sartana River...,Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action
...,...,...,...,...,...,...,...
10677,4652,1948,Fighting Back,Nick Sanders comes home from the war and needs...,Malcolm St. Clair,"Jean Rogers, Paul Langton",drama
10678,23220,1987,The Romance of Book and Sword,The film covers the first half of the novel an...,Ann Hui,"Zhang Duofu, Chang Dashi, Liu Jia",action
10679,15847,2010,Holy Rollers,"Sam Gold (Jesse Eisenberg), is a mild-mannered...",Kevin Asch,"Jesse Eisenberg, Justin Bartha, Ari Graynor, D...",drama
10680,3102,1941,Lady from Louisiana,Yankee lawyer John Reynolds (John Wayne) and S...,Bernard Vorhaus,"John Wayne, Ona Munson",drama


In [9]:
# Check for whitespace strings (it's OK if there aren't any!):
import string
def contains_whitespace(s):
    return True in [c in s for c in string.whitespace]
contains_whitespace(train['Plot'])

False

### Task #3: Remove NaN values:

In [8]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
    text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing. 
    text = text.replace('x', '')
#    text = re.sub(r'\W+', '', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text


In [9]:
train['cleanplot'] = train['Plot'].apply(clean_text)

In [49]:
data['cleanplot'] = data['Plot'].apply(clean_text)

In [50]:
test['cleanplot'] = test['Plot'].apply(clean_text)

In [11]:
MAX_NB_WORDS = 50000
# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 100

tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(train['cleanplot'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 113958 unique tokens.


### Task #4: Take a look at the columns and do some EDA to familiarize yourself with the data. 

### Task #5: Split the data into train & test sets:

Yes we have a holdout set of the data, but you do not know the genres of that data, so you can't use it to evaluate your models. Therefore you must create your own training and test sets to evaluate your models. 

In [12]:
def evaluate(test, pred, model):
    return [model, 
            precision_score(test, pred, average = 'weighted'), 
            recall_score(test, pred,average = 'weighted'), 
            accuracy_score(test, pred), 
            f1_score(test, pred, average = 'weighted')]

In [13]:
def print_accuracy_indices(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds, average = 'weighted')))
    print("Recall Score: {}".format(recall_score(labels, preds, average = 'weighted')))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds, average = 'weighted')))

In [21]:
from gensim.models import Word2Vec
wv = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)
wv.init_sims(replace=True)

In [12]:
def word_averaging(wv, words):
    all_words, mean = set(), []
    
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.vector_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

In [14]:
def w2v_tokenize_text(text):
    tokens = []
    for sent in nltk.sent_tokenize(text, language='english'):
        for word in nltk.word_tokenize(sent, language='english'):
            if len(word) < 2:
                continue
            tokens.append(word)
    return tokens

In [22]:
train, test = train_test_split(train, test_size=0.3, random_state = 42)

test_tokenized = test.apply(lambda r: w2v_tokenize_text(r['Plot1']), axis=1).values
train_tokenized = train.apply(lambda r: w2v_tokenize_text(r['Plot1']), axis=1).values

X_train_word_average = word_averaging_list(wv,train_tokenized)
X_test_word_average = word_averaging_list(wv,test_tokenized)


Call to deprecated `syn0norm` (Attribute will be removed in 4.0.0, use self.vectors_norm instead).



In [28]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg = logreg.fit(X_train_word_average, train['Genre'])



lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



In [32]:
from sklearn.metrics import classification_report


In [34]:
y_pred = logreg.predict(X_test_word_average)
print('accuracy %s' % accuracy_score(y_pred, test.Genre))
print(classification_report(test.Genre, y_pred))

accuracy 0.556687898089172
              precision    recall  f1-score   support

      action       0.48      0.38      0.43       110
   adventure       0.29      0.05      0.09        37
      comedy       0.59      0.62      0.60       404
       crime       1.00      0.02      0.04        53
       drama       0.53      0.75      0.62       579
      horror       0.66      0.72      0.69       120
     romance       0.42      0.13      0.20        97
    thriller       0.55      0.06      0.10       106
     western       0.71      0.58      0.64        64

    accuracy                           0.56      1570
   macro avg       0.58      0.37      0.38      1570
weighted avg       0.56      0.56      0.52      1570



# using tfidf predicting smote imbalance

In [51]:
test['new'] = test.cleanplot + ' ' + test.Director +' '+ test.Title +' '+ test. Title


In [76]:
train['new'] = train.cleanplot + ' ' + train.Director +' '+ train.Title +' '+ train. Title


AttributeError: 'DataFrame' object has no attribute 'cleanplot'

In [52]:
Xtestnew = tfidf.transform(test['new'])


In [61]:
newpreds = svclassifier.predict(Xtestnew)

In [None]:
new

In [62]:

pred_label = label[newpreds]

In [66]:
test=test.assign(Genre = pred_label)

In [71]:
predictedataset = test.drop(['cleanplot','new'],axis =1)

In [73]:
predictedataset.to_csv('predications.csv')

In [75]:
train['new'] = train.cleanplot + ' ' + train.Director +' '+ train.Title +' '+ train. Title

AttributeError: 'DataFrame' object has no attribute 'cleanplot'

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit(train['new'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [74]:
X = tfidf.transform(train['new'])
#train['new'][1]

KeyError: 'new'

In [17]:
print([X[1, tfidf.vocabulary_['hunting']]])

[0.04542228384249389]


In [34]:
# transform target variable
train['new_genre'] = pd.factorize(train.Genre)[0]
y = train.new_genre

In [55]:
y, label = pd.factorize(train["Genre"])


In [57]:
from imblearn.over_sampling import SMOTE

In [35]:
from imblearn.over_sampling import ADASYN 
sm = ADASYN()
Xbal, ybal = sm.fit_sample(X, y)

In [23]:
y.shape

(33576,)

In [22]:
X.shape

(33576, 118620)

In [24]:
X

<33576x118620 sparse matrix of type '<class 'numpy.float64'>'
	with 8358398 stored elements in Compressed Sparse Row format>

In [36]:
X_train, X_test, y_train, y_test = train_test_split(Xbal, ybal, test_size=0.3, random_state = 42)

In [26]:
import time

In [27]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score


In [28]:
start = time.time()
rfclassifier = RandomForestClassifier(n_estimators=100)
rfclassifier.fit(X_train, y_train)
rf_runtime = time.time() - start

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [31]:
rf_pred = rfclassifier.predict(X_test)  

In [33]:
def print_accuracy_indices(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds, average = 'weighted')))
    print("Recall Score: {}".format(recall_score(labels, preds, average = 'weighted')))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds, average = 'weighted')))

In [34]:
print_accuracy_indices(y_test, rf_pred)

Precision Score: 0.9041796466175551
Recall Score: 0.8922215556888622
Accuracy Score: 0.8922215556888622
F1 Score: 0.8878812552531175


In [36]:
# Predicting the Test set results
rfy_pred = rfclassifier.predict(X_test)
# Making the Confusion Matrix
print(pd.crosstab(y_test, rfy_pred, rownames=['Actual'], colnames=['Predicted']))

Predicted     0    1     2     3    4     5     6     7     8
Actual                                                       
0          1182   49     0     0    1     0     0     0     0
1            31  892    24    12   72    58    13     8     9
2             1   54  1000     6    2     9     1     0     1
3             0   15     0  1117    1     0     0     0     0
4            27  484    20     8  380    42     9     5    10
5             0   32     1     0    1  1050     0     0     0
6             5   48     1     1    3     1  1104     0     1
7             0    5     0     0    0     0     0  1130     0
8             0    7     0     0    0     0     0     0  1069


## SVC Nu-Support Vector Classification.

Similar to SVC but uses a parameter to control the number of support vectors.

In [25]:
import time

In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

In [37]:
start = time.time()
svclassifier = LinearSVC(C = 50)
svclassifier.fit(X_train, y_train)
svc_runtime = time.time() - start


Liblinear failed to converge, increase the number of iterations.



LinearSVC(C=30, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [137]:
y_train

7007     4
3059     1
19647    3
32220    8
27183    7
        ..
16850    3
6265     5
11284    0
860      1
15795    2
Name: new_genre, Length: 23503, dtype: int64

In [38]:
svcpred = svclassifier.predict(X_test)  

In [39]:
def print_accuracy_indices(labels, preds):
    print("Precision Score: {}".format(precision_score(labels, preds, average = 'weighted')))
    print("Recall Score: {}".format(recall_score(labels, preds, average = 'weighted')))
    print("Accuracy Score: {}".format(accuracy_score(labels, preds)))
    print("F1 Score: {}".format(f1_score(labels, preds, average = 'weighted')))

In [40]:
print_accuracy_indices(y_test, svcpred)

Precision Score: 0.9336750823027415
Recall Score: 0.937158741189318
Accuracy Score: 0.937158741189318
F1 Score: 0.9336229659104195


In [77]:
params_grid = [{'kernel': ['linear'], 'C': [1, 10, 100]}]

In [78]:
svm_model = GridSearchCV(SVC(), params_grid, cv=5)
svm_model.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [79]:
svm_model.best_estimator_

SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

### Task #6: Build a pipeline to vectorize the date, then train and fit your models.
You should train multiple types of models and try different combinations of the tuning parameters for each model to obtain the best one. You can use the SKlearn functions of GridSearchCV and Pipeline to help automate this process.


### Task #7: Run predictions and analyze the results on the test set to identify the best model.  

In [None]:
# Form a prediction set


In [None]:
# Report the confusion matrix



In [None]:
# Print a classification report


In [None]:
# Print the overall accuracy and F1 score


### Task #8: Refit the model to all of your data and then use that model to predict the holdout set. 

### #9: Save your predictions as a csv file that you will send to the instructional staff for evaluation. 

## Great job!