# Challenge: IMDB Review Classifier

**Run the Cell to import the packages**

In [1]:
import pandas as pd
import numpy as np
import csv
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

**Fill in the Command to load your CSV dataset "imdb.csv" with pandas**

In [2]:
imdb = pd.read_csv('../data/imdb.csv')
imdb.columns = ["index", "text", "label"]
print(imdb.head(5))

   index                                               text  label
0      0  A very, very, very slow-moving, aimless movie ...      0
1      1  Not sure who was more lost - the flat characte...      0
2      2  Attempting artiness with black & white and cle...      0
3      3       Very little music or anything to speak of.        0
4      4  The best scene in the movie was when Gerardo i...      1


**Data Analysis**

- Get the shape of the dataset and print it.

- Get the column names in list and print it.

- Group the dataset by **label** and describe the dataset to understand the basic statistics of the dataset.

- Print the first three rows of the dataset

In [3]:
data_size = imdb.shape
print("shape: ", data_size)
imdb_col_names = list(imdb.columns)
print("columns: ", imdb_col_names)
print(imdb.groupby(['label']).describe())
print(imdb.head(3))

shape:  (1000, 3)
columns:  ['index', 'text', 'label']
       index                                                        
       count     mean         std  min     25%    50%     75%    max
label                                                               
0      500.0  466.418  276.272620  0.0  218.75  462.5  700.25  999.0
1      500.0  532.582  297.457084  4.0  297.75  569.5  787.25  993.0
   index                                               text  label
0      0  A very, very, very slow-moving, aimless movie ...      0
1      1  Not sure who was more lost - the flat characte...      0
2      2  Attempting artiness with black & white and cle...      0


**Target Identification**

Execute the below cell to identify the target variables. If `0`, then it is a bad review and if `1` it is a good review.

In [4]:
imdb_target=imdb['label'] 
print(imdb_target.value_counts())

0    500
1    500
Name: label, dtype: int64


**Tokenization**

- Convert the text into lower.
- Tokenize the text using word_tokenize
- Apply the function **split_tokens** for the column **text** in the **imdb** dataset with axis =1

In [5]:
# nltk.download('all')
def split_tokens(text):
    message = text.lower()
    word_tokens = word_tokenize(message)
    return word_tokens

imdb['tokenized_message'] = imdb.apply(lambda row: split_tokens(row['text']), axis=1)

**Lemmatization**

- Apply the function **split_into_lemmas** for the column **tokenized_message** with axis=1
- Print the 55th row from the column **tokenized_message**.
- Print the 55th row from the column **lemmatized_message**

In [6]:
def split_into_lemmas(text):
    lemma = []
    lemmatizer = WordNetLemmatizer()
    for word in text:
        a = lemmatizer.lemmatize(word)
        lemma.append(a)
    return lemma

imdb['lemmatized_message'] = imdb.apply(lambda row: split_into_lemmas(row['tokenized_message']), axis=1)
print('Tokenized message:',  imdb['tokenized_message'][11])
print('Lemmatized message:', imdb['lemmatized_message'][11])

Tokenized message: ['the', 'movie', 'showed', 'a', 'lot', 'of', 'florida', 'at', 'it', "'s", 'best', ',', 'made', 'it', 'look', 'very', 'appealing', '.']
Lemmatized message: ['the', 'movie', 'showed', 'a', 'lot', 'of', 'florida', 'at', 'it', "'s", 'best', ',', 'made', 'it', 'look', 'very', 'appealing', '.']


**Stop Word Removal**
- Set the stop words language as english in the variable **stop_words**
- Apply the function **stopword_removal** to the column **lemmatized_message** with axis=1
- Print the 55th row from the column **preprocessed_message**

In [7]:
def stopword_removal(text):
    stop_words = set(stopwords.words('english'))
    filtered_sentence = []
    filtered_sentence = ' '.join([word for word in text if word not in stop_words])
    return filtered_sentence

imdb['preprocessed_message'] = imdb.apply(lambda row: stopword_removal(row['lemmatized_message']), axis=1)
print('Preprocessed message:', imdb['preprocessed_message'][11])
X = pd.Series(list(imdb['preprocessed_message']))
y = pd.Series(list(imdb['label']))

Preprocessed message: movie showed lot florida 's best , made look appealing .


**Term Document Matrix**

- Apply CountVectorizer with following parameters
  - ngram_range = (1,2)
  - min_df = (1/len(Training_label))
  - max_df = 0.7
- Fit the **tf_vectorizer** with the **X**
- Transform the **total_dictionary_tdm** with the **X** 

In [8]:
tf_vectorizer = CountVectorizer(ngram_range=(1, 2), min_df=(1/len(y)), max_df=0.7)
total_dictionary_tdm = tf_vectorizer.fit(X, y)
message_data_tdm = total_dictionary_tdm.transform(X)

**Term Frequency Inverse Document Frequency (TFIDF)**
- Apply TfidfVectorizer with following parameters
  - ngram_range = (1,2)
  - min_df = (1/len(Training_label))
  - max_df = 0.7
- Fit the **tfidf_vectorizer** with the **X**
- Transform the **total_dictionary_tfidf** with the **X** 

In [9]:
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), min_df=(1/len(y)), max_df=0.7)
total_dictionary_tfidf = tfidf_vectorizer.fit(X, y)
message_data_tfidf = total_dictionary_tfidf.transform(X)

**Train and Test Data**

Splitting the data for training and testing (90% train,10% test)

- Perform train-test split on `message_data_tdm` and `y` with 90% as train data and 10% as test data.

In [10]:
train_data, test_data, train_label, test_label = train_test_split(message_data_tdm, y, test_size=0.1)

**Support Vector Machine**

- Get the shape of the train-data and print the same.

- Get the shape of the test-data and print the same.

- Initialize SVM classifier with following parameters
    - kernel = linear
    - C= 0.025
    - random_state=seed

- Train the model with train_data and train_label

- Now predict the output with test_data

- Evaluate the classifier with score from test_data and test_label

- Print the predicted score

In [11]:
seed = 9
from sklearn.svm import SVC

train_data_shape = train_data.shape
test_data_shape = test_data.shape
print("The shape of train data: ", train_data_shape)
print("The shape of test data: ", test_data_shape)
classifier = SVC(kernel='linear', C=0.025, random_state=seed)
classifier = classifier.fit(train_data, train_label)
target = classifier.predict(test_data)
score = classifier.score(test_data, test_label)
print('SVM Classifier : ', score)

# with open('output.txt', 'w') as file:
#     file.write(str((imdb['tokenized_message'][55], imdb['lemmatized_message'][55])))

The shape of train data:  (900, 9051)
The shape of test data:  (100, 9051)
SVM Classifier :  0.76


**Stochastic Gradient Descent Classifier**

- Perform train-test split on `message_data_tdm` and `y` with this time 80% as train data and 20% as test data.

- Get the shape of the train-data and print the same.

- Get the shape of the test-data and print the same.

- Initialize SVM classifier with following parameters
    - loss = modified_huber
    - shuffle= True
    - random_state=seed

- Train the model with train_data and train_label

- Now predict the output with test_data

- Evaluate the classifier with score from test_data and test_label

- Print the predicted score

In [12]:
seed = 9
from sklearn.linear_model import SGDClassifier

train_data,test_data, train_label, test_label = train_test_split(message_data_tdm, y, test_size=0.1)
train_data_shape = train_data.shape
test_data_shape = test_data.shape
print("The shape of train data: ", train_data_shape)
print("The shape of test data: ", test_data_shape)
classifier =  SGDClassifier(loss='modified_huber', shuffle=True, random_state=seed)
classifier = classifier.fit(train_data, train_label)
target = classifier.predict(test_data)
score = classifier.score(test_data, test_label)
print('SGD classifier : ',score)

# with open('output1.txt', 'w') as file:
#     file.write(str((imdb['preprocessed_message'][55])))

The shape of train data:  (900, 9051)
The shape of test data:  (100, 9051)
SGD classifier :  0.79


## END