<a href="https://colab.research.google.com/github/zhijunm/Text_Mining/blob/master/TextClassificationAndSentimentAnalysis(IMDB).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import nltk
import re
import numpy as np
import pandas as pd
# tokenization of text
from nltk.tokenize import word_tokenize, sent_tokenize
# remove stop words
from nltk.corpus import stopwords
nltk.download('stopwords')
# set the language
all_stopwords = set(stopwords.words('english')) 
from typing import List

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
# reading review data with panda frames
reviews_data = pd.read_csv('IMDB Dataset.csv')
reviews_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


We have 50k revies of which 49582 reviews are unique and have two types of sentiments

In [0]:
# sentiment counts
reviews_data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

The sentiments are either positive or negative and are evenly distributed. Lets reprocess the text using the simple tokenizer we built last time. We call it preprocess_text now.

In [0]:
def preprocess_text(text: str) -> List[str]:
    # Looking at the text we see that <br></br> which is HTML tag for line break can be a good splitter
    # A sentence (atleast well structured) often has a full spot at the end. We use these two for word breaks
    pattern1 = re.compile("<br /><br />|\.")
    lines = re.split(pattern1, text)
    # you can break a sentence into words using whitespace based split
    tokens = []
    for line in lines:
        tokens += line.split(" ")

    # lowercase and remove any non-alphanumeric characters from tokens for normalize
    normalized_tokens = [re.sub(r"\W+", "", token.lower()) for token in tokens]
    return  " ".join([
            token
            for token in normalized_tokens
            if token and token not in all_stopwords and len(tokens) > 1 
        ])
    

  
custom_review = "I hated the film. It was a disaster. Poor direction, bad acting."
custom_review_tokens = preprocess_text(custom_review)
print(custom_review_tokens)

hated film disaster poor direction bad acting


In [0]:
# apply preprocessing to review data
reviews_data['review'] = reviews_data['review'].apply(preprocess_text)

In [0]:
#split the dataset  
#train dataset
train_reviews=reviews_data.review[:40000]
train_sentiments=reviews_data.sentiment[:40000]
#test dataset
test_reviews=reviews_data.review[40000:45000]
test_sentiments=reviews_data.sentiment[40000:45000]
#validation (blind) dataset
blind_reviews=reviews_data.review[45000:]
blind_sentiments=reviews_data.sentiment[45000:]
print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)
print(blind_reviews.shape,blind_sentiments.shape)

(40000,) (40000,)
(5000,) (5000,)
(5000,) (5000,)


In [0]:
# CountVectorizer implements both tokenization and occurrence counting in a single class. Read more here https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# You can also reuse the from scratch code we learnt in previous class
# TfidfVectorizer Convert a collection of raw documents to a matrix of TF-IDF features. Equivalent to CountVectorizer followed by TfidfTransformer.
# from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#Count vectorizer with 
lower_count_thr = 100 # rare words/tokens
upper_count_thr = 5000 # frequent/common tokens

tv=TfidfVectorizer(min_df=lower_count_thr,max_df=upper_count_thr,binary=False,ngram_range=(1,1))
#transformed train reviews
tv_train_reviews=tv.fit_transform(train_reviews)
#transformed test reviews
tv_test_reviews=tv.transform(test_reviews)

#transformed validation reviews
tv_blind_reviews=tv.transform(blind_reviews)

print('BOW_cv_train:',tv_train_reviews.shape)
print('BOW_cv_test:',tv_test_reviews.shape)
print('BOW_cv_blind:',tv_blind_reviews.shape)

BOW_cv_train: (40000, 5144)
BOW_cv_test: (5000, 5144)
BOW_cv_blind: (5000, 5144)


In [0]:
#Now generate binary (true, false) labels from sentiment values. positive maps to 1, negative maps to 0
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
#transformed sentiment data
sentiment_data=lb.fit_transform(reviews_data['sentiment'])
print(sentiment_data.shape)

(50000, 1)


In [0]:
#Spliting the sentiment data
train_sentiments=sentiment_data[:40000]
test_sentiments=sentiment_data[40000:45000]
blind_sentiments=sentiment_data[45000:]
print(train_sentiments.shape)
print(test_sentiments.shape)
print(blind_sentiments.shape)

(40000, 1)
(5000, 1)
(5000, 1)


Now that we have both vectorized data and binary lables we are ready to train classifier model. The objective of binary classifier is to predict 0/1 label based on features. We use many types of classifier for comparison.

In [0]:
from sklearn.linear_model import LogisticRegression,SGDClassifier
#training Logistic model
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
#Fitting the model for tfidf features
lr_tfidf=lr.fit(tv_train_reviews,train_sentiments)
print(lr_tfidf)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


Now we use the trained model to predict sentiment label on both test and validation data.

In [0]:
##Predicting the model for test set
lr_tfidf_predict_test=lr.predict(tv_test_reviews)
print(lr_tfidf_predict_test)

[0 0 0 ... 1 1 0]


Next we compute accuracy of the prediction on test set

In [0]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
lr_tfidf_score=accuracy_score(test_sentiments,lr_tfidf_predict_test)
print("lr_tfidf_score :",lr_tfidf_score)

#Classification report for tfidf features
lr_tfidf_report_test=classification_report(test_sentiments,lr_tfidf_predict_test,target_names=['Positive','Negative'])
print(lr_tfidf_report_test)

lr_tfidf_score : 0.882
              precision    recall  f1-score   support

    Positive       0.89      0.87      0.88      2463
    Negative       0.88      0.89      0.88      2537

    accuracy                           0.88      5000
   macro avg       0.88      0.88      0.88      5000
weighted avg       0.88      0.88      0.88      5000

