# Lab 5 Text Classification

<h1 align=center><font size = 5> TEXT CLASSIFICATION APPLICATION: SENTIMENT ANALYSIS ON  MOVIE REVIEW</font></h1>




In this lab session, we will use machine learning based approach to predict the polarity of a movie review. Sentiment Analysis (SA) is a text classification problem if we have a corpus sentiment (a collection of text labelled with polarity sentiment).<br>

In figure below, part (a) a text classification model is built by training the labelled review. Part (b) is when we can predict the sentiment of a new review using the built text classification ![text classification in general](https://www.nltk.org/images/supervised-classification.png) <br>

This tutorial uses:<br>

1. A modifed code from [kavita Ganesan](https://kavita-ganesan.com/news-classifier-with-logistic-regression-in-python/#.XbeO9egzY2w)<br>
2. Dataset of  movie reviews from IMDB dataset
3. Machine learning package from  [scikit-learn](https://scikit-learn.org/stable/) ref:[Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.](http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf).<br>
**Credit to Dr.Sabrina

In this lab tutorial, we aim to classify movie reviews into their respective sentiment polarity  i.e . positive or negative. Basically, the tasks include:

- Read in a collection of documents - a corpus

-  Transform text into numerical vector data 

- Create a classifier

-  Fit/train the classifier

- Test the classifier on new data

- Evaluate performance

###Import Libraries

Let's start by importing the libraries..

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

## Load the dataset

Since we are using Google Colab, you can use the following code to upload the dataset to Colab. 

Download it from UKMFolio first to your computer and choose the file when prompted.

In [None]:
from google.colab import files 
  
  
uploaded = files.upload()

Saving IMDB Dataset.csv to IMDB Dataset.csv


Next, we can use dataframe in pandas to view the content of the csv file.

In [None]:
# movie review - provided by IMDB
df= pd.read_csv("IMDB Dataset.csv")
print(df)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


Take a quick look at the 'sentiment' column

In [None]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

## Feature extractor

We will be using the term-frequency (countVectorizer) and TF-IDF feature weighting. 

In [None]:
#feature extraction
#field = TEXT column
def extract_features(df,field,training_data,testing_data,type="binary"):
  #logging.info("Extracting features and creating vocabulary...")
  if "binary" in type:
    cv= CountVectorizer(binary=True, max_df=0.95)
    cv.fit_transform(training_data[field].values)
    train_feature_set=cv.transform(training_data[field].values)
    test_feature_set=cv.transform(testing_data[field].values)
    return train_feature_set,test_feature_set,cv
    #count-based representation
  elif "counts" in type:
    cv= CountVectorizer(binary=False, max_df=0.95)
    cv.fit_transform(training_data[field].values)
    train_feature_set=cv.transform(training_data[field].values)
    test_feature_set=cv.transform(testing_data[field].values)
    return train_feature_set,test_feature_set,cv
  else:   
    # TF-IDF BASED FEATURE REPRESENTATION
    tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_df=0.95)
    tfidf_vectorizer.fit_transform(training_data[field].values)
    train_feature_set=tfidf_vectorizer.transform(training_data[field].values)
    test_feature_set=tfidf_vectorizer.transform(testing_data[field].values)
    return train_feature_set,test_feature_set,tfidf_vectorizer
  

## Split the data into train & test sets:
For a supervised learning approach, we would need to split the dataset into training and test set. The training set will be used to learn the patterns from the data. 
 The test set will be used to evaluate the performance of the classification model.

In [None]:
#create features
#field  - column name contains the review text
#feature_rep   - can be binary, counts or tf
field = 'review'
feature_rep = 'tf'
# GET A TRAIN TEST SPLIT (set seed for consistent results)
training_data,testing_data = train_test_split(df,random_state = 2000)
# GET FEATURES
X_train,X_test, feature_transformer=extract_features(df,field,training_data,testing_data,type=feature_rep)
# GET LABELS
Y_train=training_data['sentiment'].values
Y_test=testing_data['sentiment'].values


### Build and Evaluate Classifier model
In this example, two classifiers are used i.e Logistic Regression and Multinomial Naive Bayes

In [None]:
#build the classifier model - logistic regression
from sklearn.linear_model import LogisticRegression
scikit_log_reg = LogisticRegression(verbose=1, solver='liblinear',random_state=0, C=5, penalty='l2',max_iter=1000)
model_LR=scikit_log_reg.fit(X_train,Y_train)
lr_predicted= model_LR.predict(X_test)
print("Logistic Regression with TFIDF:",metrics.accuracy_score(Y_test, lr_predicted))

[LibLinear]Logistic Regression with TFIDF: 0.89976


In [None]:
#build the classifier model - naives bayes
from sklearn.naive_bayes import MultinomialNB
model_nb = MultinomialNB().fit(X_train, Y_train)
nb_predicted= model_nb.predict(X_test)
print("MultinomialNB Accuracy with TFIDF:",metrics.accuracy_score(Y_test, nb_predicted))

MultinomialNB Accuracy with TFIDF: 0.85696


Which classifier performs better?
Logistic Regression classifier has higher accuracy compared to MultinomialNB

## Lab Task

Modify the program in  the following ways and report the trend that you observe from the output. 

- Use a different corpus with similar label. You may get the corpus from Kaggle.com or any other sources (even with different domain)


- Select the features to be used (text representation eg. TF-IDF)

- Use three different classifiers (eg. SVM, Naive Bayes, KNN, LR etc) and observe how this affects the performance

Write a short report on the trends that you observed. Make sure to include some comments to the changes that you made in the program.

The deadline for this assignment is **9 Dec 2022**. Share the file link in UKMFolio.

## Selective Stock Headlines Sentiment corpus
Source: https://www.kaggle.com/datasets/ryanchan911/selective-stock-headlines-sentiment
This data-set includes social media headlines (twitter) of selective stocks and their sentiments (positive/negative). All data is from the internet and collected by Beautiful soup with basic data processing.

### Step 1: Import Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

###Step 2: Load the dataset

In [None]:
from google.colab import files 

#Upload Stock.csv
uploaded = files.upload() 

Saving Stock.csv to Stock.csv


Next, we can use dataframe in pandas to view the content of the csv file.

In [None]:
df= pd.read_csv("Stock.csv")
print(df)

                 datetime                                           headline  \
0     01/16/2020 05:25 AM  $MMM fell on hard times but could be set to re...   
1           01-11-20 6:43  Wolfe Research Upgrades 3M $MMM to ¡§Peer Perf...   
2           01-09-20 9:37  3M $MMM Upgraded to ¡§Peer Perform¡¨ by Wolfe ...   
3          01-08-20 17:01  $MMM #insideday follow up as it also opened up...   
4           01-08-20 7:44  $MMM is best #dividend #stock out there and do...   
...                   ...                                                ...   
9465        04-11-19 1:24  $WMT - Walmart shifts to remodeling vs. new st...   
9466        04-10-19 6:05  Walmart INC $WMT Holder Texas Permanent School...   
9467        04-09-19 4:38  $WMT $GILD:3 Dividend Stocks Perfect for Retir...   
9468        04-09-19 4:30  Walmart expanding use of #robots to scan shelv...   
9469        04-09-19 4:11  $WMT Walmart plans to add thousands of robot h...   

     ticker  sentiment  
0       MMM   

###Step 3: Drop datetime and ticker column
We will only need the headlines and sentiment in this program.

In [None]:
df = df.drop('datetime', axis=1)
df = df.drop('ticker', axis=1)
df

Unnamed: 0,headline,sentiment
0,$MMM fell on hard times but could be set to re...,0
1,Wolfe Research Upgrades 3M $MMM to ¡§Peer Perf...,1
2,3M $MMM Upgraded to ¡§Peer Perform¡¨ by Wolfe ...,1
3,$MMM #insideday follow up as it also opened up...,1
4,$MMM is best #dividend #stock out there and do...,0
...,...,...
9465,$WMT - Walmart shifts to remodeling vs. new st...,1
9466,Walmart INC $WMT Holder Texas Permanent School...,0
9467,$WMT $GILD:3 Dividend Stocks Perfect for Retir...,1
9468,Walmart expanding use of #robots to scan shelv...,1


###Step 4: Replace sentiment (1 to positive) and (0 to negative)
So that we can differentiate the sentiment easily

In [None]:
df['sentiment'] = df['sentiment'].replace([1], 'positive')
df['sentiment'] = df['sentiment'].replace([0], 'negative')
df

Unnamed: 0,headline,sentiment
0,$MMM fell on hard times but could be set to re...,negative
1,Wolfe Research Upgrades 3M $MMM to ¡§Peer Perf...,positive
2,3M $MMM Upgraded to ¡§Peer Perform¡¨ by Wolfe ...,positive
3,$MMM #insideday follow up as it also opened up...,positive
4,$MMM is best #dividend #stock out there and do...,negative
...,...,...
9465,$WMT - Walmart shifts to remodeling vs. new st...,positive
9466,Walmart INC $WMT Holder Texas Permanent School...,negative
9467,$WMT $GILD:3 Dividend Stocks Perfect for Retir...,positive
9468,Walmart expanding use of #robots to scan shelv...,positive


Count the sentiment which are positive and negative

In [None]:
df['sentiment'].value_counts()

positive    5482
negative    3988
Name: sentiment, dtype: int64

###Step 5: Feature extractor
The features to be used is term-frequency (countVectorizer) and TF-IDF feature weighting. The main difference between the 2 implementations is that TfidfVectorizer(TF-IDF) performs both term frequency and inverse document frequency, while using TfidfTransformer will require CountVectorizer class to perform Term Frequency.


In [None]:
#feature extraction
#field = TEXT column
def extract_features(df,field,training_data,testing_data,type="binary"):
  #logging.info("Extracting features and creating vocabulary...")
  if "binary" in type:
    cv= CountVectorizer(binary=True, max_df=0.95)
    cv.fit_transform(training_data[field].values)
    train_feature_set=cv.transform(training_data[field].values)
    test_feature_set=cv.transform(testing_data[field].values)
    return train_feature_set,test_feature_set,cv
    #count-based representation
  elif "counts" in type:
    cv= CountVectorizer(binary=False, max_df=0.95)
    cv.fit_transform(training_data[field].values)
    train_feature_set=cv.transform(training_data[field].values)
    test_feature_set=cv.transform(testing_data[field].values)
    return train_feature_set,test_feature_set,cv
  else:   
    # TF-IDF BASED FEATURE REPRESENTATION
    tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_df=0.95)
    tfidf_vectorizer.fit_transform(training_data[field].values)
    train_feature_set=tfidf_vectorizer.transform(training_data[field].values)
    test_feature_set=tfidf_vectorizer.transform(testing_data[field].values)
    return train_feature_set,test_feature_set,tfidf_vectorizer
  

###Step 6: Split the data into train & test sets:
For a supervised learning approach, we would need to split the dataset into training and test set. The training set will be used to learn the patterns from the data. 
 The test set will be used to evaluate the performance of the classification model.

In [None]:
#create features
#field  - column name contains the review text
#feature_rep   - can be binary, counts or tf
field = 'headline'
feature_rep = 'tf' #we will use TF-IDF feature weighting
# GET A TRAIN TEST SPLIT (set seed for consistent results)
training_data,testing_data = train_test_split(df,random_state = 2000)
# GET FEATURES
X_train,X_test, feature_transformer=extract_features(df,field,training_data,testing_data,type=feature_rep)
# GET LABELS
Y_train=training_data['sentiment'].values
Y_test=testing_data['sentiment'].values


### Step 7: Build and Evaluate Classifier model
Three different classifiers will be used

**1. Logistic Regression**

In [None]:
#build the classifier model - logistic regression
from sklearn.linear_model import LogisticRegression
scikit_log_reg = LogisticRegression(verbose=1, solver='liblinear',random_state=0, C=5, penalty='l2',max_iter=1000)
model_LR=scikit_log_reg.fit(X_train,Y_train)
lr_predicted= model_LR.predict(X_test)
print("Logistic Regression with TFIDF:",metrics.accuracy_score(Y_test, lr_predicted))

[LibLinear]Logistic Regression with TFIDF: 0.981418918918919


**2. Multinomial Naive Bayes**

In [None]:
#build the classifier model - naives bayes
from sklearn.naive_bayes import MultinomialNB
model_nb = MultinomialNB().fit(X_train, Y_train)
nb_predicted= model_nb.predict(X_test)
print("MultinomialNB Accuracy with TFIDF:",metrics.accuracy_score(Y_test, nb_predicted))

MultinomialNB Accuracy with TFIDF: 0.8982263513513513


**3. Support Vector Machine**

In [None]:
#Import svm model
from sklearn import svm
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(X_train, Y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy: how often is the classifier correct?
print("SVM Accuracy with TFIDF:",metrics.accuracy_score(Y_test, y_pred))

SVM Accuracy with TFIDF: 0.984375


Count Vectorizer is a way to convert a given set of strings into a frequency representation. TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions. 

The classifier that performs the best is **SVM** where its accuracy (0.984) is the highest among the other two classifier when using TF-IDF based feature reprensentation. 
The Naive Bayes algorithm relies on an assumption of conditional independence of features given a class. SVM works well with unstructured and semi-structured data like text and images while logistic regression works with already identified independent variables.

NAME: CHONG WEI YI

MATRIC NO: A180497