<a href="https://colab.research.google.com/github/thuc-github/MIS710-T12023/blob/main/Week%2010/MIS710_Lab10_NLP_Deployment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **MIS710 Lab 10 Week 10 Deployment Solution**
Author: Associate Professor Lemai Nguyen

Objectives: 
1. To learn text analytics and NLP basics
2. To apply the basic skills on the well-known Internet Movie Database developed by Stanford researcher Andrew Maas.
3. To apply the basic skills on another review dataset.
4. To learn basic MLOps: saving your model and loading and using it later.


# **1. Import libraries and functions**

In [None]:
# import libraries 
import pandas as pd #for data manipulation and analysis
import numpy as np
 
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from sklearn import metrics 
from sklearn.metrics import classification_report, confusion_matrix

# **2. Case One: IMDb**

**Sentiment analysis**

**Context**
IMDb stands for the Internet Movie Database, which is an online database of information related to films, television programs, and video games. It contains a vast collection of data on various aspects of the entertainment industry, including cast and crew information, production details, plot summaries, and user ratings and reviews.

**Content**
The IMDb dataset has been widely used in sentiment analysis research. The dataset contains 50,000 movie reviews. Each review is labeled as either "positive" or "negative" based on the overall sentiment expressed in the review. 

The dataset consists of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. 

**Inspiration**
To train and test a sentiment analysis model

**Further information**:
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). 

http://ai.stanford.edu/~amaas/data/sentiment/


## **3. ML Operationalisation**

### **3.2 Load and use the model**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pickle

In [None]:
#modify the code below to point to your picked model
#or use my models from here https://github.com/VanLan0/MIS710-ML/tree/main/Pickled
path = '/content/drive/MyDrive/Colab Notebooks/MIS710/IMDB_ann_clf.pickle'

In [None]:
with open(path, 'rb') as f:
    model = pickle.load(f)


In [None]:
# Load the vocabulary used to preprocess the training data
#Modify the code below to point to your vocabulary 
#or use my pickled vocabulay from here https://github.com/VanLan0/MIS710-ML/tree/main/Pickled
with open('/content/drive/MyDrive/Colab Notebooks/MIS710/IMDB_vocabulary.pkl', 'rb') as file:
    vocabulary = pickle.load(file)

In [None]:
#check the size of the vocabulary
print("Size of vocabulary:", len(vocabulary))

Size of vocabulary: 156040


We load new reviews from the 'production line'.



In [None]:
url='https://raw.githubusercontent.com/VanLan0/MIS710-ML/main/Datasets/DrCha_reviews.csv'
new_reviews = pd.read_csv(url)
print(new_reviews)

                                               review sentiment
0   I just watched the first episode and it’s 10/1...  positive
1   this is going to be absolutely hilarious I can...  positive
2   Accidentally watched the first episode, it was...  positive
3                          This actress is hilarious!  positive
4                             Looks very entertaining  positive
5                 annoying unnecesarry romantic scene  negative
6                           I love this drama so much  positive
7   Too convoluted and over dramatic. Dr Cha neede...  negative
8   Really good to see Km Byung Chul in a more com...  positive
9   disappointed if the writers keep this marriage...  negative
10  I'm addicted to this movie, but If she doesn't...  positive
11      I love it. Can't wait to see the next episode  positive
12  The movie really disappoint me, his mother wil...  negative
13  Loved the first 2 episodes and excited for the...  positive


In practice, we should create a data pipeline to automate the pre-process of data. Let's repeat the pre-processing steps for now.

In [None]:
#import the Python module re to work with regular expressions
import re

In [None]:
# Define function to clean text
def clean_text(text):
  # Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # Remove punctuation and special characters
  text = re.sub(r'[^\w\s]', '', text)
  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()
  return text

In [None]:
def lowercasing(text):
  # Convert to lowercase
  text = text.lower()
  return text

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# define stopwords without negation words
stop_words = set(stopwords.words('english'))
negation_words = {'no', 'not', 'nor', 'neither', 'none', 'never'}
filtered_words = [word for word in stop_words if word not in negation_words]

In [None]:
#define a function to perform tokenization, stemming and lemmatization
def tokenize_stem_lemmatize(text):
  #tokenization
  tokens = nltk.word_tokenize(text.lower())
    
  #initialize stemmer and lemmatizer  
  stemmer = PorterStemmer()
  lemmatizer = WordNetLemmatizer()

  #perform stemming and lemmatization 
  stemmed_tokens = [stemmer.stem(token) for token in tokens]
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in stemmed_tokens if token not in filtered_words and token.lower() not in negation_words]
  return ' '.join(lemmatized_tokens)

In [None]:
# Write your code to apply the clean_text function to the 'review' column 


In [None]:
# Write your code to apply the lowercasing function to the 'review' column 



In [None]:
# Write yourcode to tokenize, stem, and lemmatize the 'review' column 



In [None]:
processed_text

0     watch first episod 1010 im love mom also left ...
1         thi go absolut hilari cant wait synopsi funni
2     accident watch first episod wa good couldnt st...
3                                    thi actress hilari
4                                   look veri entertain
5                        annoy unnecesarri romant scene
6                                   love thi drama much
7     convolut dramat dr cha need assert confid fema...
8               realli good see km byung chul comed set
9         disappoint writer keep thi marriag togeth end
10    im addict thi movi doesnt divorc husband watch...
11                       love cant wait see next episod
12    movi realli disappoint hi mother pain find hi ...
13                    love first 2 episod excit remaind
Name: review, dtype: object

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Write your code to create a vector representation of the text data using the pickled vocabulary


#Write your code to fit the vectorizer with the processed_text, save it in X_vec = 



In [None]:
# Use the model to make predictions on X_vec


In [None]:
#join unseen y_test with predicted value into a data frame
inspection=pd.DataFrame({'Actual':new_reviews['sentiment'], 'Predicted':y_pred})

#join X_test with the new dataframe
inspection=pd.concat([new_reviews['review'],inspection], axis=1)

inspection

Unnamed: 0,review,Actual,Predicted
0,i just watched the first episode and its 1010 ...,positive,positive
1,this is going to be absolutely hilarious i can...,positive,positive
2,accidentally watched the first episode it was ...,positive,positive
3,this actress is hilarious,positive,positive
4,looks very entertaining,positive,positive
5,annoying unnecesarry romantic scene,negative,negative
6,i love this drama so much,positive,positive
7,too convoluted and over dramatic dr cha needed...,negative,negative
8,really good to see km byung chul in a more com...,positive,positive
9,disappointed if the writers keep this marriage...,negative,negative


In [None]:
#print confusion matrix and evaluation report
y_test=new_reviews['sentiment']
cm=confusion_matrix(y_test, y_pred)

print(cm)
print(classification_report(y_test, y_pred))

[[4 0]
 [1 9]]
              precision    recall  f1-score   support

    negative       0.80      1.00      0.89         4
    positive       1.00      0.90      0.95        10

    accuracy                           0.93        14
   macro avg       0.90      0.95      0.92        14
weighted avg       0.94      0.93      0.93        14



You can test it on a new data set, but make sure you do the same preprocessing first.

# **Congratulations**
Well done, you have loaded and used your first pre-trained model!


