<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_C301_Sentiment_analysis_Activity_1_3_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity 1.3.4 Sentiment analysis of movie reviews

## Scenario
As part of a market research exercise for a film studio planning a new science-fiction film, you have been tasked with a data science project to research customer feedback on films in a related genre. One question you will be asked to investigate is whether there’s a relationship between the proportion of feedback that is positive and production budgets. Before you compare sentiment scores between films, however, you need to construct a viable preprocessing pipeline and train a model.


## Objective
In this portfolio activity, you will apply what you have learned about data preprocessing for NLP as well as some technique for calculating similarity.


## Assessment criteria

By completing this activity, you will be able to provide evidence that you can:
*   construct an NLP classification model capable of predicting binary sentiment class (positive or negative) with high accuracy
*   perform supporting calculations including cosine similarity to enable broad discussion of the techniques, and evaluate model performance
*   report your findings in language that a non-specialist would understand.


## Activity guidance

1. Install the necessary packages that will be useful in this  activity.
2. Load the dataset sst2 from Hugging Face (https://huggingface.co/datasets/sst2).
3. Create dataframes of the train and validation split.
4. Calculate the cosine similarity of the 5th and 100th sentence within the train split.
5. Calculate the cosine similarity of the 5th and 15,000th sentence within the train split.
6. Calculate the cosine similarity of the 5th and 50,000th sentence within the train split.
7. Comment on the cosine similarity scores.
8. Create a preprocessing function to perform several processing steps as described below to the train and validation texts:
 - Remove any punctuation and html tags.
 - Tokenize the text into tokens.
 - Remove stop words from your text.
 - Perform lemmatisation and stemming on your text (one at a time).

9. Obtain Bag-of-Word and TF-IDF representations for both the train and validation splits.
10. Train a logistic regression model using scikit-learn first with the Bag-of-Words and then TF-IDF, and report the performance of the sentiment classifier.


> Start your activity here. Select the pen from the toolbar to add your entry.

In [None]:
#In this activity, you will be required to download a data set from Hugging Face and perfom the text classification on the the data set.
#You will be required to study the impact of different different parameter choices on the classification perfomance of sentiment classifier.


#1. Install the necessary packages that will be useful in this activity.
#2. Load the data set sst2 from Hugging Face (https://huggingface.co/datasets/sst2).
#3. Create dataframes of the train and validation split.
#4. Calculate the cosine similarity of the 5th and 100th sentence within the train split.
#5. Calculate the cosine similarity of the 5th and 15,000th sentence within the train split.
#6. Calculate the cosine similarity of the 5th and 50,000th sentence within the train split.
#7. Comment on the cosine similarity scores.
#8. Create a preprocessing function to perform several processing steps as described below to the train and validation texts:
     #1. Remove any punctuation and HTML tags.
     #2. Tokenise the text into tokens.
     #3. Remove stop words from your text.
     #4. Perform lemmatisation and stemming on your text (one at a time).

#9.  Obtain Bag-of-Word and TF-IDF  Representation for both the train and validation splits.
#10. Train a logistic regression model using scikit-learn first with the Bag-of-Words and then TF-IDF, and report the performance of the sentiment classifier.
#11. What is the impact of not removing stop words on the performance of the sentiment classifier?
#12. Which one is more important to the perfomrance of the sentiment classifier (lemmatisation or stemming)?

In [None]:
#1. Install the necessary packages that will be useful in this activity.
!pip install datasets
!pip install nltk
!pip install spacy
!pip install beautifulsoup4
!python -m spacy download en_core_web_lg

Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow, dill, multiprocess, datasets
  Attempting uninstall: pyarrow
    Found exis

In [None]:
#2. Load the data set sst2 from Hugging Face (https://huggingface.co/datasets/sst2).
from datasets import load_dataset
dataset = load_dataset("yelp_polarity")

Downloading data:   0%|          | 0.00/256M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/560000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/38000 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 38000
    })
})

In [None]:
#3. Create dataframes of the train and validation split.
text_train = dataset['train']['text']
label_train = dataset['train']['label']
text_test = dataset['test']['text']
label_test = dataset['test']['label']

In [None]:
import pandas as pd
df_train = pd.DataFrame()
df_train['text'] = text_train
df_train['label'] = label_train

In [None]:
df_test = pd.DataFrame()
df_test['text'] = text_test
df_test['label'] = label_test

In [None]:
#4. Calculate the cosine similarity of the 5th and 100th sentence within the train split.
import spacy
nlp = spacy.load("en_core_web_lg")

text_5th = df_train['text'][4]
text_100th = df_train['text'][99]

doc1 = nlp(text_5th )
doc2 = nlp(text_100th)

doc1.similarity(doc2)

0.7825863255400825

In [None]:
#5. Calculate the cosine similarity of the 5th and 15,000th sentence within the train split.
text_5th = df_train['text'][4]
text_15000th = df_train['text'][14999]

doc1 = nlp(text_5th )
doc2 = nlp(text_15000th)

doc1.similarity(doc2)

0.8541770249315613

In [None]:
#6. Calculate the cosine similarity of the 5th and 50,000th sentence within the train split.
text_5th = df_train['text'][4]
text_50000th = df_train['text'][49999]

doc1 = nlp(text_5th )
doc2 = nlp(text_50000th)

doc1.similarity(doc2)

0.762340969981936

In [None]:

#7. Comment on the cosine similarity scores.
# The 5th text is more similar to the the 100th text than it is to the 50,000th text and 15,000th text.
# This is because the cosine similarity (0.69) is much higher than the similarity with the other two similarities.

In [None]:
#8.Create a preprocessing function to perfom several processing steps as described below to the train and test texts:
     #1. Tokenise the text into tokens.
     #2. Remove stop words from your text.
     #3. Remove any punctuation and HTML tags.
     #4. Perform lemmatisation on your text.
     #5. Perform stemming on your text.



import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import nltk
import string
from bs4 import BeautifulSoup
nltk.download('all')


lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess_text(text):

    # Tokenise the text.
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    tokens = word_tokenize(text.lower())


    # Remove stop words.

    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]


    # Lemmatise the tokens.

    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]


    #stemmatized_tokens = [stemmer.stem(token) for token in  lemmatized_tokens]

    # Join the tokens back into a string.

    processed_text = ' '.join(lemmatized_tokens)

    return processed_text


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

In [None]:
df_train['cleanText'] = df_train['text'].apply(preprocess_text)

df_test['cleanText'] = df_test['text'].apply(preprocess_text)

  soup = BeautifulSoup(text, "html.parser")


In [None]:
df_train

In [None]:
x_train_texts = df_train['cleanText'].tolist()
y_train = df_train['label'].tolist()

x_test_texts = df_test['cleanText'].tolist()
y_test = df_test['label'].tolist()

In [None]:
#9. Obtain Bag-of-Word and TF-IDF  Representation for both the train and test splits.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

##To run Bag-of-Words uncomment this run.
vectorizer = CountVectorizer(max_features = 3000)

##To run TF-IDF uncomment this run
#vectorizer = TfidfVectorizer(max_features = 3000)
X_train = vectorizer.fit_transform(x_train_texts)
X_test = vectorizer.fit_transform(x_test_texts )

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:
#10. Train a logistic regression model using sckit-learn first with the Bag-of-Words and then TF-IDF, and report the perfomance of the sentiment classifier.
# Train a classification model.
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

In [None]:
# Evaluate the model.
y_pred = clf.predict(X_test)
from sklearn.metrics import classification_report

print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.43      0.52      0.47       354
           1       0.61      0.53      0.57       518

    accuracy                           0.52       872
   macro avg       0.52      0.52      0.52       872
weighted avg       0.54      0.52      0.53       872



In [None]:
# TF-IDF versus Bag-of-Words.
# TF-IDF achieves an accuracy of 51% while Bag-of-Words achieves an accuarcy 52%.