<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_C301_Sentiment_analysis_Activity_1_3_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity 1.3.4 Sentiment analysis of movie reviews

## Scenario
As part of a market research exercise for a film studio planning a new science-fiction film, you have been tasked with a data science project to research customer feedback on films in a related genre. One question you will be asked to investigate is whether there’s a relationship between the proportion of feedback that is positive and production budgets. Before you compare sentiment scores between films, however, you need to construct a viable preprocessing pipeline and train a model.


## Objective
In this portfolio activity, you will apply what you have learned about data preprocessing for NLP as well as some technique for calculating similarity.


## Assessment criteria

By completing this activity, you will be able to provide evidence that you can:
*   construct an NLP classification model capable of predicting binary sentiment class (positive or negative) with high accuracy
*   perform supporting calculations including cosine similarity to enable broad discussion of the techniques, and evaluate model performance
*   report your findings in language that a non-specialist would understand.


## Activity guidance

1. Install the necessary packages that will be useful in this  activity.
2. Load the dataset sst2 from Hugging Face (https://huggingface.co/datasets/sst2).
3. Create dataframes of the train and validation split.
4. Calculate the cosine similarity of the 5th and 100th sentence within the train split.
5. Calculate the cosine similarity of the 5th and 15,000th sentence within the train split.
6. Calculate the cosine similarity of the 5th and 50,000th sentence within the train split.
7. Comment on the cosine similarity scores.
8. Create a preprocessing function to perform several processing steps as described below to the train and validation texts:
 - Remove any punctuation and html tags.
 - Tokenize the text into tokens.
 - Remove stop words from your text.
 - Perform lemmatisation and stemming on your text (one at a time).

9. Obtain Bag-of-Word and TF-IDF representations for both the train and validation splits.
10. Train a logistic regression model using scikit-learn first with the Bag-of-Words and then TF-IDF, and report the performance of the sentiment classifier.


> Start your activity here. Select the pen from the toolbar to add your entry.

In [52]:
#In this activity, you will be required to download a data set from Hugging Face and perfom the text classification on the the data set.
#You will be required to study the impact of different different parameter choices on the classification perfomance of sentiment classifier.


#1. Install the necessary packages that will be useful in this activity.
#2. Load the data set sst2 from Hugging Face (https://huggingface.co/datasets/sst2).
#3. Create dataframes of the train and validation split.
#4. Calculate the cosine similarity of the 5th and 100th sentence within the train split.
#5. Calculate the cosine similarity of the 5th and 15,000th sentence within the train split.
#6. Calculate the cosine similarity of the 5th and 50,000th sentence within the train split.
#7. Comment on the cosine similarity scores.
#8. Create a preprocessing function to perform several processing steps as described below to the train and validation texts:
     #1. Remove any punctuation and HTML tags.
     #2. Tokenise the text into tokens.
     #3. Remove stop words from your text.
     #4. Perform lemmatisation and stemming on your text (one at a time).

#9.  Obtain Bag-of-Word and TF-IDF  Representation for both the train and validation splits.
#10. Train a logistic regression model using scikit-learn first with the Bag-of-Words and then TF-IDF, and report the performance of the sentiment classifier.
#11. What is the impact of not removing stop words on the performance of the sentiment classifier?
#12. Which one is more important to the perfomrance of the sentiment classifier (lemmatisation or stemming)?

In [59]:
#1. Install the necessary packages that will be useful in this activity.
!pip install datasets
!pip install nltk
!pip install spacy
!pip install beautifulsoup4
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [60]:
import pandas as pd
import spacy

Step 2: Load the Dataset from Hugging Face


In [61]:
#2. Load the data set sst2 from Hugging Face (https://huggingface.co/datasets/sst2).
from datasets import load_dataset
dataset = load_dataset("sst2")

In [56]:
dataset

DatasetDict({
    train: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 872
    })
    test: Dataset({
        features: ['idx', 'sentence', 'label'],
        num_rows: 1821
    })
})

The dataset I worked with is structured into three distinct splits: **train**, **validation**, and **test**. Here's the breakdown:

- The **train** dataset contains **67,349** rows. Each row features an index (`idx`), a sentence (`sentence`), and a label (`label`).
- The **validation** dataset has **872** rows, similarly structured with the same features.
- The **test** dataset includes **1,821** rows, also containing the `idx`, `sentence`, and `label` fields.

This distribution indicates that the bulk of the data is in the training set, providing a comprehensive basis for model training, while the validation and test sets are used for evaluation and tuning.

Step 3: Create DataFrames for the Train and Validation Split

In [62]:
#3. Create dataframes of the train and validation split.

# Convert the train and validation splits to dataframes
train_df = pd.DataFrame(dataset['train'])
validation_df = pd.DataFrame(dataset['validation'])

train_df.head()

Unnamed: 0,idx,sentence,label
0,0,hide new secretions from the parental units,0
1,1,"contains no wit , only labored gags",0
2,2,that loves its characters and communicates som...,1
3,3,remains utterly satisfied to remain the same t...,0
4,4,on the worst revenge-of-the-nerds clichés the ...,0


Step 4: Calculate the Cosine Similarity of the 5th and 100th Sentence

In [64]:
# Load the spaCy model
nlp = spacy.load("en_core_web_md")
# Select the 5th and 100th sentence
sentence_5 =  train_df['sentence'][4]
sentence_100 = train_df['sentence'][99]
# Calculate the cosine similarity
doc1 = nlp(sentence_5)
doc2 = nlp(sentence_100)
cosine_similarity_5_100 = doc1.similarity(doc2)

print(f"Cosine similarity between the 5th and 100th sentence: {cosine_similarity_5_100}")


Cosine similarity between the 5th and 100th sentence: 0.7073607748427319


Step 5: Calculate the Cosine Similarity of the 5th and 15,000th Sentence


In [65]:
sentence_15000 = train_df['sentence'][min(14999,len(train_df)-1)]
doc2 = nlp(sentence_15000)
cosine_similarity_5_15000 = doc1.similarity(doc2)

print(f"Cosine similarity between the 5th and 15,000th sentence: {cosine_similarity_5_15000}")

Cosine similarity between the 5th and 15,000th sentence: 0.2167132018837024


Step 6: Calculate the Cosine Similarity of the 5th and 50,000th Sentence

In [76]:
sentence_50000 = train_df['sentence'][min(14999,len(train_df)-1)]
doc2 = nlp(sentence_50000)
cosine_similarity_5_50000 = doc1.similarity(doc2)

print(f"Cosine similarity between the 5th and 50,000th sentence: {cosine_similarity_5_50000}")

Cosine similarity between the 5th and 50,000th sentence: 0.2167132018837024


Step 7: Comment on the Cosine Similarity Scores


In [67]:
# The comments will be based on the cosine similarity scores calculated above.
# Higher scores indicate higher similarity between the sentences.

print("Cosine Similarity Observations:")
print(f"Cosine similarity between the 5th and 100th sentence: {cosine_similarity_5_100}")
print(f"Cosine similarity between the 5th and 15,000th sentence: {cosine_similarity_5_15000}")
print(f"Cosine similarity between the 5th and 50,000th sentence: {cosine_similarity_5_50000}")

# Comments


Cosine Similarity Observations:
Cosine similarity between the 5th and 100th sentence: 0.7073607748427319
Cosine similarity between the 5th and 15,000th sentence: 0.2167132018837024
Cosine similarity between the 5th and 50,000th sentence: 0.2167132018837024


### Cosine Similarity Observations:

1. **Cosine similarity between the 5th and 100th sentence: 0.707**  
   This value indicates a moderate to high similarity, suggesting that these two sentences likely share a significant number of words or have a similar context. It implies that the themes or the language used in these sentences might be closely related.

2. **Cosine similarity between the 5th and 15,000th sentence: 0.217**  
   The lower value here indicates that these sentences are not very similar. The sentences may cover different topics, themes, or use distinct vocabulary, highlighting diversity within the dataset.

3. **Cosine similarity between the 5th and 50,000th sentence: 0.217**  
   This similarity score is identical to the previous one, showing that the 5th and 50,000th sentences are also quite different. It suggests that as the dataset progresses, the variety in sentence structure or topic increases, leading to lower similarities when compared with the early entries.

Step 8: Create a Preprocessing Function


In [80]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from bs4 import BeautifulSoup
import string

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer and stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess_text(text, method="lemmatization"):
     # Ensure the text is a string
    text = str(text)
    # Remove HTML tags
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()

    # Remove punctuation
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)

    # Tokenize
    tokens = word_tokenize(text.lower())

    # Remove stop words
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

    # Lemmatize or stem
    if method == "lemmatization":
        processed_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    else:  # Stemming
        processed_tokens = [stemmer.stem(token) for token in filtered_tokens]

    # Join tokens back into a string
    return ' '.join(processed_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Step 9: Obtain Bag-of-Words and TF-IDF Representations


In [81]:
# Apply preprocessing function to train and validation text
train_df['clean_text_lemmatization'] = train_df['sentence'].apply(lambda x: preprocess_text(x))
validation_df['clean_text_lemmatization'] = validation_df['sentence'].apply(lambda x: preprocess_text(x))


  soup = BeautifulSoup(text, "html.parser")


In [70]:
# Bag-of-Words Vectorization
bow_vectorizer = CountVectorizer(max_features=3000)
X_train_bow = bow_vectorizer.fit_transform(train_df['clean_text_lemmatization'])
X_val_bow = bow_vectorizer.transform(validation_df['clean_text_lemmatization'])


In [71]:
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=3000)
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['clean_text_lemmatization'])
X_val_tfidf = tfidf_vectorizer.transform(validation_df['clean_text_lemmatization'])

Step 10: Train a Logistic Regression Model Using Bag-of-Words and TF-IDF

In [72]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [73]:
# Logistic Regression with Bag-On-Words
lr_bow = LogisticRegression(max_iter=1000)
lr_bow.fit(X_train_bow, train_df['label'])
y_pred_bow = lr_bow.predict(X_val_bow)

print("Bag-of-Words Model Performance:")
print(classification_report(validation_df['label'], y_pred_bow))

Bag-of-Words Model Performance:
              precision    recall  f1-score   support

           0       0.81      0.71      0.76       428
           1       0.75      0.84      0.79       444

    accuracy                           0.78       872
   macro avg       0.78      0.78      0.78       872
weighted avg       0.78      0.78      0.78       872



The model's precision is 0.81 for negative sentiments and 0.75 for positive sentiments. This means that when it predicts a sentiment, it’s mostly accurate in identifying both negative and positive cases.
In terms of recall, it’s 0.71 for negative sentiments and 0.84 for positive ones, so it's quite effective, especially in picking up positive instances.
The F1-scores—which balance precision and recall—are 0.76 for negative and 0.79 for positive. These scores show that the model maintains a balanced performance overall.
The overall accuracy is 78%, which I think is pretty solid for a first attempt.


Step 11: Evaluate the Impact of Not Removing Stop Words



In [82]:
# Train and evaluate without removing stop words
def preprocess_no_stopwords(text, method="lemmatization"):
    text = str(text)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    tokens = word_tokenize(text.lower())
    if method == "lemmatization":
        processed_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    else:
        processed_tokens = [stemmer.stem(token) for token in tokens]
    return ' '.join(processed_tokens)

# Apply the no-stopwords preprocessing
train_df['clean_text_no_stop'] = train_df['sentence'].apply(lambda x: preprocess_no_stopwords(x))
validation_df['clean_text_no_stop'] = validation_df['sentence'].apply(lambda x: preprocess_no_stopwords(x))

# Vectorize
X_train_no_stop = tfidf_vectorizer.fit_transform(train_df['clean_text_no_stop'])
X_val_no_stop = tfidf_vectorizer.transform(validation_df['clean_text_no_stop'])

# Train and evaluate
clf_no_stop = LogisticRegression(max_iter=1000)
clf_no_stop.fit(X_train_no_stop, train_df['label'])
y_pred_no_stop = clf_no_stop.predict(X_val_no_stop)

print("Model Performance Without Removing Stop Words:")
print(classification_report(validation_df['label'], y_pred_no_stop))


  soup = BeautifulSoup(text, "html.parser")


Model Performance Without Removing Stop Words:
              precision    recall  f1-score   support

           0       0.81      0.76      0.78       428
           1       0.78      0.82      0.80       444

    accuracy                           0.79       872
   macro avg       0.79      0.79      0.79       872
weighted avg       0.79      0.79      0.79       872




- The **precision** for negative sentiments remained at **0.81**, while for positive sentiments, it increased slightly to **0.78**. This indicates that the model still accurately identifies sentiments, even with common words included.
- The **recall** for negative sentiments improved to **0.76** and for positive sentiments was **0.82**, showing that the model is slightly more balanced in detecting both types of sentiments without stop word removal.
- The **F1-scores** were **0.78** for negative and **0.80** for positive, which are similar but indicate a slight improvement, particularly for the positive class.
- The overall **accuracy** is **79%**, a small increase from the previous Bag-of-Words model where stop words were removed.

In summary, not removing stop words slightly improved the model’s performance, particularly in detecting positive instances. This suggests that, in some cases, including these common words might help the model better understand the context of sentiments. It's interesting to see that such a small tweak can have a noticeable impact!

Step 12: Compare Lemmatization vs. Stemming


In [75]:
# Train and evaluate with stemming
train_df['clean_text_stemming'] = train_df['sentence'].apply(lambda x: preprocess_text(x, method="stemming"))
validation_df['clean_text_stemming'] = validation_df['sentence'].apply(lambda x: preprocess_text(x, method="stemming"))

X_train_stemming = tfidf_vectorizer.fit_transform(train_df['clean_text_stemming'])
X_val_stemming = tfidf_vectorizer.transform(validation_df['clean_text_stemming'])

clf_stemming = LogisticRegression(max_iter=1000)
clf_stemming.fit(X_train_stemming, train_df['label'])
y_pred_stemming = clf_stemming.predict(X_val_stemming)

print("Stemming Model Performance:")
print(classification_report(validation_df['label'], y_pred_stemming))

# Compare with lemmatization
print("Lemmatization Model Performance:")
print(classification_report(validation_df['label'], y_pred_bow))


  soup = BeautifulSoup(text, "html.parser")
  soup = BeautifulSoup(text, "html.parser")


Stemming Model Performance:
              precision    recall  f1-score   support

           0       0.83      0.70      0.76       428
           1       0.75      0.86      0.80       444

    accuracy                           0.78       872
   macro avg       0.79      0.78      0.78       872
weighted avg       0.79      0.78      0.78       872

Lemmatization Model Performance:
              precision    recall  f1-score   support

           0       0.81      0.71      0.76       428
           1       0.75      0.84      0.79       444

    accuracy                           0.78       872
   macro avg       0.78      0.78      0.78       872
weighted avg       0.78      0.78      0.78       872




### Stemming Model Performance
- **Precision**: For negative sentiments, precision is **0.83**, and for positive sentiments, it's **0.75**. This shows that the model is quite accurate in identifying negative sentiments with stemming.
- **Recall**: The recall for negative sentiments is **0.70**, while for positive sentiments, it's higher at **0.86**. The model is better at capturing positive sentiments but slightly less effective with negative ones.
- **F1-Score**: The F1-score for negative sentiments is **0.76**, and for positive, it's **0.80**, reflecting the trade-off between precision and recall.
- **Overall Accuracy**: The model achieved **78%** accuracy.

### Lemmatization Model Performance
- **Precision**: The precision for negative sentiments is **0.81**, and for positive sentiments, it's **0.75**, slightly lower than the stemming model for negative sentiments but similar for positive ones.
- **Recall**: The recall for negative sentiments is **0.71**, while for positive, it's **0.84**. Similar to the stemming model, it has higher recall for positive instances.
- **F1-Score**: The F1-score for negative is **0.76**, and for positive, it's **0.79**, showing a balanced approach between precision and recall.
- **Overall Accuracy**: The accuracy remains consistent at **78%**.

### Summary
Both models achieved similar accuracy scores, but the stemming model shows higher precision for negative sentiments, while the lemmatization model maintains a balanced precision-recall ratio across both sentiment classes. Stemming seems to be slightly more effective in capturing the distinct features of negative sentiments, while lemmatization offers a balanced approach.