<a href="https://colab.research.google.com/github/tfysekis/Sentiment-Analysis/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this project, we will conduct sentiment analysis using the Amazon Fine Food Reviews dataset. This dataset contains over 500,000 customer reviews of food products available on Amazon, making it a rich resource for analyzing customer opinions. Each review includes both the text of the review and a numerical rating (1–5), which we will use to classify sentiment into three categories:

- Positive: Ratings of 4 or 5
- Neutral: A rating of 3
- Negative: Ratings of 1 or 2

## Load the Dataset

In [None]:
import pandas as pd  # Library for working with data

# Try to load the dataset
try:
    # Reads the uploaded file into a pandas DataFrame
    df = pd.read_csv('Reviews.csv', encoding='utf-8')

    # Show the first 5 rows
    print(df.head())

    # Display the total number of rows
    print("Total number of rows in the dataset:", df.shape[0])

except FileNotFoundError:
    print("The file was not found. Check if the file is uploaded correctly and the name is typed correctly.")
except Exception as e:
    print("An error occurred:", e)


   Id   ProductId          UserId                      ProfileName  \
0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   

   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \
0                     1                       1      5  1303862400   
1                     0                       0      1  1346976000   
2                     1                       1      4  1219017600   
3                     3                       3      2  1307923200   
4                     0                       0      5  1350777600   

                 Summary                                               Text  
0  Good Quality Dog Food  I have bought several of the Vitality canned d...  
1 

## Check the Data

Data Integrity Check: Identifying and Handling Missing Values

In [None]:
# Check for missing values in each column
missing_values = df.isnull().sum()

# Display the count of missing values for each column
print(missing_values)

Id                         0
ProductId                  0
UserId                     0
ProfileName               26
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64


Data Cleansing: Handling Missing Values in Review Data.

As we can see we have 26 ProfileName rows missing, and 27 summaries, we are going to remove those.


In [None]:
# Remove rows where any of these columns have missing values
df_cleaned = df.dropna(subset=['ProfileName', 'Summary'])

# Check the shape of the new DataFrame to see how many rows are left
print("New number of rows after removing missing values:", df_cleaned.shape[0])

# Optionally, check again for missing values to ensure they're all handled
print(df_cleaned.isnull().sum())

New number of rows after removing missing values: 568401
Id                        0
ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   0
Text                      0
dtype: int64


Data Preprocessing: Streamlining and Cleansing the Review Datase

Now, we will focus on removing columns that are not essential for our sentiment analysis. Specifically, we will exclude columns such as
- Id
- ProductId
- UserId
- ProfileName
- HelpfulnessNumerator
- HelpfulnessDenominator
- Time
- Score

These columns are primarily identifiers and metadata that do not contribute to the sentiment analysis process. We will retain only the Summary and Text columns, as these contain the actual review content which is crucial for analyzing and understanding user sentiments.

In [None]:
# Columns to keep that are likely useful for sentiment analysis
columns_to_keep = ['Score', 'Summary', 'Text']

# Reducing the DataFrame to only necessary columns
df_reduced = df_cleaned[columns_to_keep]

# Check the cleaned data structure
print("Total rows after cleaning:", df_cleaned.shape[0])

Total rows after cleaning: 568401


Transform Ratings into Sentiment Categories:
- Positive: Ratings of 4 or 5
- Neutral: A rating of 3
- Negative: Ratings of 1 or 2


In [None]:
# Explicitly create a copy of the reduced DataFrame
df_reduced = df_cleaned[columns_to_keep].copy()

# Function to categorize ratings
def categorize_rating(score):
    if score >= 4:
        return 'Positive'
    elif score == 3:
        return 'Neutral'
    else:
        return 'Negative'

# Apply function to the 'Score' column
df_reduced['Sentiment'] = df_reduced['Score'].apply(categorize_rating)

# Print the first 10 rows to see the original scores and their corresponding sentiments
print(df_reduced[['Score', 'Sentiment']].head(10))
print("Total rows after cleaning:", df_reduced.shape[0])


   Score Sentiment
0      5  Positive
1      1  Negative
2      4  Positive
3      2  Negative
4      5  Positive
5      4  Positive
6      5  Positive
7      5  Positive
8      5  Positive
9      5  Positive
Total rows after cleaning: 568401


# 2

Drop the Score Column, now that we have the sentiment, we dont need it.

In [None]:
#Dropping the 'Score' column from the DataFrame
df_reduced.drop(columns=['Score'], inplace=True)

## Comprehensive Text Cleaning and Preprocessing

This section of the code is designed to clean and preprocess the text data from your dataset. It involves several steps that prepare the text for further NLP tasks such as sentiment analysis. The cleaning function will remove HTML tags, URLs, special characters, convert text to lowercase, and remove stopwords. We apply this function to both the Summary and Text columns to ensure they are uniformly cleaned.

In [None]:
# Import necessary libraries
import re
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

# Download necessary NLTK resources
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# Define a comprehensive cleaning function
def clean_and_preprocess(text):
    # Check if the input is valid HTML (optional but recommended)
    if '<' in text and '>' in text:
        text = BeautifulSoup(text, "lxml").get_text()
    else:
        text = text  # Skip BeautifulSoup if it’s not valid HTML

    # Remove URLs and special characters
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Special characters

    # Convert text to lowercase
    text = text.lower().strip()

    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])

    return text


# Handle null values and apply the cleaning function
# Check for and handle null values
df_reduced['Summary'] = df_reduced['Summary'].fillna('')
df_reduced['Text'] = df_reduced['Text'].fillna('')

# Apply the comprehensive cleaning and preprocessing function
df_reduced['Cleaned_Summary'] = df_reduced['Summary'].apply(clean_and_preprocess)
df_reduced['Cleaned_Text'] = df_reduced['Text'].apply(clean_and_preprocess)

# Display the cleaned text to verify
print(df_reduced[['Summary', 'Cleaned_Summary', 'Text', 'Cleaned_Text']].head())


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                 Summary        Cleaned_Summary  \
0  Good Quality Dog Food  good quality dog food   
1      Not as Advertised             advertised   
2  "Delight" says it all           delight says   
3         Cough Medicine         cough medicine   
4            Great taffy            great taffy   

                                                Text  \
0  I have bought several of the Vitality canned d...   
1  Product arrived labeled as Jumbo Salted Peanut...   
2  This is a confection that has been around a fe...   
3  If you are looking for the secret ingredient i...   
4  Great taffy at a great price.  There was a wid...   

                                        Cleaned_Text  
0  bought several vitality canned dog food produc...  
1  product arrived labeled jumbo salted peanutsth...  
2  confection around centuries light pillowy citr...  
3  looking secret ingredient robitussin believe f...  
4  great taffy great price wide assortment yummy ...  


## Text Vectorization Using TF-IDF

This section of the code is focused on transforming the preprocessed text into a numerical format that machine learning models can interpret and analyze. Using the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer, we convert the cleaned text from the 'Summary' and 'Text' columns into a set of numerical features. TF-IDF measures not just the frequency of words in each document (text entry), but adjusts this frequency against the number of documents the words appear in, which helps to highlight words that are more important to the specific document. This is crucial for tasks like sentiment analysis or topic modeling where the significance of words plays a key role in understanding the content.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the cleaned text
tfidf_summary = tfidf_vectorizer.fit_transform(df_reduced['Cleaned_Summary'])
tfidf_text = tfidf_vectorizer.fit_transform(df_reduced['Cleaned_Text'])

# Optionally, display the shape of the vectorized data
print("TF-IDF Summary shape:", tfidf_summary.shape)
print("TF-IDF Text shape:", tfidf_text.shape)


TF-IDF Summary shape: (568401, 41201)
TF-IDF Text shape: (568401, 307176)


## Efficient Replacement of Common Terms with Count Tracking


This code efficiently replaces predefined common terms (e.g., abbreviations, contractions, and informal expressions) in the Cleaned_Summary and Cleaned_Text columns of the dataset with their standardized or expanded forms. It also tracks and reports the total number of replacements made.

Key Features:

- Regex-Based Replacement: Uses a compiled regular expression pattern for fast and memory-efficient replacements, ensuring only whole words are replaced.
- Replacement and Counting: Tracks the number of replacements for each column during the replacement process.
- Optimized Performance: Minimizes memory usage by combining replacement and counting into a single operation.

In [None]:
# Define a dictionary for common terms and their replacements
replacement_dict = {
    "e.g.": "for example",
    "i.e.": "that is",
    "etc.": "and so on",
    "Dr.": "Doctor",
    "vs.": "versus",
    "can't": "cannot",
    "won't": "will not",
    "n't": " not",
    "it's": "it is",
    "lol": "laugh out loud",
    "omg": "oh my god"
}


In [None]:
import re

# Convert the replacement dictionary into a regex pattern for efficient replacement
pattern = re.compile(r'\b(' + '|'.join(re.escape(term) for term in replacement_dict.keys()) + r')\b')

def replace_and_count_changes(text):
    """
    Replace terms in the text and count the number of replacements made.
    """
    # Initialize count of changes
    changes = 0

    # Function to replace and count replacements
    def replacement_function(match):
        nonlocal changes
        changes += 1
        return replacement_dict[match.group(0)]

    # Apply the replacement using regex
    replaced_text = pattern.sub(replacement_function, text)
    return replaced_text, changes

# Initialize counters
summary_changes_total = 0
text_changes_total = 0

# Apply replacements and count changes for 'Cleaned_Summary'
df_reduced['Cleaned_Summary'], summary_changes_col = zip(*df_reduced['Cleaned_Summary'].apply(replace_and_count_changes))
summary_changes_total = sum(summary_changes_col)

# Apply replacements and count changes for 'Cleaned_Text'
df_reduced['Cleaned_Text'], text_changes_col = zip(*df_reduced['Cleaned_Text'].apply(replace_and_count_changes))
text_changes_total = sum(text_changes_col)

# Total changes
total_changes = summary_changes_total + text_changes_total
print(f"Total replacements made: {total_changes}")
print(f"Replacements in 'Cleaned_Summary': {summary_changes_total}")
print(f"Replacements in 'Cleaned_Text': {text_changes_total}")

Total replacements made: 3578
Replacements in 'Cleaned_Summary': 596
Replacements in 'Cleaned_Text': 2982


# 3

## Baseline Model Development for Sentiment Analysis

In this task, we'll:

1. Split the dataset into training and testing sets using a 70-30 split.
Vectorize the Cleaned_Text column using TF-IDF to prepare it for machine learning models.
2. Train and evaluate two baseline models:
3. Logistic Regression: A simple, effective classifier.
4. Support Vector Machine (SVM): Known for its robustness in text classification.
5. Compare their performance using metrics like accuracy, precision, recall, and F1 score.

1. Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

# Split the data
X = df_reduced['Cleaned_Text']  # Feature: Cleaned_Text
y = df_reduced['Sentiment']     # Target: Sentiment

# Train-test split (70-30)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")



Training set size: 397880
Testing set size: 170521


2. Text Vectorization with TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Using 5000 features for efficiency

# Fit and transform training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform testing data
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("TF-IDF vectorization complete.")
print(f"Training data shape: {X_train_tfidf.shape}")
print(f"Testing data shape: {X_test_tfidf.shape}")


TF-IDF vectorization complete.
Training data shape: (397880, 5000)
Testing data shape: (170521, 5000)


3. Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Initialize and train Logistic Regression model
logreg = LogisticRegression(max_iter=1000, random_state=42)
logreg.fit(X_train_tfidf, y_train)

# Predict on the test set
y_pred = logreg.predict(X_test_tfidf)

# Evaluate the model
print("Classification Report for Logistic Regression:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

    Negative       0.74      0.67      0.70     24366
     Neutral       0.52      0.18      0.27     12835
    Positive       0.90      0.97      0.93    133320

    accuracy                           0.87    170521
   macro avg       0.72      0.61      0.63    170521
weighted avg       0.85      0.87      0.85    170521

Confusion Matrix:
[[ 16280    974   7112]
 [  2768   2371   7696]
 [  3089   1254 128977]]


4. Support Vector Machine (SVM) Model

In [None]:
from sklearn.svm import LinearSVC

# Initialize and train SVM model
svm = LinearSVC(random_state=42)
svm.fit(X_train_tfidf, y_train)

# Predict on the test set
y_pred_svm = svm.predict(X_test_tfidf)

# Evaluate the model
print("Classification Report for SVM:")
print(classification_report(y_test, y_pred_svm))

# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))


Classification Report for SVM:
              precision    recall  f1-score   support

    Negative       0.72      0.67      0.70     24366
     Neutral       0.58      0.11      0.19     12835
    Positive       0.89      0.97      0.93    133320

    accuracy                           0.86    170521
   macro avg       0.73      0.59      0.61    170521
weighted avg       0.84      0.86      0.84    170521

Confusion Matrix:
[[ 16388    457   7521]
 [  3089   1462   8284]
 [  3133    589 129598]]
