# Kindle Review Sentiment Analysis


## Dataset Overview  
### Context  
This dataset is a small subset of Amazon Kindle Store book reviews.  

### Content  
The dataset includes 982,619 entries of product reviews from May 1996 to July 2014, where each reviewer and product has at least 5 reviews.  

#### Columns:  
- **asin**: Unique product ID (e.g., `B000FA64PK`).  
- **helpful**: Helpfulness rating of the review (e.g., `2/3`).  
- **overall**: Product rating.  
- **reviewText**: Full text of the review.  
- **reviewTime**: Review timestamp (raw format).  
- **reviewerID**: Unique reviewer ID (e.g., `A3SPTOKDG7WBLN`).  
- **reviewerName**: Name of the reviewer.  
- **summary**: Summary or title of the review.  
- **unixReviewTime**: Unix timestamp of the review.  

### Acknowledgements  
This dataset was sourced from the Amazon product data repository by Julian McAuley (UCSD). Full dataset details are available [here](http://jmcauley.ucsd.edu/data/amazon/).  

### Potential Analyses  
- **Sentiment Analysis**: Extract and analyze sentiments from review texts.  
- **Helpfulness Analysis**: Understand factors influencing review helpfulness.  
- **Outlier Detection**: Identify fake or unusual reviews.  
- **Product Insights**: Find top-rated products or analyze product similarities based on reviews.  
- **Other Explorations**: Discover unique patterns or insights from the data.  


### Best Practises

1. Preprocessing And Cleaning
2. Train Test Split
3. BOW, TFIDF, Word2vec
4. Train ML algorithms

---

## Loading the Dataset


In [None]:
# Loading the dataset
import pandas as pd

# Define the path to the dataset
file_path = r"D:\Repositories\Data-Science\NLP_Kindle_Review_Sentiment_Analysis\data\all_kindle_review.csv"

# Load the dataset into a DataFrame
data = pd.read_csv(file_path)

# Display the first five rows
data.head()


## Selecting Relevant Columns


In [None]:
# Selecting relevant columns: 'reviewText' and 'rating'
df = data[['reviewText', 'rating']]

# Display the first five rows of the selected DataFrame
df.head()


## Checking the Shape of the Dataset


In [None]:
# Check the shape of the DataFrame (rows, columns)
df.shape


## Checking for Missing Values


In [None]:
# Check for missing values in the DataFrame
df.isnull().sum()


## Checking Unique Values in the Rating Column


In [None]:
# Check unique values in the 'rating' column
df['rating'].unique()


## Counting the Frequency of Each Rating Value


In [None]:
# Count the frequency of each value in the 'rating' column
df['rating'].value_counts()

## Preprocessing And Cleaning

### Converting Ratings to Binary Labels (Positive: 1, Negative: 0)


In [None]:
# Convert ratings to binary labels: Positive (1) if rating >= 3, otherwise Negative (0)
df['rating'] = df['rating'].apply(lambda x: 0 if x < 3 else 1)


## Frequency of Binary Rating Labels


In [None]:
# Count the frequency of binary labels in the 'rating' column
df['rating'].value_counts()


## Converting All Review Text to Lowercase


In [None]:
# Convert all text in the 'reviewText' column to lowercase
df['reviewText'] = df['reviewText'].str.lower()
df.head()

## Importing Libraries and Downloading Stopwords


In [None]:
# Import required libraries
import re
import nltk
from nltk.corpus import stopwords

# Import BeautifulSoup for parsing HTML content
from bs4 import BeautifulSoup


## Cleaning Review Text
- Removing special characters
- Removing stopwords
- Removing URLs
- Removing HTML tags
- Removing extra spaces


In [None]:
# Removing special characters
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s-]+', '', x))

# Removing stopwords
stop_words = set(stopwords.words('english'))
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join([word for word in x.split() if word not in stop_words]))

# Removing URLs
df['reviewText'] = df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://[\w\.-]+(?:/[\w\./-]*)?', '', str(x)))

# Removing HTML tags
df['reviewText'] = df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

# Removing additional spaces
df['reviewText'] = df['reviewText'].apply(lambda x: " ".join(x.split()))


## Text Preprocessing: Lemmatization and WordNet Lemmatizer
- Initializing WordNet Lemmatizer
- Defining Lemmatization Function
- Applying Lemmatization on Review Text


In [None]:
# Import WordNetLemmatizer from NLTK
from nltk.stem import WordNetLemmatizer

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Define a function to lemmatize words in the text
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

# Apply the lemmatization function to the 'reviewText' column
df['reviewText'] = df['reviewText'].apply(lambda x: lemmatize_words(x))


## Train-Test Split
- Splitting the dataset into training and testing sets
- 80% for training, 20% for testing


In [None]:
# Import train_test_split from scikit-learn
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(df['reviewText'], df['rating'], test_size=0.20)


## Bag of Words (BoW) Transformation
- Converting text data into numerical format using CountVectorizer
- Training set and test set transformation


In [None]:
# Import CountVectorizer from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer
bow = CountVectorizer()

# Transform the training and testing data using CountVectorizer (Bag of Words)
X_train_bow = bow.fit_transform(X_train).toarray()
X_test_bow = bow.transform(X_test).toarray()


## TF-IDF Transformation
- Converting text data into numerical format using TfidfVectorizer
- Training set and test set transformation


In [None]:
# Import TfidfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
tfidf = TfidfVectorizer()

# Transform the training and testing data using TfidfVectorizer (TF-IDF)
X_train_tfidf = tfidf.fit_transform(X_train).toarray()
X_test_tfidf = tfidf.transform(X_test).toarray()


## Visualizing the Bag of Words Transformation (Training Set)
- Displaying the numerical representation of the training data


In [None]:
# Display the Bag of Words transformation of the training data
X_train_bow


In [None]:
# Get the feature names (words) from the Bag of Words model
feature_names = bow.get_feature_names_out()

# Convert the matrix to a DataFrame with feature names as columns
import pandas as pd
X_train_bow_df = pd.DataFrame(X_train_bow, columns=feature_names)

# Display the first few rows of the Bag of Words DataFrame
X_train_bow_df.head()


## Training Naive Bayes Classifier
- Training Gaussian Naive Bayes model using Bag of Words (BoW) representation
- Training Gaussian Naive Bayes model using TF-IDF representation


In [None]:
# Import GaussianNB from scikit-learn
from sklearn.naive_bayes import GaussianNB

# Train Naive Bayes model on Bag of Words features (X_train_bow)
nb_model_bow = GaussianNB().fit(X_train_bow, y_train)

# Train Naive Bayes model on TF-IDF features (X_train_tfidf)
nb_model_tfidf = GaussianNB().fit(X_train_tfidf, y_train)


## Model Evaluation - Naive Bayes (BoW)
- Evaluating the Naive Bayes model performance using:
  - Confusion Matrix
  - Accuracy Score
  - Classification Report


In [None]:
# Make sure the prediction is done first
y_pred_bow = nb_model_bow.predict(X_test_bow)  # Making predictions with Naive Bayes model

# Importing evaluation metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Confusion Matrix
conf_matrix_bow = confusion_matrix(y_test, y_pred_bow)
print("Confusion Matrix (BoW):\n", conf_matrix_bow)

# Accuracy Score
accuracy_bow = accuracy_score(y_test, y_pred_bow)
print("\nAccuracy Score (BoW):", accuracy_bow)

# Classification Report
class_report_bow = classification_report(y_test, y_pred_bow)
print("\nClassification Report (BoW):\n", class_report_bow)


In [None]:
# Accuracy Score
accuracy_bow = accuracy_score(y_test, y_pred_bow)
print("BoW Accuracy: ", accuracy_bow)


## Model Evaluation - Naive Bayes (TF-IDF)
- Evaluating the Naive Bayes model performance using the Confusion Matrix.
- The confusion matrix will help identify the number of correct and incorrect classifications made by the model.


In [None]:
# Making predictions with Naive Bayes model (TF-IDF)
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)

# Importing necessary evaluation metrics
from sklearn.metrics import confusion_matrix

# Confusion Matrix
conf_matrix_tfidf = confusion_matrix(y_test, y_pred_tfidf)
print("Confusion Matrix (TF-IDF):\n", conf_matrix_tfidf)


In [None]:
print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

## Conclusion - Sentiment Analysis Evaluation

This project aimed to evaluate sentiment analysis using **Naive Bayes** models, leveraging different text vectorization techniques: **Bag of Words (BoW)** and **TF-IDF**. Both methods provide insights into the effectiveness of these models in classifying sentiment (positive vs. negative reviews). Below are the findings:

### **Bag of Words (BoW)**:
- **Accuracy Score**: 57.17%
- **Confusion Matrix**:
  - True Negatives (Negative Reviews): 504
  - False Positives (Incorrectly classified as Positive): 309
  - False Negatives (Incorrectly classified as Negative): 719
  - True Positives (Correctly classified as Positive): 868
- **Classification Report**:
  - **Precision** for class 1 (positive sentiment) is 0.74, indicating the model is good at identifying positive sentiment when it predicts it, but with some room for improvement in recall (0.55).
  - **Recall** for negative reviews (class 0) is 0.62, suggesting the model is better at identifying negative reviews, though still leaving some false negatives.

### **TF-IDF**:
- **Accuracy Score**: 57.46%
- **Confusion Matrix**:
  - True Negatives: 494
  - False Positives: 319
  - False Negatives: 702
  - True Positives: 885
- **Classification Report**:
  - The TF-IDF model showed a slightly higher accuracy than BoW (57.46%), with an improved performance in identifying positive reviews (class 1), though the recall is still lower compared to its precision (0.55).
  - The model performed similarly in terms of recall for class 0 (negative sentiment), where the recall was 0.55.

### **Insights**:
- Both models performed relatively well, achieving accuracy in the range of 57%, but they show a **bias towards identifying negative reviews** (class 0), possibly because of the higher frequency of negative reviews in the dataset.
- The **TF-IDF model slightly outperformed BoW** in terms of overall accuracy, which indicates that capturing word importance (via TF-IDF) is slightly more effective for this sentiment analysis task compared to the simple frequency count of BoW.

### **Next Steps and Recommendations**:
1. **Advanced Models**: Explore more sophisticated models, such as **Logistic Regression**, **Support Vector Machines (SVM)**, or **Gradient Boosting Machines (GBM)**, which may provide better classification performance.
2. **Deep Learning Models**: Consider leveraging **LSTM (Long Short-Term Memory)** or **BERT (Bidirectional Encoder Representations from Transformers)** for deeper semantic understanding and improved sentiment classification.
3. **Fine-tuning Pretrained Models**: Utilize pre-trained transformer models, such as BERT or RoBERTa, which have shown state-of-the-art results in sentiment analysis tasks and can be fine-tuned for better accuracy.
4. **Data Augmentation**: Use techniques such as **back-translation** or **data synthesis** to balance the dataset and improve the model's ability to generalize across both positive and negative sentiments.

By incorporating these strategies, the overall performance of the sentiment analysis could be significantly improved, resulting in more accurate and reliable predictions.


## Implementing Support Vector Machine (SVM) for Sentiment Analysis


In [None]:
# Importando bibliotecas necessárias
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Treinando o modelo SVM usando dados TF-IDF
svm_model = LinearSVC()
svm_model.fit(X_train_tfidf, y_train)  # Ajustando o modelo com dados de treino

# Realizando previsões no conjunto de teste
y_pred_svm = svm_model.predict(X_test_tfidf)

# Avaliando o modelo SVM
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)
print("Confusion Matrix (SVM):\n", conf_matrix_svm)  # Matriz de Confusão
print("\nAccuracy Score (SVM):", accuracy_score(y_test, y_pred_svm))  # Acurácia
print("\nClassification Report (SVM):\n", classification_report(y_test, y_pred_svm))  # Relatório detalhado


### Explanation of SVM Implementation

1. **Model Selection**: We use `LinearSVC`, a computationally efficient version of Support Vector Machines optimized for linear problems.
2. **Training**: The `fit` method trains the SVM model using the TF-IDF-transformed training data (`X_train_tfidf`) and labels (`y_train`).
3. **Prediction**: The `predict` method generates predictions for the test set (`X_test_tfidf`).
4. **Evaluation**:
   - **Confusion Matrix**: Summarizes the performance by showing true positives, true negatives, false positives, and false negatives.
   - **Accuracy Score**: Measures the proportion of correct predictions.
   - **Classification Report**: Provides precision, recall, and F1-score for each class, along with overall metrics.


## Implementing LSTM (Long Short-Term Memory) for Sentiment Analysis


In [None]:
# Importing necessary libraries
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Tokenizing the text
# Tokenizer splits text into tokens (words) and creates a numerical representation.
tokenizer = Tokenizer(num_words=5000)  # Limit vocabulary to the top 5000 words
tokenizer.fit_on_texts(X_train)  # Learn vocabulary from training data

# Converting text to sequences of integers
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Padding sequences to ensure uniform length
# Padding ensures all input has the same length, either by adding zeros or truncating.
max_length = 100  # Maximum length of each review
X_train_padded = pad_sequences(X_train_seq, maxlen=max_length, padding='post')
X_test_padded = pad_sequences(X_test_seq, maxlen=max_length, padding='post')

# Creating the LSTM model
model = Sequential([
    Embedding(input_dim=5000, output_dim=100, input_length=max_length),  # Embedding layer
    LSTM(128, return_sequences=False),  # LSTM with 128 units
    Dropout(0.3),  # Dropout to prevent overfitting
    Dense(1, activation='sigmoid')  # Output layer for binary classification
])

# Compiling the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Training the model
model.fit(X_train_padded, y_train, epochs=5, batch_size=64, validation_data=(X_test_padded, y_test))

# Evaluating the model
loss, accuracy = model.evaluate(X_test_padded, y_test)
print(f"LSTM Accuracy: {accuracy:.2f}")


### Explanation of LSTM Implementation

1. **Text Preparation**:
   - **Tokenization**: Converts text into a sequence of integers, where each integer represents a word's index in the vocabulary.
   - **Padding**: Ensures uniform length for input sequences by truncating or adding zeros to sequences.

2. **Model Architecture**:
   - **Embedding Layer**: Maps words to dense vectors of fixed size (`embedding_dim` = 100), capturing semantic meaning.
   - **LSTM Layer**: Processes sequential data and captures temporal dependencies with 128 memory units.
   - **Dropout**: Reduces overfitting by randomly setting some neurons' outputs to zero during training.
   - **Dense Layer**: Final layer with a `sigmoid` activation function for binary sentiment classification.

3. **Model Compilation**:
   - **Optimizer**: `Adam` optimizer adjusts weights to minimize the loss function.
   - **Loss Function**: `binary_crossentropy` is used for binary classification tasks.

4. **Training and Evaluation**:
   - Trains the model for 5 epochs with a batch size of 64.
   - Validates the model on the test data during training.
   - Evaluates the final performance using accuracy and loss on the test dataset.


In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding, Dropout
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

# Defining parameters
embedding_dim = 100
max_features = 5000  # Adjust based on dataset
max_len = 300  # Maximum sequence length

# Model architecture
model = Sequential([
    Embedding(input_dim=max_features, output_dim=embedding_dim, input_length=max_len),
    LSTM(128, return_sequences=True, dropout=0.3),
    LSTM(64, dropout=0.3),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

# Compile the model
optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])

# Early stopping to avoid overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Training the model
history = model.fit(
    X_train_padded, y_train,
    validation_data=(X_test_padded, y_test),
    epochs=10,
    batch_size=64,
    callbacks=[early_stopping]
)


In [None]:
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test_padded, y_test, verbose=0)
print(f"LSTM Accuracy: {test_accuracy:.2f}")

# Generate predictions
y_pred_proba = model.predict(X_test_padded)
y_pred = (y_pred_proba > 0.5).astype(int)

# Metrics
from sklearn.metrics import confusion_matrix, classification_report
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)

print("\nClassification Report:\n", classification_report(y_test, y_pred))
