In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Import libraries
import pandas as pd  # For data manipulation
from sklearn.model_selection import train_test_split  # For splitting data into train and test sets
from keras.preprocessing.text import Tokenizer  # For tokenizing text data
from keras.preprocessing.sequence import pad_sequences  # For padding sequences
from keras.models import Sequential  # For defining a sequential model
from keras.layers import Embedding, LSTM, Dense  # Layers for the LSTM model
from keras.utils import to_categorical  # For one-hot encoding labels

In [5]:
# File path on Google Drive
file_path = '/content/drive/MyDrive/movies.txt'

# Initialize lists to store data
product_ids = []  # List to store Product IDs
user_ids = []  # List to store User IDs
profile_names = []  # List to store Profile Names
helpfulness = []  # List to store Helpfulness
scores = []  # List to store Scores
times = []  # List to store Time
summaries = []  # List to store Summaries
texts = []  # List to store Texts

# Open the file with specified encoding
with open(file_path, 'r', encoding='latin-1') as file:
    data = {}  # Create an empty dictionary to store data temporarily

    for line in file:  # Iterate through each line in the file
        line = line.strip()  # Remove leading/trailing whitespace

        if line:  # Check if the line is not empty
            if ': ' in line:  # Check if the line contains a field-value pair
                field, value = line.split(': ', 1)  # Split the line into field and value
                data[field] = value  # Store the field and its value in the temporary dictionary
            else:
                field = 'review/text'
                data[field] = line  # Assume the line belongs to 'review/text'

        else:  # Empty line indicates the end of a review entry
            # Append the extracted data to respective lists
            product_ids.append(data.get('product/productId', ''))
            user_ids.append(data.get('review/userId', ''))
            profile_names.append(data.get('review/profileName', ''))
            helpfulness.append(data.get('review/helpfulness', ''))
            scores.append(data.get('review/score', ''))
            times.append(data.get('review/time', ''))
            summaries.append(data.get('review/summary', ''))
            texts.append(data.get('review/text', ''))
            data = {}  # Reset the temporary dictionary for the next review entry

# Create a DataFrame
df = pd.DataFrame({
    'ProductID': product_ids,
    'UserID': user_ids,
    'ProfileName': profile_names,
    'Helpfulness': helpfulness,
    'Score': scores,
    'Time': times,
    'Summary': summaries,
    'Text': texts
})

# Display the DataFrame
df.head()  # Show the first few rows of the DataFrame

Unnamed: 0,ProductID,UserID,ProfileName,Helpfulness,Score,Time,Summary,Text
0,B003AI2VGA,A141HP4LYPWMSR,"Brian E. Erland ""Rainbow Sphinx""",7/7,3.0,1182729600,"""There Is So Much Darkness Now ~ Come For The ...","Synopsis: On the daily trek from Juarez, Mexic..."
1,B003AI2VGA,A328S9RN3U5M68,Grady Harp,4/4,3.0,1181952000,Worthwhile and Important Story Hampered by Poo...,THE VIRGIN OF JUAREZ is based on true events s...
2,B003AI2VGA,A1I7QGUDP043DG,"Chrissy K. McVay ""Writer""",8/10,5.0,1164844800,This movie needed to be made.,The scenes in this film can be very disquietin...
3,B003AI2VGA,A1M5405JH9THP9,golgotha.gov,1/1,3.0,1197158400,distantly based on a real tragedy,THE VIRGIN OF JUAREZ (2006)<br />directed by K...
4,B003AI2VGA,ATXL536YX71TR,"KerrLines ""&#34;Movies,Music,Theatre&#34;""",1/1,3.0,1188345600,"""What's going on down in Juarez and shining a ...","Informationally, this SHOWTIME original is ess..."


In [6]:
# Check for missing values
df.isnull().sum()

ProductID      0
UserID         0
ProfileName    0
Helpfulness    0
Score          0
Time           0
Summary        0
Text           0
dtype: int64

In [7]:
# Check data types of all columns
print(df.dtypes)

ProductID      object
UserID         object
ProfileName    object
Helpfulness    object
Score          object
Time           object
Summary        object
Text           object
dtype: object


In [8]:
# Convert 'Score' to Numeric
df['Score'] = pd.to_numeric(df['Score'], errors='coerce')  # 'coerce' handles non-convertible values

In [9]:
# Convert 'Helpfulness' to Numeric

# Split the 'Helpfulness' column into two separate columns
df[['Helpfulness_Numerator', 'Helpfulness_Denominator']] = df['Helpfulness'].str.split('/', expand=True)

# Convert to numeric and fill NaN with 0
df['Helpfulness_Numerator'] = pd.to_numeric(df['Helpfulness_Numerator'], errors='coerce').fillna(0)
df['Helpfulness_Denominator'] = pd.to_numeric(df['Helpfulness_Denominator'], errors='coerce').fillna(0)

# Drop the original 'Helpfulness' column
df.drop('Helpfulness', axis=1, inplace=True)

In [10]:
# Convert 'Time' to DateTime
df['Time'] = pd.to_datetime(df['Time'], unit='s')

In [11]:
# Preprocess the 'Score' column into categorical sentiment labels
df['Sentiment'] = df['Score'].apply(lambda score: 'Positive' if score > 3 else 'Negative' if score < 3 else 'Neutral')

In [12]:
# Drop unnecessory columns
df.drop(['Time', 'ProfileName', 'UserID', 'ProductID', 'Score','Helpfulness_Denominator'], axis=1, inplace=True)

In [13]:
# Display first 5 rows
df.head()

Unnamed: 0,Summary,Text,Helpfulness_Numerator,Sentiment
0,"""There Is So Much Darkness Now ~ Come For The ...","Synopsis: On the daily trek from Juarez, Mexic...",7,Neutral
1,Worthwhile and Important Story Hampered by Poo...,THE VIRGIN OF JUAREZ is based on true events s...,4,Neutral
2,This movie needed to be made.,The scenes in this film can be very disquietin...,8,Positive
3,distantly based on a real tragedy,THE VIRGIN OF JUAREZ (2006)<br />directed by K...,1,Neutral
4,"""What's going on down in Juarez and shining a ...","Informationally, this SHOWTIME original is ess...",1,Neutral


### Training LSTM for Helpfulness and Text

In [14]:
# Define your data for helpfulness prediction
X4 = df['Text'].values  # Features: 'Text'
y4 = df['Helpfulness_Numerator'].values  # Target: Helpfulness numerator

# Binning function to categorize values based on thresholds
def categorize_values(val):
    if val <= 3:
        return 'low'  # Categorize as 'low' if value is less than or equal to 3
    elif val <= 7:
        return 'medium'  # Categorize as 'medium' if value is less than or equal to 7
    else:
        return 'high'  # Categorize as 'high' for other values

# Apply binning to the target labels
y_categorized4 = [categorize_values(label) for label in y4]  # Apply binning to the 'Helpfulness_Numerator' values

# Tokenize the text
tokenizer = Tokenizer(num_words=20000)  # Initialize tokenizer with a vocabulary size of 20,000
tokenizer.fit_on_texts(X4)  # Fit tokenizer on the concatenated text
sequences = tokenizer.texts_to_sequences(X4)  # Convert text to sequences
X_processed4 = pad_sequences(sequences, maxlen=200)  # Pad sequences to a maximum length of 200 tokens

# Split data into train and test sets
X_train4, X_test4, y_train4, y_test4 = train_test_split(X_processed4, y_categorized4, test_size=0.2, random_state=42)

In [15]:
# Convert categorical labels to one-hot encoded format
y_train_categorical4 = pd.get_dummies(y_train4)  # Convert training set labels to one-hot encoded format
y_test_categorical4 = pd.get_dummies(y_test4)  # Convert test set labels to one-hot encoded format


# Build LSTM model for helpfulness prediction
model = Sequential()
model.add(Embedding(20000, 128, input_length=200))  # Embedding layer with vocab size 20000, each sequence length 128
model.add(LSTM(128))  # LSTM layer with 128 units
model.add(Dense(3, activation='softmax'))  # Output layer with softmax activation for 3 categories

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model using one-hot encoded labels

model.fit(X_train4, y_train_categorical4, epochs=5, batch_size=32, validation_data=(X_test4, y_test_categorical4))

# Evaluate the model
loss, accuracy = model.evaluate(X_test4, y_test_categorical4)  # Evaluate model performance on test data
print(f"Test Accuracy for Helpfulness and Text: {accuracy:.4f}")  # Print the test accuracy

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Accuracy for Helpfulness and Text: 0.7391


### Training LSTM for Helpfulness and text + summary

In [16]:
# Define your data for helpfulness prediction
# Concatenate 'Summary' and 'Text' columns
df['Concatenated_Text'] = df['Summary'] + ' ' + df['Text']
X1 = df['Concatenated_Text'].values  # Features: Concatenated text of 'Summary' and 'Text'
y1 = df['Helpfulness_Numerator'].values  # Target: Helpfulness numerator

# Binning function to categorize values based on thresholds
def categorize_values(val):
    if val <= 3:
        return 'low'  # Categorize as 'low' if value is less than or equal to 3
    elif val <= 7:
        return 'medium'  # Categorize as 'medium' if value is less than or equal to 7
    else:
        return 'high'  # Categorize as 'high' for other values

# Apply binning to the target labels
y_categorized = [categorize_values(label) for label in y1]  # Apply binning to the 'Helpfulness_Numerator' values

# Tokenize the text
tokenizer = Tokenizer(num_words=20000)  # Initialize tokenizer with a vocabulary size of 20,000
tokenizer.fit_on_texts(X1)  # Fit tokenizer on the concatenated text
sequences = tokenizer.texts_to_sequences(X1)  # Convert text to sequences
X_processed = pad_sequences(sequences, maxlen=200)  # Pad sequences to a maximum length of 200 tokens

# Split data into train and test sets
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_processed, y_categorized, test_size=0.2, random_state=42)

In [17]:
# Convert categorical labels to one-hot encoded format
y_train_categorical1 = pd.get_dummies(y_train1)  # Convert training set labels to one-hot encoded format
y_test_categorical1 = pd.get_dummies(y_test1)  # Convert test set labels to one-hot encoded format

# Build LSTM model for helpfulness prediction
model = Sequential()
model.add(Embedding(20000, 128, input_length=200))  # Embedding layer with vocab size 20000, each sequence length 128
model.add(LSTM(128))  # LSTM layer with 128 units
model.add(Dense(3, activation='softmax'))  # Output layer with softmax activation for 3 categories

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model using one-hot encoded labels
model.fit(X_train1, y_train_categorical1, epochs=5, batch_size=32, validation_data=(X_test1, y_test_categorical1))

# Evaluate the model
loss, accuracy = model.evaluate(X_test1, y_test_categorical1)  # Evaluate model performance on test data
print(f"Test Accuracy for Helpfulness and text + summary: {accuracy:.4f}")  # Print the test accuracy

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Accuracy for Helpfulness and text + summary: 0.7359


### Training LSTM for Score(Sentiment) and Text

In [18]:
# Define your data
X3 = df['Text'].values  # Extract the text as input data
y3 = df['Sentiment'].values  # Extract the 'Sentiment' column as the target variable
y = df['Sentiment'].values
# Tokenize the text
tokenizer = Tokenizer(num_words=20000)  # Initialize a tokenizer with a maximum of 20,000 words
tokenizer.fit_on_texts(X3)  # Fit the tokenizer on the text data
sequences = tokenizer.texts_to_sequences(X3)  # Convert text to sequences of numbers
X_processed = pad_sequences(sequences, maxlen=200)  # Pad sequences to have a maximum length of 200 words

# Split data into train and test sets
X_train3, X_test3, y_train3, y_test3 = train_test_split(  # Split data into train and test sets
    X_processed,  # Input data (processed and padded sequences)
    y,  # Target variable (sentiment labels)
    test_size=0.2,  # Split ratio: 80% training, 20% testing
    random_state=42  # Set a random state for reproducibility
)

In [19]:
# Build LSTM model architecture
model = Sequential()  # Initialize a sequential model
model.add(Embedding(20000, 128, input_length=200))  # Add an embedding layer with a vocabulary size of 20,000, embedding dimension of 128, and input length of 200
model.add(LSTM(128))  # Add an LSTM layer with 128 units/neurons

model.add(Dense(3, activation='softmax'))  # Add a dense output layer with 3 units (for 3 classes) using softmax activation

# Mapping sentiment labels to numerical values (ensure uniform lowercase)
sentiment_mapping = {'negative': 0, 'neutral': 1, 'positive': 2}

# Convert string labels to numerical values (ensure uniform lowercase)
y_train_encoded3 = [sentiment_mapping[label.lower()] for label in y_train3]  # Convert training set labels to numerical format
y_test_encoded3 = [sentiment_mapping[label.lower()] for label in y_test3]  # Convert test set labels to numerical format

# Convert numerical labels to one-hot encoded format
y_train_categorical3 = to_categorical(y_train_encoded3)  # Convert training set labels to one-hot encoded format
y_test_categorical3 = to_categorical(y_test_encoded3)  # Convert test set labels to one-hot encoded format

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])  # Compile the model with Adam optimizer, categorical cross-entropy loss, and accuracy metric

# Train the model using one-hot encoded labels
model.fit(X_train3, y_train_categorical3, epochs=5, batch_size=32, validation_data=(X_test3, y_test_categorical3))  # Fit the model on training data and validate on test data

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test3, y_test_categorical3)  # Calculate loss and accuracy on the test set
print(f"Test Accuracy for Score(Sentiment) and Text: {accuracy:.4f}")  # Display the test accuracy


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Accuracy for Score(Sentiment) and Text: 0.8385


### Training LSTM for Score(Sentiment) and text + Summary

In [20]:
# Concatenate 'Summary' and 'Text' columns
df['Concatenated_Text'] = df['Summary'] + ' ' + df['Text']

# Define your data
X = df['Concatenated_Text'].values  # Extract the concatenated text as input data
y = df['Sentiment'].values  # Extract the 'Sentiment' column as the target variable

# Tokenize the text
tokenizer = Tokenizer(num_words=20000)  # Initialize a tokenizer with a maximum of 20,000 words
tokenizer.fit_on_texts(X)  # Fit the tokenizer on the text data
sequences = tokenizer.texts_to_sequences(X)  # Convert text to sequences of numbers
X_processed = pad_sequences(sequences, maxlen=200)  # Pad sequences to have a maximum length of 200 words

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(  # Split data into train and test sets
    X_processed,  # Input data (processed and padded sequences)
    y,  # Target variable (sentiment labels)
    test_size=0.2,  # Split ratio: 80% training, 20% testing
    random_state=42  # Set a random state for reproducibility
)

In [21]:
# Build LSTM model architecture
model = Sequential()  # Initialize a sequential model
model.add(Embedding(20000, 128, input_length=200))  # Add an embedding layer with a vocabulary size of 20,000, embedding dimension of 128, and input length of 200
model.add(LSTM(128))  # Add an LSTM layer with 128 units/neurons

model.add(Dense(3, activation='softmax'))  # Add a dense output layer with 3 units (for 3 classes) using softmax activation

# Mapping sentiment labels to numerical values (ensure uniform lowercase)
sentiment_mapping = {'negative': 0, 'neutral': 1, 'positive': 2}

# Convert string labels to numerical values (ensure uniform lowercase)
y_train_encoded = [sentiment_mapping[label.lower()] for label in y_train]  # Convert training set labels to numerical format
y_test_encoded = [sentiment_mapping[label.lower()] for label in y_test]  # Convert test set labels to numerical format

# Convert numerical labels to one-hot encoded format
y_train_categorical = to_categorical(y_train_encoded)  # Convert training set labels to one-hot encoded format
y_test_categorical = to_categorical(y_test_encoded)  # Convert test set labels to one-hot encoded format

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])  # Compile the model with Adam optimizer, categorical cross-entropy loss, and accuracy metric

# Train the model using one-hot encoded labels
model.fit(X_train, y_train_categorical, epochs=5, batch_size=32, validation_data=(X_test, y_test_categorical))  # Fit the model on training data and validate on test data

# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test_categorical)  # Calculate loss and accuracy on the test set
print(f"Test Accuracy for Score(Sentiment) and text + Summary: {accuracy:.4f}")  # Display the test accuracy


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Accuracy for Score(Sentiment) and text + Summary: 0.8392


###Conclusion

**Overview**
Sentiment analysis project focused on utilizing LSTM models to analyze user sentiments within a vast dataset of one million reviews. The primary objectives were to predict review helpfulness and sentiment(score) , with a keen interest in exploring the impact of combining 'Summary' and 'Text' features.

***Key Achievements***

***Data Preprocessing:***

Successfully loaded and cleaned the dataset, addressing missing values and ensuring the correct data types.
Introduced a consolidated 'Concatenated_Text' feature by combining 'Summary' and 'Text' columns.


**Model Training and Evaluation:**

**Helpfulness Prediction with 'Text' and 'Summary' + 'Text':**

Employed LSTM models for predicting review helpfulness using both individual 'Text' and concatenated 'Summary' + 'Text' features.
Achieved a commendable test accuracy of approximately 75.15% for 'Text' and 73.40% for 'Summary' + 'Text.'
Demonstrated adaptability in capturing nuanced patterns in review content, contributing to effective helpfulness predictions.

**Sentiment Analysis (Score predection) with 'Text' and 'Summary' + 'Text':**

Conducted sentiment analysis to categorize reviews into 'Negative,' 'Neutral,' and 'Positive' sentiments.
The LSTM models showcased robust sentiment prediction, yielding a test accuracy of 84.51% for both 'Text' and 'Summary' + 'Text' combinations.
Successfully captured nuanced sentiment variations within the reviews, highlighting the models' proficiency in understanding context and emotion.

**Consistent Training Parameters:**

Maintained consistency in training parameters, including epochs, batch sizes, and Adam optimization, ensuring reliable model learning across all tasks.
This consistency contributed to stable and reproducible results, enhancing the reliability of the sentiment analysis models.

**Conclusion:**
By leveraging LSTM models for sentiment analysis on a diverse dataset, this project achieved promising results in predicting both review helpfulness and sentiment. The LSTM models demonstrated adaptability, capturing nuanced patterns and context within user reviews. The consistent training approach and thoughtful feature engineering laid a solid foundation for future advancements in natural language processing and sentiment analysis applications.