# **Sentiment Analysis of IMDB Movie Reviews**

</br>

**Dataset**
</br>

The IMDb Dataset of 50K Movie Reviews, is a popular dataset commonly used for sentiment analysis and natural language processing tasks. The dataset consists of 50,000 movie reviews, with 25,000 reviews labeled as positive and 25,000 as negative
</br>

Dataset Source: [Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?datasetId=134715&searchQuery=pytor)
</br>

**The Problem Statement**
</br>

Predict the number of positive and negative reviews based on sentiments by using deep learning techniques.

**To approach this problem, we've followed the below outline:**

- **Data preprocessing:** applied in the notebook called _"Data_preprocessing_notebook"_
</br>

- **Word embedding:** We've converted the preprocessed text into a numerical representation that can be understood by deep learning models, using word embeddings, such as Word2Vec or GloVe, to represent words as dense vectors in a continuous vector space.
</br>

- **Model selection:** Choose a suitable deep learning model architecture including recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and convolutional neural networks (CNNs). 
</br>

- **Model training:** Split our dataset into training and validation sets.
</br>
- **Model evaluation**
</br>
- **Model refinement**
</br>

**(Initial) Attributes**:

* Review
* Sentiment
 

## All the imports

In [6]:
# import to "ignore" warnings

import warnings
warnings.filterwarnings('ignore')

# imports for data manipulation

import pandas as pd
import numpy as np

# imports for data visualization

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud # need local import


# import pytorch (framework for building deep learning models) || need local import

import torch 
from torch import nn
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader

# imports from sklearn

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

import gensim # need local import
import random


## Load the csv file 

In [7]:
# read data

data = pd.read_csv('imdb_clean_dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,one review mention watch oz episod hook right ...,1
1,wonder littl product film techniqu unassum old...,1
2,thought wonder way spend time hot summer weeke...,1
3,basic famili littl boy jake think zombi closet...,0
4,petter mattei love time money visual stun film...,1


## Word embedding using Word2Vec model

In [10]:
# Set a random seed for reproducibility
random.seed(42)

# Split the sentences into words
sentences = [text.split() for text in data['review']]

# Train the Word2Vec model
model = gensim.models.Word2Vec(sentences, vector_size=500, window=5, min_count=2, workers=4, sg=1)

# Get the word vectors (keys) from the trained model
word_vectors = model.wv.index_to_key


# Calculate the embeddings for each text in the dataset
embeddings = []
for text in data['review']:
    # Split the text into words
    words = text.split()

    
    # Initialize an empty list to store word vectors for the current text
    text_word_vectors = []
    
    # Iterate over each word in the text
    for word in words:
        # Check if the word is present in the word vectors of the model
        if word in model.wv.key_to_index:
            # Retrieve the word vector for the word
            vector = model.wv.get_vector(word)
            # Append the word vector to the list for the current text
            text_word_vectors.append(vector)
    
    # Check if there are any word vectors for the current text
    if text_word_vectors:
        # Calculate the average vector for the text
        text_embedding = np.mean(text_word_vectors, axis=0)
    else:
        # If no word vectors are found, assign a zero vector
        text_embedding = np.zeros(500)  # or any other suitable dimension
    
    # Append the text embedding to the list of embeddings
    embeddings.append(text_embedding)

# Assign the calculated embeddings to the 'embedding' column in the DataFrame
data['embedding'] = embeddings


data.head()



Unnamed: 0,review,sentiment,embedding
0,one review mention watch oz episod hook right ...,1,"[0.04266952, 0.15759, 0.08354795, 0.11671276, ..."
1,wonder littl product film techniqu unassum old...,1,"[0.08034655, 0.116457954, 0.06715829, 0.119105..."
2,thought wonder way spend time hot summer weeke...,1,"[0.040959258, 0.098045155, 0.033240408, 0.0907..."
3,basic famili littl boy jake think zombi closet...,0,"[0.004097277, 0.123330146, 0.07372391, 0.11563..."
4,petter mattei love time money visual stun film...,1,"[0.057226174, 0.09773444, 0.090847254, 0.09383..."


## Split into train/test sets

In [11]:
# Split the dataset into training and test sets
X_train,X_test,y_train,y_test = train_test_split(data['embedding'], data['sentiment'], test_size=0.2, random_state=42)

print(f'Shape of train data: {X_train.shape}')
print(f'Shape of test data: {X_test.shape}')

Shape of train data: (39665,)
Shape of test data: (9917,)
