# **Sentiment Analysis of IMDB Movie Reviews**

</br>

**Dataset**
</br>

The IMDb Dataset of 50K Movie Reviews, is a popular dataset commonly used for sentiment analysis and natural language processing tasks. The dataset consists of 50,000 movie reviews, with 25,000 reviews labeled as positive and 25,000 as negative
</br>

Dataset Source: [Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?datasetId=134715&searchQuery=pytor)
</br>

**The Problem Statement**
</br>

Predict the number of positive and negative reviews based on sentiments by using deep learning techniques.

**To approach this problem, we've followed the below outline:**

- **Data preprocessing:** applied in the notebook called _"Data_preprocessing_notebook"_
</br>

- **Word embedding:** We've converted the preprocessed text into a numerical representation that can be understood by deep learning models, using word embeddings, such as Word2Vec or GloVe, to represent words as dense vectors in a continuous vector space.
</br>

- **Model selection:** Choose a suitable deep learning model architecture including recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and convolutional neural networks (CNNs). 
</br>

- **Model training:** Split our dataset into training and validation sets.
</br>
- **Model evaluation**
</br>
- **Model refinement**
</br>

**(Initial) Attributes**:

* Review
* Sentiment
 

## All the imports

In [34]:
# import to "ignore" warnings

import warnings
warnings.filterwarnings('ignore')

# imports for data manipulation

import pandas as pd
import numpy as np

# imports for data visualization

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud # need local import


# import pytorch (framework for building deep learning models) || need local import

import torch 
from torch import nn
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader

# imports from sklearn

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

import gensim # need local import

## Load the csv file 

In [35]:
# read data

data = pd.read_csv('imdb_clean_dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,one review mention watch oz episod hook right ...,1
1,wonder littl product film techniqu unassum old...,1
2,thought wonder way spend time hot summer weeke...,1
3,basic famili littl boy jake think zombi closet...,0
4,petter mattei love time money visual stun film...,1


## Word embedding using Word2Vec model

In [36]:
# Train Word2Vec model

sentences = []
for text in data['review']:
    words = text.split()
    sentences.append(words)
model = gensim.models.Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word_vectors = model.wv

# Convert text to Word2Vec embeddings
def text_to_word2vec(text):
    word_tokens = text.split()
    word_vectors = []
    for word in word_tokens:
        if word in model.wv.key_to_index:
            vector = model.wv.get_vector(word)
            word_vectors.append(vector)
    if word_vectors:
        text_embedding = np.mean(word_vectors, axis=0)
    else:
        text_embedding = np.zeros(100)  # or any other suitable dimension
    return text_embedding

# Apply feature engineering to the entire dataset
data['embedding'] = data['review'].apply(text_to_word2vec)

data.head()

Unnamed: 0,review,sentiment,embedding
0,one review mention watch oz episod hook right ...,1,"[0.5883391, 0.26740506, -0.4274795, -0.2607590..."
1,wonder littl product film techniqu unassum old...,1,"[0.44789693, 0.22259189, -0.44325185, -0.09301..."
2,thought wonder way spend time hot summer weeke...,1,"[0.3831304, 0.2718575, -0.4016498, 0.12928088,..."
3,basic famili littl boy jake think zombi closet...,0,"[0.51270556, 0.67743295, -0.75755, -0.13768345..."
4,petter mattei love time money visual stun film...,1,"[0.34306252, 0.2656034, -0.74243355, -0.279398..."


## Split into train/test sets

In [37]:
# Split the dataset into training and test sets
X_train,X_test,y_train,y_test = train_test_split(data['embedding'], data['sentiment'], test_size=0.2, random_state=42)

print(f'Shape of train data: {X_train.shape}')
print(f'Shape of test data: {X_test.shape}')

Shape of train data: (39665,)
Shape of test data: (9917,)
