Movies Reviews Classification

A neural network model for sentiment analysis of movie reviews using IMDb dataset. The model is built using PyTorch and BERT as the feature extractor.

Note: This README.md file contains an overview of the project, it is recommended to open notebook as it contains the code and further explanation for the results.

Dataset

The project needs a dataset for movies and TV shows reviews, IMDb is a popular website for movies and TV shows. It has a database of over 8 million movies and TV shows. Using a dataset from this website will be a good choice for the project to train our neural network and test it.

IMDB Dataset

Instead of using the whole dataset, we will use a subset of the dataset. The dataset contains 50,000 reviews for movies and TV shows. The dataset is already balanced, meaning that it contains an equal number of positive and negative reviews. The dataset is available on Kaggle.

Data Splitting

Since the dataset is already balanced, we will split the dataset into 70% training set, 20% validation set, 10% testing set . The training set will be used to train the neural network, validation set is used to further tune the hyperparameters and the testing set will be used to evaluate the neural network.

Data Preprocessing

Text pre-processing is essential for NLP tasks. So, you will apply the following steps on our data before used for classification:
- Remove punctuation.
- Remove stop words.
- Lowercase all characters.
- Lemmatization of words.
The data preprocessing is done using the NLTK library.

Model Architecture

The project uses PyTorch to build the neural network. The neural network is a simple feedforward neural network with 5 layers.
The network's input layer takes in 768 inputs corresponding to the vector provided by BERT's pooled output (classification output)
Our network consists of 4 hidden layers with 512, 256, 128, 64 units respectively.
The hidden layers have ReLU activation function.
The output layer have sigmoid activation function to classify the vector
The network uses Adam optimizer and Binary Cross Entropy loss function.

Improving the Model

The model can be improved by using different hyperparameters and regularization techniques. The following techniques are used to improve the model:

Hyperparameter Tuning

The following hyperparameters can be tuned:
- Learning Rate
- Batch Size
- Number of Epochs
- Number of Hidden Layers
- Number of Units in each Hidden Layer
- Activation Function
- Optimizer
- Loss Function
We are only tuning the learning rate in this project since the other hyperparameters will have slight to no effect on the model's performance.
You can find the model's performance for different learning rates in the results folder

Regularization using Dropout

Dropout is a regularization technique that randomly drops out some of the neurons in the network. This technique is used to prevent overfitting.
Dropout is applied to the hidden layers of the network. The dropout rate can be specified while initializing the network. The dropout rate is the probability of a neuron to be dropped out. The dropout rate is set to 0.4 in this project.

Model Evaluation

The model is able to classify the reviews with 93% accuracy on raw test data. On the other hand, the accuracy reached 90% when using the preprocessed data. This indicates that not all preprocessing steps are necessary for the model to perform well.

Results

The model's performance on the raw test set is as follows:
- Accuracy: 93.6%
- Precision: 94.2%
- Recall: 92.92%
- F1 Score: 93.5%
The confusion matrix:

Note: See notebook for more details on the results.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
data		data
results		results
README.md		README.md
Review_Classification.ipynb		Review_Classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

results

results

README.md

README.md

Review_Classification.ipynb

Review_Classification.ipynb

Repository files navigation

Movies Reviews Classification

Table of Contents

Dataset

IMDB Dataset

Data Splitting

Data Preprocessing

Model Architecture

Improving the Model

Hyperparameter Tuning

Regularization using Dropout

Model Evaluation

Results

Contributers

About

Releases

Packages

Contributors 3

Languages

yousefkotp/Movies-Reviews-Classification

Folders and files

Latest commit

History

Repository files navigation

Movies Reviews Classification

Table of Contents

Dataset

IMDB Dataset

Data Splitting

Data Preprocessing

Model Architecture

Improving the Model

Hyperparameter Tuning

Regularization using Dropout

Model Evaluation

Results

Contributers

About

Topics

Resources

Stars

Watchers

Forks

Languages