# **Project: Text Sentiment Analysis using Naive Bayes Classifier and Neural Network Models**
<hr>

- **Author**: Uyen Nguyen
- **Date**: April 2024
- **Course**: AI Vietnam - Course 2023

<hr>

## **I. Introduction**

***Text Classification*** is a fundamental task in Natural Language Processing (NLP) that involves assigning labels to text units. These text units can range from individual words to paragraphs or entire documents. For this specific project, we are focusing on sentiment analysis, which involves classifying text as positive or negative. Sentiment analysis is crucial for understanding public opinion, customer feedback, and social media monitoring. In this project, we will build two types of classifiers: Naive Bayes Classifier and Neural Network. By building these models, we can compare their performance and automate the process of determining the sentiment expressed in textual data, which has applications in various domains such as marketing, customer service, and political analysis.

## **II. Dataset Information**
For this project, we are using the NTC-SCV dataset, which is available to access via this [GitHub](https://github.com/congnghia0609/ntc-scv) repository. The NTC-SCV dataset is a collection of Vietnamese text reviews used for sentiment analysis. It contains 50,000 samples, each labeled as either positive or negative. The dataset is specifically designed to help train and evaluate text classification models. Here is a detailed description of the dataset:

The dataset is organized into three main subsets:
- *Training Set*: Contains 30,000 samples used to train the model.
- *Validation Set*: Contains 10,000 samples used to tune the model and prevent overfitting.
- *Test Set*: Contains 10,000 samples used to evaluate the final model's performance.

Each subset is further divided into two categories:
- *Positive Reviews*: Reviews that express a positive sentiment, labeled as 1.
- *Negative Reviews*: Reviews that express a negative sentiment, labeled as 0.

The data is stored in directories with each review saved as a text file. The directory structure is as follows:
```data/
├── train/
│   ├── pos/
│   └── neg/
├── valid/
│   ├── pos/
│   └── neg/
└── test/
    ├── pos/
    └── neg/
```
Examples of positive sentiment reviews might include praise for a product, positive feedback on a service, or general expressions of satisfaction. Negative sentiment reviews, on the other hand, might include complaints, expressions of dissatisfaction, or criticisms.

By leveraging the NTC-SCV dataset, we can develop and evaluate sentiment analysis models tailored to the Vietnamese language, contributing to advancements in NLP for underrepresented languages and providing practical tools for various real-world applications.

## **III. Project Framework**

In this project, we will train a sentiment text classification model using both Naive Bayes and Neural Network approaches. Here are the general steps of the project:

1. **Data Acquisition and Loading**: Initially, we will clone the dataset directly from a GitHub source, unzip it, and load it into our notebook.

2. **Text Data Preprocessing**: We will clean and prepare the text data for modeling by removing special characters, stop words, and other irrelevant data. Additionally, we will normalize the text by lowercasing, correcting typos, and converting abbreviations.

3. **Vector Representation**: After preprocessing, the text data will be converted into a numerical format that machine learning algorithms can process. We will achieve this by transforming the text into vectors using word embeddings.

4. **Building the Classification Models**: With the text data in a usable format, we will construct two models for classification: a Naive Bayes classifier and a Neural Network model.

5. **Model Training**: During this phase, the models learn to associate the input text vectors with the correct labels. This is achieved by adjusting the network weights based on a loss function. We will also set aside a portion of the dataset for validation to monitor for overfitting and fine-tune the model parameters.

6. **Prediction and Model Evaluation**: Finally, we will use the trained models to predict the labels of new, unseen text data. The models' performance will be evaluated using metrics such as accuracy, precision, recall, and F1-score to determine their generalization capabilities on new data.

These steps outline the essential framework for training a neural network to effectively classify text data, ensuring the models are both accurate and robust.

## **IV. Project Implementation**
### **1. Import Neccesary Libraries**
First, we will import all necessaries libraries for this project. For now, we will import basic libraries that usually use in Machine Learning. As we go through the project, we will import more libraries if necessary.

In [1]:
# Import necessary for the projects
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### **2. Data Cloning and Loading**
For our project, we will clone the original [GitHub](https://github.com/congnghia0609/ntc-scv) link accessed to the NTC-SCV dataset to our local environment and unzip those files as follows:

In [2]:
# Download the dataset from github
!git clone https://github.com/congnghia0609/ntc-scv
!unzip ./ntc-scv/data/data_test.zip -d ../data
!unzip ./ntc-scv/data/data_train.zip -d ../data
!rm -rf ./ntc-scv

Cloning into 'ntc-scv'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 39 (delta 0), reused 4 (delta 0), pack-reused 35[K
Receiving objects: 100% (39/39), 186.94 MiB | 12.92 MiB/s, done.
Resolving deltas: 100% (9/9), done.
Updating files: 100% (11/11), done.
Archive:  ./ntc-scv/data/data_test.zip
   creating: ../data/data_test/
   creating: ../data/data_test/test/
   creating: ../data/data_test/test/neg/
  inflating: ../data/data_test/test/neg/10.txt  
  inflating: ../data/data_test/test/neg/10014.txt  
  inflating: ../data/data_test/test/neg/1003.txt  
  inflating: ../data/data_test/test/neg/10044.txt  
  inflating: ../data/data_test/test/neg/10055.txt  
  inflating: ../data/data_test/test/neg/1007.txt  
  inflating: ../data/data_test/test/neg/10070.txt  
  inflating: ../data/data_test/test/neg/10076.txt  
  inflating: ../data/data_test/test/neg/10079.txt  
  inflating: ../data

After unzipping, we have obtained a new `data` folder with two subfolders, `data_test` and `data_set`, which we will use for training and testing our model as we progress. Now, we will define a function to read the files in these folders and store them in variables as follows: