# Sentiment Analysis Challenge


## Introduction

Ever wondered how social media platforms figure out if a post is positive or negative? Or how companies analyze customer reviews to improve their products? That’s where Sentiment Analysis comes in!

In this challenge, you'll step into the world of Natural Language Processing (NLP)—the tech behind chatbots, search engines, and AI assistants. Your task? Build a sentiment classification model that can determine whether a given piece of text is positive or negative.

<img src="sentiment_analysis.png" alt="Sentiment Analysis Example" width="600"/>

### Challenge Structure:
**Note: This notebook is a suggested starter template for the challenge. You are free to modify the code as you see fit.**

The challenge is divided into 4 sections:
- Task 1: Exploratory Data Analysis (EDA)
- Task 2: Data Preprocessing
- Task 3: Model Selection and Training
- Task 4: Evaluation
- Task 5: Inference (Optional)

The **dataset provided** (amazon.csv) contains **19,396 reviews** and two columns: text and label. The label is 0 for negative and 1 for positive.

<img src="dataset_str.png" alt="Dataset Example" width="400"/>

Note: The classification model you choose can range from traditional machine learning algorithms (e.g. Logistic Regression, Random Forest) to deep learning models like RNN, LSTM, and Transformers.

## Environment Setup

In [None]:
## Create virtual environment

# For Python (Windows/Linux/Mac):
# python -m venv myenv
# source myenv/bin/activate  # Linux/Mac
# myenv\Scripts\activate     # Windows

# If you're using Conda (Windows/Linux):
# conda create -n myenv 
# conda activate myenv

In [1]:
##Place all imports here 
import pandas as pd

## Load and display the dataset
Use the cell below to load and display the dataset.

In [7]:
df = pd.read_csv("amazon.csv") # reads the file amazon.csv into a pandas dataframe

## Task 1: Exploratory data analysis (EDA)
 EDA helps uncover patterns, detect anomalies, and validate assumptions through data visualization and summary statistics. What insights can you extract from the dataset to better understand its structure and sentiment distribution?

In [8]:
# to display the first 5 rows
df.head()

Unnamed: 0,Text,label
0,"Not only do we enjoy this individually, it ent...",1
1,Over the years I have purchased these radio ap...,1
2,I love this game! There are very few ads. The ...,1
3,It won't let me move anyone and when I try to ...,0
4,Two things in order for me to want this. Get ...,0


In [9]:
# to view the statistics of the dataframe
df.describe()

Unnamed: 0,label
count,19396.0
mean,0.769746
std,0.421006
min,0.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


<span style="color:red">##### TODO: perform more EDA and some visualizations</span>

In [None]:
# Here are a few options to explore for EDA and visualizations (you can choose some of them or add more):
# 1. Check for missing values in the dataset
# 2. Analyze the distribution of sentiment labels (check dataset balance)
# 3. Visualize the length of reviews (character/word count) by sentiment
# 4. Create word clouds for positive and negative reviews
# 5. Examine the most frequent words in each sentiment class
# 6. Analyze the relationship between review length and sentiment
# 7. Check for duplicate reviews
# 8. Visualize the distribution of review lengths


# Task 2: Preprocessing
Raw text data is often noisy and unstructured, containing inconsistencies like typos, slang, abbreviations, and irrelevant information. How can you preprocess the dataset to clean and standardize the text for better sentiment classification?

Suggested pre-processing would be:
- Removing punctuation
- Removing stopwords
- Tokenizing text
- Text Lemmatization
- Text representation (hint:TF-IDF, bag of words, word2vec, BERT, etc)
- Converting sentiments to numerical labels(encoding)
- Splitting the dataset




#### **Note: The preprocessing steps are related to the model you choose to use.**

### Preprocess the dataset
Perform some preprocessings on the dataset as suggested above.

In [None]:
### TODO: perform preprocessings on the dataset




### Splitting the dataset
Choose an appropriate approach to splitting the dataset to train/test/validate sets. (Hint: Use random state to select the same test sample e.g random_state:42)

In [None]:
### TODO: choose an appropriate approach to split the dataset to train/test/validate sets.




# Task 3: Model building:

In this task, you are free to choose any appropriate approach for sentiment classification.
You can experiment with, for example:
- Using pretrained transformer-based models (e.g., BERT, RoBERTa, DistilBERT) via Hugging Face. (https://huggingface.co/models?pipeline_tag=text-classification&sort=trending)
- Applying traditional machine learning techniques (e.g., Logistic Regression, Random Forest, SVM)
- using vectorized features such as TF-IDF or word embeddings.
- Building your own custom deep learning model (e.g., CNN, RNN, LSTM) for the classification task.

In [None]:
### TODO: Define an appropriate model and classify the sentiment of the dataset.



# Task 4: Performance evaluation

Model evaluation helps assess performance by using various metrics to understand a model’s strengths and weaknesses. How well do your trained and pretrained models classify sentiments, and which evaluation metrics best capture their effectiveness?

- Remember you must use **at least f1 score** as a metric to evaluate the performance of your model.

Evaluate the performance of the model(s) on test data . Include a visual comparison between each models' training loss and validation loss.

In [None]:
### TODO: Evaluate the performance of the model(s) on test data.




# Task 5: Inferencing (Optional)
AI inference applies a trained model to predict outcomes on new, unseen data. How can you use your pre-trained model to classify sentiment on new and unseen text from your end, and evaluate its performance?

Note that:
- this task is more about the application of the model rather than the training process itself, where the model(s) created in the previous task should be utlised here on unseen data.
- unseen data: data that the model has not seen during training. This can include any new content you create independently.