# Sentiment Analysis for Customer Reviews Challenge

## Challenge:
Develop a robust Sentiment Analysis classifier for XYZ customer reviews, automating the categorization into positive, negative, or neutral sentiments. Utilize Natural Language Processing (NLP) techniques, exploring different sentiment analysis methods.

## Problem Statement:
XYZ organization, a global online retail giant, accumulates a vast number of customer reviews daily. Extracting sentiments from these reviews offers insights into customer satisfaction, product quality, and market trends. The challenge is to create an effective sentiment analysis model that accurately classifies XYZ customer reviews.

### Important Instructions:

1. Make sure this ipynb file that you have cloned is in the __Project__ folder on the Desktop. The Dataset is also available in the same folder.
2. Ensure that all the cells in the notebook can be executed without any errors.
3. Once the Challenge has been completed, save the SentimentAnalysis.ipynb notebook in the __*Project*__ Folder on the desktop. If the file is not present in that folder, autoevalution will fail.
4. Print the evaluation metrics of the model. 
5. Before you submit the challenge for evaluation, please make sure you have assigned the Accuracy score of the model that was created for evaluation.
6. Assign the Accuracy score obtained for the model created in this challenge to the specified variable in the predefined function *submit_accuracy_score*. The solution is to be written between the comments `# code starts here` and `# code ends here`
7. Please do not make any changes to the variable names and the function name *submit_accuracy_score* as this will be used for automated evaluation of the challenge. Any modification in these names will result in unexpected behaviour.

### --------------------------------------- CHALLENGE CODE STARTS HERE --------------------------------------------

## Data Collection 

In [1]:
#  Import necessary libraries
import pandas as pd
import re
import nltk
import joblib
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [2]:
# Download NLTK stopwords if not already downloaded
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ramwadhwa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Load the dataset
# Assuming your dataset is loaded into a DataFrame named 'data'
data = pd.read_csv('Reviews.csv')  # Change the filename to your dataset

In [4]:
# Select relevant columns
data = data[['Text', 'Score']]

## Data Preprocessing

##### Text Cleaning

In [5]:
# Text Cleaning
def clean_text(text):
    clean = re.compile('<.*?>')
    text = re.sub(clean, '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower()
    return text

In [6]:
data['CleanText'] = data['Text'].apply(clean_text)

#### Tokenization

In [7]:
# Tokenization and Stopwords Removal
stop_words = set(stopwords.words('english'))

In [8]:
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens if token.strip() not in stop_words]
    return tokens

In [9]:
data['TokenizedText'] = data['CleanText'].apply(tokenize)

#### Handling Missing Values & Duplicates

In [10]:
# Handling Missing Values (if any)
data.dropna(inplace=True)

In [11]:
# data = data.drop_duplicates(subset=['Product'], keep='first')

## Sentiment Analysis Implementation

In [12]:
# Mapping scores to sentiments
def label_sentiment(score):
    if score >= 4:
        return 'Positive'
    elif score <= 2:
        return 'Negative'
    else:
        return 'Neutral'

In [13]:
data['Sentiment'] = data['Score'].apply(label_sentiment)

#### Feature Extraction

In [14]:
# Feature Extraction (TF-IDF)
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(data['CleanText'])
y = data['Sentiment']

In [15]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#### Model Selection

In [16]:
# Model Selection and Training (Using Naive Bayes as an example)
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

MultinomialNB()

In [17]:
# Save the trained model
joblib.dump(nb_classifier, 'sentiment_model.pkl')  # Save the model as 'sentiment_model.pkl'l

['sentiment_model.pkl']

#### Training and Testing

In [18]:
# Predictions
predictions = nb_classifier.predict(X_test)

#### Evaluation Metrics

In [19]:
# Evaluation Metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')
f1 = f1_score(y_test, predictions, average='weighted')

In [20]:
print(f"Accuracy: {accuracy:.2f}, Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")

Accuracy: 0.78, Precision: 0.76, Recall: 0.78, F1 Score: 0.69


### --------------------------------------- CHALLENGE CODE ENDS HERE --------------------------------------------

### NOTE:
1. Assign the Accuracy score obtained for the model created in this challenge to the specified variable in the predefined function *submit_accuracy_score* below. The solution is to be written between the comments `# code starts here` and `# code ends here`
2. Please do not make any changes to the variable names and the function name *submit_accuracy_score* as this will be used for automated evaluation of the challenge. Any modification in these names will result in unexpected behaviour.

In [21]:
def submit_accuracy_score(y_test,predictions)-> float:
    #accuracy should be in the range of 0.0 to 1.0
    accuracy = accuracy_score(y_test, predictions)
    # code ends here
    return accuracy

In [23]:
accuracy_score=submit_accuracy_score(y_test,predictions)

In [29]:
print("Accuracy Score ="+str(accuracy_score))

Accuracy Score=0.7849616212317562
