# Sentiment Analysis (NLP) 

- IMBD Movie Review

### Import necessary libraries:

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

### Load the dataset and explore it

In [2]:
# Load the dataset
df = pd.read_csv("IMDB_dataset.csv")

# Display the first few rows
print(df.head())

# Check the shape of the dataset
print("Dataset Shape:", df.shape)

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive
Dataset Shape: (50000, 2)


### Data Preprocessing:

In [3]:
# Tokenization and cleaning
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Tokenize the text
    words = nltk.word_tokenize(text)
    
    # Remove stopwords and non-alphabetic characters
    cleaned_words = [word.lower() for word in words if word.isalpha() and word not in stop_words]
    
    return ' '.join(cleaned_words)

# Apply preprocessing to the 'review' column
df['cleaned_review'] = df['review'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kavit\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Split the dataset into training and testing sets:

In [4]:
X = df['cleaned_review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Feature Extraction:

In [5]:
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)


### Build and train a sentiment analysis model:

In [6]:
model = MultinomialNB()
model.fit(X_train_bow, y_train)

### Make predictions and evaluate the model:

In [7]:
y_pred = model.predict(X_test_bow)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

Accuracy: 0.8585
Classification Report:
               precision    recall  f1-score   support

    negative       0.84      0.88      0.86      4961
    positive       0.87      0.84      0.86      5039

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



## Interpretation:

Accuracy: The model is correct about 86% of the time when predicting if a review is positive or negative.

Precision: When the model predicts a review as "negative":

It's right about 84% of the time.
When a review is actually negative, it catches 88% of them.
Precision: When the model predicts a review as "positive":

It's right about 87% of the time.
When a review is actually positive, it catches 84% of them.
F1-score: This number combines both precision and recall. It's a balanced measure of correctness.

Support: The number of reviews in each category (negative or positive).