<a href="https://colab.research.google.com/github/shahzadahmad3/Natural-Language-Processing/blob/main/Movie_Review_Sentiment_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Movie Review Sentiment Classification**

**Goal:** Classify movie reviews as either positive or negative based on their text content.

**Dataset:** We'll use the IMDb movie reviews dataset, which contains labeled reviews.

**Challenge:** Handling text preprocessing, feature extraction, and model selection for optimal accuracy.

In [12]:
import nltk
from nltk.corpus import movie_reviews
import random

nltk.download('movie_reviews')

# Load dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
# Shuffle data for randomness
random.shuffle(documents)

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [13]:
# Step 1:Data Preprocessing
# We'll clean, tokenize, and vectorize the text data.
from nltk.corpus import stopwords
import string

nltk.download('punkt_tab')
nltk.download('stopwords')
# # Define stopwords and punctuations
# stop_words = set(stopwords.words('english'))
# punctuations = set(string.punctuation)

def preprocessing(text):
  filtered_text=[word.lower() for word in text if word.lower() not in stopwords.words('english')]
  processed_text=[word.lower() for word in filtered_text if word not in string.punctuation]
  return processed_text # Added return statement to return the processed text

# Apply preprocessing
processed_documents=[(preprocessing(words), category) for words, category in documents]


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
# Step 2: Feature Engineering with TF-IDF
# We'll use TF-IDF vectorization for feature extraction.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Convert documents to strings
texts = [" ".join(words) for words, category in processed_documents]
labels=[category for words, category in processed_documents]
# Convert documents to strings
vectorize=TfidfVectorizer(max_features=5000)
X=vectorize.fit_transform(texts)

# Encode labels
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(labels)


In [26]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**4. Model Selection:**
We'll use multiple machine learning models and compare performance:

**Logistic Regression:**
1.   Naive Bayes
2.   Support Vector Machine (SVM)
3.  Random Forest

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

model_lr = LogisticRegression()
model_nb = MultinomialNB()
model_svm = SVC()
model_rf = RandomForestClassifier()

models = [model_lr, model_nb, model_svm, model_rf]

In [29]:
#Model evaluation
for model in models:
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  accuracy = accuracy_score(y_test, y_pred)
  print(f"Accuracy of {model}: {accuracy}")
  print(f"Classification Report of {model}: {classification_report(y_test, y_pred)}")

Accuracy of LogisticRegression(): 0.865
Classification Report of LogisticRegression():               precision    recall  f1-score   support

           0       0.84      0.89      0.86       189
           1       0.89      0.84      0.87       211

    accuracy                           0.86       400
   macro avg       0.87      0.87      0.86       400
weighted avg       0.87      0.86      0.87       400

Accuracy of MultinomialNB(): 0.8175
Classification Report of MultinomialNB():               precision    recall  f1-score   support

           0       0.76      0.89      0.82       189
           1       0.88      0.75      0.81       211

    accuracy                           0.82       400
   macro avg       0.82      0.82      0.82       400
weighted avg       0.83      0.82      0.82       400

Accuracy of SVC(): 0.8575
Classification Report of SVC():               precision    recall  f1-score   support

           0       0.82      0.89      0.86       189
           1  

In [31]:
#Classify movie reviews as either positive or negative based on their text content.
def classify_movie_review(review_text):
  """Classifies a movie review as positive or negative.

    Args:
        review_text: The text content of the movie review.

    Returns:
        'positive' or 'negative', indicating the predicted sentiment.
    """

  # 1. Preprocess the review text
  processed_review = preprocessing(review_text.split())

  # 2. Vectorize using TF-IDF
  review_vector = vectorize.transform([" ".join(processed_review)])

  # 3. Predict using the chosen model (e.g., Logistic Regression)
  prediction = model_svm.predict(review_vector)[0]

  # 4. Decode the prediction
  sentiment = encoder.inverse_transform([prediction])[0]

  return sentiment

# Example usage:
review = "An outstanding film with a gripping storyline and phenomenal performances! The direction was top-notch, and every scene kept me engaged."
predicted_sentiment = classify_movie_review(review)
print(f"Predicted sentiment: {predicted_sentiment}")

Predicted sentiment: pos
