#**Sarcasm Detection Machine Learning Model for IMDB Movie Reviews**

###**Business Problem**
- The movie industry heavily relies on audience reviews for a film's success in theaters.
- Understanding and analyzing these reviews is crucial for providing accurate ratings and overall opinions on movies
- However, some reviewers express their thoughts sarcastically, which can mislead traditional sentiment analysis models.
###**Solution Proposed**
- To address the challenge of sarcasm detection, we need to develop and implement Machine Learning models specifically designed to recognize and interpret sarcastic reviews.
- By incorporating sarcasm detection into sentiment analysis, we can significantly enhance the accuracy of review analysis.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
df= pd.read_excel('/content/drive/MyDrive/Colab Notebooks/Dataset.xlsx')

In [None]:
df.head(10)

Unnamed: 0,reviews,Sarcasm
0,One of the other reviewers has mentioned that ...,not sarcastic
1,A wonderful little production. <br /><br />The...,not sarcastic
2,This movie was a groundbreaking experience!<br...,sarcastic
3,I thought this was a wonderful way to spend ti...,not sarcastic
4,Basically there's a family where a little boy ...,sarcastic
5,"Petter Mattei's ""Love in the Time of Money"" is...",not sarcastic
6,"Probably my all-time favorite movie, a story o...",not sarcastic
7,I sure would like to see a resurrection of a u...,not sarcastic
8,"This show was an amazing, fresh & innovative i...",sarcastic
9,Encouraged by the positive comments about this...,sarcastic


#**1) Data Preparation**

we will do the following steps and prepare the dataset for further process:

1. Removing duplicate reviews
2. Removing null valued labels

**Removing the duplicates and null values**

In [None]:
def prepare_dataset(df, label_column, review_column):
    initial_shape = df.shape

    # Removing duplicate reviews
    df = df.drop_duplicates(subset=review_column, keep='first')
    final_shape_after_duplicates = df.shape
    rows_dropped = initial_shape[0] - final_shape_after_duplicates[0]
    print(f"Number of rows dropped due to duplicates: {rows_dropped}")

    # Checking for null values
    rows_with_nulls = df[df.isnull().any(axis=1)]
    # Remove rows with any null values
    df = df.dropna()
    print(f"Shape of the dataset after removing null values: {df.shape}")

    # Print the final shape of the dataset
    print("Final shape of the dataset:", df.shape)
    return df

df = prepare_dataset(df, 'Sarcasm','reviews')


Number of rows dropped due to duplicates: 34
Shape of the dataset after removing null values: (6497, 2)
Final shape of the dataset: (6497, 2)


**Dataset analysis**

In [None]:
def analyze_excel_dataset(df, label_column, review_column):
    # Print the shape of the dataset
    print("Shape of the dataset:", df.shape)

    # Print the count of each unique label in the 'Sarcasm' column
    label_counts = df[label_column].value_counts()
    print("Count of each unique label in '{}':".format(label_column))
    print(label_counts)
    review_length=[]
    # Calculate and print the average length of the reviews
    review_length = df[review_column].apply(lambda x: len(str(x).split()))
    average_length = review_length.mean()
    print("Average length of the reviews in '{}':".format(review_column), average_length)


analyze_excel_dataset(df,'Sarcasm','reviews')


Shape of the dataset: (6497, 2)
Count of each unique label in 'Sarcasm':
Sarcasm
sarcastic        3518
not sarcastic    2979
Name: count, dtype: int64
Average length of the reviews in 'reviews': 115.17115591811606


In [None]:
df

Unnamed: 0,reviews,Sarcasm
0,One of the other reviewers has mentioned that ...,not sarcastic
1,A wonderful little production. <br /><br />The...,not sarcastic
2,This movie was a groundbreaking experience!<br...,sarcastic
3,I thought this was a wonderful way to spend ti...,not sarcastic
4,Basically there's a family where a little boy ...,sarcastic
...,...,...
6539,This movie's idea of character development is ...,sarcastic
6540,I guess they ran out of budget for a decent sc...,sarcastic
6541,Who needs a plot when you have explosions ever...,sarcastic
6542,Is there an award for most generic action movi...,sarcastic


#**2) Cleaning data**

Cleaning the input Dataset reviews by performing the following steps:

1.  Removing HTML tags from reviews.

2.  Removing URLs from reviews.

3.  Removing specified punctuation marks from reviews.

4.  Removing extra white spaces from reviews.

In [None]:
import re
import string

In [None]:
def clean_text(text):
    # Compile regular expressions for HTML tags and URLs , defining punctuation marks to remove
    html_pattern = re.compile('<.*?>')
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    punctuation_to_remove = ''.join(p for p in string.punctuation if p not in ['?', '!', '.'])

    # Remove HTML tags,urls,punctuation marks
    text = html_pattern.sub('', text)
    text = url_pattern.sub('', text)
    text = text.translate(str.maketrans('', '', punctuation_to_remove))

    # Remove extra white spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

In [None]:
df['cleaned_reviews'] = df['reviews'].apply(clean_text)

In [None]:
df.head(5)

Unnamed: 0,reviews,Sarcasm,cleaned_reviews
0,One of the other reviewers has mentioned that ...,not sarcastic,One of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,not sarcastic,A wonderful little production. The filming tec...
2,This movie was a groundbreaking experience!<br...,sarcastic,This movie was a groundbreaking experience! Iv...
3,I thought this was a wonderful way to spend ti...,not sarcastic,I thought this was a wonderful way to spend ti...
4,Basically there's a family where a little boy ...,sarcastic,Basically theres a family where a little boy J...


#**3) Data Preprocessing**

We will follow the following methods in order for preprocessing the data :


1.   Stop words removal
2.   Lemmatization
3.   Checking for dataset imbalance
4.   Tokenization
5.   Input data Embeddings
6.   Encoding the labels

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

In [None]:
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Initialize stop words and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def process_review(review):
    # Remove stopwords
    filtered_words = [word for word in review.split() if word.lower() not in stop_words]
    filtered_sentence = ' '.join(filtered_words)

    # Perform lemmatization
    words = filtered_sentence.split()
    pos_tags = pos_tag(words)
    lemmatized_sentence = ' '.join([lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags])

    return lemmatized_sentence


In [None]:
df['lemmatized_reviews'] = df['cleaned_reviews'].apply(process_review)

In [None]:
df.head(5)

Unnamed: 0,reviews,Sarcasm,cleaned_reviews,lemmatized_reviews
0,One of the other reviewers has mentioned that ...,not sarcastic,One of the other reviewers has mentioned that ...,One reviewer mention watch 1 Oz episode youll ...
1,A wonderful little production. <br /><br />The...,not sarcastic,A wonderful little production. The filming tec...,wonderful little production. filming technique...
2,This movie was a groundbreaking experience!<br...,sarcastic,This movie was a groundbreaking experience! Iv...,movie groundbreaking experience! Ive never see...
3,I thought this was a wonderful way to spend ti...,not sarcastic,I thought this was a wonderful way to spend ti...,think wonderful way spend time hot summer week...
4,Basically there's a family where a little boy ...,sarcastic,Basically theres a family where a little boy J...,Basically there family little boy Jake think t...


**Checking for dataset IMBALANCE**

In [None]:
def check_imbalance(df, label_column, threshold=0.05):
    label_counts = df[label_column].value_counts()
    class_proportions = label_counts / label_counts.sum()
    return any(class_proportions < threshold)

label = 'Sarcasm'
is_imbalanced = check_imbalance(df, label)
if is_imbalanced:
    print("The dataset is imbalanced.")
else:
    print("The dataset is balanced.")


The dataset is balanced.


**Encoding the labels**

In [None]:
from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

df['Sarcasm'] = label_encoder.fit_transform(df['Sarcasm'])

mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

print("Mapping of sarcasm labels to numerical values:")
for sarcasm, label in mapping.items():
    print(f"{sarcasm}: {label}")

Mapping of sarcasm labels to numerical values:
not sarcastic: 0
sarcastic: 1


**Spliting the data into train and test data**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Prepare the data for training
X = df['lemmatized_reviews']
Y = df['Sarcasm']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [None]:
X_train.shape

(5197,)

In [None]:
X_test.shape

(1300,)

**Tokenization and Embeddings**

In [None]:
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

In [None]:
# Tokenizer and padding
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_train_ = pad_sequences(X_train_seq, padding='post')

X_test_seq = tokenizer.texts_to_sequences(X_test)
X_test_ = pad_sequences(X_test_seq, padding='post', maxlen=X_train_.shape[1])

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1
max_length = X_train_.shape[1]


In [None]:
print(f"vocabulary size: {vocab_size} ")
print(f"maximum length of input review embedding: {max_length}" )

vocabulary size: 29783 
maximum length of input review embedding: 944


##**4) RANDOM FOREST ML MODEL TRAINING**

**About Random Forest model**

- Random Forest is a powerful ensemble learning technique in machine learning.

- It combines the output of multiple decision trees to reach a single result.

- Random Forest handles both classification and regression problems.

**How Does Random Forest Work?**
- **During training:**
  - Multiple decision trees are created, each using a random subset of the dataset.
  - Each tree measures a random subset of features in each partition.
  - This randomness introduces variability, reducing overfitting.
- **In prediction:**
  - The algorithm aggregates results from all trees:
      - For classification tasks, it uses voting.
      - For regression tasks, it averages predictions.
  - The collaborative decision-making process provides stable and precise results.

**Advantages:**
- Handles complex data.
- Reduces overfitting.
- Provides reliable forecasts.

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

In [None]:
def random_forest(X_train, X_test, y_train, y_test):
    model = RandomForestClassifier(n_estimators=5, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"Evaluation for the given vectors:\n")
    print(f'Accuracy: {accuracy:.2f}')
    print('Confusion Matrix:')
    print(conf_matrix)
    print('Classification Report:')
    print(class_report)

    return model

model_rf=random_forest(X_train_, X_test_, y_train, y_test)

Evaluation for the given vectors:

Accuracy: 0.75
Confusion Matrix:
[[451 153]
 [175 521]]
Classification Report:
               precision    recall  f1-score   support

not sarcastic       0.72      0.75      0.73       604
    sarcastic       0.77      0.75      0.76       696

     accuracy                           0.75      1300
    macro avg       0.75      0.75      0.75      1300
 weighted avg       0.75      0.75      0.75      1300



**Performing the hyper parameter tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
#hyper Parameters
parameters = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(model_rf, parameters, cv=5, scoring='accuracy')
grid_search.fit(X_train_, y_train)
# Extract best parameters
best_params = grid_search.best_params_
print(f"Best parameters found: {best_params}")
rf_model = RandomForestClassifier(**best_params)
rf_model.fit(X_train_,y_train)
rf_preds = rf_model.predict(X_test_)

# Evaluation

class_report = classification_report(y_test, rf_preds)
print('Classification Report:')
print(class_report)

Best parameters found: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Classification Report:
               precision    recall  f1-score   support

not sarcastic       0.74      0.84      0.79       604
    sarcastic       0.84      0.75      0.79       696

     accuracy                           0.79      1300
    macro avg       0.79      0.79      0.79      1300
 weighted avg       0.80      0.79      0.79      1300



###**5) Predictions and Analysis**

In [None]:
def predict_new_data(text, model_gru, tokenizer, max_length):
    # Preprocess the new data
     new_data_cleaned = []
     text_cleaned = clean_text(text)
     processed_review = process_review(review)
     new_data_cleaned.append(processed_review)
     # Tokenize and pad the new data
     new_data_seq = tokenizer.texts_to_sequences(new_data_cleaned)
     new_data_padded = pad_sequences(new_data_seq, padding='post', maxlen=max_length)
     # Predict using GRU model
     gru_predictions = model.predict(new_data_padded)
     predictions = np.where(gru_predictions > 0.5, 1, 0).flatten()
     if (predictions==1):
      return("Sarcastic")
     else: return("Not Sarcastic")

In [None]:
review= "Despite its star-studded cast, 'Cats' fails to capture the magic of the stage musical.The awkward CGI and lackluster choreography detract from the experience, resulting in a confusing and unsettling film that left audiences disappointed."
print("actual : Not Sarcastic")
a=predict_new_data(review, model, tokenizer, max_length)
print( a)


actual : Not Sarcastic
Not Sarcastic


In [None]:
review= "Congratulations to the special effects team for making everything look like it came straight out of a video game from the '90s! I didn’t realize I was watching a film; I thought I was playing an outdated arcade game. Truly groundbreaking work!"
print("actual : Sarcastic")
b=predict_new_data(review, model, tokenizer, max_length)
print(b)

actual : Sarcastic
Not Sarcastic


In [None]:
review= "A touching story about a man who travels across borders—too bad he didn’t bother with a map!"
print("actual : Sarcastic")
d=predict_new_data(review, model, tokenizer, max_length)
print(d)


actual : Sarcastic
Sarcastic


#**Conclusion**

The Random Forest model achieved an overall accuracy of 79%. It demonstrated a balanced performance between the two classes:

- *Not Sarcastic* reviews: The model achieved a precision of 74%, a recall of 84%, and an F1-score of 79%.
- *Sarcastic* reviews: The model achieved a precision of 84%, a recall of 75%, and an F1-score of 79%.

This Random Forest model serves as a benchmark for training the deep learning model. Its performance metrics provide a reference point for evaluating and improving the deep learning model's effectiveness in classifying sarcastic and non-sarcastic reviews.