## Sarcasm Detection in Movie Review (Tokenization Techniques and Evaluation)

Here we modify the previous implementation to evaluate all the tokenization methods using a Random Forest classifier. This will allow us to compare the performance of each tokenization method and determine which is best for sarcasm detection.

- **Data Preparation** : lemmatization and tokenization processe are applied primarily to the "Review" column.
- **Exploring Tokenization Techniques** :
   1. Word Tokenization
   2. Subword Tokenization
   3. Character Tokenization
   4. Sentence Tokenization<br>

- **Label Encoding** : Encode the sentiment and sarcasm columns.
- **Vectorization** : Use TF-IDF for converting tokenized text into features.
- **Model Training**: Train a Random Forest model for each tokenization method.
- **Evaluation** : Evaluate and compare the performance of each model.

### Step 1: Loading the Data

We start by loading the Clean_data dataset, which contains both the text of the reviews and their corresponding labels (indicating whether the review is sarcastic or not).

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# File path
file_path = '/content/drive/MyDrive/IMBD/Clean_data1.csv'

In [3]:
# Read CSV file
import pandas as pd
df = pd.read_csv(file_path)

In [4]:
# Display the first 5 rows of data
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic
1,wonderful little production. filming technique...,positive,non-sarcastic
2,movie groundbreaking experience! I've never se...,positive,sarcastic
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic
4,Basically there's family little boy (Jake) thi...,negative,sarcastic


In [5]:
df.shape

(6497, 3)

### Step 2 : Installing Libraries and Packages  

In [6]:
!pip install pandas nltk spacy
import subprocess

# Define the command as a list of strings
command = ["python", "-m", "spacy", "download", "en_core_web_sm"]

# Execute the command using subprocess
try:
    subprocess.run(command, check=True)
    print("en_core_web_sm model downloaded successfully!")
except subprocess.CalledProcessError as e:
    print(f"Error downloading en_core_web_sm model: {e}")



en_core_web_sm model downloaded successfully!


### Step 3 :  Lemmatization
The lemmatization and tokenization processes are applied primarily to the "Review" column because that column contains the textual data that needs to be processed for NLP (Natural Language Processing) tasks. The "Sentiment" and "Sarcasm" columns contain categorical data that do not require lemmatization or tokenization.
1. **Lemmatization** : Reducing words to their base or root form (lemmas) helps in normalizing the text. For example, "running" becomes "run," which helps in reducing the complexity of the text and making the text analysis more consistent.

In [7]:
# Function for lemmatization using spaCy
def lemmatize_text(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])

In [8]:
# Apply tokenization and lemmatization
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
import spacy

# Download NLTK data
nltk.download('punkt')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')
df['Lemmatized_Review'] = df['Review'].apply(lemmatize_text)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [9]:
df.head(5)

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...


In [11]:
# Save the token data to a new CSV file
output_file_path = '/content/drive/MyDrive/IMBD/lemmatize_dataset.csv'
df.to_csv(output_file_path, index=False)

In [12]:
data = pd.read_csv('/content/drive/MyDrive/IMBD/lemmatize_dataset.csv')

## Exploring Tokenization Techniques
1. Word Tokenization
2. Subword Tokenization
3. Character Tokenization
4. Sentence Tokenization


### Step 4 : Word Tokenization
Using NLTK for word tokenization.

In [13]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

df = data

# Function for word tokenization
def word_tokenize_reviews(reviews):
    return reviews.apply(lambda x: word_tokenize(x))

# Tokenize the lemmatized reviews
df['word_tokenized'] = word_tokenize_reviews(df['Lemmatized_Review'])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,word_tokenized
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode..."
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn..."
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have..."
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ..."
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,..."


### Step 5 : Subword Tokenization
Using HuggingFace's Tokenizers for Byte-Pair Encoding (BPE).

In [14]:
from tokenizers import ByteLevelBPETokenizer

# Initialize a ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()

# Train the tokenizer on the lemmatized reviews
tokenizer.train_from_iterator(df['Lemmatized_Review'], vocab_size=5000, min_frequency=2)

# Function for BPE tokenization
def bpe_tokenize_reviews(reviews):
    return reviews.apply(lambda x: tokenizer.encode(x).tokens)

# Tokenize the lemmatized reviews
df['bpe_tokenized'] = bpe_tokenize_reviews(df['Lemmatized_Review'])


In [None]:
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,word_tokenized,bpe_tokenized
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode...","[one, Ġreviewer, Ġmention, Ġwatch, Ġ1, Ġo, z, ..."
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn...","[w, onder, ful, Ġlittle, Ġproduction, Ġ., Ġfil..."
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have...","[movie, Ġgroundb, reak, ing, Ġexperience, Ġ!, ..."
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...","[think, Ġwonderful, Ġway, Ġspend, Ġtime, Ġhot,..."
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,...","[b, as, ically, Ġthere, Ġbe, Ġfamily, Ġlittle,..."


### Step 6 : Character Tokenization
Custom function for character tokenization.

In [15]:
# Function for character tokenization
def char_tokenize_reviews(reviews):
    return reviews.apply(lambda x: list(x))

# Tokenize the lemmatized reviews
df['char_tokenized'] = char_tokenize_reviews(df['Lemmatized_Review'])


In [16]:
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,word_tokenized,bpe_tokenized,char_tokenized
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode...","[one, Ġreviewer, Ġmention, Ġwatch, Ġ1, Ġo, z, ...","[o, n, e, , r, e, v, i, e, w, e, r, , m, e, ..."
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn...","[w, onder, ful, Ġlittle, Ġproduction, Ġ., Ġfil...","[w, o, n, d, e, r, f, u, l, , l, i, t, t, l, ..."
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have...","[movie, Ġgroundb, reak, ing, Ġexperience, Ġ!, ...","[m, o, v, i, e, , g, r, o, u, n, d, b, r, e, ..."
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...","[think, Ġwonderful, Ġway, Ġspend, Ġtime, Ġhot,...","[t, h, i, n, k, , w, o, n, d, e, r, f, u, l, ..."
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,...","[b, as, ically, Ġthere, Ġbe, Ġfamily, Ġlittle,...","[b, a, s, i, c, a, l, l, y, , t, h, e, r, e, ..."


###Step 7: Sentence Tokenization with SpaCy
SpaCy is another powerful library for NLP tasks, including sentence tokenization. More powerful and accurate, especially for complex and non-standard texts.

In [17]:
import spacy
import nltk
from nltk.tokenize import sent_tokenize
# Load the SpaCy model
nlp = spacy.load('en_core_web_sm')

# Function for sentence tokenization
def spacy_sentence_tokenize_reviews(reviews):
    return reviews.apply(lambda x: [sent.text for sent in nlp(x).sents])

# Tokenize the lemmatized reviews into sentences
df['spacy_sentence_tokenized'] = spacy_sentence_tokenize_reviews(df['Lemmatized_Review'])

# Display the tokenized sentences
print(df['spacy_sentence_tokenized'].head())


0    [one reviewer mention watch 1 oz episode hook ...
1    [wonderful little production ., film technique...
2    [movie groundbreaking experience !, I have nev...
3    [think wonderful way spend time hot summer wee...
4    [basically there be family little boy ( Jake )...
Name: spacy_sentence_tokenized, dtype: object


In [18]:
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,word_tokenized,bpe_tokenized,char_tokenized,spacy_sentence_tokenized
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode...","[one, Ġreviewer, Ġmention, Ġwatch, Ġ1, Ġo, z, ...","[o, n, e, , r, e, v, i, e, w, e, r, , m, e, ...",[one reviewer mention watch 1 oz episode hook ...
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn...","[w, onder, ful, Ġlittle, Ġproduction, Ġ., Ġfil...","[w, o, n, d, e, r, f, u, l, , l, i, t, t, l, ...","[wonderful little production ., film technique..."
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have...","[movie, Ġgroundb, reak, ing, Ġexperience, Ġ!, ...","[m, o, v, i, e, , g, r, o, u, n, d, b, r, e, ...","[movie groundbreaking experience !, I have nev..."
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...","[think, Ġwonderful, Ġway, Ġspend, Ġtime, Ġhot,...","[t, h, i, n, k, , w, o, n, d, e, r, f, u, l, ...",[think wonderful way spend time hot summer wee...
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,...","[b, as, ically, Ġthere, Ġbe, Ġfamily, Ġlittle,...","[b, a, s, i, c, a, l, l, y, , t, h, e, r, e, ...",[basically there be family little boy ( Jake )...


In [20]:
def join_sentences(tokenized_reviews):
    return tokenized_reviews.apply(lambda x: ' '.join(x))

df['joined_spacy_sentences'] = join_sentences(df['spacy_sentence_tokenized'])

In [21]:
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,word_tokenized,bpe_tokenized,char_tokenized,spacy_sentence_tokenized,joined_spacy_sentences
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode...","[one, Ġreviewer, Ġmention, Ġwatch, Ġ1, Ġo, z, ...","[o, n, e, , r, e, v, i, e, w, e, r, , m, e, ...",[one reviewer mention watch 1 oz episode hook ...,one reviewer mention watch 1 oz episode hook ....
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn...","[w, onder, ful, Ġlittle, Ġproduction, Ġ., Ġfil...","[w, o, n, d, e, r, f, u, l, , l, i, t, t, l, ...","[wonderful little production ., film technique...",wonderful little production . film technique u...
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have...","[movie, Ġgroundb, reak, ing, Ġexperience, Ġ!, ...","[m, o, v, i, e, , g, r, o, u, n, d, b, r, e, ...","[movie groundbreaking experience !, I have nev...",movie groundbreaking experience ! I have never...
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...","[think, Ġwonderful, Ġway, Ġspend, Ġtime, Ġhot,...","[t, h, i, n, k, , w, o, n, d, e, r, f, u, l, ...",[think wonderful way spend time hot summer wee...,think wonderful way spend time hot summer week...
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,...","[b, as, ically, Ġthere, Ġbe, Ġfamily, Ġlittle,...","[b, a, s, i, c, a, l, l, y, , t, h, e, r, e, ...",[basically there be family little boy ( Jake )...,basically there be family little boy ( Jake ) ...


In [22]:
# Handle NaNs by filling with an empty string
df['word_tokenized'] = df['word_tokenized'].fillna('')
df['bpe_tokenized'] = df['bpe_tokenized'].fillna('')
df['char_tokenized'] = df['char_tokenized'].fillna('')
df['joined_spacy_sentences'] = df['joined_spacy_sentences'].fillna('')

### Step 8 : Label Encoding
Label Encoding for sentiment and sarcasm columns.


In [25]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report


# Encode the sentiment and sarcasm columns
label_encoder = LabelEncoder()
df['sentiment_encoded'] = label_encoder.fit_transform(df['Sentiment'])
df['sarcasm_encoded'] = label_encoder.fit_transform(df['Sarcasm'])


### Step 9 : Vectorization
Use TF-IDF for converting tokenized text into features.

In [28]:
# Function to vectorize tokenized reviews
def vectorize_reviews(tokenized_reviews, vectorizer=None):
    if vectorizer is None:
        vectorizer = TfidfVectorizer(tokenizer=lambda x: x, preprocessor=lambda x: x, token_pattern=None)
        vectorizer.fit(tokenized_reviews)
    return vectorizer.transform(tokenized_reviews), vectorizer


### Step 10 : Model Training and Evaluation
**Model Training** : Train a Random Forest model for each tokenization method.<br>
**Evaluation** : Evaluate and compare the performance of each model.


In [31]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    df[['word_tokenized', 'bpe_tokenized', 'char_tokenized', 'joined_spacy_sentences']],
    df['sarcasm_encoded'],
    test_size=0.2,
    random_state=42
)

# Evaluate each tokenization method
def train_and_evaluate(tokenized_column):
    X_train_tokenized, vectorizer = vectorize_reviews(X_train[tokenized_column])
    X_test_tokenized = vectorize_reviews(X_test[tokenized_column], vectorizer)[0]

    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train_tokenized, y_train)
    y_pred = model.predict(X_test_tokenized)

    print(f"Evaluation for {tokenized_column}:\n")
    print(classification_report(y_test, y_pred))

# Evaluate each tokenization method
train_and_evaluate('word_tokenized')
train_and_evaluate('bpe_tokenized')
train_and_evaluate('char_tokenized')
train_and_evaluate('joined_spacy_sentences')


Evaluation for word_tokenized:

              precision    recall  f1-score   support

           0       0.77      0.92      0.84       604
           1       0.92      0.77      0.84       696

    accuracy                           0.84      1300
   macro avg       0.85      0.84      0.84      1300
weighted avg       0.85      0.84      0.84      1300

Evaluation for bpe_tokenized:

              precision    recall  f1-score   support

           0       0.77      0.92      0.84       604
           1       0.91      0.76      0.83       696

    accuracy                           0.83      1300
   macro avg       0.84      0.84      0.83      1300
weighted avg       0.85      0.83      0.83      1300

Evaluation for char_tokenized:

              precision    recall  f1-score   support

           0       0.75      0.82      0.78       604
           1       0.83      0.76      0.79       696

    accuracy                           0.79      1300
   macro avg       0.79      0.79

### Analysis of Metrics Overview
**Accuracy** : Word tokenization achieved the highest accuracy (0.84), followed by BPE (0.83), and both character and joined SpaCy sentences (0.79).<br>
**Precision and Recall** : For class 1 (sarcasm), word tokenization has the highest precision (0.92) and balanced recall (0.77), making it reliable for detecting sarcasm with fewer false positives. BPE tokenization also performs well but with slightly lower precision (0.91) and recall (0.76).<br>
**F1-Score** : Word tokenization and BPE tokenization have similar F1-scores for both classes, but word tokenization has a marginally better overall performance.

**Higher Accuracy** : Word tokenization provides the highest accuracy among all methods, indicating better overall performance.<br>
**Balanced Performance** : It achieves balanced precision and recall, especially for the sarcasm class (class 1), ensuring reliable detection without significant bias towards false positives or negatives.<br>
**Simplicity and Interpretability**: Word tokenization is straightforward and interpretable, making it easier to understand and debug the model’s decisions.<br>

### Conclusion
After evaluating various tokenization methods, word tokenization emerges as the best choice for our sarcasm detection model on movie reviews. It strikes a balance between computational efficiency and the ability to capture contextual nuances, making it well-suited for a dataset of 6497 movie reviews.<br>
###Evaluation Criteria
**Dataset Size** : With 6497 reviews, the choice of tokenization needs to balance simplicity and effectiveness.<br>
**Content Type** : Movie reviews often have rich, descriptive language and context-specific terms.<br>

###Suitability for Movie Reviews
**Rich Vocabulary** : Movie reviews typically contain a wide range of vocabulary and expressions. Word tokenization captures these variations effectively.<br>
**Handling Negations and Idioms** : Sarcasm in movie reviews can be conveyed through negations and idiomatic expressions, which are better captured through word tokenization.<br> bold text