## Sarcasm Detection in Movie Review (Implementation Strategy)
### Data Preparation
- **Tokenization** : Use word tokenization for the reviews column.
- **Label Encoding** : Encode the sentiment and sarcasm columns as numerical values.
- **Word Embeddings**: Use Word2Vec to generate word embeddings for the tokenized reviews.

### Step 1: Loading the Data

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
# File path
file_path = '/content/drive/MyDrive/IMBD/lemmatize_dataset.csv'

In [12]:
# Read CSV file
import pandas as pd
df = pd.read_csv(file_path)

In [13]:
# Display the first 5 rows of data
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...


### Step 1 :  Tokenization
**Tokenization** : The tokenization processes are applied primarily to the "Review" column because that column contains the textual data that needs to be processed for NLP (Natural Language Processing) tasks. The "Sentiment" and "Sarcasm" columns contain categorical data that do not require tokenization.<br>

After evaluating various tokenization methods, word tokenization emerges as the best choice for our sarcasm detection model on movie reviews. It strikes a balance between computational efficiency and the ability to capture contextual nuances, making it well-suited for a dataset of 6510 movie reviews.<br>
###Evaluation Criteria
**Dataset Size** : With 6510 reviews, the choice of tokenization needs to balance simplicity and effectiveness.<br>
**Content Type** : Movie reviews often have rich, descriptive language and context-specific terms.<br>

###Suitability for Movie Reviews
**Rich Vocabulary** : Movie reviews typically contain a wide range of vocabulary and expressions. Word tokenization captures these variations effectively.<br>
**Handling Negations and Idioms** : Sarcasm in movie reviews can be conveyed through negations and idiomatic expressions, which are better captured through word tokenization.<br>  

In [14]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Function for word tokenization
def word_tokenize_reviews(reviews):
    return reviews.apply(lambda x: word_tokenize(x))

# Tokenize the lemmatized reviews
df['Tokenized_Review'] = word_tokenize_reviews(df['Lemmatized_Review'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [15]:
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,Tokenized_Review
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode..."
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn..."
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have..."
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ..."
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,..."


### Step 2 : Label Encoding
**Purpose** : Encodes categorical variables into integer labels.<br>
**Usage** : Often used for encoding target variables or categorical variables with ordinal relationships.<br>
**Advantages**:
- Converts categorical labels into integers, making them easier for machine learning algorithms to process as they typically work with numerical data.<br>

**Considerations** :
- Assigns integers based on the alphabetical order or first appearance in the dataset, potentially implying an ordinal relationship between categories (e.g., positive = 2, neutral = 1, negative = 0), which may not always be appropriate if categories have no inherent order.

- Transformation may discard some nuances present in original categorical labels, potentially affecting model performance in tasks relying on subtle differences between categories (e.g., sentiment analysis, sarcasm detection).


In [16]:
from sklearn.preprocessing import LabelEncoder
# Apply Label encoder on sentiment and sarcasm columns
encoder = LabelEncoder()
df['Sentiment_Label'] = encoder.fit_transform(df['Sentiment'])
df['Sarcasm_Label'] = encoder.fit_transform(df['Sarcasm'])

In [17]:
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,Tokenized_Review,Sentiment_Label,Sarcasm_Label
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode...",2,0
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn...",2,0
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have...",2,1
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...",2,0
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,...",0,1


**Output Explanation** :
1. **Encoded Sentiment Labels** : Each unique sentiment category (positive, negative, neutral, etc.) is assigned a unique numerical label.<br>
The labels are typically assigned in alphabetical order (**negative** = 0, **neutral** = 1, **positive** = 2).<br>

2. **Encoded Sarcasm Labels** : Each unique sarcasm category (sarcastic, non-sarcastic, etc.) is assigned a unique numerical label.<br>Similarly, labels are assigned in alphabetical order (**non-sarcastic** = 0, **sarcastic** = 1).

### Step 3 : Word Embeddings
**Purpose** : Represents words as dense vectors in a continuous space.<br>
**Usage** : Converts words into fixed-size dense vectors that capture semantic meanings.<br>
**Advantages** :
- Embeds words into a continuous vector space where similar words have similar vector representations.
- Captures contextual meanings and nuances, which can be useful for understanding the subtleties of sarcasm.<br>

**Considerations** :
- Requires a large amount of text data to train effectively.
- May struggle with out-of-vocabulary words not seen during training.


In [22]:
# Train Word2Vec model on the tokenized reviews
import pandas as pd
import nltk
from gensim.models import Word2Vec
import numpy as np

word2vec_model = Word2Vec(sentences=df['Tokenized_Review'], vector_size=100, window=5, min_count=1, workers=4)

# Function to get the average Word2Vec vector for each review
def get_avg_word2vec(tokens_list, model, vector_size):
    vec = np.zeros(vector_size)
    count = 0
    for word in tokens_list:
        if word in model.wv.key_to_index:
            vec += model.wv[word]
            count += 1
    if count != 0:
        vec /= count
    return vec

# Apply the function to get Word2Vec embeddings for the reviews
df['word2vec_vector'] = df['Tokenized_Review'].apply(lambda x: get_avg_word2vec(x, word2vec_model, 100))


In [23]:
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,Tokenized_Review,Sentiment_Label,Sarcasm_Label,word2vec_vector
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode...",2,0,"[-0.39988263693377524, 0.5321289262994453, 0.2..."
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn...",2,0,"[-0.17672336481440912, 0.5281532592541355, 0.2..."
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have...",2,1,"[-0.8143854476511478, 0.732536039625605, -0.00..."
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ...",2,0,"[-0.358939056317342, 0.5354939161414474, 0.278..."
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,...",0,1,"[-0.44032156851334386, 0.5738598694025046, 0.2..."


###Summary
**Tokenization** : Applied word tokenization to the lemmatized reviews.<br>
**Label Encoding** : Encoded the sentiment and sarcasm columns using LabelEncoder.<br>
**Word2Vec Embedding**: Trained a Word2Vec model on the tokenized reviews and computed the average word vectors for each review.<br>
By using word tokenization and Word2Vec embeddings, this approach captures the semantic information of the reviews, enabling the model to effectively detect sarcasm.