## Sarcasm Detection in Movie Review (Exploring Encoding Techniques)

In this code snippet, we'll exploring encoding techniques for sarcasm detection in movie review data. Sarcasm detection involves identifying sarcastic remarks in text, which can be particularly challenging due to the subtlety and context-dependency of sarcasm.

### Step 1: Loading the Data

We start by loading the Clean_data dataset, which contains both the text of the reviews and their corresponding labels (indicating whether the review is sarcastic or not).





In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [25]:
# File path
file_path = '/content/drive/MyDrive/IMBD/Clean_data.csv'

In [26]:
# Read CSV file
import pandas as pd
df = pd.read_csv(file_path)

In [27]:
# Display the first 5 rows of data
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic
1,wonderful little production. filming technique...,positive,non-sarcastic
2,movie groundbreaking experience! I've never se...,positive,sarcastic
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic
4,Basically there's family little boy (Jake) thi...,negative,sarcastic


### Step 2 : Installing Libraries and Packages  

In [28]:
!pip install pandas nltk spacy
import subprocess

# Define the command as a list of strings
command = ["python", "-m", "spacy", "download", "en_core_web_sm"]

# Execute the command using subprocess
try:
    subprocess.run(command, check=True)
    print("en_core_web_sm model downloaded successfully!")
except subprocess.CalledProcessError as e:
    print(f"Error downloading en_core_web_sm model: {e}")



en_core_web_sm model downloaded successfully!


In [29]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
import spacy

# Download NLTK data
nltk.download('punkt')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Step 3 :  Lemmatization and tokenization
The lemmatization and tokenization processes are applied primarily to the "Review" column because that column contains the textual data that needs to be processed for NLP (Natural Language Processing) tasks. The "Sentiment" and "Sarcasm" columns contain categorical data that do not require lemmatization or tokenization.
1. **Lemmatization** : Reducing words to their base or root form (lemmas) helps in normalizing the text. For example, "running" becomes "run," which helps in reducing the complexity of the text and making the text analysis more consistent.
2. **Tokenization** : Breaking down the text into individual words or tokens helps in analyzing the text more effectively.  

In [30]:
# Function for lemmatization using spaCy
def lemmatize_text(text):
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])


In [31]:
# Apply tokenization and lemmatization
df['Lemmatized_Review'] = df['Review'].apply(lemmatize_text)
df['Tokenized_Review'] = df['Lemmatized_Review'].apply(word_tokenize)

In [32]:
df.head(5)

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,Tokenized_Review
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"[one, reviewer, mention, watch, 1, oz, episode..."
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"[wonderful, little, production, ., film, techn..."
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"[movie, groundbreaking, experience, !, I, have..."
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"[think, wonderful, way, spend, time, hot, summ..."
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"[basically, there, be, family, little, boy, (,..."


In [33]:
df.shape

(6510, 5)

### Step 4 : Save the data in CSV format for further use or analysis

In [34]:
# Save the token data to a new CSV file
output_file_path = '/content/drive/MyDrive/IMBD/Token_dataset.csv'
df.to_csv(output_file_path, index=False)

In [35]:
data = pd.read_csv('/content/drive/MyDrive/IMBD/Token_dataset.csv')

## Exploring Encoding Techniques
1. One Hot Encoder : Encodes categorical data into binary vectors.
2. Label Encoder : Encodes categorical data into integer labels.
3. TF-IDF : Represents documents as vectors of term importance.
4. Word2Vec : Represents words as dense vectors in a continuous space.
5. Term Frequency Encoder : Represents documents as vectors of term frequencies.

### Step 5 : One Hot Encoder
**Purpose** : Encodes categorical variables into binary vectors.<br>
**Usage** : Typically used for encoding categorical variables like "Sentiment" and "Sarcasm".<br>
**Advantages** :
- OneHotEncoder can handle unseen categories in test data gracefully by automatically assigning zeros to new categories in the encoded vectors.
This reduces the risk of errors and unexpected behavior when applying the model to new data.<br>

**Considerations** :
- One-hot encoding expands the feature space significantly, especially if categorical variables have many unique categories.

- One-hot encoding introduces redundant information because one category can be inferred from the others.
To mitigate this, typically one category is dropped (known as the reference category) to avoid multicollinearity in regression models.




In [36]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()

#data is DataFrame with 'Sentiment' and 'Sarcasm' columns
data_encoded = pd.concat([data, pd.get_dummies(data[['Sentiment', 'Sarcasm']])], axis=1)


In [37]:
data_encoded.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,Tokenized_Review,Sentiment_negative,Sentiment_neutral,Sentiment_positive,Sarcasm_non-sarcastic,Sarcasm_sarcastic
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"['one', 'reviewer', 'mention', 'watch', '1', '...",False,False,True,True,False
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"['wonderful', 'little', 'production', '.', 'fi...",False,False,True,True,False
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"['movie', 'groundbreaking', 'experience', '!',...",False,False,True,False,True
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"['think', 'wonderful', 'way', 'spend', 'time',...",False,False,True,True,False
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"['basically', 'there', 'be', 'family', 'little...",True,False,False,False,True


**Output Explanation** :
1. This approach explicitly uses **OneHotEncoder** to fit and transform the categorical columns into a sparse matrix of binary vectors (**0**s and **1**s). The resulting sparse matrix is converted to a DataFrame with column names derived from ***enc.get_feature_names_out()***.

### Step 6 : Label Encoder
**Purpose** : Encodes categorical variables into integer labels.<br>
**Usage** : Often used for encoding target variables or categorical variables with ordinal relationships.<br>
**Advantages**:
- Converts categorical labels into integers, making them easier for machine learning algorithms to process as they typically work with numerical data.<br>

**Considerations** :
- Assigns integers based on the alphabetical order or first appearance in the dataset, potentially implying an ordinal relationship between categories (e.g., positive = 2, neutral = 1, negative = 0), which may not always be appropriate if categories have no inherent order.

- Transformation may discard some nuances present in original categorical labels, potentially affecting model performance in tasks relying on subtle differences between categories (e.g., sentiment analysis, sarcasm detection).


In [39]:
from sklearn.preprocessing import LabelEncoder
# Apply Label encoder on sentiment and sarcasm columns
encoder = LabelEncoder()
data['Sentiment_Label'] = encoder.fit_transform(data['Sentiment'])
data['Sarcasm_Label'] = encoder.fit_transform(data['Sarcasm'])


In [41]:
data.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,Tokenized_Review,Sentiment_Label,Sarcasm_Label
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"['one', 'reviewer', 'mention', 'watch', '1', '...",2,0
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"['wonderful', 'little', 'production', '.', 'fi...",2,0
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"['movie', 'groundbreaking', 'experience', '!',...",2,1
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"['think', 'wonderful', 'way', 'spend', 'time',...",2,0
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"['basically', 'there', 'be', 'family', 'little...",0,1


**Output Explanation** :
1. **Encoded Sentiment Labels** : Each unique sentiment category (positive, negative, neutral, etc.) is assigned a unique numerical label.<br>
The labels are typically assigned in alphabetical order (**negative** = 0, **neutral** = 1, **positive** = 2).<br>

2. **Encoded Sarcasm Labels** : Each unique sarcasm category (sarcastic, non-sarcastic, etc.) is assigned a unique numerical label.<br>Similarly, labels are assigned in alphabetical order (**non-sarcastic** = 0, **sarcastic** = 1).

### Step 7 : TF-IDF (Term Frequency-Inverse Document Frequency)
**Purpose** : Represents documents as vectors of term importance.<br>
**Usage** : Converts text data (after tokenization and optionally lemmatization) into numerical vectors.<br>
**Advantages** :
- It scales down the impact of common words that appear in many documents (e.g., "the", "and") which might not contribute much to sarcasm detection.
- It scales up the importance of words that are distinctive to specific documents or classes (sarcastic vs. non-sarcastic).<br>

**Considerations** :
- Useful for capturing the discriminative power of less frequent words in determining sarcasm.
- May not capture semantic relationships between words or phrases as effectively as Word2Vec.<br>



In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Apply Tfidf on Lemmatized_Review column
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(data['Lemmatized_Review'])
print("TfidfVectorizer Vocabulary:")
print(tfidf_vectorizer.get_feature_names_out())

TfidfVectorizer Vocabulary:
['00' '000' '0000110' ... 'zzz' 'zzzz' 'zzzzzzzzzzzz']


In [45]:
# Print the output
print(X_tfidf)

  (0, 21585)	0.04980605866299702
  (0, 5980)	0.05078213482709231
  (0, 24253)	0.05279890850407331
  (0, 23794)	0.03282657001448616
  (0, 25589)	0.07415959619458323
  (0, 24815)	0.07477473489909324
  (0, 4872)	0.07945962687585753
  (0, 2371)	0.042253500559414105
  (0, 14957)	0.04244012946969331
  (0, 8349)	0.04731784752471771
  (0, 21787)	0.06001487842277317
  (0, 22849)	0.054679211955762606
  (0, 13492)	0.04717075643144834
  (0, 2701)	0.07612174032553351
  (0, 24598)	0.040441442018540746
  (0, 4560)	0.05637084280623327
  (0, 15339)	0.05528284665334174
  (0, 14673)	0.08598285885373277
  (0, 26061)	0.02728804828207833
  (0, 17004)	0.0528813763210903
  (0, 13230)	0.04409909869430822
  (0, 12190)	0.16323848878106215
  (0, 16407)	0.09344250704968061
  (0, 21108)	0.06017671602076017
  (0, 26263)	0.09463569504943542
  :	:
  (6505, 15911)	0.13553539799247885
  (6506, 6126)	0.5019068200247979
  (6506, 10512)	0.4513596894609441
  (6506, 20960)	0.40218808986637994
  (6506, 20413)	0.42195838451671

**Output Explanation** :
1. **TfidfVectorizer Vocabulary** : ***tfidf_vectorizer.get_feature_names_out()*** prints the vocabulary learned from the data, which represents all unique words found in **data['Lemmatized_Review']** after applying TF-IDF transformation.<br>

2. **Output Format** : The output is an array of feature names ***(words)*** sorted in lexicographical order.
Each element in the array corresponds to a unique word in the corpus (in this case, **Lemmatized_Review** column).

### Step 8 : Word2Vec
**Purpose** : Represents words as dense vectors in a continuous space.<br>
**Usage** : Converts words into fixed-size dense vectors that capture semantic meanings.<br>
**Advantages** :
- Embeds words into a continuous vector space where similar words have similar vector representations.
- Captures contextual meanings and nuances, which can be useful for understanding the subtleties of sarcasm.<br>

**Considerations** :
- Requires a large amount of text data to train effectively.
- May struggle with out-of-vocabulary words not seen during training.



In [47]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Apply Word2vec on Tokenized_Review Column
tokenized_reviews = data['Tokenized_Review'].apply(lambda x: word_tokenize(x))

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_reviews, vector_size=100, window=5, min_count=1, sg=0)

# Example of obtaining word vector
word_to_find = 'example_word'

if word_to_find in word2vec_model.wv.key_to_index:
    word_vector = word2vec_model.wv[word_to_find]
    print(f"Vector representation of '{word_to_find}':")
    print(word_vector)
else:
    print(f"'{word_to_find}' is not present in the vocabulary.")

'example_word' is not present in the vocabulary.



**Output Explanation** :
1. **Vector Representation** : The code checks if ***'example_word'*** exists in the vocabulary of the trained Word2Vec model ***(word2vec_model.wv.key_to_index)***. If it does, it retrieves the vector representation **(word_vector)** of ***'example_word'*** from the model.<br>

2. **Output Format** : The vector representation of ***'example_word'*** is printed using print**(word_vector)**.<br>
If ***'example_word'*** is not found in the vocabulary, a message indicating its absence is printed.

### Step 9 : Term Frequency Encoder
**Purpose** : Represents documents as vectors of term frequencies.<br>
**Usage** : Converts text data into numerical vectors based on raw term frequencies.<br>
**Advantages** :
- Simple and interpretable, representing the raw frequency of each word.
- Can be effective if word frequency (and potentially the presence of specific words) is important for sarcasm detection.<br>

**Considerations** :
- Doesn’t capture the importance of words based on their rarity or commonality across the corpus.
- May overemphasize common words that might not contribute to sarcasm detection.

In [49]:
from sklearn.feature_extraction.text import CountVectorizer

# Apply TFE on Lemmatized_Review Column
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(data['Lemmatized_Review'])


In [50]:
# Print the output
print("Feature names (vocabulary):")
print(count_vectorizer.get_feature_names_out())
print("\n")
print("X_count (Sparse Matrix representation):")
print(X_count)
print("\n")
print("X_count (Dense Array representation):")
print(X_count.toarray())

Feature names (vocabulary):
['00' '000' '0000110' ... 'zzz' 'zzzz' 'zzzzzzzzzzzz']


X_count (Sparse Matrix representation):
  (0, 16902)	1
  (0, 19950)	1
  (0, 15207)	1
  (0, 25940)	3
  (0, 17274)	6
  (0, 8015)	2
  (0, 11422)	1
  (0, 20050)	2
  (0, 8235)	1
  (0, 10796)	1
  (0, 23799)	1
  (0, 8969)	2
  (0, 23885)	1
  (0, 22871)	2
  (0, 3383)	1
  (0, 24964)	1
  (0, 20802)	1
  (0, 25645)	4
  (0, 21223)	1
  (0, 26453)	2
  (0, 10076)	2
  (0, 24540)	1
  (0, 21515)	4
  (0, 8517)	1
  (0, 10991)	1
  :	:
  (6505, 16035)	1
  (6506, 3423)	1
  (6506, 20413)	1
  (6506, 20960)	1
  (6506, 10512)	1
  (6506, 6126)	1
  (6507, 8201)	1
  (6507, 18173)	1
  (6507, 16286)	1
  (6507, 15469)	1
  (6507, 8994)	1
  (6507, 8382)	1
  (6508, 8190)	1
  (6508, 15911)	1
  (6508, 14560)	1
  (6508, 640)	1
  (6508, 1937)	1
  (6508, 9796)	1
  (6509, 640)	1
  (6509, 11513)	1
  (6509, 22791)	1
  (6509, 24652)	1
  (6509, 16525)	1
  (6509, 15427)	1
  (6509, 16656)	1


X_count (Dense Array representation):
[[0 0 0 ... 0 0 0]
 [

**Output Explanation** :
1. **Feature Names (Vocabulary)** : ***count_vectorizer.get_feature_names_out()***  prints the vocabulary learned from the data, which represents all unique words found in **data['Lemmatized_Review']**.<br>

2. **X_count (Sparse Matrix Representation)** : ***X_count*** is a sparse matrix where each row represents a document (review) and each column represents a word from the vocabulary.<br>
It shows the counts of each word in each document (review).<br>

3. **X_count (Dense Array Representation)** : ***X_count.toarray()*** converts the sparse matrix ***X_count*** into a dense NumPy array for easier inspection and manipulation.<br>
It provides the actual count values of each word in each document.



### Step 10 : Summary
**Label Encoding** provides a straightforward method to convert categorical variables like "Sarcasm" and "Sentiment" into numerical representations suitable for machine learning algorithms.<br>

**For Sarcasm Column** :
- "non-sarcastic" will be encoded as 0
- "sarcastic" will be encoded as 1<br>

**For Sentiment Column**:
- "negative" will be encoded as 0
- "neutral" will be encoded as 1
- "positive" will be encoded as 2<br>

For Sarcasm detection in Movie reviews with a dataset size of 6510 datapoints, **Word2Vec** is likely to be the most effective choice among the three options. Here's why:

- **Contextual Understanding** : Sarcasm often relies on subtle linguistic and contextual cues, which Word2Vec can capture better than TF-IDF or simple term frequencies.
- **Semantic Relationships** : Word2Vec embeddings can encode semantic meanings and relationships between words, which are crucial for understanding the nuanced differences between sarcastic and non-sarcastic expressions.
- **Generalization**: Once trained, Word2Vec embeddings can generalize to new words or phrases not seen during training, potentially improving robustness in sarcasm detection.<br>

While TF-IDF and Term Frequency (CountVectorizer) are simpler and more straightforward, they may not capture the nuanced semantics and context-dependent nature of sarcasm as effectively as Word2Vec. Therefore, if computational resources permit, opting for Word2Vec for feature representation in your sarcasm detection model would likely yield more accurate and insightful results.