# Evaluation of Encoding Methods:
## One Hot Encoder:

Suitability: Not ideal for text data. It would require creating a separate feature for each unique word in the dataset, leading to a very high-dimensional and sparse matrix, which is not efficient for sarcasm detection.
## Label Encoder:

Suitability: Not suitable for text data as it converts categorical data into integers. It does not capture the semantic meaning of words and thus is not useful for this task.
## TF-IDF:

Suitability: Good for text classification tasks. TF-IDF can effectively convert the text data into numerical vectors by capturing the importance of words in each comment relative to the entire dataset. It helps in distinguishing common words from more unique words, which can be useful for detecting sarcasm.
## Word2Vec:

Suitability: Very suitable. Word2Vec creates dense vector representations of words, capturing their semantic meaning. Since sarcasm often depends on the context and semantics of the words, Word2Vec embeddings can be beneficial for a sarcasm detection model.
## Term Frequency Encoder:

Suitability: Basic but can be useful. It simply counts the occurrences of each word in a document. While it does not consider the importance of words across documents like TF-IDF, it can still provide a baseline representation of text data.


## Recommendations:
Primary Recommendation: Word2Vec or similar embeddings (like GloVe or FastText). These embeddings can capture semantic and contextual information, which is crucial for detecting sarcasm that often relies on nuanced meanings.


Secondary Recommendation: TF-IDF can be a strong alternative if you prefer traditional feature extraction methods. It can provide useful features for machine learning models by highlighting important words.

Step 1: Load the Dataset

In [1]:
import pandas as pd

# Load the dataset
file_path = 'cleaned_balanced_dataset_FINAL.csv'
data = pd.read_csv(file_path)

# Drop rows with missing comments
data.dropna(subset=['comment'], inplace=True)

# Display the first few rows of the dataset
data.head()


Unnamed: 0,label,comment
0,1,need
1,0,might well milk last
2,1,ask locktrap
3,1,im glad community doesnt make console player f...
4,0,joke put stitch


Step 2: Preprocess the Text Data

In [2]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download NLTK data
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Preprocess text function
def preprocess_text(text):
    text = text.lower()  # Lowercase text
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    tokens = word_tokenize(text)  # Tokenize text
    tokens = [word for word in tokens if word not in stopwords.words('english')]  # Remove stopwords
    return tokens

# Apply preprocessing to the comment column
data['processed_comment'] = data['comment'].apply(preprocess_text)
data.head()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\haree\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\haree\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,label,comment,processed_comment
0,1,need,[need]
1,0,might well milk last,"[might, well, milk, last]"
2,1,ask locktrap,"[ask, locktrap]"
3,1,im glad community doesnt make console player f...,"[im, glad, community, doesnt, make, console, p..."
4,0,joke put stitch,"[joke, put, stitch]"


Step 3: 
## Encoding Methods
One Hot Encoding and Term Frequency

In [4]:
data['processed_comment_str'] = data['processed_comment'].apply(lambda x: ' '.join(x))
data.head()

Unnamed: 0,label,comment,processed_comment,processed_comment_str
0,1,need,[need],need
1,0,might well milk last,"[might, well, milk, last]",might well milk last
2,1,ask locktrap,"[ask, locktrap]",ask locktrap
3,1,im glad community doesnt make console player f...,"[im, glad, community, doesnt, make, console, p...",im glad community doesnt make console player f...
4,0,joke put stitch,"[joke, put, stitch]",joke put stitch


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# One Hot Encoding
onehot_vectorizer = CountVectorizer(binary=True)
X_onehot = onehot_vectorizer.fit_transform(data['processed_comment_str'])

# Term Frequency
tf_vectorizer = CountVectorizer()
X_tf = tf_vectorizer.fit_transform(data['processed_comment_str'])


Step 3: 
# Create TF-IDF Vectors and Train a Random Forest Model

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_tfidf = tfidf_vectorizer.fit_transform(data['processed_comment_str'])


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Convert processed comments back to strings for TF-IDF Vectorizer
data['processed_comment_str'] = data['processed_comment'].apply(lambda x: ' '.join(x))

# TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=10000)
X_tfidf = tfidf.fit_transform(data['processed_comment_str'])

# Train-test split
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(X_tfidf, data['label'], test_size=0.2, random_state=42)

# Train a Random Forest model
model_tfidf = RandomForestClassifier(n_estimators=100, random_state=42)
model_tfidf.fit(X_train_tfidf, y_train)

# Evaluate the model
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)
print("TF-IDF Model Performance:\n", classification_report(y_test, y_pred_tfidf))


TF-IDF Model Performance:
               precision    recall  f1-score   support

           0       0.65      0.67      0.66     12746
           1       0.67      0.64      0.65     12994

    accuracy                           0.66     25740
   macro avg       0.66      0.66      0.66     25740
weighted avg       0.66      0.66      0.66     25740



Step 4: 
# Create Word2Vec Embeddings and Train a Random Forest Model

In [8]:
from gensim.models import Word2Vec
import numpy as np

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=data['processed_comment'], vector_size=100, window=5, min_count=2, workers=4)
word2vec_model.train(data['processed_comment'], total_examples=word2vec_model.corpus_count, epochs=30)

# Function to average Word2Vec embeddings
def get_avg_word2vec(tokens, model, vector_size):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) == 0:
        return np.zeros(vector_size)
    else:
        return np.mean(vectors, axis=0)

# Apply Word2Vec to each comment
vector_size = 100
data['word2vec'] = data['processed_comment'].apply(lambda x: get_avg_word2vec(x, word2vec_model, vector_size))

# Convert to numpy array
X_word2vec = np.array(data['word2vec'].tolist())


ModuleNotFoundError: No module named 'gensim'

In [12]:
#!pip install gensim
#!pip install --upgrade pip
#!pip install gensim==4.3.2
#import sys
#print(sys.version)
#!pip install --force-reinstall gensim==4.3.2
#pip install --upgrade pip setuptools
#!pip install numpy scipy
#!pip install gensim==4.3.2
#!pip install gensim
pip install H:\\infosys Springboard internship\\gensim-4.3.2-cp312-cp312-win_amd64.whl


SyntaxError: invalid syntax (215301553.py, line 11)

In [12]:
from gensim.models import Word2Vec
import numpy as np

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=data['processed_comment'], vector_size=100, window=5, min_count=2, workers=4)
word2vec_model.train(data['processed_comment'], total_examples=word2vec_model.corpus_count, epochs=30)

# Function to average Word2Vec embeddings
def get_avg_word2vec(tokens, model, vector_size):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) == 0:
        return np.zeros(vector_size)
    else:
        return np.mean(vectors, axis=0)

# Apply Word2Vec to each comment
vector_size = 100
data['word2vec'] = data['processed_comment'].apply(lambda x: get_avg_word2vec(x, word2vec_model, vector_size))

# Convert to numpy array
X_word2vec = np.array(data['word2vec'].tolist())

# Train-test split
X_train_w2v, X_test_w2v, y_train, y_test = train_test_split(X_word2vec, data['label'], test_size=0.2, random_state=42)

# Train a Random Forest model
model_w2v = RandomForestClassifier(n_estimators=100, random_state=42)
model_w2v.fit(X_train_w2v, y_train)

# Evaluate the model
y_pred_w2v = model_w2v.predict(X_test_w2v)
print("Word2Vec Model Performance:\n", classification_report(y_test, y_pred_w2v))


ModuleNotFoundError: No module named 'gensim'

## Label Encoding

In [13]:
from sklearn.preprocessing import LabelEncoder

# Label Encoder (for target variable)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['label'])


Step 4: 
# Train and Evaluate Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Train-test split function
def train_test_split_data(X, y):
    return train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest Model
def train_evaluate_model(X_train, X_test, y_train, y_test):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return classification_report(y_test, y_pred)

# Prepare data for model training
X_train_onehot, X_test_onehot, y_train, y_test = train_test_split_data(X_onehot, y)
X_train_tf, X_test_tf, y_train, y_test = train_test_split_data(X_tf, y)
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split_data(X_tfidf, y)
X_train_w2v, X_test_w2v, y_train, y_test = train_test_split_data(X_word2vec, y)

# Evaluate models
print("One Hot Encoding Model Performance:\n", train_evaluate_model(X_train_onehot, X_test_onehot, y_train, y_test))
print("Term Frequency Model Performance:\n", train_evaluate_model(X_train_tf, X_test_tf, y_train, y_test))
print("TF-IDF Model Performance:\n", train_evaluate_model(X_train_tfidf, X_test_tfidf, y_train, y_test))
print("Word2Vec Model Performance:\n", train_evaluate_model(X_train_w2v, X_test_w2v, y_train, y_test))


## Compare Performance

In [None]:
# Print the performance comparison for each encoding method
print("Performance Comparison:")
print("One Hot Encoding Model Performance:\n", train_evaluate_model(X_train_onehot, X_test_onehot, y_train, y_test))
print("Term Frequency Model Performance:\n", train_evaluate_model(X_train_tf, X_test_tf, y_train, y_test))
print("TF-IDF Model Performance:\n", train_evaluate_model(X_train_tfidf, X_test_tfidf, y_train, y_test))
print("Word2Vec Model Performance:\n", train_evaluate_model(X_train_w2v, X_test_w2v, y_train, y_test))


## Explanation

Load the Dataset: Reads the dataset and drops any rows with missing comments.

Preprocess the Text Data: Cleans and tokenizes the text data, then converts the processed tokens back into strings.

Encoding Methods: Encodes the text data using One Hot Encoding, Term Frequency, TF-IDF, and Word2Vec. Label encoding is used for the target variable.

Train and Evaluate Models: Trains and evaluates a Random Forest model using each encoding method.

Conclusion and Comparison: Prints the performance comparison for each encoding method.

This code will help you determine which feature extraction method works best for your sarcasm detection task when using a Random Forest classifier. Adjust the parameters as needed based on your specific requirements and dataset characteristics.