## Sarcasm Detection in Movie Review (Model Evaluation - Machine Learning Models)
In this code snippet , we will perform model evaluation on a various Machine Learning Models. We will start by importing the necessary libraries and loading the dataset. We will split the dataset into training and testing sets to train our models and evaluate its performance on unseen data.

Next, we will train models using the training data. After training the model, we will evaluate its performance using several common evaluation metrics such as accuracy, precision, recall, and the F1 score. Additionally, we will use cross-validation to ensure the robustness of our models.

**Traditional Machine Learning Models** <br>
1. Logistic Regression
2. K-Nearest Neighbors
3. Naive Bayes
4. Gardient Bosting
5. XG Boost
6. Random Forest Classifier
7. Support Vector Machine (SVM)



### Step 1: Loading the Vectorize Data

we will evaluate a Machine learning Models using Tokenize data. We will start by importing the necessary libraries and loading a Tokenized dataset.

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# File path
file_path = '/content/drive/MyDrive/IMBD/Vector_dataset.csv'

In [3]:
# Read CSV file
import pandas as pd
df = pd.read_csv(file_path)

In [4]:
# Display the first 5 rows of data
df.head()

Unnamed: 0,Review,Sentiment,Sarcasm,Lemmatized_Review,Tokenized_Review,Sentiment_Label,Sarcasm_Label,word2vec_vector
0,One reviewers mentioned watching 1 Oz episode ...,positive,non-sarcastic,one reviewer mention watch 1 oz episode hook ....,"['one', 'reviewer', 'mention', 'watch', '1', '...",2,0,[-0.33703893 0.63750656 0.20848949 0.110051...
1,wonderful little production. filming technique...,positive,non-sarcastic,wonderful little production . film technique u...,"['wonderful', 'little', 'production', '.', 'fi...",2,0,[-2.21933369e-01 6.40139948e-01 2.48385639e-...
2,movie groundbreaking experience! I've never se...,positive,sarcastic,movie groundbreaking experience ! I have never...,"['movie', 'groundbreaking', 'experience', '!',...",2,1,[-7.50784083e-01 8.69618461e-01 6.57767776e-...
3,thought wonderful way spend time hot summer we...,positive,non-sarcastic,think wonderful way spend time hot summer week...,"['think', 'wonderful', 'way', 'spend', 'time',...",2,0,[-0.29578843 0.66404176 0.19095987 0.130039...
4,Basically there's family little boy (Jake) thi...,negative,sarcastic,basically there be family little boy ( Jake ) ...,"['basically', 'there', 'be', 'family', 'little...",0,1,[-0.36713844 0.69574437 0.21454412 0.073285...


### Step 2 : check for class imbalance in dataset
Our dataset have columns Sentiment_Label and Sarcasm_Label that represent target variables. <br>
Typically, this involves counting the occurrences of each class within target variables and then assessing whether there is a significant disparity between the counts of different classes.

In [5]:
# Check class distribution for Sarcasm_Label
sarcasm_counts = df['Sarcasm_Label'].value_counts()
print("\nSarcasm Label Distribution:")
print(sarcasm_counts)
print()

# Determine if Sarcasm_Label is imbalanced
is_imbalanced = False
for count in sarcasm_counts:
    if count < 0.2 * sarcasm_counts.sum():
        is_imbalanced = True
        break

if is_imbalanced:
    print("Sarcasm Label is imbalanced")
else:
    print("Sarcasm Label is balanced")


Sarcasm Label Distribution:
Sarcasm_Label
1    3518
0    2979
Name: count, dtype: int64

Sarcasm Label is balanced


In [6]:
# Check class distribution for Sentiment_Label
sentiment_counts = df['Sentiment_Label'].value_counts()
print("Sentiment Label Distribution:")
print(sentiment_counts)
print()

# Determine if Sentiment_Label is imbalanced
is_imbalanced = False
for count in sentiment_counts:
    if count < 0.2 * sentiment_counts.sum():
        is_imbalanced = True
        break

if is_imbalanced:
    print("Sentiment Label is imbalanced")
else:
    print("Sentiment Label is balanced")

Sentiment Label Distribution:
Sentiment_Label
0    4184
2    2300
1      13
Name: count, dtype: int64

Sentiment Label is imbalanced


**Output Explanation** :<br>

**Sentiment_Label Distribution** : calculates the counts of each unique value in the Sentiment_Label column.<br>

**Check Imbalance** : The loop for count in sentiment_counts iterates through the counts of each unique value. If any count is less than 20% of the total count ***(0.2 * sentiment_counts.sum())***, it flags the label as imbalanced.<br>

**Sarcasm_Label Distribution** : Similarly, calculates the counts of each unique value in the Sarcasm_Label column.<br>

**Check Imbalance for Sarcasm_Label** : The loop for count in sarcasm_counts iterates through the counts of each unique value. If any count is less than 20% of the total count ***(0.2 * sarcasm_counts.sum())***, it flags the label as imbalanced.<br>

This approach allows us to check for imbalance in columns with three unique values (0, 1, and 2). Adjust the threshold (0.2 in this case) according to specific dataset and imbalance criteria.<br>

### Machine Learning Models
1. Logistic Regression

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer


# Convert Tokenized_Review back to text for vectorization
df['Review_Text'] = df['Tokenized_Review'].apply(lambda x: ' '.join(eval(x)))

# Extract features and labels
X = df['Review_Text']
y = df['Sarcasm_Label']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text data to TF-IDF features
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Train the model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = log_reg.predict(X_test_tfidf)

# Evaluate the model
print("Logistic Regression Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Logistic Regression Performance:
Accuracy: 0.8253846153846154
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.86      0.82       604
           1       0.86      0.80      0.83       696

    accuracy                           0.83      1300
   macro avg       0.83      0.83      0.83      1300
weighted avg       0.83      0.83      0.83      1300



2. K-Nearest Neighbors (KNN)

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Extract features and labels
X_text = df['Tokenized_Review'].astype(str)  # Ensure text is treated as string
y = df['Sarcasm_Label']

# Split the dataset into training and testing sets
X_train_text, X_test_text, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)

# Convert text to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_train = tfidf_vectorizer.fit_transform(X_train_text)
X_test = tfidf_vectorizer.transform(X_test_text)

# Train the model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Evaluate the model
print("K-Nearest Neighbors Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


K-Nearest Neighbors Performance:
Accuracy: 0.5761538461538461
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.09      0.17       604
           1       0.56      1.00      0.72       696

    accuracy                           0.58      1300
   macro avg       0.75      0.54      0.44      1300
weighted avg       0.74      0.58      0.46      1300



3. Naive Bayes

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Extract features and labels
X_text = df['Tokenized_Review'].astype(str)  # Ensure text is treated as string
y = df['Sarcasm_Label']

# Split the dataset into training and testing sets
X_train_text, X_test_text, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)

# Convert text to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_train = tfidf_vectorizer.fit_transform(X_train_text)
X_test = tfidf_vectorizer.transform(X_test_text)

# Train the model
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Make predictions
y_pred = nb.predict(X_test)

# Evaluate the model
print("Naive Bayes Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Naive Bayes Performance:
Accuracy: 0.7884615384615384
Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.84      0.79       604
           1       0.84      0.75      0.79       696

    accuracy                           0.79      1300
   macro avg       0.79      0.79      0.79      1300
weighted avg       0.79      0.79      0.79      1300



4. Gradient Boosting

In [10]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Extract features and labels
X_text = df['Tokenized_Review'].astype(str)  # Ensure text is treated as string
y = df['Sarcasm_Label']

# Split the dataset into training and testing sets
X_train_text, X_test_text, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)

# Convert text to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_train = tfidf_vectorizer.fit_transform(X_train_text)
X_test = tfidf_vectorizer.transform(X_test_text)

# Train the model
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

# Make predictions
y_pred = gb.predict(X_test)

# Evaluate the model
print("Gradient Boosting Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Gradient Boosting Performance:
Accuracy: 0.7992307692307692
Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.80      0.79       604
           1       0.82      0.80      0.81       696

    accuracy                           0.80      1300
   macro avg       0.80      0.80      0.80      1300
weighted avg       0.80      0.80      0.80      1300



5. XGBoost

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

# Extract features and labels
X_text = df['Tokenized_Review'].astype(str)  # Ensure text is treated as string
y = df['Sarcasm_Label']

# Split the dataset into training and testing sets
X_train_text, X_test_text, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)

# Convert text to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_train = tfidf_vectorizer.fit_transform(X_train_text)
X_test = tfidf_vectorizer.transform(X_test_text)

# Train the model
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred = xgb_model.predict(X_test)

# Evaluate the model
print("XGBoost Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


XGBoost Performance:
Accuracy: 0.8184615384615385
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.84      0.81       604
           1       0.85      0.80      0.83       696

    accuracy                           0.82      1300
   macro avg       0.82      0.82      0.82      1300
weighted avg       0.82      0.82      0.82      1300



6. Random Forest Classifier

In [12]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Extract features and labels
X_text = df['Tokenized_Review'].astype(str)  # Ensure text is treated as string
y = df['Sarcasm_Label']

# Split the dataset into training and testing sets
X_train_text, X_test_text, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)

# Convert text to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_train = tfidf_vectorizer.fit_transform(X_train_text)
X_test = tfidf_vectorizer.transform(X_test_text)

# Train the model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

# Evaluate the model
print("Random Forest Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Random Forest Performance:
Accuracy: 0.8246153846153846
Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.90      0.83       604
           1       0.90      0.76      0.82       696

    accuracy                           0.82      1300
   macro avg       0.83      0.83      0.82      1300
weighted avg       0.84      0.82      0.82      1300



7. Support Vector Machine (SVM)

In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Extract features and labels
X_text = df['Tokenized_Review'].astype(str)  # Ensure text is treated as string
y = df['Sarcasm_Label']

# Split the dataset into training and testing sets
X_train_text, X_test_text, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)

# Convert text to numerical features using TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Adjust max_features as needed
X_train = tfidf_vectorizer.fit_transform(X_train_text)
X_test = tfidf_vectorizer.transform(X_test_text)

# Train the model
svm = SVC(kernel='linear', random_state=42)
svm.fit(X_train, y_train)

# Make predictions
y_pred = svm.predict(X_test)

# Evaluate the model
print("SVM Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


SVM Performance:
Accuracy: 0.8138461538461539
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.83      0.81       604
           1       0.84      0.80      0.82       696

    accuracy                           0.81      1300
   macro avg       0.81      0.81      0.81      1300
weighted avg       0.82      0.81      0.81      1300



In [14]:
import pandas as pd

# Sample sarcastic and non-sarcastic movie reviews
sarcastic_reviews = [
    "Oh sure, because a movie about vampires has never been done before...",
    "Brilliant acting! I could totally tell they weren't reading from a script...",
    "What a masterpiece... if you like boring, unoriginal films.",
    "Yeah, because having a plot twist that makes sense is way overrated...",
    "Another rom-com where the guy is a total jerk but somehow wins the girl. Realistic!",
    "Wow, another superhero movie. How original...",
    "Of course, because we all know a talking dog is exactly what cinema needed...",
    "Oh great, another movie about a white guy saving the day. Groundbreaking.",
    "Yeah, because cramming 10 plots into one film is a recipe for success...",
    "Oh look, another sequel. Because Hollywood has run out of ideas, obviously."
]

non_sarcastic_reviews = [
    "This movie was absolutely brilliant! I loved every minute of it.",
    "What a heartwarming story. It brought tears to my eyes.",
    "The cinematography was stunning, and the acting was top-notch.",
    "A gripping tale that kept me on the edge of my seat.",
    "I was pleasantly surprised by the depth of the characters.",
    "An original plot that kept me guessing until the very end.",
    "I can't wait to see it again. A true masterpiece.",
    "The soundtrack was phenomenal. It really set the mood.",
    "A must-watch for anyone who loves good cinema.",
    "I highly recommend this movie to everyone."
]

# Combine sarcastic and non-sarcastic reviews into a DataFrame
sarcastic_labels = ['Sarcastic'] * 10
non_sarcastic_labels = ['Non-Sarcastic'] * 10

data = pd.DataFrame({
    'Review': sarcastic_reviews + non_sarcastic_reviews,
    'Label': sarcastic_labels + non_sarcastic_labels
})

# Shuffle the DataFrame
data = data.sample(frac=1, random_state=42).reset_index(drop=True)

# Display the DataFrame
data


Unnamed: 0,Review,Label
0,"Oh sure, because a movie about vampires has ne...",Sarcastic
1,The soundtrack was phenomenal. It really set t...,Non-Sarcastic
2,An original plot that kept me guessing until t...,Non-Sarcastic
3,Brilliant acting! I could totally tell they we...,Sarcastic
4,"Yeah, because cramming 10 plots into one film ...",Sarcastic
5,"Wow, another superhero movie. How original...",Sarcastic
6,What a heartwarming story. It brought tears to...,Non-Sarcastic
7,"Yeah, because having a plot twist that makes s...",Sarcastic
8,A must-watch for anyone who loves good cinema.,Non-Sarcastic
9,I can't wait to see it again. A true masterpiece.,Non-Sarcastic


In [15]:
# Preprocess the data (assuming TF-IDF vectorization)
tfidf_vectorizer = TfidfVectorizer(max_features=1000)  # Use the same vectorizer parameters as trained models
X_train_text = tfidf_vectorizer.fit_transform(data['Review'])

In [19]:
# Logistic Regression
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

# Encode labels into numerical format
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(data['Label'])

# Train Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_text, y_encoded)

# Predict with the Logistic Regression model
log_reg_predictions = log_reg.predict(X_train_text)

# Map predictions back to labels for comparison
predicted_labels_logreg = label_encoder.inverse_transform(log_reg_predictions)

# Add predicted labels to the DataFrame
data['LogisticRegression_Prediction'] = predicted_labels_logreg

In [18]:
# XGBoost
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

# Encode labels into numerical format
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(data['Label'])

# Load trained XGBoost model
xgb_model = XGBClassifier()
xgb_model.fit(X_train_text, y_encoded)

# Predict with the XGBoost model
xgb_predictions = xgb_model.predict(X_train_text)

# Map predictions back to labels for comparison
predicted_labels = label_encoder.inverse_transform(xgb_predictions)

# Add predicted labels to the DataFrame
data['XGBoost_Prediction'] = predicted_labels

In [20]:
# Random Forest
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier

# Encode labels into numerical format
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(data['Label'])

# Train Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train_text, y_encoded)

# Predict with the Random Forest model
rf_predictions = rf_model.predict(X_train_text)

# Map predictions back to labels for comparison
predicted_labels_rf = label_encoder.inverse_transform(rf_predictions)

# Add predicted labels to the DataFrame
data['RandomForest_Prediction'] = predicted_labels_rf


In [21]:
# SVM
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC

# Encode labels into numerical format
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(data['Label'])

# Train and make predictions with SVM model
svm_model = SVC()
svm_model.fit(X_train_text, y_encoded)
svm_predictions = svm_model.predict(X_train_text)

# Map predictions back to labels for comparison
predicted_labels_svm = label_encoder.inverse_transform(svm_predictions)

# Add predicted labels to the DataFrame
data['SVM_Prediction'] = predicted_labels_svm


In [23]:
# Function to highlight incorrect predictions
def highlight_incorrect_predictions(row):
    colors = ['background-color: yellow' if row['Label'] != row[col] else '' for col in row.index[2:]]
    return [''] * 2 + colors

# Apply the function to the DataFrame
styled_df = data.style.apply(highlight_incorrect_predictions, axis=1)

# Display the styled DataFrame
display(styled_df)

Unnamed: 0,Review,Label,XGBoost_Prediction,LogisticRegression_Prediction,RandomForest_Prediction,SVM_Prediction
0,"Oh sure, because a movie about vampires has never been done before...",Sarcastic,Sarcastic,Sarcastic,Sarcastic,Sarcastic
1,The soundtrack was phenomenal. It really set the mood.,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic
2,An original plot that kept me guessing until the very end.,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic
3,Brilliant acting! I could totally tell they weren't reading from a script...,Sarcastic,Sarcastic,Sarcastic,Sarcastic,Sarcastic
4,"Yeah, because cramming 10 plots into one film is a recipe for success...",Sarcastic,Sarcastic,Sarcastic,Sarcastic,Sarcastic
5,"Wow, another superhero movie. How original...",Sarcastic,Sarcastic,Sarcastic,Sarcastic,Sarcastic
6,What a heartwarming story. It brought tears to my eyes.,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic
7,"Yeah, because having a plot twist that makes sense is way overrated...",Sarcastic,Sarcastic,Sarcastic,Sarcastic,Sarcastic
8,A must-watch for anyone who loves good cinema.,Non-Sarcastic,Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic
9,I can't wait to see it again. A true masterpiece.,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic,Non-Sarcastic


In [29]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

# Preprocess the data
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
X = tfidf_vectorizer.fit_transform(df['Tokenized_Review']).toarray()

# Encode labels into numerical format
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['Sarcasm_Label'])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [30]:
# Logistic Regression Hyperparameter Tuning
log_reg_params = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs']
}
log_reg = LogisticRegression(max_iter=1000)
log_reg_grid = GridSearchCV(log_reg, log_reg_params, cv=5, scoring='accuracy')
log_reg_grid.fit(X_train, y_train)

In [33]:
# Evaluate Logistic Regression
log_reg_best = log_reg_grid.best_estimator_
log_reg_pred = log_reg_best.predict(X_test)
print("Logistic Regression Best Parameters:", log_reg_grid.best_params_)
print("Logistic Regression Accuracy:", accuracy_score(y_test, log_reg_pred))
print("Logistic Regression Classification Report:\n", classification_report(y_test, log_reg_pred))


Logistic Regression Best Parameters: {'C': 1, 'solver': 'liblinear'}
Logistic Regression Accuracy: 0.8115384615384615
Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.81      0.80       604
           1       0.83      0.81      0.82       696

    accuracy                           0.81      1300
   macro avg       0.81      0.81      0.81      1300
weighted avg       0.81      0.81      0.81      1300



In [32]:
# Random Forest Hyperparameter Tuning
rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}
rf = RandomForestClassifier()
rf_grid = GridSearchCV(rf, rf_params, cv=5, scoring='accuracy')
rf_grid.fit(X_train, y_train)


In [34]:
# Evaluate Random Forest
rf_best = rf_grid.best_estimator_
rf_pred = rf_best.predict(X_test)
print("Random Forest Best Parameters:", rf_grid.best_params_)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, rf_pred))


Random Forest Best Parameters: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200}
Random Forest Accuracy: 0.8261538461538461
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.91      0.83       604
           1       0.90      0.76      0.82       696

    accuracy                           0.83      1300
   macro avg       0.83      0.83      0.83      1300
weighted avg       0.84      0.83      0.83      1300



In [None]:
# SVM Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV
import time

# SVM Hyperparameter Tuning with RandomizedSearchCV
svm_params = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}
svm = SVC()

# Use RandomizedSearchCV for faster hyperparameter tuning
svm_random_grid = RandomizedSearchCV(svm, svm_params, n_iter=20, cv=5, scoring='accuracy', random_state=42, n_jobs=-1)

start_time = time.time()
svm_random_grid.fit(X_train, y_train)
end_time = time.time()


In [None]:

# Evaluate SVM
svm_best = svm_random_grid.best_estimator_
svm_pred = svm_best.predict(X_test)
print("SVM Best Parameters:", svm_random_grid.best_params_)
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("SVM Classification Report:\n", classification_report(y_test, svm_pred))
print(f"Time taken for RandomizedSearchCV: {end_time - start_time} seconds")


In [None]:
# Evaluate SVM
svm_best = svm_grid.best_estimator_
svm_pred = svm_best.predict(X_test)
print("SVM Best Parameters:", svm_grid.best_params_)
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("SVM Classification Report:\n", classification_report(y_test, svm_pred))