# SVC Model


When considering whether to use `SVC` (Support Vector Classifier) from `sklearn.svm` for sarcasm detection, here are some key reasons to prefer `SVC` over other approaches within the context of sarcasm detection:

### Reasons to Use `SVC`:

1. **Effectiveness in High-Dimensional Spaces:**
   `SVC` is known for its effectiveness in high-dimensional spaces, which is particularly relevant when dealing with text data that has been transformed into TF-IDF features. The high dimensionality of the feature space can be efficiently handled by the SVC algorithm.

2. **Robustness to Overfitting:**
   SVMs, including `SVC`, have regularization parameters (`C` in `SVC`) that help prevent overfitting. This is crucial for sarcasm detection, where the model needs to generalize well to new, unseen examples of sarcasm without being too specific to the training data.

3. **Kernel Trick:**
   `SVC` supports the use of kernel functions, which can map the original features into higher-dimensional spaces where a linear separation is possible. This is beneficial for sarcasm detection, as the relationship between features and labels may not be linearly separable in the original feature space. Using kernels like the RBF (Radial Basis Function) can capture complex patterns in the data.

4. **Performance:**
   Empirical evidence often shows that `SVC` performs well in text classification tasks, including sarcasm detection. The `SVC` algorithm can achieve high accuracy and precision, making it a strong candidate for this type of task.

5. **Scalability:**
   While SVMs can be computationally intensive, `SVC` implementations in `scikit-learn` are optimized for performance. They can handle reasonably large datasets efficiently, especially when combined with techniques like grid search for hyperparameter tuning.

### Comparison with Other Approaches:

- **Logistic Regression:** While logistic regression is simpler and faster, it might not capture the complex patterns in sarcasm detection as effectively as `SVC` with non-linear kernels.
- **Decision Trees/Random Forests:** These models can capture non-linear patterns but may require extensive tuning and can be prone to overfitting. They also may not perform as well in high-dimensional spaces compared to `SVC`.
- **XGBoost:** While powerful and often performing well in various tasks, XGBoost might be more complex to tune and more computationally intensive compared to `SVC`.

### Conclusion:

**SVC** is a strong candidate for sarcasm detection due to its ability to handle high-dimensional feature spaces, robustness against overfitting, support for non-linear classification through kernel functions, and demonstrated empirical performance in text classification tasks.

In [2]:
import pandas as pd

# Load the dataset
file_path = 'balanced_dataset_50000.csv'
data = pd.read_csv(file_path)

# Count the number of samples
num_samples = len(data)
print(f'Number of samples: {num_samples}')

# Get the shape of the data
data_shape = data.shape
print(f'Shape of the data: {data_shape}')

# Count null values
null_values = data.isnull().sum()
print(f'Null values:\n{null_values}')

# Remove null values
data_cleaned = data.dropna()

# Shuffle the dataset
data_shuffled = data_cleaned.sample(frac=1).reset_index(drop=True)

# Save the new dataset
new_file_path = 'balanced_dataset_50000_cleaned_shuffled.csv'
data_shuffled.to_csv(new_file_path, index=False)

# Count the values of 0s and 1s (assuming the target column is the last column)
value_counts = data_shuffled.iloc[:, -1].value_counts()
print(f'Value counts of target column:\n{value_counts}')


Number of samples: 50000
Shape of the data: (50000, 2)
Null values:
label        0
comment    502
dtype: int64
Value counts of target column:
comment
forgot                                                                                              280
dropped                                                                                             118
yes                                                                                                  81
thanks                                                                                               64
lol                                                                                                  50
                                                                                                   ... 
joke u part backgammon operation                                                                      1
parachute stage ignition                                                                              1
well hows wife hol

In [3]:
# Count the values of 0s and 1s in the 'label' column
label_counts = data_shuffled['label'].value_counts()
print(f'Value counts in the label column:\n{label_counts}')

Value counts in the label column:
label
1    24908
0    24590
Name: count, dtype: int64


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score

# Load the dataset
data = pd.read_csv('balanced_dataset_50000_cleaned_shuffled.csv')

# Preprocess the dataset
X = data['comment']
y = data['label']

# Encode labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to strings to handle potential float values
X_train = X_train.astype(str)
X_test = X_test.astype(str)

# Tokenize and transform sequences using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [2]:
print(data.head(10))

   label                                            comment
0      0                                      yes joke sure
1      0  credit worthiness isnt going factor dealer dec...
2      0                         m paint motion adobe flash
3      0                               take pantsand jacket
4      0                      cobalt alchemist crimson oems
5      0                                           guy fuck
6      0  seriously neither option ballot option win one...
7      1                                   wouldnt happened
8      0            german car french wine imagine response
9      1  used build house couldwould take month complet...


In [3]:
print(data.shape)

(49498, 2)


In [4]:
print("Original dataset shape:", data.shape)
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
print("X_train_tfidf shape:", X_train_tfidf.shape)
print("X_test_tfidf shape:", X_test_tfidf.shape)

Original dataset shape: (49498, 2)
X_train shape: (39598,)
y_train shape: (39598,)
X_test shape: (9900,)
y_test shape: (9900,)
X_train_tfidf shape: (39598, 10000)
X_test_tfidf shape: (9900, 10000)


In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Load the cleaned and shuffled dataset
file_path = 'balanced_dataset_50000_cleaned_shuffled.csv'
data = pd.read_csv(file_path)

# Subsample the dataset for quicker experimentation (e.g., using 10,000 samples)
data_sampled = data.sample(n=10000, random_state=42)

# Prepare features (X) and labels (y)
X = data_sampled['comment']  # Assuming 'comment' column contains text data
y = data_sampled['label']

# Convert text data to TF-IDF features with limited number of features
vectorizer = TfidfVectorizer(max_features=5000)  # Limiting to 5000 features
X_tfidf = vectorizer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Initialize the Support Vector Classifier with a linear kernel
svc = SVC(kernel='linear')

# Train the model
svc.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = svc.predict(X_train)

# Make predictions on the testing set
y_test_pred = svc.predict(X_test)

# Evaluate the model
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_report = classification_report(y_test, y_test_pred)

print(f'Training Accuracy: {train_accuracy}')
print(f'Testing Accuracy: {test_accuracy}')
print(f'Classification Report:\n{test_report}')

Training Accuracy: 0.812625
Testing Accuracy: 0.622
Classification Report:
              precision    recall  f1-score   support

           0       0.61      0.68      0.64       988
           1       0.64      0.57      0.60      1012

    accuracy                           0.62      2000
   macro avg       0.62      0.62      0.62      2000
weighted avg       0.62      0.62      0.62      2000



In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# Load the cleaned and shuffled dataset
file_path = 'balanced_dataset_50000_cleaned_shuffled.csv'
data = pd.read_csv(file_path)

# Subsample the dataset for quicker experimentation (e.g., using 10,000 samples)
data_sampled = data.sample(n=10000, random_state=42)

# Prepare features (X) and labels (y)
X = data_sampled['comment']  # Assuming 'comment' column contains text data
y = data_sampled['label']

# Convert text data to TF-IDF features with limited number of features
vectorizer = TfidfVectorizer(max_features=5000)  # Limiting to 5000 features
X_tfidf = vectorizer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Initialize the Support Vector Classifier with a linear kernel
svc = SVC()

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'degree': [3, 4, 5],  # Only applicable for 'poly' kernel
    'gamma': ['scale', 'auto']  # Only applicable for 'rbf', 'poly', and 'sigmoid'
}

# Initialize GridSearchCV
grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', verbose=2, n_jobs=-1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best estimator
best_svc = grid_search.best_estimator_

# Train the model with the best estimator
best_svc.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = best_svc.predict(X_train)

# Make predictions on the testing set
y_test_pred = best_svc.predict(X_test)

# Evaluate the model
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_report = classification_report(y_test, y_test_pred)

print(f'Best Parameters: {grid_search.best_params_}')
print(f'Training Accuracy: {train_accuracy}')
print(f'Testing Accuracy: {test_accuracy}')
print(f'Classification Report:\n{test_report}')


Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best Parameters: {'C': 1, 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf'}
Training Accuracy: 0.94675
Testing Accuracy: 0.634
Classification Report:
              precision    recall  f1-score   support

           0       0.62      0.69      0.65       988
           1       0.66      0.58      0.62      1012

    accuracy                           0.63      2000
   macro avg       0.64      0.63      0.63      2000
weighted avg       0.64      0.63      0.63      2000



In [8]:
# Get the best model
best_model = grid_search.best_estimator_

In [9]:
best_model = grid_search.best_estimator_
train_predictions = best_model.predict(X_train)
test_predictions = best_model.predict(X_test)

In [10]:
print("Train Accuracy: {:.2f}%".format(accuracy_score(y_train, train_predictions) * 100))
print("Test Accuracy: {:.2f}%".format(accuracy_score(y_test, test_predictions) * 100))
print("Test Precision: {:.2f}%".format(precision_score(y_test, test_predictions) * 100))

Train Accuracy: 94.67%
Test Accuracy: 63.40%
Test Precision: 65.66%


In [21]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from scipy.sparse import hstack

# Load the dataset
file_path = 'balanced_dataset_50000_cleaned_shuffled.csv'
data = pd.read_csv(file_path)

# Preprocess the dataset
X = data['comment']
y = data['label']

# Convert text data to TF-IDF features with limited number of features
vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Convert text data to count features with limited number of features
count_vectorizer = CountVectorizer(max_features=5000)
X_count = count_vectorizer.fit_transform(X)

# Combine the TF-IDF and count features
X_features = hstack([X_tfidf, X_count])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_features, y, test_size=0.2, random_state=42)

# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf.fit(X_train, y_train)

# Make predictions on the testing set
y_test_pred = rf.predict(X_test)

# Evaluate the model
train_accuracy = accuracy_score(y_train, rf.predict(X_train))
test_accuracy = accuracy_score(y_test, y_test_pred)
test_report = classification_report(y_test, y_test_pred)

print(f'Training Accuracy: {train_accuracy}')
print(f'Testing Accuracy: {test_accuracy}')
print(f'Classification Report:\n{test_report}')

Training Accuracy: 0.9644426486186171
Testing Accuracy: 0.6441414141414141
Classification Report:
              precision    recall  f1-score   support

           0       0.65      0.65      0.65      4975
           1       0.64      0.64      0.64      4925

    accuracy                           0.64      9900
   macro avg       0.64      0.64      0.64      9900
weighted avg       0.64      0.64      0.64      9900



In [22]:
import pickle

# Save the trained model
filename = 'trained_model.sav'
pickle.dump(best_model, open(filename, 'wb'))

# Load the saved model
loaded_model = pickle.load(open(filename, 'rb'))

# Input text
input_text = "I'm thrilled to spend my weekend working."

# Preprocess the input text
input_tfidf = vectorizer.transform([input_text])

# Predict the sentiment
prediction = loaded_model.predict(input_tfidf)[0]

# Print the prediction
print("Sentiment:", prediction)

Sentiment: 1


In [23]:
import pickle

# Save the trained model
filename = 'trained_model.sav'
pickle.dump(best_model, open(filename, 'wb'))

# Load the saved model
loaded_model = pickle.load(open(filename, 'rb'))

# Input text
input_text = "wouldnt happened"

# Preprocess the input text
input_tfidf = vectorizer.transform([input_text])

# Predict the sentiment
prediction = loaded_model.predict(input_tfidf)[0]

# Print the prediction
print("Sentiment:", prediction)

Sentiment: 0


In [28]:
import pickle

# Load the saved model
filename = 'trained_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))

# Input texts
input_texts = [
    "I just love waiting in long lines. It's my favorite pastime.",
    "Wow, another rainy day. Just what I needed.",
    "I'm so excited for Monday morning meetings.",
    "The food here is amazing, said no one ever.",
    "I'm thrilled to spend my weekend working.",
    "What a beautiful day to stay indoors.",
    "I can't wait to do my taxes. It's so much fun.",
    "Great, my phone battery died in the middle of nowhere.",
    "I absolutely love getting stuck in traffic for hours.",
    "Thank you for the wonderful gift. It’s exactly what I didn't want."
]

# Preprocess the input texts
input_tfidf = vectorizer.transform(input_texts)

# Predict the sentiment
predictions = loaded_model.predict(input_tfidf)

# Print the predictions
for i, prediction in enumerate(predictions):
    if prediction == 0:
        print(f"Comment {i}: Non-sarcastic")
    else:
        print(f"Comment {i}: Sarcastic")



Comment 0: Non-sarcastic
Comment 1: Non-sarcastic
Comment 2: Non-sarcastic
Comment 3: Sarcastic
Comment 4: Sarcastic
Comment 5: Non-sarcastic
Comment 6: Non-sarcastic
Comment 7: Sarcastic
Comment 8: Non-sarcastic
Comment 9: Non-sarcastic


In [29]:
import pickle

# Load the saved model
filename = 'trained_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))

# Input texts
input_texts = [
    "The weather looks great today. Perfect for a picnic!",
    "Wow, another rainy day. Just what I needed.",
    "Thank you for considering my application. I look forward to hearing from you.",
    "The food here is amazing, said no one ever.",
    "I'm thrilled to spend my weekend working.",
    "I appreciate your feedback on the project. Let's discuss it further.",
    "I can't wait to do my taxes. It's so much fun.",
    "Great, my phone battery died in the middle of nowhere.",
    "I absolutely love getting stuck in traffic for hours.",
    "I enjoyed our conversation yesterday. Let's catch up again soon."
]

# Preprocess the input texts
input_tfidf = vectorizer.transform(input_texts)

# Predict the sentiment
predictions = loaded_model.predict(input_tfidf)

# Print the predictions
for i, prediction in enumerate(predictions):
    if prediction == 0:
        print(f"Comment {i}: Non-sarcastic")
    else:
        print(f"Comment {i}: Sarcastic")



Comment 0: Sarcastic
Comment 1: Non-sarcastic
Comment 2: Non-sarcastic
Comment 3: Sarcastic
Comment 4: Sarcastic
Comment 5: Sarcastic
Comment 6: Non-sarcastic
Comment 7: Sarcastic
Comment 8: Non-sarcastic
Comment 9: Sarcastic


List of points highlighting why a deep learning (DL) model is preferred over a Support Vector Classifier (SVC) for sarcasm detection:

1. **Automatic Feature Learning:** DL models automatically learn relevant features from raw text data, eliminating the need for manual feature engineering required by SVCs.

2. **Handling Non-linear Relationships:** DL models can capture complex, non-linear relationships in language, which are crucial for understanding nuanced sarcasm expressions. SVCs, being linear classifiers, may struggle with such complexities.

3. **Contextual Understanding:** DL models, especially those with attention mechanisms, excel at capturing contextual dependencies across sentences, enhancing their ability to recognize sarcasm based on broader linguistic context.

4. **Scalability with Large Datasets:** DL models scale effectively with large datasets, crucial for sarcasm detection which involves diverse and subtle forms of expression. SVCs may not generalize well with increasing data complexity.

5. **Utilization of Pretrained Models:** DL models can leverage pretrained language models like BERT or GPT, which encode extensive linguistic knowledge. Fine-tuning these models for sarcasm detection improves accuracy and robustness compared to SVCs.

6. **Adaptability to Domain-Specific Contexts:** DL models can be fine-tuned on domain-specific datasets, allowing them to better understand and predict sarcasm within specific contexts, whereas SVCs may struggle to generalize beyond initial training domains.

7. **Continuous Learning:** DL models can be updated with new data continuously, improving over time as language use and sarcasm expressions evolve. SVCs typically require retraining from scratch with updated datasets, which is more resource-intensive.

In summary, DL models offer advantages in automatic feature learning, handling complex relationships, contextual understanding, scalability with large datasets, utilization of pretrained models, adaptability to domains, and continuous learning, making them more effective than SVCs for sarcasm detection tasks.