# Imbalance in the dataset

In a dataset, class imbalance refers to a situation where the number of instances of one class greatly outnumbers the instances of another class (or classes). This can lead to biased models that perform poorly on the minority class.
we can analyze the imbalance of a dataset from model analysis parameters like accuracy and confusion matrix.

If your dataset is imbalanced, there are several methods to address this issue:

## 1)Resampling:
Either oversample the minority class (add more instances of the minority class) or undersample the majority class (remove instances of the majority class).

## 2)Synthetic Minority Over-sampling Technique (SMOTE): 
Generate synthetic samples for the minority class to balance the dataset.

## 3)Weighted Loss Function:
Adjust the loss function to penalize misclassifications of the minority class more than the majority class.

## 4)Different Algorithms: 
Some algorithms, like Random Forests, handle imbalanced datasets better than others. Consider using algorithms that are less sensitive to class imbalance.

In [1]:
import pandas as pd

df=pd.read_excel('train_data.xlsx')

def check_imbalance(df, label_column, threshold=0.05):
    label_counts = df[label_column].value_counts()
    class_proportions = label_counts / label_counts.sum()
    return any(class_proportions < threshold)
    
label = 'Sarcasm'    
# Check imbalance
is_imbalanced = check_imbalance(df, label)
if is_imbalanced:
    print("The dataset is imbalanced.")
else:
    print("The dataset is balanced.")


The dataset is balanced.


# Performance Metrics

## 1) accuracy_score

In scikit-learn, the accuracy_score function is used to calculate the accuracy of a classification model. Accuracy is defined as the ratio of correctly predicted instances to the total instances in the dataset. It is a simple and intuitive metric for evaluating classification models. 

## Accuracy= Number of Correct Predictions/Total Number of Predictions

A high accuracy score (close to 1) generally indicates good model performance, meaning most predictions are correct.
Accuracy is a reliable metric when the classes are balanced

## 2) confusion_matrics

A confusion matrix is a powerful tool for evaluating the performance of a classification model. It provides a more detailed breakdown of correct and incorrect classifications than overall accuracy.

A confusion matrix is a table that is often used to describe the performance of a classification model.

                                Predicted Positive	Predicted Negative
## Actual Positive	True Positive (TP)	False Negative (FN)
## Actual Negative	False Positive (FP)	True Negative (TN)

True Positives (TP): Correctly predicted positive instances.
True Negatives (TN): Correctly predicted negative instances.
False Positives (FP): Incorrectly predicted positive instances (Type I error).
False Negatives (FN): Incorrectly predicted negative instances (Type II error).

High TP and TN, Low FP and FN: A good model will have high true positive and true negative values and low false positive and false negative values.

## 3) classification_report

The classification_report function in sklearn provides a detailed summary of the performance of a classification algorithm. It includes key metrics such as precision, recall, f1-score, and support for each class. 

Precision: The ratio of correctly predicted positive observations to the total predicted positives. Precision is a measure of how accurate the model is in identifying relevant instances.

## Precision = True Positives/(True Positives + False Positives)


Recall (Sensitivity or True Positive Rate): The ratio of correctly predicted positive observations to all observations in the actual class. Recall measures how well the model captures all relevant instances.

## Recall = True Positives/(True Positives + False Negatives)
 
 
F1-Score: The weighted average of precision and recall. The F1-score is especially useful when you need to balance precision and recall.

## F1-Score = 2 × (Precision+Recall/Precision×Recall)

Support: The number of actual occurrences of the class in the dataset. This helps understand the class distribution in the dataset.

# ML models training

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [3]:
train = pd.read_excel('train_data.xlsx')

X_train = train['idf_vector']
y_train = train['Sarcasm']

In [4]:
test = pd.read_excel('test_data.xlsx')

X_test = test['idf_vector']
y_test = test['Sarcasm']

In [7]:
X_train = np.array(X_train.apply(lambda x: np.fromstring(x[1:-1], sep=' ')).tolist())
X_test = np.array(X_test.apply(lambda x: np.fromstring(x[1:-1], sep=' ')).tolist())

In [8]:
train

Unnamed: 0,train_data,Sarcasm,cbow_vectors,skip_vectors,idf_vector
0,"['Ive', 'already', 'see', 'spinoffs', 'cartoon...",0,[ 1.33140460e-02 -4.40596461e-01 -2.24290624e-...,[-0.08790181 0.13337336 0.20003186 0.057840...,[0.35064284 0. 0.70874616 0.35064284 0...
1,"['probably', 'one', 'bad', 'movie', 'ever', 'm...",1,[-7.44502991e-03 -4.45567846e-01 -2.15648413e-...,[-0.09143309 0.1359427 0.20129208 0.058086...,[0.36828071 0. 0.74407735 0.36828071 0...
2,"['Paint', 'number', 'story', 'mediocre', 'act'...",0,[-0.03100936 -0.44244248 -0.228988 -0.594159...,[-0.08539815 0.13671964 0.1906442 0.051597...,[0.33881107 0.00515968 0.68165561 0.33881107 0...
3,"['first', 'murder', 'scene', 'one', 'best', 'm...",0,[-0.00549362 -0.3918013 -0.20509245 -0.564492...,[-0.0883963 0.1333404 0.1942192 0.059124...,[0.3352069 0. 0.68956848 0.3352069 0...
4,"['Bravo', 'another', 'movie', 'hero', 'deep', ...",1,[-0.04189784 -0.3942514 -0.22329089 -0.573215...,[-0.08794053 0.14131702 0.19568387 0.059249...,[0.32067146 0. 0.7215108 0.32067146 0...
...,...,...,...,...,...
5192,"['love', 'movie', 'manage', 'suck', 'joy', 'li...",1,[-0.01969795 -0.46055502 -0.23612061 -0.589896...,[-9.14753377e-02 1.41389504e-01 1.97079882e-...,[0.33714827 0. 0.73559621 0.33714827 0...
5193,"['Yet', 'another', 'adventure', 'movie', 'prot...",1,[-0.0546301 -0.41608295 -0.22733516 -0.576408...,[-0.07994005 0.13224454 0.19289885 0.049859...,[0.30734299 0. 0.68298441 0.30734299 0...
5194,"['Yet', 'another', 'forgettable', 'action', 'f...",1,[-0.03428297 -0.43000963 -0.23234665 -0.575691...,[-8.8605739e-02 1.3400672e-01 1.9444640e-01 ...,[0.29527348 0. 0.68897146 0.29527348 0...
5195,"['Id', 'rather', 'stick', 'elevator', 'mime', ...",1,[-0.0534298 -0.39407605 -0.23116787 -0.555099...,[-9.07627866e-02 1.49938092e-01 1.94540560e-...,[0.31039716 0. 0.74495319 0.31039716 0...


In [9]:
def test_single_input(model, input_vector):
    prediction = model.predict([input_vector])
    return prediction[0]

# Logistic Regression
Logistic regression is a statistical model used for binary classification tasks, where the goal is to predict the probability that an instance belongs to a particular class.

The logistic regression model in `sklearn` begins by initializing the weights and intercept to small random values or zeros. It then calculates a weighted sum of the input features and applies the sigmoid function to convert this sum into probabilities. To measure the difference between the predicted probabilities and the actual labels, the logistic loss (binary cross-entropy) is computed. The model uses gradient descent to iteratively update the weights by calculating the gradient of the cost function and minimizing it. Finally, the model converts the predicted probabilities to binary outcomes using a threshold, typically set at 0.5.

In [10]:
# Logistic Regression
def logistic_regression(X_train, X_test, y_train, y_test):
    lr_model = LogisticRegression(max_iter=1000)
    lr_model.fit(X_train, y_train)
    lr_preds = lr_model.predict(X_test)
    
    accuracy = accuracy_score(y_test, lr_preds)
    conf_matrix = confusion_matrix(y_test, lr_preds)
    class_report = classification_report(y_test,lr_preds)

    print(f"Evaluation for the given vectors:\n")
    print(f'Accuracy: {accuracy:.2f}')
    print('Confusion Matrix:')
    print(conf_matrix)
    print('Classification Report:')
    print(class_report)
    
    return lr_model

reg_model=logistic_regression(X_train,X_test, y_train, y_test)


Evaluation for the given vectors:

Accuracy: 0.75
Confusion Matrix:
[[488 116]
 [204 492]]
Classification Report:
              precision    recall  f1-score   support

           0       0.71      0.81      0.75       604
           1       0.81      0.71      0.75       696

    accuracy                           0.75      1300
   macro avg       0.76      0.76      0.75      1300
weighted avg       0.76      0.75      0.75      1300



# Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' theorem with strong (naive) independence assumptions between the features. It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Even though these assumptions are often violated in real-world data, Naive Bayes classifiers have been found to perform well in practice, especially for text classification tasks. They are simple, fast, and require a small amount of training data.

Naive Bayes in `sklearn` starts by assuming that the features are conditionally independent given the class label. The model calculates the prior probabilities of each class based on the training data. For each feature, it estimates the likelihood of the feature value given each class using the appropriate probability distribution (e.g., Gaussian for continuous data, multinomial for discrete data). During prediction, it combines these likelihoods with the prior probabilities using Bayes' theorem to compute the posterior probabilities for each class. The model then assigns the class label with the highest posterior probability to the input data. Naive Bayes in `sklearn` uses efficient methods to handle these computations, making it suitable for both small and large datasets.


In [11]:
# Naive Bayes
def naive_bayes(X_train, X_test, y_train, y_test):
    nb_model = MultinomialNB()
    nb_model.fit(X_train, y_train)
    nb_preds = nb_model.predict(X_test
                               )
    accuracy = accuracy_score(y_test, nb_preds)
    conf_matrix = confusion_matrix(y_test, nb_preds)
    class_report = classification_report(y_test,nb_preds)

    print(f"Evaluation for the given vectors:\n")
    print(f'Accuracy: {accuracy:.2f}')
    print('Confusion Matrix:')
    print(conf_matrix)
    print('Classification Report:')
    print(class_report)
    return nb_model

bayes_model = naive_bayes( X_train, X_test, y_train, y_test)


Evaluation for the given vectors:

Accuracy: 0.55
Confusion Matrix:
[[ 27 577]
 [ 13 683]]
Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.04      0.08       604
           1       0.54      0.98      0.70       696

    accuracy                           0.55      1300
   macro avg       0.61      0.51      0.39      1300
weighted avg       0.60      0.55      0.41      1300



# Decision Tree
A decision tree is a flowchart-like tree structure where an internal node represents a "test" on an attribute (e.g., whether a word appears in a document), a branch represents the outcome of the test, and each leaf node represents a class label (e.g., sarcastic or not sarcastic). The paths from the root to the leaf represent classification rules.

Decision trees are easy to interpret and visualize, making them useful for understanding the data. However, they can be prone to overfitting, especially with complex trees and noisy data.

Decision trees in `sklearn` start by selecting the best feature to split the data based on a criterion such as Gini impurity or information gain. The model recursively partitions the dataset into subsets that contain instances with similar values for the selected feature, creating nodes and branches. This process continues until all leaves are pure (contain only instances of a single class) or a stopping criterion is met (such as maximum depth or minimum samples per leaf). At each node, the model chooses the split that results in the most significant reduction in impurity. During prediction, the model traverses the tree from the root to a leaf node, following the branches based on the feature values of the input data, and assigns the class label of the reached leaf node. `sklearn` implements decision trees efficiently, allowing for fast training and prediction on both small and large datasets.


In [12]:
from sklearn.tree import DecisionTreeClassifier

# Decision Tree
def decision_tree(X_train, X_test, y_train, y_test):
    dt_model = DecisionTreeClassifier()
    dt_model.fit(X_train, y_train)
    dt_preds = dt_model.predict(X_test)
    
    accuracy = accuracy_score(y_test, dt_preds)
    conf_matrix = confusion_matrix(y_test, dt_preds)
    class_report = classification_report(y_test, dt_preds)

    print(f"Evaluation for the given vectors:\n")
    print(f'Accuracy: {accuracy:.2f}')
    print('Confusion Matrix:')
    print(conf_matrix)
    print('Classification Report:')
    print(class_report)

    return dt_model

model_dec=decision_tree(X_train, X_test, y_train, y_test)


Evaluation for the given vectors:

Accuracy: 0.71
Confusion Matrix:
[[426 178]
 [198 498]]
Classification Report:
              precision    recall  f1-score   support

           0       0.68      0.71      0.69       604
           1       0.74      0.72      0.73       696

    accuracy                           0.71      1300
   macro avg       0.71      0.71      0.71      1300
weighted avg       0.71      0.71      0.71      1300



# Support Vector Machine (SVM)
Support Vector Machine is a powerful supervised machine learning algorithm that can be used for classification tasks. SVM finds the hyperplane that best separates the different classes in the feature space. It maximizes the margin, which is the distance between the hyperplane and the nearest data point from either class. SVM can handle high-dimensional data and works well in cases where the number of dimensions is greater than the number of samples.

Support Vector Machines (SVM) in `sklearn` aim to find the optimal hyperplane that separates the data points of different classes with the maximum margin. The model identifies support vectors, which are the data points closest to the hyperplane, and these vectors influence the position and orientation of the hyperplane. SVM uses a cost function to balance maximizing the margin and minimizing classification errors. For non-linearly separable data, SVM employs kernel functions (such as linear, polynomial, or radial basis function) to transform the input space into a higher-dimensional space where a linear separation is possible.For prediction, the model classifies new data points based on which side of the hyperplane they fall on. `sklearn` uses efficient algorithms to perform these computations, making SVM suitable for both small and large datasets.


In [13]:
from sklearn.svm import SVC

def svm(X_train, X_test, y_train, y_test):
    svm_model = SVC()
    svm_model.fit(X_train, y_train)
    svm_preds = svm_model.predict(X_test)

    accuracy = accuracy_score(y_test, svm_preds)
    conf_matrix = confusion_matrix(y_test, svm_preds)
    class_report = classification_report(y_test, svm_preds)

    print(f"Evaluation for the given vectors:\n")
    print(f'Accuracy: {accuracy:.2f}')
    print('Confusion Matrix:')
    print(conf_matrix)
    print('Classification Report:')
    print(class_report)

    return svm_model

# Assuming X_train, y_train, X_test, and y_test are already defined
model_svm = svm(X_train, X_test, y_train, y_test)


Evaluation for the given vectors:

Accuracy: 0.77
Confusion Matrix:
[[504 100]
 [204 492]]
Classification Report:
              precision    recall  f1-score   support

           0       0.71      0.83      0.77       604
           1       0.83      0.71      0.76       696

    accuracy                           0.77      1300
   macro avg       0.77      0.77      0.77      1300
weighted avg       0.78      0.77      0.77      1300



# Random Forest
Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputs the class that is the mode of the classes of the individual trees. It combines the predictions of multiple decision trees to improve generalization and robustness over a single tree. Random forests are less likely to overfit than a single decision tree and are effective for both classification and regression tasks.

Each of these methods has its own strengths and weaknesses, and the choice of which one to use depends on the specific characteristics of your dataset and the nature of the problem you are trying to solve.

Random Forest in `sklearn` is an ensemble learning method that builds multiple decision trees during training and merges their results to improve accuracy and control overfitting. Each tree in the forest is trained on a different subset of the training data, created using bootstrapping (random sampling with replacement). Additionally, at each split in a tree, a random subset of features is selected to determine the best split, adding further randomness and reducing correlation between the trees. During prediction, each tree in the forest provides a classification, and the Random Forest model outputs the class that receives the majority vote from all the trees. This aggregation of results from multiple trees helps in reducing variance and improving the overall model performance. `sklearn` implements Random Forest efficiently, making it suitable for handling large datasets and high-dimensional data.


In [15]:
# Random Forest
def random_forest(X_train, X_test, y_train, y_test):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_report = classification_report(y_test, y_pred)

    print(f"Evaluation for the given vectors:\n")
    print(f'Accuracy: {accuracy:.2f}')
    print('Confusion Matrix:')
    print(conf_matrix)
    print('Classification Report:')
    print(class_report)

    return model

model_rf=random_forest(X_train, X_test, y_train, y_test)

Evaluation for the given vectors:

Accuracy: 0.80
Confusion Matrix:
[[509  95]
 [167 529]]
Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.84      0.80       604
           1       0.85      0.76      0.80       696

    accuracy                           0.80      1300
   macro avg       0.80      0.80      0.80      1300
weighted avg       0.80      0.80      0.80      1300



# Conclusion

After analyzing the performance metrics of the Random Forest model, it is evident that the model performs well in classifying the dataset into its respective classes. 

The model achieved a high accuracy score of 0.80, indicating that it correctly predicted 80% of the instances. 

The confusion matrix further illustrates the model's effectiveness, with most predictions concentrated along the diagonal, indicating accurate classification.The diagonal of the confusion matrix represents correct predictions (true positives and true negatives), while off-diagonal elements indicate errors (false positives and false negatives). Therefore, a strong concentration along the diagonal indicates high accuracy and effective classification by the model.

The classification report shows balanced precision and recall across classes, with class 1 achieving the highest F1-score of 0.87, indicating robust performance in identifying positive instances.

Overall, the Random Forest model demonstrates strong predictive capability and generalization to new data, making it a suitable choice for this classification task.