## AML Assignment 1
### Varun Agrawal
### MDS202251
#### train.ipynb

Import and Data Loading

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

def load_data(train_path, val_path, test_path, mod_df_path):
    train_df = pd.read_csv(train_path)
    val_df = pd.read_csv(val_path)
    test_df = pd.read_csv(test_path)
    df = pd.read_csv(mod_df_path)
    return train_df, val_df, test_df, df


Data Preprocessing

In [2]:
def preprocess_data(train_df, val_df, test_df):
    vectorizer = CountVectorizer()
    vectorizer.fit(train_df.X_train)
    X_train = vectorizer.transform(train_df.X_train)
    X_val = vectorizer.transform(val_df.X_val)
    X_test = vectorizer.transform(test_df.X_test)
    
    tfidf_trans = TfidfTransformer().fit(X_train)
    tfidf_X_train = tfidf_trans.transform(X_train)
    tfidf_X_val = tfidf_trans.transform(X_val)
    tfidf_X_test = tfidf_trans.transform(X_test)
    
    return tfidf_X_train, tfidf_X_val, tfidf_X_test


Model Training and Hyperparameter Tuning:

In [3]:
def train_model(model, X_train, y_train, hyperparams=None):
    if hyperparams:
        model = GridSearchCV(model, hyperparams, cv=5, scoring='accuracy', return_train_score=True)
    model.fit(X_train, y_train)
    return model


Model Evaluation

In [4]:
def evaluate_model(model, X_val, y_val):
    val_predictions = model.predict(X_val)
    accuracy = accuracy_score(y_val, val_predictions)
    report = classification_report(y_val, val_predictions)
    confusion = confusion_matrix(y_val, val_predictions)
    return accuracy, report, confusion


Model Testing

In [5]:
def test_model(model, X_test, y_test):
    test_predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, test_predictions)
    report = classification_report(y_test, test_predictions)
    confusion = confusion_matrix(y_test, test_predictions)
    return accuracy, report, confusion


Loading Data

In [6]:
train_path = './data/train.csv'
val_path = './data/validation.csv'
test_path = './data/test.csv'
mod_df_path = './data/modified_df.csv'
train_df, val_df, test_df, _ = load_data(train_path, val_path, test_path, mod_df_path)
tfidf_X_train, tfidf_X_val, tfidf_X_test = preprocess_data(train_df, val_df, test_df)

Model Initialization and Hyperparameter Definitions

In [7]:
# Initialize models
nb_model = MultinomialNB()
lr_model = LogisticRegression()
svc_model = SVC()
dt_model = DecisionTreeClassifier()

# Hyperparameters for models
nb_hyperparams = {'alpha': [0.01, 0.1, 1.0, 10.0]}
lr_hyperparams = {'C': [0.01, 0.1, 1.0, 10.0]}
svc_hyperparams = {'C': [0.01, 0.1, 1.0, 10.0]}
dt_hyperparams = {'max_depth': [None, 6, 10, 14]}


Model Training

In [8]:
nb_model = train_model(nb_model, tfidf_X_train, train_df.y_train, nb_hyperparams)
lr_model = train_model(lr_model, tfidf_X_train, train_df.y_train, lr_hyperparams)
svc_model = train_model(svc_model, tfidf_X_train, train_df.y_train, svc_hyperparams)
dt_model = train_model(dt_model, tfidf_X_train, train_df.y_train, dt_hyperparams)


Model Evaluation on Validation Set

In [9]:
nb_accuracy, nb_report, nb_confusion = evaluate_model(nb_model, tfidf_X_val, val_df.y_val)
lr_accuracy, lr_report, lr_confusion = evaluate_model(lr_model, tfidf_X_val, val_df.y_val)
svc_accuracy, svc_report, svc_confusion = evaluate_model(svc_model, tfidf_X_val, val_df.y_val)
dt_accuracy, dt_report, dt_confusion = evaluate_model(dt_model, tfidf_X_val, val_df.y_val)

Model Scoring on Test Set

In [10]:
nb_test_accuracy, nb_test_report, nb_test_confusion = test_model(nb_model.best_estimator_, tfidf_X_test, test_df.y_test)
lr_test_accuracy, lr_test_report, lr_test_confusion = test_model(lr_model.best_estimator_, tfidf_X_test, test_df.y_test)
svc_test_accuracy, svc_test_report, svc_test_confusion = test_model(svc_model.best_estimator_, tfidf_X_test, test_df.y_test)
dt_test_accuracy, dt_test_report, dt_test_confusion = test_model(dt_model.best_estimator_, tfidf_X_test, test_df.y_test)

Printing Results

In [11]:
# Print best hyperparameters for each model
print("Best Hyperparameters:")
print("Naive Bayes:", nb_model.best_params_)
print("Logistic Regression:", lr_model.best_params_)
print("Support Vector Machine:", svc_model.best_params_)
print("Decision Tree:", dt_model.best_params_)


Best Hyperparameters:
Naive Bayes: {'alpha': 0.01}
Logistic Regression: {'C': 10.0}
Support Vector Machine: {'C': 10.0}
Decision Tree: {'max_depth': None}


In [12]:
print("Naive Bayes Model:")
print("Validation Accuracy:", nb_accuracy)
print("Test Accuracy:", nb_test_accuracy)
print("Classification Report:")
print(nb_report)
print("Confusion Matrix:")
print(nb_confusion)

print("\nLogistic Regression Model:")
print("Validation Accuracy:", lr_accuracy)
print("Test Accuracy:", lr_test_accuracy)
print("Classification Report:")
print(lr_report)
print("Confusion Matrix:")
print(lr_confusion)

print("\nSupport Vector Machine Model:")
print("Validation Accuracy:", svc_accuracy)
print("Test Accuracy:", svc_test_accuracy)
print("Classification Report:")
print(svc_report)
print("Confusion Matrix:")
print(svc_confusion)

print("\nDecision Tree Model:")
print("Validation Accuracy:", dt_accuracy)
print("Test Accuracy:", dt_test_accuracy)
print("Classification Report:")
print(dt_report)
print("Confusion Matrix:")
print(dt_confusion)


Naive Bayes Model:
Validation Accuracy: 0.9922480620155039
Test Accuracy: 0.9808027923211169
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       411
           1       1.00      0.96      0.98       105

    accuracy                           0.99       516
   macro avg       1.00      0.98      0.99       516
weighted avg       0.99      0.99      0.99       516

Confusion Matrix:
[[411   0]
 [  4 101]]

Logistic Regression Model:
Validation Accuracy: 0.9941860465116279
Test Accuracy: 0.9930191972076788
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       411
           1       1.00      0.97      0.99       105

    accuracy                           0.99       516
   macro avg       1.00      0.99      0.99       516
weighted avg       0.99      0.99      0.99       516

Confusion Matrix:
[[411   0]
 [  3 102]]

Support Vector Machine

**---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

This is what I did above
1. **Fit a model on train data:** The code fits multiple models (Naive Bayes, Logistic Regression, Support Vector Machine, and Decision Tree) on the training data.

2. **Score a model on given data:** The code scores the models on the validation set and the test set.

3. **Evaluate the model predictions:** The code evaluates the model predictions by calculating accuracy, F1-score, and confusion matrix for both validation and test sets.

4. **Validate the model:** The validation is performed using the validation set. The code evaluates each model's performance on the validation set and selects the best hyperparameters using GridSearchCV.

5. **Fit on train:** The models are trained on the training data.

6. **Score on train and validation:** The code calculates the accuracy and other metrics on both the training and validation sets.

7. **Evaluate on train and validation:** The code evaluates each model's performance on both the training and validation sets.

8. **Fine-tune using train and validation (if necessary):** Hyperparameter tuning is performed using GridSearchCV, which involves fine-tuning the models using both training and validation sets.

9. **Score atleast 3 benchmark models on test data and select the best one:** The code scores all models (Naive Bayes, Logistic Regression, Support Vector Machine, and Decision Tree) on the test set and selects the best-performing model based on test set performance.

We noticed that best performing model in terms of accuracy was Logistic Regression Model with C = 10, It also had the highest f1 score, and since from prepare.ipyn we know that 75% are hams(0) and 25% are spams(1) in the orignal data, I am trying to maximize f1 score and also taking look on accuracy.
