<a href="https://colab.research.google.com/github/s34836/EWD/blob/main/lab11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab - Classification 3

## Tasks
1. The `spam.csv` dataset contains examples of `spam` and `ham` email messages. Use the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to generate word frequency vectors for each message. Fit and compare several classification models, such as `MultinomialNB`, `KNeighborsClassifier`, `LogisticRegression`, etc.



In [26]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

df = pd.read_csv('spam.csv')
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Message'])
X_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

In [27]:
# Convert labels to binary format
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

# Get target variable
y = df['Category']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the classifiers
models = {
    'Multinomial NB': MultinomialNB(),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'SVM': SVC(kernel='linear'),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

# Dictionary to store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    print(f"\nTraining {name}...")

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Store results
    results[name] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1
    }

    # Print results
    print(f"{name} results:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

# Create a DataFrame with the results for comparison
results_df = pd.DataFrame(results).T
print("\nModel Comparison:")
print(results_df)



# Find the best model based on F1 score
best_model_name = results_df['f1_score'].idxmax()
print(f"\nBest model based on F1 score: {best_model_name}")

# Get the confusion matrix for the best model
best_model = models[best_model_name]
y_pred_best = best_model.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred_best)


# Print detailed classification report for the best model
print("\nDetailed Classification Report for Best Model:")
print(classification_report(y_test, y_pred_best, target_names=['ham', 'spam']))


Training Multinomial NB...
Multinomial NB results:
Accuracy: 0.9857
Precision: 0.9404
Recall: 0.9530
F1 Score: 0.9467

Training KNN...
KNN results:
Accuracy: 0.9256
Precision: 1.0000
Recall: 0.4430
F1 Score: 0.6140

Training Logistic Regression...
Logistic Regression results:
Accuracy: 0.9857
Precision: 0.9926
Recall: 0.8993
F1 Score: 0.9437

Training SVM...
SVM results:
Accuracy: 0.9883
Precision: 1.0000
Recall: 0.9128
F1 Score: 0.9544

Training Random Forest...
Random Forest results:
Accuracy: 0.9785
Precision: 1.0000
Recall: 0.8389
F1 Score: 0.9124

Model Comparison:
                     accuracy  precision    recall  f1_score
Multinomial NB       0.985650   0.940397  0.953020  0.946667
KNN                  0.925561   1.000000  0.442953  0.613953
Logistic Regression  0.985650   0.992593  0.899329  0.943662
SVM                  0.988341   1.000000  0.912752  0.954386
Random Forest        0.978475   1.000000  0.838926  0.912409

Best model based on F1 score: SVM

Detailed Classificat

2. The `IMDB` dataset contains positive and negative movie reviews. Vectorize the reviews, then use the [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) method to compare the:
    
    - accuracy and f1-score,
    - fit time,
    - score time

    of selected classification models including `MultinomialNB`, `KNeighborsClassifier` and `LogisticRegression`.

In [40]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from sklearn.metrics import make_scorer

df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [41]:
X = df['review']
y = df['sentiment']

# Vectorize the reviews
vectorizer = CountVectorizer(max_features=100)  # Limit features to improve performance
X_vectorized = vectorizer.fit_transform(X)
X_vectorized_df = pd.DataFrame(X_vectorized.toarray(), columns=vectorizer.get_feature_names_out())
X_vectorized_df.head()

Unnamed: 0,about,after,all,also,an,and,any,are,as,at,...,well,were,what,when,which,who,will,with,would,you
0,1,1,1,0,1,6,0,2,4,0,...,1,0,2,0,1,2,0,5,1,3
1,1,0,2,0,0,7,0,2,0,0,...,3,0,0,0,1,0,0,3,0,1
2,0,0,0,0,0,4,0,1,0,1,...,1,0,0,1,0,0,0,2,0,0
3,0,0,3,0,0,4,0,2,2,0,...,1,0,0,1,1,0,0,3,0,2
4,2,0,2,0,0,5,0,1,1,0,...,0,0,1,0,1,0,0,1,0,0


In [43]:


# Define the models to compare
models = {
    'MultinomialNB': MultinomialNB(),
    'KNN': KNeighborsClassifier(n_neighbors=2),
    'LogisticRegression': LogisticRegression(max_iter=100, C=1.0)
}

y = df['sentiment'].map({'positive': 1, 'negative': 0})
scoring = ['accuracy', 'f1']

# Dictionary to store results
results = {}

# Perform cross-validation for each model
for name, model in models.items():
    print(f"Cross-validating {name}...")

    # Use cross_validate to get multiple metrics
    cv_results = cross_validate(
        model,
        X_vectorized,
        y,
        cv=5,  # 5-fold cross-validation
        scoring=scoring,
        return_train_score=False,
        n_jobs=-1,  # Use all available cores
        return_estimator=False,
        verbose=0,
        error_score='raise'
    )

    # Store results
    results[name] = {
        'accuracy': cv_results['test_accuracy'].mean(),
        'f1_score': cv_results['test_f1'].mean(),
        'fit_time': cv_results['fit_time'].mean(),
        'score_time': cv_results['score_time'].mean()
    }

# Convert to DataFrame for easier comparison
results_df = pd.DataFrame(results).T
print("\nModel Comparison:")
print(results_df)




# Identify best model
best_accuracy_model = results_df['accuracy'].idxmax()
best_f1_model = results_df['f1_score'].idxmax()
results_df['total_time'] = results_df['fit_time'] + results_df['score_time']
fastest_model = results_df['total_time'].idxmin()

print(f"\nBest model by accuracy: {best_accuracy_model} ({results_df.loc[best_accuracy_model, 'accuracy']:.4f})")
print(f"Best model by F1 score: {best_f1_model} ({results_df.loc[best_f1_model, 'f1_score']:.4f})")
print(f"Fastest model: {fastest_model} ({results_df.loc[fastest_model, 'fit_time']:.4f} seconds)")

Cross-validating MultinomialNB...
Cross-validating KNN...
Cross-validating LogisticRegression...

Model Comparison:
                    accuracy  f1_score  fit_time  score_time
MultinomialNB        0.68498  0.676570  0.050763    0.014033
KNN                  0.57698  0.477075  0.048009   66.594518
LogisticRegression   0.73208  0.736627  0.667245    0.011252

Best model by accuracy: LogisticRegression (0.7321)
Best model by F1 score: LogisticRegression (0.7366)
Fastest model: MultinomialNB (0.0508 seconds)
