# Name: Harsh Siddhapura
# ASU ID: 1230169813

## Lab 16: Naive Bayes Classifier 



In [1]:
# Imports
import numpy as np
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

1 Write a script that fits a Gaussian Naive Bayes classifier to the given dataset. The script should:
- Read the dataset using proper function for reading a libsvm file.
- Convert the feature data into a dense array using X.todense().
- Split the data into train/test parts. Use 30% of the data for testing.
- Create a machine learning pipeline that includes a standard scalar and a Gaussian Naive Bayes Classifier.
- Fit the pipeline to the training data. Since there are not different parameters to try out, and due to the limitations of the cross_valiation_score implementation, we will not use k-fold cross validation for this classifier.
- Use the pipe.score() function to fit the pipe scalar to the test set, and find the accuracy.
- Run your code, make sure it does not contain any errors. What is the classification accuracy returned by the pipe's score function?

In [2]:
# Load the dataset (replace 'your_dataset.libsvm' with the actual file path)
X, y = load_svmlight_file('a9a.txt')

# Convert feature data to dense array
X_dense = np.array(X.todense())

# Split data into train/test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X_dense, y, test_size=0.3, random_state=0)

# Create a pipeline with standard scaler and Gaussian Naive Bayes classifier
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('nb', GaussianNB())          # Gaussian Naive Bayes classifier
])

# Fit the pipeline to the training data
pipe.fit(X_train, y_train)

# Evaluate accuracy on the test set
accuracy = pipe.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.5150987818609889


2 Modify your script by adding a part that fits a DT classifier to the data:
- Create another pipeline that includes a standard scalar and a Decision Tree Classifier.
- Use a parameter grid that evaluates two impurity calculation metrics: entropy & gini, and a maximum tree depth of 10, 50 & 100.
- Fit this pipeline to the training data using k-fold cross validation with k=5. 
- Find the best performing model and the corresponding parameter values.
- Create another pipeline that contains the same standard scalar and the best performing model.
- Use this pipeline to call the score() function over the test data.
- Run your code, make sure it does not contain any errors. What is the classification accuracy obtained on the test data?

In [5]:
# Load the dataset (replace 'your_dataset.libsvm' with the actual file path)
X, y = load_svmlight_file('a9a.txt')

# Convert feature data to dense array
X_dense = np.array(X.todense())

# Split data into train/test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X_dense, y, test_size=0.3, random_state=0)

# Create a pipeline with standard scaler and Decision Tree Classifier
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('dt', DecisionTreeClassifier())  # Decision Tree Classifier
])

# Define hyperparameters for grid search
param_grid = {
    'dt__criterion': ['gini', 'entropy'],
    'dt__max_depth': [10, 50, 100]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipe, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best performing model
best_model = grid_search.best_estimator_

# Evaluate accuracy on the test set
test_accuracy = best_model.score(X_test, y_test)
print(f"Best Model Test Accuracy: {test_accuracy}")

# Get the best hyperparameters
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")

Best Model Test Accuracy: 0.8324291124987204
Best Hyperparameters: {'dt__criterion': 'gini', 'dt__max_depth': 10}


3 Modify your script by adding a part that fits an SVM classifier to the data:
- Create another pipeline that includes a standard scalar and an SVM classifier.
- Use a parameter grid that evaluates three kernels: linear, polynomial (with d values = 2 & 3), rbf (with gamma values = 0.001, 0.1, 2).
- Fit this pipeline to the training data using k-fold cross validation with k=5. 
- Find the best performing model and the corresponding parameter values.
- Create another pipeline that contains the same standard scalar and the best performing model.
- Use this pipeline to call the score() function over the test data.
- Run your code, make sure it does not contain any errors. What is the obtained classification accuracy?

In [6]:
# Load the dataset (replace 'your_dataset.libsvm' with the actual file path)
X, y = load_svmlight_file('a9a.txt')

# Convert feature data to dense array
X_dense = np.array(X.todense())

# Split data into train/test sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X_dense, y, test_size=0.3, random_state=0)

# Create a pipeline with standard scaler and SVM classifier
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('svm', SVC())  # SVM classifier
])

# Define hyperparameters for grid search
param_grid = {
    'svm__kernel': ['linear', 'poly', 'rbf'],
    'svm__degree': [2, 3],
    'svm__gamma': [0.001, 0.1, 2]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best performing model
best_model = grid_search.best_estimator_

# Evaluate accuracy on the test set
test_accuracy = best_model.score(X_test, y_test)
print(f"Best Model Test Accuracy: {test_accuracy}")

# Get the best hyperparameters
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")


Best Model Test Accuracy: 0.8446666666666667
Best Hyperparameters: {'svm__degree': 2, 'svm__gamma': 0.001, 'svm__kernel': 'linear'}


#### Which classifier (and parameter values) gave the highest accuracy?

Test accuracies for all 3 classifiers are as follows:  

1. Gaussian = 0.5150987818609889
2. Decision Tree = 0.8324291124987204
3. SVM Classifier = 0.8446666666666667

According to the parameter value {'svm__degree': 2, 'svm__gamma': 0.001, 'svm__kernel': 'linear'}, the SVM classifier is most accurate with accuracy of 0.8446666666666667.