# *CodexCue*
## **ALLOCATED PROJECTS 3 - Hyperparameter Tuning of ML Models**

**Grid Search** and **Random Search** are techniques used for hyperparameter tuning in machine learning models.

- **Grid Search**: This method involves an exhaustive search over a predefined set of hyperparameters. It evaluates all possible combinations of hyperparameter values provided in the grid. While thorough, it can be computationally expensive and time-consuming, especially with a large number of hyperparameters and possible values.

- **Random Search**: Instead of searching all possible combinations, random search samples a fixed number of hyperparameter combinations from the specified distributions. It is generally more efficient than grid search, as it can explore a wider range of hyperparameters in less time, often leading to comparable or better performance with less computational cost.


- **Grid Search**: Use when the hyperparameter space is small and you want to evaluate all possible combinations to find the best parameters exhaustively.

- **Random Search**: Use when the hyperparameter space is large and you want to sample a fixed number of combinations to efficiently explore a wider range of hyperparameters with less computational cost.

### Mounted drive, dataset accessed.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import zipfile
zip_ref = zipfile.ZipFile('/content/drive/MyDrive/Codexcue/emails.csv.zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

## Required Imports

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix


from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import time
import random
import warnings
warnings.filterwarnings("ignore")

## Dataset Analysis

In [4]:
df = pd.read_csv('emails.csv')
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0


In [10]:
df.columns

Index(['Email No.', 'the', 'to', 'ect', 'and', 'for', 'of', 'a', 'you', 'hou',
       ...
       'connevey', 'jay', 'valued', 'lay', 'infrastructure', 'military',
       'allowing', 'ff', 'dry', 'Prediction'],
      dtype='object', length=3002)

In [5]:
rows, cols = df.shape
print("Our dataset have",rows,"rows and",cols,"columns")

Our dataset have 5172 rows and 3002 columns


In [6]:
if df.isnull().sum().sum() == 0:
    print("No missing values found in the dataset.")
else:
    print("Missing values found. Handling missing values...")
    df = df.dropna()

No missing values found in the dataset.


In [7]:
df.drop_duplicates(keep='first', inplace=True)

In [8]:
print(df['Prediction'].value_counts())

Prediction
0    3672
1    1500
Name: count, dtype: int64


In [9]:
categorical_columns = df.select_dtypes(include=['object']).columns
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns

# **Pre-processing**

##  Encoding categorical variables using LabelEncoder

In [11]:
label_encoder = LabelEncoder()
for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])

##  Scaling numerical columns using StandardScaler except Prediction

In [12]:
numerical_columns = [col for col in df.columns if col != 'Prediction' and df[col].dtype in ['int64', 'float64']]

scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

In [13]:
print("Encoded and scaled DataFrame:")
df.head(5)

Encoded and scaled DataFrame:


Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,-1.731716,-0.565449,-0.649083,-0.293895,-0.508752,-0.667663,-0.421725,-0.611169,-0.571751,-0.290556,...,-0.047525,-0.062944,-0.091138,-0.172137,-0.044197,-0.04733,-0.056285,-0.329048,-0.070971,0
1,-0.98759,0.115757,0.714508,1.337337,0.483741,0.614369,-0.100659,0.530831,-0.339949,3.584743,...,-0.047525,-0.062944,-0.091138,-0.172137,-0.044197,-0.04733,-0.056285,0.030672,-0.070971,0
2,-0.243465,-0.565449,-0.649083,-0.293895,-0.508752,-0.667663,-0.421725,-0.542649,-0.571751,-0.290556,...,-0.047525,-0.062944,-0.091138,-0.172137,-0.044197,-0.04733,-0.056285,-0.329048,-0.070971,0
3,0.50066,-0.565449,-0.124625,1.19549,-0.508752,0.400697,-0.261192,-0.051589,-0.108147,1.14474,...,-0.047525,-0.062944,-0.091138,-0.172137,-0.044197,-0.04733,-0.056285,-0.329048,-0.070971,0
4,1.244786,0.030606,-0.019733,0.840875,-0.343336,0.400697,-0.100659,0.016931,-0.571751,1.00121,...,-0.047525,-0.062944,-0.091138,-0.172137,-0.044197,-0.04733,-0.056285,0.030672,-0.070971,0


In [14]:
X = df.drop(columns=['Email No.', 'Prediction'], axis=1)
y = df['Prediction']

## Splitting the dataset into training and testing sets

In [15]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)

In [16]:
print(X_train.shape)
print(X_test.shape)

(3879, 3000)
(1293, 3000)


### Created 3 models, random-forest, svm and logictic reg. instances, get predictions and evalutae accuracies for all models

In [18]:
rf = RandomForestClassifier()
svc = SVC()
lr = LogisticRegression()

In [19]:
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.9760247486465584

In [20]:
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test,y_pred)

0.9373549883990719

In [21]:
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
accuracy_score(y_test,y_pred)

0.9690641918020109

### Checking complete param list, initialzing rf with some param to analyze the rf instances performance

In [22]:
print(rf.get_params())

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


In [23]:
rf = RandomForestClassifier(max_samples=0.75, criterion='gini', random_state=1)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.9737045630317092

### Taking mean for cross-validation for rf to analyze result different test set results

In [24]:
np.mean(cross_val_score(RandomForestClassifier(max_samples=0.75),X,y,cv=5,scoring='accuracy'))

0.9549468785916521

# **Hyperparameter Tuning for SVM Model**

In [38]:
model = SVC()

### **GridSearchCV**

In [40]:
# Parameters for Grid Search
param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': ['scale', 'auto'],
              'kernel': ['rbf', 'sigmoid']
              }

# Start the timer for Grid Search
start_time = time.time()

# Execute Grid Search
grid_search = GridSearchCV(estimator=model,
                           param_grid = param_grid,
                           cv = 5,
                           n_jobs = -1,
                           refit=True,
                           verbose=0,
                           random_state=42)


In [41]:
grid_search.fit(X_train, y_train)

In [46]:
# Calculate the execution time of Grid Search
grid_search_time = time.time() - start_time

print("Best score of Grid Search: ", grid_search.best_score_)
print("Processing time of Grid Search: {:.2f} seconds".format(grid_search_time))

Best score of Random Search:  0.9541104090455604
Processing time of Random Search: 1230.05 seconds


In [48]:
grid_search.best_params_

{'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}

### **RandomSearchCV**

In [43]:
# Parameters for Random Search
param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': ['scale', 'auto'],
              'kernel': ['rbf', 'sigmoid']
              }

# Start the timer for Random Search
start_time = time.time()

In [44]:
# Execute Random Search
random_search = RandomizedSearchCV(estimator=model,
                           param_distributions = param_grid,
                           cv = 5,
                           n_iter=50,
                           n_jobs = -1,
                           refit=True,
                           verbose=0,
                           random_state=42)



In [45]:
random_search.fit(X_train, y_train)

In [42]:
# Calculate the execution time of Random Search
random_search_time = time.time() - start_time

print("Best score of Random Search: ", random_search.best_score_)
print("Processing time of Random Search: {:.2f} seconds".format(random_search_time))

Best score of Grid Search:  0.9541104090455604
Processing time of Grid Search: 835.78 seconds


In [47]:
print(random_search.best_params_)

{'kernel': 'rbf', 'gamma': 'scale', 'C': 10}


*Different can be analyze in terms of time consuming. Though with this dataset and model we got same accuracy coincidentally. In this example we use same param_grid, we can also used different param_grid for both techniques.*