<a href="https://colab.research.google.com/github/sheelaj123/Machine-Learning-Course--2024/blob/main/Cross_Validation_and_Ensemble_IN_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Implementing cross validation

The below code demonstrates the usage of 5 fold cross validation to determine the best k value for a kNN model built on the defaulter dataset.

The dataset can be downloaded here.

###Reading the data

The defaulter dataset, contains data about customers defaulting on loans.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
#read data from input csv file
defaulter = pd.read_csv("default.csv")


##Feature Engineering

We will now normalize the features in the dataset using MinMaxScaler

In [5]:
#### Normalizing the data using MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
features_to_scale = ["balance","income"]
scaled_values = scaler.fit_transform(defaulter[features_to_scale])
defaulter["norm_balance"] = scaled_values[:,0]
defaulter["norm_income"] = scaled_values[:,1]


Splitting the data into train and test set

In [7]:
from sklearn.model_selection import train_test_split
X=defaulter[["norm_balance","norm_income"]]
Y=defaulter['default']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=100)


Finding best value of k for KNN


In [8]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
#create new a knn model
knn = KNeighborsClassifier()
#create a dictionary of all k neighbor values
param_grid = {'n_neighbors': np.arange(1, 15,2)}
'''using GridSearchCV to perform k-fold validation'''
knn_gscv = GridSearchCV(knn, param_grid,return_train_score=True, verbose=1,scoring='accuracy')
#fit model to data
knn_gscv.fit(X_train,Y_train)
#storing results to dataframe
#print(knn_gscv.cv_results_)
df=pd.DataFrame(knn_gscv.cv_results_)
#filtering out columns
df=df[['param_n_neighbors','mean_train_score','mean_test_score']]


Fitting 5 folds for each of 7 candidates, totalling 35 fits


The model parameter 'param_n_neighbors' is used to set the value of k of KNN.

Here we observe that, for param_n_neighbors = 9, you will get good performance on test and train data. So choose 9 as best value for k.

Having determined the best value of k using 5 fold cross-validation, use that value to train a model on the entire training data and check the performance on train and test data as shown below:


In [9]:
model = KNeighborsClassifier(n_neighbors = 9, metric="euclidean")
model.fit(X_train,Y_train)
train_accuracy = model.score(X_train,Y_train)
test_accuracy = model.score(X_test,Y_test)
print(train_accuracy,test_accuracy)
#output
#0.974625 0.9725


0.974625 0.9725


#Ensemble methods

Ensemble methods are techniques that aim at improving the prediction accuracy in models by creating and combining multiple models instead of using a single model.

Two commonly used ensemble methods are Bagging and Boosting.

Bagging
In Bagging, multiple models are trained using the same algorithm on different subsets of the training data. Once multiple models are trained in this manner, they are aggregated using maximum voting or simple aggregation methods such as averaging.

Random forest is a special type of bagging algorithm which uses decision trees as base models. It creates random subsets of the training dataset to create a collection of trees. While building a tree, it also randomly samples the feature variables at each split. This adds another layer of variety and randomness to the final classifier. While using the random forest on a new data, the new data is run through each of the trees in the collection and the target predictions from them are aggregated to give the final output.

In the demo code you will see training a random forest model with 10 decision trees.

Reading the input data

Here,

you will work on spambase dataset, where the normalized frequency of different words in an email are recorded,

based on which an email is labelled as spam (1) or not spam (0).



In [11]:
#reading input data from csv file
spam_data = pd.read_csv("spambase.csv")


Splitting the data into train and test set



In [12]:
from sklearn.model_selection import train_test_split
features = spam_data.columns.drop('spam')
target = "spam"
X=spam_data[features]
Y=spam_data[target]
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=100)


Model Building

In [13]:
from sklearn.ensemble import RandomForestClassifier
# building model with RandomforestClassifier with 10 underlying Decision tree models/ estimators
model = RandomForestClassifier(n_estimators=10,
                               min_samples_split=20,
                               min_impurity_decrease=0.05)
model.fit(X_train,Y_train)
# Evaluate the model performance
train_accuracy = model.score(X_train,Y_train)
test_accuracy = model.score(X_test,Y_test)
print(train_accuracy,test_accuracy)
#output
#0.8633152173913043 0.8577633007600435


0.8538043478260869 0.8631921824104235


Reviewing the feature imporatance:


Random forest model can also help us evaluate which features are important. The below code demonstrates the same.

In [14]:
feature_imps = pd.DataFrame(np.array([features,
                                      model.feature_importances_]).T,
                            columns=["feature","importance"])
feature_imps.sort_values(by="importance",ascending=False)


Unnamed: 0,feature,importance
51,char_freq_!,0.25983
52,char_freq_$,0.228477
20,word_freq_your,0.149205
23,word_freq_money,0.111111
6,word_freq_remove,0.097842
24,word_freq_hp,0.058052
16,word_freq_business,0.053059
15,word_freq_free,0.042424
42,word_freq_original,0.0
38,word_freq_pm,0.0


In [None]:
#In the above code,  the model.feature_importances_ value is used to determine the importance of each feature in the random forest model.

#The random forest model found 10 features to be useful out of 50+ features in the dataset.

#Boosting

Boosting is another ensemble learning technique where the models are built sequentially. Each new model is built by taking into account the mistakes made by the previous model in predicting target value. This is done by assigning the same weight to each training sample at the beginning. The samples which get incorrectly labelled by a model are given more weight while building the subsequent model. Output of a boosted model is the weighted sum of the predictions made by the individual models. AdaBoost is one of the well known boosting techniques.

The below code shows how you can use Adaboost in sklearn.

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
#building AdaBoostClassifier with 10 models, also called as estimators.
model = AdaBoostClassifier(n_estimators=10)
model.fit(X_train,Y_train)
# Evaluating the model performance
train_accuracy = model.score(X_train,Y_train)
test_accuracy = model.score(X_test,Y_test)
print(train_accuracy,test_accuracy)
#output
#0.9195652173913044 0.9272529858849077


0.9195652173913044 0.9272529858849077


Similiar to random forest, the adaboost classifier also exposes the important features.



In [16]:
feature_imps = pd.DataFrame(np.array([features,
                                      model.feature_importances_]).T,
                            columns=["feature","importance"])
feature_imps.sort_values(by="importance",ascending=False)


Unnamed: 0,feature,importance
15,word_freq_free,0.1
45,word_freq_edu,0.1
55,capital_run_length_longest,0.1
52,char_freq_$,0.1
6,word_freq_remove,0.1
51,char_freq_!,0.1
26,word_freq_george,0.1
44,word_freq_re,0.1
24,word_freq_hp,0.1
36,word_freq_1999,0.1


# Topic end's here, Thanks for visiting ----Happy Learning...>