# Loading the dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv("/Users/rajatchauhan/Desktop/Machine Learning Notes/Datasets/Social_Network_Ads.csv")
data

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0
...,...,...,...,...,...
395,15691863,Female,46,41000,1
396,15706071,Male,51,23000,1
397,15654296,Female,50,20000,1
398,15755018,Male,36,33000,0


In machine learning, decision tree algorithms, like many other algorithms, work with numerical data. 

Decision trees make binary splits at each node based on the values of features to make decisions, and to do this effectively, they require numeric inputs. 

When it comes to categorical variables, like "gender," which typically have non-numeric values such as "male" and "female," you need to convert them into numerical format for decision trees to work. 

This conversion is often referred to as encoding or one-hot encoding.

In [3]:
data["Gender"].replace({"Male": 0, "Female" : 1}, inplace = True)
data

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,0,19,19000,0
1,15810944,0,35,20000,0
2,15668575,1,26,43000,0
3,15603246,1,27,57000,0
4,15804002,0,19,76000,0
...,...,...,...,...,...
395,15691863,1,46,41000,1
396,15706071,0,51,23000,1
397,15654296,1,50,20000,1
398,15755018,0,36,33000,0


Defining input features and output variable

In [4]:
X = data.iloc[:,1:4]
X

Unnamed: 0,Gender,Age,EstimatedSalary
0,0,19,19000
1,0,35,20000
2,1,26,43000
3,1,27,57000
4,0,19,76000
...,...,...,...
395,1,46,41000
396,0,51,23000
397,1,50,20000
398,0,36,33000


In [5]:
y = data.iloc[:,-1]
y

0      0
1      0
2      0
3      0
4      0
      ..
395    1
396    1
397    1
398    0
399    1
Name: Purchased, Length: 400, dtype: int64

# Data preprocessing

Before moving forward, we can clearly see the age and estimated salary have a lot of difference of scale,

so we will go for standard scaling of these variables


StandardScaler is a preprocessing technique used in machine learning to standardize or normalize the features of a dataset. 

It's a method for transforming your data such that it has a mean of 0 and a standard deviation of 1. 

Standardizing features can be important for various machine learning algorithms and is particularly useful for those that are sensitive to the scale of input features. Here's what StandardScaler does.

In [6]:
from sklearn.preprocessing import StandardScaler

In [7]:
scaler = StandardScaler()

In [8]:
X = scaler.fit_transform(X)
X

array([[-1.02020406, -1.78179743, -1.49004624],
       [-1.02020406, -0.25358736, -1.46068138],
       [ 0.98019606, -1.11320552, -0.78528968],
       ...,
       [ 0.98019606,  1.17910958, -1.46068138],
       [-1.02020406, -0.15807423, -1.07893824],
       [ 0.98019606,  1.08359645, -0.99084367]])

# Train Test Split

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 1 )

# Defining Decision Tree Classifier

In [11]:
from sklearn.tree import DecisionTreeClassifier

In [12]:
clf = DecisionTreeClassifier()

In [13]:
clf.fit(X_train, y_train)

DecisionTreeClassifier()

# Doing prediction using this model

In [14]:
y_pred = clf.predict(X_test)
y_pred

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0])

In [15]:
y_test

398    0
125    0
328    1
339    1
172    0
      ..
347    1
41     0
180    0
132    0
224    0
Name: Purchased, Length: 80, dtype: int64

# Checking the accuracy of the model

In [16]:
from sklearn.metrics import accuracy_score

In [17]:
accuracy_score(y_test, y_pred)

0.8

So, this is the accuracy that we are getting when we let the classifier go with default hyperparameters of the classifier

In [18]:
# Get the parameters used by the regressor
params = clf.get_params()

# Print the parameters and their values
for param, value in params.items():
    print(f"{param}: {value}")

ccp_alpha: 0.0
class_weight: None
criterion: gini
max_depth: None
max_features: None
max_leaf_nodes: None
min_impurity_decrease: 0.0
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
random_state: None
splitter: best


# Doing Hyper-parameter Tuning using GridSearchCV

First thing is to set up the parameter grid which we will use to define the possible parameter values out of which the algorithm will select the best possible values as per out dataset.

Let us first do tuning of these two hyperparameters: criterion and max_depth

In [19]:
param_grid = {"criterion" : ["gini", "entropy"],"max_depth" : [1,2,3,4,5,6,7,None]}

In [20]:
from sklearn.model_selection import GridSearchCV

In [21]:
gsv = GridSearchCV(clf, param_grid= param_grid, cv = 10, n_jobs = -1)

The cv parameter in scikit-learn's GridSearchCV (Grid Search Cross-Validation) is used to specify the number of folds or partitions into which the dataset will be split for cross-validation during the hyperparameter tuning process.

The n_jobs parameter in scikit-learn's GridSearchCV is used to specify the number of CPU cores to be used in parallel when performing the grid search for hyperparameter tuning. 

It controls the degree of parallelism during the cross-validation process for each combination of hyperparameters.

In [22]:
gsv.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=DecisionTreeClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [1, 2, 3, 4, 5, 6, 7, None]})

Let us see the best parameters this gsv is giving us now

In [23]:
gsv.best_params_

{'criterion': 'gini', 'max_depth': 2}

In [24]:
gsv.best_score_

0.91875

Clearly by going for hyperparameter tuning the accuracy score is increased now.

In same way, we can play and tune the other parameter nobs as well