## Decision Tree

Decision tree algorithm is one of the most versatile algorithms in machine learning which can perform both classification and regression analysis. It is very powerful and works great with complex datasets. Apart from that, it is very easy to understand and read. That makes it more popular to use. When coupled with ensemble techniques – which we will learn very soon- it performs even better.
As the name suggests, this algorithm works by dividing the whole dataset into a tree-like structure based on some rules and conditions and then gives prediction based on those conditions.
Let’s understand the approach to decision tree with a basic scenario. 
Suppose it’s Friday night and you are not able to decide if you should go out or stay at home. Let the decision tree decide it for you.


<img src="Decision_tree1.PNG" width="300">
                         
Although we may or may not use the decision tree for such decisions, this was a basic example to help you understand how a decision tree makes a decision.
So how did it work?
*	It selects a root node based on a given condition, e.g. our root node was chosen as time >10 pm.
*	Then, the root node was split into child notes based on the given condition. The right child node in the above figure fulfilled the condition, so no more questions were asked.
*	The left child node didn’t fulfil the condition, so again it was split based on a new condition.
*	This process continues till all the conditions are met or if you have predefined the depth of your tree, e.g. the depth of our tree is 3, and it reached there when all the conditions were exhausted.

Let’s see how the parent nodes and condition is chosen for the splitting to work.

#### Decision Tree for Regression
When performing regression with a decision tree, we try to divide the given values of X into distinct and non-overlapping regions, e.g. for a set of possible values X1, X2,..., Xp; we will try to divide them into J distinct and non-overlapping regions R1, R2, . . . , RJ.
For a given observation falling into the region Rj, the prediction is equal to the mean of the response(y) values for each training observations(x) in the region Rj. 
The regions R1,R2, . . . , RJ  are selected in a way to reduce the following sum of squares of residuals :


<img src="formula1.PNG" width="300">
                                                        
Where, yrj (second term) is the mean of all the response variables in the region ‘j’.



#### Recursive binary splitting(Greedy approach)
As mentioned above, we try to divide the X values into j regions, but it is very expensive in terms of computational time to try to fit every set of X values into j regions. Thus, decision tree opts for a top-down greedy approach in which nodes are divided into two regions based on the given condition, i.e. not every node will be split but the ones which satisfy the condition are split into two branches. It is called greedy because it does the best split at a given step at that point of time rather than looking for splitting a step for a better tree in upcoming steps. It decides a threshold value(say s) to divide the observations into different regions(j) such that the RSS for Xj>= s and Xj <s is minimum.


<img src="formula2.PNG" width="400">
                      
Here for the above equation, j and s are found such that this equation has the minimum value.
The regions R1, R2 are selected based on that value of s and j such that the equation above has the minimum value.
Similarly, more regions are split out of the regions created above based on some condition with the same logic. This continues until a stopping criterion (predefined) is achieved.
Once all the regions are split, the prediction is made based on the mean of observations in that region.

The process mentioned above has a high chance of overfitting the training data as it will be very complex. 


### Classification Trees

Regression trees are used for quantitative data. In the case of qualitative data or categorical data, we use classification trees.  In regression trees, we split the nodes based on RSS criteria, but in classification, it is done using classification error rate, Gini impurity and entropy.
Let’s understand these terms in detail.

#### Entropy
Entropy is the measure of randomness in the data. In other words, it gives the impurity present in the dataset.

<img src="entropy.PNG" width="300">
                                           
When we split our nodes into two regions and put different observations in both the regions, the main goal is to reduce the entropy i.e. reduce the randomness in the region and divide our data cleanly than it was in the previous node. If splitting the node doesn’t lead into entropy reduction, we try to split based on a different condition, or we stop. 
A region is clean (low entropy) when it contains data with the same labels and random if there is a mixture of labels present (high entropy).
Let’s suppose there are ‘m’ observations and we need to classify them into categories 1 and 2.
Let’s say that category 1 has ‘n’ observations and category 2 has ‘m-n’ observations.

p= n/m  and    q = m-n/m = 1-p

then, entropy for the given set is:


          E = -p*log2(p) – q*log2(q) 
           
           
When all the observations belong to category 1, then p = 1 and all observations belong to category 2, then p =0, int both cases E =0, as there is no randomness in the categories.
If half of the observations are in category 1 and another half in category 2, then p =1/2 and q =1/2, and the entropy is maximum, E =1.


<img src="entropy1.PNG" width="300">
                                  

#### Information Gain
Information gain calculates the decrease in entropy after splitting a node. It is the difference between entropies before and after the split. The more the information gain, the more entropy is removed. 

<img src="info_gain.PNG" width="300">

                                 
Where, T is the parent node before split and X is the split node from T.

A tree which is splitted on basis of entropy and information gain value looks like:

<img src="entropy_tree.PNG" width="900">

#### Ginni Impurity
According to wikipedia, ‘Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset.’
It is calculated by multiplying the probability that a given observation is classified into the correct class and sum of all the probabilities when that particular observation is classified into the wrong class.
Let’s suppose there are k number of classes and an observation belongs to the class ‘i’, then Ginni impurity is given as:

<img src="ginni.PNG" width="300">
                                    
Ginni impurity value lies between 0 and 1, 0 being no impurity and 1 denoting random distribution.
The node for which the Ginni impurity is least is selected as the root node to split.


A tree which is splitted on basis of ginni impurity value looks like:

<img src="tree_example.PNG" width="900">





### Maths behind Decision Tree Classifier
Before we see the python implementation of decision tree. let's first understand the math behind the decision tree classfication. We will see how all the above mentioned terms are used for splitting.

We will use a simple dataset which contains information about students of different classes and gender and see whether they stay in school's hostel or not.

This is how our data set looks like :


<img src='data_class.PNG' width="200">

Let's try and understand how the root node is selected by calcualting gini impurity. We will use the above mentioned data.

We have two features which we can use for nodes: "Class" and "Gender".
We will calculate gini impurity for each of the features and then select that feature which has least gini impurity.

Let's review the formula for calculating ginni impurity:

<img src='example/gini.PNG' width="200">

Let's start with class, we will try to gini impurity for all different values in "class". 

<img src='example/1.PNG' width="500">

<img src='example/2.PNG' width="500">

<img src='example/3.1.PNG' width="500">

<img src='example/3.PNG' width="500">

<img src='example/4.PNG' width="500">

<img src='example/5.PNG' width="500">

<img src='example/6.PNG' width="500">

<img src='example/7.PNG' width="500">

<img src='example/8.PNG' width="500">

This is how our Decision tree node is selected by calculating gini impurity for each node individually.
If the number of feautures increases, then we just need to repeat the same steps after the selection of the root node.

We will try and find the root nodes for the same dataset by calculating entropy and information gain.

DataSet:

<img src='data_class.PNG' width="200">

We have two features and we will try to choose the root node by calculating the information gain by splitting each feature.

Let' review the formula for entropy and information gain:

<img src='example/formula_entropy.PNG' width="300">

<img src='example/inform_gain.PNG' width="300">


Let's start with feature "class" :

<img src='example/9.PNG' width="500">

<img src='example/10.1.PNG' width="500">

<img src='example/11.PNG' width="500">

<img src='example/12.PNG' width="500">

<img src='example/13.PNG' width="500">


Let' see the information gain from feature "gender" :

<img src='example/10.2.PNG' width="500">

<img src='example/14.PNG' width="500">

<img src='example/15.PNG' width="500">

<img src='example/16.PNG' width="500">







### Different Algorithms for Decision Tree


* ID3 (Iterative Dichotomiser) : It is one of the algorithms used to construct decision tree for classification. It uses Information gain as the criteria for finding the root nodes and splitting them. It only accepts categorical attributes.

* C4.5 : It is an extension of ID3 algorithm, and better than ID3 as it deals both continuous and discreet values.It is also used for classfication purposes.


* Classfication and Regression Algorithm(CART) : It is the most popular algorithm used for constructing decison trees. It uses ginni impurity as the default calculation for selecting root nodes, however one can use "entropy" for criteria as well. This algorithm works on both regression as well as classfication problems. We will use this algorithm in our pyhton implementation. 


Entropy and Ginni impurity can be used reversibly. It doesn't affects the result much. Although, ginni is easier to compute than entropy, since entropy has a log term calculation. That's why CART algorithm uses ginni as the default algorithm.

If we plot ginni vs entropy graph, we can see there is not much difference between them:

<img src="example/entropyVsGini.PNG" width = "400">



##### Advantages of Decision Tree:

   * It can be used for both Regression and Classification problems.
   * Decision Trees are very easy to grasp as the rules of splitting is clearly mentioned.
   * Complex decision tree models are very simple when visualized. It can be understood just by visualising.
   * Scaling and normalization are not needed.


##### Disadvantages of Decision Tree:


   * A small change in data can cause instability in the model because of the greedy approach.
   * Probability of overfitting is very high for Decision Trees.
   * It takes more time to train a decision tree model than other classification algorithms.

## Business Case:-Based on given features we need to find whether an employee will leave the company or not.

In [None]:
## Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
## Target variable:-

In [None]:
## Loading the data
data=pd.read_csv('HR-Employee-Attrition.csv')

## Basic Checks

In [None]:
data.head()#first five rows

In [None]:
## Getting all columns form the dataset
data.columns

In [None]:
data.head(pd.set_option('display.max_columns',None))#to diplay all columns from dataset


In [None]:
data.HourlyRate.value_counts()#number of appearance for each label in the columns

In [None]:
data.tail()#last five rows

In [None]:
data.describe()##used to view some basic statistical details like percentile, mean, std etc. 

In [None]:
# we do not have any null values
# we have one constant feature Employee count

In [None]:
data.describe(include=['O'])#It will give you info about categorical data/columns

In [None]:
data.info()#To check  data type and  null value of all columns  

## Exploratory Data Analysis

# univariant Analysis

In [None]:
## Univariate Analysis
!pip install sweetviz#installing sweetviz library`

In [None]:
import sweetviz as sv#importing sweetviz library 
my_report = sv.analyze(data)#syntax to use sweetviz
my_report.show_html()#Default arguments will generate to "SWEETVIZ_REPORT.html"

## insights from univariant
* people betwwen the age group 25-40 are the majority
* 70% of the people travel raraely, 20% travel frequently rest do not travel
* more than 70% of the employeres belong to research and development
* almost 50% of the people are nearer to the office i.e the distance from their home is lesser than or equal to 10.
* more than 60% of the people have educational qualification of 4 and 5
* majority(40%) of the people are from life science field and 30% are from medical field
* 60% of the people are almost satisfied with environment condition of the office with more than 3 ratings.
* gender count: 60% male 40% female
* 60% of the people have partial involvement in job and 20% have good involvement
* more than 60% employees seem to be satisfied with their job
* 50% of the people are married, 30% single and the rest are divorced
* 60% of the people have less thanm 10k income
* 40% of the people have worked for less than 1 company which implies they are freshers
* 30% of the people have worked for more than 5 companies
* 80% of the people have average work rating
* 60% of the people have worked for the same company only for 5 years or lesser
* 80% of the people own only 1 or 0 stock at the company

# Bivaraite Analysis 

## checking relationship of all variables with respect to target variable 

In [None]:
categorical_col = []#list
for column in data.columns:#for loop to acess columns form dataset
    if data[column].dtype == object and len(data[column].unique()) <= 50:#checking datatype whether datatype is object/string and number of unique label in the columns less than 50 
        categorical_col.append(column)#appending those columns in the list who statisfy the condition 
        print(f"{column} : {data[column].unique()}")#output
        print("====================================")

## Categorical Data

In [None]:
## Create a new dataframe with categorical variables only
data1=data[['Attrition',
 'BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'MaritalStatus',
 'Over18',
 'OverTime']]

In [None]:
data1#new data frame with categorical columns only

In [None]:
# Plotting how every  categorical feature correlate with the "target"
plt.figure(figsize=(50,50), facecolor='white')#canvas size
plotnumber = 1#count variable

for column in data1:#for loop to acess columns form data1
    if plotnumber<=16 :#checking whether count variable is less than 16 or not
        ax = plt.subplot(4,4,plotnumber)#plotting 8 graphs in canvas(4 rows and 4 columns)
        sns.countplot(x=data1[column].dropna(axis=0)#plotting count plot 
                        ,hue=data.Attrition)
        plt.xlabel(column,fontsize=20)#assigning name to x-axis and increasing it's font 
        plt.ylabel('Attrition',fontsize=20)#assigning name to y-axis and increasing it's font 
    plotnumber+=1#increasing counter
plt.tight_layout()

## insights of bivariant
* these are the insights wrt attrition
* more male employees are expected to quit their job
* people who travel more are more expected to leave the job
* people who do not do overtime do not leave the job
* singles are expected to quit the job
* people from life science and mediacl field are more probablyu leaving theitr job

In [None]:
numerical_col = []#list for continous columns
for column in data.columns:#acessing columns from datasets
    if data[column].dtype == int and len(data[column].unique()) >= 10: #checking whether it's datatype is int and count of unique label greater than 10  
        numerical_col.append(column) # inserting those columns in list                                      
        

In [None]:
numerical_col#printing list which contain continous columns

##  Discrete data

In [None]:
data3=data[['Education',
 'EmployeeCount',
 'EnvironmentSatisfaction',
 'JobInvolvement',
 'JobLevel',
 'JobSatisfaction',
 'NumCompaniesWorked',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StandardHours',
 'StockOptionLevel',
 'TrainingTimesLastYear',
 'WorkLifeBalance']]#discrete columns

In [None]:
# Plotting how every  discrete feature correlate with the "target"
plt.figure(figsize=(20,25), facecolor='white')#canvas size
plotnumber = 1

for column in data3:
    if plotnumber<=16 :
        ax = plt.subplot(4,4,plotnumber)
        sns.countplot(x=data3[column].dropna(axis=0)
                        ,hue=data.Attrition)
        plt.xlabel(column,fontsize=20)
        plt.ylabel('Attrition',fontsize=20)
    plotnumber+=1
plt.tight_layout()

### Bivariant analysis of continuous variables

In [None]:
data2=data[['Age',
 'DailyRate',
 'DistanceFromHome',
 'EmployeeNumber',
 'HourlyRate',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'PercentSalaryHike',
 'TotalWorkingYears',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']]#continuous variables/columns

In [None]:
# Plotting how every  numerical feature correlate with the "target"
plt.figure(figsize=(20,25), facecolor='white')#canvas size
plotnumber = 1#counter for number of plot

for column in data2:#acessing columns form data2 DataFrame
    if plotnumber<=16 :#checking whether counter is less than 16 or not
        ax = plt.subplot(4,4,plotnumber)#plotting 8 graphs in canvas(4 rows and 4 columns)
        sns.histplot(x=data2[column].dropna(axis=0)# plotting hist plot and dropping null values,classification according to target
                        ,hue=data.Attrition)
        plt.xlabel(column,fontsize=20)##assigning name to x-axis and increasing it's font 
        plt.ylabel('Attrition',fontsize=20)#assigning name to y-axis and increasing it's font 
    plotnumber+=1#increasing counter by 1
plt.tight_layout()

## Final conclusions
BusinessTravel : The workers who travel alot are more likely to quit then other employees.

Department : The worker in Research & Development are more likely to stay then the workers on other departement.

EducationField : The workers with Human Resources and Technical Degree are more likely to quit then employees from other fields of educations.

Gender : The Male are more likely to quit.

JobRole : The workers in Laboratory Technician, Sales Representative, and Human Resources are more likely to quit the workers in other positions.

MaritalStatus : The workers who have Single marital status are more likely to quit the Married, and Divorced.

OverTime : Attrition rate is almost equal

# Data Preprocessing

## Checking missing values/null values

In [None]:

data.isnull().sum()#null value checking 
# no null values

# conversion of  Categorical columns in to numerical columns

In [None]:
## Categorical data conversion
data1.head()

###  1.Attrition

In [None]:
data.Attrition.unique()#checking unique value in Attrition column

In [None]:
## Manual encoding Attrition feature
data.Attrition=data.Attrition.map({'Yes':1,'No':0})
data1.head()


###  2.BusinessTravel 

In [None]:
data.BusinessTravel.unique()#checking unique value

In [None]:
## Encoding BusinessTravel, this feature told the worker who travelled frequesnlty has quited the job so let do the
##manual encoding
data.BusinessTravel=data.BusinessTravel.map({'Travel_Frequently':2,'Travel_Rarely':1,'Non-Travel':0})


In [None]:
data.head()#checking whether imputation properly done or not 

### 3.Department

In [None]:
data.Department.unique()#unique values

In [None]:
data.Department=data.Department.map({'Research & Development':2,'Sales':1,'Human Resources':0})#imputation using map function


### 4.EducationField

In [None]:
 data.EducationField.unique()#unique labels

In [None]:
#using map function
data.EducationField=data.EducationField.map({'Life Sciences':5,'Medical':4,'Marketing':3,'Technical Degree':2,'Other':1,'Human Resources':0 })
   
 

In [None]:
data.head()#checking for imputation

### 5.Gender

In [None]:
data.Gender.value_counts()#checking weightage of each label whoever have high count 

In [None]:
## Encoding Gender
data.Gender=pd.get_dummies(data.Gender,drop_first=True)

In [None]:
data.Gender#checking whether imputation done or not


### JobRole

In [None]:
data.JobRole.value_counts()#checking count for each label

In [None]:
## Encoding JobRole
data.JobRole=data.JobRole.map({'Laboratory Technician':8,'Sales Executive':7,'Research Scientist':6,'Sales Representative':5,
                              'Human Resources':4,'Manufacturing Director':3,'Healthcare Representative':2,'Manager':1,'Research Director':0 })
  
   
  

In [None]:
data.JobRole#data.Gender#checking whether imputation done or not


### Encoding MaritalStatus using label encoding 


In [None]:
## Encoding MaritalStatus

from sklearn.preprocessing import LabelEncoder#importing label encoder from sklearn 

label = LabelEncoder()#object creation 
data.MaritalStatus=label.fit_transform(data.MaritalStatus)#applying label encoder to  marital status

In [None]:
data.MaritalStatus

### OverTime

In [None]:
## Encoding OverTime
data.OverTime=label.fit_transform(data.OverTime)#label encoding

In [None]:
data.head()#checking for imputation 

## Feature Selection

In [None]:
## Checking correlation

plt.figure(figsize=(30, 30))#canvas size
sns.heatmap(data2.corr(), annot=True, cmap="RdYlGn", annot_kws={"size":15})#plotting heat map to check correlation

In [None]:
## Removing constant features
data.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis="columns", inplace=True)#droping those columns which have std=0 

In [None]:
data.describe()

## Model Creation

In [None]:
## Creating independent and dependent variable
X = data.drop('Attrition', axis=1)#independent variable 
y = data.Attrition#dependent variable 

In [None]:
## Balacing the data
from collections import Counter# importing counter to check count of each label
from imblearn.over_sampling import SMOTE #for balancing the data
sm=SMOTE()#object creation
print(Counter(y))# checking count for each class 
X_sm,y_sm=sm.fit_resample(X,y)#applying sampling on target variable 
print(Counter(y_sm))# checking count after sampling for  each class

In [None]:
## preparing training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.25, random_state=42)

In [None]:
from sklearn.tree import DecisionTreeClassifier#importing decision tree from sklearn.tree
dt=DecisionTreeClassifier(criterion='entropy', max_depth=10, min_samples_leaf= 1, min_samples_split= 3, splitter= 'random')#object creation for decision tree  
dt.fit(X_train,y_train)#training the model
y_hat=dt.predict(X_test)#prediction
y_hat#predicted values 

In [None]:
y_train_predict=dt.predict(X_train)#predicting training data to check training performance 
y_train_predict

In [None]:
## Evalauting the model
from sklearn.metrics import accuracy_score,classification_report,f1_score    #importing mertics to check model performance
##Training score
y_train_predict=dt.predict(X_train)#passing X_train to predict Y_train
acc_train=accuracy_score(y_train,y_train_predict)#checking accuracy
acc_train


In [None]:
print(classification_report(y_train,y_train_predict))# it will give precision,recall,f1 scores and accuracy  

In [None]:
pd.crosstab(y_train,y_train_predict)#it will show you confusion matrix

In [None]:
## test acc
test_acc=accuracy_score(y_test,y_hat)#testing accuracy 
test_acc

In [None]:
## test score
test_f1=f1_score(y_test,y_hat)#f1 score
test_f1

In [None]:
print(classification_report(y_test,y_hat))# for  testing 

In [None]:
pd.crosstab(y_test,y_hat)# confusion matrix for

## Hyperparameters of DecisionTree
* Hyperparameter tuning is searching the hyperparameter space for a set of values that will optimize your model architecture.


* criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.


* splitter: This is how the decision tree searches the features for a split. The default value is set to “best”. That is, for each node, the algorithm considers all the features and chooses the best split. If you decide to set the splitter parameter to “random,” then a random subset of features will be considered.



* max_depth: This determines the maximum depth of the tree.  we use a depth of two to make our decision tree. ... This will often result in over-fitted decision trees. The depth parameter is one of the ways in which we can regularize the tree, or limit the way it grows to prevent over-fitting..The tree perfectly fits the training data and fails to generalize on testing data.



* min_samples_split:Ideal range is 1 to 40.min_samples_split specifies the minimum number of samples required to split an internal node, while min_samples_leaf specifies the minimum number of samples required to be at a leaf node.



* min_samples_leaf: The minimum number of samples required to be at a leaf node.Similarr to min sample split ,this describes the minimum number of samples at the leaf,the base of tree.Ideal range is 1 to 20.(thershold value to make a decision)like 40


In [None]:
https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680

In [None]:
from sklearn.model_selection import GridSearchCV
#It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. 
#So,in the end, you can select the best parameters from the listed hyperparameters.

In [None]:

#creating dictionary--> key value pair of hyperparameters having key as parameter and values as its values
params = {
    "criterion":("gini", "entropy"), #quality of split
    "splitter":("best", "random"), # searches the features for a split
    "max_depth":(list(range(1, 20))), #depth of tree range from 1 to 19
    "min_samples_split":[2, 3, 4],    #the minimum number of samples required to split internal node
    "min_samples_leaf":list(range(1, 20)),#minimum number of samples required to be at a leaf node,we are passing list which is range from 1 to 19 
}


tree_clf = DecisionTreeClassifier(random_state=3)#object creation for decision tree with random state 3
tree_cv = GridSearchCV(tree_clf, params, scoring="f1", n_jobs=-1, verbose=1, cv=3)
#passing model to gridsearchCV ,
#tree_clf-->model
#params---->hyperparametes(dictionary we created)
#scoring--->performance matrix to check performance
#n_jobs---->Number of jobs to run in parallel,-1 means using all processors.
#verbose=Controls the verbosity: the higher, the more messages.
#>1 : the computation time for each fold and parameter candidate is displayed;
#>2 : the score is also displayed;
#>3 : the fold and candidate parameter indexes are also displayed together with the starting time of the computation.
#cv------> number of flods




tree_cv.fit(X_train,y_train)#training data on gridsearch cv
best_params = tree_cv.best_params_#it will give you best parameters 
print(f"Best paramters: {best_params})")#printing  best parameters



In [None]:
#fitting 3 folds for each of 4332 candidates, totalling 12996 fits
Bestparamters: ({'criterion': 'entropy', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'})


In [None]:
tree_cv.best_params_#getting best parameters from cv

In [None]:
tree_cv.best_score_#getting best score form cv

In [None]:
dt1=DecisionTreeClassifier(criterion='entropy',max_depth=10,min_samples_leaf= 1,min_samples_split=3,splitter='random')#passing best parameter to decision tree

In [None]:
dt1.fit(X_train,y_train)#traing model with best parameter

In [None]:
y_hat1=dt1.predict(X_test)#predicting
y_hat1

In [None]:
acc_test=accuracy_score(y_test,y_hat1)#checking accuracy
acc_test

In [None]:
test_f1=f1_score(y_test,y_hat1)#f1_score
test_f1

In [None]:
print(classification_report(y_test,y_hat1))#it will give precision,recall,f1 scores and accuracy 

# what is random forest
* A random forest is a supervised machine learning algorithm that is constructed from decision tree algorithms.


* A random forest is a machine learning technique that’s used to solve regression and classification problems. It utilizes       ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems.


* The (random forest) algorithm establishes the outcome based on the predictions of the decision trees. It predicts by taking the average or mean of the output from various trees. Increasing the number of trees increases the precision of the outcome.

**working of Random forest**
![](rf.png)

**Output side called as  Aggregation**


**What is bootstrap in random forest?**
* When training, each tree in a random forest learns from a random sample of the data points. The samples are drawn with replacement, known as bootstrapping, which means that some samples will be used multiple times in a single tree.




**For regression task it will take average**



**For classification it will count the output** 

## RandomForest Implementation

In [None]:
from sklearn.ensemble import RandomForestClassifier#importing randomforest

rf_clf = RandomForestClassifier(n_estimators=100)#object creation ,taking 100 decision tree in random forest 
rf_clf.fit(X_train,y_train)#training the data

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
RandomForestClassifier()

In [None]:
y_predict=rf_clf.predict(X_test)#testing

In [None]:

print(classification_report(y_test,y_predict))

In [None]:
f_Score=f1_score(y_test,y_predict)
f_Score

## Hyperparameter Tuning

* n_estimators = number of trees in the foreset

* max_features =These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available in Python to assign maximum features

* max_depth =The depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information              about the data

* min_samples_split =the minimum number of samples required to split an internal node. This can vary between considering at least one sample at each node to considering all of the samples at each node

* min_samples_leaf = minimum number of data points allowed in a leaf node
* bootstrap = method for sampling data points (with or without replacement)

In [None]:
#Random Search sets up a grid of hyperparameter values and selects random combinations to train the model and score.
#This allows you to explicitly control the number of parameter combinations that are attempted.
#The number of search iterations is set based on time or resources.
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]       #Number of decision trees
max_features = ['log2', 'sqrt']                                  #maximum number of features allowed to try in individual tree
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]      #List Comprehension-using for loop in list
max_depth.append(None)
min_samples_split = [2, 5, 10]#minimum number of samples required to split an internal node
min_samples_leaf = [1, 2, 4]#minimum number of samples required to be at a leaf node.
bootstrap = [True, False]#sampling 

#dictionary for hyperparameters
random_grid = {'n_estimators': n_estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}

rf_clf1 = RandomForestClassifier(random_state=42)#model

rf_cv = RandomizedSearchCV(estimator=rf_clf1, scoring='f1',param_distributions=random_grid, n_iter=100, cv=3, 
                               verbose=2, random_state=42, n_jobs=-1)

#estimator--number of decision tree
#scoring--->performance matrix to check performance
#param_distribution-->hyperparametes that we are going to provide 
#n_iter--->Number of combinations to try
##cv------> number of folds
#verbose=Controls the verbosity:the greater the number, the more detail you will get.
#n_jobs----> if you specify n_jobs to -1, it will use all cores in CPU. If it is set to 1 or 2, it will use one or two cores only 





rf_cv.fit(X_train, y_train)                                  ##training data on randomsearch cv
rf_best_params = rf_cv.best_params_                          ##it will give you best parameters 
print(f"Best paramters: {rf_best_params})")                  ##printing  best parameters
 


In [None]:

#passing best parameter to randomforest
rf_clf2 = RandomForestClassifier(n_estimators= 1400, min_samples_split= 2, min_samples_leaf= 1, 
                                 max_features= 'log2', max_depth= 40, bootstrap= False)



rf_clf2.fit(X_train, y_train)

y_predict=rf_clf2.predict(X_test)

f1_score=f1_score(y_test,y_predict)

In [None]:
f1_score#calling variable