<h1> Day 26 - Class </h1>

## CART Algorithms Contd ...

### CHAID (CHI Square Automatic Interaction Detection)
Information Gain, GINI Index , Entropy talks about similarity within the group. <b>CHAID</b> talks about variance between the groups. Higher the variance, better is the split.

Information Gain is good for continous variables
GINI Index is good for binary variables

chi-square is a metric to find the significance of a feature. The higher the value, the higher the statistical significance. Similar to the others, CHAID builds decision trees for classification problems. This means that it expects data sets having a categorical target variable.

CHAID is the oldest decision tree algorithm in the history. It was raised in 1980 by Gordon V. Kass. Then, CART was found in 1984, ID3 was proposed in 1986 and C4.5 was announced in 1993. It is the acronym of chi-square automatic interaction detection. Here, chi-square is a metric to find the significance of a feature. The higher the value, the higher the statistical significance. Similar to the others, CHAID builds decision trees for classification problems. This means that it expects data sets having a categorical target variable.

CHAID uses chi-square tests to find the most dominant feature whereas ID3 uses information gain, C4.5 uses gain ratio and CART uses GINI index. Chi-square testing was raised by Karl Pearson. He is also the founder of correlation. Today, most programming  libraries (e.g. Pandas for Python) use Pearson metric for correlation by default.

The formula of chi-square testing is easy.

√((y – y’)2 / y’)

where y is actual and y’ is expected.

Let's try a program for CHAID formula. Also, find an example below showing the calculation

<img src='img/chi-1.png' />

We pick the column with the highest chaid score as the first node

Let's try another example to understand this further

<img src='img/chaid-01.png' />

<b>Outlook feature</b>
Outlook feature has 3 classes: sunny, rain and overcast. There are 2 decisions: yes and no. We firstly find the number of yes decisions and no decision for each class.

<img src='img/chaid-02.png' />

Chi-square value of outlook is the sum of chi-square yes and no columns.

0.316 + 0.316 + 1.414 + 1.414 + 0.316 + 0.316 = 4.092

<b>Temperature feature</b>
This feature has 3 classes: hot, mild and cool. The following table summarizes the chi-square values for these classes.

<img src='img/chaid-03.png' />

Chi-square value of temperature feature will be

0 + 0 + 0.577 + 0.577 + 0.707 + 0.707 = 2.569

This is a value less than the chi-square value of outlook. This means that the feature outlook is more important than the feature temperature based on chi-square testing.

<b>Humidity feature</b>
Humidity has 2 classes: high and normal. Let’s summarize the chi-square values.

<img src='img/chaid-04.png' />

So, the chi-square value of humidity feature is

0.267 + 0.267 + 1.336 + 1.336 = 3.207

This is less than the chi-square value of outlook as well. What about wind feature?

<b>Wind feature</b>

Wind feature has 2 classes: weak and strong. The following table is the pivot table.

<img src='img/chaid-05.png' />

Herein, the chi-square test value of the wind feature is

0.802 + 0.802 + 0 + 0 = 1.604

We’ve found the chi square values of all features. Let’s see them all in a table. 

<img src='img/chaid-06.png' />

As seen, outlook feature has the highest chi-square value. This means that it is the most significant feature. So, we will put this feature to the root node.

<img src='img/chaid-07.png' />

overcast branch just has yes decisions in the sub data set. This means that CHAID tree returns YES if outlook is overcast.

Outlook = Sunny branch

This branch has 5 instances. Now, we look for the most dominant feature. BTW, we will ignore the outlook column now because they are all same. In other words, we will find the most dominant feature among temperature, humidity and wind. Now we find the chi-square value for each of the features and chose the next branch

humidity is the most dominant feature for the sunny outlook branch. We will put this feature as a decision rule.

<img src='img/chaid-08.png' />

### CHAID Sample program

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [45]:
data = sns.load_dataset('titanic')
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [21]:
y = data['survived']
x = data[['pclass','who','sex','parch']]

Calculate the CHAID score between pclass and survived

In [22]:
data['pclass'].value_counts()

3    491
1    216
2    184
Name: pclass, dtype: int64

In [23]:
chaid_pclass = pd.crosstab(data['pclass'],data['survived']) # 0 - survived, 1 - not survived
chaid_pclass

survived,0,1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,80,136
2,97,87
3,372,119


In [24]:
chaid_pclass.values.sum(axis=1)

array([216, 184, 491], dtype=int64)

Expected/Predicted value is nothing but the average

In [25]:
expected_survived = chaid_pclass.values.sum(axis=1)/2
expected_survived

array([108. ,  92. , 245.5])

In [26]:
chaid_pclass['expected_survived_0'] = expected_survived
chaid_pclass['expected_survived_1'] = expected_survived
chaid_pclass

survived,0,1,expected_survived_0,expected_survived_1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,80,136,108.0,108.0
2,97,87,92.0,92.0
3,372,119,245.5,245.5


In [27]:
chaid_pclass['survived_0_deviation'] = chaid_pclass['expected_survived_0'] - chaid_pclass[0]
chaid_pclass['survived_1_deviation'] = chaid_pclass['expected_survived_1'] - chaid_pclass[1]
chaid_pclass

survived,0,1,expected_survived_0,expected_survived_1,survived_0_deviation,survived_1_deviation
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,80,136,108.0,108.0,28.0,-28.0
2,97,87,92.0,92.0,-5.0,5.0
3,372,119,245.5,245.5,-126.5,126.5


Let's apply the formulae now

In [28]:
chaid_0 = np.sqrt(chaid_pclass['survived_0_deviation'] ** 2 / chaid_pclass['expected_survived_0'])
chaid_1 = np.sqrt(chaid_pclass['survived_1_deviation'] ** 2 / chaid_pclass['expected_survived_1'])
chaid_pclass['chaid_score_survivded_0']  = chaid_0
chaid_pclass['chaid_score_survivded_1']  = chaid_1
chaid_pclass

survived,0,1,expected_survived_0,expected_survived_1,survived_0_deviation,survived_1_deviation,chaid_score_survivded_0,chaid_score_survivded_1
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,80,136,108.0,108.0,28.0,-28.0,2.694301,2.694301
2,97,87,92.0,92.0,-5.0,5.0,0.521286,0.521286
3,372,119,245.5,245.5,-126.5,126.5,8.073554,8.073554


In [31]:
total_chaid_score = (chaid_0.values + chaid_1.values).sum()
total_chaid_score

22.57828343341873

Calculate the CHAID score between sex and survived

In [32]:
chaid_sex = pd.crosstab(data['sex'],data['survived']) # 0 - survived, 1 - not survived
chaid_sex

survived,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,81,233
male,468,109


Find expected/predicted value for age

In [35]:
expected_survived = chaid_sex.values.sum(axis=1)/2
expected_survived

array([157. , 288.5])

In [36]:
chaid_sex['expected_survived_0'] = expected_survived
chaid_sex['expected_survived_1'] = expected_survived
chaid_sex

survived,0,1,expected_survived_0,expected_survived_1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,81,233,157.0,157.0
male,468,109,288.5,288.5


In [37]:
chaid_sex['survived_0_deviation'] = chaid_sex['expected_survived_0'] - chaid_sex[0]
chaid_sex['survived_1_deviation'] = chaid_sex['expected_survived_1'] - chaid_sex[1]
chaid_sex

survived,0,1,expected_survived_0,expected_survived_1,survived_0_deviation,survived_1_deviation
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,81,233,157.0,157.0,76.0,-76.0
male,468,109,288.5,288.5,-179.5,179.5


In [38]:
chaid_0 = np.sqrt(chaid_sex['survived_0_deviation'] ** 2 / chaid_sex['expected_survived_0'])
chaid_1 = np.sqrt(chaid_sex['survived_1_deviation'] ** 2 / chaid_sex['expected_survived_1'])
chaid_sex['chaid_score_survived_0']  = chaid_0
chaid_sex['chaid_score_survived_1']  = chaid_1
chaid_sex

survived,0,1,expected_survived_0,expected_survived_1,survived_0_deviation,survived_1_deviation,chaid_score_survived_0,chaid_score_survived_1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
female,81,233,157.0,157.0,76.0,-76.0,6.06546,6.06546
male,468,109,288.5,288.5,-179.5,179.5,10.567969,10.567969


In [39]:
total_chaid_score = (chaid_0.values + chaid_1.values).sum()
total_chaid_score

33.266859301707605

## Regression Algorithms

<b>Decision Tree Regression & Random Forest Regression </b>. Not very frequently used regression algorithms.

### Decision Tree Regression
Coefficient of deviation

Standard deviation coefficient

<img src='img/dec-tree-reg-1.png'/>

Just like how we calculated Information Gain, for each predictor variable we calculate the gain
i.e. E(T) - T(T,X1) , E(T) - T(T,X2) , E(T) - T(T,X3)

Let's see the formulae using the titanic data set

In [40]:
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [46]:
y = data['age']
x = data[['pclass','fare','who']]

y.isna().sum()

177

In [47]:
data['age'].fillna(data['age'].median(),inplace=True)
y.isna().sum()

0

In [48]:
x.isna().sum()

pclass    0
fare      0
who       0
dtype: int64

In [49]:
x

Unnamed: 0,pclass,fare,who
0,3,7.2500,man
1,1,71.2833,woman
2,3,7.9250,woman
3,1,53.1000,woman
4,3,8.0500,man
...,...,...,...
886,2,13.0000,man
887,1,30.0000,woman
888,3,23.4500,woman
889,1,30.0000,man


In [50]:
y

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

#### Convert continous variables to class variable (one of the approach is shown below)

Reason why we do this is because otherwise the length of the tree will be too huge

In [51]:
values = []
for i in x['fare']:
    if x['fare'].mean() > i:
        values.append(1)
    else:
        values.append(0)
x['fare'] = values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [52]:
x.head()

Unnamed: 0,pclass,fare,who
0,3,1,man
1,1,0,woman
2,3,1,woman
3,1,0,woman
4,3,1,man


#### Let's find the Coefficient of variation using the formulae (Standard Deviation / Mean) * 100

In [54]:
cv_y = (y.std()/y.mean()) * 100
cv_y

44.342625451832305

##### Like in information gain we have to do for each predictor variables

In [60]:
x['y'] = y
x

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,pclass,fare,who,y
0,3,1,man,22.0
1,1,0,woman,38.0
2,3,1,woman,26.0
3,1,0,woman,35.0
4,3,1,man,35.0
...,...,...,...,...
886,2,1,man,27.0
887,1,1,woman,19.0
888,3,1,woman,28.0
889,1,1,man,26.0


In [63]:
x.groupby('pclass').std()

Unnamed: 0_level_0,fare,y
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.441764,14.182103
2,0.325338,13.581096
3,0.239758,10.697676


In [65]:
x.groupby('pclass').count()

Unnamed: 0_level_0,fare,who,y
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,216,216,216
2,184,184,184
3,491,491,491


In [64]:
y_x1_cv = pd.DataFrame(x.groupby('pclass').std()['y'])
y_x1_cv

Unnamed: 0_level_0,y
pclass,Unnamed: 1_level_1
1,14.182103
2,13.581096
3,10.697676


In [67]:
y_x1_cv['count'] = x.groupby('pclass').count()['y']
y_x1_cv

Unnamed: 0_level_0,y,count
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,14.182103,216
2,13.581096,184
3,10.697676,491


In [69]:
y_x1_cv.columns = ['std dev for each class','count']
y_x1_cv

Unnamed: 0_level_0,std dev for each class,count
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,14.182103,216
2,13.581096,184
3,10.697676,491


In [70]:
y_x1_cv['probability'] = y_x1_cv['count'] / y_x1_cv['count'].sum()
y_x1_cv

Unnamed: 0_level_0,std dev for each class,count,probability
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,14.182103,216,0.242424
2,13.581096,184,0.20651
3,10.697676,491,0.551066


In [72]:
y_x1_cv['deviation coefficient'] = y_x1_cv['std dev for each class'] * y_x1_cv['probability']
y_x1_cv

Unnamed: 0_level_0,std dev for each class,count,probability,deviation coefficient
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,14.182103,216,0.242424,3.438086
2,13.581096,184,0.20651,2.804626
3,10.697676,491,0.551066,5.895128


In [73]:
y_x1_cv['deviation coefficient'].sum()

12.137839373154554

We do the above for each of the predictor variables and find the deviation coefficients for y_x2, y_x3 etc

#### Standard Deviation Reduction

<img src='img/sdr-1.png'/>

So we have S(T) which is y_cv and S(T,X) which is y_x1_cv, now to calculate SDR we just subtract the both. 
We do this calculation for each of the X variables and the X variable with the highest SDR is chosen as the first node in the tree

The main draw back for this algorithm is that when a new record is being predicted, the algorithm will find the branch to which the data will belong to. Prediction will be then the average of all the records in that final leaf node. Now this new record's prediction is fully dependent on the other records in the leaf node

### Random Forest Regression

We split the data into multiple samples. Samples could be split across

1) Columns
2) Rows

We then run Decision Tree Regression on each samples and then find the average value of all prediction as final predition from Random Forest Regression. The same is explained through this below example

<img src='img/rfr-1.png'/>

## Bagging Classifier & Regression

In Random Forest when we split the samples, we created the samples by splitting data across rows and columns. In Bagging classification we only split the data into rows. i.e. we never avoid columns in Bagging Classifier and regression

## GridSearchCV

In [76]:
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor,BaggingClassifier,BaggingRegressor

In [77]:
RandomForestClassifier()

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Take a look at the list of hyperparameters available for the algorithm. We have similar kind of parameters on most of the algorithms. GridSearchCV helps us run through various range of values across all the hyperparameters. E.g. whether to use gini or informationgain or entropy ? similarly for all other hyper parameters. GridSearchCV helps us with <b> hyperparameter tuning </b>

<img src="img/gridsearch-2.png"/>

## Stacking In Machine Learning Algorithms

Stacking is a form of ensembling. It follows meta-modeling technique

Basically the output from one algorithm will be passed as input to the next model and so on. We can build multiple such layers. 


In this technique, we make several predictions from a number for models and then using a different model to train on these predictions

Steps involved would be,

1) splitting the train set into two disjoint sets

2) train several base learners on the first part

3) make predictions with the base learners ont he second part

4) use predictions from 3) as the inputs to train a higher level trainer

<b> this is an explanation from one of the videos I watched

<img src='img/blended-2.png' />

After this, train algorithm 3 on on B1 and make predictions for C1

<pre>
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split

#split train data itself into 2 parts
x_train, x_test , y_train, y_test = train_test_split(train,y,test_size=0.5)
model1 = RandomForestRegressor()
model2 = LinearRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)

preds1 = model1.predict(x_test)
preds2 = model2.predict(x_test)

test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)

stacked_predictions = np.column_stack((preds1,preds2))
stacked_test_predictions = np.column_stack((test_preds1, test_preds2))

meta_model = LinearRegression()
meta_model.fit(stacked_predictions,y_test)
final_predictions = meta_model.predict(stacked_test_predictions)
</pre>

## Blended Stacking

This is another approach of stacking

<img src='img/blended-1.png'/>

## Boosting algorithms

Unlike many ML models which focus on high quality prediction done by a single model, boosting algorithms seek to improve the prediction power by training a sequence of weak models, each compensating the weaknesses of its predecessors.

In the sequence each model will try to better the errors on the previous model. Each incorrect prediction is penalized with a weight and there by a candiate for improvement in the next model and so on till we get the optimum result

<img src='img/boost-1.png'/>

## TODO

- Make notes on Blended Stacking
- Yet to teach Boosting in detail
- Like GridSearchCV there are other approaches for Hyper parameter tuning. Yet to cover that in the course