<h1> Day 30 - Class </h1>

## Covariance

The covariance is a measure for how two variables are related to each other, i.e., how two variables vary with each other.

Let n
be the population size, x and y two different features (variables), and μ

the population mean; the covariance can then be formally defined as:

<img src='img/covariance-01.png'/>

A covariance of 0 indicates that two variables are totally unrelated. If the covariance is positive, the variables increase in the same direction, and if the covariance is negative, the variables change in opposite directions

Pearson’s ρ or “r” (or typically just called “correlation coefficient”) is measures the linear correlation between two features and is closely related to the covariance. In fact, it’s a normalized version of the covariance as shown below:

<img src='img/covariance-02.png'/>

By dividing the covariance by the features’ standard deviations, we ensure that the correlation between two features is in the range [-1, 1], which makes it more interpretable than the unbounded covariance. However, note that the covariance and correlation are exactly the same if the features are normalized to unit variance (e.g., via standardization or z-score normalization). Two features are perfectly positively correlated if ρ=1 and pefectly negatively correlated if ρ=−1. No correlation is observed if ρ=0.

In [1]:
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig
# define a matrix
A = array([[1, 2], [3, 4], [5, 6]])
print(A)
print()
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
print()
# center columns by subtracting column means
C = A - M
print(C)
print()
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
print()
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print()
print(values)
print()
# project data
P = vectors.T.dot(C.T)
print(P.T)
print()

[[1 2]
 [3 4]
 [5 6]]

[3. 4.]

[[-2. -2.]
 [ 0.  0.]
 [ 2.  2.]]

[[4. 4.]
 [4. 4.]]

[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]

[8. 0.]

[[-2.82842712  0.        ]
 [ 0.          0.        ]
 [ 2.82842712  0.        ]]



In [None]:
## In the final output above we coudl take only PCA-1 component for our further analysis.. thereby reducing the dimensions 

In [4]:
# sklearn way of coding the same
from sklearn.decomposition import PCA
A = array([[1,2],[3,4],[5,6]])
print(A)
pca = PCA()
pca.fit(A)
print(pca.components_)
print(pca.explained_variance_)
B = pca.transform(A)
B

[[1 2]
 [3 4]
 [5 6]]
[[ 0.70710678  0.70710678]
 [ 0.70710678 -0.70710678]]
[8.00000000e+00 2.25080839e-33]


array([[-2.82842712e+00,  2.22044605e-16],
       [ 0.00000000e+00,  0.00000000e+00],
       [ 2.82842712e+00, -2.22044605e-16]])

In [5]:
# sklearn way of coding the same
from sklearn.decomposition import PCA
A = array([[1,2],[3,4],[5,6]])
print(A)
pca = PCA(1)
pca.fit(A)
print(pca.components_)
print(pca.explained_variance_)
B = pca.transform(A)
B

[[1 2]
 [3 4]
 [5 6]]
[[0.70710678 0.70710678]]
[8.]


array([[-2.82842712],
       [ 0.        ],
       [ 2.82842712]])

## ADABoost

A tree with just one node and two leaves is called as stump. In ADA Boost we create a forest of stumps. Since we are only making use of one feature at a time, stumps will act as a weak learner, just like how ADA Boost wants it.

<img src='img/adaboost-01.png'/>

In Random Forest each decision tree gets equal weight on the vote, for doing the final classification. Wheras in ADABoost, some stumps get a higher weightage as compared to others.

In Random Forest the order in which the trees are formed aren't important, wheras in ADABoost, order is important while making the forest of stumps.

<img src='img/adaboost-02.png'/>

<img src='img/adaboost-03.png'/>

<img src='img/adaboost-04.png'/>

<img src='img/adaboost-05.png'/>

<img src='img/adaboost-06.png'/>

<img src='img/adaboost-07.png'/>

<img src='img/adaboost-08.png'/>

<img src='img/adaboost-09.png'/>

We find GINI index for each of the features

<img src='img/adaboost-10.png'/>


Patient weight is the weakest learner, so this will be the first stump in the forest. Now we have to determine how much say this stump will have in the final classification. 

<img src='img/adaboost-11.png'/>

<img src='img/adaboost-12.png'/>

When the total error is higher, amount of say is lower and vice versa
<img src='img/adaboost-13.png'/>

<img src='img/adaboost-14.png'/>

<img src='img/adaboost-15.png'/>

<img src='img/adaboost-16.png'/>

<img src='img/adaboost-17.png'/>

<img src='img/adaboost-18.png'/>


For the <b>incorrectly</b> classified samples, we <b> INCREASE </b> the sample weight in the next iteration

<img src='img/adaboost-19.png'/>

<img src='img/adaboost-20.png'/>

For the <b>correcly</b> classified samples, we <b> DECREASE </b> the sample weight in the next iteration

<img src='img/adaboost-21.png' />

<img src='img/adaboost-22.png' />

<img src='img/adaboost-23.png' />

<img src='img/adaboost-24.png' />

<img src='img/adaboost-25.png' />

<img src='img/adaboost-26.png' />

<img src='img/adaboost-27.png' />

<img src='img/adaboost-28.png' />

<img src='img/adaboost-29.png' />

<img src='img/adaboost-30.png' />

<img src='img/adaboost-31.png' />

<img src='img/adaboost-32.png' />

<img src='img/adaboost-33.png' />

<img src='img/adaboost-34.png' />

<img src='img/adaboost-35.png' />

<img src='img/adaboost-36.png' />

<img src='img/adaboost-37.png' />

<img src='img/adaboost-38.png' />

Imagine the prediction is made for whether the person has heart disease or not, based on how each stump classify the output, we find the amount of say for each stump that say 'yes' v/s the one's that says 'no'. The sum of all 'yes' is checked with the sum of all 'no'
<img src='img/adaboost-39.png' />

## Gradient Boosting - Regression

<img src='img/gbr-01.png'/>

this leaf represents an initial guess for the Weights of all the samples

<img src='img/gbr-02.png'/>

<img src='img/gbr-03.png'/>

Also, like AdaBoost, Gradient Boost scales the trees. However Gradient Boost scales all trees by the same amount.

<img src='img/gbr-04.png'/>

<img src='img/gbr-05.png'/>

<img src='img/gbr-06.png'/>

<img src='img/gbr-07.png'/>

It might seem strange to predict the residuals instead of the weight, but let's just go with the flow

<img src='img/gbr-08.png'/>

In this example we are only allowing upto 4 leaves, but when using a larger dataset, it is common to allow anywhere from 8 to 32. Because we are restricting the number of leaves, some residuals will be clubbed into a single leaf. We take the average of the multiple residuals in a single leaf.

<img src='img/gbr-09.png'/>

<img src='img/gbr-10.png'/>

<img src='img/gbr-11.png'/>

Is this great ? No. Model fits the training data too well. In other words, we have low Bias but probably very high Variance

<img src='img/gbr-12.png'/>

<img src='img/gbr-13.png'/>

But it's a little bit better than the Prediction made with just the original leaf, which predicted that all samples would weigh 71.2. According to the person who invented Gradient Boost, Jerome Friedman, empirical evidence shows that taking lots of small steps(a.k.a learning rate) in the right direction results in better Predictions with a Testing Dataset, i.e. lower Variance 

Now fill the residual columns with the new predictions

<img src='img/gbr-14.png'/>

We build a new tree in the next iteration

<img src='img/gbr-15.png'/>

<img src='img/gbr-16.png'/>

<img src='img/gbr-17.png'/>

<img src='img/gbr-18.png'/>

<img src='img/gbr-19.png'/>

We keep making trees until we reach the maximum specified, or adding additional trees does not significantly reduce the size of the residuals

## Gradient Boosting - Classification

<img src='img/gbc-01.png'/>
<img src='img/gbc-02.png'/>
<img src='img/gbc-03.png'/>
<img src='img/gbc-04.png'/>
<img src='img/gbc-05.png'/>
<img src='img/gbc-06.png'/>
<img src='img/gbc-07.png'/>
<img src='img/gbc-08.png'/>
<img src='img/gbc-09.png'/>
<img src='img/gbc-10.png'/>
<img src='img/gbc-11.png'/>
<img src='img/gbc-12.png'/>
<img src='img/gbc-13.png'/>
<img src='img/gbc-14.png'/>
<img src='img/gbc-15.png'/>
<img src='img/gbc-16.png'/>
<img src='img/gbc-17.png'/>
<img src='img/gbc-18.png'/>
<img src='img/gbc-19.png'/>
<img src='img/gbc-20.png'/>
<img src='img/gbc-21.png'/>
<img src='img/gbc-22.png'/>
<img src='img/gbc-23.png'/>
<img src='img/gbc-24.png'/>
<img src='img/gbc-25.png'/>
<img src='img/gbc-26.png'/>
<img src='img/gbc-27.png'/>
<img src='img/gbc-28.png'/>
<img src='img/gbc-29.png'/>
<img src='img/gbc-30.png'/>
<img src='img/gbc-31.png'/>
<img src='img/gbc-32.png'/>
<img src='img/gbc-33.png'/>
<img src='img/gbc-34.png'/>
<img src='img/gbc-35.png'/>
<img src='img/gbc-36.png'/>
<img src='img/gbc-37.png'/>
<img src='img/gbc-38.png'/>
<img src='img/gbc-39.png'/>
<img src='img/gbc-40.png'/>
<img src='img/gbc-41.png'/>






In [None]:
PCA
ADA Boosting
Gradient boosting regression
Gradient boosting classification
XGBoost

## References

https://sebastianraschka.com/faq/docs/covariance-vs-correlation.html

https://machinelearningmastery.com/adaboost-ensemble-in-python/

https://www.youtube.com/watch?v=LsK-xG1cLYA
