### Agenda

- Implement K fold cross validation
- Understand Bootstrap Method and Bagging (bootstrap Aggregate)
- Understand and implement Random Forests


In [None]:
import pandas as pd
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
print(X.shape)
print(y.shape)

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
model = DecisionTreeRegressor(max_depth = 3)
cross_val_score(model, X_train, y_train, cv=10)

#### Testing different models with cross validation

In [None]:
model1 = DecisionTreeRegressor(max_depth=None)
from sklearn.linear_model import LinearRegression
model2 = LinearRegression()
from sklearn.neighbors import KNeighborsRegressor
model3 = KNeighborsRegressor()

In [None]:
import numpy as np
model_pipeline = [model1, model2, model3]
model_names = ['Regression Tree', 'Linear Regression', 'KNN']
scores = {}
i=0
for model in model_pipeline:
    mean_score = np.mean(cross_val_score(model, X_train, y_train, cv=10))
#     mean_score = cross_val_score(model, X_train, y_train, cv=10)
    scores[model_names[i]] = mean_score
    i = i+1
print(scores)

#### Bootstraping

- Bootstrap - To produce a reliable estimate we need enough samples in the dataset but sometimes it is not possible to collect enough real data. Bootstrap method allows us to emulate the process of obtaining new sample sets from the original data. Hence bootstrapping is the process of generating distinct data sets by repeatedly sampling observations (with replacement) from the original data set.
- [link to the image - Generating Bootstrap Samples](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.08/7.08-bootstrapping.png)

In the image above, we are generating `B` bootstrap samples from the original dataset. Since sampling is done with replacement, you would observe some repetition in the rows in some bootstrap samples. Each bootstrap sample is used to estimate alpha (for example which could be a measure of accuracy for a linear regression model). Then we take the mean of all alpha scores to obtain a more reliable final estimate.

#### Bagging
- Why do we need bagging technique?

  - One of the disadvantages with decision trees is that they have high variability in the result ie the results produced can vary greatly in their accuracy measures. This can be seen from the snapshot here: [link to the image - High Variance in Output for Regression Trees](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.08/7.08-high_variability.png)

- Bagging is a general purpose technique that is used to reduce variance in a machine learning model. The idea is to use B Bootstrap samples and find the accuracy measure for each bootstrap sample. And then aggregate the results of all the bootstrap samples. This method is particularly useful for decision trees.

- Bagging applied decision trees: B bootstrapped training sets are sampled from the original data. On each bootstrap sample, a decision tree is fit and a prediction is made. Then we average the resulting predictions. These trees are grown deep and have high variance. Averaging these B trees reduces the variance.

- Essentially we are combining the results from hundreds or thousands of independently grown decision trees.


### Random Forests

- Random Forests are very similar to bagging except for one improvement over bagging method in terms of randomization of features chosen while building a tree for each bootstrap sample.

- It also consists of building a large number of trees (a decision for each bootstrap sample).
- For each decision tree for each bootstrap sample, instead of picking all the features while making the decision tree, only a random sample of `m` features are chosen from the total set of `p` predictors.
- Hence the name random forests.
- There is no rule of thumb/best value of `m` but usually m is chosen as the square root of `p` as a good starting point.
- Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors.

#### Parameters in Random Forests

- `n_estimators: int, default=100` - The number of trees in the forest.
- `max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”` - The number of features to consider when looking for the best split.
- `bootstrapbool, default=True` - Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
- Some of these other parameters are the same ones as we have looked in decision trees.

  - `criterion{“gini”, “entropy”}, default=”gini”`
  - `max_depthint, default=None`
  - `min_samples_splitint or float, default=2`
  - `min_samples_leafint or float, default=1`

#### Example

- Using a random forest model on mail promotion data.
- The first objective here is to make a classification model and predicting who are the customers that are more likely to respond.
- The customers who are more likely to respond, on those predicted customers we will create a regression model to predict the amount of money they will donate.
- It is important to note how we will retain the information from the column `TARGET_D` which is the target column for the regression model.
- For the classification model now, we will use the cleaned data from the provided CSV files in the `files_for_lesson_and_activities` folder:
  - `numerical.csv` has the numerical features (not normalized)
  - `categorical.csv` has the categorical columns (not encoded)
  - `target.csv` has the two target columns `TARGET_B` and `TARGET_D`

In [None]:
# Reading data
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

In [None]:
numerical = pd.read_csv('./files_for_lesson_and_activities/numerical.csv')
categorical = pd.read_csv('./files_for_lesson_and_activities/categorical.csv')
targets = pd.read_csv('./files_for_lesson_and_activities/target.csv')
data = pd.concat([numerical, categorical, targets], axis = 1)
data['TARGET_B'].value_counts()

In [None]:
# Downsampling to balance data
category_0 = data[data['TARGET_B']==0].sample(len(data[data['TARGET_B']==1]))
print(category_0.shape)

In [None]:
category_1 = data[data['TARGET_B']== 1 ]
data = pd.concat([category_0, category_1], axis = 0)
data = data.sample(frac =1)
data = data.reset_index(drop=True)
print(data.shape)

In [None]:
# Data Processing
y = data['TARGET_B']
X = data.drop(['TARGET_B'], axis = 1)

In [None]:
numericalX = X.select_dtypes(np.number)
categorcalX = X.select_dtypes(np.object)

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first').fit(categorcalX)
encoded_categorical = encoder.transform(categorcalX).toarray()
encoded_categorical = pd.DataFrame(encoded_categorical)
X = pd.concat([numericalX, encoded_categorical], axis = 1)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [None]:
# Retaining Info for Regression Model for Later
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

# y_train_regression = X_train['TARGET_D']
# y_test_regression = X_test['TARGET_D']

# Now we can remove the column target d from the set of features
X_train = X_train.drop(['TARGET_D'], axis = 1)
X_test = X_test.drop(['TARGET_D'], axis = 1)

In [None]:
# Building the model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

# For cross validation
from sklearn.model_selection import cross_val_score
clf = RandomForestClassifier(max_depth=2, random_state=0)
cross_val_scores = cross_val_score(clf, X_train, y_train, cv=10)
print(np.mean(cross_val_scores))