# Activity 12 - Bagging and Boosting

***
##### CS 434 - Data Mining and Machine Learning
##### Oregon State University-Cascades
***

# Load packages

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import RidgeClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

# Dataset

[Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing)

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (`'yes'`) or not (`'no'`) subscribed.



#### Features
* **age** (numeric)
* **job** : type of job (categorical) `'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown'`
* **marital** : marital status (categorical) `'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed`
* **education** (categorical): `'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown'`
* **default**: has credit in default? (categorical) `'no','yes','unknown'`
* **housing**: has housing loan? (categorical) `'no','yes','unknown'`
* **loan**: has personal loan? (categorical) `'no','yes','unknown'`
* **contact**: contact communication type (categorical) `'cellular','telephone'`
* **month**: last contact month of year (categorical) `'jan', 'feb', 'mar', ..., 'nov', 'dec'`
* **day_of_week**: last contact day of the week (categorical) `'mon','tue','wed','thu','fri'`
* **duration**: last contact duration, in seconds (numeric). 
* **campaign**: number of contacts performed during this campaign and for this client (numeric, includes last contact)
* **pdays**: number of days that passed by after the client was last contacted from a previous campaign (numeric; `999` means client was not previously contacted)
* **previous**: number of contacts performed before this campaign and for this client (numeric)
* **poutcome**: outcome of the previous marketing campaign (categorical) `'failure','nonexistent','success'`
* **emp.var.rate**: employment variation rate - quarterly indicator (numeric)
* **cons.price.idx**: consumer price index - monthly indicator (numeric)
* **cons.conf.idx**: consumer confidence index - monthly indicator (numeric)
* **euribor3m**: euribor 3 month rate - daily indicator (numeric)
* **nr.employed**: number of employees - quarterly indicator (numeric)


#### Target variable:
* **y**: has the client subscribed a term deposit? (binary: `'yes','no'`)

In [0]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip'
zip_file = 'bank.zip'
dat_file = 'bank.csv'           # exercise 1-5
#dat_file = 'bank-full.csv'     # exercise 6  

*** 
# Exercise #1 - Load data
*** 

##### 1.1 `wget` the `url`

In [0]:
# fetch file, then comment out this line
!wget $url

##### 1.2 Unzip `zip_file`

In [0]:
# unzip, then comment out this line
!unzip $zip_file

##### 1.3 Read the `dat_file` into new dataframe `df`.

In [0]:
# read the dataframe
print('your code here')

##### 1.4 Describe dataframe

In [0]:
# describe df
print('your code here')

##### 1.5 Count values in `'education'`

In [0]:
# count values for 'education'
print('your code here')

> Don't overcomplicate.  The above takes one simple pandas' command.

*** 
# Exercise #2 - Prepare dataset
*** 

##### 2.1 Encode categorical features and class label

In [0]:
# encode all attributes
print('your code here')

A few of the features are numeric, but some have other issues (e.g., special `-1` value).  

> Therefore, just encode **all** of the features. It makes it a one-liner with pandas/sklearn.  

This is fine for some algorithms, like a decision tree (used below), but would not be desirable for a mathematical model, such as logistic regression.

###### Self Check

In [0]:
if dat_file == 'bank.csv' :
  assert df.iloc[0,3] == 0 and df.iloc[1,3] == 1
else:
  assert df.iloc[0,3] == 2 and df.iloc[1,3] == 1

##### 2.2 Split `X` and `y`

In [0]:
# split X and y
print('your code here')

##### 2.3 Partition to train and test sets with hold-out

* test proportion of 50%
* `random_state=1`
* stratify by `y`

In [0]:
# partition train and test set
print('your code here')

##### 2.4 Print the shapes of your four sets

In [0]:
# print shapes of train and test
print('your code here')

Train: (2260, 16) (2260,)
Test: (2261, 16) (2261,)


*** 
# Exercise #3 - Baseline decision tree
*** 

##### 3.1 Build a decision tree

* entropy
* no max depth
* `random_state=1`

In [0]:
# create Decision Tree
print('your code here')

###### Self Check

```python
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=1, splitter='best')
```

##### 3.2 Train and test your tree

* predict both `X_train` and `X_test`
* save predictions for both (e.g., `y_train_pred` and `y_test_pred`)

In [0]:
# train decision tree
print('your code here')

##### 3.3 Determine accuracy and F$_1$ score (train and test set)

In [0]:
# print accuracy and F1
print('your code here')

##### Self Check

The accuracy and F$_1$ of the *train* set are both `1.0`.

##### 3.4 Plot a confusion matrix (test set)

In [0]:
# graph a confusion matrix
print('your code here')

> Note the discrepency between accuracy and F$_1$.  This skew is caused by the severe class imbalance. The confusion matrix and the F$_1$ score paint a more honest picture.  The accuracy reflects the (skewed) distribution of the class label in the data set. 

*** 
# Exercise #4 - Bagger
*** 

##### 4.1 Bag your decision tree

* use a `BaggingClassifier` (see [api](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html))
* use your tree from Ex.#3
* use 100 estimators
* `bootstrap=True`
* `random_state=1`
* `n_jobs=-1`

In [0]:
# build bagger
print('your code here')

###### Self Check

```python
BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='entropy',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=1,
                                                        min_samples_split=2,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=1,
                                                        splitter='best'),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=100, n_jobs=-1, oob_score=False,
                  random_state=1, verbose=0, warm_start=False)
```

##### 4.2 Train and test bagger

* predict both `X_train` and `X_test`
* save predictions for both (e.g., `y_train_pred` and `y_test_pred`)

In [0]:
# train bagger
print('your code here')

##### 4.3 Determine accuracy and F$_1$ score (train and test)

In [0]:
# print accuracy and F1
print('your code here')

##### 4.4 Plot a confusion matrix (test set)

In [0]:
# graph a confusion matrix
print('your code here')

*** 
# Exercise #5 - Boosting
*** 

##### 5.1 Boost your decision tree

* construct an `AdaBoost` model (see [api](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html))
* use your tree from Ex.#3
* use 100 estimators
* `learning_rate=0.1`
* `random_state=1`

In [0]:
# build AdaBoost
print('your code here')

> For bagging, you could parallelize with `n_jobs` parameter.  There is no such option for AdaBoost. **Why is this?** 
>
> Consider the fundamental differences between the two approaches (see "flowchart" images in Lecture 12). 

###### Self Check

```python
AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                         class_weight=None,
                                                         criterion='entropy',
                                                         max_depth=None,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort='deprecated',
                                                         random_state=1,
                                                         splitter='best'),
                   learning_rate=0.1, n_estimators=100, random_state=1)
```                   

##### 5.2 Train and test AdaBoost

* predict both `X_train` and `X_test`
* save predictions for both (e.g., `y_train_pred` and `y_test_pred`)

In [0]:
# train AdaBoost
print('your code here')

##### 5.3 Determine accuracy and F$_1$ score (train and test)

In [0]:
# print accuracy and F1
print('your code here')

##### 5.4 Plot a confusion matrix (test set)

In [0]:
# graph a confusion matrix
print('your code here')

*** 
# Exercise #6 - Go big or go home$^1$
*** 

> $^1$ ... *but it's March Mayteenth, 2020. I'm already at home.*

##### 6.1 Run the entire Activity 12 again on `'bank-full.csv'`.

In the dataset section, switch `dat_file` to be `'bank-full.csv'`.
```python
#dat_file = 'bank.csv'           # exercise 1-5
dat_file = 'bank-full.csv'        # exercise 6  
```

> `'bank-full.csv'` be 10 times mo' biggly than `'bank.csv'`

**Remember** to change back and re-run to the smaller `'bank.csv'` before answering the Canvas questions. 