# DECISION TREE
The goal of this exercise is to explore the parameters of a decision tree. For that, we will build a default decision tree (with sklearn) and then by visually exploration decide how to set the parameters of the model.

We're working with the following dataset: 

###- Previous dataset
The BADS_T2_logreg_sampling.csv file, excluding country.


## VERSION 1
In this version you have to build from scratch the functions and solve any problems that may arise. Compare with the results from sklearn.

First, you have to import the datasets.

In [None]:
# import from Google Drive

In [None]:
# preprocessing for decision tree

Then you have to create the decision tree, visualize it and tune it.

In [None]:
# build the model

In [None]:
# plot the tree

In [None]:
# create a model with optimized parameters

## SOLUTION
Here you have the solution (try not to spoiler yourself).

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# IMPORTING
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/Data Science/Training/Exercises/Datasets/BADS_T2_logreg_sampling.csv")
X,y = df.iloc[:,:-1],df.transaccion

In [None]:
# PREPROCESSING
X = X.fillna(0)
X = X.drop('pais',axis=1)
X = pd.get_dummies(X)

In [None]:
# MODEL
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import tree
dt = DecisionTreeClassifier()
dt.fit(X,y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [None]:
# PLOT
import graphviz

dot_data = tree.export_graphviz(dt, filled=True,feature_names=X.columns)
graph = graphviz.Source(dot_data, format="png") 
graph

In [None]:
#!pip install dtreeviz
from dtreeviz.trees import dtreeviz
viz = dtreeviz(dt, X, y, feature_names=X.columns)
viz

# XGBOOST
The goal of this exercise is both to test XGBoost hyperparameters and compare its performance against basic models (logistic regression and decision trees) across two widely different datasets.

We're working with the following datasets: 

###- Small one
The BADS_T2_logreg_sampling.csv file.

###- Big one
A dataset coming from Telecom. It's in the same folder under the name BADS_T2_xgboost.csv
The transaction column comes directly from the original dataset ;)

## VERSION 1

First, you have to import the datasets

Then you have to do some preprocessing

### SMALL DATASET

After that, the first thing is to test the 3 models (logistic regression, decision tree and XGBoost) with default parameters. You can use either F1 or Average precision score. You have to report the chosen metric both in test and train (80-20 split).

Finally, you can play around with the hyperparameters of XGBoost to achieve better results. It's suggested to try these values for the hyperparameters:

- Max_depth: [3-6]
- Gamma: [0-1]
- Min_child_weight: [1-5]
- Learning_rate: [0.01-0.3]
- Reg_lambda: [1-10]


### BIG DATASET

Firstly, you have to run the default models for this dataset and compare the results with the smaller one.

In [None]:
df_bq = pd.read_csv("/content/drive/MyDrive/Data Science/Training/Exercises/Datasets/BADS_T2_logreg_sampling.csv")

Now, you have to try the parameters found in the previous dataset and apply them in a model for this dataset.

(Optional) If you want, you can play around again with the parameters to find a better combination.

## SOLUTION

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
df = pd.read_csv("/content/drive/MyDrive/Data Science/Training/Exercises/Datasets/BADS_T2_logreg_sampling.csv")
df = df.fillna(0)
df.loc[:,"transaccion"] = np.where(df.loc[:,'transaccion']==0,0,1)

In [None]:
# DIVIDE X,Y
X,y = df.iloc[:,:-1].copy(), df.iloc[:,-1].copy()

# TRAIN AND TEST
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify = y)

# CATEGORICAL
from sklearn.preprocessing import OneHotEncoder
oneHot = OneHotEncoder(handle_unknown='ignore').fit(X_train)
X_train, X_test = oneHot.transform(X_train), oneHot.transform(X_test)

# NORMALIZATION
from sklearn.preprocessing import Normalizer
normalizer = Normalizer().fit(X_train)
X_train, X_test = normalizer.transform(X_train), normalizer.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier()

In [None]:
logreg.fit(X_train,y_train)
dt.fit(X_train,y_train)
xgb.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
from sklearn.metrics import f1_score,average_precision_score

models = {"logreg":logreg,"dt":dt,"xgb":xgb}
results_f1 = []
results_aps = []

for name,model in models.items():
  model.fit(X_train,y_train)
  y_hat, y_proba = model.predict(X_test), model.predict_proba(X_test)[:,1]
  y_hat_train,y_proba_train = model.predict(X_train),model.predict_proba(X_train)[:,1]
  f1_train, aps_train = f1_score(y_train,y_hat_train),average_precision_score(y_train,y_proba_train)
  f1_test, aps_test = f1_score(y_test,y_hat),average_precision_score(y_test,y_proba)
  results_f1.append([name,f1_train,f1_test])
  results_aps.append([name,aps_train,aps_test])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [None]:
pd.DataFrame(results_f1,columns=['model','train','test'])

Unnamed: 0,model,train,test
0,logreg,0.256139,0.210187
1,dt,0.983422,0.300706
2,xgb,0.0,0.0


In [None]:
pd.DataFrame(results_aps,columns=['model','train','test'])

Unnamed: 0,model,train,test
0,logreg,0.488637,0.424852
1,dt,0.999102,0.108978
2,xgb,0.303574,0.298463
