# AI Trainee - Level 3 Practical

The aim of this notebook is to show selected Machine Learning (ML) algorithms in action and highlight their advantages and drawbacks. Furthermore, the aim is to highlight the important points not the coding style. **Do not worry** if you are not able to fully understand some code snippets as long as you understand the idea behind it.   

# Data preparation<a name="data-preparation"/>

## Get data<a name="get-data"/>
It is highly unlikely that the customer will hand you over a script to load the data directly. Often, you will receive an access to database, link to data set or simply an archive containing the data.  

In this example, the customer will ask us to predict [churn rate](https://en.wikipedia.org/wiki/Churn_rate) providing selected data on customers, contracts, and yes/no churn-left flag for each of them. We will use the [Telecom churn data sets](https://www.kaggle.com/vpfahad/telecom-churn-data-sets) data set hosted on kaggle.com  

1. Download the data zip archive: https://www.kaggle.com/vpfahad/telecom-churn-data-sets
2. Extract to your data folder
    * Tip: python contains library [zipfile](https://docs.python.org/3.7/library/zipfile.html) that allows you to read from archive without extraction!
    * If you are working on kaggle.com (although I highly recommend to work on your PC):
        * From dashboard, click on Data > New Dataset > Upload (include duplicates)
        * Start new notebook (or clone this) and click on + Add data & select your folder (you may need to leave or restart the notebook if some error occurs). Then you will see your data set in "input" from the Data drop-down menu. Congrats!

## Read<a name="read"/>
* We will read the [churn rate](https://en.wikipedia.org/wiki/Churn_rate) data to pandas/python
    * **Read** the churn data, customer data, internet contract data, and metadata
        * All data stored in [csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
* If working on kaggle.com the data will be located in `../input/` folder

In [None]:
import pandas as pd
# This will control the display options of pandas within this notebook only
pd.options.display.max_columns = None
pd.options.display.max_rows = 20

In [None]:
# Our master table containing main information
df = pd.read_csv("../input/ai-trainee/churn_data.csv")
print("Input data-frame shape: {}".format(df.shape))
df.head()

In [None]:
# The cutomer-specific data
customer = pd.read_csv("../input/ai-trainee/customer_data.csv")
customer.head()

In [None]:
# Contract data
contract = pd.read_csv("../input/ai-trainee/internet_data.csv")
contract.head()

In [None]:
# Read metadata. 
# If you get an error, just inspect the file manually and remove empty spaces or tabs after each column (' ,'>'')
# This actually happens very often that the provided data set contains some strange symbols leading to errors. Get custom to it :)
meta = pd.read_csv("../input/ai-trainee/Telecom Churn Data Dictionary.csv")
# the columns in input data and meta do not match (some are in lowercase, some contain empty spaces, ...)
# Will create a new column that will unify all. Same will be done in following section for input dataframes
meta = meta.assign(name_id=meta["Variable Name"].replace({" ":"","\t":""},regex=True).str.lower()).set_index("name_id")
meta.head()

## Join dataframes<a name="join"/>
* The aim of data [join](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) is to get all required columns into one dataframe. We have 3 files/dataframe and need to combine them into one
* If you look closer at the dataframes, we have a unique `customerID` column that is suitable for joining
* Here, we will do a [left join](https://www.w3schools.com/sql/sql_join_left.asp). This is easy as we have [1:1 ralation](https://hackernoon.com/mysql-tutorial-example-relation-foreign-key-database-funtion-join-table-query-one-namy-nest-41dd09648fbd) (no duplicates = one customer one contract)

In [None]:
# Set customer ID (=unique ID) as index for further joining
# This is clearly a repeating task. Imagine doing this ten times or so! => use a for loop
# In addition, rename the columns to be identical with "meta"
for i in [df, customer, contract]:
    i.set_index("customerID",inplace=True)
    i.rename(columns={j:j.lower() for j in i.columns},inplace=True)
    
df.head()

In [None]:
# Join all three data-frames (one after another)
df = df.join(customer).join(contract)

# Make sure no 1:N relation = no duplicates, print shape again (compare number of rows with input above)
print("Joined dataframe shape: {}".format(df.shape))
df.head()

## Train-test split<a name="train-test-split"/>
The aim of the [train-test-split](https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50) is to assess (estimate) the future performance of the model on "unseen" data. It can be simply achieved by putting aside part of the available data (usually 20-30% depending on the data set size) and train the data on the remaining part. It is very important to use the test set as we would not have it. Data Scientists often unconscientiously make their decision after looking at the complete data set. This is called [data snooping](https://en.wikipedia.org/wiki/Data_dredging) and should be avoided. Check the awesome [Caltech ML course CS 156](https://www.youtube.com/watch?v=SEYAnnLazMU&list=PLD63A284B7615313A&index=6&t=0s) for very detailed explanation.

Here, we will apply the split before encoding and scaling! Otherwise, we would scale the data using mean of the whole data set! In real-world applications, we will not know what the mean, standard deviation, or range of all features is! On similar note, imagine you would want to encode text data set. Fitting the encoder on the whole is a bit of cheating as we do not know in advance all possible expression that will pop-up in the future data. 
Hence, applying train-test-split prior any processing steps, the test accuracy will more likely follow the out-of-sample accuracy. 

In addition, the [train-test-split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) will automatically re-shuffle the data so we avoid the situation that we will train the model on "sorted" data set (for example, if the customer provide the data sorted by tenure)

After the train-test-split, we would use "[pipes](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)" to apply the same steps on train and test data set. Here, we will simply apply all steps manually on both data sets.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# to achive the identical result each run (just for this AI trainee lesson), use 'random_state' option
train, test = train_test_split(df,train_size=0.75,shuffle=True,random_state=123)

In [None]:
print(train.shape)
train.head()

## Inspect data types<a name="inspect-types"/>
* What data **types** are the columns?
    * Keep in mind that "Object" can store string, date-time, or even Integer if with some None/missing entries
* Do we have any missing data?
* What are the basic statistics for numerical values?  


> We will use this function at least two times >> prepare function. Try using function whenever you know that the task will be repeated!!

In [None]:
def check_stats(df):
    """
    This function will return a table (dataframe) showing main statistics and additional indicators
    """
    # We will store the data types in a separate dataframe
    dfinfo = pd.DataFrame(df.dtypes,columns=["dtypes"])
    
    # We are interested if we have any missing data (sum all). 
    # Again, join the result with the dfinfo (append new column). Consider '' or ' ' also as missing
    dfinfo = dfinfo.join((df.replace({'':None,' ':None}) if "('O')" in str(df.dtypes.values) else df).isna().sum().rename("isna"))
        

    # In the last step, add statistics (will be computed for numerical columns only)
    # We need to "T"ranspose the dataframe to same shape as df.describe() output
    return dfinfo.T.append(df.describe(),sort=False)

### Missing data
* Check if any missing data (see the created "isna" columns)
* If found some, first check if the missing data is not caused by incorrect data reading (not in this case)
* Due to the low number of affected entries (0.2%) we can neglect, drop these rows
    * Revise this decision if facing imbalanced data set
    * If you decide not to drop the rows, follow the notebook describing different methods when dealing with missing data: https://www.kaggle.com/mmdatainfo/missing-data

In [None]:
check_stats(train).T.query("isna != 0")

In [None]:
# The missing data are in the input file marked with ' '
# drop them = overwrite the dataframe
test = test[test.totalcharges!=' ']
train = train[train.totalcharges!=' ']

# Feature engineering<a name="feature-engineering"/>

## Encoding<a name="encoding"/>

The aim of encoding is to convert available data types to numerical values suitable for machine learning. These non-numerical values are often called categorical values as they describe category such as 'internet contract' or 'cable contract'
* For example, column containing 'no' & 'yes' values needs to be converted to numeric representation
* There are also methods to process numerical values, e.g., [discretization](https://en.wikipedia.org/wiki/Discretization_of_continuous_features) but these are applied only in rare cases

There are numerous ways/methods to encode the data as shown in the [category-encoders](https://pypi.org/project/category-encoders/) library. Here is the list of most common ones:

### Label encoding 

Label encoding will convert the categorical value into a vector of integers. The aforementioned 'no' & 'yes' would simply be converted to `[0,1]` list. Accordingly, `['no','maybe','yes']` will be converted to `[0,1,2]`. 

* **Advantages**
    * Simple & memory-efficient (integer/binary vector requires only minimal RAM)
        * Will not create a new features (=columns)
    * Ideal for [ordinal](https://en.wikipedia.org/wiki/Ordinal_number) categorical values

* **Disadvantages**
    * Not suitable for many [ML scikit-learn estimators](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features), as these expect continuous input, and would interpret the categories as being ordered, which is often not desired
        * Take for example linear/logistic regression with input `['red','blue','green']` converted to `[0,1,2]`. Higher number (`green`) would imply higher/lower effect (assuming positive/negative slope respectively)

* **Examples**
    * Experience of a candidate when applying to a job: need to sort `['junior','senior','lead']` would be `[0,1,2]`
    * Product rating: `['ok-ish','good','very-good','perfect']` would be `[0,1,2,3]`
        * Keep in mind that the input list is sorted/ordered!
        
### One-Hot Encoding

[One-Hot Encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) converts the categorical values into a vector of one-and-zeros. One is used at indices where a particular value exists, and zeros are used for all other columns. 

* **Examples**
   * Colors: If we have a features with three possible colors `['red','green','blue']`, than rows with `green` will be converted to `[0,1,0]`. Similarly, rows with blue will be converted to `[0,0,1]` and consequently, `red` will be encoded as `[1,0,0]`
       * Is not possible to have more than one color, hence, we will have only one `1` entry per each row
   * Country codes: imagine we have customers from `US`, `DE`, `SK`, and `PL`. The one-hot encoding will create 4 columns. Rows where the customer is from `US` will look like `[1,0,0,0]`, rows with Germany/`DE` like `[0,1,0,0]`, and so on
       * One could think of a different method to encode such column. For example, order the countries according to some criteria (e.g., distance) and use ordinal encoding.
       
* **Advantages**
    * Introduces objective information independent of ordinal or non-ordinal nature of the feature/input vector

* **Disadvantages**
    * Suitable only for categorical values with relatively low number of unique entries
        * Created many new features/columns which increases RAM demand
    * May introduce [collinearity](https://en.wikipedia.org/wiki/Multicollinearity) which is not desired in some models (see for example [linear regression](https://www.kaggle.com/mmdatainfo/linear-regression))
    

Before applying the encoding method, have a look at the unique values available in each of the columns
* Number of unique entries affects directly the shape of output/encoded matrix, see one-hot-encoding/disadvantages section

In [None]:
# for visualization purposes, we will store the results in pandas dataframe and print the result in the next jupyter notebook cell
# nr of unique: will count the number of unique entries
# first 5 unique: will show first 5 unique entries (if less, only those)
dfsummary = pd.DataFrame({"nr of unique":[],"first 5 unique":[]})

# Run loop over all columns computing length (len) of unique entries in each column and converting first 5 enties to a string
for i in train.columns:
    dfsummary = pd.concat([dfsummary,pd.DataFrame({"nr of unique":[len(train[i].unique())],
                                                   "first 5 unique":[str(train[i].unique()[0:5])]},index=[i])],sort=False)

In [None]:
# join the result with metadata-column description
# Will join on column name. However, the provided metadata column names are not identical (e.g, contain empty spaces)
meta = meta.assign(join_name=meta["Variable Name"].replace({" ":"","\t":""},regex=True).str.lower()).set_index("join_name")

# need to do tha same for dfsummary
dfsummary = dfsummary.assign(join_name=dfsummary.index.astype(str).str.lower()).set_index("join_name")

# now having identical indices, join tables
dfsummary = dfsummary.join(meta[["Meaning"]])
# set column with only for the next step (to see the Meaning description)
pd.options.display.max_colwidth = 100
dfsummary

In [None]:
# set back to normal
pd.options.display.max_colwidth = 50

Looking at the results above we can do following
* columns with `['No' 'Yes']` can be encoded using [scikit-learn LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)
* `['Yes' 'No' 'No internet service']` can be first converted to `['No' 'Yes']` and encoded using [scikit-learn LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). Same applies to the `['Yes' 'No' 'No phone service']`
    * We have the information about internet service in `internetservice` and `phoneservice` columns
* column `totalcharges` should be converted to number
    * pandas csv method did not convert it because of the ' ' missing values
* `contract` can be interpreted as ordered list: monthly, yearly, 2-yearly. Hence, apply [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)
* The remaining columns will be encoded applying [scikit-learn OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
* Will create a new data frame with encoded values `train_e` and `test_e` columns


In [None]:
train = train.replace({'No phone service':'No','No internet service':'No'})
test = test.replace({'No phone service':'No','No internet service':'No'})
df = df.replace({'No phone service':'No','No internet service':'No'})

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [None]:

# Tenure, seniorcitizen, and monthlycharges do not need to be converted
# We will convert 
train_e = train[["tenure","seniorcitizen","monthlycharges"]].copy()
test_e = test[["tenure","seniorcitizen","monthlycharges"]].copy()

# Convert to float (str because of missing data that was, however, dropped before)
train_e["totalcharges"] = train.totalcharges.astype("float64")
test_e["totalcharges"] = test.totalcharges.astype("float64")

In [None]:
# We want to ensure that 'No' is 0 and 'Yes' is 1. 
# To do that "fit" the encoder first, and apply (=transform) afterwards
# "fit" means that the object/variable "le_no_yes" will "remember" that no=0, and yes=1
le_no_yes = LabelEncoder().fit(['No','Yes'])

# now, apply to all columns where 'Yes', 'No' (or inverse order) occurs
for i in dfsummary.index:
    if "yes" in dfsummary.loc[i,"first 5 unique"].lower() and "no" in dfsummary.loc[i,"first 5 unique"].lower():
        print(i)
        train_e[i] = le_no_yes.transform(train[i])
        test_e[i] = le_no_yes.transform(test[i])

In [None]:
# to ensure the values are "ordered" = month-to-month = 0, year=1, 2 years = 2, fit first
le_contract = LabelEncoder().fit(['Month-to-month','One year','Two year'])
train_e["contract"] = le_contract.transform(train["contract"])
test_e["contract"] = le_contract.transform(test["contract"])

In [None]:
# Here is a for loop that will convert all remaining columns applying OneHotEncoder
# we will "declare" the encoder just to use its 'categories_' attribute 
ohe = OneHotEncoder() 
for i in df.columns:
    if i not in train_e.columns:
        print(i)
        # fit = get new mapping for each column
        ohe = OneHotEncoder().fit(train[i].unique().reshape(-1,1))
        # OneHotEncoder (just like ML models) expects/require a numpy matrix/array as input
        temp = pd.DataFrame(ohe.transform(df[i].to_numpy().reshape(-1,1)).toarray(),
                            index=df.index,
                            columns=[i+"_"+cat.lower().replace(" ","_") for cat in ohe.categories_[0]])
        # Check also category-encoders library for easier encoding
        train_e = train_e.join(temp)
        test_e = test_e.join(temp)

In [None]:
train_e.head()

In [None]:
check_stats(train_e)

#### Check correlation
* See if there is any high (anti)correlation for `churn`
* See if there are some fully-correlated feature columns (if so, you can drop one-of them)

> Correlation = linear relation!

In [None]:
train_e.corr().round(3).style.background_gradient(cmap="viridis")

## Scaling<a name="scaling"/>

The aim of scaling is to reduce the range of available numbers to a certain minimum-maximum. One of the most commonly used scaling techniques is [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics), more specifically transforming the data set to unit norm (standard deviation of 1) and zero mean value. This process can be useful especially for ML applying transfer functions (activation functions in neural networks).  

Without scaling, these transfer functions do no serve their purpose. Imagine [Logistic function](https://en.wikipedia.org/wiki/Logistic_function) converting input into 0 and 1 range output. Without scaling (setting the mean to 0, and 99.7% of range to +/- 3), all values could be converted to min/max, i.e., 0 or 1 even if the input vector would show "normal" variance but mean would be let's say -100 or +100 respectively. Scaling also improves numerical stability. Without scaling, the program could exceed the float precision (imagine multiplying ```1^10``` with ```1^13``` multipe times)

For more information, see [scikit-learn preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) materials  


In [None]:
# see https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
from sklearn.preprocessing import StandardScaler

In [None]:
# It makes, of course, only sense to apply scaling only to numerical values
# Again, "fit" on train and apply/"transform" test
scaler = StandardScaler()
for i in [["tenure","monthlycharges","totalcharges"]]:
    train_e[i] = scaler.fit_transform(train_e[i])
    test_e[i] = scaler.transform(test_e[i])

In [None]:
check_stats(train_e)

## Imbalance<a name="imbalance"/>
[Class imbalance](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) is a common phenomenon in classification problems. The aim of our ML algorithm is to correctly classify each class member. This includes minority classes. The class imbalance is not to be confused with [outliers in regression](https://www.kaggle.com/mmdatainfo/linear-regression) where we want to suppress their effect. 

The [machine learning mastery](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/) summarizes different methods that could be deployed when dealing with class imbalance.  
Here, we would just inspect our data set for imbalance.

Simplest way to check for imbalance is to use [value_coutns](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method in Pandas. Feel free to check the notebook showing how to plot data using [PyPlot](https://www.kaggle.com/mmdatainfo/pyplot-visualization) or interactive [Plotly](https://www.kaggle.com/mmdatainfo/plotly-visualization) libraries
* Our target is "churn" with binary classes (not a [multiclass problem](https://en.wikipedia.org/wiki/Multiclass_classification))
* As shown below, the imbalance is rather small-to-modes, i.e. around 1:3


In [None]:
import plotly as py
import plotly.graph_objects as go
py.offline.init_notebook_mode(connected=True)

In [None]:
fig = go.Figure(data=[go.Bar(
                            x=train_e.churn.value_counts().index, 
                            y=train_e.churn.value_counts(),
                            text=train_e.churn.value_counts().index,
                            name="train"
                            ),
                      go.Bar(
                            x=test_e.churn.value_counts().index, 
                            y=test_e.churn.value_counts(),
                            text=test_e.churn.value_counts().index,
                            name="test"
                            ),
                     ],
                layout = go.Layout(
                                   title="Checking target/churn imbalance",
                                   # Reduce default (rather big) margins between Figure edges and axes
                                   margin=go.layout.Margin(l=50,r=50,b=50,t=50),
                                   # Set figure size
                                   width=600,
                                   height=400,
                                   xaxis=go.layout.XAxis(
                                                        showgrid=False,
                                                        zeroline=False,
                                                        showticklabels=False
                                                        )
                                   )
               )

fig.show()

# ML: finally!<a name="ml"/>
Now we can finally look at some ML in action! Here, we will show Decision Trees/Random Forest, k-NN, and SVM models. Check the [sklearn model comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) to see how various models perform in different situations (with different decision boundaries)
For metrics, see the prepared [notebook](https://www.kaggle.com/mmdatainfo/performance-metrics) and/or [Wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall)  

# Decision Tree <a name="decision-tree"/>
Much more detailed description can be found [here](https://www.kaggle.com/mmdatainfo/decision-tree)

### How it works
At each step, find the attribute we can use to partition the data set to minimize the [Entropy](https://en.wikipedia.org/wiki/Entropy), or [Gini impurity](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) (not coefficient) of the data a in the next step
* just picking decision that reduced entropy the most at the current stage (=greedy algorithm)
    * will not produce optimal tree, just one that will (somehow) work
    
#### Entropy<a name="entropy"/>
* is a measure of disorder in the data set. A quantity which will measure, in some sense, how much information is "produced" by a process (Markoff), or at what rate information is produced
* Entropy is highest when every outcome is equally likely
* Every time you remove equally probable outcome you introduce predictability and the entropy will decrease
* **If the entropy decreases, we can guess the outcome with higher likelihood**
* Information gain is a measure of the decrease in disorder achieved by partitioning the original data set
* Information gain/entropy and **Gini impuritiy** are [almost identical](https://datascience.stackexchange.com/questions/10228/gini-impurity-vs-entropy)
    * In decision learning, it only matters in 2% of the cases whether you use gini impurity or entropy.
    * Entropy (needed for information gain) might be a little slower to compute (because it makes use of the logarithm).
        
### Pros<a name="pros"/> 
* Simple to understand and to interpret. Trees can be visualized
* Requires little data preparation: **no data normalization** required 
    * however, balancing of classes is recommended (see disadvantages)
* Able to **handle both numerical and categorical data**
* Uses a **white box model**. If a given situation is observable in a model, the explanation for the condition is easily explained by Boolean logic 
    * By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret
* Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model
* Performs well even if its assumptions are somewhat violated by the true model from which the data were generated

### Cons<a name="cons"/> 
* Decision-tree learners can create over-complex trees that do not generalize the data well = **overfitting**
    * **Random decision forests** correct for decision trees habit of overfitting to their training set
* Decision trees can be **unstable** because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble 
* There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems
* Decision tree learners create **biased trees if some classes dominate**. 
    * It is therefore recommended to balance the dataset prior to fitting with the decision tree

## Random forest
The training algorithm for [random forests](https://en.wikipedia.org/wiki/Random_forest) applies the general technique of bootstrap aggregating, or bagging, to tree learners
* bagging repeatedly selects a random sample with replacement of the training set and fits trees
* outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees
    * Can compute probability of the estimate as we have X trees "voting" for one result

In [None]:
# select features & labels
X_train, X_test = train_e.drop(columns=["churn"]).to_numpy(), test_e.drop(columns=["churn"]).to_numpy()
y_train, y_test = train_e["churn"].to_numpy(), test_e["churn"].to_numpy()

Let’s compare the performance of one tree (`dt`), to a whole forest (`rf`). In addition, use a [dummy model](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.dummy) as benchmark (if your model should always outperform the dummy model)

In [None]:
# Scikit-learn models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier

# Classification metrics
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from sklearn.metrics import classification_report, confusion_matrix

# cross-validation
from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV

In [None]:
dt = DecisionTreeClassifier()
rf = RandomForestClassifier(n_estimators=50)
du = DummyClassifier(strategy="stratified")

In [None]:
for clf in [du,dt,rf]:
    clf.fit(X_train,y_train)
    print(f"\n{clf}")
    print(classification_report(y_test,clf.predict(X_test)))

In [None]:
# ## Visualize the tree: https://www.kaggle.com/willkoehrsen/visualize-a-decision-tree-w-python-scikit-learn

# from sklearn.tree import export_graphviz
# # Export as dot file
# export_graphviz(dt, out_file='tree.dot', 
#                 feature_names = list(train_e.drop(columns=["churn"]).columns),
#                 class_names = ['no','yes'],
#                 rounded = True, proportion = False, 
#                 precision = 2, filled = True)

# # Convert to png > works on Linux
# from subprocess import call
# call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# # Display in python
# import matplotlib.pyplot as plt
# plt.figure(figsize = (14, 18))
# plt.imshow(plt.imread('tree5.png'))
# plt.axis('off');
# plt.show();


Let’s try to improve the random forest performance via cross-validation hyper-parameter tuning

* Some of the **[important parameters](https://www.analyticsvidhya.com/blog/2020/03/beginners-guide-random-forest-hyperparameter-tuning/)**
    * `max_depth`: maximum depth of the tree (none by default)
    * `min_samples_leaf`: minimal number of samples required to be at a leaf node (after split). 1 by default
        * See a tree visualization, for example, [here](https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb)
    * `min_samples_split`: minimum number of samples required to split an internal node
        * See a discussion about the difference between split and leaf on [stackoverflow](https://stackoverflow.com/questions/46480457/difference-between-min-samples-split-and-min-samples-leaf-in-sklearn-decisiontre)
    * `max_features` how many features used for decision/split. All by default
    * `n_estimators`: the number of trees in the forest (10 by default)
    
> Keep in mind that tuning more parameters (with more options) will increase the computation time (and RAM demands)

In [None]:
tuning_parameter = {'min_samples_leaf': [1, 3, 6],
                    'min_samples_split': [2, 10, 15],
                    'n_estimators': [100,350,600]}

In [None]:
cv = KFold(n_splits=5)
rf = GridSearchCV(RandomForestClassifier(),
                  param_grid=tuning_parameter, 
                  scoring="f1_weighted",
                  cv=cv,
                  n_jobs = 10,
                  ) # verbose=10

In [None]:
rf.fit(X_train,y_train)

In [None]:
rf.best_estimator_

In [None]:
print(classification_report(y_test,rf.best_estimator_.predict(X_test)))

# k-Nearest Neighbors (kNN)

Is a non-probabilistic, non-parametric and instance-based learning algorithm:
* **Non-parametric** means it makes no explicit assumptions about the function form of *h*, avoiding the dangers of mis-modelling the underlying distribution of the data
    * For example, suppose our data is highly non-Gaussian but the learning model was chosen assumes a Gaussian form. In that case, a parametric algorithm would make extremely poor predictions
* **Instance-based** learning means that the algorithm does not explicitly learn a model
    * Instead, it chooses to memorize the training instances which are subsequently used as "knowledge" for the prediction phase
    * Concretely, this means that only when a query to our database is made (i.e., when we ask it to predict a label given an input), will the algorithm use the training instances to predict the result
        
See separate [k-NN notebook](https://www.kaggle.com/mmdatainfo/k-nearest-neighbors) for more details.  

### How it works
An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k-nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor

[Wikpedia example k-NN classification](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm): The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/1280px-KnnClassification.svg.png" alt="drawing" width="300"/> 

### Pros<a name="pros"/> 
* **simple** to understand and implement
* kNN **works just as easily with multi-class data** sets whereas other algorithms are hard-coded for the binary setting
* the non-parametric nature of kNN gives it an edge in certain settings where the data may be highly unusual, thus **without prior knowledge on distribution**

### Cons<a name="cons"/> 
* **computationally expensive** testing phase
    * we **need to store the whole data set for each decision**!
* can **suffer from skewed class distributions**
    * for example, if a certain class is very frequent in the training set, it will tend to dominate the majority voting of the new example (large number = more common)
* the accuracy can be severally **degraded with high-dimension data** because of the little difference between the nearest and farthest neighbor
    * **the curse of dimensionality** refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience
    * for high-dimensional data (e.g., with number of dimensions more than 10) **scaling** and **dimension reductions** (such as PCA) is usually performed prior applying kNN


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=0.999)# or set n_components="mle"
# As always, fit on train, transform test
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
print("Nr. of features after PCA = {} (input = {})".format(X_train_pca.shape[1],X_train.shape[1]))

In [None]:
tuning_parameter = {"n_neighbors":list(range(1,30,2)),
                    "weights":["uniform","distance"]}
knn = GridSearchCV(KNeighborsClassifier(), 
                   tuning_parameter, 
                   cv=cv,
                   scoring="f1_weighted",
                   n_jobs = 10)

In [None]:
knn.fit(X_train_pca,y_train)

In [None]:
knn.best_estimator_

In [None]:
print(classification_report(y_test,knn.best_estimator_.predict(X_test_pca)))

# Support Vector Machines (SVMs)

This Jupyter notebook gives a very brief Introduction to Support Vector Machine (SVM). Please refer to [SVM notebook](https://www.kaggle.com/mmdatainfo/support-vector-machines) for more details. 
SMVs are **supervised learning models** for **classification** and **regression**. The algorithm finds support vectors across which to divide the data into two categories. These support vectors define hyperplanes. Thus, SVM is parametric, non-probabilistic binary classifier. SVM can be also used in classification with non-linear boundaries using "kernel trick".   

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/SVM_margin.png/1920px-SVM_margin.png" alt="drawing" width="300"/> 

### How it works

The aim of SVM is to find a line (hyperplane) that separates two sets of labels into two classes
* The question is how to set the line: there are many hyperplanes that might classify the data
* One reasonable choice as the best hyperplane is the one that **represents the largest separation**, or margin, between the two classes. 
    * The bigger margin means higher probability that any new point will correctly classified (more space with correct classification)
    * Maximum margin also constrains the solution which leads to reduced [VC dimensions](https://en.wikipedia.org/wiki/VC_dimension) (compared to arbitrary lines that would separate the two labels)
    * The algorithm **chooses the hyperplane so that the distance from it to the nearest data point on each side is maximized**
* Non-linear boundary: works via non-linear transformation: convert features from one  space to another feature space 
    * The advantage is that the this only affects the features while the actual function to be minimized stays the same
    * We work in new space, this means that also the **support vectors are in the new space**
        * So even if the image of the support vector in the original space may look very dramatic (polynomial of 10 degree for example), it is **linear in the new space**
    * More importantly, **only the number of found support vectors in the new space affects the generalizability**
        * **Compared to simple regression analysis**: take degree 10 polynomial function for example: that would significantly increase number of estimated parameters (complexity) leading to poor generalization
    * The transformation is achieved by so-called "kernel trick" (far beyond the scope of this lecture)

### Pros
* Very sufficient
* Works well with **higher-dimensional data**
* Intuitive interpretation
* Can use different kernels (linear, polynomial,...): can **solve non-linear problems**
* The **absence of local minima**

### Cons
* Highly **depends on choice of the kernel** and regularization parameter
* Does not directly provide probability estimates
* The optimal design for multiclass SVM classifiers is a further area for research (SVM is binary by design)
* **Scaling** and **regularization** are recommended
* Parameters of a solved model are difficult to interpret


In [None]:
from sklearn.svm import SVC
import numpy as np

Parameters for [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) using [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
* `C`: Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty (see [Regularization](https://www.kaggle.com/mmdatainfo/regularization-and-validation) and [RBF theory](https://www.kaggle.com/mmdatainfo/support-vector-machines))
* `gamma`: Kernel coefficient (see [RBF theory](https://www.kaggle.com/mmdatainfo/support-vector-machines))

In [None]:
tuning_parameter = {"C":np.logspace(-3, 3, 7),
                    "gamma":np.logspace(-3, 3, 7)}
svc = GridSearchCV(SVC(kernel="rbf"), 
                      tuning_parameter, 
                      cv=cv,
                      scoring="f1_weighted",
                      n_jobs = 10,
                      return_train_score=True)

In [None]:
svc.fit(X_train,y_train)

In [None]:
svc.best_estimator_

In [None]:
print(classification_report(y_test,svc.best_estimator_.predict(X_test)))

In [None]:
# To show how the score varies on parameters, plot the results (in test set!)
#pd.DataFrame(search.cv_results_)

## Data Quest:
1. try SVM or/and random forest with reduced data set (`X_train_pca`) to see if the result differs from `X_train`
2. try running random forest using [over-sampled](https://imbalanced-learn.org/stable/over_sampling.html) data to further suppress the moderate class imbalance (solution in commented code below)
3. try to change the [scoring metric](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) and number of [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)s in [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) just the see how the results change
4. try to change the hyper-parameter range (`tuning_parameter`)
    * Keep in mind: more = longer run time

In [None]:
# ##You can try to train SVM with the reduced data set: the results should be similar
# svc.fit(X_train_pca,y_train)
# print(classification_report(y_test,svc.best_estimator_.predict(X_test_pca)))

In [None]:
# ##Try re-sampling to suppres the moderate imbalance
# from imblearn.over_sampling import RandomOverSampler
# ros = RandomOverSampler(random_state=0)
# X_resampled, y_resampled = ros.fit_resample(X_train, y_train)
# tuning_parameter = {'min_samples_leaf': [1, 3, 6],
#                     'min_samples_split': [2, 10, 15],
#                     'n_estimators': [100,350,600]}
# rf = GridSearchCV(RandomForestClassifier(),
#                   param_grid=tuning_parameter, 
#                   scoring="f1_weighted",
#                   cv=cv,
#                   n_jobs = 10,
#                   )
# rf.fit(X_resampled, y_resampled)
# print(classification_report(y_test,rf.best_estimator_.predict(X_test)))

In [None]:
# ## Check if the result changes if using more k-folds
# search = GridSearchCV(RandomForestClassifier(),
#                       param_grid={'min_samples_leaf': [1, 3, 6],
#                                   'min_samples_split': [2, 10, 15],
#                                   'n_estimators': [100,350,600]}, 
#                       scoring="f1_weighted",
#                       cv=KFold(n_splits=10),
#                       n_jobs = 10,
#                       )
# rf.fit(X_train,y_train)
# print(classification_report(y_test,rf.best_estimator_.predict(X_test)))