<font size="4">Costa Rica Poverty Exploration based on the data set provided on [Kaggle](www.kaggle.com)</font>

Created on Tue Apr 9, 2019

**Project Team on Kaggle for Machine Learning:**
 * Jatin Solanki
 * Hemant Pandey
 * Dinesh Bulusu
 * Bheemeswara Sarma Kalluri

**References:**
*     Scaling the data using SciKit: 
    *         https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/
*     Splitting to Train & Test using SciKit:
    *         https://medium.com/@contactsunny/how-to-split-your-dataset-to-train-and-test-datasets-using-scikit-learn-e7cf6eb5e0d
*     Cross Validation:
    *         https://scikit-learn.org/stable/modules/cross_validation.html
*     K Nearest Neighbors
    *         https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/
*     How to get started with Machine Learning
     *        http://www.freecodecamp.org
*     Feature Engineering
     *     https://www.kaggle.com/willkoehrsen/a-complete-introduction-and-walkthrough
*     Value Error
     *     https://datascience.stackexchange.com/questions/11928/valueerror-input-contains-nan-infinity-or-a-value-too-large-for-dtypefloat32
*     Random Forest: Hyperpameter Tuning
     *     https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
     *     https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

**Tips:**
*     On Kaggle to hide the code: 
    * In the kernel editor, there is an option in the top right of every code cell - to hide input or output.. 
*     Use Markdown for formatting on Kaggle
    * URL: https://www.markdowntutorial.com/

**Objectives:**
1. Import Required libraries    
1. Standard Scaling (numeric data / Data Preparation)
1. Visualization
1. PCA (to reduce number of columns)
1. Pipeline
1. Use PCA to reduce number of columns
1. Apply following modeling techniques:
    * RandomForest (Use cross-validation and Bayes Optimization for values of Hyper Parameters)
    * XGBoost (Use cross-validation and Bayes Optimization for values of Hyper Parameters)
    *      (generally the Best Model observed on Kaggle)
    * LightGBM (Use cross-validation and Bayes Optimization for values of Hyper Parameters)
1. Summary Observations

**1. Import Required libraries**

* **Numpy** - The famous numerical analysis library. It provides support from computing the median of data distribution to processing multidimensional arrays.
* **Pandas** - Used for processing CSV files. This will also process tables and see statistics.
* **Matplotlib** - Used for visualizations of data in pandas dataframes. Image is better than 100 words.
* **Seaborn** - Another visualization tool tjhat is more focused on statistical visualizations such as histograms,pie charts, curves etc.
* **SciKit-Learn** - This is the final boss of Machine Learning with Python. All things we need from algorithms to improvements.

In [None]:
%reset -f

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# 1.0 Importing Basic Libraries.
# 1.1 Load pandas, numpy, matplotlib, jason, Encoder, os & time
import numpy as np                     # linear algebra
import pandas as pd                    # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import json 
from sklearn.preprocessing import LabelEncoder,OrdinalEncoder
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import os
import time
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

In [None]:
# 1.2 Image manipulation
from skimage.io import imshow, imsave

In [None]:
# 1.3 Libraries for Scaling
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale

In [None]:
# 1.4 Libraries for splitting.
from sklearn.model_selection import train_test_split

In [None]:
# 1.4.1 Return stratified folds. The folds are made by preserving the percentage of samples for each class.
from sklearn.model_selection import StratifiedKFold
# Libraries for AOC(Area Under theCurve) & ROC (Receiver Operating Characteristic Curve)
from sklearn.metrics import auc, roc_curve
# Libraries for Modelling
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier as gbm
from sklearn.tree import  DecisionTreeClassifier as dt
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier as rf
from sklearn.ensemble import RandomForestRegressor

In [None]:
# 1.4.2 Visualization Plotly
#import plotly.plotly as py
#import plotly.graph_objs as go
import seaborn as sns

In [None]:
# 1.5 ML - we will classify using lightgbm
import lightgbm as lgb

In [None]:
# 1.6 Bayes Optimization -- One method
from bayes_opt import BayesianOptimization

In [None]:
# 1.7 Bayes optimization--IInd method
# SKOPT is a parameter-optimisation framewor
from skopt import BayesSearchCV

In [None]:
# Metrics
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

### 2. Data Loading and Processing for understanding

** 2.1 Load the data**

Load data from Train, Test & Sample Submission CSV files.

In [None]:
#os.chdir("../input")
#os.listdir()
df_train = pd.read_csv("../input/train.csv") 
df_test =  pd.read_csv("../input/test.csv")
df_sample_submission =  pd.read_csv("../input/sample_submission.csv")

** 2.2 Check the data**

In [None]:
print ("Glimpse / sample of Train Dataset: ")
df_train.head()

These are the core data fields as described in the data description of df_train:
1. Id - a unique identifier for each row.
2. Target - the target is an ordinal variable indicating groups of income levels. 
            1 = extreme poverty 2 = moderate poverty 3 = vulnerable households 4 = non vulnerable households
3. idhogar - this is a unique identifier for each household. This can be used to create household-wide features, etc. All rows in a given household will have a matching value for this identifier.
4. parentesco1 - indicates if this person is the head of the household.

In [None]:
print ("Summary of Train Dataset: ",df_train.describe())
df_train.head(5)

In [None]:
target = df_train['Target']
target.value_counts()

In [None]:
print ("Glimpse / Sample of Test Dataset: ")
df_test.head()

In [None]:
print ("Summary of Test Dataset: ")
df_test.describe()

In [None]:
print ("Glimpse of Sample Submission Dataset: ")
df_sample_submission.head()

In [None]:
print ("Summary of Sample Submission Dataset: ")
df_sample_submission.describe()

### Visualization

In [None]:
# 3.1 Target
import seaborn as sns
sns.countplot("Target", data=df_train)

In [None]:
sns.countplot(x="r4t3",hue="Target",data=df_train)

In [None]:
sns.countplot(x="v18q",hue="Target",data=df_train)

In [None]:
sns.countplot(x="tamhog",hue="Target",data=df_train)

In [None]:
sns.countplot(x="hhsize",hue="Target",data=df_train)

In [None]:
sns.countplot(x="noelec",hue="Target",data=df_train)

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
scatter_matrix(df_train.select_dtypes('float'), alpha=0.2, figsize=(26, 20), diagonal='kde')
plt.show()

### finding columns with Null Values

In [None]:
def missingdata(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/ data.isnull().count()*100).sort_values(ascending = False)
    ms=pd.concat([total, percent], axis=1, keys=['Total','Percent'])
    ms = ms[ms["Percent"] > 0]
    f,ax = plt.subplots(figsize=(8,6))
    plt.xticks(rotation='90')
    fig = sns.barplot(ms.index, ms["Percent"], color="green", alpha=0.8)
    plt.xlabel('Features', fontsize=15)
    plt.ylabel('Percent of missing values', fontsize=15)
    plt.title('Percent missing data by feature', fontsize=15)
    return ms

In [None]:
missingdata(df_train)

In [None]:
df_train.head(10)

In [None]:
df_train.dtypes.value_counts() 

Now check for missing values in df_test data set

In [None]:
missingdata(df_test)

In [None]:
df_test.shape

In [None]:
df_test.dtypes.value_counts() 

In [None]:
df_train.drop(columns = ['rez_esc','v18q1','v2a1','meaneduc', 'SQBmeaned'], inplace = True)

In [None]:
df_test.drop(columns = ['rez_esc','v18q1','v2a1','meaneduc', 'SQBmeaned'], inplace = True)

Dropped that are not required. the columns that are not required..

Another option to check for missing values in both test and train data sets

In [None]:
naData = df_train.isnull().sum().values / df_train.shape[0] *100 
df_na = pd.DataFrame(naData, index=df_train.columns, columns=['Count']) 
df_na = df_na.sort_values(by=['Count'], ascending=False)
missing_count = df_na[df_na['Count']>0].shape[0]
print('We got', missing_count, 'rows which have missing value in train set.') 
df_na.head(10) 
###* We will get 5 rows which comprises of missing values in the train dataset

In [None]:
naData = df_test.isnull().sum().values / df_test.shape[0] *100 
df_na = pd.DataFrame(naData, index=df_test.columns, columns=['Count']) 
df_na = df_na.sort_values(by=['Count'], ascending=False)
missing_count = df_na[df_na['Count']>0].shape[0]
print('We got', missing_count, 'rows which have missing value in test set.') 
df_na.head(10)

In [None]:
# 2.3 Function to examine any dataset
#     ExamineData.__doc__  => Gives help
def ExamineData(x):
    """Prints various data charteristics, given x
    """
    print("Data shape:", x.shape)
    print("\nColumns:", x.columns)
    print("\nData types\n", x.dtypes)
    print("\nDescribe data\n", x.describe())
    print("\nData\n", x.head(2))
    print ("\nSize of data:", np.sum(x.memory_usage()))    # Get size of dataframes
    print("\nAre there any NULLS\n", np.sum(x.isnull()))

In [None]:
# 2.3.2 Let us understand test data
ExamineData(df_train)

In [None]:
# 2.3.2 Let us understand test data
ExamineData(df_test)

Handling the missing values in test and train data

In [None]:
df_train.loc[(df_train['tipovivi1'] == 1), 'v2a1'] = 0
df_test.loc[(df_test['tipovivi1'] == 1), 'v2a1'] = 0
df_train.loc[((df_train['age'] > 19) | (df_train['age'] < 7))] = 0
df_test.loc[((df_test['age'] > 19) | (df_test['age'] < 7))] = 0

In [None]:
print("The Train dataset has {0} rows and {1} columns".format(df_train.shape[0], df_train.shape[1]))
print("The Test dataset has {0} rows and {1} columns".format(df_test.shape[0], df_test.shape[1]))

In [None]:
# Remove Squared Variables to avoid confusion.
df_test = df_test[[x for x in df_test if not x.startswith('SQB')]]
df_train = df_train[[x for x in df_train if not x.startswith('SQB')]]
df_train = df_train.drop(columns = ['agesq'])
df_test = df_test.drop(columns = ['agesq'])
df_test.shape, df_train.shape

In [None]:
print("The Train dataset has {0} rows and {1} columns".format(df_train.shape[0], df_train.shape[1]))
print("The Test dataset has {0} rows and {1} columns".format(df_test.shape[0], df_test.shape[1]))

In [None]:
#Getting list of Columns which can act as Features
## list of features to be used
features = [c for c in df_train.columns if c not in ['Id', 'Target']]

In [None]:
import random
#Making Seed Locked so that We can avoid different results Occuring when we run The file
random.seed (45)

In [None]:
#Handling data by Label encoding
def label_encoding(col):
    le = LabelEncoder()
    le.fit(list(df_train[col].values) + list(df_test[col].values))
    df_train[col] = le.transform(df_train[col].astype(str))
    df_test[col] = le.transform(df_test[col].astype(str))

num_cols = df_train._get_numeric_data().columns
cat_cols = list(set(features) - set(num_cols))
for col in cat_cols:
    label_encoding(col)
    
df_train.shape,df_test.shape

In [None]:
print("The Train dataset has {0} rows and {1} columns".format(df_train.shape[0], df_train.shape[1]))
print("The Test dataset has {0} rows and {1} columns".format(df_test.shape[0], df_test.shape[1]))

#### Feature Engineering

In [None]:
# Difference between people living in house and household size
df_train['hhsize-diff'] = df_train['tamviv'] - df_train['hhsize']
df_test['hhsize-diff'] = df_test['tamviv'] - df_test['hhsize']
elec_tr = []
elec_ts = []
# Assign values in Train data for electricity type
for i, row in df_train.iterrows():
    if row['noelec'] == 1:
        elec_tr.append(0)
    elif row['coopele'] == 1:
        elec_tr.append(1)
    elif row['public'] == 1:
        elec_tr.append(2)
    elif row['planpri'] == 1:
        elec_tr.append(3)
    else:
        elec_tr.append(np.nan)

#Assign Values in df_test data for electricity type
for i, row in df_test.iterrows():
    if row['noelec'] == 1:
        elec_ts.append(0)
    elif row['coopele'] == 1:
        elec_ts.append(1)
    elif row['public'] == 1:
        elec_ts.append(2)
    elif row['planpri'] == 1:
        elec_ts.append(3)
    else:
        elec_ts.append(np.nan)
        
# Record the new variable and missing flag
df_test['elec'] = elec_ts
df_test['elec-missing'] = df_test['elec'].isnull()
df_train['elec'] = elec_tr
df_train['elec-missing'] = df_train['elec'].isnull()

# Wall ordinal variable
df_train['walls'] = np.argmax(np.array(df_train[['epared1', 'epared2', 'epared3']]),
                           axis = 1)
df_test['walls'] = np.argmax(np.array(df_test[['epared1', 'epared2', 'epared3']]),
                           axis = 1)
# Roof ordinal variable
df_train['roof'] = np.argmax(np.array(df_train[['etecho1', 'etecho2', 'etecho3']]),
                           axis = 1)
df_test['roof'] = np.argmax(np.array(df_test[['etecho1', 'etecho2', 'etecho3']]),
                           axis = 1)
# Floor ordinal variable
df_train['floor'] = np.argmax(np.array(df_train[['eviv1', 'eviv2', 'eviv3']]),
                           axis = 1)
df_test['floor'] = np.argmax(np.array(df_test[['eviv1', 'eviv2', 'eviv3']]),
                           axis = 1)
# Create new feature
df_train['walls+roof+floor'] = df_train['walls'] + df_train['roof'] + df_train['floor']
df_test['walls+roof+floor'] = df_test['walls'] + df_test['roof'] + df_test['floor']
# No toilet, no electricity, no floor, no water service, no ceiling
df_test['warning'] = 1 * (df_test['sanitario1'] + 
                         (df_test['elec'] == 0) + 
                         df_test['pisonotiene'] + 
                         df_test['abastaguano'] + 
                         (df_test['cielorazo'] == 0))
df_train['warning'] = 1 * (df_train['sanitario1'] + 
                         (df_train['elec'] == 0) + 
                         df_train['pisonotiene'] + 
                         df_train['abastaguano'] + 
                         (df_train['cielorazo'] == 0))
# Owns a refrigerator, computer, tablet, and television
df_train['bonus'] = 1 * (df_train['refrig'] + 
                      df_train['computer'] + 
                      df_train['television'])
df_test['bonus'] = 1 * (df_test['refrig'] + 
                      df_test['computer'] + 
                      df_test['television'])
# Per capita features
df_test['phones-per-capita'] = df_test['qmobilephone'] / df_test['tamviv']
df_test['rooms-per-capita'] = df_test['rooms'] / df_test['tamviv']
df_test['rent-per-capita'] = df_test['v2a1'] / df_test['tamviv']

df_train['phones-per-capita'] = df_train['qmobilephone'] / df_train['tamviv']
df_train['rooms-per-capita'] = df_train['rooms'] / df_train['tamviv']
df_train['rent-per-capita'] = df_train['v2a1'] / df_train['tamviv']

# Create one feature from the `instlevel` columns
df_train['inst'] = np.argmax(np.array(df_train[[c for c in df_train if c.startswith('instl')]]), axis = 1)
df_test['inst'] = np.argmax(np.array(df_test[[c for c in df_test if c.startswith('instl')]]), axis = 1)

df_train['escolari/age'] = df_train['escolari'] / df_train['age']
df_train['inst/age'] = df_train['inst'] / df_train['age']
df_test['escolari/age'] = df_test['escolari'] / df_test['age']
df_test['inst/age'] = df_test['inst'] / df_test['age']

print('Train Data shape: ', df_train.shape,'Test Data shape: ',df_test.shape)

In [None]:
print("The Train dataset has {0} rows and {1} columns".format(df_train.shape[0], df_train.shape[1]))
print("The Test dataset has {0} rows and {1} columns".format(df_test.shape[0], df_test.shape[1]))

Finding and handling column names with Object Dtypes

In [None]:
#Finding Object type columns
df_train.select_dtypes('object').head()

In [None]:
# Dropping the Only object type column "ID" so that data can be precessed for further Modelling
df_train =df_train.drop(columns = ['Id'])

In [None]:
print("The Train dataset has {0} rows and {1} columns".format(df_train.shape[0], df_train.shape[1]))
print("The Test dataset has {0} rows and {1} columns".format(df_test.shape[0], df_test.shape[1]))

In [None]:
df_train.head(10)

Generic Column Transform Function

In [None]:
# Defining the transformation function using columnTransformer, OneHotEncoder and StandardScaler
def transform(categorical_columns,numerical_columns,df):
    #  (taskName, objectToPerformTask, columns-upon-which-to-perform)
    # One hot encode categorical columns
    cat = ('categorical', ohe() , categorical_columns  )
    # Scale numerical columns
    num = ('numeric', StandardScaler(), numerical_columns)
    # Instantiate columnTransformer object to perform task
    # It transforms X separately by each transformer and then concatenates results.
    col_trans = ct([cat, num])
    # Learn data
    col_trans.fit(df)
    # Now transform df
    df_transAndScaled = col_trans.transform(df)
    # Return transformed data and also transformation object
    return df_transAndScaled, col_trans

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
# Cleaninig Data
df_train.replace(0, np.nan)
df_test.replace(0,np.nan)
#fillna() to replace missing values with the mean value for each column,
df_train.fillna(df_train.mean(), inplace=True);
print(df_train.isnull().sum());

df_train.shape

In [None]:
df_test.fillna(df_test.mean(), inplace=True);
print(df_test.isnull().sum());
df_test

In [None]:
df_train.drop(['idhogar',"dependency","edjefe","edjefa"], inplace = True, axis =1)
df_test.drop(['idhogar',"dependency","edjefe","edjefa"], inplace = True, axis =1)

### Principal Component Analysis (PCA) / Karhunen-Loeve Transform (KLT)

PCA Principle Component Analysis is used mainly for reducing the dimensionality where we have lot of features and we are undecessive on which features / components to consider in our analysis without impacting our end result.

1. Can be used to mitigate theproblem caused by dimensionality.
2. The above can be used to compress the data with verey little inormation lost.
3. Understanding the structure of the data with hundreds od dimensions can be difficult.

PCA is also known as **Karhunen-Loeve Transform (KLT)**, technique used for finding patterns in high dimensional data.  

PCA reduces a set of possibly correlated high dimensional variables to a lower dimensional set of linearly uncorrelated synthetic variables called **Principle Components**.

PCA can be used to find a set of vectors that span a subsoace that minimizes the sum of the squared errors of the projected data that would retain the greatest proportion of the original datasets's variance.

Also PCA is very useful when the variance in a data set is distributed unevenly across the dimensions.

In [None]:
from sklearn.decomposition import PCA

In [None]:
# Need to run the standerdised data set to perform PCA
pca = PCA().fit(df_train)

PCA_Summary function provides summary of PCA Analysis

In [None]:
def pca_summary(pca, standardized_data, out=True):
    names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    a = list(np.std(pca.transform(standardized_data), axis=0))
    b = list(pca.explained_variance_ratio_)
    c = [np.sum(pca.explained_variance_ratio_[:i]) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"), ("cumprop", "Cumulative Proportion")])
    summary = pd.DataFrame(list(zip(a, b, c)), index=names, columns=columns)
    
    if out:
        print("Importance of components:")
        display(summary)
    return summary

In [None]:
# Use the standerdised data set to get PCA_Summary
summary = pca_summary(pca, df_train)

This gives us the standard deviation of each component, and the proportion of variance explained by each component. 

The standard deviation of the components is stored in a named row called sdev of the output variable made by the pca_summary function and stored in the summary variable:

In [None]:
# Summary of SD is taken from the above
summary.sdev

The total variance explained by the components is the sum of the variances of the components:

In [None]:
np.sum(summary.sdev**2)

In [None]:
summary.sdev**2

In [None]:
X = df_train.values

In [None]:
X

In [None]:
X = scale(X)

In [None]:
pca = PCA(n_components =141)

In [None]:
pca.fit(X)

In [None]:
var= pca.explained_variance_ratio_

In [None]:
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

In [None]:
plt.plot(var1)

In [None]:
# Here we decide how many principle components are being considered. 
# Standard thumb rule is is consider all those principle components whoes SD > 1 or at 80 as beyond that we have flat pr no variation.
pca = PCA(n_components=80)

In [None]:
X = pca.fit_transform(X)

In [None]:
X.shape

#### PCA Conclusion

Using PCA we have reduced the principle components from 144 to 80 as there is no impact or variation for these 64 components.

### Split the data

In [None]:
df_train.head()

In [None]:
#seprating target and predictors 
y=df_train["Target"]
y.unique()

In [None]:
df_train.drop(['Target'], inplace = True, axis =1)

In [None]:
df_train.shape #(9557, 140)

In [None]:
#X=df_train

In [None]:
df_test.shape #(23856, 141)

In [None]:
#Scaling the data 
scale = StandardScaler()
X = scale.fit_transform(X)

In [None]:
#Final set of features for modelling 
X.shape #(9557, 80)

In [None]:
y.shape #(9557)

In [None]:
df_test.shape #(23856, 141)

In [None]:
#Splitting the data into test and trainn
X_train, X_test, y_train, y_test = train_test_split(
                                                    X,
                                                    y,
                                                    test_size = 0.3, stratify = y)

In [None]:
print("The X_train dataset has {0} rows and {1} columns".format(X_train.shape[0], X_train.shape[1]))
print("The X_test dataset has {0} rows and {1} columns".format(X_test.shape[0], X_test.shape[1]))
print("The y_train dataset has {0} rows".format(y_train.shape[0]))
print("The y_test dataset has {0} rows".format(y_test.shape[0]))

## Part II - Modelling

### Model 1 - Random Forest

This is a supervised, regression machine learning and it is supervised because we have both the features and the target that we want to predict. As part of this section providing random forest both the features and targets and it must learn how to map the data to a prediction. 

**Steps to Follow**:
1. State the question and determine required data
2. Acquire the data in an accessible format
3. Identify and correct missing data points/anomalies as required
4. Prepare the data for the machine learning model
5. Establish a baseline model that you aim to exceed
6. Train the model on the training data
7. Make predictions on the test data
8. Compare predictions to the known test set targets and calculate performance metrics
9. If performance is not satisfactory, adjust the model, acquire more data, or try a different modeling technique
10. Interpret model and report results visually and numerically.

Most of ther steps untill data cleanup and splitting has been completed before we began modelling and hence will focus now more on the remaining steps.

After all the work of data preparation before modelling, creating and training the model is simple using Scikit-learn.

We import the random forest regression model from skicit-learn, instantiate the model, and fit (scikit-learn’s name for training) the model on the training data. (Again setting the random state for reproducible results).

In [None]:
modelrf = rf()

In [None]:
modelrf = modelrf.fit(X_train, y_train)
start = time.time()
end = time.time()
(end-start)/60

In [None]:
classes = modelrf.predict(X_test)
classes

In [None]:
(classes == y_test).sum()/y_test.size  #91

In [None]:
f1 = f1_score(y_test, classes, average='macro')
f1 #56

In [None]:
 y_pred = modelrf.predict(X_test)
 accuracy_score(y_test, y_pred) #91

In [None]:
f  = confusion_matrix( y_test, classes )#confusion_matrix(y_true, y_pred)
f

In [None]:
bayes_tuner = BayesSearchCV(
    #  Place your estimator here with those parameter values
    #      that you DO NOT WANT TO TUNE
    rf(
       n_jobs = 2         # No need to tune this parameter value
      ),

    # 2.12 Specify estimator parameters that you would like to change/tune
    {
        'n_estimators': (80, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },


    n_iter=32,            # How many points to sample
    cv = 3                # Number of cross-validation folds
)
# Start op

In [None]:
bayes_tuner.fit(X_train, y_train)

In [None]:
best_params = pd.Series(bayes_tuner.best_params_)
best_params

In [None]:
TunedRF=rf(criterion="gini",
               max_depth=100,
               max_features=64,
               min_weight_fraction_leaf=0.0,
               n_estimators=500)

In [None]:
start = time.time()
TunedRF = TunedRF.fit(X_train, y_train)
end = time.time()
(end-start)/60

Our model has now been trained to learn the relationships between the features and the targets. 

The next step is figuring out how good the model is! To do this we make predictions on the test features (the model is never allowed to see the test answers). 

We then compare the predictions to the known answers. 

When performing regression, we need to make sure to use the **absolute error** because we expect some of our answers to be low and some to be high. We are interested in how far away our average prediction is from the actual value so we take the absolute value.

In [None]:
rf_predict=TunedRF.predict(X_test)
rf_predict

In [None]:
#  What accuracy is available on test-data
(rf_predict == y_test).sum()/y_test.size 

In [None]:
bayes_tuner.best_score_

#### Random Forest has provided an accuracy of 92.95%

In order to quantify the usefulness of all the variables in the entire random forest, we can look at the relative importances of the variables. The importances returned in Skicit-learn represent how much including a particular variable improves the prediction. The actual calculation of the importance is beyond the scope of this post, but we can use the numbers to make relative comparisons between variables.

In [None]:
#  And what all sets of parameters were tried?
bayes_tuner.cv_results_['params']

Hyperparameters in Random Forest

In [None]:
rf = rf(random_state = 42)
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)

In [None]:
rf_random.best_params_

In [None]:
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy
base_model = RandomForestRegressor(n_estimators = 10, random_state = 42)
base_model.fit(X_train, y_train)
base_accuracy = evaluate(base_model, X_test, y_test)

In [None]:
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))

We achieved an **improvement in accuracy of 0.86%**. 

Depending on the application, this could be a significant benefit. 

We can further improve our results by using grid search to focus on the most promising hyperparameters ranges found in the random search.

#### Grid Search with Cross Validation (CV)

In [None]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

This will try out 1 x 4 x 2 x 3 x 3 x 4 = 288 combinations of settings. 

We can fit the model, display the best hyperparameters, and evaluate performance:

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)
grid_search.best_params_
{'bootstrap': True,
 'max_depth': 80,
 'max_features': 3,
 'min_samples_leaf': 5,
 'min_samples_split': 12,
 'n_estimators': 100}
best_grid = grid_search.best_estimator_
grid_accuracy = evaluate(best_grid, X_test, y_test)

In [None]:
print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))

### Results of Random Forest Classifier

**Accuracy on Test Data with Random Forest is : 94.94%**

### Model 2 - Gradient Boosting Classifier

Gradient Boosting involves 3 elements:
1. A loss function to be optimised.
2. A weak learner to make predictions.
3. An additive model to add weak learners to minimise the loss function.

Imporovements that can be done to Basic Gradient Boosting:
1. Tree Constraints
2. Shrinkage
3. Random Sampling
4. Penalized Learning

In [None]:
modelgbm=gbm()

In [None]:
start = time.time()
modelgbm = modelgbm.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modelgbm.predict(X_test)
classes

In [None]:
unique_elements, counts_elements = np.unique(classes, return_counts=True)
print(np.asarray((unique_elements, counts_elements)))

In [None]:
(classes == y_test).sum()/y_test.size #93	

#### Gradient Boosting Classifier has given an accuracy of 92.29%

In [None]:
f1 = f1_score(y_test, classes, average='macro')	
f1 # 58%

In [None]:
bayes_tuner = BayesSearchCV(
        gbm(
            ),
    {
        'n_estimators': (100, 500),           # Specify integer-values parameters like this 	
        'max_depth': (4, 100),                # integer valued parameter	
        'max_features' : (10,64),             # integer-valued parameter	
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter	
    },
    n_iter=32,            # How many points to sample	
    cv = 3                # Number of cross-validation folds	
)

In [None]:
bayes_tuner.fit(X_train, y_train)

In [None]:
best_params = pd.Series(bayes_tuner.best_params_)	
best_params

In [None]:
TunedGBM=rf(criterion="entropy",
               max_depth=100,
               max_features=64,
               min_weight_fraction_leaf=0.0,
               n_estimators=250)

In [None]:
start = time.time()
TunedGBM = TunedGBM.fit(X_train, y_train)
end = time.time()
(end-start)/60

### Model 3: Extra Tree Classifier	

In [None]:
modeletf = ExtraTreesClassifier()
start = time.time()
modeletf = modeletf.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modeletf.predict(X_test)
classes	

In [None]:
(classes == y_test).sum()/y_test.size

#### Extra Tree Classifier has given an accuracy of 94.28%

In [None]:
bayes_cv_tuner = BayesSearchCV(
    ExtraTreesClassifier( ),
    {   'n_estimators': (100, 500),           # Specify integer-values parameters like this
        'criterion': ['gini', 'entropy'],     # Specify categorical parameters as here
        'max_depth': (4, 100),                # integer valued parameter
        'max_features' : (10,64),             # integer-valued parameter
        'min_weight_fraction_leaf' : (0,0.5, 'uniform')   # Float-valued parameter
    },
    n_iter=32,            # How many points to sample
    cv = 2            # Number of cross-validation folds
)

In [None]:
# Start optimization
bayes_cv_tuner.fit(X_train, y_train)

In [None]:
#  Get list of best-parameters
bayes_cv_tuner.best_params_

In [None]:
modeletc=ExtraTreesClassifier(criterion="entropy",
               max_depth=100,
               max_features=64,
               min_weight_fraction_leaf=0.0,
               n_estimators=100)

In [None]:
start = time.time()
modeletc = modeletc.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
bayes_cv_tuner.best_score_

In [None]:
# Accuracy available on test-data
bayes_cv_tuner.score(X_test, y_test)

In [None]:
bayes_cv_tuner.cv_results_['params']

### Model 4 - Light GBM

In [None]:
modellgb = lgb.LGBMClassifier(max_depth=-1, learning_rate=0.1, objective='multiclass',
                             random_state=None, silent=True, metric='None', 
                             n_jobs=4, n_estimators=5000, class_weight='balanced',
                             colsample_bytree =  0.93, min_child_samples = 95, num_leaves = 14, subsample = 0.96)


In [None]:
start = time.time()
modellgb = modellgb.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modellgb.predict(X_test)
classes
(classes == y_test).sum()/y_test.size 

#### Light GBM has given an accuracy of 94.14%	

### Model 5 KNeighborsClassifier	

In [None]:
modelneigh = KNeighborsClassifier(n_neighbors=4)

In [None]:
start = time.time()
modelneigh = modelneigh.fit(X_train, y_train)
end = time.time()
(end-start)/60

In [None]:
classes = modelneigh.predict(X_test)
classes
(classes == y_test).sum()/y_test.size

#### KNeighborsClassifie has given an accuracy of 90.55%	

### Model 6 XGBoost	

XGBoost, short for “Extreme Gradient Boosting”, was introduced by Chen in 2014. Since its introduction, XGBoost has become one of the most popular machine learning algorithm. 	

In [None]:
modelXGB = XGBClassifier()
modelXGB.fit(X_train, y_train)

In [None]:
y_pred = modelXGB.predict(X_test)	
predictions = [round(value) for value in y_pred]

In [None]:
accuracy = accuracy_score(y_test, predictions)	
print("Accuracy: %.2f%%" % (accuracy * 100.0))

#### XGBoost Has given an accuracy of 92.89%	

## Cumilative Model Summary for Costa Rica Project

We see that of all the 6 Models, **Extra Tree Classifier tops with an Accuracy of 94.28%**
1. Extra Tree Classifier has given an accuracy of 94.28%
2. Light GBM has given an accuracy of 94.14%
3. Random Forest has provided an accuracy of 93.79%
4. Gradient Boosting Classifier has given an accuracy of 93.27%
5. XGBoost has given an accuracy of 92.89%
6. KNeighborsClassifie has given an accuracy of 90.55%

## Hyper Parameter Finetuning - Baysiean Optimization

There are two common methods of parameter tuning: grid search and random search. 	
Each have their pros and cons. Grid search is slow but effective at searching the whole search space, while random search is fast, but could miss important points in the search space. 	
Luckily, a third option exists: Bayesian optimization. 	

Using Bayesian optimization for parameter tuning allows us to obtain the best parameters for a given model, e.g., Random Forest. 	
This also allows us to perform optimal model selection. Typically, a machine learning engineer or data scientist will perform some form of manual parameter tuning (grid search or random search) for a few models — like decision tree, XGBoost, and k nearest neighbors etc. — then compare the accuracy scores and select the best one for use as this has the possibility of comparing sub-optimal models. Maybe we have found the optimal parameters for the decision tree, but missed the optimal parameters for Random Forest. This means the model comparison was flawed. K nearest neighbors may beat Random Forest every time if the Random Forest parameters are poorly tuned. 	
Bayesian optimization allows us to find the best parameters of all the models considered, and therefore compare the best models. 	
This results in better model selection, because we are comparing the best k nearest neighbors to the best decision tree, etc. Only in this way can you do model selection with high confidence, assured that the actual best model is selected and used.

# Sample code from hyperopt	
from hyperopt import fmin, tpe, hp
best = fmin(
    fn=lambda x: x,
    space=hp.uniform('x', 0, 1),
    algo=tpe.suggest,
    max_evals=100)
print(best)

### Results of Bayesian Optimization	

#### Extra Tree Classifier its accuracy to 95.01% after finetuning with Bayesian optimization	
#### Random Forest improved its accuracy to 94.45% after finetuning with Bayesian optimization