# Purpose of this Kernel

The following is a demonstration machine learning project in Python. I have a couple years of experience with R and have thus become familiar with machine learning using the Caret framework. However, I realize that Python is truely the lingua franca of data science so I am getting on board. I have a good deal of Python porgramming experience so (hopefully) the transition will be fairly painless. My goal for this kernel isn't to demonstrate anything ground breaking but to get personally comfortable creating attractive looking kernels in python and develope a XGBoost workflow that I can come back to in future more complex projects.  

**All feedback is highly appreciated!**

I have drawn on a large number of resources to learn Python and XGBoost but the following had a direct effect on this kernel:

* This [cheat sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf) for Scikit Learn 
* This [guide](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) to tuning XGBoost hyperparamters 
* This [kernel](https://www.kaggle.com/tilii7/hyperparameter-grid-search-with-xgboost) by Tilii was really helpful in understanding how to implement random grid search
* This [mini course](https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/) on XGBoost in Python 



<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Python.svg/768px-Python.svg.png" width="250px"/>

 <a id="top"></a>
# Table of Contents

* [Load Packages](#1)
* [Import Data](#2)
* [Preprocess Data](#3)
* [Split Data Into Test and Training Sets](#4)
* [Build Base Model](#5)
* [Build Tuned Model](#6)
* [Conclusions](#7)


# Load Packages <a id="1"></a>

In [None]:
# pandas is for data manipulation and wrangling
import pandas as pd
# XGBoost is the specific model and we want the classifier 
from xgboost import XGBClassifier
# creates feature importance plot
from xgboost import plot_importance
# Label encoding transforms non-ordinal catigorical variables
from sklearn.preprocessing import LabelEncoder
# splits data into test and training sets
from sklearn.model_selection import train_test_split
# for tuning, located in sklearn.grid_search depending on version
from sklearn.model_selection import RandomizedSearchCV
# for assessing accuracy
from sklearn.metrics import accuracy_score
# For the spliting the data
from sklearn.model_selection import StratifiedKFold



# Import Data <a id="2"></a>

In [None]:
# import data set
df = pd.read_csv('../input/WA_Fn-UseC_-Telco-Customer-Churn.csv')
# view the top rows
df.head()

# Preprocess Data <a id="3"></a>

Fortunately this is not a 'messy' data set in that there aren't any missing values. That said, in order to use XGBoost some preprocessing still needs to be done so that all the data is numerical:

* Encode catigorical variables with two levels.
* For catigorical variables with more than two levels create dummy variables.
* Remove the customer ID feature.
* The data in the Total Charges column are strings. Convert to float

In [None]:
# Make dummy variables for catigorical variables with >2 levels
dummy_columns = ["MultipleLines","InternetService","OnlineSecurity",
                 "OnlineBackup","DeviceProtection","TechSupport",
                 "StreamingTV","StreamingMovies","Contract",
                 "PaymentMethod"]

df_clean = pd.get_dummies(df, columns = dummy_columns)

# Encode catigorical variables with 2 levels
enc = LabelEncoder()
encode_columns = ["Churn","PaperlessBilling","PhoneService",
                  "gender","Partner","Dependents"]

for col in encode_columns:
    df_clean[col] = enc.fit_transform(df[col])
    
# Remove customer ID column
del df_clean["customerID"]


# Make TotalCharges column numeric, empty strings are zeros
df_clean["TotalCharges"] = pd.to_numeric(df_clean["TotalCharges"],
    errors = 'coerce').fillna(0)

# Split Data into Test and Training Sets <a id="4"></a>

Split the data into to the target variable, y, which needs to be predicted (whether the customer churned or not) and all the other predictive variables x. Then use the train_test_split function to assign 80% of the data to the training set and 20% to the test set.

In [None]:
# Split data into x and y
y = df_clean[["Churn"]]
x = df_clean.drop("Churn", axis=1)

# Create test and training sets
x_train, x_test, y_train, y_test = train_test_split(x,
    y, test_size= .2, random_state= 1)

# Build Base Model <a id="5"></a>

This is the XGBoost model built using the default parameters. I will compare this base model's performance to that of a tuned model.

In [None]:
# Build XGBoost model
model = XGBClassifier()
model.fit(x_train, y_train)


# make predictions for test data
y_pred = model.predict(x_test)
predictions = [round(value) for value in y_pred]

# Find Accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# Display feature importance
plot_importance(model)

# Build Tuned Model <a id="6"></a>


The following explinations are quick reminders to myself based on the documentation. 

### Model parameters to be tuned
* **min_child_weight** – Minimum sum of instance weight (hessian) needed in a child. Used to control over-fitting. Values that are too high can lead to under-fitting. If the classes are highly unbalanced, lower values (even 1) can be alright. 
* **gamma** – Minimum loss reduction required to make a further partition on a leaf node of the tree. Good values of this parameter are highly specific to the data and model. (typically 0 - 10) 
* **subsample** – Subsample ratio of the training instance. Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting. (typically .5 - 1). 
* **colsample_bytree** – Subsample ratio of columns for each split, in each level. Denotes the fraction of columns to be randomly sampled for each tree. (typically .5 - 1)
* **max_depth** – Maximum tree depth for base learners. Used to control overfitting. (typically 3 – 10)

### Model parameters that do not need to be tuned
* **learning_rate** – Boosting learning rate (xgb’s “eta”), smaller generally gives better results but will take more time
* **n_estimators** – Number of boosted trees to fit. To a point, more are better but will take more time.
* **objective** – Specify the learning task and the corresponding learning objective. I use ‘binary:logistic’ which uses logistic regression and returns probabilities or each of the two classes. 
* **silent** – Whether to print messages while running boosting.
* **nthread** – Number of parallel threads used to run xgboost. -1 means all cores available.
* **booster** – Specify which booster to use: gbtree, gblinear or dart. As this is a classification model we cannot use gblinear. I choose gbtree arbitrarily but I know others have had success with dart. 

### Stratified K fold parameters 
We use stratified folds so that each sub-sample has the same ratio of each target variable as the entire data set. We do this because machine learning model performance can be very sensitive to sample balancing. 
* **n_splits** – Number of folds, higher will generally return better results but will take more time to run.
* **shuffle** – Whether to shuffle each stratification of the data before splitting into batches. These is no need to do so here as the data ordering is arbitrary. 
* **random_state** - If int, random_state is the seed used by the random number generator.

### Random search parameters
* **estimator** - A object of that type is instantiated for each grid point. This is the model we specify,
* **param_distributions** – Dictionary with parameter names as keys and distributions or lists of parameters to try. This is specified as ‘tuned parameters’ in this kernel.
* **n_iter** – Number of parameter settings that are sampled. In other words, the number of random combinations of the tuning parameters to evaluate the model on. The higher this number, the better the results but the longer the run time. 
* **scoring** – A single string to evaluate the predictions of the test set
* **n_jobs** – number of jobs to run in parallel. -1 means all cores available.
* **cv** - Determines the cross-validation splitting strategy. Here I use a split method which generates indices for a training and testing set. 
* **verbose** - Controls the verbosity: the higher, the more messages.
* **random_state**  - random_state is the seed used by the random number generator


In [None]:
tuned_parameters = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5, 10],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 5, 8]
        }

model = XGBClassifier(learning_rate=0.02, 
                    n_estimators=200,
                    booster = 'gbtree',
                    objective='binary:logistic',
                    silent=True, 
                    nthread=-1)


skf = StratifiedKFold(n_splits=5, shuffle = False, random_state = 22)

random_search_model = RandomizedSearchCV(estimator = model, 
                                   param_distributions=tuned_parameters, 
                                   n_iter=10, 
                                   scoring='accuracy', 
                                   n_jobs=-1, 
                                   cv=skf.split(x_train,y_train), 
                                   verbose=3, 
                                   random_state=22)

random_search_model.fit(x_train, y_train)

y_pred = random_search_model.predict(x_test)

predictions = [round(value) for value in y_pred]

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# Conclusions <a id="7"></a>

This kernel has successfully executed a simple machine learning problem and can be easily adapted in the future. 