# Customer Churn Prediction with TrueFoundry

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/truefoundry/truefoundry-examples/blob/main/churn-prediction-sklearn/train.ipynb)


Link to the problem statement: https://www.kaggle.com/c/customer-churn-prediction-2020

This notebook contains a solution to the Customer Churn Prediction challenge by Kaggle. 

1. We use `GradientBoostingClassifier` for classification.
2. We use `mlfoundry` to log metrics, datasets and model for each run.

### Install and import Python package

In [None]:
# install pip packages
%pip install -U "mlfoundry>=0.3.33,<0.4.0" > /dev/null

# import other things
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier

### Copy MLFoundry API Key and save it against `api_token`

In [None]:
from getpass import getpass
api_token = getpass("TrueFoundry API Token (Get it from https://app.truefoundry.com/settings):")


### Create MLFoundty client

We will use the client to log hyperparameters, metrics, datasets and logs.



In [None]:
import mlfoundry as mlf
mlf_client = mlf.get_client(api_key=api_token)

In [None]:
# download the test and train datasets
!curl -O https://raw.githubusercontent.com/truefoundry/truefoundry-examples/main/churn-prediction-sklearn/data/train.csv
!curl -O https://raw.githubusercontent.com/truefoundry/truefoundry-examples/main/churn-prediction-sklearn/data/test.csv

### Let's create an MLFoundry run

In [None]:
run = mlf_client.create_run(project_name="churn-prediction-sklearn")

### Load training data and split it into train and validation datasets

In [None]:
df = pd.read_csv('train.csv')
df.describe()

In [None]:
# divide the train dataset into test and train
from sklearn.model_selection import train_test_split 

X = df.drop(columns= ['churn'])
y = df['churn']

x_train, x_val, y_train, y_val = train_test_split(X, y , test_size=.25, stratify= y, random_state=1) 

# let's take a look at the value counts of yes and no
y_train.value_counts(), y_val.value_counts()

### Cleaning the data
1. Calculate the `total_net_minutes` to reduce the number of features; do the same with calls, and charge
2. Convert all `yes` and `no` strings into 0/1 in columns such as `voice_mail_plan`, `international_plan`, and `churn`
3. Convert the categorical values such as `state` and `area_code` into one-hot vectors
4. Drop all repeted features and unused columns

In [None]:
def clean_data(df):    
    df['total_net_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
    df['total_net_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
    df['total_net_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']


    df['voice_mail_plan'] = df['voice_mail_plan'].map({'yes': 1, 'no': 0}) 
    df['international_plan'] = df['international_plan'].map({'yes': 1, 'no': 0}) 

    df.drop(columns= ['state', 'area_code'], inplace= True)

    df.drop(columns=['total_day_charge', 'total_eve_charge','total_night_charge',
                    'total_day_calls','total_eve_calls', 'total_night_calls', 'total_day_minutes', 
                     'total_eve_minutes', 'total_night_minutes'], inplace=True)
    return df

In [None]:
x_train_clean  = clean_data(x_train)
y_train_clean = pd.Categorical(y_train).codes

x_val_clean = clean_data(x_val)
y_val_clean = pd.Categorical(y_val).codes

### Log Dataset to MLFoundry

In [None]:
run.log_dataset('train', features=x_train_clean, actuals=y_train_clean)
run.log_dataset('validation', features=x_val_clean, actuals=y_val_clean)

### Log our hyperparamters to MLFoundry

In [None]:
LR = 0.05
N_ESTIMATORS = 1000
MAX_DEPTH = 10

run.log_params({'learning_rate': LR, 'n_estimators': N_ESTIMATORS, 'max_depth': MAX_DEPTH})

### Train XGBoost model on the training dataset

In [None]:
xg = GradientBoostingClassifier(learning_rate=LR, n_estimators=N_ESTIMATORS,max_depth=MAX_DEPTH)
xg.fit(x_train_clean, y_train)

### Log metrics

We log the accuracy on training and validation datasets

In [None]:
run.log_metrics({'train_accuracy': xg.score(x_train_clean, y_train), 'val_accuracy': xg.score(x_val_clean, y_val)})

### Save the model with MLFoundry and end run

In [None]:
run.log_model(xg, 'sklearn')
run.end()