# Customer Churn Prediction with TrueFoundry


Link to the problem statement: https://www.kaggle.com/c/customer-churn-prediction-2020

This notebook contains a solution to the Customer Churn Prediction challenge by Kaggle. 

1. We use `GradientBoostingClassifier` for classification.
2. We use `mlfoundry` to log metrics, datasets and model for each run.

### Install and import Python package

In [4]:
# install pip packages
!pip install mlfoundry > /dev/null

# import other things
import pandas as pd

from sklearn.ensemble import GradientBoostingClassifier

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.4 requires scikit-learn>=1.0.0, but you have scikit-learn 0.24.2 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m


### Copy MLFoundry API Key and save it against `api_token`

In [5]:
from getpass import getpass
api_token = getpass("TrueFoundry API Token (Get it from https://app.truefoundry.com/settings):")


TrueFoundry API Token (Get it from https://app.truefoundry.com/settings):··········


### Create MLFoundty client

We will use the client to log hyperparameters, metrics, datasets and logs.



In [6]:
import mlfoundry as mlf
mlf_client = mlf.get_client(api_key=api_token)

In [7]:
# download the test and train datasets
!curl -O https://raw.githubusercontent.com/truefoundry/truefoundry-examples/main/churn-prediction-sklearn/data/train.csv
!curl -O https://raw.githubusercontent.com/truefoundry/truefoundry-examples/main/churn-prediction-sklearn/data/test.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  382k  100  382k    0     0  1099k      0 --:--:-- --:--:-- --:--:-- 1099k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70061  100 70061    0     0   417k      0 --:--:-- --:--:-- --:--:--  417k


### Let's create an MLFoundry run

In [30]:
run = mlf_client.create_run(project_name="churn-prediction-sklearn")

[mlfoundry] 2022-06-08T06:20:24+0000 INFO Run is created with name 'allow-serious-question' and id 'bb98bbbf65414a2ea5fb7be4baa323b6'


### Load training data and split it into train and validation datasets

In [31]:
df = pd.read_csv('train.csv')
df.describe()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls
count,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0
mean,100.236235,7.631765,180.2596,99.907294,30.644682,200.173906,100.176471,17.015012,200.527882,99.839529,9.023892,10.256071,4.426353,2.769654,1.559059
std,39.698401,13.439882,54.012373,19.850817,9.182096,50.249518,19.908591,4.271212,50.353548,20.09322,2.265922,2.760102,2.463069,0.745204,1.311434
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,73.0,0.0,143.325,87.0,24.365,165.925,87.0,14.1025,167.225,86.0,7.5225,8.5,3.0,2.3,1.0
50%,100.0,0.0,180.45,100.0,30.68,200.7,100.0,17.06,200.45,100.0,9.02,10.3,4.0,2.78,1.0
75%,127.0,16.0,216.2,113.0,36.75,233.775,114.0,19.8675,234.7,113.0,10.56,12.0,6.0,3.24,2.0
max,243.0,52.0,351.5,165.0,59.76,359.3,170.0,30.54,395.0,175.0,17.77,20.0,20.0,5.4,9.0


In [32]:
# divide the train dataset into test and train
from sklearn.model_selection import train_test_split 

X = df.drop(columns= ['churn'])
y = df['churn']

x_train, x_val, y_train, y_val = train_test_split(X, y , test_size=.25, stratify= y, random_state=1) 

# let's take a look at the value counts of yes and no
y_train.value_counts(), y_val.value_counts()

(no     2739
 yes     448
 Name: churn, dtype: int64, no     913
 yes    150
 Name: churn, dtype: int64)

### Cleaning the data
1. Calculate the `total_net_minutes` to reduce the number of features; do the same with calls, and charge
2. Convert all `yes` and `no` strings into 0/1 in columns such as `voice_mail_plan`, `international_plan`, and `churn`
3. Convert the categorical values such as `state` and `area_code` into one-hot vectors
4. Drop all repeted features and unused columns

In [33]:
def clean_data(df):    
    df['total_net_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
    df['total_net_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
    df['total_net_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']


    df['voice_mail_plan'] = df['voice_mail_plan'].map({'yes': 1, 'no': 0}) 
    df['international_plan'] = df['international_plan'].map({'yes': 1, 'no': 0}) 

    df.drop(columns= ['state', 'area_code'], inplace= True)

    df.drop(columns=['total_day_charge', 'total_eve_charge','total_night_charge',
                    'total_day_calls','total_eve_calls', 'total_night_calls', 'total_day_minutes', 
                     'total_eve_minutes', 'total_night_minutes'], inplace=True)
    return df

In [34]:
x_train_clean  = clean_data(x_train)
y_train_clean = pd.Categorical(y_train).codes

x_val_clean = clean_data(x_val)
y_val_clean = pd.Categorical(y_val).codes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

### Log Dataset to MLFoundry

In [35]:
run.log_dataset('train', features=x_train_clean, actuals=y_train_clean)
run.log_dataset('validation', features=x_val_clean, actuals=y_val_clean)

[mlfoundry] 2022-06-08T06:20:25+0000 INFO Logging Dataset, this might take a while ...
[mlfoundry] 2022-06-08T06:20:25+0000 INFO Shutting down background jobs and syncing data for run with id '9121373614f44e1b89f0248836c32cf6', please don't kill this process...
[mlfoundry] 2022-06-08T06:20:26+0000 INFO Finished syncing data for run with id '9121373614f44e1b89f0248836c32cf6'. Thank you for waiting!
[mlfoundry] 2022-06-08T06:20:42+0000 INFO Dataset logged successfully
[mlfoundry] 2022-06-08T06:20:42+0000 INFO Logging Dataset, this might take a while ...
[mlfoundry] 2022-06-08T06:20:54+0000 INFO Dataset logged successfully


### Log our hyperparamters to MLFoundry

In [36]:
LR = 0.05
N_ESTIMATORS = 1000
MAX_DEPTH = 10

run.log_params({'learning_rate': LR, 'n_estimators': N_ESTIMATORS, 'max_depth': MAX_DEPTH})

[mlfoundry] 2022-06-08T06:20:55+0000 INFO Parameters logged successfully


### Train XGBoost model on the training dataset

In [None]:
xg = GradientBoostingClassifier(learning_rate=LR, n_estimators=N_ESTIMATORS,max_depth=MAX_DEPTH)
xg.fit(x_train_clean, y_train)

### Log metrics

We log the accuracy on training and validation datasets

In [None]:
run.log_metrics({'train_accuracy': xg.score(x_train_clean, y_train), 'val_accuracy': xg.score(x_val_clean, y_val)})

### Save the model with MLFoundry and end run

In [None]:
run.log_model(xg, 'sklearn')
run.end()