<a href="https://colab.research.google.com/github/somilasthana/MachineLearningSkills/blob/master/Application_Churn_Prediction_Telecom_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!mkdir telco

In [2]:
!unzip /content/telco/telco-customer-churn.zip # Get it from Kaggle

Archive:  /content/telco/telco-customer-churn.zip
  inflating: WA_Fn-UseC_-Telco-Customer-Churn.csv  


#Data Details

The raw data contains 7043 rows (customers) and 21 columns (features).

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

* Customers who left within the last month – the column is called Churn

* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

* Demographic info about customers – gender, age range, and if they have partners and dependents

Description of fields

* customerID: Customer ID

* gender: Male or Female

* SeniorCitizen: 0 or 1

* Partner: whether partner 0 or 1

* Dependents:  have dependent or not ( Yes or No )

* tenure: number of months the customer stayed with the company.

* PhoneService: phone service used ( Yes or No )

* MultipleLines: Customer has multiple lines or not (Yes, No, No phone service)

* InternetService: Customer’s internet service provider (DSL, Fiber optic, No)

* OnlineSecurity: Customer has online security or not (Yes, No, No internet service)

* OnlineBackup: Customer has online backup or not (Yes, No, No internet service)

* DeviceProtection: Customer has device protection or not (Yes, No, No internet service)

* TechSupport: Customer has tech support or not (Yes, No, No internet service)

* Streaming: TV: Customer has streaming TV or not (Yes, No, No internet service)

* StreamingMovies: Customer has streaming movies or not (Yes, No, No internet service)

* Contract: Contract term of the customer (Month-to-month, One year, Two year)

* PaperlessBilling: Yes or No

* PaymentMethod: Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)

* MonthlyCharges: Amount charged 

* TotalCharges:  Total amount charged to the customer

* Churn:  Yes or No

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

## Data

In [0]:
churn_df = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [25]:
churn_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [26]:
print("Shape=",churn_df.shape)

Shape= (7043, 21)


In [27]:
churn_df.isna().any()

customerID          False
gender              False
SeniorCitizen       False
Partner             False
Dependents          False
tenure              False
PhoneService        False
MultipleLines       False
InternetService     False
OnlineSecurity      False
OnlineBackup        False
DeviceProtection    False
TechSupport         False
StreamingTV         False
StreamingMovies     False
Contract            False
PaperlessBilling    False
PaymentMethod       False
MonthlyCharges      False
TotalCharges        False
Churn               False
dtype: bool

In [28]:
churn_df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [0]:
churn_df["TotalCharges"] = churn_df["TotalCharges"].apply(lambda x: float(x.strip()) if len(x.strip()) != 0 else 0.0)

#Plan of Action

1. First Split the data using StraifiedKFold into 5 splits, repetition allowed.
2. Build Initial Machine Learning Framework ( also include Customer ID ): Algorithms used KNN, SVM, RandomForest, ExtraTreeForest, XGB ( the results are not good enough )
2. Stack Feaures ( Not Models ) Transform the data using PCA, NCA, SelectKBest, LDA, SelectKModel
3. Build the second Machine Learning Framework: Algorithms used KNN, SVM, RandomForest, ExtraTreeForest, XGB ( the results are not good enough )
4. Stack Features again but using Count Encoder, Percentile Encoder, Likelihood Encoder ( given by ) Far0n/kaggletils
5. Build the second Machine Learning Framework: Algorithms used KNN, SVM, RandomForest,, XGB). RandomForest is giving promising results
6. Grid Search to find best parameters for  RandomForest ( by running on one 1 fold data )

## StratifiedKFold

In [0]:
X = churn_df.values[:, :-1]
y = churn_df.values[:, -1]

In [82]:
X[0]

array(['7590-VHVEG', 'Female', 0, 'Yes', 'No', 1, 'No',
       'No phone service', 'DSL', 'No', 'Yes', 'No', 'No', 'No', 'No',
       'Month-to-month', 'Yes', 'Electronic check', 29.85, 29.85],
      dtype=object)

In [0]:
#RepeatStratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold

rksf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
training_test_split = []
for train_index, test_index in rksf.split(X, y):
  training_test_split.append((train_index, test_index))

In [8]:
len(training_test_split[0][0]), len(training_test_split[0][1])

(5634, 1409)

## The Machine Learning Framework

In [0]:
# Types of labels: Single column, binary values 

In [83]:
column_map = { v:k for k, v in enumerate(churn_df.columns)}
column_map

{'Churn': 20,
 'Contract': 15,
 'Dependents': 4,
 'DeviceProtection': 11,
 'InternetService': 8,
 'MonthlyCharges': 18,
 'MultipleLines': 7,
 'OnlineBackup': 10,
 'OnlineSecurity': 9,
 'PaperlessBilling': 16,
 'Partner': 3,
 'PaymentMethod': 17,
 'PhoneService': 6,
 'SeniorCitizen': 2,
 'StreamingMovies': 14,
 'StreamingTV': 13,
 'TechSupport': 12,
 'TotalCharges': 19,
 'customerID': 0,
 'gender': 1,
 'tenure': 5}

In [0]:
# Numeric features - Standardized
# Categorical features - One Hot Encoded
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline

In [0]:
numeric_features = [column_map[i] for i in ('tenure','MonthlyCharges', 'TotalCharges' )]
categorical_feature = [column_map[i] for i in ('gender', 'SeniorCitizen', 'Partner', 'Dependents', 
                       'PhoneService', 'MultipleLines','InternetService',
                       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
                       'PaperlessBilling', 'PaymentMethod') 
                      ]

In [0]:
preprocess = ColumnTransformer(
    transformers = [
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_feature)
    ]
)

In [0]:
X_transform = preprocess.fit_transform(X)

In [39]:
X_transform.shape

(7043, 46)

In [40]:
X_transform[0]

array([-1.27744458, -1.16032292, -0.99261052,  1.        ,  0.        ,
        1.        ,  0.        ,  0.        ,  1.        ,  1.        ,
        0.        ,  1.        ,  0.        ,  0.        ,  1.        ,
        0.        ,  1.        ,  0.        ,  0.        ,  1.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
        1.        ,  0.        ,  0.        ,  1.        ,  0.        ,
        0.        ,  1.        ,  0.        ,  0.        ,  1.        ,
        0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
        0.        ,  1.        ,  0.        ,  0.        ,  1.        ,
        0.        ])

In [0]:
# Save preprocess for future

import pickle
pickle.dump(preprocess, open('/tmp/preprocess.pkl', 'wb'))

In [0]:
from sklearn.metrics import confusion_matrix

In [44]:
# Apply knn algorithm to predict

from sklearn.neighbors import KNeighborsClassifier
display = True
acc_knn=[]
n_neighbors = 3 # Hyper parameter
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
for train_indices, test_indices in training_test_split:
  knn.fit(X_transform[train_indices, :], y[train_indices])
  
  # Compute the nearest neighbor accuracy on the embedded test set
  acc_knn.append(knn.score(X_transform[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ") 
    print(confusion_matrix(knn.predict(X_transform[test_indices,:]), y[test_indices]))
    display = False
  
acc_knn = np.array(acc_knn)
print(" Knn with neighbors={0}, accuracy={1}, {2}".format(n_neighbors, acc_knn.mean(), acc_knn.std()))

confusion metrics = 
[[879 174]
 [156 200]]
 Knn with neighbors=3, accuracy=0.753144790091921, 0.00855214427300203


In [45]:
display = True
acc_knn = []
n_neighbors = 10 # Hyper parameter
knn = KNeighborsClassifier(n_neighbors=n_neighbors)

for train_indices, test_indices in training_test_split:
  knn.fit(X_transform[train_indices, :], y[train_indices])
  
  # Compute the nearest neighbor accuracy on the embedded test set
  acc_knn.append(knn.score(X_transform[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(knn.predict(X_transform[test_indices,:]), y[test_indices]))
    display = False
  
"""
Not much change 
"""
acc_knn = np.array(acc_knn)
print(" Knn with neighbors={0}, accuracy={1},{2}".format(n_neighbors, acc_knn.mean(), acc_knn.std()))

confusion metrics = 
[[948 201]
 [ 87 173]]
 Knn with neighbors=10, accuracy=0.7827913459166702,0.008589727204208577


In [46]:
# Apply SVM algo
from sklearn import svm
acc_svm = []
display = True
clf = svm.SVC(kernel='rbf', gamma=0.7, C=1.0)

for train_indices, test_indices in training_test_split:
  clf.fit(X_transform[train_indices, :], y[train_indices])
  
  acc_svm.append(clf.score(X_transform[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_transform[test_indices,:]), y[test_indices]))
    display = False
    
    
acc_svm = np.array(acc_svm)
print(" svm with kernel={0}, accuracy={1},{2}".format('rbf', acc_svm.mean(), acc_svm.std()))



confusion metrics = 
[[974 246]
 [ 61 128]]
 svm with kernel=rbf, accuracy=0.7803500191428542,0.008256413951220698


In [47]:
from sklearn.ensemble import RandomForestClassifier

acc_rf = []
display = True
clf = RandomForestClassifier(n_estimators=500, min_samples_split=5, random_state=42)

for train_indices, test_indices in training_test_split:
  clf.fit(X_transform[train_indices, :], y[train_indices])
  
  acc_rf.append(clf.score(X_transform[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_transform[test_indices,:]), y[test_indices]))
    display = False
    
    
acc_rf = np.array(acc_rf)
print(" RF with n_estimators=500, accuracy={0},{1}".format(n_neighbors, acc_rf.mean(), acc_rf.std()))



confusion metrics = 
[[953 187]
 [ 82 187]]
 RF with n_estimators=500, accuracy=10,0.7969891594445899


In [48]:
from sklearn.ensemble import ExtraTreesClassifier

acc_et = []
display = True
clf = ExtraTreesClassifier(n_estimators=500, min_samples_split=5, random_state=42)

for train_indices, test_indices in training_test_split:
  clf.fit(X_transform[train_indices, :], y[train_indices])
  
  acc_et.append(clf.score(X_transform[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_transform[test_indices,:]), y[test_indices]))
    display = False
  
acc_et = np.array(acc_et)
print(" RF with n_estimators=500, accuracy={0},{1}".format(n_neighbors, acc_et.mean(), acc_et.std()))


confusion metrics = 
[[939 188]
 [ 96 186]]
 RF with n_estimators=500, accuracy=10,0.7857014733692381


In [49]:
import xgboost as xgb
display = True
acc_xgb = []

for train_indices, test_indices in training_test_split:
  clf = xgb.XGBClassifier().fit(X_transform[train_indices, :], y[train_indices])
  
  acc_xgb.append(clf.score(X_transform[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_transform[test_indices,:]), y[test_indices]))
    display = False
    
acc_xgb = np.array(acc_xgb)
print(" XGB with accuracy={0},{1}".format(acc_xgb.mean(), acc_xgb.std()))


confusion metrics = 
[[953 179]
 [ 82 195]]
 XGB with accuracy=0.8046147342976893,0.008685475971088683


##Stacking Features

In [0]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectFromModel

from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.neighbors import NeighborhoodComponentsAnalysis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LassoCV

pca = PCA(n_components=5)
skb = SelectKBest(mutual_info_classif, k=2)
nca = NeighborhoodComponentsAnalysis(n_components=3, random_state=42)
lda = LinearDiscriminantAnalysis(n_components=3)
clf = LassoCV(cv=4)
sfm = SelectFromModel(clf, threshold=0.25)

union = FeatureUnion(
    [
        ("pca", pca),
        ("skb", skb),
        ("nca", nca),
        ("lda", lda),
        ("sfm", sfm)
        
    ]
)


In [51]:
union.fit(X_transform, LabelEncoder().fit_transform(y))



FeatureUnion(n_jobs=None,
             transformer_list=[('pca',
                                PCA(copy=True, iterated_power='auto',
                                    n_components=5, random_state=None,
                                    svd_solver='auto', tol=0.0, whiten=False)),
                               ('skb',
                                SelectKBest(k=2,
                                            score_func=<function mutual_info_classif at 0x7fa082430730>)),
                               ('nca',
                                NeighborhoodComponentsAnalysis(callback=None,
                                                               init='auto',
                                                               max_iter=50,
                                                               n_components=3,
                                                               random_state=42,
                                                               t...
                       

In [52]:
X_feature = union.transform(X_transform)



In [53]:
X_feature.shape

(7043, 11)

In [55]:
X_feature[0]

array([-1.22215532e+00, -1.71455714e+00,  1.45403482e+00,  4.30145069e-01,
        7.09163700e-01, -1.27744458e+00,  1.00000000e+00, -5.83864377e+01,
       -6.05901584e+02, -2.01974149e+01,  1.02332910e+00])

In [0]:
# Save union for future

import pickle
pickle.dump(union, open('/tmp/union.pkl', 'wb'))

## Apply Model and Evaluate

In [57]:
# Apply knn algorithm to predict

from sklearn.neighbors import KNeighborsClassifier

display = True
acc_knn=[]
n_neighbors = 3 # Hyper parameter
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
for train_indices, test_indices in training_test_split:
  knn.fit(X_feature[train_indices, :], y[train_indices])
  
  # Compute the nearest neighbor accuracy on the embedded test set
  acc_knn.append(knn.score(X_feature[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ") 
    print(confusion_matrix(knn.predict(X_feature[test_indices,:]), y[test_indices]))
    display = False
  
acc_knn = np.array(acc_knn)
print(" Knn with neighbors={0}, accuracy={1}, {2}".format(n_neighbors, acc_knn.mean(), acc_knn.std()))

confusion metrics = 
[[910 174]
 [125 200]]
 Knn with neighbors=3, accuracy=0.7814700703115265, 0.0071675415187152655


In [60]:
# Apply SVM algo
from sklearn import svm
acc_svm = []
display = True
clf = svm.SVC(kernel='rbf', gamma=0.7, C=1.0)

for train_indices, test_indices in training_test_split:
  clf.fit(X_feature[train_indices, :], y[train_indices])
  
  acc_svm.append(clf.score(X_feature[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_feature[test_indices,:]), y[test_indices]))
    display = False
        
    
acc_svm = np.array(acc_svm)
print(" svm with kernel={0}, accuracy={1},{2}".format('rbf', acc_svm.mean(), acc_svm.std()))



confusion metrics = 
[[1026  358]
 [   9   16]]
 svm with kernel=rbf, accuracy=0.7395315826827538,0.0


In [61]:
from sklearn.ensemble import RandomForestClassifier

acc_rf = []
display = True
clf = RandomForestClassifier(n_estimators=500, min_samples_split=5, random_state=42)

for train_indices, test_indices in training_test_split:
  clf.fit(X_feature[train_indices, :], y[train_indices])
  
  acc_rf.append(clf.score(X_feature[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_feature[test_indices,:]), y[test_indices]))
    display = False
    
    
acc_rf = np.array(acc_rf)
print(" RF with n_estimators=500, accuracy={0},{1}".format(n_neighbors, acc_rf.mean(), acc_rf.std()))


confusion metrics = 
[[937 171]
 [ 98 203]]
 RF with n_estimators=500, accuracy=3,0.7973730354614439


In [62]:
from sklearn.ensemble import ExtraTreesClassifier

acc_et = []
display = True
clf = ExtraTreesClassifier(n_estimators=500, min_samples_split=5, random_state=42)

for train_indices, test_indices in training_test_split:
  clf.fit(X_feature[train_indices, :], y[train_indices])
  
  acc_et.append(clf.score(X_feature[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_feature[test_indices,:]), y[test_indices]))
    display = False
  
acc_et = np.array(acc_et)
print(" RF with n_estimators=500, accuracy={0},{1}".format(n_neighbors, acc_et.mean(), acc_et.std()))


confusion metrics = 
[[934 177]
 [101 197]]
 RF with n_estimators=500, accuracy=3,0.7940642725740656


In [63]:
import xgboost as xgb
display = True
acc_xgb = []

for train_indices, test_indices in training_test_split:
  clf = xgb.XGBClassifier().fit(X_feature[train_indices, :], y[train_indices])
  
  acc_xgb.append(clf.score(X_feature[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_feature[test_indices,:]), y[test_indices]))
    display = False
    
acc_xgb = np.array(acc_xgb)
print(" XGB with accuracy={0},{1}".format(acc_xgb.mean(), acc_xgb.std()))


confusion metrics = 
[[950 174]
 [ 85 200]]
 XGB with accuracy=0.8074402397421794,0.008516071914349499


In [0]:
X_merged = np.hstack((X_transform, X_feature))

In [65]:
X_merged.shape

(7043, 57)

In [66]:
# Apply SVM algo
from sklearn import svm
acc_svm = []
display = True
clf = svm.SVC(kernel='rbf', gamma=0.7, C=1.0)

for train_indices, test_indices in training_test_split:
  clf.fit(X_merged[train_indices, :], y[train_indices])
  
  acc_svm.append(clf.score(X_merged[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_merged[test_indices,:]), y[test_indices]))
    display = False
        
    
acc_svm = np.array(acc_svm)
print(" svm with kernel={0}, accuracy={1},{2}".format('rbf', acc_svm.mean(), acc_svm.std()))



confusion metrics = 
[[1026  358]
 [   9   16]]
 svm with kernel=rbf, accuracy=0.7379099332497,0.0025513383556078823


In [67]:
from sklearn.ensemble import RandomForestClassifier

acc_rf = []
display = True
clf = RandomForestClassifier(n_estimators=500, min_samples_split=5, random_state=42)

for train_indices, test_indices in training_test_split:
  clf.fit(X_merged[train_indices, :], y[train_indices])
  
  acc_rf.append(clf.score(X_merged[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_merged[test_indices,:]), y[test_indices]))
    display = False
     
  
acc_rf = np.array(acc_rf)
print(" RF with n_estimators=500, accuracy={0},{1}".format(n_neighbors, acc_rf.mean(), acc_rf.std()))


confusion metrics = 
[[942 181]
 [ 93 193]]
 RF with n_estimators=500, accuracy=3,0.8055358410220014


## Second Level of Feature  Preprocessing and Stacking

In [0]:
X = churn_df.values[:, :-1] # Not considering customer_id
y = churn_df.values[:, -1]

In [77]:
X[0]

array(['7590-VHVEG', 'Female', 0, 'Yes', 'No', 1, 'No',
       'No phone service', 'DSL', 'No', 'Yes', 'No', 'No', 'No', 'No',
       'Month-to-month', 'Yes', 'Electronic check', 29.85, 29.85],
      dtype=object)

In [0]:
from collections import Counter

import numpy as np
from scipy.stats import norm
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder
from statsmodels.distributions import ECDF

In [0]:
numeric_features_index = [column_map[i] for i in ('tenure','MonthlyCharges', 'TotalCharges' )]
categorical_feature_name = ('gender', 'SeniorCitizen', 'Partner', 'Dependents', 
                       'PhoneService', 'MultipleLines','InternetService',
                       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
                       'PaperlessBilling', 'PaymentMethod')
categorical_feature_index = [column_map[i] for i in categorical_feature_name 
                      ]

In [78]:
categorical_feature_index

[1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]

In [87]:
X[1, categorical_feature_index]

array(['Male', 0, 'No', 'No', 'Yes', 'No', 'DSL', 'Yes', 'No', 'Yes',
       'No', 'No', 'No', 'One year', 'No', 'Mailed check'], dtype=object)

In [0]:
class CounterEncoder(BaseEstimator, TransformerMixin):
  def __init__(self, min_count=0, nan_value=-1, copy=True):
    self.min_cnt = min_count
    self.nans = nan_value
    self.cp = copy
    self.counts = {}
    
  def is_numpy(self, x):
    return isinstance(x, np.ndarray)
    
  def fit(self, x):
    self.counts = {}
    if len(x.shape) == 1:
      x = x.reshape(-1, 1)
    ncols = x.shape[1]
    is_np = self.is_numpy(x)
    
    for i in range(ncols):
      if is_np:
        cnt = dict(Counter(x[:, i]))
      else:
        cnt = x.iloc[:, i].value_counts().to_dict()
        
      if self.min_cnt > 0:
        cnt = dict((k, self.nans if v < self.min else v ) for k, v in cnt.items())
    
      self.counts.update({i:cnt})
    return self
  
  def fit_transform(self, x):
    self.fit(x)
    return self.transform(x)
  
  def transform(self, x):
    if self.cp:
      xm = x.copy()
      
    if len(xm.shape) == 1:
      xm = xm.reshape(-1, 1)
      
    ncols = xm.shape[1]
    is_np = self.is_numpy(xm)
    
    for i in range(ncols):
      cnt = self.counts[i]
      
      if is_np:
        k, v = np.array( list ( zip ( *sorted(cnt.items()))))
        ix = np.digitize(xm[:, i], k, right=True)
        xm[:, i] = v[ix]
      else:
        xm.iloc[:, i].replace(cnt, inplace=True)
    return xm

In [0]:
#X_counter = counter.fit_transform(X[:, categorical_feature_index])
X_label = X.copy()
for i in categorical_feature_index:
  X_label[:, i] = LabelEncoder().fit(X[:, i]).transform(X[:, i])


In [0]:
counter = CounterEncoder()
X_label_count = counter.fit_transform(X_label[:, categorical_feature_index])

In [123]:
X_label_count[:3, :]

array([[3488, 5901, 3402, 4933, 682, 682, 2421, 3498, 2429, 3095, 3473,
        2810, 2785, 3875, 4171, 2365],
       [3555, 5901, 3641, 4933, 6361, 3390, 2421, 2019, 3088, 2422, 3473,
        2810, 2785, 1473, 2872, 1612],
       [3555, 5901, 3641, 4933, 6361, 3390, 2421, 2019, 2429, 3095, 3473,
        2810, 2785, 3875, 4171, 1612]], dtype=object)

In [0]:
class PercentileEncoder(BaseEstimator, TransformerMixin):
  def __init__(self, apply_ppf=False, copy=True):
    self.ppf = lambda x: norm.ppf(x * .998 + .001) if apply_ppf else x
    self.cp = copy
    self.ecdfs = {}
    
  def is_numpy(self, x):
    return isinstance(x, np.ndarray)
    
  def fit(self, x):
    self.ecdfs = {}
    
    if len(x.shape) == 1:
      x = x.reshape(-1, 1)
      
    ncols = x.shape[1]
    is_np = self.is_numpy(x)
    
    for i in range(ncols):
      if is_np:
        self.ecdfs.update({i: ECDF(x[:, i])})
      else:
        self.ecdfs.update({i: ECDF(x.iloc[:, i].values)})
        
    return self
  
  def fit_transform(self, x):
    self.fit(x)
    return self.transform(x)
  
  def transform(self, x):
    
    if self.cp:
      xm = x.copy()
      
    if len(xm.shape) == 1:
      xm = xm.reshape(-1, 1)
      
    ncols = xm.shape[1]  
    is_np = self.is_numpy(xm)
    
    for i in range(ncols):
      ecdf = self.ecdfs[i]
      if is_np:
        xm[:, i] = self.ppf(ecdf(xm[:, i]))
      else:
        xm.iloc[:, i] = self.ppf(ecdf(xm[:, i]))
        
    return xm

In [0]:
p = PercentileEncoder()

X_percentile = p.fit_transform(X[:, numeric_features_index])

In [131]:
X_percentile[:3, :]

array([[0.08859860854749398, 0.23384921198352973, 0.030242794263808038],
       [0.5543092432202187, 0.3941502200766719, 0.582564248189692],
       [0.12239102655118558, 0.3553883288371433, 0.1211131620048275]],
      dtype=object)

In [0]:
from sklearn.utils import check_X_y, check_array

def is_numpy(x):
    return isinstance(x, np.ndarray)
  
class LikelihoodEstimator(BaseEstimator):
    def __init__(self, seed=0, alpha=0, noise=0, leave_one_out=False):
        self.alpha = alpha
        self.noise = noise
        self.seed = seed
        self.leave_one_out = leave_one_out
        self.nclass = None
        self.classes = None
        self.class_priors = None
        self.likelihoods = None
        self.x_likelihoods = None

    def fit(self, x, y):
        np.random.seed(self.seed)
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)

        x, y = check_X_y(x, y)

        self.classes = np.unique(y)
        self.nclass = self.classes.shape[0]

        ctab = pd.crosstab(y, list(x.T)).T.reset_index()

        xdim = x.shape[1]
        xcols = list(ctab.columns[:xdim])
        ycols = list(ctab.columns[xdim:])

        xtab = pd.DataFrame(x, columns=xcols)
        xtab = xtab.merge(ctab, how='left', on=xcols)

        self.class_priors = xtab[ycols].div(xtab[ycols].sum(axis=1), axis=0).mean().values

        if self.leave_one_out:
            xtab[ycols] -= pd.get_dummies(y)

        xtab[ycols] = xtab[ycols].add(self.class_priors * self.alpha). \
            div(xtab[ycols].sum(axis=1) + self.alpha + 1E-15, axis=0)
        if self.noise > 0:
            xtab[ycols] = np.abs(xtab[ycols] + normal(0, scale=self.noise, size=xtab[ycols].shape))
            xtab[ycols] = xtab[ycols].div(xtab[ycols].sum(axis=1), axis=0)
        self.x_likelihoods = xtab[ycols].values

        xtab_agg = xtab.groupby(xcols, as_index=False)[ycols].agg(['mean']).fillna(0)
        xtab_agg.columns = xtab_agg.columns.get_level_values(1)

        self.likelihoods = xtab_agg.T.ix['mean'].reset_index(drop=True).T.reset_index()
        # self.likelihoods = xtab_agg.T.ix['mean'].reset_index(drop=True).to_dict('list')
        # self.likelihoods_cov = xtab_agg.T.ix['std'].reset_index(drop=True).to_dict('list')
        # self.likelihoods_cov = dict((k, np.diag(v)) for k, v in self.likelihoods_cov.items())

        return self

    def _calc_likelihood(self, x):
        return (x + self.class_priors * self.alpha) / (x.sum() + self.alpha)

    def _get_likelihood(self, x, noise):
        mean = self.likelihoods.get(x[0], self.class_priors)
        cov = self.likelihoods_cov.get(x[0], np.diag(np.zeros((self.nclass,))))
        if noise:
            if isinstance(noise, float):
                cov = np.diag(np.ones((self.nclass,)) * noise)
            lh = np.abs(multivariate_normal(mean, cov))
            return lh / lh.sum()
        else:
            return mean

    def predict(self, x, noise=False, normalize=False):
        if normalize:
            return np.average(self.predict_proba(x, noise), axis=1, weights=self.classes)
        else:
            return np.dot(self.predict_proba(x, noise), self.classes)

    def predict_proba(self, x, noise=False):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)

        x = check_array(x)

        xx = pd.DataFrame(x, columns=self.likelihoods.columns[:-self.nclass])
        xx = xx.merge(self.likelihoods, how='left')
        xx.drop(xx.columns[:-self.nclass], axis=1, inplace=True)
        xx.loc[xx.isnull().any(axis=1) | (xx == 0).all(axis=1), :] = self.class_priors

        if noise:
            np.random.seed(self.seed)
            _noise = noise if isinstance(noise, float) else self.noise
            if _noise > 1E-12:
                xx = np.abs(xx + normal(0, scale=_noise, size=xx.shape))
                xx = xx.div(xx.sum(axis=1), axis=0)

        # return np.apply_along_axis(self._get_likelihood, 1, x, noise)
        return xx.values

class LikelihoodEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, seed=0, alpha=0, leave_one_out=False, noise=0):
        self.alpha = alpha
        self.noise = noise
        self.seed = seed
        self.leave_one_out = leave_one_out
        self.nclass = None
        self.estimators = []

    def fit(self, x, y):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)
        ncols = x.shape[1]
        if not is_numpy(x):
            x = np.array(x)

        self.nclass = np.unique(y).shape[0]

        for i in range(ncols):
            self.estimators.append(LikelihoodEstimator(**self.get_params()).fit(x[:, i], y))
        return self
      
    def fit_transform(self, x, y):
        self.fit(x, y)
        return self.transform(x)

    def transform(self, x):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)
        ncols = x.shape[1]
        if not is_numpy(x):
            x = np.array(x)

        likelihoods = None

        for i in range(ncols):
            lh = self.estimators[i].predict(x[:, i], noise=True).reshape(-1, 1)
            # lh = self.estimators[i].predict_proba(x[:, i])
            # if self.nclass <= 2:
            #     lh = lh.T[1].reshape(-1, 1)
            likelihoods = np.hstack((lh,)) if likelihoods is None else np.hstack((likelihoods, lh))
        return likelihoods

In [0]:
#X_counter = counter.fit_transform(X[:, categorical_feature_index])
X_cat = X.copy()
for i in categorical_feature_index:
  X_cat[:, i] = LabelEncoder().fit(X[:, i]).transform(X[:, i])
  
le = LikelihoodEncoder()
X_likelihood = le.fit_transform(X_cat[:, 1:], LabelEncoder().fit_transform(y)) # # Not considering customer_id

In [0]:
X_feature_stacking = np.hstack([X_label_count, X_percentile, X_likelihood])

## Apply Model and Evaluate

In [145]:
# Apply knn algorithm to predict

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
display = True
acc_knn=[]
n_neighbors = 3 # Hyper parameter
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
for train_indices, test_indices in training_test_split:
  knn.fit(X_feature_stacking[train_indices, :], y[train_indices])
  
  # Compute the nearest neighbor accuracy on the embedded test set
  acc_knn.append(knn.score(X_feature_stacking[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ") 
    print(confusion_matrix(knn.predict(X_feature_stacking[test_indices,:]), y[test_indices]))
    display = False
  
acc_knn = np.array(acc_knn)
print(" Knn with neighbors={0}, accuracy={1}, {2}".format(n_neighbors, acc_knn.mean(), acc_knn.std()))

confusion metrics = 
[[917 177]
 [118 197]]
 Knn with neighbors=3, accuracy=0.7809884270223454, 0.008618955990493584


In [153]:
# Apply Naive algorithm to predict

from sklearn.naive_bayes import GaussianNB

display = True
acc_nb=[]
n_neighbors = 3 # Hyper parameter
nb = GaussianNB()
for train_indices, test_indices in training_test_split:
  nb.fit(X_feature_stacking[train_indices, :], y[train_indices])
  
  # Compute the nearest neighbor accuracy on the embedded test set
  acc_nb.append(nb.score(X_feature_stacking[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ") 
    print(confusion_matrix(knn.predict(X_feature_stacking[test_indices,:]), y[test_indices]))
    display = False
  
acc_nb = np.array(acc_nb)
print(" NB with neighbors={0}, accuracy={1}, {2}".format(n_neighbors, acc_nb.mean(), acc_nb.std()))

confusion metrics = 
[[945 135]
 [ 90 239]]
 NB with neighbors=3, accuracy=0.966435368528946, 0.005580538080268117


In [154]:
# Apply SVM algo
from sklearn import svm
acc_svm = []
display = True
clf = svm.SVC(kernel='rbf', gamma=0.7, C=1.0)

for train_indices, test_indices in training_test_split:
  clf.fit(X_feature_stacking[train_indices, :], y[train_indices])
  
  acc_svm.append(clf.score(X_feature_stacking[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_feature_stacking[test_indices,:]), y[test_indices]))
    display = False
        
    
acc_svm = np.array(acc_svm)
print(" svm with kernel={0}, accuracy={1},{2}".format('rbf', acc_svm.mean(), acc_svm.std()))



confusion metrics = 
[[1012  263]
 [  23  111]]
 svm with kernel=rbf, accuracy=0.8062766770426485,0.006095040881327082


In [155]:
from sklearn.ensemble import RandomForestClassifier

acc_rf = []
display = True
clf = RandomForestClassifier(n_estimators=500, min_samples_split=5, random_state=42)

for train_indices, test_indices in training_test_split:
  clf.fit(X_feature_stacking[train_indices, :], y[train_indices])
  
  acc_rf.append(clf.score(X_feature_stacking[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_feature_stacking[test_indices,:]), y[test_indices]))
    display = False
     
  
acc_rf = np.array(acc_rf)
print(" RF with n_estimators=500, accuracy={0},{1}".format(n_neighbors, acc_rf.mean(), acc_rf.std()))


confusion metrics = 
[[1019   18]
 [  16  356]]
 RF with n_estimators=500, accuracy=3,0.9793555995748724


In [0]:
import xgboost as xgb
display = True
acc_xgb = []

for train_indices, test_indices in training_test_split:
  clf = xgb.XGBClassifier().fit(X_feature_stacking[train_indices, :], y[train_indices])
  
  acc_xgb.append(clf.score(X_feature_stacking[test_indices,:], y[test_indices]))
  
  if display:
    print("confusion metrics = ")
    print(confusion_matrix(clf.predict(X_feature_stacking[test_indices,:]), y[test_indices]))
    display = False
    
acc_xgb = np.array(acc_xgb)
print(" XGB with accuracy={0},{1}".format(acc_xgb.mean(), acc_xgb.std()))


## Grid Search For RandomForest

In [0]:
from sklearn.model_selection import GridSearchCV
from time import time
param_dist = {
    "n_estimators": [120, 300, 500, 800, 1200],
    "max_depth": [5, 8, 15, 25, 30, None],
    "max_features": ["log2", "sqrt", None],
    "min_samples_split": [2, 5, 10, 15, 100],
    "min_samples_leaf": [1, 2, 5, 10],
    "bootstrap": [True, False],
    "criterion": ["gini", "entropy"]
}

clf = RandomForestClassifier()
grid_search = GridSearchCV(clf, param_grid=param_dist, cv=5)
start = time()
grid_search.fit(X_feature_stacking[train_indices, :], y[train_indices])
end = time()
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_['params'])))

In [0]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

clf = grid_search.best_estimator_


y_pred_rf = clf.predict_proba(X_feature_stacking[test_indices,:])
fpr_rf, tpr_rf, _ = roc_curve(LabelEncoder().fit_transform(y[test_indices]), y_pred_rf[:, 1])

auc = roc_auc_score(LabelEncoder().fit_transform(y[test_indices]), y_pred_rf[:, 1])

plt.figure(0)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rf, tpr_rf, label='RF AUC {}'.format(np.round(auc, 3)))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()