# Choosing a baseline model

Data from KDD CUP 2009 provided by telecomunication corporation Orange S.A. Data contains 50000 examples and 230 features, the first 190 features are numerical and the last 40 are categorical. The task is to estimate the churn probability of customers (classification problem on unbalanced data). The performance metric is AUC ROC.

The public data was used in Kaggle competition https://www.kaggle.com/c/telecom-clients-churn-prediction/data. The 50000 examples are splitted into train data (40000 examples) and test data (10000 examples). Competitors don't have access to the test labels so the test data is not supposed to be used in model fitting, but in performance evaluation.

Here I will try to find a good baseline model among several popular classifiers using very simple data processing and standard parameters. Later I will try to improve the result by searching for optimal parameters and processing (see "Model construction and test.ipynb").

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction import FeatureHasher
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

In [2]:
#read_data
data = pd.read_csv("orange_small_churn_train_data.csv")
data.drop("ID", axis = 1, inplace = True)
print data.shape
num_features = list(data.columns[:190])
cat_features = list(data.columns[190:230])
data.head()

(40000, 231)


Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Var6,Var7,Var8,Var9,Var10,...,Var222,Var223,Var224,Var225,Var226,Var227,Var228,Var229,Var230,labels
0,,,,,,3052.0,,,,,...,vr93T2a,LM8l689qOp,,,fKCe,02N6s8f,xwM2aC7IdeMC0,,,-1
1,,,,,,1813.0,7.0,,,,...,6hQ9lNX,LM8l689qOp,,ELof,xb3V,RAYp,55YFVY9,mj86,,-1
2,,,,,,1953.0,7.0,,,,...,catzS2D,LM8l689qOp,,,FSa2,ZI9m,ib5G6X1eUxUn6,mj86,,-1
3,,,,,,1533.0,7.0,,,,...,e4lqvY0,LM8l689qOp,,,xb3V,RAYp,F2FyR07IdsN7I,,,1
4,,,,,,686.0,7.0,,,,...,MAz3HNj,LM8l689qOp,,,WqMG,RAYp,F2FyR07IdsN7I,,,-1


In [5]:
#fill Nans with medians for numerical features and zeros for categorical features
for feat in num_features:
    med_value = np.median(data[feat].dropna())
    data.fillna({feat:med_value},inplace = True)
data.fillna(0, inplace = True)

#hashing categorical features
hash_space = 50
cat_x_hashed = pd.DataFrame()
data_set = [data]
hash_set = [cat_x_hashed]

for feat in cat_features:
    for d, h in zip(data_set, hash_set):
        feat_hashed = [hash(x) % hash_space for x in d[feat]]
        h[str(feat)] = pd.Series(feat_hashed)
data[cat_features] = cat_x_hashed

#scaling
scaler = StandardScaler()
df = data.drop(labels = ['labels'], axis = 1)
data_scaled = scaler.fit_transform(df)
data_scaled = pd.DataFrame(data_scaled, index = df.index, columns = df.columns)
data_scaled['labels'] = data['labels'] 

In [34]:
#5-fold cross validation for classifiers
X = data_scaled.drop("labels", axis = 1)
y = data_scaled['labels']
X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=21)

classifiers = [
    LogisticRegression(),
    RandomForestClassifier(n_jobs = 2),
    GradientBoostingClassifier(),
    GaussianNB()]

for clf in classifiers:
    roc_auc = cross_val_score(clf, X_train, y_train, cv=5, scoring='roc_auc')
    print "ROC AUC for %s is %f"%(clf.__class__.__name__, np.mean(roc_auc))

ROC AUC for LogisticRegression is 0.658062
ROC AUC for RandomForestClassifier is 0.589208
ROC AUC for GradientBoostingClassifier is 0.727723
ROC AUC for GaussianNB is 0.508344


Best result is shown by GradientBoostingClassifier. Lets check how it works on a hold-out dataset (20% of the data).

In [35]:
clf = GradientBoostingClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test) 
y_prob = [x[1] for x in y_pred]
roc_auc_ho = metrics.roc_auc_score(y_test, y_prob)
print 'AUC ROC on the hold-out data set is', roc_auc_ho

AUC ROC on the hold-out data set is 0.729119800047
