# Data Mining Process

The goal of this notebook is to implement a data mining process chain according to [CRISP](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining).    
For educational purposes the [Telco Customer Churn](https://www.kaggle.com/blastchar/telco-customer-churn) dataset is analyzed. 

In [None]:
import numpy as np 
import pandas as pd

path_dataset = '/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv'

# Data Access

In the following section I t****ake a look at at the dataset to get a first impression.    
I check the available data, the data types and scan for any pitfalls, e.g. parsing errors, NaNs or unexpected parsing results.

In [None]:
df = pd.read_csv(path_dataset)
df.head(3)

In [None]:
print(f'Dataset shape: {df.shape}')

In [None]:
# Check out a detailed description of the data.
# Mostly interested in the data types and any non-null values
df.info()

**Observation:**    
In total there are 21 data columns, including int64, float64 and object data types.
At first glance the dataset seems to be complete, with 7043 entries in each column.
Judging from the column names almost all data that should be numerical is either of float or integer data type.
An exception is found in the "TotalCharges" column, which is indicated as object data type.
Missing values may be a possible cause for this mismatch.
To verify this assumption the dataset is checked once again, replacing empty strings explicitly with NaN-values.

In [None]:
df = pd.read_csv(path_dataset, na_values=[' ', ''])
df.info()

**Observation:**    
Replacing empty strings with NaNs shows the expected data type of float64 for "TotalCharges".    
The dataset info now also reveals that 11 values are indeed empty.    
Since this is only a very small fraction of the dataset the corrupted rows will just be dismissed.

In [None]:
df.dropna(inplace=True)
print(f'New dataset shape: {df.shape}')

# Preprocess and Understand Data

After ensuring that the dataset can be accessed properly and the data is indeed clean the data may be analyzed more closely.

## Check Feature Domains

First check out some exemplary values of the dataset

In [None]:
# Check value ranges of data
for col in df:
    print(f'Feature: {col}')
    print(f'Values: {df[col].unique()[:5]}')
    print('---')

In [None]:
# Get a better overview of the numerical data
df.describe()

**Observation:**    
By inspecting the data closer it becomes clear that data types are still not consistent.
For example, "SeniorCitizen" is of integer type but actually describes categorical data.
Other categorical data is described as strings.

## Transformation of non-numerical data

To make the dataset consistent and to allow further processing with ML algorithms the categorical data is converted into a numeric representation.
For this purpose the LabelEncoder of scikit-learn is used.    
The CustomerId does not contain valuable data for analysis, therefore this feature will be dropped before the transformation.

In [None]:
df = df.drop(columns=['customerID'])

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df[df.select_dtypes(['object']).columns] = (
    df[df.select_dtypes(['object']).columns].select_dtypes(['object']).apply(
        lambda x: label_encoder.fit_transform(x)
    ))
df.head()

In [None]:
# Check value ranges once more
df.describe()

**Observation:**    
The data transformation worked properly. As expected all categorical data is now described as numerical data.

## Understand Data: Univariate Distribution Visualization (Numerical Features)

To gain more insight into the data the distributions will be displayed.

#### 2.3.1 Feature: Tenure

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
plt.figure(figsize=(16, 6))
plt.xticks(range(-10,100,2))
plt.title('Tenure')
sns.violinplot(x=['tenure'], data=df)

**Observation:**    
Most customers stay about 2 to 6 years. Another tenure peak is between 66 and 72 years, indicating some very loyal customers.    
The median tenure is at 29 years.    
50% of customers stay between 9 and 55 years, as indicated by the upper and lower quartiles.
This rather big range is sign of a high standard deviation and highly scattered data.

### Feature: Monthly Charges

In [None]:
plt.figure(figsize=(16, 6))
plt.title('Monthly Charges')
plt.xticks(range(5,140,5))
sns.violinplot(x=['MonthlyCharges'], data=df)

**Observation:**    
Most customers pay between 18 and 25€ per month. The violin plot shows two local peaks between 48-55 and 75-85€. Overall the monthly charges feature a high standard deviation and the charges are scattered widely.

### Feature: Total Charges

In [None]:
plt.figure(figsize=(16, 6))
plt.title('Total Charges')
plt.xticks(range(0,10000,500))
sns.violinplot(x=['TotalCharges'], data=df)

**Observation:**    
Most customers pay between 200€ and 500€ in total.
Values range up to a maximum of 8684, leading to a high standard deviation.    
The median value lies at 1400€ total charges.
50% of customers pay between 400€ and 3800€ in total, as indicated by the lower and upper quartiles.

## Understand Data: Univariate Distribution Visualization (Discrete Values)

In [None]:
# First grab the categorical subset of the data to make life easier
# To get the categorical data programmatically the data types are exploited.
df_original = pd.read_csv(path_dataset)
df_discrete = df_original[df_original.select_dtypes(['object']).columns]
df_discrete = df_discrete.drop(columns=['customerID', 'TotalCharges'])
feature_names = df_discrete.columns

In [None]:
fig = plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace=.7, wspace=.4)

index = 0
for row in range(4):
    for col in range(4):
        ax = fig.add_subplot(4, 4, index+1)
        ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha='right')
        feature_name = feature_names[index]
        ax = sns.countplot(x=feature_name, data=df_discrete)
        index+=1

**Observation:**    
Some features are significantly unbalanced, e.g. Dependents, PhoneService, MultipleLines, Contract and Churn.    
Especially Churn seems to be critical, since this is the label that's supposed to be predicted.

## Understand Data: Conditional Distribution Visualization

To get a grasp of what features might be important for the Churn label we visualize conditionally on this variable.[](http://)

### Feature: Tenure (conditional on Churn label)

In [None]:
sns.FacetGrid(df, col='Churn').map(sns.violinplot, 'tenure', order=[0,1])

**Observation:**    
Customers with low tenure are significantly more likely to churn. At the same time customers who didn't churn are spread over a wide range of tenure.

### Feature: Monthly Charges (conditional on Churn label)

In [None]:
sns.FacetGrid(df, col='Churn').map(sns.violinplot, 'MonthlyCharges', order=[0,1])

**Observation:**    
The plot indicates that monthly charges impact the customers' decision to churn significantly.
Customers with high monthly charges are way more likely to churn than customers with a low amount of monthly charges.
On the other hand customers with low monthly charges are more likely to stay.

### Feature: Total Charges (conditional on Churn label)

In [None]:
sns.FacetGrid(df, col='Churn').map(sns.violinplot, 'TotalCharges', order=[0,1])

**Observation:**    
The amount of total charges does not significantly impact the churn rate.
The distributions show no big discrepancy.

**General Observations:**    
By looking at the plots Tenure and and MonthlyCharges seem to be valuable features for churn prediction due to their vastly different distributions in context of the churn label.    
On the other hand TotalCharges seems to be a less valuable feature.

## Categorical Data (conditional on Churn label)

In [None]:
df_categorical = df.drop(columns=['tenure', 'MonthlyCharges', 'TotalCharges'])

In [None]:
feature_names = df_categorical.columns

index = 0
for index in range(len(feature_names)):
    feature_name = feature_names[index]
    sns.FacetGrid(df_categorical, col='Churn').map(
        sns.countplot, feature_name, order=df_categorical[feature_name].unique())

**Observation:**    
* Gender does not seem to be a valuable feature for churn prediction. Values are evenly distributed
* Senior citizens are almost equally likely to churn or to stay. Younger customers are more likely to stay. This assumption might be biased due to a lower amount of senior citizens in the dataset.
* Customers without partner are a little more likely to churn.
* Customers with dependants are a little bit less likely to churn.
* The phone service seems to have little impact on the customer's decision.
* Multiple lines seem have little impact on the customer's decision.
* The internet service does impact the churn rate. Fiber optics customers are more likely to churn.
* Online security seems to be an important feature. Customers without online security are more likely to churn.
* The lack of online backup services increases the churn probability.
* The lack of device protection increases churn probability.
* The lack of tech support increases churn probability.
* Customers don't churn quite as often when no TV streaming is used due to no booked internet service.
* Customers don't churn quite as often when no TV streaming is used due to no booked internet service.
* Month-to-month contracts are almost equally likely to stay or to churn. Longer contracts bind the customer to the company.
* No paperless billing reduces the risk of a churn.
* Customers with electronic check billing are most likely to churn.

## Understand Data: Correlation Analysis

In [None]:
correlations = df.corr()
fig = plt.figure(figsize=(12,10), dpi=80, facecolor='w', edgecolor='k')
ax = sns.heatmap(correlations, cmap='PiYG', vmin=-1, vmax=1,  annot=True)

In [None]:
correlations.Churn.sort_values()

**Observation:**  
Some stronger correlations can be observed in the correlation heatmap, e.g. Tenure with contract and Tenure with Total Charges. The usage of additional services (e.g. streaming and security offers) tend to influence the customer's tenure, their contract type and their total charges.

In context of the Churn label shows a different picture. The strongest negative correlation is found between churn and the contract type, the strongest positive correlation is between churn and the monthly charges.

The assumption that wider value distributions with respect to the churn label leads to higher correlation values was verified by the correlation with the features Contract, MonthlyCharges and Tenure. Less distributed data on the other hand leads to lower correlation values (e.g. gender). The assumption that TotalCharges is not a potentially usable feature has been disproved. The absolute correlation value is nearly equal to MonthlyCharges. Additionally there are features with even lower correlation values.

## Univariate Feature Selection

Features will be selected based on univariate statistical tests.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest

In [None]:
values = df.values

X = values[:,0:19] # Features
y = values[:,19] # Targets

# Mutual Information for Classification
mutual_info = mutual_info_classif(X, y, discrete_features='auto', 
                                    n_neighbors=3, copy=True, random_state=None)

# chi square
chi_score, chi_pval = chi2(X,y)

# F-measure
f_score, f_pval = f_classif(X,y)

data_feature_selection = { 'MutualInfo': mutual_info,
                        'ChiSquaredScore': chi_score,
                        'ChiSquaredPVal': chi_pval,
                        'FScore': f_score,
                        'FPVal': f_pval
                        }

features = df.columns[0:19]

df_feature_selection = pd.DataFrame(data_feature_selection)
df_feature_selection.insert(0, 'Feature', features)

pd.set_option('display.float_format', lambda x: '%.3f' % x)
df_feature_selection

**Comparison of the test methods:**

In [None]:
fig = plt.figure(figsize=(16, 5), dpi= 80, facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace=.1, wspace=.6)

ax = fig.add_subplot(1, 2, 1)
ax = sns.barplot(x='MutualInfo', y="Feature", 
                 data=df_feature_selection.sort_values('MutualInfo', ascending=False))

ax = fig.add_subplot(1, 2, 2)
ax = sns.barplot(x='FScore', y="Feature", 
                 data=df_feature_selection.sort_values('FScore', ascending=False))

In [None]:
ax = sns.barplot(x='ChiSquaredScore', y="Feature", 
                 data=df_feature_selection.drop([4,17,18])
                 .sort_values('ChiSquaredScore', ascending=False))

**Observation:**    
Both algorithms compute different feature scores. Some feature scores show an overlap, e.g. Contract and Tenure.

The Chi Square algorithm is only usable for categorical data. Here contract is scored far ahead of other features, which corresponds to the results of the other performed tests.
This suggests that contract might be an important feature.

In [None]:
fig = plt.figure(figsize=(16, 5), dpi= 80, facecolor='w', edgecolor='k')
fig.subplots_adjust(hspace=.1, wspace=.6)

ax = fig.add_subplot(1, 2, 1)
ax = sns.barplot(x='ChiSquaredPVal', y="Feature", 
                 data=df_feature_selection
                 .sort_values('ChiSquaredPVal', ascending=True))

ax = fig.add_subplot(1, 2, 2)
ax = sns.barplot(x='FPVal', y="Feature", data=df_feature_selection
                 .sort_values('FPVal', ascending=True))

**Observation:**   
When taking p-values into consideration the eight best features are equal to the features shown in the bar plots above, even though their ranking differs in the case of the ChiSquared test.

The features gender and phone service show very high p-values and therefore are no valuable features.

The best features according to the mutual information are now selected with Scikit-Learns kbest algorithm.

In [None]:
selector = SelectKBest(score_func = mutual_info_classif, k = 8).fit(X,y)

feature_indices = selector.get_support(True)

print("Best 8 features (Mutual Information):")
for i in range(len(feature_indices)):
    index = feature_indices[i]
    print(df.columns[index])

## Transform Data: One Hot Encoding

Non-binary nominal data is now one-hot-encoded for consumption by ML algorithms.

In [None]:
# Extract cols with non binary data
df_object_dtypes = df_original.select_dtypes(include="object").copy()
df_object_dtypes.drop(columns = ['customerID', 'TotalCharges'], axis=1, inplace=True)
features_non_binary_categorical = []
for col in df_object_dtypes.columns:
    if(len(df_object_dtypes[col].value_counts()) > 2):
        features_non_binary_categorical.append(col)
    else:
        df_object_dtypes.drop([col], axis=1, inplace=True)
        
features_non_binary_categorical

In [None]:
from sklearn.preprocessing import OneHotEncoder
df_onehotencoded = df.copy()

for feature in features_non_binary_categorical:
    col_values = df_onehotencoded[feature].values.reshape(-1,1)
    col_values_one_hot = OneHotEncoder(sparse=False, categories='auto').fit_transform(col_values)
    col_values_one_hot = col_values_one_hot.tolist()
    
    df_onehotencoded[feature] = col_values_one_hot
    
df_onehotencoded.head()

## Transform Data: Scaling

In [None]:
# Standardized features
df_standardized = ((df-df.mean())/
                   df.std())
df_standardized.head()

# Modeling & Evaluation

## Train / Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
features = df.columns[0:19]
target = df.columns[19]

x_train, x_test, y_train, y_test = train_test_split(df[features],
                                                   df[target],
                                                   test_size = 0.3,
                                                   random_state = 10)

print(f'Shape of training set X: {x_train.shape}')
print(f'Shape of training set y: {y_train.shape}')
print('---')
print(f'Shape of test set X: {x_test.shape}')
print(f'Shape of test set y: {y_test.shape}')

## Data Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyClassifier

In [None]:
# Preparations needed for the pipelines
# We want to scale the numerical data and we want to onehot encode the categorical data.
# For this we make use of a ColumnTransformer
categorical_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
                       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
                       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

numeric_features = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Generate a mask identifying the features that are supposed to be one hot encoded
one_hot_mask = (df_original.drop(columns=['Churn', 'customerID']).dtypes == object).values

#Define the pipeline steps
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(sparse=False, categories='auto'), categorical_features),
    remainder='passthrough'
)

pipeline = Pipeline([
    ('Preprocessor', preprocessor),
    ("KBest", SelectKBest(mutual_info_classif, k=8)),
    ('Classifier', LogisticRegression(solver ='liblinear'))
])

## Fit & Evaluate

In [None]:
# Print a confusion matrix and values for accuracy, precision, recall and f1-measure
def calculate_results(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
        
    print(f'Accuracy: {accuracy}')
    print(f'Precision: {precision}')
    print(f'Recall: {recall}')
    print(f'F1-Score: {f1}')
    
    return accuracy, precision, recall, f1

In [None]:
# Execute pipeline
pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)

pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True) 

In [None]:
_ = calculate_results(y_test, y_pred)

In [None]:
# Baseline - always predict 0
pipeline = Pipeline([
    ('Classifier', DummyClassifier(strategy='constant', constant=0))
]) 

pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)

#confusion_matrix(y_test, y_pred) - #Sklearn function to create a confusion matrix.
# pd confusion matrix -> better visualization
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True) 

In [None]:
_ = calculate_results(y_test, y_pred)

**Observation:**    
To evalute the model multiple measures are considered:
The accuracy alone is misleading, as it requires a symmetric dataset with an equal data distributions. As we can see above always predicting label 0 results in an accuracy of 73%.    
Depending on the classification problem the F1 score is a better metric. It's the weighted average of Precision and Recall and therefore works best if the cost of false positives and false negatives are nearly equal.

## Grid Search    
Determine class weights and regularization parameters with Grid Search to maximize the model's performance

In [None]:
gridSearchParams = [
    {'C':[0.01, 0.03, 0.1, 0.3, 1.0, 1.1, 1.3, 1.33, 1.6],
     'class_weight':[{0:.1, 1:.9},{0:.2, 1:.8},{0:.3, 1:.7},{0:.4, 1:.6},
                    {0:.4, 1:.7}, {0:.6, 1:.8}, {0:.5, 1:.8}, 'balanced']
    }
]

classifier = GridSearchCV(LogisticRegression(solver ='liblinear'), gridSearchParams, cv=5)

#Define the pipeline steps
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(sparse=False, categories='auto'), categorical_features),
    remainder='passthrough'
)

pipeline = Pipeline([
    ('Preprocessor', preprocessor),
    ('Classifier', classifier)
])

# Execute pipeline
_ = pipeline.fit(x_train, y_train)

# Calculate results
y_pred = pipeline.predict(x_test)

pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True) 

In [None]:
p5_measures = calculate_results(y_test, y_pred)

print("Best Hyperparameters:",classifier.best_params_)

**Observation:**    
GridSearch determined hyperparameters improving scores across the board.    
The accuracy improved from 0.78 to 0.80.    
Precision improved from 0.59 to 0.62. Intuitively this means that when the model predicts a customer to churn, it is correct 62% of the time.    
Recall improved from 0.54 to 0.64. Intuitively this means that the model correctly predicts 64% of all churned customers.    
The F1-Score improved from 0.57 to 0.63.

# Conclusion

This kernel analyzes a telco customer churn dataset according to the CRISP-DM standard.
The dataset is loaded, preprocessed and transformed to fit the specific needs, e.g. by applying label encoding to non-numeric data or scaling the values.

To better understand the dataset visualizations of univariate and conditional data distributions were created.
With the help of these plots potential interesting features for churn prediction were analyzed.
The so found features of interest were validated by inspection of a correlation heatmap.

The most informative features were selected with three different metrics: The mutual information, the F1 score and the ChiSquared score.
Comparison of the metrics showed clear overlaps of interesting features.
Salient features of the dataset were determined to be: Tenure, InternetService, OnlineSecurity, DeviceProtection, TechSupport, Contract, MonthlyCharges and TotalCharges.

Finally a model was trained and evaluated with Scikit Learn.
GridSearch was applied for hyperparameter optimization.
Due to the unbalanced nature of the dataset the accuracy is a potentially misleading performance metric.
Instead the F1Score was found to be more reliable.