# Telco Customer Churn Prediction Using Machine Learning Alghoritm - 100% accuracy

## Step for Predcition
### 1. Explora Data & Prepare Data
### 2. Handling Imbalance Data
### 3. Predict using Machine Learning Alghoritm

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Explora Data & Prepare Data

In [None]:
df = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.head()

Check the type of the dataset

In [None]:
df.info()

Many columns have data type object that we need to change to numeric data type

In [None]:
df.describe()

check column data that have null value

In [None]:
df.isnull().sum()

In [None]:
df.shape

TotalCharges column have value that blank string, so we remove data that have TotalCharges blank string

In [None]:
df_1 = df[df.TotalCharges!=' ']
df_1.shape

Change data type to numeric

In [None]:
df_1.TotalCharges = pd.to_numeric(df_1.TotalCharges)

Check that TotalChanges column have been change to numeric type

In [None]:
df_1.info()

Drop customerID that not useful for features

In [None]:
df_1 = df_1.drop(['customerID'], axis='columns')

create function to print unique data type

In [None]:
def print_unique_col_values(df):
       for column in df:
            if df[column].dtypes=='object':
                print(f'{column}: {df[column].unique()}')

In [None]:
print_unique_col_values(df_1)

Replace 'No internet service' and 'No phone service' to 'No' because that basically the same thing

In [None]:
df_1.replace('No internet service','No',inplace=True)
df_1.replace('No phone service','No',inplace=True)

In [None]:
print_unique_col_values(df_1)

List all data that have 'yes' and 'no' value to 1 and 0

In [None]:
binary_columns = ['Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup',
                  'DeviceProtection','TechSupport','StreamingTV','StreamingMovies','PaperlessBilling','Churn']
for col in binary_columns:
    df_1[col].replace({'Yes': 1,'No': 0},inplace=True)

In [None]:
df_1.info()

In [None]:
print_unique_col_values(df_1)

In [None]:
features = df_1
features.info()

Change 'Male' and 'Female' into 1 and 0 value

In [None]:
from sklearn.preprocessing import LabelEncoder
le_gender = LabelEncoder()
features['gender_label'] = le_gender.fit_transform(features['gender'])
features = features.drop(['gender'], axis='columns')
features.head()

Using one hot encoder to change categorical value into numerical value with create new column every category using pandas get_dummies

In [None]:
features = pd.get_dummies(data=features, columns=['InternetService','Contract','PaymentMethod'])
features.info()

In [None]:
features.head()

In [None]:
features.isnull().sum()

In [None]:
features.describe()

See the correlation of every columns

In [None]:
features.corr()

See the correlation features column with column 'Churn'

In [None]:
features[features.columns[1:]].corr()['Churn'][:].sort_values(ascending=False)

In [None]:
features.describe()

Normalization data using minmax

In [None]:
from sklearn.preprocessing import MinMaxScaler
features_scaler = MinMaxScaler()
features = features_scaler.fit_transform(features)
features

In [None]:
x = features
y = df_1.Churn

# 2. Handling Imbalance Data using SMOTE

Data label is not balance so we handling using SMOTE to create some data to make data balance

In [None]:
y.value_counts()

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='minority')
x_sm, y_sm = smote.fit_resample(x, y)

y_sm.value_counts()

In [None]:
x_sm.shape

In [None]:
y_sm.shape

# 3. Predict using Machine Learning Alghoritm

In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

model_params = {
    'svm': {
        'model': SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20,30],
            'kernel': ['rbf','linear','poly']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [10,50,100]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    },
    'KNN' : {
        'model': KNeighborsClassifier(),
        'params': {
            'n_neighbors': [3,7,11,13]
        }
    }
    
}

In [None]:
from sklearn.model_selection import GridSearchCV
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(x_sm, y_sm)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df_score = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df_score

Many machine learning alghoritm make good predictions

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_sm, y_sm, test_size=0.2, random_state=15, stratify=y_sm)

Choosing one machine learning alghoritms to make confusion matrix and classificaation report

In [None]:
model = SVC(C=1, kernel='rbf')
model.fit(x_train,y_train)
model.score(x_test,y_test)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sb
y_predicted = model.predict(x_test)
cm = confusion_matrix(y_test,y_predicted)
plt.figure(figsize = (10,7))
sb.heatmap(cm, annot=True, fmt=".1f")
plt.xlabel('Predicted')
plt.ylabel('Truth')

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predicted))