# Problem: 

**Bankayı terk edecek müşteriler önceden belirlenmek isteniyor.**

- Projenin amacı bir müşterinin bankayı terk edip etmeyeceğini tahmin etmektir.

- Banka hesaplarının kapatılması müşteri terkini tanımlayan durumdur.

**Veri Seti Hikayesi:**

- Veri seti 10.000 gözlemden ve 13 değişkenden oluşmaktadır. Bu değişkenlerden bir tanesi bağımlı değişken.
- Bağımsız değişkenler müşterilere ilişkin bilgilerdir.
- Bağımlı değişken müşteri terk durumunu ifade etmektedir.

**Değişkenler:**

- Surname : Müşterinin Soy Adı
- CreditScore : Müşterinin Kredi skoru
- Geography : Müşterinin ikamet ettiği ülke (Almanya/Fransa/İspanya)
- Gender : Müşterinin Cinsiyeti (Kadın/Erkek)
- Age : Müşterinin Yaş
- Tenure : Kaç yıldır bankayla çalıştığı
- Balance : Hesap Bakiyesi
- NumOfProducts : Kullanılan banka ürünü (Kredi kartı,maaş hesabı vs.)
- HasCrCard : Kredi kartı durumu (0=Yok,1=Var)
- IsActiveMember : Aktif üyelik durumu (0=Aktif Değil,1=Aktif)
- EstimatedSalary : Müşterinin Tahmin edilen maaşı
- Exited : Müşteri terk olacak mı? (0=Hayır,1=Evet)


### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression  
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=FutureWarning) 
warnings.filterwarnings("ignore", category=UserWarning) 

%config InlineBackend.figure_format = 'retina'

pd.set_option('display.max_columns', None); pd.set_option('display.max_rows', None);

In [None]:
   class color:
       BOLD = '\033[1m'
       UNDERLINE = '\033[4m'
       END = '\033[0m'

### Functions

In [None]:
#1
def read_data():
    """Reads the dataframe, assigns the column CustomerId as index, and the dataframe is returned."""
    return pd.read_csv("../input/bank-churn-modelling/Churn_Modelling.csv",index_col="CustomerId")
#2
def split_features():
    """Separates variables as categorical,numeric and outcome. It prints them on the screen and returns them."""
    categorical_features=["HasCrCard","IsActiveMember","Gender","Geography"]
    numerical_features=["CreditScore","Age","Tenure","Balance","NumOfProducts","EstimatedSalary"]
    Target="Exited"
    print(categorical_features,numerical_features,Target,sep="\n")
    return categorical_features,numerical_features,Target
#3
def cat_vis():
    """Visualizes churn status by category"""
    df.groupby("Gender").agg({"Exited":"count"}).plot.bar(color="blue");
    plt.title("Churn by Gender")
    df.groupby("Geography").agg({"Exited":"count"}).plot.bar(color="black")
    plt.title("Churn by Geography")
    df.groupby("HasCrCard").agg({"Exited":"count"}).plot.bar(color="green");
    plt.title("Churn by Credit Card")
    df.groupby("IsActiveMember").agg({"Exited":"count"}).plot.bar(color="green");
    plt.title("Churn by activity status")
#4
def stats(num_data):
    """It gives descriptive statistics according to the determined percentiles"""
    return df[num_data].describe([0.05,0.25,0.50,0.75,0.95]).T
#5
def missing_values():
    """It examines the missing data in the data set visually and numerically."""
    import missingno as msno
    msno.bar(df); 
    print(df.isnull().sum())
#6
def data_prep():
    """Drops the Surname and Rownumber columns, transforms into one hot encoding for categorical variables, and drops dummy columns."""
    df.drop(["Surname","RowNumber"],axis=1,inplace=True)
    return pd.get_dummies(df,columns = cat_ft, drop_first = True)
    
#7    
def handle_outliers(df,q1=0.05,q3=0.95,method="quantiles",
                    inplace=False):
    """Analyze outliers with LOF or quantiles method. Optionally, it drops outliers in LOF and suppresses to the limits in quantiles method."""
    if method=="quantiles":
        for feature in df:
            Q1 = df[feature].quantile(q1)
            Q3 = df[feature].quantile(q3)
            IQR = Q3-Q1
            lower = Q1- 1.5*IQR
            upper = Q3 + 1.5*IQR
            if df[(df[feature] > upper)].any(axis=None):
                print(color.BOLD+color.UNDERLINE+feature+":"+color.END,"OUTLIERS"+" ",sep="\n")
                print(df[(df[feature] > upper)])
                if inplace==True:
                    df.loc[df[feature] > upper,feature] = upper
                    return df
                print("*******************************O*******************************")
            else:
                print(color.BOLD+color.UNDERLINE+feature+color.END+": There aren't outliers in this feature"+color.END+" ")
                print("*******************************O*******************************")
    elif method=="LOF":
        from sklearn.neighbors import LocalOutlierFactor
        n_neighbors=int(input("n_neighbors(default=20): "))
        clf = LocalOutlierFactor(n_neighbors=n_neighbors)
        clf.fit_predict(df)
        df_scores = clf.negative_outlier_factor_
        print(np.sort(df_scores)[0:30])
        threshold=int(input("threshold: "))
        threshold=np.sort(df_scores)[threshold-1]
        print(threshold)
        print(df[df_scores< threshold])
        if inplace==True:
            print(df[df_scores< threshold])
            df=df.drop(index=df[df_scores< threshold].index,inplace=True)
            return df

def var_target():
    y=df[["Exited"]]
    X=df.drop("Exited",axis=1)
    return y,X
#8
def scale(num_data):
    """Uses Robust Scaler to standardize numerical variables."""
    from sklearn.preprocessing import RobustScaler
    num_df=pd.DataFrame(df[num_ft])
    scaler = RobustScaler() 
    data_scaled = scaler.fit_transform(df[num_ft])
    df_scaled=pd.DataFrame(data_scaled,columns=num_ft,index=df.index)
    cat_df=df[X.columns.difference(df_scaled.columns)]
    
    return df_scaled.merge(cat_df,left_index=True,right_index=True)
#9
def ml_simple_models(X,y):
    
    """It takes the independent variables(X) and outcome(y) as parameters, and prints the prediction
    success of the models within it after the train test separation."""
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 12345)
    
    names = ["LogisticRegression","GaussianNB","KNeighborsClassifier","LinearSVC","SVC",
         "DecisionTreeClassifier","RandomForestClassifier","GradientBoostingClassifier",
         "XGBClassifier","LGBMClassifier"]
    
    
    classifiers = [LogisticRegression(), GaussianNB(), KNeighborsClassifier(), LinearSVC(), SVC(),
               DecisionTreeClassifier(),RandomForestClassifier(), GradientBoostingClassifier(),
               XGBClassifier(), LGBMClassifier()]

    for name, clf in zip(names, classifiers):

        model = clf.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        msg = "%s: %f" % (name, acc)
        print(msg)

    
#10
def tuned_ml_models(X,y):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 12345)
    
    gb_params = {"learning_rate": np.linspace(0,1,5),
            "max_depth": [2,6,8,10],
            "n_estimators": [50,100,250,500],
            "min_samples_split": [2,7,10]}

    gb_model = GradientBoostingClassifier()

    gb_cv_model = GridSearchCV(gb_model, 
                               gb_params, 
                               cv = 3, 
                               n_jobs = -1, 
                               verbose = 1) 

    gb_cv_model.fit(X_train, y_train)
    
    gb_tuned=gb_cv_model.best_estimator_
    gb_tuned.fit(X_train,y_train)
    y_pred=gb_tuned.predict(X_test)
    print("Best params: ",gb_cv_model.best_params_)
    print("Tuned Gradient Boosting Classifier: ",accuracy_score(y_test,y_pred))
    
    cm = confusion_matrix( y_test,y_pred, [1,0] )
    sns.heatmap(cm, annot=True,  fmt='.0f', xticklabels = ["1", "0"] , 
    yticklabels = ["1", "0"] )
    plt.ylabel('ACTUAL')
    plt.xlabel('PREDICTED')
    plt.show()
    
    print("********************************************0**************************************************")
    
    rf_params = {"max_depth":[2,4,8], 
            "max_features": [2,5,8],
            "n_estimators": [50,150,300,500],
            "min_samples_split": [2,5,9]}

    rf_model = RandomForestClassifier()

    rf_cv_model = GridSearchCV(rf_model, 
                               rf_params, 
                               cv = 3, 
                               n_jobs = -1, 
                               verbose = 1) 

    rf_cv_model.fit(X_train, y_train)
    
    rf_tuned=rf_cv_model.best_estimator_
    rf_tuned.fit(X_train,y_train)
    y_pred=rf_tuned.predict(X_test)
    print("Best params: ",rf_cv_model.best_params_)
    print("Tuned Random Forests: ",accuracy_score(y_test,y_pred))
    
    cm = confusion_matrix( y_test,y_pred, [1,0] )
    sns.heatmap(cm, annot=True,  fmt='.0f', xticklabels = ["1", "0"] , 
    yticklabels = ["1", "0"] )
    plt.ylabel('ACTUAL')
    plt.xlabel('PREDICTED')
    plt.show()
    
    print("********************************************0**************************************************")
    
    lgbm_params = {"learning_rate":np.linspace(0,1,5), 
            "max_features": [2,5,7],
            "n_estimators": [10,50,150,300,500],
            "min_samples_split": [2,5,7]}

    lgbm_model = LGBMClassifier()

    lgbm_cv_model = GridSearchCV(lgbm_model, 
                               lgbm_params, 
                               cv = 3, 
                               n_jobs = -1, 
                               verbose = 3) 

    lgbm_cv_model.fit(X_train, y_train)
    
    lgbm_tuned=lgbm_cv_model.best_estimator_
    lgbm_tuned.fit(X_train,y_train)
    y_pred=lgbm_tuned.predict(X_test)
    print("Best params: ",lgbm_cv_model.best_params_)
    print("Tuned LGBM: ",accuracy_score(y_test,y_pred))
    
    cm = confusion_matrix( y_test,y_pred, [1,0] )
    sns.heatmap(cm, annot=True,  fmt='.0f', xticklabels = ["1", "0"] , 
    yticklabels = ["1", "0"] )
    plt.ylabel('ACTUAL')
    plt.xlabel('PREDICTED')
    plt.show()
    
#12
def conf_matrix(X,y):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 12345)
    
        names = ["RandomForestClassifier","GradientBoostingClassifier","LGBMClassifier"]


        classifiers = [RandomForestClassifier(), GradientBoostingClassifier(),LGBMClassifier()]

        for name, clf in zip(names, classifiers):

            model = clf.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            acc = accuracy_score(y_test, y_pred)
            msg = "%s: %f" % (name, acc)
            print(msg)
            cm = confusion_matrix( y_test,y_pred, [1,0] )
            sns.heatmap(cm, annot=True,  fmt='.0f', xticklabels = ["1", "0"] , 
                    yticklabels = ["1", "0"] )
            plt.ylabel('ACTUAL')
            plt.xlabel('PREDICTED')
            plt.show()

## Data Reading and Understanding

In [None]:
# first 5 rows
df=read_data()
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
# Splitting features into 3 categories
cat_ft,num_ft,outcome=split_features()

In [None]:
cat_vis()

It is noteworthy that those with credit cards lose more. This may indicate a dissatisfaction with credit cards.

In [None]:
stats(num_ft)

We can have general information about distributions, means,medians, standard deviations and even outliers(very generally) by looking at descriptive statistics.Looking at the variables alone, there does not seem to be an anomaly. That's why we will look at it later according to the LOF method.

In [None]:
#we're checking if there is missing data in the dataframe
missing_values()

#### There is no missing value in the dataframe

## Data Preprocessing

In [None]:
# Dropping "Surname" and "Rownumber" columns
# One hot encoding for categorical variables
df=data_prep()

In [None]:
df.head()

In [None]:
handle_outliers(df,method="LOF",inplace=True)

In [None]:
df.shape

In [None]:
y,X=var_target()

In [None]:
y.head()

In [None]:
X.head()

In [None]:
X=scale(num_ft)

In [None]:
X.head()

### Machine Learning

In [None]:
ml_simple_models(X,y)

In [None]:
conf_matrix(X,y)

### Model Tuning

Hyperparameter optimizations of the 3 algorithms that give the highest score in the primitive test error.

In [None]:
tuned_ml_models(X,y)

### SONUÇ

- Yapılan hiperparametre testleri sonucundan ilkel modellere göre daha iyi bir sonuç elde edilemedi. 
- En yüksek yüzdeyle tahmini %86 doğru tahmin yüzdesiyle Gradient Boosting Classifier optimizasyonsuz modeli ile elde ettik.
- Daha yüksek tahmin başarısı elde etmek için yapılabilecekler.
   * Outlier'lar üzerinde herhangi bir işlem yapmamıştık. Bazı değerleri outlier olarak belirleyip drop etme,ortalama veya medyan ile doldurma, baskılama gibi yöntemler denenebilir.
   * Farklı bir standardizasyon yöntemi kullanılabilir. (Biz Robust Scaler kullanmıştık)
   * Tüm veri setine standardizasyon uygulanabilir. (Biz numerik değerlere uygulamıştık.)
   * Feature Engineering ile yeni değişkenler türetilebilir, var olan değişkenler dönüştürülebilir.
   * Optimum hiperparametreleri bulmak için mutlaka farklı hiperparametre uzayları test edilmeli.