# [TR] 🚀 Bank Customer Churn Prediction (Banka Müşteri Kaybı Tahmini)
## 📌 Proje Hedefi
Bu proje, **banka müşterilerinin hizmeti bırakıp bırakmayacağını (churn olup olmayacağını) tahmin etmek** için geliştirilmiştir.  
Makine öğrenimi kullanarak **müşteri kaybını önceden belirleyip bankaların aksiyon almasını sağlamak** amaçlanmaktadır.

## 📊 Kullanılan Veri Seti
- **Veri:** "Bank Customer Churn Prediction" veri seti
- **link:** https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset
- **Özellikler:**
  - `credit_score`: Müşterinin kredi puanı
  - `age`: Müşterinin yaşı
  - `tenure`: Müşterinin bankada geçirdiği yıl sayısı
  - `balance`: Hesaptaki bakiye
  - `products_number`: Kullanılan bankacılık ürünleri sayısı
  - `credit_card`: Müşterinin kredi kartı olup olmadığı (0: Yok, 1: Var)
  - `active_member`: Müşterinin aktif üye olup olmadığı (0: Hayır, 1: Evet)
  - `estimated_salary`: Tahmini yıllık maaşı
  - `country`: Müşterinin bulunduğu ülke (France, Germany, Spain)
  - `gender`: Müşterinin cinsiyeti (Male, Female)
  - `churn`: **Hedef değişken**, müşterinin bankayı terk edip etmeyeceği (0: Hayır, 1: Evet)

## 🏗 Modelleme Adımları
1️⃣ **Veriyi Analiz Etme ve Eksik Değerleri Kontrol Etme**  
2️⃣ **Kategorik Verileri Sayısal Hale Getirme (One-Hot Encoding, Label Encoding)**  
3️⃣ **Özellik Mühendisliği ve Standartlaştırma (`StandardScaler`)**  
4️⃣ **Aykırı Değerleri ve Korelasyonu Analiz Etme**  
5️⃣ **Sınıf Dengesizliği Problemini Çözme (`SMOTE`, `Class Weight`)**  
6️⃣ **RandomForest ve LogisticRegression Modelleri ile Eğitme**  
7️⃣ **Hiperparametre Optimizasyonu (`RandomizedSearchCV`)**  
8️⃣ **Yeni Bir Müşteri İçin Churn Tahmini Yapma**  


# [EN] 🚀 Bank Customer Churn Prediction
## 📌 Project Goal
This project was developed to **predict whether bank customers will leave the service (whether they will churn).**

The aim is to **predict customer churn in advance and enable banks to take action** using machine learning.

## 📊 Dataset Used
- **Data:** "Bank Customer Churn Prediction" dataset
- **link:** https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset
- **Features:**
- `credit_score`: Customer's credit score
- `age`: Customer's age
- `tenure`: Number of years customer has been with the bank
- `balance`: Balance in account
- `products_number`: Number of banking products used
- `credit_card`: Whether customer has a credit card (0: None, 1: Yes)
- `active_member`: Whether customer is an active member (0: No, 1: Yes)
- `estimated_salary`: Estimated annual salary
- `country`: Country of customer (France, Germany, Spain)
- `gender`: Gender of customer (Male, Female)
- `churn`: **Target variable**, whether customer will leave the bank (0: No, 1: Yes)

## 🏗 Modeling Steps
1️⃣ **Analyzing Data and Checking for Missing Values**
2️⃣ **Digitizing Categorical Data (One-Hot Encoding, Label Encoding)**
3️⃣ **Feature Engineering and Standardization (`StandardScaler`)**
4️⃣ **Analyzing Outliers and Correlation**
5️⃣ **Solving the Class Imbalance Problem (`SMOTE`, `Class Weight`)**
6️⃣ **Training with RandomForest and LogisticRegression Models**
7️⃣ **Hyperparameter Optimization (`RandomizedSearchCV`)**
8️⃣ **Making Churn Prediction for a New Customer**

## Importing Libraries

In [178]:
import pandas as pd 
import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Load Dataset

In [179]:
df=pd.read_csv("Bank Customer Churn Prediction.csv")
df

Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,15606229,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,15569892,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,15584532,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,15682355,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [None]:
def examine_data(df):
    print("info:",df.info())
    print("shape:",df.shape)
    print("missing value:",df.isna().sum())
    for column in df.columns :
        print("value counts:",display(df[column].value_counts()))

In [181]:
examine_data(df) # customer_id silinecek , country silinecek , balance 0 var bunlara bakılacak 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       10000 non-null  int64  
 1   credit_score      10000 non-null  int64  
 2   country           10000 non-null  object 
 3   gender            10000 non-null  object 
 4   age               10000 non-null  int64  
 5   tenure            10000 non-null  int64  
 6   balance           10000 non-null  float64
 7   products_number   10000 non-null  int64  
 8   credit_card       10000 non-null  int64  
 9   active_member     10000 non-null  int64  
 10  estimated_salary  10000 non-null  float64
 11  churn             10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB
info: None
shape: (10000, 12)
missing value: customer_id         0
credit_score        0
country             0
gender              0
age                 0
tenure         

customer_id
15656710    1
15768163    1
15672754    1
15719276    1
15692664    1
           ..
15737888    1
15701354    1
15619304    1
15647311    1
15634602    1
Name: count, Length: 10000, dtype: int64

value counts: None


credit_score
850    233
678     63
655     54
705     53
667     53
      ... 
358      1
412      1
382      1
373      1
419      1
Name: count, Length: 460, dtype: int64

value counts: None


country
France     5014
Germany    2509
Spain      2477
Name: count, dtype: int64

value counts: None


gender
Male      5457
Female    4543
Name: count, dtype: int64

value counts: None


age
37    478
38    477
35    474
36    456
34    447
     ... 
84      2
82      1
88      1
85      1
83      1
Name: count, Length: 70, dtype: int64

value counts: None


tenure
2     1048
1     1035
7     1028
8     1025
5     1012
3     1009
4      989
9      984
6      967
10     490
0      413
Name: count, dtype: int64

value counts: None


balance
0.00         3617
130170.82       2
105473.74       2
113957.01       1
85311.70        1
             ... 
88381.21        1
155060.41       1
57369.61        1
75075.31        1
116363.37       1
Name: count, Length: 6382, dtype: int64

value counts: None


products_number
1    5084
2    4590
3     266
4      60
Name: count, dtype: int64

value counts: None


credit_card
1    7055
0    2945
Name: count, dtype: int64

value counts: None


active_member
1    5151
0    4849
Name: count, dtype: int64

value counts: None


estimated_salary
24924.92     2
140469.38    1
51695.41     1
151325.24    1
64327.26     1
            ..
2988.28      1
99595.67     1
53445.17     1
115146.40    1
23101.13     1
Name: count, Length: 9999, dtype: int64

value counts: None


churn
0    7963
1    2037
Name: count, dtype: int64

value counts: None


In [182]:
df.describe()

Unnamed: 0,customer_id,credit_score,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,15690940.0,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,71936.19,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,15628530.0,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


## Eliminating Missing Data (Eksik Verileri Giderme)

In [183]:
df.columns

Index(['customer_id', 'credit_score', 'country', 'gender', 'age', 'tenure',
       'balance', 'products_number', 'credit_card', 'active_member',
       'estimated_salary', 'churn'],
      dtype='object')

In [184]:
df = df.drop(columns=['customer_id'])

## Categorical Value Converting (Kategorik Değer Dönüştürme)

In [185]:
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])  # Male -> 1, Female -> 0

df = pd.get_dummies(df, columns=['country'], drop_first=True)

In [186]:
df

Unnamed: 0,credit_score,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn,country_Germany,country_Spain
0,619,0,42,2,0.00,1,1,1,101348.88,1,False,False
1,608,0,41,1,83807.86,1,0,1,112542.58,0,False,True
2,502,0,42,8,159660.80,3,1,0,113931.57,1,False,False
3,699,0,39,1,0.00,2,0,0,93826.63,0,False,False
4,850,0,43,2,125510.82,1,1,1,79084.10,0,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,1,39,5,0.00,2,1,0,96270.64,0,False,False
9996,516,1,35,10,57369.61,1,1,1,101699.77,0,False,False
9997,709,0,36,7,0.00,1,0,1,42085.58,1,False,False
9998,772,1,42,3,75075.31,2,1,0,92888.52,1,True,False


## Numeric Values Standartization (Sayısal Değerleri Standartlaştırma)

In [187]:
scaler = StandardScaler()

# Ölçeklenecek sütunları seç (churn hariç)
num_cols = ['credit_score', 'age', 'tenure', 'balance', 'products_number', 'credit_card', 'active_member', 'estimated_salary']
df[num_cols] = scaler.fit_transform(df[num_cols])
df

Unnamed: 0,credit_score,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn,country_Germany,country_Spain
0,-0.326221,0,0.293517,-1.041760,-1.225848,-0.911583,0.646092,0.970243,0.021886,1,False,False
1,-0.440036,0,0.198164,-1.387538,0.117350,-0.911583,-1.547768,0.970243,0.216534,0,False,True
2,-1.536794,0,0.293517,1.032908,1.333053,2.527057,0.646092,-1.030670,0.240687,1,False,False
3,0.501521,0,0.007457,-1.387538,-1.225848,0.807737,-1.547768,-1.030670,-0.108918,0,False,False
4,2.063884,0,0.388871,-1.041760,0.785728,-0.911583,0.646092,0.970243,-0.365276,0,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1.246488,1,0.007457,-0.004426,-1.225848,0.807737,0.646092,-1.030670,-0.066419,0,False,False
9996,-1.391939,1,-0.373958,1.724464,-0.306379,-0.911583,0.646092,0.970243,0.027988,0,False,False
9997,0.604988,0,-0.278604,0.687130,-1.225848,-0.911583,-1.547768,0.970243,-1.008643,1,False,False
9998,1.256835,1,0.293517,-0.695982,-0.022608,0.807737,0.646092,-1.030670,-0.125231,1,True,False


## Detecting Outliers (Aykırı Değer Tespiti)

In [None]:
Q1 = df[num_cols].quantile(0.25)
Q3 = df[num_cols].quantile(0.75)
IQR = Q3 - Q1

df_cleaned = df[~((df[num_cols] < (Q1 - 1.5 * IQR)) | (df[num_cols] > (Q3 + 1.5 * IQR))).any(axis=1)]

## Data Sampling : SMOTE

In [189]:
X = df_cleaned.drop(columns=['churn'])
y = df_cleaned['churn']

# SMOTE ile veri artırma
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)


In [None]:
y_resampled.value_counts()  # dengesizlik giderildi

churn
1    7646
0    7646
Name: count, dtype: int64

## Train - Test - Split

In [201]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

## Model Training and Performance : Logistic Regression

In [204]:
# Modeli oluştur ve eğit
log_model = LogisticRegression(class_weight='balanced', random_state=42)
log_model.fit(X_train, y_train)

# Tahmin yap
y_pred = log_model.predict(X_test)

# Sürekli tahminleri 0 ve 1'e çevir
y_pred_class = np.where(y_pred >= 0.5, 1, 0)

#degerlendirme
print(f"Accuracy: {accuracy_score(y_test, y_pred_class):.4f}")
print("\nClassification Report: \n",classification_report(y_test, y_pred_class))
print("\nConfusion Matrix: \n",confusion_matrix(y_test, y_pred_class))


Accuracy: 0.7404

Classification Report: 
               precision    recall  f1-score   support

           0       0.74      0.75      0.75      1567
           1       0.74      0.73      0.73      1492

    accuracy                           0.74      3059
   macro avg       0.74      0.74      0.74      3059
weighted avg       0.74      0.74      0.74      3059


Confusion Matrix: 
 [[1177  390]
 [ 404 1088]]


## Model Training and Performance : RANDOM FOREST

In [None]:
# Modeli oluştur ve eğit
rf_model = RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42)
rf_model.fit(X_train, y_train)

# Tahmin yap
y_pred_rf = rf_model.predict(X_test)

# degerlendirme
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred_rf))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))


Accuracy: 0.8918
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.88      0.89      1567
           1       0.88      0.90      0.89      1492

    accuracy                           0.89      3059
   macro avg       0.89      0.89      0.89      3059
weighted avg       0.89      0.89      0.89      3059

Confusion Matrix:
 [[1384  183]
 [ 148 1344]]


## HiperParametre Tuning : RandomizedSearchCV

In [193]:
# RandomForest için hiperparametreler
rf_param_grid = {
    'n_estimators': [50, 100, 200, 500],  # Ağaç sayısı
    'max_depth': [10, 20, 30, None],  # Maksimum derinlik
    'min_samples_split': [2, 5, 10],  # Bölünme için minimum örnek sayısı
    'min_samples_leaf': [1, 2, 4],  # Yaprak düğümü için minimum örnek sayısı
    'bootstrap': [True, False]  # Örnekleme yöntemi
}

# Logistic Regression için hiperparametreler
log_param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Düzenlileştirme parametresi
    'penalty': ['l1', 'l2'],  # Ceza türü (Lasso veya Ridge)
    'solver': ['liblinear', 'saga']  # Optimizasyon yöntemi
}


In [194]:
# RandomForest Modeli
rf = RandomForestClassifier(class_weight="balanced", random_state=42)

# Logistic Regression Modeli
log_reg = LogisticRegression(class_weight="balanced", random_state=42, max_iter=500)

# RandomizedSearchCV tanımlama
random_search_rf = RandomizedSearchCV(
    estimator=rf,
    param_distributions=rf_param_grid,
    n_iter=20,  # Kaç farklı kombinasyon denenecek
    cv=5,  # 5 Katlı çapraz doğrulama
    verbose=2,
    random_state=42,
    n_jobs=-1  # Paralel işlem
)

random_search_log = RandomizedSearchCV(
    estimator=log_reg,
    param_distributions=log_param_grid,
    n_iter=20,
    cv=5,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

random_search_rf.fit(X_train, y_train)
random_search_log.fit(X_train, y_train)

print("🌟 En İyi RandomForest Hiperparametreleri:", random_search_rf.best_params_)
print("🌟 En İyi Logistic Regression Hiperparametreleri:", random_search_log.best_params_)


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Fitting 5 folds for each of 20 candidates, totalling 100 fits
🌟 En İyi RandomForest Hiperparametreleri: {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 30, 'bootstrap': False}
🌟 En İyi Logistic Regression Hiperparametreleri: {'solver': 'saga', 'penalty': 'l2', 'C': 0.1}


In [195]:
best_rf = random_search_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test)

best_log = random_search_log.best_estimator_
y_pred_log = best_log.predict(X_test)


print("📌 RandomForest Classification Report:\n", classification_report(y_test, y_pred_rf))
print("\n📌 Logistic Regression Classification Report:\n", classification_report(y_test, y_pred_log))

print("\n📊 RandomForest Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("\n📊 Logistic Regression Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log))


📌 RandomForest Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.89      0.90      1567
           1       0.89      0.89      0.89      1492

    accuracy                           0.89      3059
   macro avg       0.89      0.89      0.89      3059
weighted avg       0.89      0.89      0.89      3059


📌 Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.74      0.75      0.75      1567
           1       0.74      0.73      0.73      1492

    accuracy                           0.74      3059
   macro avg       0.74      0.74      0.74      3059
weighted avg       0.74      0.74      0.74      3059


📊 RandomForest Confusion Matrix:
 [[1400  167]
 [ 160 1332]]

📊 Logistic Regression Confusion Matrix:
 [[1178  389]
 [ 404 1088]]
