<html> 
<body> 
<hr style="border: 2px solid #1f88c0ff; width: 90%;">
<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 Customer Churn Predection
</h1>
<hr style="border: 2px solid #1f88c0ff; width: 90%;">

</body> 
</html>



# üü¢ Customer Churn Prediction


## 1Ô∏è‚É£ Problem Definition
**Customer Churn** refers to customers who stop using a company's product or service.  
The goal is to **predict which customers are likely to churn** based on historical behavior.


## 2Ô∏è‚É£ Why This Problem is Important
- Retaining existing customers is **cheaper than acquiring new ones**.  
- Helps companies **increase revenue and improve customer satisfaction**.  
- Enables **targeted retention strategies** for at-risk customers.


## 3Ô∏è‚É£ How Machine Learning Can Help
- ML models can **analyze historical customer data** to detect churn patterns.  
- **Predictive models** identify high-risk customers **before they leave**.  
- Supports **data-driven decision making** in marketing and customer support.


## 4Ô∏è‚É£ Data Description
| Feature | Type | Description |
|---------|------|-------------|
| CustomerID | Identifier | Unique ID for each customer |
| Age | Numeric | Age of the customer |
| Gender | Categorical | Male / Female |
| Tenure | Numeric | Months customer has been with company |
| Usage Frequency | Numeric | How often customer uses the service |
| Support Calls | Numeric | Number of calls to support |
| Payment Delay | Numeric | Delays in payments |
| Subscription Type | Categorical | Basic / Standard / Premium |
| Contract Length | Categorical | Monthly / Quarterly / Yearly |
| Total Spend | Numeric | Total amount spent by the customer |
| Last Interaction | Numeric | Days since last interaction |
| Churn | Target | 0 = Active, 1 = Churned |





<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 1.Importing
</h1>

</body> 
</html>



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split ,GridSearchCV
from sklearn.preprocessing import OrdinalEncoder ,OneHotEncoder

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier , AdaBoostClassifier ,ExtraTreesClassifier 
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report , confusion_matrix, ConfusionMatrixDisplay

import warnings
warnings.filterwarnings('ignore')


<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 2.Reading Data
</h1>

</body> 
</html>



In [None]:
train = pd.read_csv('/kaggle/input/customer-churn-data/customer_churn_dataset-testing-master.csv')
test = pd.read_csv('/kaggle/input/customer-churn-data/customer_churn_dataset-testing-master.csv')

In [None]:
print(f"Train Shape : {train.shape}")
print(f"Test Shape : {test.shape}")

#### I Will Concat Two Dataframes To split it into (Train , Val , Test)

In [None]:
# Concat Two DataFrames
df = pd.concat([train , test] , axis = 0)
df.head()

In [None]:
df.shape

<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 3.Exploratory Data Analysis
</h1>

</body> 
</html>



In [None]:
print(df.columns)

In [None]:
df.drop('CustomerID' , axis = 1 , inplace=True)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include= 'object')

In [None]:
for col in df.columns :
    print(col)
    print(df[col].unique()) 
    print("*******************")

**Dataset Summary**

- Numerical columns (8): `CustomerID`, `Age`, `Tenure`, `Usage Frequency`, `Support Calls`, `Payment Delay`, `Total Spend`, `Last Interaction`
- Binary nominal column (1): `Gender`
- Ordinal categorical columns (2): `Contract Length`, `Subscription Type`


<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 4.Detect and Handle Missing Values
</h1>

</body> 
</html>



In [None]:
df.isna().sum()

In [None]:
df.dropna(inplace = True)

In [None]:
df.duplicated().sum()


<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 5.Detect outlier
</h1>

</body> 
</html>




In [None]:
num_col = df.select_dtypes(include='number').columns
cat_col = df.select_dtypes(include='object').columns

In [None]:
plt.figure(figsize=(10, 8))
sns.boxplot(data=df[num_col] , palette='Blues')

plt.title('Boxplot for Outlier Detection')
plt.xticks(rotation = 45)
plt.show()


##### As we Saw There Is No Outliers

<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 6. Analysis and Visualizations
</h1>

</body> 
</html>



## Univariate Analysis 

In [None]:
plt.figure(figsize=(5, 5))

sns.countplot(
    data=df,
    x="Churn" ,width =.4
)

plt.title("Churn Distribution")
plt.xlabel("Churn")
plt.ylabel("Count")

plt.show()


##### This Distribution Reflect That Data Imblanced and more People Leave This Business (Churn)

In [None]:
plt.figure(figsize=(15, 10))

for i, col in enumerate(num_col, 1):
    plt.subplot(2 , 4, i)
    sns.histplot(data=df, x=col, kde=True, bins=30 ,palette='Blues')  
    plt.title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))

for i, col in enumerate(cat_col, 1):
    plt.subplot(1 , 3 , i)
    sns.countplot(data=df, x=col , hue = 'Churn' , palette="Blues")  
    plt.title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

#### - We observed that females have a higher churn rate than males.
#### - Customers with monthly contracts have the highest churn rate.

## Bivariate Analysis 

In [None]:

sns.boxplot(data=df, x='Churn', y='Tenure', palette="Blues")

In [None]:
plt.figure(figsize=(6, 6))

corr = df[num_col].corr()
corr_with_target = corr['Churn'].sort_values(ascending= True).to_frame()
sns.heatmap(
    data=corr_with_target,
    annot=True,          
    fmt=".2f",           
    cmap="Blues",       
    cbar=True,           
    linewidths=0.5,      
    linecolor='white',  
    square=True         
)

plt.title("Correlation  With Target", fontsize=18)
plt.show()



## Multivirate Analysis 

In [None]:
plt.figure(figsize=(12, 8))

corr = df[num_col].corr()
sns.heatmap(
    data=corr,
    annot=True,          
    fmt=".2f",           
    cmap="Blues",       
    cbar=True,           
    linewidths=0.5,      
    linecolor='white',  
    square=True         
)

plt.title("Correlation Heatmap of Numerical Features", fontsize=18)
plt.tight_layout()
plt.show()


<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 7.Encoding
</h1>

</body> 
</html>



#### I Will Apply two Ways Of Encoding :
####   1. OneHotEncoding > Gender Because it is Nominal 
####   2. Ordinal Encoder > Subscription Type , Contract Length because they Ordinal

In [None]:
df_encoded = df.copy()

ohe = OneHotEncoder(
    drop='first',
    sparse_output=False
)

gender_encoded = ohe.fit_transform(df_encoded[['Gender']])

gender_df = pd.DataFrame(
    gender_encoded,
    columns=ohe.get_feature_names_out(['Gender']),
    index=df_encoded.index
)

df_encoded = pd.concat([df_encoded.drop('Gender', axis=1), gender_df], axis=1)


In [None]:
ordinal_cols = ['Subscription Type', 'Contract Length']

oe = OrdinalEncoder(
    categories=[
        ['Basic', 'Standard', 'Premium'],        
        ['Monthly', 'Quarterly', 'Annual']       
    ]
)

df_encoded[ordinal_cols] = oe.fit_transform(df_encoded[ordinal_cols])


<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 8.Splitting Data
</h1>

</body> 
</html>



In [None]:

X = df_encoded.drop('Churn' , axis= 1)
y = df_encoded['Churn']

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp
)
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)
print(X_test.shape)
print(y_test.shape)



<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 9.Scaling Numerical Features
</h1>

</body> 
</html>



#### Apply Scaling To Improve Distance Based Models 

In [None]:


scaled_col = ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 'Payment Delay',
       'Total Spend', 'Last Interaction']
scaler = StandardScaler()
X_train[scaled_col] = scaler.fit_transform(X_train[scaled_col])
X_val[scaled_col] = scaler.transform(X_val[scaled_col])
X_test[scaled_col] = scaler.transform(X_test[scaled_col])


<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 10.Handle Imbalanced problem
</h1>

</body> 
</html>



#### I Try To Balance Clsaases Of this Data

In [None]:
rus = RandomUnderSampler(sampling_strategy='auto',
    random_state=42
)

X_train, y_train = rus.fit_resample(
    X_train, y_train
)

<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 11.Models & Predicitions
</h1>

</body> 
</html>



# Basic Models

In [None]:
Basic_models = {
    "Logistic Regression": LogisticRegression(),
    "Naive Bayes": GaussianNB(),
    "KNN": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42)
}

results = []

for name, model in Basic_models.items():
    model.fit(X_train, y_train)  

    y_pred = model.predict(X_val)
    
    
    results.append({
        "Model": name,
        "Accuracy": round(accuracy_score(y_val, y_pred),2),
        "Precision": round(precision_score(y_val, y_pred),2),
        "Recall": round(recall_score(y_val, y_pred),2),
        "F1-score": round(f1_score(y_val, y_pred),2)
    })
    print(f"                 {name}       ( Acc : {accuracy_score(y_val, y_pred):.2f})")
    print("=============================================================")
    print(classification_report(y_val, y_pred))
    cm = confusion_matrix(y_val , y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()
    plt.show()

results_df = pd.DataFrame(results)
results_df

#  Hyperparameter Tuning 

In [None]:
param_grids = {
    "Logistic Regression": {
        "C": [0.01, 0.1, 1, 10],
        "solver": ["lbfgs"],
        "max_iter": [200 ,500 ,1000]
    },
    "Naive Bayes": {
        
    },
    "KNN": {
        "n_neighbors": [3, 5, 7, 9],
        "weights": ["uniform", "distance"]
    },
    
    "Decision Tree": {
        "max_depth": [3, 5, 10, None],
        "min_samples_split": [2, 5, 10]
    },
    "Random Forest": {
        "n_estimators": [50, 100, 150],
        "max_depth": [3, 5, 10, None],
        "min_samples_split": [2, 5, 10]
    },
    
}


In [None]:
tuned_models = {}
tuned_results = []

for name, model in Basic_models.items():
    print(f"Running GridSearchCV for {name}...")
    
    if name in param_grids and param_grids[name]:  
        grid = GridSearchCV(
            estimator=model,
            param_grid=param_grids[name],
            scoring="f1",
            cv=5,
            n_jobs=-1
        )
        grid.fit(X_train, y_train)
        best_model = grid.best_estimator_
        best_params = grid.best_params_
    else:
        model.fit(X_train, y_train)
        best_model = model
        best_params = "No tuning needed"
    

    tuned_models[name +" Tunning"] = {
        "model": best_model ,
        "best_params": best_params
    }
    
    y_pred = best_model.predict(X_val)
    
    tuned_results.append({
        "Model": name +" Tunning",
        "Accuracy": round(accuracy_score(y_val, y_pred),2),
        "Precision": round(precision_score(y_val, y_pred),2),
        "Recall": round(recall_score(y_val, y_pred),2),
        "F1-score": round(f1_score(y_val, y_pred),2),
        "Best_Params": best_params
    })
    print(f"                 {name}       ( Acc : {accuracy_score(y_val, y_pred):.2f})")
    print("=============================================================")
    print(classification_report(y_val, y_pred))
    cm = confusion_matrix(y_val , y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()
    plt.show()



tuned_results_df = pd.DataFrame(tuned_results)
tuned_results_df


# Before & After Hyperparameter Tuning 

In [None]:
all_models = pd.concat([results_df , tuned_results_df] , axis =0).sort_values('Model')
all_models

#  More Advanced Algorithims 
### [ SVC , XGboost , Adaboost ,LightGBoost ,ExtraTree ]

In [None]:
Advanced_models = {
    'AdaBoost':AdaBoostClassifier(),
    'Extratree':ExtraTreesClassifier(verbose= 0 ),
    'LGBoost' :LGBMClassifier( verbosity = -1),
    'XGBoost':XGBClassifier(),
    'CatBoost' : CatBoostClassifier(verbose= 0)
}

Advanced_results = []

for name, model in Advanced_models.items():
    model.fit(X_train, y_train)  

    y_pred = model.predict(X_val)
    
    
    Advanced_results.append({
        "Model": name,
        "Accuracy": round(accuracy_score(y_val, y_pred),2),
        "Precision": round(precision_score(y_val, y_pred),2),
        "Recall": round(recall_score(y_val, y_pred),2),
        "F1-score": round(f1_score(y_val, y_pred),2)
    })
    print(f"                 {name}       ( Acc : {accuracy_score(y_val, y_pred):.2f})")
    print("=============================================================")
    print(classification_report(y_val, y_pred))
    cm = confusion_matrix(y_val , y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()
    plt.show()

    

Advanced_results = pd.DataFrame(Advanced_results)
Advanced_results

<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 12.Models Comprasion
</h1>

</body> 
</html>



In [None]:
Final_results = pd.concat([all_models ,Advanced_results ] , axis = 0)
Final_results

In [None]:
plt.figure(figsize=(15, 7))

# Accuracy Plot
plt.subplot(1, 2, 1)
sns.barplot(
    data=Final_results.sort_values(by='Accuracy', ascending=False),
    x='Accuracy',
    y='Model',
    palette='Blues_r'
)
plt.title('Model Comparison based on Accuracy')
plt.xlabel('Accuracy')
plt.ylabel('Model')

# F1-score Plot
plt.subplot(1, 2, 2)
sns.barplot(
    data=Final_results.sort_values(by='F1-score', ascending=False),
    x='F1-score',
    y='Model',
    palette='Blues_r'
)
plt.title('Model Comparison based on F1-score')
plt.xlabel('F1-score')
plt.ylabel('Model')

plt.tight_layout()
plt.show()


<html> 
<body> 

<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 13 .Best Model
</h1>

</body> 
</html>



In [None]:
# Get Best Model Name
best_model_name = (
    Final_results
    .sort_values(by='F1-score', ascending=False)
    .iloc[0]['Model']
)
best_model_name

In [None]:
def get_model(name):
    if name in Basic_models:
        return Basic_models[name]

    elif name in Advanced_models:
        return Advanced_models[name]

    elif name in tuned_models:
        return tuned_models[name]['model']

    else:
        raise ValueError(f"Model '{name}' not found in stored models")
    
    

In [None]:
if best_model_name in Advanced_models or best_model_name =='Decision Tree' or best_model_name == 'Random Forest' :
    importances = get_model(best_model_name).feature_importances_


    feature_names = X_train.columns  

    feat_imp = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
    feat_imp = feat_imp.sort_values(by='Importance', ascending=False)


    # ÿ±ÿ≥ŸÖ
    plt.figure(figsize=(10,6))
    plt.barh(feat_imp['Feature'], feat_imp['Importance'])
    plt.gca().invert_yaxis()
    plt.title('Feature Importance')
    plt.show()

In [None]:
# Test Accuracy
Best_model = get_model(best_model_name)
y_test_pred = Best_model.predict(X_test)
print(f"Test Acc for {best_model_name} : {accuracy_score(y_test, y_test_pred):.2f}")



#### Save Our Models

In [None]:
import joblib

joblib.dump(Best_model, 'Best_Model.pkl')
joblib.dump(scaler, 'Scaler.pkl')
joblib.dump(oe, 'Ordinal_Encoder.pkl')
joblib.dump(ohe, 'One_Hot_Encoder.pkl')

## Test Function

In [None]:
import pandas as pd
import joblib
Test_df = {
    'Age': 30,
    'Gender': 'Female',
    'Tenure': 39,
    'Usage Frequency': 14,
    'Support Calls': 5,
    'Payment Delay': 18,
    'Subscription Type': 'Standard',
    'Contract Length': 'Annual',
    'Total Spend': 932,
    'Last Interaction': 17
}

def Test_func(data):
    df = pd.DataFrame([data])

    # ÿ™ÿ≠ŸÖŸäŸÑ ÿßŸÑÿ£ÿØŸàÿßÿ™ ŸàÿßŸÑŸÖŸàÿØŸäŸÑ
    model = joblib.load('Best_Model.pkl')
    scaler = joblib.load('Scaler.pkl')
    ohe = joblib.load('One_Hot_Encoder.pkl')
    oe = joblib.load('Ordinal_Encoder.pkl')

    num_cols = ['Age', 'Tenure', 'Usage Frequency', 'Support Calls', 
                'Payment Delay', 'Total Spend', 'Last Interaction']

    ordinal_cols = [ 'Subscription Type','Contract Length']
    onehot_cols = ['Gender']

    df[num_cols] = scaler.transform(df[num_cols])

    
    ohe_df = pd.DataFrame(ohe.transform(df[onehot_cols]), 
                          columns=ohe.get_feature_names_out(onehot_cols))

    # Drop original one-hot columns and concat encoded
    df = df.drop(columns=onehot_cols)
    df = pd.concat([df, ohe_df], axis=1)

    df[ordinal_cols] = oe.transform(df[ordinal_cols])


    # Prediction
    prediction = model.predict(df)
    prediction_proba = model.predict_proba(df)

    return prediction[0], prediction_proba[0]

pred, pred_proba = Test_func(Test_df)
print("Prediction:", pred)
print("Prediction Probabilities:", pred_proba)


# Project Summary

**Goal:** Predict customer churn to help the company improve retention.

**Best Model:** Random Forest (Highest F1-score)

**Key Findings:**
- Age, Payment delay, Total Spend, and Support Calls are important features.
- Customers with short tenure and low usage tend to churn more.
- Data preprocessing (scaling, encoding) significantly improved model performance.

**Evaluation Metrics of Best Model:**
- Accuracy: 0.94
- F1-score: 0.95
- Precision: 0.90
- Recall: 1.00

**Next Steps:**
- Deploy the model for real-time prediction.
- Monitor feature importance regularly to detect changes in customer behavior.
- Consider cost-sensitive learning for business impact.


<html> 
<body> 
<hr style="border: 2px solid #1f88c0ff; width: 90%;">
<h1 style="text-align:center; font-family: 'Arial', sans-serif; color:#ffffff;">
 Thank You
</h1>
<hr style="border: 2px solid #1f88c0ff; width: 90%;">

</body> 
</html>

