### PROBLEM STATEMENT
**Model Training** on **Customer Churn Dataset** to develop a **Machine Learning** model that can predict whether a customer will churn or not. The dataset has been preprocessed to ensure it is suitable for modeling.

### Stages of Model Training
1. Splitting the dataset into training and testing sets
2. Selecting and training different machine learning models
3. Evaluating model performance using appropriate metrics
4. Hyperparameter tuning to improve model performance

### Importing Required Python libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc

### Data Loading & Splitting

In [6]:
train_data = pd.read_csv('Data_Preprocessed/new_train.csv')
train_data.head()

Unnamed: 0,CustomerID,Age,Tenure,Usage Frequency,Support Calls,Payment Delay,Total Spend,Last Interaction,Churn,Subscription Type Encoded,Gender Encoded,Contract Length_Annual,Contract Length_Monthly
0,0.0,0.255319,0.644068,0.448276,0.5,0.6,0.924444,0.551724,1.0,0.0,0,1.0,0.0
1,2e-06,1.0,0.813559,0.0,1.0,0.266667,0.507778,0.172414,1.0,1.0,0,0.0,1.0
2,4e-06,0.787234,0.220339,0.103448,0.6,0.6,0.094444,0.068966,1.0,1.0,0,0.0,0.0
3,7e-06,0.851064,0.627119,0.689655,0.7,0.233333,0.328889,0.965517,1.0,0.0,1,0.0,1.0
4,9e-06,0.106383,0.525424,0.655172,0.5,0.266667,0.574444,0.655172,1.0,1.0,1,0.0,1.0


In [7]:
test_data = pd.read_csv('Data_Preprocessed/new_test.csv')
test_data.head()

Unnamed: 0,CustomerID,Age,Tenure,Usage Frequency,Support Calls,Payment Delay,Total Spend,Last Interaction,Churn,Subscription Type Encoded,Gender Encoded,Contract Length_Annual,Contract Length_Monthly
0,-2e-06,0.085106,0.40678,0.448276,0.4,0.9,0.553333,0.275862,1,1.0,0,0.0,1.0
1,0.0,0.489362,0.457627,0.931034,0.7,0.433333,0.537778,0.655172,0,0.0,0,0.0,1.0
2,2e-06,0.617021,0.440678,0.310345,0.2,0.966667,0.73,0.689655,0,2.0,1,1.0,0.0
3,4e-06,0.361702,0.135593,0.37931,0.5,0.566667,0.146667,0.586207,0,2.0,1,0.0,0.0
4,7e-06,0.744681,0.966102,0.793103,0.9,0.066667,0.481111,0.586207,0,0.0,0,1.0,0.0


In [8]:
X_train = train_data.drop(columns=['Churn'])
y_train = train_data['Churn']
X_test = test_data.drop(columns=['Churn'])
y_test = test_data['Churn']

In [9]:
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (440832, 12)
Shape of y_train: (440832,)
Shape of X_test: (64374, 12)
Shape of y_test: (64374,)


### Train Models

In [10]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(probability=True),
    'Gradient Boosting': GradientBoostingClassifier(),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    'LightGBM': LGBMClassifier()
}

# Train models
model_results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    model_results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1 Score': f1_score(y_test, y_pred)
    }

# Convert results to DataFrame
results_df = pd.DataFrame(model_results).T
print(results_df)

Parameters: { "use_label_encoder" } are not used.



[LightGBM] [Info] Number of positive: 249999, number of negative: 190833
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006580 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 729
[LightGBM] [Info] Number of data points in the train set: 440832, number of used features: 12
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.567107 -> initscore=0.270058
[LightGBM] [Info] Start training from score 0.270058
                     Accuracy  Precision    Recall  F1 Score
Logistic Regression  0.473685   0.473685  1.000000  0.642858
Decision Tree        0.503293   0.488125  0.998885  0.655787
Random Forest        0.497794   0.485384  0.999803  0.653505
SVM                  0.485631   0.479412  1.000000  0.648112
Gradient Boosting    0.494221   0.483611  0.999672  0.651868
XGBoost              0.503216   0.488086  0.998918  0.655759
LightGBM       