# ðŸ“Š Customer Churn Analysis Project
**What is Churn?** Churn happens when customers stop using a service. 
In this project, 
we are building an AI model to predict who might leave so the company can offer them a discount or a better deal to stay!

### Step 1: Loading our Dataset
We use the `pandas` library to load our CSV file. This is like opening an Excel sheet inside Python so we can work with it.

In [2]:
import pandas as pd

df = pd.read_csv("telco_customer_churn_202602221854.csv")
print(df.columns)

Index(['gender', 'seniorcitizen', 'partner', 'dependents', 'tenure',
       'phoneservice', 'multiplelines', 'internetservice', 'onlinesecurity',
       'onlinebackup', 'deviceprotection', 'techsupport', 'streamingtv',
       'streamingmovies', 'contract', 'paperlessbilling', 'paymentmethod',
       'monthlycharges', 'totalcharges', 'churn', 'churn_new'],
      dtype='object')


### Step 2: Preparing the Data
Machine Learning models only understand numbers, not words. 
* **Dropping Columns:** We remove the 'churn' labels from our features so the model doesn't "cheat" by seeing the answer.
* **One-Hot Encoding:** We turn categories (like 'Gender' or 'Contract Type') into 0s and 1s so the math works!

In [3]:
print(df[['churn','churn_new']].head())

  churn  churn_new
0    No          0
1    No          0
2   Yes          1
3    No          0
4   Yes          1


In [4]:
# Target
y = df['churn_new']

# Features â€” drop BOTH churn columns
X = df.drop(['churn', 'churn_new'], axis=1)

# Encode
X = pd.get_dummies(X, drop_first=True)

In [5]:
print([col for col in X.columns if 'churn' in col.lower()])

[]


### Step 3: Train-Test Split
Think of this like school:
* **Training Set (80%):** This is the "textbook" the model uses to learn.
* **Testing Set (20%):** This is the "final exam" to see if the model actually learned or just memorised the answers.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

### Step 4: Building the Logistic Regression Model
We are using **Logistic Regression**. Even though it sounds fancy, itâ€™s just a way for the computer to draw a line between "People staying" and "People leaving."

In [7]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


### Step 5: How did we do?
We use a **Confusion Matrix** to see:
1. How many people we correctly predicted would stay.
2. How many people we correctly predicted would leave.
*Our goal is to get the "Accuracy" as high as possible!*

In [8]:
y_pred = model.predict(X_test)

In [9]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, model.predict_proba(X_test)[:,1]))

[[924 111]
 [166 208]]
              precision    recall  f1-score   support

           0       0.85      0.89      0.87      1035
           1       0.65      0.56      0.60       374

    accuracy                           0.80      1409
   macro avg       0.75      0.72      0.73      1409
weighted avg       0.80      0.80      0.80      1409

ROC-AUC: 0.842636079464724


In [10]:
from sklearn.metrics import confusion_matrix, classification_report

y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[924 111]
 [166 208]]
              precision    recall  f1-score   support

           0       0.85      0.89      0.87      1035
           1       0.65      0.56      0.60       374

    accuracy                           0.80      1409
   macro avg       0.75      0.72      0.73      1409
weighted avg       0.80      0.80      0.80      1409



In [11]:
y_probs = model.predict_proba(X_test)[:,1]

y_pred_custom = (y_probs > 0.3).astype(int)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_custom))

              precision    recall  f1-score   support

           0       0.90      0.75      0.82      1035
           1       0.52      0.76      0.62       374

    accuracy                           0.75      1409
   macro avg       0.71      0.75      0.72      1409
weighted avg       0.80      0.75      0.76      1409



### Step 6: What drives Churn?
This is the most important part for a business! We are looking at which features (like having Fiber Optic internet) actually make people want to leave.

In [12]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

roc_auc_score(y_test, rf.predict_proba(X_test)[:,1])


0.8253313182980704

### Step 7: Exporting the Model
We save our trained model into a file called `churn_model.pkl`. This way, we can use it later in a real-world app or a website without having to train it all over again!

In [21]:
import pandas as pd

feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
})

feature_importance.sort_values(by='Coefficient', ascending=False).head(10)

Unnamed: 0,Feature,Coefficient
10,internetservice_Fiber optic,0.74309
28,paymentmethod_Electronic check,0.399692
26,paperlessbilling_Yes,0.374593
9,multiplelines_Yes,0.276259
21,streamingtv_Yes,0.199063
23,streamingmovies_Yes,0.198425
8,multiplelines_No phone service,0.1599
0,seniorcitizen,0.148908
29,paymentmethod_Mailed check,0.072625
5,partner_Yes,0.020999


In [14]:
import joblib
joblib.dump(model, "churn_model.pkl")

['churn_model.pkl']

### Step 8: Calculating Probability
Instead of just saying "Yes" or "No," we want to know how *sure* the model is. 
* **`predict_proba`** gives us a percentage. For example, a score of 0.85 means there is an 85% chance the customer will churn!

In [15]:
model.predict_proba(X)

array([[0.38187513, 0.61812487],
       [0.95413699, 0.04586301],
       [0.70432419, 0.29567581],
       ...,
       [0.60839319, 0.39160681],
       [0.28617197, 0.71382803],
       [0.95222302, 0.04777698]])

### Step 9: Creating our "Smart" Dataset
We are now adding two new columns to our original table:
1.  **Prediction:** The final 0 or 1.
2.  **Churn Probability:** The actual risk score (between 0 and 1).
This is the "Gold Mine" for business teams! They can use these scores to call the highest-risk customers first. ðŸ“žðŸ’°

In [16]:
df['prediction'] = model.predict(X)
df['churn_probability'] = model.predict_proba(X)[:,1]

We add prediction and churn probability columns to the dataset.

Prediction â†’ Final classification (0 or 1)
Churn Probability â†’ Risk score between 0 and 1

This allows business teams to identify high-risk customers.

### Step 10: Exporting for the Dashboard
Success! ðŸŽ‰ We are exporting our final results to a new file called `churn_with_predictions.csv`. 
We will now take this file into **Power BI** to build a visual dashboard that shows management exactly where the churn risk is highest.

In [19]:
df.to_csv("churn_with_predictions.csv", index=False)

We export the dataset with model predictions.

This file will be used in Power BI to create dashboards
for business decision-making.