<a href="https://colab.research.google.com/github/vishal786-commits/machine-learning-journey/blob/main/classical-ml/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### SKlearn Implementation of Logistic Regression


Churn Prediction in Telecommunication Dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Download Dataset
!gdown 1uUt7uL-VuF_5cpodYRiriEwhsldeEp3m

Downloading...
From: https://drive.google.com/uc?id=1uUt7uL-VuF_5cpodYRiriEwhsldeEp3m
To: /content/churn_logistic.csv
  0% 0.00/494k [00:00<?, ?B/s]100% 494k/494k [00:00<00:00, 69.7MB/s]


In [3]:
churn = pd.read_csv("churn_logistic.csv")
churn.head(5)

Unnamed: 0,Account Length,VMail Message,Day Mins,Eve Mins,Night Mins,Intl Mins,CustServ Calls,Intl Plan,VMail Plan,Day Calls,...,Eve Calls,Eve Charge,Night Calls,Night Charge,Intl Calls,Intl Charge,State,Area Code,Phone,Churn
0,128,25,265.1,197.4,244.7,10.0,1,0,1,110,...,99,16.78,91,11.01,3,2.7,KS,415,382-4657,0
1,107,26,161.6,195.5,254.4,13.7,1,0,1,123,...,103,16.62,103,11.45,3,3.7,OH,415,371-7191,0
2,137,0,243.4,121.2,162.6,12.2,0,0,0,114,...,110,10.3,104,7.32,5,3.29,NJ,415,358-1921,0
3,84,0,299.4,61.9,196.9,6.6,2,1,0,71,...,88,5.26,89,8.86,7,1.78,OH,408,375-9999,0
4,75,0,166.7,148.3,186.9,10.1,3,1,0,113,...,122,12.61,121,8.41,3,2.73,OK,415,330-6626,0


In [6]:
churn.shape

(5700, 21)

In [4]:
churn.columns

Index(['Account Length', 'VMail Message', 'Day Mins', 'Eve Mins', 'Night Mins',
       'Intl Mins', 'CustServ Calls', 'Intl Plan', 'VMail Plan', 'Day Calls',
       'Day Charge', 'Eve Calls', 'Eve Charge', 'Night Calls', 'Night Charge',
       'Intl Calls', 'Intl Charge', 'State', 'Area Code', 'Phone', 'Churn'],
      dtype='object')

In [7]:
# Based on simple EDA, I have chosed the following features to build the model

cols = ['Day Mins', 'Eve Mins', 'Night Mins', 'CustServ Calls', 'Account Length']
y = churn['Churn']
X = churn[cols]

In [10]:
X.head(5)

Unnamed: 0,Day Mins,Eve Mins,Night Mins,CustServ Calls,Account Length
0,265.1,197.4,244.7,1,128
1,161.6,195.5,254.4,1,107
2,243.4,121.2,162.6,0,137
3,299.4,61.9,196.9,2,84
4,166.7,148.3,186.9,3,75


In [8]:
X.shape

(5700, 5)

Let us split the data into training, validation and test.

In [9]:
from sklearn.model_selection import train_test_split

# Step 1: Split into 80% training+validation and 20% test
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=1 # random_state=1 ensures the split is reproducible
)

# Step 2: Split the 80% into:
# 60% training and 20% validation
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val,
    y_train_val,
    test_size=0.25,  # 25% of 80% = 20% of total
    random_state=1 # random_state=1 ensures the split is reproducible
)

# Check training shape
print(X_train.shape)


(3420, 5)


  1. train_test_split separates 20% of the data into a test set (X_test, y_test) and keeps 80% for training and validation (X_train_val, y_train_val).
  2. The remaining 80% is split again, where 25% of it (which equals 20% of the original dataset) becomes the validation set (X_val, y_val), and the remaining 60% becomes the final training set (X_train, y_train).

In [11]:
# we will Scale our data before fitting the model
from sklearn.preprocessing import StandardScaler

# Step 1: Create the scaler object
scaler = StandardScaler()

# Step 2: Fit the scaler ONLY on the training data
# (learns the mean and standard deviation from X_train)
scaler.fit(X_train)

# Step 3: Transform all datasets using the same scaler
X_train = scaler.transform(X_train)
X_val   = scaler.transform(X_val)
X_test  = scaler.transform(X_test)

In [15]:
X_train[:5, :5]

array([[-1.8525591 , -0.54121117,  1.87596728,  0.0724823 ,  2.13378709],
       [ 0.93155078,  1.05292599,  0.39854651, -0.54879454, -0.81991418],
       [ 0.46912157,  0.11462924,  1.13324217,  0.0724823 , -2.27130187],
       [-0.95630455,  1.07033768,  0.83013002,  0.0724823 , -0.25972945],
       [-0.94200262,  0.11656387, -0.69346015,  1.31503598,  1.6499912 ]])

We now implement Logistic Regression.

In [16]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

This code imports the LogisticRegression class (a blueprint for creating logistic regression models), creates an instance of that class with model = LogisticRegression() *(which makes an actual model object in memory)*, and then trains that specific model using model.fit(X_train, y_train).

It learns the best coefficients from the training data to map input features (X_train) to target labels (y_train); after this step, the trained model stores the learned weights internally and is ready to make predictions on new data.

In [18]:
# Let us now look at the weights
print("Weights (coefficients):", model.coef_)
print("Bias (intercept):", model.intercept_)

Weights (coefficients): [[0.68445262 0.29104301 0.1363756  0.79630985 0.06125924]]
Bias (intercept): [-0.01220319]
