## Logistic Regression Implementation

Logistic regression is a statistical method used for binary classification problems - predicting whether something belongs to one category or another (like spam/not spam, pass/fail, etc.).
Unlike linear regression which predicts continuous values, logistic regression predicts probabilities between 0 and 1. It uses the logistic (sigmoid) function to map any real number to a value between 0 and 1, making it perfect for probability estimation.
The key idea is that instead of fitting a straight line to data points, logistic regression fits an S-shaped curve. This curve represents the probability of the positive class. The mathematical foundation uses the logit function (log-odds) and maximum likelihood estimation to find the best parameters.
The algorithm works by finding coefficients that maximize the likelihood of observing the training data. It then uses these coefficients with the sigmoid function to make predictions on new data.


# KEY DIFFERENCES FROM LINEAR REGRESSION
 Linear Regression: Predicts continuous values, uses MSE loss
 
 Logistic Regression: Predicts probabilities, uses log-likelihood loss

# ASSUMPTIONS OF LOGISTIC REGRESSION
 - Linear relationship between independent variables and log-odds
 - Independence of observations
 - No multicollinearity among predictors
 - Large sample size for stable results

"""
# PRACTICAL TIPS AND BEST PRACTICES:

1. FEATURE SCALING:
   - Logistic regression benefits from feature scaling (StandardScaler)
   - Helps with convergence and coefficient interpretation

2. REGULARIZATION:
   - L1 (Lasso): Feature selection, sparse solutions
   - L2 (Ridge): Prevents overfitting, handles multicollinearity
   - Use LogisticRegression(penalty='l1') or penalty='l2'

3. HANDLING IMBALANCED DATA:
   - Use class_weight='balanced' parameter
   - Consider SMOTE for oversampling minority class
   - Adjust decision threshold based on business needs

4. MODEL VALIDATION:
   - Use cross-validation for robust performance estimation
   - Check for overfitting with validation curves
   - Monitor both training and validation performance

5. FEATURE ENGINEERING:
   - Create polynomial features for non-linear relationships
   - Handle categorical variables with encoding
   - Consider feature interactions

6. DIAGNOSTIC CHECKS:
   - Plot residuals to check assumptions
   - Check for influential outliers
   - Validate linear relationship with log-odds

7. INTERPRETATION:
   - Coefficients represent change in log-odds
   - Use odds ratios (e^coefficient) for easier interpretation
   - Consider confidence intervals for coefficients

8. WHEN TO USE:
   - Binary classification problems
   - Need probability estimates, not just classifications
   - Interpretability is important
   - Baseline model before trying complex algorithms
"""

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
from sklearn.datasets import make_classification

In [3]:
## create the dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=15)

In [7]:
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.377957,1.043895,1.043494,-0.101838,-1.617442,0.402713,0.913601,-0.067192,0.175471,-1.049646
1,-0.325259,1.276263,-0.686123,-2.463205,-0.489426,-0.240715,-1.469496,1.006633,-0.833692,0.957744
2,0.739019,-0.600903,-0.177294,1.335714,-0.817332,-0.790047,1.457365,-0.218981,0.878643,-1.257740
3,0.474312,-1.103002,1.189936,-0.800186,0.912377,-0.406451,-1.130950,1.985111,1.379029,1.041768
4,0.927365,1.114796,0.080284,1.261064,0.761179,0.921563,0.440832,0.184645,-1.567739,-0.142107
...,...,...,...,...,...,...,...,...,...,...
995,1.538272,0.171629,0.075371,-0.957658,-1.066219,1.158096,-0.036964,0.123689,0.927871,-0.225003
996,-0.060266,0.095018,-0.271685,1.830560,0.219445,-0.341269,1.180088,-0.216876,-1.752938,-0.810152
997,0.675563,-0.538420,-1.299500,0.747835,1.733898,-0.268044,-0.520953,2.043336,0.947388,0.790354
998,2.629710,-2.452899,-1.359785,1.592065,0.854157,1.618828,0.621701,0.378898,-1.971894,-0.252250


In [4]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42)

In [5]:
from sklearn.linear_model import LogisticRegression
logistic=LogisticRegression()
logistic.fit(X_train,y_train)
y_pred=logistic.predict(X_test)
print(y_pred)

[0 0 0 1 1 0 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 1 1 0 1
 1 0 0 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0
 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 0 1
 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 1
 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 1 0 1 0 1 0
 0 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0
 0 1 0 0 0 0 0 1 0 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1
 0 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
 0 1 0 0]


In [None]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [None]:
score=accuracy_score(y_pred,y_test)
print(score)
print(classification_report(y_pred,y_test))
print(confusion_matrix(y_pred,y_test))

## Hyperparameter Tuning And Cross Validation

## Grid SearchCV

In [None]:
model=LogisticRegression()
penalty=['l1', 'l2', 'elasticnet']
c_values=[100,10,1.0,0.1,0.01]
solver=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

In [None]:
params=dict(penalty=penalty,C=c_values,solver=solver)

In [None]:
from sklearn.model_selection import StratifiedKFold
cv=StratifiedKFold()

In [None]:
## GridSearchCV
from sklearn.model_selection import GridSearchCV

grid=GridSearchCV(estimator=model,param_grid=params,scoring='accuracy',cv=cv,n_jobs=-1)

In [None]:
grid

In [None]:
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

In [None]:
grid.best_score_

In [None]:
y_pred=grid.predict(X_test)

In [None]:
score=accuracy_score(y_pred,y_test)
print(score)
print(classification_report(y_pred,y_test))
print(confusion_matrix(y_pred,y_test))

## Randomized SearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
model=LogisticRegression()
randomcv=RandomizedSearchCV(estimator=model,param_distributions=params,cv=5,scoring='accuracy')

In [None]:
randomcv.fit(X_train,y_train)

In [None]:
randomcv.best_score_

In [None]:
randomcv.best_params_

In [None]:
y_pred=randomcv.predict(X_test)

In [None]:
score=accuracy_score(y_pred,y_test)
print(score)
print(classification_report(y_pred,y_test))
print(confusion_matrix(y_pred,y_test))

## Logistic Regression For Multiclass Classification Problem

In [None]:
## create the dataset
X, y = make_classification(n_samples=1000, n_features=10,n_informative=3, n_classes=3, random_state=15)

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
logistic=LogisticRegression(multi_class='ovr')
logistic.fit(X_train,y_train)
y_pred=logistic.predict(X_test)

In [None]:
y_pred

In [None]:
score=accuracy_score(y_pred,y_test)
print(score)
print(classification_report(y_pred,y_test))
print(confusion_matrix(y_pred,y_test))

## Logistic Regression for Imbalanced Dataset

In [None]:
# Generate and plot a synthetic imbalanced classification dataset
from collections import Counter
from sklearn.datasets import make_classification


In [None]:
## imbalanced dataset
X,y=make_classification(n_samples=10000,n_features=2,n_clusters_per_class=1,
                   n_redundant=0,weights=[0.99],random_state=10)

In [None]:
X

In [None]:
Counter(y)

In [None]:
import seaborn as sns

In [None]:
import pandas as pd
sns.scatterplot(pd.DataFrame(X)[0],pd.DataFrame(X)[1],hue=y)

In [None]:
## Split the dataset into train and test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

In [None]:
class_weight=[{0:w,1:y} for w in [1,10,50,100] for y in [1,10,50,100]]

In [None]:
class_weight

In [None]:
## Hyperparamter tuning
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
penalty=['l1', 'l2', 'elasticnet']
c_values=[100,10,1.0,0.1,0.01]
solver=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
class_weight=[{0:w,1:y} for w in [1,10,50,100] for y in [1,10,50,100]]

In [None]:
params=dict(penalty=penalty,C=c_values,solver=solver,class_weight=class_weight)

In [None]:
params

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
cv=StratifiedKFold()
grid=GridSearchCV(estimator=model,param_grid=params,scoring='accuracy',cv=cv)

In [None]:
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

In [None]:
y_pred=grid.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [None]:
score=accuracy_score(y_pred,y_test)
print(score)
print(classification_report(y_pred,y_test))
print(confusion_matrix(y_pred,y_test))

## Logistic Regression With ROC curve And ROC AUC score 

In [None]:
# roc curve and auc
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

In [None]:
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=1)

In [None]:
## Split the dataset into train and test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

In [None]:
# Created a dummy model with default 0 as output 
dummy_model_prob = [0 for _ in range(len(y_test))]
dummy_model_prob

In [None]:
## Lets Create Basic Logistic Model
model=LogisticRegression()
model.fit(X_train,y_train)

In [None]:
## Prediction based on probability
model_prob=model.predict_proba(X_test)

In [None]:
model_prob

In [None]:
## Lets focus on the positive outcome
model_prob=model_prob[:,1]

In [None]:
## Lets calulcate the scores
dummy_model_auc=roc_auc_score(y_test,dummy_model_prob)
model_auc=roc_auc_score(y_test,model_prob)
print(dummy_model_auc)
print(model_auc)

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In [None]:
## calculate ROC Curves
dummy_fpr, dummy_tpr, _ = roc_curve(y_test, dummy_model_prob)
model_fpr, model_tpr, thresholds = roc_curve(y_test, model_prob)

In [None]:
thresholds

In [None]:
model_fpr,model_tpr

In [None]:
import seaborn as sns


In [None]:
# plot the roc curve for the model
pyplot.plot(dummy_fpr, dummy_tpr, linestyle='--', label='Dummy Model')
pyplot.plot(model_fpr, model_tpr, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

In [None]:
# plot the roc curve for the model
import numpy as np
fig = pyplot.figure(figsize=(20,50))
pyplot.plot(dummy_fpr, dummy_tpr, linestyle='--', label='Dummy Model')
pyplot.plot(model_fpr, model_tpr, marker='.', label='Logistic')
ax = fig.add_subplot(111)
for xyz in zip(model_fpr, model_tpr,thresholds):   
    ax.annotate('%s' % np.round(xyz[2],2), xy=(xyz[0],xyz[1]))
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()