# Logistic Regression:

## 1. Purpose of Logistic Regression: Best Classification

The goal of `Logistic Regression` is to find the best classifier (decision boundary) that effectively separates two classes by estimating the probability of a sample belonging to class 1.

- Unlike Linear Regression, which predicts continuous values, Logistic Regression transforms inputs into probabilities using the **sigmoid function** and classifies them into two categories (e.g., 0 or 1).
- It works well for binary classification problems, such as spam detection, fraud detection, and disease prediction.

<img src="images/binary.ppm" width='350px'>

### Mathematical Representation
For a given set of input features $X$, the logistic function is:

$$
h(X) = \frac{1}{1 + e^{-(w^T x + b)}}
$$

where:
- $h(X)$ is the probability that the point belongs to **class 1**.
- $w$ is the **weight vector** (parameters to be learned).
- $x$ is the **feature vector**.
- $b$ is the **bias term**.

The best classifier maximizes the separation between the two classes.

---

## 3. Assumptions in Logistic Regression
1. **Binary Output**: The target variable should have only two categories (e.g., spam or not spam).
2. **Linear Log-Odds Relationship**: The independent variables should have a linear relationship with the **log-odds** of the dependent variable.
3. **No Multicollinearity**: Independent variables should not be highly correlated.
4. **Independence of Observations**: Each observation should be independent.
5. **Large Dataset Size**: Logistic Regression performs best with large, well-balanced datasets.

---

In [None]:
## 

### 2. Equations of Decision Boundary: Line and Plane

#### Equation of a Decision Boundary in 2D (Line)
In a two-dimensional space, the decision boundary is a straight line given by:

$$
w^T x + b = 0
$$

Expanding it in terms of coordinates:

$$
w_1 X_1 + w_2 X_2 + b = 0
$$

where:
- $w_1, w_2$ are the weights (coefficients) for features $X_1, X_2$.
- $b$ is the bias term.

### Equation of a Decision Boundary in Higher Dimensions (Plane/Hyperplane)
For an $n$-dimensional space, the decision boundary is a **hyperplane**:

$$
w^T x + b = w_1 X_1 + w_2 X_2 + ... + w_n X_n + b = 0
$$

- If $w^T x + b > 0$ → Class 1
- If $w^T x + b < 0$ → Class 0

---


In [None]:
## 

## Working of Logistic Regression: Distance Calculation

To classify a point correctly, we calculate its distance from the decision boundary.

### Distance from a Point to a Line (2D Space)
For a point $(X_1, X_2)$, the perpendicular distance **$d$** from the decision boundary is given by:

$$
d = \frac{|w^T x + b|}{\| w \|}
$$

### Distance from a Point to a Hyperplane (Higher Dimension)
For a point $(X_1, X_2, ..., X_n)$ in an $n$-dimensional space, the distance is:

$$
d = \frac{|w^T x + b|}{\| w \|}
$$

This distance determines whether the point is classified correctly.

---

In [None]:
## 

### Condition for Finding the Best Classifier Line or Plane

The best classifier is chosen bymaximizing the sum of correct classifications, given by the **argmax equation**:

$$
\underset{w,b}{\arg\max} \sum_{i=1}^{m} y_i (w^T x_i + b)
$$

This means:
- We compute $w^T x_i + b$ for each training sample $x_i$.
- The classifier that maximizes this summation is the best decision boundary.

---


In [None]:
## 


## Four Cases in Classification

| Case | Condition | Distance Calculation | Correct Classification? |
|------|-----------|----------------------|-------------------------|
| 1. Positive point in positive region | $w^T x + b > 0$ and actual class = 1 | Distance is positive | Correct ✅ |
| 2. Negative point in negative region | $w^T x + b < 0$ and actual class = -1 | Distance is negative | Correct ✅ |
| 3. Negative point in positive region | $w^T x + b > 0$ but actual class = -1 | Distance is positive | Incorrect ❌ |
| 4. Positive point in negative region | $w^T x + b < 0$ but actual class = 1 | Distance is negative | Incorrect ❌ |

Thus, cases 3 and 4 indicate misclassification while cases 1 and 2 indicate correct classification

---

In [None]:
## 


### Condition after training:
After training the model, the final decision boundary equation is:

$$
w^T x + b = 0
$$

This equation **divides the feature space into two regions**:
- Region 1 (+ class): $w^T x + b > 0$
- Region 0 (- class): $w^T x + b < 0$

The **classifier with the highest** $ \sum y_i (w^T x_i + b) $ is chosen as the best classifier.

---




In [2]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
df = sns.load_dataset('penguins')

In [6]:
df['species'].unique() # Output column

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

In [7]:
print(df['island'].unique()) # One hot encoding
print(df['sex'].unique())

['Torgersen' 'Biscoe' 'Dream']
['Male' 'Female' nan]


In [8]:
df.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [10]:
df.shape

(344, 7)

In [None]:
df = df.dropna() # Eliminate null values.

- Currently focusing on binary classification:

In [33]:
df = df[df['species']!='Chinstrap']
print(df['species'].unique())
print(df['species'].nunique())
print(df.shape)

['Adelie' 'Gentoo']
2
(265, 7)


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 265 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            265 non-null    object 
 1   island             265 non-null    object 
 2   bill_length_mm     265 non-null    float64
 3   bill_depth_mm      265 non-null    float64
 4   flipper_length_mm  265 non-null    float64
 5   body_mass_g        265 non-null    float64
 6   sex                265 non-null    object 
dtypes: float64(4), object(3)
memory usage: 16.6+ KB


- Now Output Column Species Have two unique values : 'Adelie' and 'Gentoo'

- Dividing the dataset into independent and dependent variable.

In [44]:
X = df.iloc[:,1:]
y = df.iloc[:,0]

- Spliting into traing and testing set.

In [59]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=0)

In [60]:
X_train.sample(3)

Unnamed: 0,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
284,Biscoe,45.8,14.2,219.0,4700.0,Female
235,Biscoe,49.3,15.7,217.0,5850.0,Male
65,Biscoe,41.6,18.0,192.0,3950.0,Male


### Modify data for model ready....

In [61]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,MinMaxScaler

In [62]:
df.columns

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

In [63]:
ct = ColumnTransformer(transformers=[
    ('t1',OneHotEncoder(),['island','sex']),
    ('t2',MinMaxScaler(),['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'])
],remainder='passthrough')

In [64]:
ct.fit(X_train)

In [65]:
new_X_train = ct.transform(X_train)
new_X_test = ct.transform(X_test)

In [66]:
print(new_X_train.shape)
print(new_X_test.shape)

(198, 9)
(67, 9)


In [67]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

In [68]:
lr.fit(new_X_train,y_train)

In [69]:
y_pred = lr.predict(new_X_test)

In [70]:
from sklearn.metrics import accuracy_score, classification_report

In [71]:
accuracy_score(y_pred,y_test)

1.0

In [72]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

      Adelie       1.00      1.00      1.00        34
      Gentoo       1.00      1.00      1.00        33

    accuracy                           1.00        67
   macro avg       1.00      1.00      1.00        67
weighted avg       1.00      1.00      1.00        67



In [81]:
df1 = sns.load_dataset('titanic')

In [82]:
df1.sample(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
362,0,3,female,45.0,0,1,14.4542,C,Third,woman,False,,Cherbourg,no,False
307,1,1,female,17.0,1,0,108.9,C,First,woman,False,C,Cherbourg,yes,False
387,1,2,female,36.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True


In [83]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


- Understanding Data:

In [90]:
df1.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64

In [None]:
df1 = df1.drop(columns='deck')

In [96]:
df1 = df1.drop(columns='embark_town')

In [101]:
df1 = df1.drop(columns='class')

In [89]:
df1['age'] = df1['age'].fillna(df1['age'].mean())

In [91]:
df1 = df1.dropna()

In [126]:
df1.isnull().sum()

survived      0
pclass        0
sex           0
age           0
sibsp         0
parch         0
fare          0
embarked      0
who           0
adult_male    0
alive         0
alone         0
dtype: int64

In [127]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 889 entries, 0 to 890
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   survived    889 non-null    int64  
 1   pclass      889 non-null    int64  
 2   sex         889 non-null    object 
 3   age         889 non-null    float64
 4   sibsp       889 non-null    int64  
 5   parch       889 non-null    int64  
 6   fare        889 non-null    float64
 7   embarked    889 non-null    object 
 8   who         889 non-null    object 
 9   adult_male  889 non-null    bool   
 10  alive       889 non-null    object 
 11  alone       889 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(4)
memory usage: 78.1+ KB


In [128]:
df1.sample()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,who,adult_male,alive,alone
77,0,3,male,29.699118,0,0,8.05,S,man,True,no,True


In [129]:
df1['who'].unique()

array(['man', 'woman', 'child'], dtype=object)

In [130]:
df1.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'who', 'adult_male', 'alive', 'alone'],
      dtype='object')

sex, embarked, who, adult_male, alive, alone -> nominal     
pclass, -> ordinal     
age, fare -> numerical     

In [159]:
X = df1.iloc[:,1:]
y = df1.iloc[:,0]

- Spliting into traing and testing set.

In [169]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=0)

In [170]:
X_train.sample(3)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,who,adult_male,alive,alone
780,3,female,13.0,0,0,7.2292,C,child,False,yes,True
678,3,female,43.0,1,6,46.9,S,woman,False,no,False
830,3,female,15.0,1,0,14.4542,C,child,False,yes,False


In [171]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

In [172]:
ct1 = ColumnTransformer(transformers=[
    ('t1', OneHotEncoder(sparse_output=False, drop='first'),['sex','embarked', 'who', 'adult_male', 'alive', 'alone']),
    ('t2',MinMaxScaler(),['age','fare'])
], remainder='passthrough')

In [173]:
ct1.fit(X_train)

The format of the columns of the 'remainder' transformer in ColumnTransformer.transformers_ will change in version 1.7 to match the format of the other transformers.
At the moment the remainder columns are stored as indices (of type int). With the same ColumnTransformer configuration, in the future they will be stored as column names (of type str).



In [174]:
X_train_s = ct1.transform(X_train)
X_test_s = ct1.transform(X_test)

In [175]:
X_train_s.shape

(666, 13)

In [176]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

In [178]:
lr.fit(X_train_s,y_train)

In [179]:
y_pred = lr.predict(X_test_s)

In [180]:
y_pred

array([0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 0, 1])

In [181]:
import numpy as np
np.array(y_test)

array([0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 0, 1])

In [182]:
from sklearn.metrics import accuracy_score, classification_report

In [183]:
accuracy_score(y_pred,y_test)

1.0

In [184]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       132
           1       1.00      1.00      1.00        91

    accuracy                           1.00       223
   macro avg       1.00      1.00      1.00       223
weighted avg       1.00      1.00      1.00       223

