# Assignment for Topic 2: Logistic Regression


Hi,there! This assignment is created by @Xingjian and checked by Professor @Jiahui and @Chunmei.


## Another Edition

This is an edition for jupyter notebook. About .py edition, please see attachment in this folder.

## Datasets

In this section, I choose the dataset from Kaggle, here is link: [Diabetes Healthcare: Comprehensive Dataset-AI](https://www.kaggle.com/datasets/deependraverma13/diabetes-healthcare-comprehensive-dataset?resource=download).You can download it and check all the details about this topic.

In [1]:
import pandas as pd
data = pd.read_csv("D:\\MLtraining_xingjian\\2_logistic_regression\\Train2\\health_care_diabetes.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [2]:
data.head(6)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0


Then we can use the normalization to clean the data (but not including outcome) as following

In [3]:
import numpy as np
from sklearn.preprocessing import StandardScaler
mm = StandardScaler() #create the object for normalization
dd = np.array(data)
mm_data = mm.fit_transform(dd)
print(mm_data)
origin_data = mm.inverse_transform(mm_data) # return back to original data
print(origin_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.46849198  1.4259954
   1.36589591]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.36506078 -0.19067191
  -0.73212021]
 [ 1.23388019  1.94372388 -0.26394125 ...  0.60439732 -0.10558415
   1.36589591]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.68519336 -0.27575966
  -0.73212021]
 [-0.84488505  0.1597866  -0.47073225 ... -0.37110101  1.17073215
   1.36589591]
 [-0.84488505 -0.8730192   0.04624525 ... -0.47378505 -0.87137393
  -0.73212021]]
[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]


## Define the Logistic Sigmoid Function $\sigma(z)$

we know Logistic Sigmoid Function is as following:
$$\sigma(z)=\frac{1}{1+e^{-z}}$$

In [4]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

And we can check that:

In [5]:
sigmoid(0)

0.5

## Define LogisticRegression

Then we can give the class object of LogisticRegression like following:

In [6]:
class LogisticRegression2:
    def __init__(self, M, N, lr=0.1):
        # unlike linear regression. we need the weight matrix here
        # StandardScaler(): normalizer, which is important
        # initailize all variables
        self.W = np.random.normal(0,1,size=(N,1))# weight matrix
        self.b = np.random.rand(1).reshape(1,1)# bias
        self.lr = lr
        self.M = M
        self.N = N
    def fit(self, X, y, epoch=5000):
        # normalize first
        # for each epoch, update the weight matrix and bias
        M,N = np.shape(X)
        if len(y.shape) == 1:
            y = y.reshape(y.shape[0],1)
            
        weights = np.concatenate([self.b, self.W],axis=0)
        X = np.c_[np.ones((np.shape(X)[0],1)),X]
        costs = []
        
        for i in range(1,epoch+1):
            H = sigmoid(np.dot(X,weights))        
            cost0 = y.T.dot(np.log(sigmoid(H)))
            cost1 = (1-y).T.dot(np.log(1-sigmoid(H)))
            cost = -((cost1 + cost0))/self.M
            cost = np.squeeze(cost)
            costs.append(cost)
            weights = weights - self.lr * np.dot(X.T, sigmoid(np.dot(X,weights)) - np.reshape(y,(len(y),1)))
            if i % 100 == 0:
                print ('Epoch:{}, The cost is :{}'.format(i, cost))
        
        self.b = weights[0]
        
        self.W = weights[1:]
        
        return self.W, self.b, costs

    def predict(self, X_test):
        X = np.c_[np.ones((np.shape(X_test)[0],1)),X_test]
        weight = np.concatenate([self.b.reshape(self.b.shape[0],1), self.W.reshape(self.W.shape[0],1)],axis=0)
        H = sigmoid(np.dot(X,weight))
        y_pred = []
        for i in H:
            if i>0.5:
                y_pred.append(1)
            else:
                y_pred.append(0)
        return y_pred

But we also need some measurement to verify this performance, we will use $F_{1}$ score to finish this process.

In [7]:
def F1_score(y,y_hat):
    tp,tn,fp,fn = 0,0,0,0
    for i in range(len(y)):
        if y[i] == 1 and y_hat[i] == 1:
            tp += 1
        elif y[i] == 1 and y_hat[i] == 0:
            fn += 1
        elif y[i] == 0 and y_hat[i] == 1:
            fp += 1
        elif y[i] == 0 and y_hat[i] == 0:
            tn += 1
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f1_score = 2*precision*recall/(precision+recall)
    return f1_score

Now we can check like the following:

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
def data2(data):
    X = np.array(data)[:,:-1]
    y = np.array(data)[:,-1]
    X_tr,X_te,y_tr,y_te = train_test_split(X,y,test_size=0.1)
    normal = StandardScaler()
    X_tr = normal.fit_transform(X_tr)
    X_te = normal.fit_transform(X_te)
    M,N = np.shape(X_tr)
    obj1 = LogisticRegression2(M,N)
    model= obj1.fit(X_tr,y_tr)
    y_pred = obj1.predict(X_te)
    y_train = obj1.predict(X_tr)
    #Let's see the f1-score for training and testing data
    f1_score_tr = F1_score(y_tr,y_train)
    f1_score_te = F1_score(y_te,y_pred)
    print(f1_score_tr)
    print(f1_score_te)
    logisticRegr = LogisticRegression()
    logisticRegr.fit(X_tr, y_tr)
    y_pred2 = logisticRegr.predict(X_tr)
    f1_score_tr2 = F1_score(y_tr,y_pred2)
    print(f1_score_tr2)
    y_pred3 = logisticRegr.predict(X_te)
    f1_score_tr3 = F1_score(y_te,y_pred3)
    print(f1_score_tr3)
data2(data)

Epoch:100, The cost is :0.6998769095082942
Epoch:200, The cost is :0.6998769095082942
Epoch:300, The cost is :0.6998769095082942
Epoch:400, The cost is :0.6998769095082942
Epoch:500, The cost is :0.6998769095082942
Epoch:600, The cost is :0.6998769095082942
Epoch:700, The cost is :0.6998769095082942
Epoch:800, The cost is :0.6998769095082942
Epoch:900, The cost is :0.6998769095082942
Epoch:1000, The cost is :0.6998769095082942
Epoch:1100, The cost is :0.6998769095082942
Epoch:1200, The cost is :0.6998769095082942
Epoch:1300, The cost is :0.6998769095082942
Epoch:1400, The cost is :0.6998769095082942
Epoch:1500, The cost is :0.6998769095082942
Epoch:1600, The cost is :0.6998769095082942
Epoch:1700, The cost is :0.6998769095082942
Epoch:1800, The cost is :0.6998769095082942
Epoch:1900, The cost is :0.6998769095082942
Epoch:2000, The cost is :0.6998769095082942
Epoch:2100, The cost is :0.6998769095082942
Epoch:2200, The cost is :0.6998769095082942
Epoch:2300, The cost is :0.69987690950829

Besides, we add anther classfication dataset to state our training code.

In [9]:
from sklearn.datasets import make_classification
def data1():
    X,y = make_classification(n_features=4)
    X_tr,X_te,y_tr,y_te = train_test_split(X,y,test_size=0.1)
    normal = StandardScaler()
    X_tr = normal.fit_transform(X_tr)
    X_te = normal.fit_transform(X_te)
    M,N = np.shape(X_tr)
    obj1 = LogisticRegression2(M,N)
    model= obj1.fit(X_tr,y_tr)
    y_pred = obj1.predict(X_te)
    y_train = obj1.predict(X_tr)
    #Let's see the f1-score for training and testing data
    f1_score_tr = F1_score(y_tr,y_train)
    f1_score_te = F1_score(y_te,y_pred)
    print(f1_score_tr)
    print(f1_score_te)

    
    logisticRegr = LogisticRegression()
    logisticRegr.fit(X_tr, y_tr)
    y_pred2 = logisticRegr.predict(X_tr)
    f1_score_tr2 = F1_score(y_tr,y_pred2)
    print(f1_score_tr2)
    y_pred3 = logisticRegr.predict(X_te)
    f1_score_tr3 = F1_score(y_te,y_pred3)
    print(f1_score_tr3)

In [10]:
data1()

Epoch:100, The cost is :0.5698954657352746
Epoch:200, The cost is :0.5698954657358896
Epoch:300, The cost is :0.5698954657358896
Epoch:400, The cost is :0.5698954657358896
Epoch:500, The cost is :0.5698954657358896
Epoch:600, The cost is :0.5698954657358896
Epoch:700, The cost is :0.5698954657358896
Epoch:800, The cost is :0.5698954657358896
Epoch:900, The cost is :0.5698954657358896
Epoch:1000, The cost is :0.5698954657358896
Epoch:1100, The cost is :0.5698954657358896
Epoch:1200, The cost is :0.5698954657358896
Epoch:1300, The cost is :0.5698954657358896
Epoch:1400, The cost is :0.5698954657358896
Epoch:1500, The cost is :0.5698954657358896
Epoch:1600, The cost is :0.5698954657358896
Epoch:1700, The cost is :0.5698954657358896
Epoch:1800, The cost is :0.5698954657358896
Epoch:1900, The cost is :0.5698954657358896
Epoch:2000, The cost is :0.5698954657358896
Epoch:2100, The cost is :0.5698954657358896
Epoch:2200, The cost is :0.5698954657358896
Epoch:2300, The cost is :0.56989546573588