In [1]:
import pandas as pd
import numpy as np

In [4]:
col_names = ['x1', 'x2', 'result']

df = pd.read_csv("./Logistic_Regression_Dataset_Students.csv", header=None, names=col_names)
df

Unnamed: 0,x1,x2,result
0,34.623660,78.024693,0
1,30.286711,43.894998,0
2,35.847409,72.902198,0
3,60.182599,86.308552,1
4,79.032736,75.344376,1
...,...,...,...
95,83.489163,48.380286,1
96,42.261701,87.103851,1
97,99.315009,68.775409,1
98,55.340018,64.931938,1


The iloc indexer is used to extract a subset of the rows and columns from the dataset. The syntax is df.iloc[row_indexer, column_indexer], where df is the data frame, row_indexer is a list of row indices, and column_indexer is a list of column indices.

In this case, ds.iloc[:, [0, 1]] selects all rows (:) and the first two columns (0 and 1). The resulting data frame is then converted to a NumPy array using the values attribute.

So, x is a NumPy array containing the values in the first two columns of the dataset. These values correspond to the input features (x1 and x2) of the logistic regression model.

```
# input
x = ds.iloc[:, [0, 1]].values
```

But we wont be using this in our code

In [8]:
# extract the input features from the data frame and convert them to a NumPy array
x = df[['x1', 'x2']].values
x

array([[34.62365962, 78.02469282],
       [30.28671077, 43.89499752],
       [35.84740877, 72.90219803],
       [60.18259939, 86.3085521 ],
       [79.03273605, 75.34437644],
       [45.08327748, 56.31637178],
       [61.10666454, 96.51142588],
       [75.02474557, 46.55401354],
       [76.0987867 , 87.42056972],
       [84.43281996, 43.53339331],
       [95.86155507, 38.22527806],
       [75.01365839, 30.60326323],
       [82.30705337, 76.4819633 ],
       [69.36458876, 97.71869196],
       [39.53833914, 76.03681085],
       [53.97105215, 89.20735014],
       [69.07014406, 52.74046973],
       [67.94685548, 46.67857411],
       [70.66150955, 92.92713789],
       [76.97878373, 47.57596365],
       [67.37202755, 42.83843832],
       [89.67677575, 65.79936593],
       [50.53478829, 48.85581153],
       [34.21206098, 44.2095286 ],
       [77.92409145, 68.97235999],
       [62.27101367, 69.95445795],
       [80.19018075, 44.82162893],
       [93.1143888 , 38.80067034],
       [61.83020602,

In [11]:
# extract the output labels from the data frame and convert them to a NumPy array
y = df['result'].values
y

array([0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

The scale function from scikit-learn's preprocessing module standardizes the input features by scaling them so that they have zero mean and unit variance. Standardization can be useful for algorithms that assume that the input features have zero mean and unit variance, or when we want to compare the importance of different features.

The scale function takes a NumPy array of shape (n_samples, n_features) as input and returns a standardized version of the input array.

In this case, xp is a NumPy array containing the standardized versions of the input features (x1 and x2). Standardization is performed by subtracting the mean of each feature from each value and dividing by the standard deviation of each feature.

In [12]:
# standardizing the input features so they have zero mean and variance
from sklearn import preprocessing
xp = preprocessing.scale(x)

This code block is training the logistic regression model using the training data and the stochastic gradient descent optimization algorithm.

The KFold function from scikit-learn's model_selection module is used to split the data into folds for cross-validation. The n_splits argument specifies the number of folds to use. In this case, the data is split into 5 folds.

The split method of KFold returns the indices of the training and test samples for each fold. The for loop iterates over the folds and uses the indices to extract the training and test samples from the input data (xp) and the labels (y).

The training and test samples are then used to train and test the logistic regression model using the stochastic gradient descent algorithm. The model's weights (b0, b1, and b2) are updated at each epoch until the number of epochs (iterations) is exhausted. The alpha parameter is the learning rate, which controls the size of the weight updates.

Finally, the trained model's weights (b0, b1, and b2) are printed.

In [15]:
from sklearn.model_selection import KFold, train_test_split

# training the logistic regression model using the training data and the stochastic gradient descent optimization algorithm
# we split the data into 5 folds using kfold so we cross verify
kf = KFold(n_splits=5)

# training and testing the logistic regression model using the stochastic gradient descent algorithm. model's weights (b0, b1, and b2) are updated at each epoch until the number of epochs (iterations) is exhausted. The alpha parameter is the learning rate, which controls the size of the weight update
for train_index, test_index in kf.split(xp):
    xtrain, xtest, ytrain, ytest = train_test_split(xp, y, test_size = 0.20, random_state=0)
    x1 = xtrain[:, 0]
    x2 = xtrain[:, 1]
    b0 = 0.0
    b1 = 0.0
    b2 = 0.0
    epoch=1000
    alpha=0.001
    while(epoch > 0):
        for i in range(len(xtrain)):
            prediction = 1/(1 + np.exp(-b0 + b1*x1[i] + b2*x2[i]))
            b0 = b0 + alpha*(ytrain[i] - prediction) * prediction * ( 1 - prediction) * 1.0
            b1 = b1 + alpha*(ytrain[i] - prediction) * prediction * ( 1 - prediction) * x1[i]
            b2 = b2 + alpha*(ytrain[i] - prediction) * prediction * ( 1 - prediction) * x2[i]
        epoch = epoch - 1

print(b0)
print(b1)
print(b2)

2.274620542034513
2.785686835762315
3.0555177592607037


In [16]:
# using the trained model to make predictions on the test set
final_predictions = []

x3 = xtest[:, 0]
x4 = xtest[:, 1]

print(ytest)

[1 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1]


In [17]:
y_pred = [0] * len(xtest)

for i in range(len(xtest)):
    y_pred[i] = np.round( 1 / 1 + np.exp(-(b0 + b1*x3[i] + b2*x4[i])))
    final_predictions.append(np.ceil(y_pred[i]))

print(final_predictions)

[1.0, 1.0, 3.0, 4.0, 1.0, 1.0, 2.0, 1.0, 15.0, 1.0, 25.0, 71.0, 12.0, 1.0, 2.0, 1.0, 17.0, 1.0, 1.0, 1.0]


In [18]:
from sklearn.metrics import accuracy_score

print("accuracy : ",accuracy_score(ytest,y_pred))

accuracy :  0.5
