Let's import some modules we may need first:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, let's import the occupancy data we are going to work with.

In [2]:
filename = '../data/occupancy_data/datatraining.txt'
df_training = pd.read_csv(filename)
# inspect the data
df_training.head(7)

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
1,2015-02-04 17:51:00,23.18,27.272,426.0,721.25,0.004793,1
2,2015-02-04 17:51:59,23.15,27.2675,429.5,714.0,0.004783,1
3,2015-02-04 17:53:00,23.15,27.245,426.0,713.5,0.004779,1
4,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1
5,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1
6,2015-02-04 17:55:59,23.1,27.2,419.0,701.0,0.004757,1
7,2015-02-04 17:57:00,23.1,27.2,419.0,701.666667,0.004757,1


In [3]:
df_training.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8143 entries, 1 to 8143
Data columns (total 7 columns):
date             8143 non-null object
Temperature      8143 non-null float64
Humidity         8143 non-null float64
Light            8143 non-null float64
CO2              8143 non-null float64
HumidityRatio    8143 non-null float64
Occupancy        8143 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 508.9+ KB


It doesn't seem like we have any null entries, so we don't have to do any imputing.

Our plan is to implement logistic regression from scratch, and see how well that performs against sklearn's logistic regression. Let's put our data into numpy arrays, splitting our training set into our X variable and the Y variable.

In [4]:
# ignore the datetime column
X_train = df_training.iloc[:,1:6].values
y_train = df_training['Occupancy'].values
# now, we need to add a column of 1's to the front of the X so we can take into account
# the intercept term
ones = np.ones((8143,1))
Xb_train = np.concatenate((ones,X_train),axis=1)

Now, we will define the logistic function and perform our gradient descent. We will do this by taking the derivative of the cross-entropy error function and continually updating our weights.

In [29]:
W = np.random.randn(6) # initialize weights randomly, need an extra weight for intercept term
def sigmoid(z):
    return 1./(1.+np.exp(-z))
y_pred = sigmoid(Xb_train.dot(W)) # initial prediction is based on random weights
learning_rate = 0.01
# we will perform 1000 iterations
for i in range(0,1000):
    W-= learning_rate * Xb_train.T.dot(y_pred-y_train) # update weight based on derivative
    y_pred = sigmoid(Xb_train.dot(W))
print("weights:",W)

('weights:', array([-4.50498652e+03, -9.34492166e+04, -8.00111640e+04,  1.50908672e+04,
        1.04046475e+03, -1.34030311e+01]))


  This is separate from the ipykernel package so we can avoid doing imports until


Let's compare these weights to what we would get from running the sklearn Logistic Regression.

In [28]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
logreg.coef_

array([[-5.16482752e-01,  4.40977352e-02,  1.99510758e-02,
         4.31632201e-03, -3.47110769e-05]])

As we can see, the logistic regression model from sklearn outputs a model with very different weights: this is probably because sklearn optimizes the weights using regularization, and we simply did this using regularization. Let's test the accuracy of the model in both cases by using the test set that was calculated by the ML Repository.

In [30]:
filename = '../data/occupancy_data/datatest.txt'
df_test = pd.read_csv(filename)
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2665 entries, 140 to 2804
Data columns (total 7 columns):
date             2665 non-null object
Temperature      2665 non-null float64
Humidity         2665 non-null float64
Light            2665 non-null float64
CO2              2665 non-null float64
HumidityRatio    2665 non-null float64
Occupancy        2665 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 166.6+ KB


In [32]:
# ignore the datetime column
X_test = df_test.iloc[:,1:6].values
y_test = df_test['Occupancy'].values
# now, we need to add a column of 1's to the front of the X so we can take into account
# the intercept term
ones = np.ones((2665,1))
Xb_test = np.concatenate((ones,X_test),axis=1)

Let's measure how accurate our gradient descent model was on the data.

In [39]:
y_pred_GD = np.round(sigmoid(Xb_test.dot(W))) # round to 0 or 1 based on 0.5 threshold
accuracy = np.sum(y_pred_GD==y_test)/float(len(y_test))
print("The accuracy was " + `accuracy`)

The accuracy was 0.9789868667917448


  This is separate from the ipykernel package so we can avoid doing imports until


Overall, our model using gradient descent from scratch was very accurate. Let's see how well our sklearn model did.

In [52]:
y_pred_sk = logreg.predict(X_test)
logreg.score(X_test,y_test)

0.9782363977485928

As we can see, our accuracy with the sklearn algorithm was actually lower.