***Multi Variate Linear Regression using Gradient Descent for stream data*** <br />
***Sliding Window Approach***

***Procedure for Online Regression using Sliding Window Approach***

1. The Batch Data Arrives (here from a file)
2. Initially for the first step, we choose some random values of weights (thetas)
3. We apply gradient descent on that batch and get the updated parameters
4. We pass those parameters to the next batch
5. We apply the same method for this batch too, just the weights are not chosen ramdomly
6. We repeat the same for all other batched.
7. The weights we get after the last batch are the final set of weights

***Comparision with other approaches***
***Advantages***
1. We do not need to store the batch data.
2. Lower iterations for batches.

***Disadvantages***
1. As the new data arrives, the previous records are discarded.
2. We can't have global view of all the records.
3. Hence, we can't accurately model the entire stream data.

***We have implemented for Multiple Variate Dataset. Same procedure can be applied for Uni Variate Too***

In [120]:
#Importing Libraries

from sklearn import metrics
import pandas as pd
import numpy as np
from pprint import pprint
from sklearn.datasets import load_boston
from numpy.linalg import inv, pinv, LinAlgError

In [121]:
#Loading Dataset
X,Y = load_boston(True)

#Adding dummy column to X
X_new=np.ones((X.shape[0],X.shape[1]+1))
X_new[:,1:]=X
X=X_new

# Splitting the dataset into training set and testing set
# 80% Training and 20 % Testing 
index = int(0.8 * len(X))
x_train,x_test=X[:index],X[index:]
y_train,y_test=Y[:index],Y[index:]

# display the shape/dimensions for X_train and X_test
print("Type of x_train:", type(x_train), "Shape of x_train:", x_train.shape)
print("Type of x_test:", type(x_test), "Shape of x_test:", x_test.shape)

# Normalizing the Data
# from sklearn.preprocessing import StandardScaler
# scaler=StandardScaler()
# scaler.fit(x_train[:,1:])
# x_train[:,1:]=scaler.transform(x_train[:,1:])
# x_test[:,1:]=scaler.transform(x_test[:,1:])

#Rearranging Y and storing X,Y in csv format
Y=np.reshape(Y,newshape=(Y.shape[0],1))
boston=np.hstack((X,Y)) #Combining the X and Y and making in horizontal stack form
boston=pd.DataFrame(boston) #Converting to pandas Dataframe
boston.to_csv('boston.csv', header='None', index=False)  #Converting to CSV Format and writing in file

Type of x_train: <class 'numpy.ndarray'> Shape of x_train: (404, 14)
Type of x_test: <class 'numpy.ndarray'> Shape of x_test: (102, 14)


In [122]:
# Function for generation of Stream

def stream_data(wsize=100):
    counter=0   #Initializing
    for chunk in pd.read_csv('boston.csv', header=None, chunksize=wsize): # Opening file and reading data into chunks
        chunk_array=chunk.values    #Converting pandas dataframe to numpy arrays
        counter=counter+1   #incrementing the counter
        yield (chunk_array[:,:-1], chunk_array[:,-1])   # Streams the chunk data without exiting/returning the function
        if counter >= 4:  #Checking the counter until it comes at the end of data that needs to stream
            break


In [123]:
#Cost Function
def cost_function(X, y, theta):
    m = y.size
    error = np.dot(X, theta.T) - y
    cost = 1/(2*m) * np.dot(error.T, error)
    return cost, error

In [124]:
#Gradient Descent
def gradient_descent(X, y, theta, alpha, iters):
    cost_array = np.zeros(iters)
    m = y.size
    for i in range(iters):
        cost, error = cost_function(X, y, theta)
        theta = theta - (alpha * (1/m) * np.dot(X.T, error))
        cost_array[i] = cost
    return theta, cost_array

In [125]:
#Getting stream Data
sdata = stream_data(100)   # <-- Yeilds 100 data points/instances at a time

# Deciding the Learning Rate and Number of Iterations for each batch
iterations = 10
alpha = 0.001
cnt = 0

for (x,y) in sdata:
    
    # Normalize our features
    X = (x - x.mean()) / x.std()

    if cnt==0:     # <-- Initialize Theta Values to 0    
        theta = np.zeros(X.shape[1])
        cnt = cnt+1
    
    initial_cost, error_mat = cost_function(X, y, theta)
    print("***** BATCH {0}*****".format(cnt))
    print('Cost Before Gradient Descent is: {0}'.format(initial_cost))
    # Run Gradient Descent [Theta gets updates and gets passed to new batch]
    theta, cost_num = gradient_descent(X, y, theta, alpha, iterations)
    final_cost, error_mat = cost_function(X, y, theta)
    print("Cost After  Gradient Descent is: {0}".format(final_cost))
    
    # Calculating Statistics in the Trained Model for this batch
    x_test_std = (x_test - x_test.mean()) / x_test.std()
    y_pred = np.dot(theta, x_test_std.T)
    print("Mean Absolute Error: ", metrics.mean_absolute_error(y_test, y_pred))
    print("Mean Squared  Error: ", metrics.mean_squared_error(y_test, y_pred))
    print()
    cnt = cnt + 1


# Display cost chart
# plotChart(iterations, cost_num)

***** BATCH 1*****
Cost Before Gradient Descent is: 261.70925
Cost After  Gradient Descent is: 203.12089236005522
Mean Absolute Error:  13.750193611001542
Mean Squared  Error:  214.11059344073243

***** BATCH 2*****
Cost Before Gradient Descent is: 274.05601926130316
Cost After  Gradient Descent is: 221.36775788033458
Mean Absolute Error:  11.427417916394079
Mean Squared  Error:  154.5405496562384

***** BATCH 3*****
Cost Before Gradient Descent is: 341.1365430832656
Cost After  Gradient Descent is: 269.2976460320818
Mean Absolute Error:  9.02041989864586
Mean Squared  Error:  102.72275428949952

***** BATCH 4*****
Cost Before Gradient Descent is: 130.58639410915137
Cost After  Gradient Descent is: 113.96605605838378
Mean Absolute Error:  7.780349840087024
Mean Squared  Error:  80.4851164359746

