### Coding Linear Regression Algorithm

As we have now studied the math behind Linear Regression, we will be coding our own LR algorithm (1-D).
The requirements of the algorithm are:
* A function fit that receives X_train and Y_train and trains the model on that data. This function will basically calculate the optimal coefficients of m and c.
* A function predict that receives the X_test, m and c as parameters and uses m and c to generate Y_predict
* A function named score that will use the formula of coefficient of determination to calculate the score of our algorithm, this funciton will require Y_test, and Y_predicted as its parameters
* The last function we need to define is the Cost Function that will calculate the error function for all data points using the cost function formula, this function will require, X, Y, m and c as parameters

In [1]:
import numpy as np

In [2]:
# Loading Data
data = np.loadtxt('data.csv', delimiter=",")

# Splitting Features and Output
X = data[:, 0]
Y = data[:, 1]

X.shape, Y.shape

((100,), (100,))

In [3]:
from sklearn import model_selection

In [4]:
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((75,), (25,), (75,), (25,))

Now that our data is split into training and testing datasets, we can go ahead and write the function to code our own Linear Regression algorithm.

In [5]:
def fit(X_train, Y_train):
    # Using the formula derived for optimal coefficients
    # We calculate m and c
    numerator = (np.mean(X_train * Y_train)) - (np.mean(X_train) * np.mean(Y_train))
    denominator = (np.mean(X_train**2)) - (np.mean(X_train) * np.mean(X_train))
    m = numerator / denominator
    c = np.mean(Y_train) - (m * np.mean(X_train))
    return (m, c)

In [6]:
def predict(X_test, m, c):
    # Here we simply generate the fit line by calculating Y for all X_test data points
    # Numpy arrays make this extra simple
    return m*X_test + c

In [7]:
def score(Y_test, Y_pred):
    # Here we calculate the coefficient of determination
    u = np.sum((Y_test - Y_pred)**2)
    v = np.sum((Y_test - np.mean(Y_test))**2)
    return 1 - (u / v)

In [8]:
def cost(X, Y, m, c):
    # We calculate the error for each data point in our data set here
    total_cost = np.mean((Y - (m*X + c))**2)
    return total_cost

In [9]:
# Traing the algorithm on the training data and generate m and c
m, c = fit(X_train, Y_train)

# Use the generated coefficients to get the FIT LINE for test data X
Y_pred = predict(X_test, m, c)

# Get the score of our algorithm using coefficient of determination
sc = score(Y_test, Y_pred)

# Getting the total error/deviation in our prediction using the Error function formula
err = cost(X_train, Y_train, m, c)

print("M, C", m, c)
print("Score", sc)
print("Cost on training data", err)

M, C 1.2034175463933428 13.496578597683445
Score 0.6350533202377575
Cost on training data 101.05127294222592
