# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Implementation of Linear Regression using OOPs

## Learning Objectives

At the end of the mini-project, you will be able to :

- understand the power and flexibility of the Object-oriented programming (OOP) paradigm
- build OOP based classes and methods and use them to implement Linear Regression for solving real world data related queries


## Problem Statement

Implement linear regression using classes and methods built with OOP.

## Information

#### Object oriented programming in a nutshell

Object oriented programming is based around the concept of "objects". Objects have two kinds of attributes (accessed via . syntax): data attributes (or instance variables) and function attributes (or methods). Object data is typically modified by object methods.

To know more about OOPs click [here](https://docs.python.org/3/tutorial/classes.html)

#### Linear Regression

In statistics, linear regression is a linear approach to model the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.

To know more about Linear regression  click [here](http://www.mit.edu/~6.s085/notes/lecture3.pdf)


## Grading = 10 Points

#### There are total 10 exercises and 1 point for each.

##### Importing Necessary Packages

In [107]:
import numpy as np # Numpy Package
import pandas as pd # Pandas Package

#### Exercise 1: Generate 50 points with an approximate relationship of y = 3x + 1, with normally distributed errors.

**Hint:** np.linspace(), np.random.randn()

In [108]:
# YOUR CODE HERE

In [109]:
x_actual= np.linspace(10,20,num=50)

In [110]:
y=3*x_actual+np.ones(50)

In [111]:
x_actual

array([10.        , 10.20408163, 10.40816327, 10.6122449 , 10.81632653,
       11.02040816, 11.2244898 , 11.42857143, 11.63265306, 11.83673469,
       12.04081633, 12.24489796, 12.44897959, 12.65306122, 12.85714286,
       13.06122449, 13.26530612, 13.46938776, 13.67346939, 13.87755102,
       14.08163265, 14.28571429, 14.48979592, 14.69387755, 14.89795918,
       15.10204082, 15.30612245, 15.51020408, 15.71428571, 15.91836735,
       16.12244898, 16.32653061, 16.53061224, 16.73469388, 16.93877551,
       17.14285714, 17.34693878, 17.55102041, 17.75510204, 17.95918367,
       18.16326531, 18.36734694, 18.57142857, 18.7755102 , 18.97959184,
       19.18367347, 19.3877551 , 19.59183673, 19.79591837, 20.        ])

In [112]:
x=x_actual+np.random.randn(50)

In [113]:
np.random.randn(50)

array([ 0.363779  , -1.10353165,  1.33838936,  0.4318283 , -0.05522974,
       -1.36374105, -1.01801016,  0.04138905,  0.86821892, -1.0890109 ,
        1.71017514, -0.45162197, -1.20175542, -0.4771423 ,  0.01607059,
       -1.07226399, -0.48423506,  2.86944539, -0.04535328,  0.18001361,
       -0.45195553, -0.171367  , -1.44838139, -0.16992862,  0.02063148,
        0.05160786, -0.05292013, -0.4015743 , -0.83471693, -1.6460258 ,
       -1.46928274, -2.47395438, -0.10415319,  0.53967247,  1.94777163,
       -0.64867687, -1.1008839 ,  1.22681159, -2.50447115,  0.04474438,
       -0.81890464, -0.61701026,  0.86655837, -1.14126055,  1.69366945,
        0.29233626,  1.50874642,  0.8011663 , -0.60099855,  0.84339957])

#### Exercise 2: Define a class named **LinearRegression** and add a short description of linear regression using built in method \_\_repr\_\_

**Hint:** [How to use \_\_repr\_\_ method](https://www.educative.io/edpresso/what-is-the-repr-method-in-python)

In [114]:
# YOUR CODE HERE
class LinearRegression:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self) -> str:
        return 'fits given x and y with minimum error between the actual and predicted'


In [115]:
model= LinearRegression(2,3)

In [116]:
model.__repr__()

'fits given x and y with minimum error between the actual and predicted'

In [117]:
repr(model)

'fits given x and y with minimum error between the actual and predicted'

#### Exercise 3: In the above defined Linear Regression class, add a method which takes list of values as input and returns the mean of those values. 

**Note:** Don't use built-in method to calculate the mean

**Hint:** 
1. The mean is the average of the numbers
2. [How to define a method in a class](https://docs.python.org/3/tutorial/classes.html#scopes-and-namespaces-example)

In [118]:
# YOUR CODE HERE

In [119]:
class LinearRegression:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self) -> str:
        return 'fits the given x and y with minimum error between the actual and predicted'
    def avg(self,list1):
        sum=0
        for item in list1:
            sum+=item
        return sum/len(list1)

In [120]:
model= LinearRegression([4,5,6,9],[4,6,7,9])
model.avg(model.x)

6.0

In [121]:
model

fits the given x and y with minimum error between the actual and predicted

#### Exercise 4: In the above defined Linear Regression class, add a method which takes list of values as input and returns the variance of those values.

**Note:** Don't use built-in method to calculate the variance

**Hint:** 

1. The Variance is the average of the squared differences of each datapoint from the Mean
2. [How to access one method in different method inside a class](https://docs.python.org/3/tutorial/classes.html#scopes-and-namespaces-example)


In [122]:
# YOUR CODE HERE
class LinearRegression:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self) -> str:
        return 'fits the given x and y with minimum error between the actual and predicted'
    def avg(self,list1):
        sum=0
        for item in list1:
            sum+=item
        return sum/len(list1)
    def var(self,list1):
        k=self.avg(list1)
        sum =0
        for item in list1:
            sum+= (item - k)**2
        return sum/len(list1)

In [123]:
model= LinearRegression([1,2,3],[4,6,7])
model.avg(model.x)

2.0

In [124]:
model.var(model.x)

0.6666666666666666

#### Exercise 5: In the above defined Linear Regression class, add a method which takes two values as input and returns the covariance of those values.

**Note:** Don't use built-in method to calculate the covariance

**Hint:** [How to calculate the covariance of two values](https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/covariance/)

In [125]:
# YOUR CODE HERE
class LinearRegression:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self) -> str:
        return 'fits the given x and y with minimum error between the actual and predicted'
    def avg(self,list1):
        sum=0
        for item in list1:
            sum+=item
        return sum/len(list1)
    def var(self,list1):
        k=self.avg(list1)
        sum =0
        for item in list1:
            sum+= (item - k)**2
        return sum/len(list1)
    def cov(self,list1,list2):
        if len(list1)!= len(list2):
            return('X and y should have same length')
        else:
            xavg=self.avg(list1)
            yavg=self.avg(list2)
            sum=0
            for i in range(len(list1)):
                sum+= (list1[i]-xavg)*(list2[i]-yavg)
            return sum/(len(list1))


In [126]:
model= LinearRegression([1,2,3],[1,2,3])
model.var(model.x)

0.6666666666666666

In [127]:
model.cov(model.x,model.y)

0.6666666666666666

#### Exercise 6: In the above defined Linear Regression class, add a method named 'fit' which takes two values as input (x, y) and returns the estimated coefficients.

**Hint:**

- Equation of line : $  y = b_{0} + b_{1} * x $
- The estimated coefficients i.e. values of $b_{0}$ and $b_{1}$ are calculated as below
    - $ b_{1} = covariance(x,y) / variance(x) $ and
    - $ b_{0} = mean(y) - b_{1} * mean(x)$

In [128]:
# YOUR CODE HERE
class LinearRegression:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self) -> str:
        return 'fits the given x and y with minimum error between the actual and predicted'
    def avg(self,list1):
        sum=0
        for item in list1:
            sum+=item
        return sum/len(list1)
    def var(self,list1):
        k=self.avg(list1)
        sum =0
        for item in list1:
            sum+= (item - k)**2
        return sum/len(list1)
    def cov(self,list1,list2):
        if len(list1)!= len(list2):
            return('X and y should have same length')
        else:
            xavg=self.avg(list1)
            yavg=self.avg(list2)
            sum=0
            for i in range(len(list1)):
                sum+= (list1[i]-xavg)*(list2[i]-yavg)
            return sum/(len(list1))
    def fit(self,list1,list2):
        if len(list1)!= len(list2):
            return('X and y should have same length')
        else:
            b1=self.cov(list1,list2)/self.var(list1)
            b0=self.avg(list2)-b1*self.avgavg(list1)
            return b1,b0


#### Exercise 7: In the above defined Linear Regression class, add a method named predict which takes two values as input (x, y) and returns the predicted values.

**Hint:** substitute the estimated coefficients values calculated above in the equation of line i.e $  y = b_{0} + b_{1} * x $

In [129]:
# YOUR CODE HERE
class LinearRegression:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self) -> str:
        return 'fits the given x and y with minimum error between the actual and predicted'
    def avg(self,list1):
        sum=0
        for item in list1:
            sum+=item
        return sum/len(list1)
    def var(self,list1):
        k=self.avg(list1)
        sum =0
        for item in list1:
            sum+= (item - k)**2
        return sum/len(list1)
    def cov(self,list1,list2):
        if len(list1)!= len(list2):
            return('X and y should have same length')
        else:
            xavg=self.avg(list1)
            yavg=self.avg(list2)
            sum=0
            for i in range(len(list1)):
                sum+= (list1[i]-xavg)*(list2[i]-yavg)
            return sum/(len(list1))
    def fit(self,list1,list2):
        if len(list1)!= len(list2):
            return('X and y should have same length')
        else:
            b1=self.cov(list1,list2)/self.var(list1)
            b0=self.avg(list2)-b1*self.avg(list1)
        return b1,b0
    def predict(self,list1,list2):
        b1,b0 = self.fit(list1,list2)
        y_pred = (np.ones(len(list1))*b0) + (b1*list1)
        return y_pred

#### Data

The dataset chosen for this experiment is **Pizza Franchise** dataset. The dataset contains the following data

X = annual franchise fee ($1000)

Y = start up cost ($1000) for a pizza franchise

Download the dataset [here](https://cdn.iisc.talentsprint.com/CDS/Datasets/pizza.csv)

#### Exercise 8: Using the above defined class LinearRegression, calculate the Estimated coefficients, fit the model, and predict the values on the Pizza Franchise dataset.

In [130]:
# YOUR CODE HERE

In [131]:
df=pd.read_csv('pizza.csv')

In [132]:
df.head()

Unnamed: 0,X,Y
0,1000,1050
1,1125,1150
2,1087,1213
3,1070,1275
4,1100,1300


In [133]:
x=df.X
y=df.Y

In [134]:
model=LinearRegression(x,y)

In [135]:
model.fit(model.x,model.y)

(0.3731579359288647, 867.6042222620562)

In [136]:
ypred=model.predict(model.x,model.y)

In [137]:
ypred[0:5]

0    1240.762158
1    1287.406900
2    1273.226899
3    1266.883214
4    1278.077952
Name: X, dtype: float64

In [138]:
y[0:5]

0    1050
1    1150
2    1213
3    1275
4    1300
Name: Y, dtype: int64

#### Exercise 9: In the above defined Linear Regression class, add a method named RMSE which takes two values as input (x, y) and returns the error value.

**Hint:**

- [How to calculate RMSE value](https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e)

In [139]:
# YOUR CODE HERE
class LinearRegression:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def __repr__(self) -> str:
        return 'fits the given x and y with minimum error between the actual and predicted'
    def avg(self,list1):
        sum=0
        for item in list1:
            sum+=item
        return sum/len(list1)
    def var(self,list1):
        k=self.avg(list1)
        sum =0
        for item in list1:
            sum+= (item - k)**2
        return sum/len(list1)
    def cov(self,list1,list2):
        if len(list1)!= len(list2):
            return('X and y should have same length')
        else:
            xavg=self.avg(list1)
            yavg=self.avg(list2)
            sum=0
            for i in range(len(list1)):
                sum+= (list1[i]-xavg)*(list2[i]-yavg)
            return sum/(len(list1))
    def fit(self,list1,list2):
        if len(list1)!= len(list2):
            return('X and y should have same length')
        else:
            b1=self.cov(list1,list2)/self.var(list1)
            b0=self.avg(list2)-b1*self.avg(list1)
        return b1,b0
    def predict(self,list1,list2):
        b1,b0 = self.fit(list1,list2)
        y_pred = (np.ones(len(list1))*b0) + (b1*list1)
        return y_pred
    def RMSE(self,x,y):
        y_pred=self.predict(x,y)
        sum=0
        for i in range(len(y_pred)):
            sum += ((y_pred[i]-y[i])**2)
        return (sum/len(y_pred))**0.5

In [140]:
model=LinearRegression(x,y)
model.fit(model.x,model.y)

(0.3731579359288647, 867.6042222620562)

In [141]:
ypred=model.predict(model.x,model.y)

In [142]:
model.RMSE(model.x,model.y)

107.50949652541532

#### Data

The dataset chosen for this exercise is **List Price Vs. Best Price for a New GMC Pickup** dataset. The dataset contains the following data

X = List price (in $1000) for a GMC pickup truck

Y = Best price (in $1000) for a GMC pickup truck

Download the dataset [here](https://cdn.iisc.talentsprint.com/CDS/Datasets/gmc.csv)

#### Exercise 10: Using above defined class LinearRegression, 

- calculate the Estimated coefficients, fit the model, and predict the values on the List Price Vs. Best Price for a New GMC Pickup dataset.
- calculate the RMSE error on predicted and actual values of List Price Vs. Best Price for a New GMC Pickup dataset using the function defined above.

In [143]:
# YOUR CODE HERE
df=pd.read_csv('gmc.csv')
x=df.X
y=df.Y

In [144]:
lr=LinearRegression(x,y)

In [145]:
lr.fit(lr.x, lr.y)

(0.8511440378638508, 0.4345844908253085)

In [146]:
y_pred=lr.predict(lr.x, lr.y)

In [147]:
y_pred

0     10.988770
1     12.605944
2     12.776173
3     13.116630
4     14.138004
5     14.818918
6     14.478461
7     13.542202
8     14.904033
9     15.670062
10    16.436092
11    17.712808
12    19.500211
13    16.946779
14    13.627317
15    14.648691
16    15.159376
17    16.095634
18    16.776551
19    15.244490
20    17.031893
21    17.202123
22    18.478839
Name: X, dtype: float64

In [148]:
lr.RMSE(lr.x, lr.y)

0.10868338378542268

### Optional

* Use the built-in `sklearn LinearRegression` package to determine the coefficients for the above problems. 
* Compare the coefficients obtained using OOP based implementation vs coefficients from `sklearn LinearRegression` package.

In [149]:
from sklearn.linear_model import LinearRegression

In [150]:
sklr=LinearRegression()

In [151]:
sklr.fit(np.array(x).reshape(-1,1),y)

LinearRegression()

In [152]:
sklr.coef_

array([0.85114404])

In [153]:
sklr.intercept_

0.4345844908253138

In [155]:
sklr.predict(np.array(x).reshape(-1,1))

array([10.98877024, 12.60594439, 12.77617304, 13.11663033, 14.13800382,
       14.81891841, 14.47846112, 13.54220235, 14.90403313, 15.67006245,
       16.43609176, 17.71280781, 19.50021062, 16.9467785 , 13.62731708,
       14.64869057, 15.1593757 , 16.09563446, 16.77655066, 15.24449043,
       17.03189323, 17.20212268, 18.47883874])

In [158]:
np.array(y).reshape(1,-1)

array([[11.19999981, 12.5       , 12.69999981, 13.10000038, 14.10000038,
        14.80000019, 14.39999962, 13.39999962, 14.89999962, 15.60000038,
        16.39999962, 17.70000076, 19.60000038, 16.89999962, 14.        ,
        14.60000038, 15.10000038, 16.10000038, 16.79999924, 15.19999981,
        17.        , 17.20000076, 18.60000038]])

## SKlearn coeffiecients are                        
b1=0.85114404 , b0= 0.4345844908253138
## OOP based implmentation's coeffiecients are 
b1= 0.8511440378638508, b0= 0.4345844908253085
