# System requirements


**Create virtual environment**

*python3.6 -m venv {Virtual Env Name} or {Absolute Path for Virtual Env}*

Ex: python3.6 -m venv azar or python3.6 -m venv Users/azar-0000/python/azar

**cd to venv directory**

*cd azar* or *Users/azar-0000/python/azar*

**Activate virtual environment**

*source bin/activate*

Install basic dependencies.

*pip install numpy scikit-learn notebook pandas matplotlib*

**Open Notebook**

jupyter notebook

<h2><center>All ready, Let's write code</center></h2>

 <h1><center>Linear Regression from scratch</center></h1>

Numpy is the only dependecy we need to run our **OWN** linear regression.

In [20]:
import numpy as np

As we already saw below method is for updating **w & b** in given equation.

In [21]:
def update_w_and_b(spendings, sales, w, b, alpha):
    
    dl_dw = 0.0
    dl_db = 0.0
    
    N = len(spendings)
    for i in range(N):
        dl_dw += -(spendings[i]*(sales[i]-(w*spendings[i])))
        dl_db += -(sales[i]-(w*spendings[i]))
        
    w = w-(1/float(N))*dl_dw*alpha
    b = b-(1/float(N))*dl_db*alpha
    
    return w,b

This method is the captain of the ship, Here we train.<br>
Training means we are just moving towards best(not the best, which is near to the best) values of w and b.

In [22]:
def train(spendings, sales, w, b, alpha, epochs):
    for e in range(epochs):
        w, b = update_w_and_b(spendings, sales,  w, b, alpha)
        print("epoch: ", e, "loss: ", loss(spendings,  sales,  w, b))
    return w, b

How to check our hand made LR(Linear regression) preformance. We are using below method to find loss.

In [23]:
def loss(spendings, sales, w, b):
    N =len(spendings)
    total_error=0.0
    for i in range(N):
        total_error += (sales[i]-(w*spendings[i] + b))**2
        return total_error/float(N)

What is the use of creating model without predict, Here we can predict.

In [24]:
def predict(x, w, b):
    return w*x + b

Let's check our own and lovely Linear Regression.

In [25]:
# spendings = np.array([50, 60, 80, 90, 21])
# sales = np.array([100, 120, 160, 180, 42])

spendings = np.array([2, 4, 6, 8, 10])
sales = np.array([20, 40, 60, 80, 100])

w = 0.0
b = 0.0
w, b = train(spendings, sales, w, b, alpha = 0.01, epochs = 1)

epoch:  0 loss:  22.472


Let's check our model with new input.

In [26]:
x_new = 23
y_new = predict(23, w, b)
y_new

101.8

**Oh! Oh! Some thing is not right,** We are getting 203. We should get 230 or some value near to that.

Rewind today session and try to change paramaters by your own.
If you couldn't find, don't worry.

Ask us. We'll help you.

**Don't forget, Machine learning is all about hyper parameters**

# Linear Regressing using scikit-learn : hourly wages dataset

## Import necessary libraries

In [27]:
import numpy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Load and explore data

In [28]:
dataset=pd.read_csv("data/hourly_wages.csv")

In [29]:
dataset.head()

Unnamed: 0,wage_per_hour,union,education_yrs,experience_yrs,age,female,marr,south,manufacturing,construction
0,5.1,0,8,21,35,1,1,0,1,0
1,4.95,0,9,42,57,1,1,0,1,0
2,6.67,0,12,1,19,0,0,0,1,0
3,4.0,0,12,4,22,0,0,0,0,0
4,7.5,0,12,17,35,0,1,0,0,0


In [30]:
dataset.describe(include='all')

Unnamed: 0,wage_per_hour,union,education_yrs,experience_yrs,age,female,marr,south,manufacturing,construction
count,534.0,534.0,534.0,534.0,534.0,534.0,534.0,534.0,534.0,534.0
mean,9.024064,0.179775,13.018727,17.822097,36.833333,0.458801,0.655431,0.292135,0.185393,0.044944
std,5.139097,0.38436,2.615373,12.37971,11.726573,0.498767,0.475673,0.45517,0.388981,0.207375
min,1.0,0.0,2.0,0.0,18.0,0.0,0.0,0.0,0.0,0.0
25%,5.25,0.0,12.0,8.0,28.0,0.0,0.0,0.0,0.0,0.0
50%,7.78,0.0,12.0,15.0,35.0,0.0,1.0,0.0,0.0,0.0
75%,11.25,0.0,15.0,26.0,44.0,1.0,1.0,1.0,0.0,0.0
max,44.5,1.0,18.0,55.0,64.0,1.0,1.0,1.0,1.0,1.0


## Prepare training and testing data

In [31]:
#create a dataframe with all training data except the target column
X = dataset.drop(columns=['wage_per_hour'])

#check that the target variable has been removed
X.head()

Unnamed: 0,union,education_yrs,experience_yrs,age,female,marr,south,manufacturing,construction
0,0,8,21,35,1,1,0,1,0
1,0,9,42,57,1,1,0,1,0
2,0,12,1,19,0,0,0,1,0
3,0,12,4,22,0,0,0,0,0
4,0,12,17,35,0,1,0,0,0


In [32]:
#create a dataframe with only the target column
Y = dataset[['wage_per_hour']]

#view dataframe
Y.head()

Unnamed: 0,wage_per_hour
0,5.1
1,4.95
2,6.67
3,4.0
4,7.5


In [33]:
(trainX, testX, trainY, testY) = train_test_split(X, Y, test_size=0.25, random_state=42)

In [34]:
print("Number of data in training set ",len(trainX), len(trainY))
print("Number of data in tesing set ",len(testX), len(testY))

Number of data in training set  400 400
Number of data in tesing set  134 134


## Model building

In [35]:
# There are three steps to model something with sklearn
# 1. Set up the model
model = LinearRegression()
# 2. Use fit
for i in range(0,100):
    model.fit(trainX, trainY)
# 3. Check the score
model.score(testX, testY)

0.3087885789792395

## Save and Load the model / Prediction¶

In [36]:
import pickle

In [37]:
# save the model to disk
filename = 'linr_model'
pickle.dump(model, open(filename, 'wb'))

In [38]:
Xnew = [[0, 12, 45, 63, 1, 1, 0, 0, 0]]
ynew = model.predict(Xnew)
ynew

array([[9.59925632]])

In [39]:
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

In [40]:
ynew = loaded_model.predict(Xnew)
ynew

array([[9.59925632]])