# Lets try our hand at Multivariate Linear Regression
## Using the `make_regression` function as data
Note: we did not split the dataset, as this is just an illustration of using the `make_regression` function.
Also, note how we can just pass in  ndarrays into the regression model, and skipping the pandas dataframe conversion.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from sklearn import linear_model

# Generate our random regression data
# We perform destructuring assignment of the output of the make_regression function
# X and y is our training examples where X is the features (i.e. indepdent variables)
# and y is the dependent variable
# coef is the 
X, y, coef = make_regression(n_samples=100, n_features=5,
                          n_informative=1, noise=10,
                          coef=True, random_state=0)
coef 

array([ 45.70587613,   0.        ,   0.        ,   0.        ,   0.        ])

In [2]:
X.shape

(100, 5)

In [3]:
y.shape

(100,)

In [4]:
reg = linear_model.LinearRegression()
# no need to load the data into a pandas dataframe, since it's already nicely generated for us
reg.fit(X, y)

# this prints out the coefficients of our independent variables (i.e. features). 5 features --> 5 coefficients
reg.coef_ 

array([  4.58448502e+01,  -8.45538675e-03,  -2.38412637e+00,
         5.91382877e-01,  -6.86280983e-01])

In [5]:
reg.intercept_ #our y intercept

-2.0738422536129955

Food for thought: Why is the coefficient printed out in line 1 different from that of line 4?

ANSWER: Because we introduced noise in the `make_regression` function! 

In [6]:
reg.get_params()  
#fetches the parameters of the linear regression model estimator, in case you want to check your model's 'settings'

{'copy_X': True, 'fit_intercept': True, 'n_jobs': 1, 'normalize': False}

In [7]:
# reg.predict()
predict_for_me = np.array([[5,5,5,5,5]]) # Create a 1x5 dimension array
predict_for_me.shape

(1, 5)

In [8]:
reg.predict(predict_for_me)

array([ 214.7130094])

## Using the `load_linnerud` function as data

In [9]:
from sklearn.datasets import load_linnerud
from sklearn.metrics import mean_squared_error, r2_score
import pprint 
dataset = load_linnerud()
pprint.pprint(dataset)

{'DESCR': 'Linnerrud dataset\n'
          '\n'
          'Notes\n'
          '-----\n'
          'Data Set Characteristics:\n'
          '    :Number of Instances: 20\n'
          '    :Number of Attributes: 3\n'
          '    :Missing Attribute Values: None\n'
          '\n'
          'The Linnerud dataset constains two small dataset:\n'
          '\n'
          '- *exercise*: A list containing the following components: exercise '
          'data with\n'
          '  20 observations on 3 exercise variables: Weight, Waist and '
          'Pulse.\n'
          '\n'
          '- *physiological*: Data frame with 20 observations on 3 '
          'physiological variables:\n'
          '   Chins, Situps and Jumps.\n'
          '\n'
          'References\n'
          '----------\n'
          '  * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. '
          'Paris: Editions Technic.\n',
 'data': array([[   5.,  162.,   60.],
       [   2.,  110.,   60.],
       [  12.,  101.,  101

In [10]:
print(dataset.DESCR)

Linnerrud dataset

Notes
-----
Data Set Characteristics:
    :Number of Instances: 20
    :Number of Attributes: 3
    :Missing Attribute Values: None

The Linnerud dataset constains two small dataset:

- *exercise*: A list containing the following components: exercise data with
  20 observations on 3 exercise variables: Weight, Waist and Pulse.

- *physiological*: Data frame with 20 observations on 3 physiological variables:
   Chins, Situps and Jumps.

References
----------
  * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris: Editions Technic.



In [11]:
print(dataset.data.shape)

(20, 3)


In [12]:
print(dataset.target.shape)

(20, 3)


In [13]:
# We can put them in a table format if we want to have feel of how it looks
dfx = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
dfx.head()

Unnamed: 0,Chins,Situps,Jumps
0,5.0,162.0,60.0
1,2.0,110.0,60.0
2,12.0,101.0,101.0
3,12.0,105.0,37.0
4,13.0,155.0,58.0


In [14]:
dfy = pd.DataFrame(data=dataset.target, columns=dataset.target_names)
dfy.head()

Unnamed: 0,Weight,Waist,Pulse
0,191.0,36.0,50.0
1,189.0,37.0,52.0
2,193.0,38.0,58.0
3,162.0,35.0,62.0
4,189.0,35.0,46.0


In [15]:
df = dfx.join(dfy)
df.head()

# So, Chins, situps and Jumps are our features
# And Weight, Waist, and Pulse are our targets (i.e. variables to predict)

Unnamed: 0,Chins,Situps,Jumps,Weight,Waist,Pulse
0,5.0,162.0,60.0,191.0,36.0,50.0
1,2.0,110.0,60.0,189.0,37.0,52.0
2,12.0,101.0,101.0,193.0,38.0,58.0
3,12.0,105.0,37.0,162.0,35.0,62.0
4,13.0,155.0,58.0,189.0,35.0,46.0


In [16]:
df_train=df.sample(frac=0.8,random_state=200) 
df_score=df.drop(df_train.index)
print(df_train.shape, df_score.shape) #unfortunately, this dataset is quite small...

(16, 6) (4, 6)


In [17]:
# Let's reshape this into our features and targets
train_x = df_train[['Chins', 'Situps','Jumps']].values.reshape(-1, 3)  #the number of rows is not specified, hence -1
train_y = df_train[['Weight', 'Waist','Pulse']].values.reshape(-1, 3)  #Y
print(train_x.shape, train_y.shape)

score_x = df_score[['Chins', 'Situps','Jumps']].values.reshape(-1, 3)  #the number of rows is not specified, hence -1
score_y = df_score[['Weight', 'Waist','Pulse']].values.reshape(-1, 3)  #Y
print(score_x.shape, score_y.shape)

(16, 3) (16, 3)
(4, 3) (4, 3)


In [18]:
# We don't really need to use the dataframe we made, since the data is already prepared for us in ndarrays 
# and separated from feature/target names.
# So we'll use the data provided in the sklearn dataset directly

reg = linear_model.LinearRegression()
# no need to load the data into a pandas dataframe, since it's already nicely generated for us
reg.fit(train_x, train_y)

# this prints out the coefficients of our independent variables (i.e. features). 5 features --> 5 coefficients
reg.coef_ 

array([[-0.2796632 , -0.30945104,  0.32228343],
       [-0.12653769, -0.05046425,  0.06380484],
       [ 0.62135825,  0.01865561, -0.07862093]])

In [19]:
prediction = reg.predict(score_x)

mean_squared_error(prediction,score_y)

314.6571307806442

In [20]:
r2_score(prediction,score_y)



-3.7033868140054853

In [21]:
reg.intercept_ #Note the 3 intercepts, one for each target

array([ 206.78149037,   39.85959778,   54.26568249])

# Gradient Descent, Yo!
## Applying it to a univariate function

$f(x)=x^{4}-3x^{3}+2$ will have a derivative of $f(x)=4x^{3}-9x^{2}$

In [22]:
current_x = 6 #starting x variable
alpha = 0.01 #learning rate
precision = 0.000001  # the limit of the smallest step that we want the algorithm to take.
previous_step_size = current_x

der = lambda x: 4 * x**3 - 9 * x**2

while previous_step_size > precision: #stop once the step size reaches our precision cut-off
    prev_x = current_x
    current_x += - alpha * der(prev_x)  # this is the same as current_x = current_x - alpha * fn(prev_x)
    previous_step_size = abs(current_x - prev_x)
    
print("The local minimum occurs at %f" % current_x)

The local minimum occurs at 2.249996
