##Linear Regression using TensorFlow

In this session, we will have a look at creating a linear regression model using tensorflow 2.0. Note that we already know the basics of linear regression and understand the implementation through sklearn. We will try to figure out how to do it using tensorflow tools that we have learnt.   


<i>Note that to keep this exercise simple and focused on tensorflow and its relevant functions, we will make a very simple model with very basic preprocessing.</i> 

#### Let us start with mounting our drive to be able to use the dataset stored in our dataset.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


#### Now, let us install the latest version of tensorflow.


In [0]:
!pip install tensorflow==2.0.0

##### Check the version of the installation

In [2]:
import tensorflow as tf
print(tf.__version__)

2.2.0


Now Let us import out data and get it ready for modelling.

In [9]:
import pandas as pd
import numpy as np


cars_data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/data/usedcars.csv')
cars_data

Unnamed: 0,year,model,price,mileage,color,transmission
0,2011,SEL,21992,7413,Yellow,AUTO
1,2011,SEL,20995,10926,Gray,AUTO
2,2011,SEL,19995,7351,Silver,AUTO
3,2011,SEL,17809,11613,Gray,AUTO
4,2012,SE,17500,8367,White,AUTO
...,...,...,...,...,...,...
145,2006,SES,6200,95000,Silver,AUTO
146,2002,SE,5995,87003,Red,AUTO
147,2000,SE,5980,96841,Red,AUTO
148,2001,SE,4899,151479,Yellow,AUTO


In [10]:
cars_data.isnull().sum()

year            0
model           0
price           0
mileage         0
color           0
transmission    0
dtype: int64

In [11]:
#just some basic preprocessing
cars_data.drop_duplicates()
#creating dummy variables for the categorical features
cars_data = pd.get_dummies(cars_data)
cars_data = cars_data.astype('float32') # we will need to convert the dataset to float in order to be able to convert it into tensors later.
cars_data


Unnamed: 0,year,price,mileage,model_SE,model_SEL,model_SES,color_Black,color_Blue,color_Gold,color_Gray,color_Green,color_Red,color_Silver,color_White,color_Yellow,transmission_AUTO,transmission_MANUAL
0,2011.0,21992.0,7413.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,2011.0,20995.0,10926.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,2011.0,19995.0,7351.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,2011.0,17809.0,11613.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2012.0,17500.0,8367.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,2006.0,6200.0,95000.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
146,2002.0,5995.0,87003.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
147,2000.0,5980.0,96841.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
148,2001.0,4899.0,151479.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


In [12]:
#explorinfg column names
cars_data.columns

Index(['year', 'price', 'mileage', 'model_SE', 'model_SEL', 'model_SES',
       'color_Black', 'color_Blue', 'color_Gold', 'color_Gray', 'color_Green',
       'color_Red', 'color_Silver', 'color_White', 'color_Yellow',
       'transmission_AUTO', 'transmission_MANUAL'],
      dtype='object')

In [14]:
#getting the features and labels and finally splitting the test and train data.


from sklearn.model_selection import train_test_split
X = cars_data[['year','mileage', 'model_SE', 'model_SEL', 'model_SES',
       'color_Black', 'color_Blue', 'color_Gold', 'color_Gray', 'color_Green',
       'color_Red', 'color_Silver', 'color_White', 'color_Yellow',
       'transmission_AUTO', 'transmission_MANUAL']]
Y = cars_data['price']


X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

#let us scale the data as features are on different scales which might be a problem while modelling
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
# MinMaxScalar has been used here. You can go ahead and use the other scalars available and chcek the effect on the results.
#fitting the transform on test and train separately
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
X_train

array([[0.75      , 0.27928138, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.58332825, 0.5233474 , 0.        , ..., 1.        , 0.        ,
        1.        ],
       [0.75      , 0.21721278, 1.        , ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.83332825, 0.20944397, 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.33332825, 0.6878496 , 1.        , ..., 0.        , 1.        ,
        0.        ],
       [0.83332825, 0.2253431 , 1.        , ..., 0.        , 1.        ,
        0.        ]], dtype=float32)

In [21]:
y_train.values

array([12990.,  7999., 13845.,  9000., 10979., 14477., 13999., 15298.,
       13997., 11749., 10995., 15995., 10770., 11992., 13995., 10815.,
       15992., 17500., 14495., 13992., 14893., 14989., 10836.,  7488.,
       11754.,  8495.,  9992., 12995.,  9999., 12777., 16992., 10995.,
        9999., 16950., 19995., 11984., 17000.,  9651., 11450., 13950.,
       13995.,  6950.,  7900., 14495., 13992., 13599., 10955., 12849.,
       12500., 12995., 16995., 14992., 14992.,  9995., 14900., 12995.,
       10995.,  3800., 15499., 15889., 13895., 16995.,  8996., 12595.,
       11980.,  6200., 12995.,  4899., 11495., 13950., 15999., 12997.,
       14699., 12999., 14000., 15499., 13425., 14549., 15688.,  7995.,
       10000.,  6980., 14355., 15980., 13799., 17809.,  6980., 14999.,
        5980., 11792., 14480., 10955., 16950., 12995., 16995., 12998.,
       13992., 21992., 20995., 13995., 15899., 13992., 13991., 16995.,
       12995., 12704., 13888., 14299., 12280., 15988., 15500.,  8480.,
      

## George's Challenge
- What's wrong?

In [0]:
# let us now convert the data elements into tensors as we need tensors to be fed into different tensorflow based operations
#X-train and X_test were converted to numpy arrays while transformations while the other two need to be transformed into numpy arrays.
X_train=tf.convert_to_tensor(X_train)
y_train=tf.convert_to_tensor(y_train.values)
X_test=tf.convert_to_tensor(X_test)
y_test=tf.convert_to_tensor(y_test.values)

In [23]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

(TensorShape([120, 16]),
 TensorShape([120]),
 TensorShape([30, 16]),
 TensorShape([30]))

#### Let us try modelling now. We will use a few concepts covered in the practice exercise shared with the course material.

In [0]:
input_dim = X_train.shape[1]  #16
output_dim = 1
learning_rate = 0.01

# Let us initialize the weights and bias variables. 
weights = tf.Variable(tf.zeros(shape=(input_dim, output_dim), dtype= tf.float32))
bias = tf.Variable(tf.ones(shape=(output_dim,), dtype= tf.float32))

def predict(features):
  return tf.matmul(features, weights) + bias # note that the matmul is matrix multiplication and is needed for calculating predictions
  # features (None, 16)
  # weights (16, 1)
  # bias (1,)
  # (None, 1)

def compute_loss(y_true, predictions):
  return tf.reduce_mean(tf.square(y_true - predictions)) # mean square error

# Let us now define a function to train the model. We will call the other functions in function definition.
def train(x, y):
  with tf.GradientTape() as tape:
    predictions = predict(x)
    loss = compute_loss(y, predictions)
    dloss_dw, dloss_db = tape.gradient(loss, [weights, bias]) #note that we can pass lists as well here.
  weights.assign_sub(learning_rate * dloss_dw) 
  bias.assign_sub(learning_rate * dloss_db)
  return loss

## George's Challenge
- What's `weights`'s shape?
- How about `features`?
- How many dimensions does `bias` have?
- What's the matrix multiplication result's shape?
- What are we doing with this code `tf.matmul(features, weights) + bias`?
- What's the shape of the resulting array?
- Walk thru this code `tf.reduce_mean(tf.square(y_true - predictions))`
- Difference between `tf.reduce_mean` and `np.mean`?
- What does `assign_sub` do? 

In [0]:
#@title Answer
- `weights`'s shape: (16, 1)
- `features`'s shape: (None, 16)
- `bias` has one dimension
- matmul's result shape: (None, 1)
- same as above
-  no difference

#### Let us now, call the train function with 50 epochs

In [25]:
for epoch in range(50):
  loss = train(X_train, y_train)
  print('Epoch %d: Loss = %.4f' % (epoch, float(loss)))


print('Final Weights after 50 epochs:')
print('###############################################################################')
print(weights)

print('Final Bias after 50 epochs:')
print('###############################################################################')
print(bias)

Epoch 0: Loss = 176731152.0000
Epoch 1: Loss = 157933792.0000
Epoch 2: Loss = 141282512.0000
Epoch 3: Loss = 126532168.0000
Epoch 4: Loss = 113465600.0000
Epoch 5: Loss = 101890464.0000
Epoch 6: Loss = 91636392.0000
Epoch 7: Loss = 82552456.0000
Epoch 8: Loss = 74505032.0000
Epoch 9: Loss = 67375680.0000
Epoch 10: Loss = 61059544.0000
Epoch 11: Loss = 55463732.0000
Epoch 12: Loss = 50505960.0000
Epoch 13: Loss = 46113356.0000
Epoch 14: Loss = 42221356.0000
Epoch 15: Loss = 38772788.0000
Epoch 16: Loss = 35716996.0000
Epoch 17: Loss = 33009130.0000
Epoch 18: Loss = 30609448.0000
Epoch 19: Loss = 28482746.0000
Epoch 20: Loss = 26597854.0000
Epoch 21: Loss = 24927156.0000
Epoch 22: Loss = 23446194.0000
Epoch 23: Loss = 22133296.0000
Epoch 24: Loss = 20969274.0000
Epoch 25: Loss = 19937130.0000
Epoch 26: Loss = 19021804.0000
Epoch 27: Loss = 18209964.0000
Epoch 28: Loss = 17489794.0000
Epoch 29: Loss = 16850830.0000
Epoch 30: Loss = 16283806.0000
Epoch 31: Loss = 15780509.0000
Epoch 32: Lo

#### Let us now test our model on the test data and predict on the test data.

In [27]:
test_predictions = tf.matmul(X_test, weights) + bias
print(compute_loss(y_test, test_predictions))

tf.Tensor(8548196.0, shape=(), dtype=float32)


In [26]:
predict(X_test)

<tf.Tensor: shape=(30, 1), dtype=float32, numpy=
array([[12966.248 ],
       [12759.732 ],
       [11598.596 ],
       [11823.012 ],
       [12167.281 ],
       [10588.252 ],
       [12544.059 ],
       [13192.689 ],
       [11679.381 ],
       [13533.48  ],
       [13303.452 ],
       [11865.229 ],
       [12738.202 ],
       [12878.569 ],
       [13292.344 ],
       [10183.005 ],
       [12652.132 ],
       [12997.4795],
       [10452.418 ],
       [11850.391 ],
       [12616.388 ],
       [10689.471 ],
       [12982.405 ],
       [12955.671 ],
       [12863.805 ],
       [ 9874.346 ],
       [11986.701 ],
       [12172.322 ],
       [11634.078 ],
       [11628.225 ]], dtype=float32)>

In [0]:
X_test.shape

TensorShape([30, 16])

We learnt creating a very simple linear regression model on cars data and predicted prices. 

Though, we could have done an extensive EDA and further improved the model but we have focused on tensorflow and its operations.

<i>Happy Learning!</i>

## George' Bonus
The big picture!