## Linear Regression

Here, we will investigate how linear regression works. The data is from CMU pronunciation dictionary, and the task is to predict the number of syllables with the number of characters. 

We will use *massaged_cmudict.txt* file. The first column contains actual words, the second column contains the number of characters, and the last column contains the number of syllables. We will use the second and third columns for the analysis. 

In [19]:
%%bash
head data/massaged_cmudict.txt

A	1	1
A	1	1
A'S	2	1
A.	1	1
A.'S	2	1
A.S	2	1
A42128	6	6
AA	2	2
AAA	3	3
AABERG	6	2


**numpy** is a python module which can be used to do a matrics calculation. Lots of deep learning tools use numpy arrays as an input. **keras** is a deep learning tool, and it uses **tensorflow** to train and test the model. *keras* is more convenient to use, since in tensorflow, you have to understand the input and output dimension, and you have to create weight and bias layers, too. However, in keras, it will automatically takes an output of the previous layer as an input, and weight and bias layers will also automatically generated.

In [3]:
import numpy as np
from keras import optimizers
from keras.layers import Dense
from keras.models import Sequential
from keras.callbacks import TensorBoard

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Here, the script reads *massaged_cmudict.txt* file and assign it to *f*. Then, it saves the text as a list, and assign it to *data* variable. Then, close *f*, and check the list.

In [24]:
with open("data/massaged_cmudict,txt", "r") as f:
    data = f.readlines()
    f.close()
    
for i in range(0, 10):
    print(data[i])

A	1	1

A	1	1

A'S	2	1

A.	1	1

A.'S	2	1

A.S	2	1

A42128	6	6

AA	2	2

AAA	3	3

AABERG	6	2



Then, we have to create lists for a predictor and a result. Here, the list *X* will contain the number of characters, and *y* will contain the number of syllables. Since **split** function returns *string* type values, we need to convert the values to *int* type. We have to check whether the number of items in X and y lists are the same. 

In [5]:
X = []
y = []
for line in data:
    (word, word_len, syll_len) = line.split("\t")
    X.append(int(word_len))
    y.append(int(syll_len))
    
print(len(X))
print(len(y))

133779
133779


Now, we have a total of 133,779 data. We will split it to three different datasets: training, validation, and test. A training dataset will be used to build a model, a vaildation dataset will be used to check the performance of the model during the training, and a test dataset will be used to evaluate the performance of the model. 

In [7]:
X_train = X[0:100000]
y_train = y[0:100000]

print(len(X_train))

X_val = X[100000:130000]
y_val = y[100000:130000]

print(len(X_val))

X_test = X[130000:]
y_test = y[130000:]

print(len(X_test))

100000
30000
3779


There are 100,000 training data, 30,000 validation data, and 3,779 test data. We will build a model with those data. **Sequential()** function is used to say to build the initial model. Then, we add one layer (**model.add**) which has one input value and one output value. We will use **Mean Square Error (MSE)** as a loss function. The goal of the model is to have the lowest MSE. Then, with **fit** function, we will build a model. **TensorBoard** function is used here to check the training process later. 

batch_size: how many data we will look at once

epochs: how many times we will iterate the whole data

verbose: how to display the training process

validation_data: which will be used as validation data

shuffle: whether we will shuffle the training data or not

callbacks: how to keep track of the result

In [20]:
model = Sequential()
model.add(Dense(1, input_dim=1))
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

tensorboard = TensorBoard(log_dir='./logs', histogram_freq = 0, 
                         write_graph=True, write_images=False)

model.fit(X_train, y_train, batch_size = 256, epochs = 10, verbose = 1, validation_data=(X_val, y_val), 
          shuffle = True, callbacks=[tensorboard])

Train on 100000 samples, validate on 30000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f8c5e0468d0>

We can see that MSE of training data is getting lower, and MSE of validation data is getting lower and higher repeatedly. This is the process how the model adjust its weight (slope) and bias (intercept). 

With the trained model, we will evaluate the performace of the model on the test dataset.

In [9]:
results = model.evaluate(X_test, y_test, verbose=1)
print(results)

[0.37081340698613535, 0.37081340698613535]


MSE of test data was 0.37, which is lower than MSE of training and validation data. We can say that the model is not overfitted. 

Now, we will see the predicted number of syllables from the model and the actual number of syllables in test data. 

In [13]:
predictions = model.predict(X_test)
for i in range(0, len(predictions)):
    print(predictions[i][0], "\t", y_test[i])

1.9400102 	 1
4.2086363 	 3
4.2086363 	 3
2.6962187 	 2
2.6962187 	 2
3.074323 	 2
3.074323 	 2
3.074323 	 2
3.074323 	 2
3.4524274 	 2
3.4524274 	 2
2.3181145 	 2
2.3181145 	 2
3.4524274 	 2
3.4524274 	 2
3.074323 	 2
3.074323 	 2
3.074323 	 2
3.074323 	 2
3.074323 	 2
3.8305318 	 2
3.8305318 	 2
3.4524274 	 3
3.4524274 	 3
3.8305318 	 3
3.8305318 	 3
1.9400102 	 2
1.9400102 	 2
1.9400102 	 2
3.074323 	 2
3.074323 	 2
2.6962187 	 2
2.6962187 	 2
2.6962187 	 2
2.6962187 	 2
2.3181145 	 2
2.3181145 	 2
2.3181145 	 2
2.3181145 	 2
2.3181145 	 2
2.3181145 	 2
2.3181145 	 2
2.6962187 	 2
2.6962187 	 2
1.9400102 	 2
1.9400102 	 2
2.3181145 	 2
2.3181145 	 2
3.074323 	 2
3.074323 	 2
3.074323 	 2
3.074323 	 2
2.3181145 	 2
2.3181145 	 2
2.6962187 	 2
2.6962187 	 2
2.6962187 	 2
2.6962187 	 2
2.3181145 	 2
2.3181145 	 2
1.9400102 	 2
1.9400102 	 2
2.3181145 	 2
2.3181145 	 2
2.6962187 	 2
2.6962187 	 2
3.074323 	 2
3.074323 	 2
2.3181145 	 2
2.3181145 	 2
2.6962187 	 2
2.6962187 	 2
2.6962187

2.6962187 	 2
2.6962187 	 2
3.074323 	 2
1.5619059 	 2
2.3181145 	 2
2.6962187 	 2
2.6962187 	 2
2.6962187 	 2
3.074323 	 3
4.2086363 	 3
4.5867405 	 3
3.4524274 	 3
3.4524274 	 3
3.8305318 	 3
4.2086363 	 3
4.2086363 	 3
2.6962187 	 2
1.9400102 	 2
2.3181145 	 2
3.074323 	 3
1.9400102 	 2
2.3181145 	 2
3.074323 	 3
3.4524274 	 3
3.074323 	 3
3.8305318 	 3
1.9400102 	 2
2.3181145 	 3
2.6962187 	 3
3.074323 	 3
2.3181145 	 2
1.9400102 	 2
2.3181145 	 2
2.3181145 	 2
2.6962187 	 2
2.3181145 	 2
2.6962187 	 2
2.6962187 	 2
1.9400102 	 1
2.3181145 	 2
2.3181145 	 2
3.4524274 	 3
1.9400102 	 2
3.8305318 	 3
2.6962187 	 2
2.3181145 	 2
2.3181145 	 3
3.074323 	 3
1.5619059 	 1
1.9400102 	 2
2.3181145 	 2
2.3181145 	 2
3.074323 	 2
1.5619059 	 2
2.6962187 	 2
1.5619059 	 2
1.9400102 	 2
2.6962187 	 2
1.9400102 	 2
1.9400102 	 1
1.9400102 	 2
2.3181145 	 2
2.6962187 	 2
3.4524274 	 3
2.3181145 	 2
1.9400102 	 2
2.3181145 	 2
2.3181145 	 2
2.6962187 	 2
1.5619059 	 1
1.9400102 	 2
1.9400102 	 2


1.9400102 	 1
2.3181145 	 2
2.6962187 	 2
1.5619059 	 2
1.9400102 	 2
2.3181145 	 3
1.1838015 	 1
1.9400102 	 2
2.3181145 	 2
3.074323 	 2
2.6962187 	 2
2.6962187 	 2
3.4524274 	 3
3.8305318 	 3
2.6962187 	 2
4.2086363 	 3
1.9400102 	 2
2.3181145 	 3
2.6962187 	 3
2.6962187 	 2
2.6962187 	 2
4.5867405 	 4
1.9400102 	 2
2.3181145 	 3
1.9400102 	 2
1.9400102 	 2
3.8305318 	 4
3.4524274 	 4
3.074323 	 3
1.9400102 	 2
3.074323 	 3
3.8305318 	 3
3.4524274 	 3
3.074323 	 3
0.8056972 	 1
1.1838015 	 1
1.5619059 	 2
1.1838015 	 1
1.1838015 	 2
1.5619059 	 2
2.3181145 	 3
2.6962187 	 3
2.3181145 	 3
1.9400102 	 2
2.3181145 	 2
2.3181145 	 2
1.1838015 	 1
1.9400102 	 2
1.5619059 	 1
2.6962187 	 3
3.074323 	 3
1.1838015 	 1
1.5619059 	 1
2.3181145 	 2
2.6962187 	 2
1.5619059 	 1
1.9400102 	 1
4.2086363 	 4
3.074323 	 3
4.5867405 	 3
3.4524274 	 3
1.9400102 	 2
2.3181145 	 2
1.5619059 	 1
2.6962187 	 2
2.6962187 	 2
4.2086363 	 4
2.3181145 	 2
3.074323 	 3
2.3181145 	 2
2.3181145 	 2
2.3181145 	 2

2.3181145 	 2
1.5619059 	 2
2.3181145 	 3
2.3181145 	 3
0.8056972 	 1
1.5619059 	 2
0.8056972 	 1
1.9400102 	 3
1.5619059 	 2
1.9400102 	 3
1.9400102 	 3
1.5619059 	 2
1.9400102 	 2
2.3181145 	 2
4.5867405 	 5
2.3181145 	 2
1.9400102 	 3
1.1838015 	 1
2.3181145 	 2
1.5619059 	 1
1.9400102 	 1
1.9400102 	 2
1.9400102 	 2
1.9400102 	 2
2.3181145 	 2
1.9400102 	 2
2.6962187 	 3
3.074323 	 3
3.074323 	 3
3.074323 	 3
3.8305318 	 3
2.3181145 	 2
3.8305318 	 3
4.2086363 	 3
1.5619059 	 1
1.1838015 	 1
2.3181145 	 2
1.9400102 	 2
2.3181145 	 2
2.6962187 	 2
3.074323 	 3
3.4524274 	 3
1.5619059 	 2
1.9400102 	 2
1.9400102 	 3
1.9400102 	 3
3.8305318 	 5
0.8056972 	 1
1.1838015 	 1
2.6962187 	 2
1.9400102 	 2
1.1838015 	 2
1.1838015 	 1
1.5619059 	 2
1.5619059 	 2
0.8056972 	 2
1.9400102 	 2
2.3181145 	 2
2.3181145 	 2
2.3181145 	 2
0.8056972 	 1
1.5619059 	 2
1.9400102 	 2
0.8056972 	 1
1.5619059 	 2
1.9400102 	 2
1.9400102 	 3
2.3181145 	 3
1.9400102 	 2
1.9400102 	 2
1.9400102 	 3
1.5619059 

1.5619059 	 2
1.5619059 	 3
3.074323 	 4
1.5619059 	 2
1.9400102 	 2
1.9400102 	 2
3.4524274 	 4
2.3181145 	 3
2.6962187 	 3
3.074323 	 3
1.5619059 	 2
1.1838015 	 1
3.074323 	 3
3.074323 	 3
1.1838015 	 1
1.9400102 	 2
1.9400102 	 2
2.3181145 	 2
2.3181145 	 2
3.4524274 	 3
3.8305318 	 3
3.074323 	 3
3.074323 	 3
3.4524274 	 3
3.4524274 	 3
0.8056972 	 1
1.1838015 	 1
1.5619059 	 1
2.3181145 	 1
1.9400102 	 1
2.6962187 	 2
1.9400102 	 2
1.9400102 	 2
0.8056972 	 1
1.1838015 	 1
2.3181145 	 3
2.3181145 	 3
2.6962187 	 3
2.6962187 	 3
1.9400102 	 1
2.3181145 	 4
0.8056972 	 1
1.5619059 	 2
1.5619059 	 2
2.6962187 	 3
1.5619059 	 2
2.6962187 	 3
1.9400102 	 2
2.3181145 	 3
2.6962187 	 3
1.5619059 	 2
1.5619059 	 2
1.5619059 	 2
1.1838015 	 2
2.3181145 	 4
1.5619059 	 2
1.9400102 	 2
2.3181145 	 2
1.5619059 	 2
1.5619059 	 2
2.3181145 	 3
2.6962187 	 2
2.6962187 	 2
2.3181145 	 2
1.9400102 	 2
1.5619059 	 2
1.1838015 	 2
1.9400102 	 3
1.9400102 	 3
1.9400102 	 2
1.5619059 	 2
2.6962187 	 

The linear regression model has a structure **ax + b**, where *a* is a slope and *b* is an intercept. We can check those values easily using **get_weights()** function.

In [17]:
for layer in model.layers:
    weight = layer.get_weights()[0]
    bias = layer.get_weights()[1]
    print("Weight:\t", weight)
    print("Bias:\t", bias)

Weight:	 [[0.37810433]]
Bias:	 [-0.3286158]


Here, we can see that the formulus for this model is **0.378 x -0.3286**, where *x* is the number of characters, and the result is the predicted number of syllables.