# COMP541 - LAB #1

In this exercise, you’re supposed to preprocess Boston Housing Dataset, so that we can use it in some machine learning models like linear regression later.

The housing dataset has housing related information for 506 neighborhoods in Boston from 1978. Each neighborhood is represented using 13 attributes such as crime rate or distance to employment centers. The goal is to predict the median value of the houses given in $1000's.

Note that the only reason we always use functions is for us to be able to test what you do properly. You do not have to do everything in a function when you write your code 😉

In [81]:
using Test

## Exercise 0

In order to use some necessary functions, we need to import some modules. Just insert the following line as first line or cell.

`using DelimitedFiles, Statistics, Random`

**Statistics** contains statistical procedures like **mean** and **std**, **DelimitedFiles** contains our data read procedure functions (**readdlm**) and Random is for random numbers (**rand**, **Random.seed!** etc.).

In [82]:
using DelimitedFiles, Statistics, Random

## Exercise 1

First download, and then read the file. You need to download the data within Julia notebook (please have a look: **readdlm**, **download** functions of Julia by typing e.g. **@doc download**). If you look at the data, you see that each house is represented with 13 attributes separated by whitespaces and there are 506 lines in total. Here’s the [link](https://raw.githubusercontent.com/ilkerkesen/ufldl-tutorial/master/ex1/housing.data) to the dataset.

In [83]:
"""
    download_and_read(data_url)

First download the `data_url`, then read it into an array.
"""
function download_and_read(data_url)
    fname = download(data_url)
    readdlm(fname)
end
housing_data = download_and_read("https://raw.githubusercontent.com/ilkerkesen/ufldl-tutorial/master/ex1/housing.data")

@info "Testing Download"
@test size(housing_data) == (506, 14)
@test mean(housing_data) ≈ 66.67816985036704

┌ Info: Testing Download
└ @ Main In[83]:12


[32m[1mTest Passed[22m[39m

## Exercise 2

The resulting data matrix should have 506 rows representing neighborhoods and 14 columns representing the attributes. The last attribute is the median house price to be predicted, so let’s separate it. Also, take transpose of this data matrix to make data convenient with common mathematical notation (deep learning people represent instances/samples as column vectors mostly). We will use Julia’s array indexing operation to split the data array into input x and output y. (Hint: you may want to **reshape** y array into a matrix with size 1x506, use reshape procedure for this purpose)

In [84]:
"""
    reshape_data(data)

Transpose `data` and split it into two parts along the first axis.
Second split is size of 1. Return both splits as 2-dimensional arrays.
"""
function reshape_data(data)
    data = copy(transpose(data))
    x = data[1:((size(data)[1])-1), :]
    y = data[size(data)[1], :]
    y = reshape(y, (1,size(data)[2]))
    return x,y
end
x,y = reshape_data(housing_data)

@info "Testing Reshape"
@test size(x) == (13,506)
@test size(y) == (1,506)
@test mean(x[:,1]) ≈ 62.37687076923077
@test y[1,123] == 20.5

┌ Info: Testing Reshape
└ @ Main In[84]:16


[32m[1mTest Passed[22m[39m

Do not panic if you see "LinearAlgebra.Adjoint{Float64,Array{Float64,2}}" instead of "Array{Float64,2}". This is just a wrapper type around array type that julia returns when you transpose an array. You can get rid of it by copy or collect function.

## Exercise 3
As you can see, input attributes have different ranges. We need to normalize attributes by subtracting their mean and then dividing by their standard deviation (hint: take means and standard deviations of column vectors). The mean and std functions calculate mean and standard deviation values of x. Calculate mean and standard deviation values. Perform normalization on input data.

In [85]:
"""
    normalize(x)

Take mean and standart deviation of `x` along second axis. Subtract the mean from x and
divide by standart deviation. Return the result.
"""
function normalize(x)
    μ = mean(x, dims = 2)
    σ = std(x, dims = 2)
    x = (x .- μ) ./ σ
end
x = normalize(x)

@info "Testing Normalize"
@test size(x) == (13,506)
@test mean(x[:, 123]) ≈ -0.010503411301568476

┌ Info: Testing Normalize
└ @ Main In[85]:14


[32m[1mTest Passed[22m[39m

### *Important Note on Random Number Generation*
Before generating random numbers, strings etc., you need to set a seed, because Julia uses a pseudo random number generator. In pseudo random number generators you set a seed and you obtain some certain random number generation order based on that seed. If you don’t set a seed, the results you obtain in the next exercises will be different. When you fail in some part, run the cells again starting from the cell or line we set random seed.

## Exercise 4
It is necessary to split our dataset into training and test subsets so we can estimate how good our model will perform on unseen data. There are 506 house in our dataset. Let’s take 400 of them randomly, use them as training data. Let the rest be test data. In the end, you will have 4 different arrays: xtrn, ytrn, xtst and ytst.

Use **randperm** function to split our dataset into train and test sets. Note that, results will differ since usage of **randperm** function introduces randomness. If you want to overcome this randomness, set a seed by using **Random.seed!** function. In this exercise, we set seed as 1 just before **randperm** call and you need to take the first 400 random samples -not the last 400- as your training data, so that you will get exactly the same results. Use **@doc** macro to see documentation about **randperm** and **Random.seed!** (e.g. type **@doc randperm** to Julia REPL or notebook).

In [86]:
Random.seed!(1)
"""
    train_test_split(x, y, split_size=400)

Shuffle both `x` and `y` with same random permutation so that they correspond to each other
on their second axis. Split both into two by `split_size`, return the splits.
"""
function train_test_split(x, y, split_size=400)
    rperm = randperm(size(y)[2])
    xtrn = x[:, rperm[1:split_size]] 
    ytrn = y[:, rperm[1:split_size]]
    xtst = x[:, rperm[split_size + 1 : size(x)[2]]]
    ytst = y[:, rperm[split_size + 1 : size(x)[2]]]
    return xtrn, xtst, ytrn, ytst
end
xtrn, xtst, ytrn, ytst = train_test_split(x, y)

@info "Testing Split"
@test size(xtrn) == (13,400)
@test size(xtst) == (13,106)
@test size(ytrn) == (1,400)
@test size(ytst) == (1,106)
@test mean(xtrn[:,123]) ≈ 0.5696601530016167
@test mean(xtst[:,42]) ≈ -0.15323416352408686
@test ytrn[1,123] == 13.1
@test ytst[1,42] == 20.0

┌ Info: Testing Split
└ @ Main In[86]:18


[32m[1mTest Passed[22m[39m

## Exercise 5
Our data is ready to be used. This week, we will not deal with the training of a model, but let’s look at how good a randomly initialized linear regression model performs on our processed data.

Basically, we need to use some weights with whom we’re going to multiply the attributes of houses so that we can predict the price of that house. Neighborhoods are represented with 13 attributes and we need to predict the prices which is a single number. We need to have a weight matrices with size of 1x13. We also use a bias value which is 0.

To create weight matrix, we will sample from normal distribution with zero mean and a small standard deviation. In this tutorial, our standard deviation value is equal to 0.1. Use randn function to create a random weight matrix whose values are sampled from a unit normal distribution (mean=0, standard deviation=1). Multiply our weight matrix by 0.1 which is our desired standard deviation. We will not use bias in this tutorial.

In [87]:
"""
    create_matrix(x=1, y=13, scale=0.1)

Return a matrix of size (`x`, `y`) scaled by `scale`.
"""
function create_matrix(x=1, y=13, scale=0.1)
    randn((x,y)) .* scale
end
Random.seed!(1)
w = create_matrix()

@info "Testing w"
@test size(w) == (1,13)
@test w[1,3] ≈ -0.059763447672823114

┌ Info: Testing w
└ @ Main In[87]:12


[32m[1mTest Passed[22m[39m

Note that if you used any other operation that uses a seed after you used randperm function once in the previous exercise, your weight array will not be the same with the example. Please reset your seed and try again.

## Exercise 6
Now, we have input and weights. Let’s write a function to predict price. Implement the function takes weight matrix and neighborhood attributes as input and outputs a single value, house price prediction. Simply perform a matrix multiplication inside this function and return the output vector.

In [88]:
"""
    predict(w, x)

Return the dot product of `w` and `x`.
"""
function predict(w, x)
    w * x
end
ypred = predict(w, xtrn)

@info "Testing Predict"
@test size(ypred) == (1,400)
@test ypred[1,123] ≈ -0.16473245771387449

┌ Info: Testing Predict
└ @ Main In[88]:11


[32m[1mTest Passed[22m[39m

ypred is an 1x400 dimensional array/matrix. Each value in this array is the model’s price prediction for an average house in corresponding neighborhood.

## Exercise 7
Let’s implement a loss function which is called as Mean Squared Error (MSE),
![](https://github.com/OsmanMutlu/rawtext/raw/master/img/Comp541-Lab1-Screenshot7.png)
In this function we calculate J, our loss value, average of squared difference between real price values and predicted price values.

Implement MSE loss function which takes weight matrix, input matrix (xtrn or xtst) and ground truth prices (ytrn or ytst). We make the first parameter of loss function weight matrix, it’s not crucial, but make it a habit. Use the **predict** function you implemented above. Helpful functions: sum, mean, size, abs2, .* You don’t have to use all of them. Use abs2 with dot syntax as **abs2.(x)** if you’re using it.

In [89]:
"""
    mse_loss(w, x, y)

Predict `x` using `w`. Calculate the loss of the predictions using `y`.
"""
function mse_loss(w, x, y)
    sum(abs2.(predict(w,x) - y)) ./ (2 * size(y)[2])
    
end

train_loss = mse_loss(w, xtrn, ytrn)
test_loss = mse_loss(w, xtst, ytst)

@info "Testing Loss"
@test train_loss ≈ 294.9289548077531
@test test_loss ≈ 298.1198492352728

┌ Info: Testing Loss
└ @ Main In[89]:14


[32m[1mTest Passed[22m[39m

## Exercise 8
Lastly, let’s find in how many of them, the model predicts the price with an error less than average error. Measure the absolute difference between the predicted price and correct price for each neighborhood and compare those differences with the square root of the loss value calculated in previous exercise. You can use **sqrt** function to take square roots. The result should be **107**.

In [90]:
"""
    num_above_avg_preds(y,pred,loss)

Calculate absolute difference between `y` and `pred`. Return the number of instances
whose difference is less than square root of `loss`.
"""
function num_above_avg_preds(y,pred,loss)
    diff = abs.(y-pred)
    size(diff[diff .< sqrt.(loss)])[1]
end
result = num_above_avg_preds(ytrn,ypred,train_loss)

@info "Testing Result"
@test result == 107

┌ Info: Testing Result
└ @ Main In[90]:13


[32m[1mTest Passed[22m[39m

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*