# Linear Regression

In [None]:
# Import the required libraries
import pandas as pd
import numpy as np
from numpy import random

## Data Preprocessing

### **Exploring the dataset**

Let's start with loading the training data from the csv into a pandas dataframe



Load the datasets from GitHub. Train dataset has already been loaded for you in df below. To get test dataset use the commented code.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/train_processed_splitted.csv')

Let's see what the first 5 rows of this dataset looks like

In [None]:
df.head()

Unnamed: 0,LotArea,TotalBsmtSF,GrLivArea,GarageArea,PoolArea,OverallCond,Utilities,SalePrice
0,11553,1051,1159,336,0,5,AllPub,158000
1,8400,1052,1052,288,0,5,AllPub,138500
2,8960,1008,1028,360,0,6,AllPub,115000
3,11100,0,930,308,0,7,AllPub,84900
4,15593,1304,2287,667,0,4,AllPub,225000


What are all the features present? What is the range for each of the features along with their mean?

In [None]:

df.info()
df.head()
df.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1314 entries, 0 to 1313
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   LotArea      1314 non-null   int64 
 1   TotalBsmtSF  1314 non-null   int64 
 2   GrLivArea    1314 non-null   int64 
 3   GarageArea   1314 non-null   int64 
 4   PoolArea     1314 non-null   int64 
 5   OverallCond  1314 non-null   int64 
 6   Utilities    1314 non-null   object
 7   SalePrice    1314 non-null   int64 
dtypes: int64(7), object(1)
memory usage: 82.3+ KB


Unnamed: 0,LotArea,TotalBsmtSF,GrLivArea,GarageArea,PoolArea,OverallCond,Utilities,SalePrice
1309,9020,1127,1165,490,0,7,AllPub,174900
1310,10793,780,1620,462,0,5,AllPub,152000
1311,8885,864,902,484,0,5,AllPub,131000
1312,11275,710,2978,564,0,7,AllPub,242000
1313,10206,0,944,528,0,3,AllPub,82000


### **Feature Scaling and One-Hot Encoding**

You must have noticed that some features `(such as Utilities)` are not continuous values.
  
These features contain values indicating different categories and must somehow be converted to numbers so that the computer can understand it. `(Computers only understand numbers and not strings)`
  
These features are called categorical features. We can represent these features as a `One-Hot Representation`
  
  
You must have also noticed that all the other features, each are in a different scale. This can be detremental to the performance of our linear regression model and so we normalize them so that all of them are in the range $[0,1]$

> NOTE: When you are doing feature scaling, store the min/max which you will use to normalize somewhere. This is then to be used at testing time. Try to think why are doing this?

In [None]:
# Do the one-hot encoding here
df_pandas_encoded = pd.get_dummies(df, columns = ['Utilities'], drop_first = True)

In [None]:
# Do the feature scaling here

#finding max and min for each collumn
minima = df.min()
maxima = df.max()

#
df_feature_scaled = df_pandas_encoded/maxima


### **Conversion to NumPy**

Ok so now that we have all preprocessed all the data, we need to convert it to numpy for our linear regression model
  
Assume that our dataset has a total of $N$ datapoints. Each datapoint having a total of $D$ features (after one-hot encoding), we want our numpy array to be of shape $(N, D)$

In our task, we have to predict the `SalePrice`. We will need 2 numpy arrays $

*   List item
*   List item

(X, Y)$. These represent the features and targets respectively

In [None]:
# Convert to numpy array

'''n rows, d columns'''
arr = df_feature_scaled.to_numpy()
print(arr[0,])


[np.float64(0.23695345557122707) np.float64(0.2054236086494151)
 np.float64(0.0536737206439174) np.float64(0.5555555555555556)
 np.float64(0.0) np.float64(0.20927152317880796)
 np.float64(0.17201309328968903) nan nan]


## Linear Regression formulation
  
We now have our data in the form we need. Let's try to create a linear model to get our initial (Really bad) prediction


In [None]:
#8 weights to be learned
#
# yi = wixi + bi: i goes from 1 to D
#1314 values of y
#we have D rows and N collumns

random_weights = random.normal(size = (1,7))
random_biases = random.normal(size = (1,1))
#now we have a numpy array of weights , biases
#and an array of values of x
#so just mutiply the weights with x values and add the biases
#take the i'th row, add all the features and add the bias
sums = []
for i in range(1314):
  sum = 0
  for j in range(7):
    sum += arr[i,j]*random_weights[0,j]
  sum += random_biases[0,0]
  sums.append(sum)



print(sums[1313])


-0.2128709596589699


Let's say a single datapoint in our dataset consists of 3 features $(x_1, x_2, x_3)$, we can pose it as a linear equation as follows:
$$ y = w_1x_1 + w_2x_2 + w_3x_3 + b $$
Here we have to learn 4 parameters $(w_1, w_2, w_3, b)$
  
  
Now how do we extend this to multiple datapoints?  
  
  
Try to answer the following:
- How many parameters will we have to learn in the cae of our dataset? (Don't forget the bias term)
- Form a linear equation for our dataset. We need just a single matrix equation which correctly represents all the datapoints in our dataset
- Implement the linear equation as an equation using NumPy arrays (Start by randomly initializing the weights from a standard normal distribution)

How well does our model perform? Try comparing our predictions with the actual values

In [None]:
print(df)

      LotArea  TotalBsmtSF  GrLivArea  GarageArea  PoolArea  OverallCond  \
0       11553         1051       1159         336         0            5   
1        8400         1052       1052         288         0            5   
2        8960         1008       1028         360         0            6   
3       11100            0        930         308         0            7   
4       15593         1304       2287         667         0            4   
...       ...          ...        ...         ...       ...          ...   
1309     9020         1127       1165         490         0            7   
1310    10793          780       1620         462         0            5   
1311     8885          864        902         484         0            5   
1312    11275          710       2978         564         0            7   
1313    10206            0        944         528         0            3   

     Utilities  SalePrice  
0       AllPub     158000  
1       AllPub     138500  
2  

### **Learning weights using gradient descent**

So these results are really horrible. We need to somehow update our weights so that it correclty represents our data. How do we do that?

We must do the following:
- We need some numerical indication for our performance, for this we define a Loss Function ( $\mathscr{L}$ )
- Find the gradients of the `Loss` with respect to the `Weights`
- Update the weights in accordance to the gradients: $W = W - \alpha\nabla_W \mathscr{L}$

Lets define the loss function:
- We will use the MSE loss since it is a regression task. (Specify the assumptions we make while doing so as taught in the class).
- Implement this loss as a function. (Use numpy as much as possible)

In [None]:
def mse_loss_fn(y_true, y_pred):
  return np.mean((y_true - y_pred)**2)


Calculate the gradients of the loss with respect to the weights (and biases). First write the equations down on a piece of paper, then proceed to implement it

In [None]:
def get_gradients(y_true, y_pred, W, b, X):
  dW=[]
  for i in range(len(X[0])):
    sum = 0
    for j in range(len(X)):
      sum += X[j,i]*(y_pred - y_true)
    sum+=b
    sum = sum/len(X)


  sum = 0
  for i in range(len(X)):
    sum += y_pred[i] - y_true[i]
  db = sum/1313

  return dW, db
  """
    Calculates the gradients for the MSE loss function with respect to the weights (and bias)

    Args:
        y_true: The true values of the target variable (SalePrice in our case)
        y_pred: The predicted values of the target variable using our model (W*X + b)

        W: The weights of the model
        b: The bias of the model
        X: The input features

    Returns:
        dW: The gradients of the loss function with respect to the weights
        db: The gradients of the loss function with respect to the bias

        gradient of loss wrt weights is 1/m*(y_pred - y_true) - 1314 value
        gradient of loss wrt bias is 1/m*sum(y_pred - y_true) - 1 value
    """





Update the weights using the gradients

In [None]:
def update(weights, bias, gradients_weights, gradients_bias, lr):

  '''dJ/dw already is known, so updating is simple enough

    """Updates the weights (and bias) using the gradients and the learning rate

    Args:
        weights: The current weights of the model
        bias: The current bias of the model

        gradients_weights: The gradients of the loss function with respect to the weights
        gradients_bias: The gradients of the loss function with respect to the bias

        lr: The learning rate

    Returns:
        weights_new: The updated weights of the model

    """'''
  w = []
  for i in range(len(weights)):
    w.append(weights[i] - gradients_weights[i]*lr)
  bias = bias - gradients_bias*lr
  return weights , bias


Put all these together to find the loss value, its gradient and finally updating the weights in a loop. Feel free to play around with different learning rates and epochs
  
> NOTE: The code in comments are just meant to be used as a guide. You will have to do changes based on your code

In [None]:

LEARNING_RATE = 2e-2

losses = []
#arr is what our x values are called
x = arr
NUM_EPOCHS = len(x)
w = random.normal(size=[1,8])[0,]

b = random_biases[0,0]
y = np.array(df_feature_scaled['SalePrice'])
y_pred = []
for epoch in range(NUM_EPOCHS):

  sum = 0
  for i in range(len(w)):
    sum += arr[epoch,i]*w[i]
  sum += b
  y_pred.append(sum)

  np.array(y_pred)

for epoch in range(NUM_EPOCHS):
  loss = mse_loss_fn(y, y_pred)
  losses.append(loss)

  for i in range(len(x)):
    dw, db = get_gradients(y, y_pred, w, b, x)


    w, b = update(w, b, dw, db, LEARNING_RATE)


IndexError: list index out of range

Now use matplotlib to plot the loss graph

In [None]:
plt.plot(losses)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()

NameError: name 'plt' is not defined

### **Testing with test data**

Load and apply all the preprocessing steps used in the training data for the testing data as well. Remember to use the **SAME** min/max values which you used for the training set and not recalculate them from the test set. Also mention why we are doing this.

To load test data from GitHub, use the code below.


In [None]:
df_test = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/test_processed_splitted.csv')
print(df_test)

# Let's find all the columns that are missing in the test set
missing_cols = set(df.columns) - set(df_test.columns)

# Add these columns to the test set with all zeros
for col in missing_cols:
    df_test[col] = 0

if 'Utilities_AllPub' not in df_test.columns:
    df_test = df_test.join(pd.get_dummies(df_test['Utilities'], dtype = 'int32', prefix = 'Utilities'))
    df_test = df_test.drop('Utilities', axis = 1)



Using the weights learnt above, predict the values in the test dataset. Also answer the following questions:
- Are the predictions good?
- What is the MSE loss for the testset
- Is the MSE loss for testing greater or lower than training
- Why is this the case

In [None]:
# Scale the features

# Fill NaN values
df_test.fillna(0, inplace=True)

# Scale features


# Check for unexpected NaNs




# Convert to numpy array
x_test = df_test.copy().drop('SalePrice', axis=1).to_numpy() # (N, D)
y_test = df_test.copy()['SalePrice'].to_numpy().reshape(-1, 1) # (N, 1)
print(x_test.shape)


In [None]:
extra_cols = list(set(df_test.columns) - set(df.columns))
print("Extra columns in df_test:", extra_cols)

missing_cols = list(set(df.columns) - set(df_test.columns))
print("Missing columns in df_test:", missing_cols)

In [None]:
# Make predictions
y_pred_test = x_test @ w.T + b # (N, 1)
loss_test = mse_loss_fn(y_pred_test, y_test)


# Scale the predictions back to the original scale


In [None]:
idx = np.random.randint(0, x_test.shape[0], 5)
y_pred_test_sample = y_pred_test_scaled[idx].round().astype(int)
y_true_test_sample = y_test_scaled[idx].round().astype(int)

print('Predicted SalePrice: \t', y_pred_test_sample.squeeze().tolist())
print('Actual SalePrice: \t', y_true_test_sample.squeeze().tolist())
print('\nTest Loss: \t\t', loss_test)