# Regression - GlobalAIHub

📌 **Regression** is a type of *supervised learning* that uses an algorithm to understand the relationship between a dependent variable, that is the input, and an independent variable, which is the output. Regression models are helpful for predicting numerical values based on different features’ values. For example, temperature forecast based on wind, humidity and pressure, or price estimations of a car based on its model year, brand, and transmission type. In regression, we want to build a relationship between each feature and the output so that we can predict for example, the price of the house when we know the features but not the price. If this relationship is linear, this algorithm is called linear regression.

📌 **Linear regression** is perhaps the most well-known and well-understood algorithm in statistics and machine learning. A simple linear regression model tries to explain the relationship between the two variables using a best fitting straight line. We call this a regression line.



📌The first step is reading the data. To do that, we need to import Python’s handy data science library, **Pandas**. After importing the pandas library we can easily load our train and test datasets using *read_csv*. We will use the train dataset to help our regression model to learn some important patterns in the data. Then we’ll use the test dataset to check how well the model learned the patterns or how well it predicts. Let’s start with this simple operation.

In [None]:
import pandas as pd

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

📌 Let’s observe features in our train dataset. We can see that some of the features we have include the house’s general quality, the year it was built, the size of the garage, and so on.

In [None]:
train.columns

Index(['OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF',
       'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea',
       'SalePrice', 'ExterQual_TA', 'Foundation_PConc', 'KitchenQual_TA'],
      dtype='object')


📌 Now that we have our datasets loaded, we can import linear regression models from the most important machine learning library of Python; **sklearn**. It is an open-source library for machine learning. There are many models constructed in this library, we just need to import the one that we will use.

In [None]:
from sklearn.linear_model import LinearRegression

📌 After importing linear regression model, we can assign it to the **“model”** variable to use it easily. 

In [None]:
model = LinearRegression()

📌 In this example, we’re trying to predict the house prices. The column we want to predict also is called “ground truth”, “target” or “labels” and other columns are called “features” or “attributes”. For the model to predict the house prices, first, we need to define which column has the house prices, or the ground truth. Then we remove it from the features of the data using the drop function and assign it as ‘labels’ using the *loc* function. Inside the drop function, we write the column name followed by the argument axis which we set to 1. This indicates that the specified column needs to be deleted.

In [None]:
X_train = train.drop('SalePrice', axis=1)
y_train = train.loc[:,'SalePrice']

📌 Basically, we want to predict y with the help of x. And generally, we assign target to the y variable, features to the X variable. Then we can **fit** our model, which means teaching the hidden patterns in the training dataset into it.

In [None]:
model.fit(X_train,y_train)

LinearRegression()

📌 After the fitting process, our model is almost ready to make predictions. But before that, we also need to divide our test dataset into **“target”** and **“features”**.

In [None]:
X_test = test.drop('SalePrice', axis=1)
y_test = test.loc[:,'SalePrice']

📌 Here the target dataset contains the actual values which our model will compare its predictions.

In [None]:
predictions = model.predict(X_test)

📌 comparing some data points with your eyes won’t tell you how well your model predicts. This is exactly where we use evaluation metrics! Let’s import **mean squared error** from the *sklearn library*. We also need to import **square root** from the *NumPy library*, because we want to observe our root mean squared error as it’s in the same unit with our data. After importing, we can use these functions to average the calculated error.

In [None]:
from sklearn.metrics import mean_squared_error
from numpy import sqrt

In [None]:
rmse = sqrt(mean_squared_error(y_test, predictions))
rmse

33186.384172367696

In [None]:
comparison = pd.DataFrame({"Actual Values": y_test,"Predictions": predictions})

📌 Now, we are completely ready to make predictions using the features of the test dataset. Let’s observe our predictions and actual test values. For that, we can simply put them into the same data frame and observe some of the rows using head and tail functions. We see that the actual values and the predictions by our model are more or less close. Of course, as in every machine learning model, there are some inaccuracies.

In [None]:
comparison.head()

Unnamed: 0,Actual Values,Predictions
0,118500,83380.944694
1,154900,105974.149765
2,133000,139238.138343
3,115000,104982.049557
4,154500,140473.360146


In [None]:
comparison.tail()

Unnamed: 0,Actual Values,Predictions
324,132250,102816.796295
325,123000,121698.649065
326,316600,271745.844407
327,142000,131258.275591
328,250000,263005.372419


📌 Our **RMSE** is approximately 33 000! If we consider our average price is 185 000, and maximum price as big as 500 000, then 33 000 may be considered normal for your first model. Keep going! Also, we can check which features have the most impact on our predictions. Basically, we can check for **correlations** on our train dataset. But since we need correlations between target and features, we can simply take the “SalePrice” column from this data frame. From the data frame, we can’t decide which ones have the most impact. Let’s sort and see the top 10 using sort_values, then head functions.

In [None]:
train.corr()["SalePrice"].sort_values(ascending=False).head(10)

SalePrice           1.000000
OverallQual         0.792263
GrLivArea           0.712054
GarageCars          0.658355
GarageArea          0.621354
1stFlrSF            0.621057
TotalBsmtSF         0.612205
FullBath            0.597505
TotRmsAbvGrd        0.573845
Foundation_PConc    0.517222
Name: SalePrice, dtype: float64

In [None]:
correlations = train.corr()
correlations

Unnamed: 0,OverallQual,YearBuilt,YearRemodAdd,TotalBsmtSF,1stFlrSF,GrLivArea,FullBath,TotRmsAbvGrd,GarageCars,GarageArea,SalePrice,ExterQual_TA,Foundation_PConc,KitchenQual_TA
OverallQual,1.0,0.572367,0.550407,0.557685,0.539527,0.62889,0.598265,0.482744,0.627897,0.579378,0.792263,-0.692146,0.593079,-0.579892
YearBuilt,0.572367,1.0,0.615451,0.418706,0.315715,0.205311,0.496001,0.122193,0.530869,0.466243,0.503317,-0.6086,0.675289,-0.478635
YearRemodAdd,0.550407,0.615451,1.0,0.305751,0.299912,0.300983,0.500358,0.189233,0.507051,0.459938,0.504414,-0.58621,0.608433,-0.621112
TotalBsmtSF,0.557685,0.418706,0.305751,1.0,0.912271,0.51743,0.370448,0.337671,0.476327,0.539858,0.612205,-0.414837,0.330111,-0.353424
1stFlrSF,0.539527,0.315715,0.299912,0.912271,1.0,0.589766,0.392271,0.416777,0.472616,0.531808,0.621057,-0.355415,0.262008,-0.315156
GrLivArea,0.62889,0.205311,0.300983,0.51743,0.589766,1.0,0.624707,0.826999,0.492914,0.4998,0.712054,-0.427637,0.34034,-0.384288
FullBath,0.598265,0.496001,0.500358,0.370448,0.392271,0.624707,1.0,0.550967,0.528268,0.465081,0.597505,-0.516471,0.519781,-0.474227
TotRmsAbvGrd,0.482744,0.122193,0.189233,0.337671,0.416777,0.826999,0.550967,1.0,0.426842,0.389448,0.573845,-0.307535,0.2559,-0.251362
GarageCars,0.627897,0.530869,0.507051,0.476327,0.472616,0.492914,0.528268,0.426842,1.0,0.845512,0.658355,-0.543945,0.517289,-0.465095
GarageArea,0.579378,0.466243,0.459938,0.539858,0.531808,0.4998,0.465081,0.389448,0.845512,1.0,0.621354,-0.511492,0.451725,-0.455758


In [None]:
saleprice_correlations = correlations["SalePrice"]
saleprice_correlations

OverallQual         0.792263
YearBuilt           0.503317
YearRemodAdd        0.504414
TotalBsmtSF         0.612205
1stFlrSF            0.621057
GrLivArea           0.712054
FullBath            0.597505
TotRmsAbvGrd        0.573845
GarageCars          0.658355
GarageArea          0.621354
SalePrice           1.000000
ExterQual_TA       -0.598202
Foundation_PConc    0.517222
KitchenQual_TA     -0.527176
Name: SalePrice, dtype: float64

📌 Don’t forget that we need to set ascending as false because we want to see 10 highest values!

In [None]:
saleprice_correlations.sort_values(ascending=False).head(10)

SalePrice           1.000000
OverallQual         0.792263
GrLivArea           0.712054
GarageCars          0.658355
GarageArea          0.621354
1stFlrSF            0.621057
TotalBsmtSF         0.612205
FullBath            0.597505
TotRmsAbvGrd        0.573845
Foundation_PConc    0.517222
Name: SalePrice, dtype: float64