# Python - Week 4 - Linear regression

## Shahin Mammadov

Your neighbor is a real estate agent and wants some help predicting housing prices for regions in the USA. It would be great if you could somehow create a model for her that allows her to put in a few features of a house and returns back an estimate of what the house would sell for.

She has asked you if you could help her out with your new data science skills. You say yes, and decide that Linear Regression might be a good path to solve this problem!

Your neighbor then gives you some information about a bunch of houses in regions of the United States,it is all in the data set: USA_Housing.csv.

The data contains the following columns:

- 'Avg. Area Income': Avg. Income of residents of the city house is located in.
- 'Avg. Area House Age': Avg Age of Houses in same city
- 'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in same city
- 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in same city
- 'Area Population': Population of city house is located in
- 'Price': Price that the house sold at
- 'Address': Address for the house

## 1. Let's prepare the data.

In [29]:
import pandas as pd

In [30]:
data = pd.read_csv('USA_Housing.csv')

In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Avg. Area Income              5000 non-null   float64
 1   Avg. Area House Age           5000 non-null   float64
 2   Avg. Area Number of Rooms     5000 non-null   float64
 3   Avg. Area Number of Bedrooms  5000 non-null   float64
 4   Area Population               5000 non-null   float64
 5   Price                         5000 non-null   float64
 6   Address                       5000 non-null   object 
dtypes: float64(6), object(1)
memory usage: 273.6+ KB


In [32]:
data.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


 Looking at the task, we know that our dependent variable is price and the rest are independent variables, so we assign y to price and the rest to x. We will also exclude Address as the linear regression model will not be able to use a variable with text info.

In [33]:
y = data['Price']

X = data[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=None)

## 2. Training the model

In [35]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In [36]:
y_pred = regressor.predict(X_test)

## Model evaluation

In [37]:
from sklearn import metrics
import numpy as np
R2 = metrics.r2_score(y_test, y_pred)
n = X_test.shape[0]
p = X_test.shape[1]

In [38]:
print('R^2:', R2)
print('Adjusted R^2:', 1 - metrics.r2_score(y_test, y_pred))
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

R^2: 0.9133765530963829
Adjusted R^2: 0.08662344690361712
Mean Absolute Error: 81017.13019705343
Mean Squared Error: 10204051189.674582
Root Mean Squared Error: 101015.10377005303


 Lets evaluate the performance of a linear regression model by using the r^2. Coefficient of determination is the amount of the variation in the output dependent attribute which is predictable from the input independent variables. Looking above, we see R^2 = 0.91 which refers to 91% of prices being predicted correctly, while remaining 9% of variability is still unaccounted for.

### Interpret coefficients

In [40]:
coeff_df = pd.DataFrame(regressor.coef_,X.columns,columns=['Coefficient'])
coeff_df

Unnamed: 0,Coefficient
Avg. Area Income,21.712471
Avg. Area House Age,164771.792874
Avg. Area Number of Rooms,121098.583031
Avg. Area Number of Bedrooms,1695.046439
Area Population,15.214228


- With all other predictors held constant, if the engine size increased by 1 unit, the average price increases by \$21.52 .
- With all other predictors held constant, if the area house age increased by 1 unit, the average price increases by \$164883.28 .
- With all other predictors held constant, if the number of rooms increased by 1 unit, the average price increases by \$122368.67 .
- With all other predictors held constant, if the number if bedrooms increased by 1 unit, the average price increases by\$2233.80 .
- With all other predictors held constant, if the area population increased by 1 unit, the average price increases by \$15.15 .

In [43]:
regressor.coef_

array([2.17124714e+01, 1.64771793e+05, 1.21098583e+05, 1.69504644e+03,
       1.52142282e+01])