# Predicting house prices

In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# plotting modules
import seaborn as sns

%matplotlib inline
import matplotlib.pyplot as plt
# read in all our data
houses = pd.read_csv("../DATA/houses.csv")

# set seed for reproducibility
np.random.seed(0)

**Load some house sales data**


In [None]:
houses

**Exploring the data for housing sales**

The house price is correlated with the number of square feet of living space.

In [None]:
x = houses.LotArea
y = houses.SalePrice

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( x, y, test_size=0.20, random_state=0)

In [None]:

plt.scatter(x, y ) #, color='#2F08EC', marker="s")
#plt.scatter(x, y_ , color='#FF082C')
#plt.plot(houses.LotArea, houses.SalePrice , color='#FF0000')


**Create a simple regression model of sqft_living to price**

Split data into training and testing.  
We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).  

##Build the regression model using only sqft_living as a feature

In [None]:
from sklearn.linear_model import LinearRegression
model1 = LinearRegression(fit_intercept=True)

model1.fit(X_train[:, np.newaxis], Y_train)

Y_pred = model1.predict(X_train[:, np.newaxis])

plt.scatter(X_train, Y_train)
plt.plot(X_train, Y_pred, color='#FF0000');

**Evaluate the simple model**

In [None]:
model1

In [None]:
print(model1.intercept_)
print(model1.coef_)

print('Coefficients: W=%.3f, b=%.3f' % (model1.coef_[0],model1.intercept_))

Score R2 - coefficient of determination, "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s)


R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1 indicates that the regression line perfectly fits the data.

In [None]:
model1.score(X_test[:, np.newaxis],Y_test)

RMSE of about \$255,170!

**Let's show what our predictions look like**

Matplotlib is a Python plotting library that is also useful for plotting.  You can install it with:

'pip install matplotlib'

In [None]:
def predict(x,W,b):
    p = list()
    for i in x:
        y = W*i + b
        p.append(y)
    return p

In [None]:
y_ = predict(X_test,model1.coef_,model1.intercept_)

Above:  blue dots are original data, green line is the prediction from the simple regression.

Below: we can view the learned regression coefficients. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(X_train, Y_train , color='#2F08EC', marker="s")
plt.scatter(X_test, Y_test , color='#FF082C')
plt.plot(X_test, y_ , color='#00FF00')

### Explore other features in the data

To build a more elaborate model, we will explore using more features.

In [None]:
#Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,
#Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,
#HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,
#Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,
#BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,
#Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,
#BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,
#FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,
#WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,
#MoSold,YrSold,SaleType,SaleCondition,SalePrice

H = houses[['LotFrontage','LotArea','OverallQual','OverallCond','SalePrice']].dropna()
x2 = H[['LotFrontage','LotArea','OverallQual','OverallCond']]
y2 = H.SalePrice


X2_train, X2_test, Y2_train, Y2_test = train_test_split( x2, y2, test_size=0.20, random_state=0)

In [None]:
model2 = LinearRegression(fit_intercept=True)

model2.fit(X2_train, Y2_train)

y2fit = model2.predict(X2_test)

In [None]:
plt.scatter(X2_train.LotArea, Y2_train , color='#2F08EC', marker="s")
plt.scatter(X2_test.LotArea, Y2_test , color='#FF082C')
plt.scatter(X2_test.LotArea, y2fit , color='#00FF00')

In [None]:
model2

In [None]:
print(model2.intercept_)
print(model2.coef_)

And now R2  score (notice that 1 indicates that the regression line perfectly fits the data).

In [None]:
model2.score(X2_test,Y2_test)

In [None]:
model2.score(X2_train,Y2_train)

More info about [sklearn.linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

R2 score [Coefficient_of_determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)
