# Fire up graphlab create

In [None]:
import graphlab

# Load some house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [None]:
sales = graphlab.SFrame('home_data.gl/')

In [None]:
sales

#Exploring the data for housing sales 

The house price is correlated with the number of square feet of living space.

In [None]:
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")

#Create a simple regression model of sqft_living to price

Split data into training and testing.  
We use seed=0 so that everyone running this notebook gets the same results.  In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).  

In [None]:
train_data,test_data = sales.random_split(.8,seed=0)

##Build the regression model using only sqft_living as a feature

In [None]:
sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'],validation_set=None)

#Evaluate the simple model

In [None]:
print test_data['price'].mean()

In [None]:
print sqft_model.evaluate(test_data)

RMSE of about \$255,170!

#Let's show what our predictions look like

Matplotlib is a Python plotting library that is also useful for plotting.  You can install it with:

'pip install matplotlib'

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.plot(test_data['sqft_living'],test_data['price'],'.',
        test_data['sqft_living'],sqft_model.predict(test_data),'-')

Above:  blue dots are original data, green line is the prediction from the simple regression.

Below: we can view the learned regression coefficients. 

In [None]:
sqft_model.get('coefficients')

#Explore other features in the data

To build a more elaborate model, we will explore using more features.

In [None]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [None]:
sales[my_features].show()

In [None]:
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')

Pull the bar at the bottom to view more of the data.  

98039 is the most expensive zip code.

#Build a regression model with more features

In [None]:
my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)

In [None]:
print my_features

##Comparing the results of the simple model with adding more features

In [None]:
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)

The RMSE goes down from \$255,170 to \$179,508 with more features.

#Apply learned models to predict prices of 3 houses

The first house we will use is considered an "average" house in Seattle. 

In [None]:
house1 = sales[sales['id']=='5309101200']

In [None]:
house1

<img src="http://info.kingcounty.gov/Assessor/eRealProperty/MediaHandler.aspx?Media=2916871">

In [None]:
print house1['price']

In [None]:
print sqft_model.predict(house1)

In [None]:
print my_features_model.predict(house1)

In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature.  However, on average, the model with more features is better.

##Prediction for a second, fancier house

We will now examine the predictions for a fancier house.

In [None]:
house2 = sales[sales['id']=='1925069082']

In [None]:
house2

<img src="https://ssl.cdn-redfin.com/photo/1/bigphoto/302/734302_0.jpg">

In [None]:
print sqft_model.predict(house2)

In [None]:
print my_features_model.predict(house2)

In this case, the model with more features provides a better prediction.  This behavior is expected here, because this house is more differentiated by features that go beyond its square feet of living space, especially the fact that it's a waterfront house. 

##Last house, super fancy

Our last house is a very large one owned by a famous Seattleite.

In [None]:
bill_gates = {'bedrooms':[8], 
              'bathrooms':[25], 
              'sqft_living':[50000], 
              'sqft_lot':[225000],
              'floors':[4], 
              'zipcode':['98039'], 
              'condition':[10], 
              'grade':[10],
              'waterfront':[1],
              'view':[4],
              'sqft_above':[37500],
              'sqft_basement':[12500],
              'yr_built':[1994],
              'yr_renovated':[2010],
              'lat':[47.627606],
              'long':[-122.242054],
              'sqft_living15':[5000],
              'sqft_lot15':[40000]}

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Bill_gates%27_house.jpg/2560px-Bill_gates%27_house.jpg">

In [None]:
print my_features_model.predict(graphlab.SFrame(bill_gates))

The model predicts a price of over $13M for this house! But we expect the house to cost much more.  (There are very few samples in the dataset of houses that are this fancy, so we don't expect the model to capture a perfect prediction here.)

# my_code

In [1]:
import numpy as np
import graphlab
import matplotlib.pyplot as plt
%matplotlib inline
graphlab.canvas.set_target('ipynb')
from __future__ import absolute_import, division, print_function

[INFO] [1;32m1452520087 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_FILE to /home/sagarkar10/anaconda3/envs/datapy/lib/python2.7/site-packages/certifi/cacert.pem
[0m[1;32m1452520087 : INFO:     (initialize_globals_from_environment:282): Setting configuration variable GRAPHLAB_FILEIO_ALTERNATIVE_SSL_CERT_DIR to 
[0mThis non-commercial license of GraphLab Create is assigned to sagarkar10@gmail.com and will expire on January 07, 2017. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-9245 - Server binary: /home/sagarkar10/anaconda3/envs/datapy/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1452520087.log
[INFO] GraphLab Server Version: 1.7.1


In [2]:
data_all = graphlab.SFrame('home_data.gl/')
#data_all.show(view="Scatter Plot", x="sqft_living", y="price")
#data_all

In [3]:
                        # for training settings
data_train, data_test = data_all.random_split(0.8,seed = 0)
#print(len(data_test))
#print(len(data_train))

In [4]:
# seattle has th highest avg house price ..... zip = 98039
avg_zip_price = data_all['price'][data_all['zipcode']=='98039'].sum()
avg_zip_price/50
                    ## ans to the first quiz question

2160606.6

In [5]:
req_data_len = len(data_all[(data_all['sqft_living']>2000) & (data_all['sqft_living']<4000)])
req_data_len/len(data_all)
# ans for 2nd

0.4215518437977143

In [6]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [7]:
advanced_features =\
[
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

#### training bothg the model with diff features

In [8]:
my_feature_model = graphlab.linear_regression.create\
                    (data_train,target = 'price', features = my_features, validation_set=None)

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 17384
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 6
PROGRESS: Number of coefficients    : 115
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 1.050867     | 3763208.270524     | 181908.848367 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:


In [9]:
adv_feature_model = graphlab.linear_regression.create(\
                            data_train,target = 'price', features = advanced_features, validation_set=None)

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 17384
PROGRESS: Number of features          : 18
PROGRESS: Number of unpacked features : 18
PROGRESS: Number of coefficients    : 127
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.182697     | 3469012.450624     | 154580.940734 |
PROGRESS: | 2         | 3        | 0.286620     | 3469012.450673     | 154580.940735 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:


In [10]:
# model.evaluate to calculate rmse error
my_feature_model.evaluate(data_test)

{'max_error': 3486584.50938179, 'rmse': 179542.43331269047}

In [11]:
adv_feature_model.evaluate(data_test)

{'max_error': 3556849.413849059, 'rmse': 156831.11680200775}

In [12]:
my_feature_model.evaluate(data_test)['rmse'] - adv_feature_model.evaluate(data_test)['rmse']

22711.316510682722

In [13]:
# above one is the third answer 