# Fire up graphlab create

In [1]:
import graphlab as gl

A newer version of GraphLab Create (v1.6.1) is available! Your current version is v1.6.
New features in 1.6:
- Time Series data type
- Model tuning in Canvas
- Churn prediction toolkit
- Product sentiment analysis toolkit
- DBSCAN for clustering toolkit
- Record linker for data matching toolkit
- Frequent pattern mining toolkit
- Support adaptive Predictive Services model serving through endpoint policies
- Distributed Machine Learning in EC2
- Interface between DataFrames and SFrames in scala

Notable performance improvements:
- Improve service latency for all supervised learning models
- Improved performance of nearest neighbor toolkit by constructing a similarity graph directly
- Fast approximation of nearest neighbors through locality-sensitive hashing
- More efficient and faster access of data in S3
- Improved performance of distributed graph analytics

For detailed release notes please visit:
https://dato.com/download/release-notes.html

-
You can use pip to upgrade the graphlab-

# Load some house sales data

In [2]:
sales = gl.SFrame('home_data.gl/')

[INFO] This non-commercial license of GraphLab Create is assigned to vantroy@gmail.comand will expire on September 23, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-3985 - Server binary: /home/hades/devel/anaconda/envs/dato-env/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1443844055.log
[INFO] GraphLab Server Version: 1.6


In [3]:
sales

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900,3,1.0,1180,5650,1,0
6414100192,2014-12-09 00:00:00+00:00,538000,3,2.25,2570,7242,2,0
5631500400,2015-02-25 00:00:00+00:00,180000,2,1.0,770,10000,1,0
2487200875,2014-12-09 00:00:00+00:00,604000,4,3.0,1960,5000,1,0
1954400510,2015-02-18 00:00:00+00:00,510000,3,2.0,1680,8080,1,0
7237550310,2014-05-12 00:00:00+00:00,1225000,4,4.5,5420,101930,1,0
1321400060,2014-06-27 00:00:00+00:00,257500,3,2.25,1715,6819,2,0
2008000270,2015-01-15 00:00:00+00:00,291850,3,1.5,1060,9711,1,0
2414600126,2015-04-15 00:00:00+00:00,229500,3,1.0,1780,7470,1,0
3793500160,2015-03-12 00:00:00+00:00,323000,3,2.5,1890,6560,2,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,1180,0,1955,0,98178,47.51123398
0,3,7,2170,400,1951,1991,98125,47.72102274
0,3,6,770,0,1933,0,98028,47.73792661
0,5,7,1050,910,1965,0,98136,47.52082
0,3,8,1680,0,1987,0,98074,47.61681228
0,3,11,3890,1530,2001,0,98053,47.65611835
0,3,7,1715,0,1995,0,98003,47.30972002
0,3,7,1060,0,1963,0,98198,47.40949984
0,3,7,1050,730,1960,0,98146,47.51229381
0,3,7,1890,0,2003,0,98038,47.36840673

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0
-122.23319601,2720.0,8062.0
-122.39318505,1360.0,5000.0
-122.04490059,1800.0,7503.0
-122.00528655,4760.0,101930.0
-122.32704857,2238.0,6819.0
-122.31457273,1650.0,9711.0
-122.33659507,1780.0,8113.0
-122.0308176,2390.0,7570.0


# Exploring the data for housing sales

In [4]:
gl.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")



# Create simple regression model of sqft_living to price

In [5]:
train_data, test_data = sales.random_split(.8, seed=0)

In [6]:
train_data.sample

<bound method SFrame.sample of Columns:
	id	str
	date	datetime
	price	int
	bedrooms	str
	bathrooms	str
	sqft_living	int
	sqft_lot	int
	floors	str
	waterfront	int
	view	int
	condition	int
	grade	int
	sqft_above	int
	sqft_basement	int
	yr_built	int
	yr_renovated	int
	zipcode	str
	lat	float
	long	float
	sqft_living15	float
	sqft_lot15	float

Rows: Unknown

Data:
+------------+---------------------------+---------+----------+-----------+-------------+
|     id     |            date           |  price  | bedrooms | bathrooms | sqft_living |
+------------+---------------------------+---------+----------+-----------+-------------+
| 7129300520 | 2014-10-13 00:00:00+00:00 |  221900 |    3     |     1     |     1180    |
| 6414100192 | 2014-12-09 00:00:00+00:00 |  538000 |    3     |    2.25   |     2570    |
| 5631500400 | 2015-02-25 00:00:00+00:00 |  180000 |    2     |     1     |     770     |
| 2487200875 | 2014-12-09 00:00:00+00:00 |  604000 |    4     |     3     |     1960    |
| 195440

# Build a regression model

In [7]:
sqft_model = gl.linear_regression.create(train_data, target='price', features=['sqft_living'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16523
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 1
PROGRESS: Number of coefficients    : 2
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 1.004445     | 4297668.734813     | 4373157.

# Evaluate the simple model

In [8]:
print test_data['price'].mean()

543054.042563


In [9]:
print sqft_model.evaluate(test_data)

{'max_error': 4162109.8683798565, 'rmse': 255150.0490403781}


# Show what our predictions look like

In [10]:
import matplotlib.pyplot as plt