# Homework

In [40]:
import graphlab
import matplotlib.pyplot as plt
%matplotlib inline

In [34]:
sales = graphlab.SFrame('home_data.gl/')

**1. Selection and summary statistics**: In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price.

In [29]:
expensive_houses = sales[sales['zipcode']=='98039']

In [30]:
graphlab.canvas.set_target('ipynb')
expensive_houses.show()

**2. Filtering data**: One of the key features we used in our model was the number of square feet of living space (‘sqft_living’) in the house. For this part, we are going to use the idea of filtering (selecting) data.

In particular, we are going to use logical filters to select rows of an SFrame. You can find more info in [the Logical Filter section of this documentation](https://turi.com/products/create/docs/generated/graphlab.SFrame.html). Using such filters, first select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft.

What fraction of the all houses have ‘sqft_living’ in this range?

In [31]:
filtered_data = sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <= 4000)]
total_data_length, filtered_data_length = len(sales), len(filtered_data)
filtered_data_fraction = filtered_data_length / (total_data_length + 0.0)
filtered_data_fraction

0.42187572294452413

**3. Building a regression model with several more features**: In the sample notebook, we built two regression models to predict house prices, one using just ‘sqft_living’ and the other one using a few more features, we called this set.

```python
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
```

Now, going back to the original dataset, you will build a model using the following features:

```python
advanced_features = [
    'bedrooms', 
    'bathrooms', 
    'sqft_living', 
    'sqft_lot', 
    'floors', 
    'zipcode',
    'condition', # condition of house				
    'grade', # measure of quality of construction				
    'waterfront', # waterfront property				
    'view', # type of view				
    'sqft_above', # square feet above ground				
    'sqft_basement', # square feet in basement				
    'yr_built', # the year built				
    'yr_renovated', # the year renovated				
    'lat', 'long', # the lat-long of the parcel				
    'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
    'sqft_lot15', # average lot size of 15 nearest neighbors 
]
```

**Compute the RMSE (root mean squared error)** on the test_data for the model using just my_features, and for the one using advanced_features.

**Note 1**: both models must be trained on the original sales dataset, not the filtered one.

**Note 2**: when doing the train-test split, make sure you use seed=0, so you get the same training and test sets, and thus results, as we do.

**Note 3**: in the module we discussed residual sum of squares (RSS) as an error metric for regression, but GraphLab Create uses root mean squared error (RMSE). These are two common measures of error regression, and RMSE is simply the square root of the mean RSS:

$$RMSE = \sqrt{\frac{RSS}{N}}$$

where $N$ is the number of data points. RMSE can be more intuitive than RSS, since its units are the same as that of the target column in the data, in our case the unit is dollars ($), and doesn't grow with the number of data points, like the RSS does.

**Important note**: when answering the question below using GraphLab Create, when you call the `linear_regression.create()` function, make sure you use the parameter `validation_set=None`. When you use regression GraphLab Create, it sets aside a small random subset of the data to validate some parameters. This process can cause fluctuations in the final RMSE, so we will avoid it to make sure everyone gets the same answer.

What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features?

In [36]:
train_data, test_data = sales.random_split(.8, seed=0)

In [43]:
basic_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

basic_model = graphlab.linear_regression.create(train_data, target='price', 
                                                features=basic_features, validation_set=None)
basic_model_evaluation = basic_model.evaluate(test_data)

In [44]:
advanced_features = [
    'bedrooms', 
    'bathrooms', 
    'sqft_living', 
    'sqft_lot', 
    'floors', 
    'zipcode',
    'condition', # condition of house                
    'grade', # measure of quality of construction                
    'waterfront', # waterfront property                
    'view', # type of view                
    'sqft_above', # square feet above ground                
    'sqft_basement', # square feet in basement                
    'yr_built', # the year built                
    'yr_renovated', # the year renovated                
    'lat', 'long', # the lat-long of the parcel                
    'sqft_living15', # average sq.ft. of 15 nearest neighbors                 
    'sqft_lot15', # average lot size of 15 nearest neighbors 
]

advanced_model = graphlab.linear_regression.create(train_data, target='price', 
                                                   features=advanced_features, validation_set=None)
advanced_model_evaluation = advanced_model.evaluate(test_data)

In [46]:
difference = abs(basic_model_evaluation['rmse'] - advanced_model_evaluation['rmse'])
print(difference)

22711.3165108
