<a href="https://colab.research.google.com/github/spentaur/DS-Unit-2-Regression-Classification/blob/master/module2/final_assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

# Here's the game plan

1. make function that adds features
 1. part of that is assigning neighborhood values
  1. the way i'm going to do that is i'm going to kmeans classify on training data. then use those labels to do kneighborsclassifier to predict those labels.
  1. the reason i'm doing it this way is becuase i want to make sure the predicted label matches up with the fact that i'm chaning the labels order. if that makes sense. i don't know if there's a better way to do it but i'm sure there is.
1. now that i have that function, i can make another function that takes in the raw untouched dataframe, splits into training and testing, and returns my scores and my trained model.

In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)
Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

# Import all the library functions

In [0]:
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Import the DataFrame and remove top .1% of price and geo

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

# Change created to datetime

In [0]:
df['created'] = pd.to_datetime(df['created'])

# Check to make sure i'm splitting the data up right

In [6]:
df['created'].dt.month.value_counts()

6    16973
4    16217
5    15627
Name: created, dtype: int64

In [0]:
july = df['created'].dt.month.value_counts()[6]
mayjune = df['created'].dt.month.value_counts()[5] + df['created'].dt.month.value_counts()[4]

In [8]:
mayjune+july, mayjune, july

(48817, 31844, 16973)

In [9]:
april = '2016-04'
june = '2016-06'

train = df[(df['created'] > april) & (df['created'] < june)].copy()
test = df[(df['created'] > june)].copy()

print(len(df), len(train), len(test))

48817 31844 16973


# drop target column, so i can't accidently put it in

In [0]:
train_target = train['price']

test_target = test['price']
test.drop('price', axis=1, inplace=True)

# Add some features

In [0]:
# kmeans_features = ['latitude', 'longitude', 'roof_deck']

kmeans_features = ['latitude', 'longitude']

In [0]:
def add_neighborhoods(train, number_of_clusters=50):
    new_labels = list(range(1,number_of_clusters + 1))

    kmeans = MiniBatchKMeans(n_clusters=number_of_clusters, batch_size=1993, max_iter=10000).fit(train[kmeans_features])

    labels = kmeans.labels_

    # assig those labels to the dataframe
    # i'm adding 1 because labels start at 0, and i don't want that for my other
    # stuff, it'll mess with the math i believe
    train['neighborhood'] = labels + 1

    # i have to make sure that the labels go up with price
    # so i'm grouping by neighborhood, sorting by mean price and getting the
    # neighborhood values
    old_labels = train.groupby('neighborhood')['price'].mean().sort_values().index

    # now i'm mapping neighborhoods in order
    train['neighborhood'] = train['neighborhood'].map(dict(zip(old_labels, new_labels)))

In [0]:
# this is the model that was trained on the trainng data

def train_neighborhoods_model(train):

    # now that i have labels i need to make a model to predict those specific labels
    
    clf = KNeighborsClassifier(n_neighbors=1)
    clf.fit(train[kmeans_features], train['neighborhood'])

    return clf

In [0]:
def predic_neighborhoods(data):
    return trained_neighborhood_model.predict(data[kmeans_features])

In [0]:
def add_new_features(data, neighborhoods=True):

    data['total_rooms'] = data.loc[:, ['bedrooms', 'bathrooms']].sum(axis=1)

    data['bed_bath_ratio'] = data['bedrooms'] / data['bathrooms']

    data['interest_level_code'] = data['interest_level'].map({'low': 0, 'medium': 1, 'high': 2})

    data['pets_allowed'] = (data.loc[:, ['cats_allowed', 'dogs_allowed']].sum(axis=1) > 0).astype('int8')

    rich_keywords = ['views',
                 'manhattan',
                 'penthouse',]

    rich_pattern = '|'.join(rich_keywords)

    rich_special_keywords = data['description'].str.contains(rich_pattern)
    rich_special_keywords.fillna(False, inplace=True)
    data['manhattan_views'] = rich_special_keywords.astype('int64')

    parking_special_keywords = data['description'].str.contains('parking')
    parking_special_keywords.fillna(False, inplace=True)
    data['parking'] = parking_special_keywords.astype('int64')

    central_park_special_keywords = data['description'].str.contains('(?i)central park')
    central_park_special_keywords.fillna(False, inplace=True)
    data['central_park'] = central_park_special_keywords.astype('int64')

    data['total_perks'] = data.loc[:, 'elevator':'central_park'].sum(axis=1)

    data['description_len'] = data['description'].str.len()

    data.fillna(value=0, inplace=True)
    data.replace(np.inf, 0, inplace=True)

    if neighborhoods:
        print("You're changing the neighborhood columns by prediction.")
        data['neighborhood'] = predic_neighborhoods(data)
    
    return data

In [16]:
add_neighborhoods(train)
add_new_features(train, False)

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,neighborhood,total_rooms,bed_bath_ratio,interest_level_code,pets_allowed,manhattan_views,parking,central_park,total_perks,description_len
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,45,2.0,1.000000,2,0,0,0,0,53.000000,691.0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34,2.0,1.000000,0,0,0,0,0,39.000000,492.0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,5.0,4.000000,0,0,0,0,0,19.000000,479.0
5,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,45,6.0,2.000000,1,0,0,0,0,54.000000,8.0
6,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.9660,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,3.0,2.000000,0,1,0,0,0,40.000000,579.0
7,2.0,1,2016-04-13 06:01:42,"This huge sunny ,plenty of lights 1 bed/2 bath...",West 21st Street,40.7427,-73.9957,5645,155 West 21st Street,low,1,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,47,3.0,0.500000,0,0,0,0,0,58.500000,948.0
8,1.0,1,2016-04-20 02:36:35,<p><a website_redacted,Hamilton Terrace,40.8234,-73.9457,1725,63 Hamilton Terrace,medium,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,2.0,1.000000,1,1,0,0,0,18.000000,24.0
9,2.0,4,2016-04-02 02:58:15,This is a spacious four bedroom with every bed...,522 E 11th,40.7278,-73.9808,5800,522 E 11th,low,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,33,6.0,2.000000,0,0,0,0,0,43.000000,1052.0
10,1.0,0,2016-04-14 01:10:30,New to the market! Spacious studio located in ...,York Avenue,40.7769,-73.9467,1950,1661 York Avenue,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,23,1.0,0.000000,0,0,0,0,0,24.000000,168.0
12,1.0,2,2016-04-19 05:37:25,***LOW FEE. Beautiful CHERRY OAK WOODEN FLOORS...,E 38th St,40.7488,-73.9770,3000,137 E 38th St,high,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,35,3.0,2.000000,2,0,0,0,0,45.000000,697.0


In [17]:
train.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space',
       'neighborhood', 'total_rooms', 'bed_bath_ratio', 'interest_level_code',
       'pets_allowed', 'manhattan_views', 'parking', 'central_park',
       'total_perks', 'description_len'],
      dtype='object')

In [18]:
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

# Define function to get fitted model and errors

In [0]:
def fit_model(train, features, target, **kwargs):

    model = LinearRegression()

    X = train[features]
    y = target

    reg = model.fit(X,y)

    return model

def get_errors(y_true, y_pred):

    rmse = mean_squared_error(y_true, y_pred)**.5

    mae = mean_absolute_error(y_true, y_pred)

    return {
        'rmse': rmse,
        'mae': mae
    }

def predict_and_error(data, features, target, model):

    pred = model.predict(data[features])

    errors = get_errors(target, pred)

    errors['r2'] = model.score(data[features], target)

    return errors

# Test it out, get some baselines

In [0]:
features_with_geo = np.append(train.columns.values[10:], ['bedrooms', 'bathrooms', 'latitude', 'longitude'])
features_without_geo = np.append(train.columns.values[10:], ['bedrooms', 'bathrooms'])

In [21]:
features_with_geo

array(['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive',
       'loft', 'garden_patio', 'wheelchair_access',
       'common_outdoor_space', 'neighborhood', 'total_rooms',
       'bed_bath_ratio', 'interest_level_code', 'pets_allowed',
       'manhattan_views', 'parking', 'central_park', 'total_perks',
       'description_len', 'bedrooms', 'bathrooms', 'latitude',
       'longitude'], dtype=object)

In [22]:
model = fit_model(train, features_with_geo, train_target)

train_true = train_target

train_pred = model.predict(train[features_with_geo])

get_errors(train_true, train_pred)

{'mae': 618.678532432344, 'rmse': 994.1276171637836}

In [23]:
model = fit_model(train, features_without_geo, train_target)

train_true = train_target

train_pred = model.predict(train[features_without_geo])

get_errors(train_true, train_pred)

{'mae': 619.5424235449724, 'rmse': 994.7504508145051}

In [24]:
mean_price = df['price'].mean()

guess_mean = [mean_price] * len(df)

get_errors(df['price'], guess_mean)

{'mae': 1201.532252154329, 'rmse': 1762.4127206231178}

# Already better than just guessing

# Okay problems i see

the function is taking a really long time because it's doing so much. maybe i should take out the feature engineering stuff and just pass that function in the training and testing models. another problem is with the kmeans, it's random so it's unpredictable, i need to run it a bunch of times and then just get the best one ya know.

what this is doing is getting the best trained_neighborhood_model

In [25]:
import time
best = {}

begin = time.time()

for num in range(314):

    start = time.time()

    add_neighborhoods(train)

    features = features_with_geo

    model = fit_model(train, features, train_target)

    train_true = train_target

    train_pred = model.predict(train[features_with_geo])

    result = get_errors(train_true, train_pred)

    best[result['mae']] = result
    best[result['mae']]['labels'] = train['neighborhood'].values

    print(num, result['mae'], sorted(best)[0], time.time() - start)

print(sorted(best)[0])
time.time() - begin

0 625.3550420308761 625.3550420308761 0.38170957565307617
1 627.6353590266623 625.3550420308761 0.37249302864074707
2 627.892470977498 625.3550420308761 0.46546220779418945
3 624.055184013536 624.055184013536 0.360431432723999
4 625.9052703748722 624.055184013536 0.32292723655700684
5 625.4132915661745 624.055184013536 0.4171028137207031
6 625.5097928255609 624.055184013536 0.34049010276794434
7 627.1830329773381 624.055184013536 0.40291738510131836
8 626.817436585882 624.055184013536 0.3648216724395752
9 624.276412196298 624.055184013536 0.3864560127258301
10 625.5524904688336 624.055184013536 0.4771695137023926
11 627.2302616687783 624.055184013536 0.32257604598999023
12 624.5542339373426 624.055184013536 0.3708984851837158
13 628.4188272127902 624.055184013536 0.31928086280822754
14 627.4482400980887 624.055184013536 0.4309849739074707
15 627.5408211506062 624.055184013536 0.43491554260253906
16 627.0249082339562 624.055184013536 0.4741332530975342
17 628.0217270732343 624.055184013

119.21984100341797

In [0]:
best_model = best[sorted(best)[0]]

In [0]:
train['neighborhood'] = best_model['labels']

# let's try some other regression techniques

In [0]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, Lars, LassoLars, OrthogonalMatchingPursuit, BayesianRidge, ARDRegression, SGDRegressor, PassiveAggressiveRegressor, RANSACRegressor, HuberRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import AdaBoostRegressor, ExtraTreesRegressor, RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import RegressorMixin
from sklearn.base import is_classifier

In [0]:
def fit_other_models(data, features, target, model):

    reg = model.fit(data[features],target)

    pred = reg.predict(data[features])

    result = get_errors(target, pred)

    return result, pred, model

In [30]:
fit_other_models(train, features_with_geo, train_target, LinearRegression())[0]

{'mae': 622.1709324116582, 'rmse': 1000.2208305314416}

In [31]:
fit_other_models(train, features_with_geo, train_target, HuberRegressor())[0]

{'mae': 595.637220332226, 'rmse': 1045.4883527930576}

In [32]:
fit_other_models(train, features_with_geo, train_target, Ridge())[0]

{'mae': 622.165157963177, 'rmse': 1000.221553139777}

In [33]:
fit_other_models(train, features_with_geo, train_target, RandomForestRegressor(n_estimators=200))[0]

{'mae': 133.15458868509617, 'rmse': 248.171212778433}

In [0]:
linear_regression_model = fit_other_models(train, features_with_geo, train_target, LinearRegression())[2]

In [0]:
forrest_model = fit_other_models(train, features_with_geo, train_target, RandomForestRegressor(n_estimators=200))[2]

# Now run it on the test data

In [36]:
test.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'street_address', 'interest_level', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

In [37]:
trained_neighborhood_model = train_neighborhoods_model(train)
add_new_features(test)

You're changing the neighborhood columns by prediction.


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,total_rooms,bed_bath_ratio,interest_level_code,pets_allowed,manhattan_views,parking,central_park,total_perks,description_len,neighborhood
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4.5,2.000000,1,0,0,0,0,7.500000,588.0,26
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,2.000000,0,1,0,0,0,11.000000,8.0,47
11,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,1.000000,0,0,0,0,0,3.000000,690.0,14
14,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,315 East 56th St..,low,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,1.000000,0,0,0,0,0,6.000000,569.0,36
24,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,30 W 18 St.,medium,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,6.0,2.000000,1,1,0,0,0,21.000000,870.0,49
34,1.0,2,2016-06-17 03:30:24,This desirable apartment is located in Washing...,W 171 Street,40.8440,-73.9404,651 W 171 Street,low,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.0,2.000000,0,1,0,0,0,10.000000,445.0,14
39,1.0,0,2016-06-29 04:08:35,Prime Location!! This Luxury Chelsea building ...,W 34 St.,40.7530,-73.9959,360 W 34 St.,medium,1,1,1,1,1,1,1,0,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0,1.0,0.000000,1,1,0,0,0,16.000000,563.0,33
48,1.0,1,2016-06-21 04:09:32,<p><a website_redacted,Second Avenue,40.7802,-73.9504,1731 Second Avenue,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,1.000000,0,0,0,0,0,3.000000,24.0,32
59,1.0,1,2016-06-08 06:26:49,Steps to G TrainShared BackyardShared RoofHard...,Clifton Pl,40.6888,-73.9522,310 Clifton Pl,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,1.000000,0,0,0,0,0,5.000000,189.0,20
82,1.0,0,2016-06-24 05:04:03,(((Spacious 2 bedroom in the Upper East Side))...,E 61 St.,40.7617,-73.9626,311 E 61 St.,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,0.000000,0,0,0,0,0,1.000000,352.0,36


In [38]:
get_errors(test_target, linear_regression_model.predict(test[features_with_geo]))

{'mae': 689.8107906345883, 'rmse': 1080.3556261190888}

In [39]:
get_errors(test_target, forrest_model.predict(test[features_with_geo]))

{'mae': 725.6573128390655, 'rmse': 1283.564524299664}

In [0]:
def test_on_models(model):
    reg = model.fit(train[features_with_geo],train_target)
    return get_errors(test_target, reg.predict(test[features_with_geo]))

In [45]:
test_on_models(MLPRegressor(max_iter=840))

{'mae': 625.022064540944, 'rmse': 1005.6002900241914}

In [49]:
test_on_models(ExtraTreesRegressor(n_estimators=200))

{'mae': 562.9199713026946, 'rmse': 974.1549918772384}