<a href="https://colab.research.google.com/github/spentaur/DS-Unit-2-Regression-Classification/blob/master/module2/assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
import plotly.express as px
import spacy
import re
import time

# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

df.fillna(value=0, inplace=True)
df.replace(np.inf, 0, inplace=True)
df['created'] = pd.to_datetime(df['created'])

In [5]:
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

In [6]:
target = 'price'

features = np.append(df.columns.values[10:], ['bedrooms', 'bathrooms'])

print(features)

['elevator' 'cats_allowed' 'hardwood_floors' 'dogs_allowed' 'doorman'
 'dishwasher' 'no_fee' 'laundry_in_building' 'fitness_center' 'pre-war'
 'laundry_in_unit' 'roof_deck' 'outdoor_space' 'dining_room'
 'high_speed_internet' 'balcony' 'swimming_pool' 'new_construction'
 'terrace' 'exclusive' 'loft' 'garden_patio' 'wheelchair_access'
 'common_outdoor_space' 'bedrooms' 'bathrooms']


In [0]:
# Get regression metrics RMSE, MAE, and 𝑅2, for both the train and test data.

def fit_and_errors(df,features, **kwargs):

    model = LinearRegression()

    april = '2016-04'
    june = '2016-06'

    train = df[(df['created'] > april) & (df['created'] < june)]
    test = df[(df['created'] > june)]

    X_train = train[features]
    y_train = train[target]

    X_test = test[features]
    y_test = test['price']

    reg = model.fit(X_train,y_train)

    y_test_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)

    test_errors = get_errors(y_test, y_test_pred)
    train_errors = get_errors(y_train, y_train_pred)

    return {
        'test_r2': reg.score(X_test, y_test),
        'test_rmse': test_errors['rmse'],
        'test_mae': test_errors['mae'],
        'train_r2': reg.score(X_train, y_train),
        'train_rmse': train_errors['rmse'],
        'train_mae': train_errors['mae'],
        'model': reg
    }

def get_errors(y_true, y_pred):

    rmse = mean_squared_error(y_true, y_pred)**.5

    mae = mean_absolute_error(y_true, y_pred)

    return {
        'rmse': rmse,
        'mae': mae
    }

In [8]:
# get baseline

y_pred = [df['price'].mean()] * len(df)

get_errors(df['price'], y_pred)

{'mae': 1201.532252154329, 'rmse': 1762.4127206231178}

In [9]:
fit_and_errors(df, features)

{'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'test_mae': 757.4700095987182,
 'test_r2': 0.5845874625030612,
 'test_rmse': 1136.2706550259431,
 'train_mae': 748.7163232374169,
 'train_r2': 0.576272151423602,
 'train_rmse': 1147.033969216284}

In [0]:
# Engineer at least two new features. (See below for explanation & ideas.)

# Does the apartment have a description?
# How long is the description?
# How many total perks does each apartment have?
# Are cats or dogs allowed?
# Are cats and dogs allowed?
# Total number of rooms (beds + baths)
# Ratio of beds to baths
# What's the neighborhood, based on address or latitude & longitude?



lb_make = LabelEncoder()

df["interest_level_code"] = lb_make.fit_transform(df["interest_level"])

df['description_len'] = df['description'].str.len()

df['total_perks'] = df.loc[:, 'elevator':'interest_level_code'].sum(axis=1)

df['pets_allowed'] = df.loc[:, ['cats_allowed', 'dogs_allowed']].sum(axis=1)

df['total_rooms'] = df.loc[:, ['bedrooms', 'bathrooms']].sum(axis=1)

df['bed_bath_ratio'] = df['bedrooms'] / df['bathrooms']

In [0]:
# kmeans to just get random neightborhoods

geo = df[['latitude', 'longitude']]

kmeans = KMeans(n_clusters=3, n_init=25, max_iter=250, precompute_distances=True).fit(geo)

labels = kmeans.labels_

df['neighborhood'] = labels + 1

In [12]:
px.set_mapbox_access_token(open("/content/.mapbox_token").read())
fig = px.scatter_mapbox(df, lat="latitude", lon="longitude", color="neighborhood",
                        size_max=15, zoom=10)

fig.show()

In [0]:
# Fit a linear regression model with at least two features.

In [14]:
features = np.append(df.columns.values[10:], ['bedrooms', 'bathrooms'])

df.fillna(value=0, inplace=True)
df.replace(np.inf, 0, inplace=True)

fit_and_errors(df,features)

{'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'test_mae': 714.3290492558085,
 'test_r2': 0.6159379249503141,
 'test_rmse': 1092.5534757759049,
 'train_mae': 707.185999776011,
 'train_r2': 0.6068225842418403,
 'train_rmse': 1104.9103902881186}

In [0]:
# What's the best test MAE you can get? Share your score and features used with your cohort on Slack!

In [16]:
pip install -U spacy

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.1.8)


In [0]:
TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    if text:
        return TAG_RE.sub('', text)

In [0]:
df['description'] = df['description'].apply(remove_tags)

In [0]:
descriptions = df['description'].copy()

In [20]:
descriptions.isna().sum()

1425

In [0]:
descriptions.fillna('', inplace=True)

In [0]:

nlp = spacy.load("en_core_web_sm")
ents_i_care_about = ['PERSON', 'ORG', 'LOC', 'GPE']

all_ents = []
ents_len = []

for doc in nlp.pipe(descriptions, disable=["tagger", "parser"]):
    ents = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ in ents_i_care_about]
    ents_len.append(len(ents))
    all_ents.append(ents)

In [0]:
assert len(all_ents) == len(df) == len(ents_len)

In [0]:
all_ents

In [0]:
df['total_entities'] = ents_len

In [26]:
rich_keywords = ['views',
                 'manhattan',
                 'penthouse',]

rich_pattern = '|'.join(rich_keywords)

rich_special_keywords = df['description'].str.contains(rich_pattern)
rich_special_keywords.fillna(False, inplace=True)
df[rich_special_keywords]['price'].mean()

4311.364925582701

In [27]:
parking_special_keywords = df['description'].str.contains('parking')
parking_special_keywords.fillna(False, inplace=True)
df[parking_special_keywords]['price'].mean()

3920.831953239162

In [28]:
central_park_special_keywords = df['description'].str.contains('(?i)central park')
central_park_special_keywords.fillna(False, inplace=True)
df[central_park_special_keywords]['price'].mean()

3965.1224379719524

In [29]:
balcony_special_keywords = df['description'].str.contains('(?i)balcony')
balcony_special_keywords.fillna(False, inplace=True)
df[balcony_special_keywords]['price'].mean()

4097.3970856102005

In [30]:
poor_keywords = ['brooklyn', 'bronx', 'queens']

poor_pattern = '|'.join([f'(?i){x}' for x in poor_keywords])

poor_special_keywords = df['description'].str.contains(poor_pattern)
poor_special_keywords.fillna(False, inplace=True)
df[poor_special_keywords]['price'].mean()

3056.877094972067

In [31]:
df['price'].mean()

3579.5852469426636

In [0]:
df['manhattan_views'] = rich_special_keywords.astype('int64')

In [0]:
df['other_burrows'] = poor_special_keywords.astype('int64')

In [0]:
df['parking'] = parking_special_keywords.astype('int64')

In [0]:
df['central_park'] = central_park_special_keywords.astype('int64')

In [0]:
# features = np.append(df.columns.values[10:], ['bedrooms', 'bathrooms', 'total_entities', 'manhattan_views', 'other_burrows', 'parking', 'central_park'])

features = np.append(df.columns.values[10:], ['bedrooms', 'bathrooms', 'manhattan_views', 'other_burrows', 'parking', 'central_park'])

In [37]:
features

array(['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive',
       'loft', 'garden_patio', 'wheelchair_access',
       'common_outdoor_space', 'interest_level_code', 'description_len',
       'total_perks', 'pets_allowed', 'total_rooms', 'bed_bath_ratio',
       'neighborhood', 'total_entities', 'manhattan_views',
       'other_burrows', 'parking', 'central_park', 'bedrooms',
       'bathrooms', 'manhattan_views', 'other_burrows', 'parking',
       'central_park'], dtype=object)

In [0]:
models = {}

df.fillna(value=0, inplace=True)
df.replace(np.inf, 0, inplace=True)

for _ in range(10):
    geo = df[['latitude', 'longitude']]

    kmeans = KMeans(n_clusters=3).fit(geo)

    labels = kmeans.labels_

    df['neighborhood'] = labels + 1

    results = fit_and_errors(df, features, normalize=True)

    models[results['test_mae']] = labels

In [39]:
sorted(models)

[712.1060218600504,
 712.1912632881149,
 728.9371721198817,
 750.6150636916068,
 750.6197976485186,
 750.6197976485188,
 750.6201801838171]

In [0]:
best = models[sorted(models)[0]]

In [41]:
df['neighborhood'] = best + 1

results = fit_and_errors(df, features)

results

{'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'test_mae': 712.1060218600504,
 'test_r2': 0.6208553120996321,
 'test_rmse': 1085.5366201043912,
 'train_mae': 705.2895839714581,
 'train_r2': 0.6115903360057624,
 'train_rmse': 1098.1907697680597}

# that's as good as i can get right now, but i still want to do neighborhood right, since i think that has the largest effect, the trick is ordering them correctly

In [0]:
price_categories = pd.cut(df['price'], bins=50, labels=list(range(1,51)), include_lowest=True)

In [0]:
df_copy = df.copy()

In [0]:
df_copy['price_category'] = price_categories

In [0]:
geo = df[['latitude', 'longitude']].copy()

kmeans = MiniBatchKMeans(n_clusters=252).fit(geo)

labels = kmeans.labels_

df_copy['neighborhood'] = labels + 1

In [46]:
px.set_mapbox_access_token(open("/content/.mapbox_token").read())
fig = px.scatter_mapbox(df_copy, lat="latitude", lon="longitude", color="neighborhood",
                        size_max=15, zoom=10)

fig.show()

In [0]:
old_labels = df_copy.groupby('neighborhood')[['price', 'neighborhood']].mean().sort_values(by="price")['neighborhood'].values
one_to_252 = list(range(1,253))

new_labels = one_to_252

In [0]:
# replace old lables with new lables

df_copy['neighborhood'] = df_copy['neighborhood'].map(dict(zip(old_labels, new_labels)))

In [49]:
df_copy.groupby('price_category')['neighborhood'].mean()

price_category
1      21.112830
2      55.337763
3      77.500262
4      99.637953
5     110.373585
6     120.484493
7     132.485678
8     138.355953
9     145.317000
10    150.244712
11    158.865948
12    159.217614
13    160.897833
14    163.345927
15    161.182600
16    167.398703
17    174.224675
18    179.381371
19    173.439909
20    184.536269
21    187.024752
22    189.882353
23    194.836364
24    189.957031
25    188.917808
26    191.344086
27    196.655405
28    205.906250
29    200.149254
30    201.357143
31    194.226562
32    199.062500
33    198.400000
34    210.192308
35    207.568182
36    197.571429
37    204.086957
38    198.347222
39    229.666667
40    199.545455
41    188.300000
42    216.451613
43    206.586957
44    223.066667
45    198.627451
46    251.000000
47    205.888889
48    172.375000
49    206.467532
50    217.333333
Name: neighborhood, dtype: float64

In [62]:
one_to_252 = list(range(1,253))

new_labels = one_to_252

models = {}

for i in range(1000):

    geo = df[['latitude', 'longitude']].copy()

    kmeans = MiniBatchKMeans(n_clusters=252, batch_size=1000, max_iter=1000).fit(geo)

    labels = kmeans.labels_

    df_copy['neighborhood'] = labels + 1

    old_labels = df_copy.groupby('neighborhood')[['price', 'neighborhood']].mean().sort_values(by="price")['neighborhood'].values

    df_copy['neighborhood'] = df_copy['neighborhood'].map(dict(zip(old_labels, new_labels)))

    results = fit_and_errors(df_copy, features)

    print(i, results['test_mae'])

    models[results['test_mae']] = {
        'labels': df_copy['neighborhood'].values
    }

0 640.2517151689998
1 646.5637817054452
2 646.5319744330626
3 645.9810950333765
4 644.592314009847
5 646.3434265191142
6 645.3970411667215
7 647.3835924270835
8 647.052668450575
9 640.9508638012816
10 646.276887402694
11 644.0153223558525
12 643.6819629809686
13 648.581051024282
14 646.1767153360785
15 645.2558723970416
16 649.7113934200902
17 646.5743958894672
18 646.7482327767874
19 648.012967160094
20 646.2650492699718
21 641.3916436907623
22 647.5934947299481
23 647.9656646689572
24 646.2989537154316
25 645.5195969445737
26 643.8185159436331
27 647.7963951393002
28 650.8463741400943
29 646.6840798336032
30 643.2588730869675
31 648.4091479924325
32 645.9351280302989
33 646.2583368095417
34 646.5770449119299
35 643.6341251766956
36 650.6999688948488
37 645.9780293832663
38 646.1748725477463
39 649.3953548057465
40 646.7441660056871
41 644.4909254729266
42 648.5927329848356
43 647.4746834434193
44 644.5063702953022
45 643.1740553796596
46 647.7113961411652
47 644.9855190680598
48 643.

In [73]:
sorted(models)[0]

636.4449826884655

In [74]:
sorted(models)

[636.4449826884655,
 636.6719475936657,
 638.1266306759927,
 638.3280834483352,
 638.4812451209075,
 638.6742687554776,
 639.2684059287398,
 639.3805647983968,
 639.4312669451949,
 639.464392686314,
 639.649573476475,
 639.7041668518664,
 639.8284513574885,
 639.88374528617,
 639.8911783941297,
 639.9150733692887,
 640.0845321342872,
 640.0857084531259,
 640.1123074638106,
 640.2517151689998,
 640.2840057285592,
 640.4265144348834,
 640.4650758775698,
 640.5480024704306,
 640.5569276439817,
 640.5874606292465,
 640.6516623002244,
 640.6595771709276,
 640.6787089927385,
 640.6804511037033,
 640.6993499409748,
 640.7089351633404,
 640.7171251302454,
 640.7341671427166,
 640.8599779516453,
 640.8619154555671,
 640.8871265836283,
 640.9508638012816,
 640.9528420585965,
 640.9633318210806,
 640.9828820993138,
 641.0115119818646,
 641.0225952232753,
 641.0226385077501,
 641.0349035980829,
 641.0400416942244,
 641.0880015495851,
 641.1888127311535,
 641.211936494802,
 641.2773033484401,
 641.

In [0]:
df['neighborhood'] = models[sorted(models)[0]]['labels']

In [76]:
results = fit_and_errors(df, features)
results

{'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'test_mae': 637.4313833671766,
 'test_r2': 0.6839148666217318,
 'test_rmse': 991.1605909193883,
 'train_mae': 635.5725732313831,
 'train_r2': 0.673233730657437,
 'train_rmse': 1007.2827244441891}

In [66]:
df_copy.groupby('price_category').mean()

# i want to get every column where price_category 1 is at least 50% less than price_category 25

Unnamed: 0_level_0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,interest_level_code,description_len,total_perks,pets_allowed,total_rooms,bed_bath_ratio,neighborhood,total_entities,manhattan_views,other_burrows,parking,central_park
price_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1
1,1.010919,0.659691,40.766625,-73.927099,1536.89263,0.277525,0.456779,0.203822,0.422202,0.042766,0.047316,0.10919,0.035487,0.027298,0.193813,0.012739,0.0,0.014559,0.015469,0.00364,0.00182,0.00455,0.0,0.00091,0.10828,0.024568,0.007279,0.00546,0.00455,0.900819,448.88808,2.920837,0.878981,1.67061,0.646042,44.88535,3.488626,0.026388,0.095541,0.031847,0.017288
2,1.000923,0.773717,40.764859,-73.942086,1817.471023,0.232189,0.421558,0.300849,0.373939,0.045404,0.089332,0.15467,0.064599,0.018457,0.203027,0.012182,0.003322,0.029531,0.026209,0.032115,0.006645,0.004799,0.006275,0.004799,0.058324,0.099668,0.008121,0.006275,0.010705,1.117017,448.125877,3.330011,0.795496,1.77464,0.764673,90.531561,3.18863,0.028793,0.05131,0.02141,0.039498
3,1.012428,0.843799,40.755182,-73.956085,2086.814495,0.297227,0.387493,0.359498,0.348247,0.119571,0.177132,0.18158,0.04134,0.040031,0.215856,0.034537,0.016745,0.044479,0.040816,0.016222,0.0191,0.006803,0.009419,0.007064,0.057823,0.073522,0.013344,0.00471,0.006279,1.177132,480.905024,3.695971,0.73574,1.856227,0.819987,116.650968,3.436682,0.046049,0.057823,0.020147,0.049451
4,1.013156,0.890431,40.752359,-73.967252,2394.244368,0.418274,0.442782,0.457199,0.404577,0.305821,0.308884,0.314111,0.05082,0.150478,0.194089,0.10128,0.090647,0.090467,0.047396,0.068481,0.030636,0.029915,0.046495,0.019643,0.041269,0.047216,0.024509,0.017661,0.018021,1.199856,552.120382,4.920526,0.84736,1.903586,0.852977,141.817805,4.144531,0.094972,0.04271,0.031537,0.06776
5,1.015723,0.985744,40.749956,-73.97329,2654.575472,0.471908,0.470231,0.472117,0.444235,0.374214,0.384277,0.348218,0.050943,0.218449,0.207757,0.12935,0.112998,0.110482,0.049057,0.071488,0.033543,0.040252,0.050105,0.026834,0.04109,0.043187,0.025577,0.024738,0.026834,1.176101,582.898323,5.403983,0.914465,2.001468,0.952655,152.2587,4.428931,0.131656,0.043396,0.033753,0.085744
6,1.032716,1.339259,40.749987,-73.974017,2916.327723,0.506241,0.455749,0.506808,0.430598,0.406203,0.436838,0.382564,0.053328,0.236384,0.184191,0.161498,0.137292,0.130673,0.078101,0.082262,0.045575,0.052383,0.05938,0.043116,0.038389,0.033472,0.035363,0.029312,0.026664,1.180787,594.86233,5.733169,0.886346,2.371974,1.282385,158.806732,4.483169,0.130295,0.039713,0.037254,0.085098
7,1.039784,1.511352,40.748446,-73.979049,3214.367919,0.619563,0.491407,0.544664,0.45958,0.541693,0.501804,0.465097,0.066624,0.363251,0.183323,0.166773,0.199873,0.173775,0.102058,0.118608,0.069595,0.065563,0.067473,0.053893,0.037556,0.034797,0.040738,0.030766,0.039465,1.179928,643.826225,6.617865,0.950987,2.551135,1.446531,170.541057,4.778697,0.158073,0.031827,0.055379,0.08572
8,1.068341,1.614344,40.746334,-73.980359,3491.354021,0.620382,0.503502,0.518474,0.471867,0.537068,0.485149,0.42526,0.067617,0.369476,0.182323,0.184979,0.170007,0.161314,0.10553,0.101425,0.072205,0.069307,0.063511,0.049022,0.033567,0.021492,0.042019,0.031393,0.043951,1.15262,638.350881,6.483458,0.975368,2.682685,1.521178,173.700797,4.891331,0.172905,0.032359,0.049505,0.074137
9,1.1235,1.727,40.746024,-73.981424,3769.329,0.575667,0.533667,0.465,0.508333,0.498667,0.454333,0.403,0.050333,0.344667,0.174333,0.202333,0.155333,0.146,0.091,0.09,0.060667,0.063,0.054667,0.042333,0.042667,0.024333,0.037333,0.033333,0.032,1.133333,620.008667,6.216333,1.042,2.8505,1.552022,178.959,4.792333,0.182333,0.029667,0.044667,0.074667
10,1.193281,1.888843,40.749374,-73.981328,4059.608046,0.61095,0.503526,0.462878,0.481128,0.552883,0.476566,0.39776,0.062215,0.354625,0.160929,0.246786,0.134799,0.17752,0.133555,0.08876,0.07922,0.07051,0.051846,0.051846,0.040647,0.019909,0.05392,0.036914,0.041891,1.145168,636.19494,6.436748,0.984654,3.082124,1.635494,183.430942,4.932393,0.20282,0.028204,0.056408,0.082953


In [0]:
price_df = df_copy.groupby('price_category').mean()

price_df.drop(['price', 'latitude', 'longitude'], axis=1, inplace=True)

price_category_1 = price_df.iloc[0]
price_category_25 = price_df.iloc[24]

In [0]:
features = (price_category_25 / price_category_1).sort_values(ascending=False).index

In [69]:
results = fit_and_errors(df, features)
results

{'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'test_mae': 637.4313833671766,
 'test_r2': 0.6839148666217318,
 'test_rmse': 991.1605909193883,
 'train_mae': 635.5725732313831,
 'train_r2': 0.673233730657437,
 'train_rmse': 1007.2827244441891}

In [0]:
best_feats = {}
for idx in range(len(features)):
    results = fit_and_errors(df, features[:idx + 1])
    best_feats[results['test_mae']] = features[:idx + 1]

In [80]:
features = best_feats[sorted(best_feats)[0]]
results = fit_and_errors(df, features)
results

{'model': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False),
 'test_mae': 636.2477184111135,
 'test_r2': 0.6833061981879741,
 'test_rmse': 992.1144447234204,
 'train_mae': 635.2131635434067,
 'train_r2': 0.6718466491417191,
 'train_rmse': 1009.4183544054107}

i'm calling it, i might be able to get better if i tried some other stuff