## **Rent Regression**

The **goal** of this notebook is to establish the price of the average NYC apartment based on data taken from renthop.com

This will is achieved by splitting the data into two sets: train and test. From here I fit a linear regression model and interpret its accuracy with several metrics.

In [5]:
# Utility imports
import numpy as np
import pandas as pd

# Model building & metrics
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression

# Visualizations
import matplotlib.pyplot as plt

**Data cleaning and feature engineering.**

In [2]:
# Reading in the data
df = pd.read_csv('renthop-nyc.csv',
                 index_col='created').sort_index()
# The column shape had to be reduced by 1 to fit the sorted data
assert df.shape == (49352, 33) 

# Removing the most extreme 1% prices, latitudes and longitudes. (Outliers)
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [6]:
# Splitting the data
y = df['price']
X = df[['bathrooms', 'bedrooms']]

In [7]:
# Train test split
cutoff = '2016-06-01'
mask = X.index < cutoff
X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

In [8]:
# New Features
df['total_rooms'] = df['bathrooms'] + df['bedrooms']
df['price_room_ratio'] = df['price'] / df['total_rooms']

**Instantiating a linear model.**

In [9]:
# Linear Regression Model

#Baselines:
print('Mean Price:', y_train.mean())
y_pred = [y_train.mean()] * len(y_train)
print('Baseline MAE:', mean_absolute_error(y_train, y_pred))

# Model
model = LinearRegression()
model.fit(X_train, y_train);

Mean Price: 3575.604007034292
Baseline MAE: 1201.8811133682555


In [21]:
# Coef and Intercept
print('Intercept:', round(model.intercept_, 2), 'Coef:', round(model.coef_[0], 2))

Intercept: 485.72 Coef: 2072.61


In [22]:
# Regression Metrics:

# MAE, Result: Better than baseline :)
print('Training MAE:', mean_absolute_error(y_train, model.predict(X_train)))
print('Test MAE:', mean_absolute_error(y_test, model.predict(X_test)))

print('\n')

#RMSE, Result: Not as good, there must be some outliers!
print('Training RMSE:', mean_squared_error(y_train, model.predict(X_train), squared=False))
print('Test RMSE:', mean_squared_error(y_test, model.predict(X_test), squared=False))

print('\n')

#R2
print('Training R^2:', r2_score(y_train, model.predict(X_train)))
print('Test R^2:', r2_score(y_test, model.predict(X_test)))

Training MAE: 818.5310213271713
Test MAE: 825.8987822403525


Training RMSE: 1232.0225917223486
Test RMSE: 1219.719357233823


Training R^2: 0.5111543084316607
Test R^2: 0.5213303957090345
