<a href="https://colab.research.google.com/github/taycurran/DS-Unit-2-Linear-Models/blob/master/CURRAN_M2_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [X] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [X] Engineer at least two new features. (See below for explanation & ideas.)
- [X] Fit a linear regression model with at least two features.
- [X] Get the model's coefficients and intercept.
- [X] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [X] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [X] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

# MY CODE

In [0]:
from datetime import datetime
import math
import numpy as np

In [183]:
print(df.shape)
df.head(3)

(48817, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [184]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48817 entries, 0 to 49351
Data columns (total 34 columns):
bathrooms               48817 non-null float64
bedrooms                48817 non-null int64
created                 48817 non-null object
description             47392 non-null object
display_address         48684 non-null object
latitude                48817 non-null float64
longitude               48817 non-null float64
price                   48817 non-null int64
street_address          48807 non-null object
interest_level          48817 non-null object
elevator                48817 non-null int64
cats_allowed            48817 non-null int64
hardwood_floors         48817 non-null int64
dogs_allowed            48817 non-null int64
doorman                 48817 non-null int64
dishwasher              48817 non-null int64
no_fee                  48817 non-null int64
laundry_in_building     48817 non-null int64
fitness_center          48817 non-null int64
pre-war                 4

### **Feature Engineering**
1. Apartment's Distance From Central Park
2. Easy Cleaning Score

#### **Domain Knowledge**

CENTRAL PARK COORDINATES == 40.7829° N, 73.9654° W

Positive latitude is above the equator (N), and negative latitude is below the equator (S). Positive longitude is east of the prime meridian, while negative longitude is west of the prime meridian (a north-south line that runs through a point in England).

In [0]:
# CENTRAL PARK COORDINATES == 40.7829° N, 73.9654° W
# Positive latitude is above the equator (N), 
# and negative latitude is below the equator (S). 
# Positive longitude is east of the prime meridian, 
# while negative longitude is west of the prime meridian 
# (a north-south line that runs through a point in England).

In [186]:
# Coordinates DF with Coordinate of Interest, Central Park
centralPark = [40.7892, -73.9654]
coordinates = df[['latitude', 'longitude']]
coordinates.head(2)

Unnamed: 0,latitude,longitude
0,40.7145,-73.9425
1,40.7947,-73.9667


In [0]:
# Calculate Each XY Distance
distLat = abs(centralPark[0] - df['latitude']) 
distLong = abs(centralPark[1] - df['longitude'])

In [188]:
# Calculate Radial Distance
distCentralP = np.sqrt(distLat**2 + distLong**2)
distCentralP.head()

0    0.078131
1    0.005652
2    0.062170
3    0.035375
4    0.038435
dtype: float64

In [0]:
# Add to Original DF
df['distCentralP'] = distCentralP

Easy Cleaning Score
- 1 Point for Laundry in Building
- 2 Point for Laundry in Unit
- 1.5 Points for Diswasher in Unit

In [0]:
easyClean = 1.5*df['dishwasher'] + 1*df['laundry_in_building'] + 2*df['laundry_in_unit']

In [242]:
easyClean.value_counts()

0.0    25522
1.5    12218
3.5     7529
1.0     2051
2.0      972
2.5      439
4.5       77
3.0        9
dtype: int64

In [243]:
easyClean.head()

0    0.0
1    0.0
2    2.5
3    0.0
4    0.0
dtype: float64

In [0]:
df['easyClean'] = easyClean

In [0]:
# One More Feature for Bed+Bath
df['BedBath'] = df['bathrooms'] + df['bedrooms']

New Features:

In [220]:
df[['distCentralP','easyClean', 'BedBath']].head(2)

Unnamed: 0,distCentralP,easyClean,BedBath
0,0.078131,0,4.5
1,0.005652,0,3.0


### **Fitting Model**

In [0]:
# 1. Import the Appropriate Estimator Class from Scikit-Learn
from sklearn.linear_model import LinearRegression

# 2. Instantiate the Class
model = LinearRegression()

In [0]:
# The Below Cell Uses More Features and Realizes a Smaller Mean Absolute Error
# Select Features for Regression Model
#myFeatures = df[['price', 'distCentralP', 'BedBath', 'easyClean']]
#myFeatures.head()

In [0]:
# Select Features for Regression Model
myFeatures = df[['price', 'distCentralP', 'BedBath', 'easyClean', 'doorman', 
                 'fitness_center', 'no_fee', 'pre-war', 'roof_deck', 
                 'outdoor_space', 'balcony', 'swimming_pool', 'terrace', 
                 'exclusive', 'garden_patio', 'common_outdoor_space']]

### **Train/Test Split**

In [246]:
# Establish Separation Conditions from Original DF that Includes Date Column
trainCond = df['created'].str.contains('2016-04|2016-05')
testCond = df['created'].str.contains('2016-06')

# Separation
train = myFeatures[trainCond]
test = myFeatures[testCond]

print(train.shape)
print(test.shape)

(31844, 16)
(16973, 16)


In [247]:
# 3. Separate Target Vector from Features Matrix
y_train = train['price']
X_train = train.drop('price', axis=1)

y_test = test['price']
X_test = test.drop('price', axis=1)

print(f"Linear Regression, Dependent On: {list(X_train.columns)}")

Linear Regression, Dependent On: ['distCentralP', 'BedBath', 'easyClean', 'doorman', 'fitness_center', 'no_fee', 'pre-war', 'roof_deck', 'outdoor_space', 'balcony', 'swimming_pool', 'terrace', 'exclusive', 'garden_patio', 'common_outdoor_space']


In [248]:
# Begin with Baseline
baseline = train['price'].mean()
baseline

3575.604007034292

In [0]:
# Import Calculators
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score

In [250]:
# Baseline Prediction Scores
basePred = [baseline] * len(y_train)
maeBase = mean_absolute_error(y_train, basePred)
rmseBase = sqrt(mean_squared_error(y_train, basePred))
RsquareBase = r2_score(y_train, basePred)

print('Baseline Prediction Scores')
print(f"MAE: {maeBase}")
print(f"RMSE: {rmseBase}")
print(f"R^2: {RsquareBase}")

Baseline Prediction Scores
MAE: 1201.8811133682555
RMSE: 1762.1090255404863
R^2: 0.0


In [0]:
# 4. Fit the Model
model.fit(X_train, y_train)
yPredTrain = model.predict(X_train)

In [252]:
# 5. Get Model Coefficients
print(model.intercept_, model.coef_)

1272.6024338279153 [-5.62720888e+03  7.91284040e+02  1.69478309e+02  7.76758006e+02
  3.28735099e+02 -3.20067026e+02 -4.40885718e+01 -2.77965770e+02
 -1.10172924e+02 -1.04940320e+02  5.02483436e+01  2.79094311e+02
  3.04711009e+02 -3.57613802e+00 -1.65064371e+02]


In [253]:
# Interpret Coefficients
print('Intercept', model.intercept_)
coefficients = pd.Series(model.coef_, list(X_train.columns))
print(coefficients.to_string())

Intercept 1272.6024338279153
distCentralP           -5627.208875
BedBath                  791.284040
easyClean                169.478309
doorman                  776.758006
fitness_center           328.735099
no_fee                  -320.067026
pre-war                  -44.088572
roof_deck               -277.965770
outdoor_space           -110.172924
balcony                 -104.940320
swimming_pool             50.248344
terrace                  279.094311
exclusive                304.711009
garden_patio              -3.576138
common_outdoor_space    -165.064371


In [254]:
# y_train Scores
maeTrain = mean_absolute_error(y_train, yPredTrain)
rmseTrain = sqrt(mean_squared_error(y_train, yPredTrain))
RsquareTrain = r2_score(y_train, yPredTrain)

print('Train Prediction Scores')
print(f"MAE: {maeTrain}")
print(f"RMSE: {rmseTrain}")
print(f"R^2: {RsquareTrain}")

Train Prediction Scores
MAE: 805.9566633130073
RMSE: 1220.2486030829666
R^2: 0.5204531009585542


In [0]:
# 5. Apply the Model to New Data (Test Data)
yPredTest = model.predict(X_test)

In [256]:
# y_Test Scores
maeTest = mean_absolute_error(y_test, yPredTest)
rmseTest = sqrt(mean_squared_error(y_test, yPredTest))
RsquareTest = r2_score(y_test, yPredTest)

print('Test Prediction Scores')
print(f"MAE: {maeTest}")
print(f"RMSE: {rmseTest}")
print(f"R^2: {RsquareTest}")

Test Prediction Scores
MAE: 816.2058999906938
RMSE: 1217.8998355314743
R^2: 0.5227574454751267


## **Conclusions**

In [257]:
# Analysis Efforts are Better Than Base Estimates
maeBase > maeTest

True

In [258]:
# Through Analysis Efforts We are, on average, $316.72 Closer in Our... 
# ...Apartment Rent Predictions
maeBase - maeTest #== 316.7198906842018

385.6752133775617

In [259]:
print(f"Baseline Mean Absolute Error: {maeBase}")
print(f"Train Mean Absolute Error: {maeTrain}")
print(f"Test Mean Absolute Error: {maeTest}")

Baseline Mean Absolute Error: 1201.8811133682555
Train Mean Absolute Error: 805.9566633130073
Test Mean Absolute Error: 816.2058999906938


In [0]:
#Baseline Mean Absolute Error: 1201.8811133682555
#Train Mean Absolute Error: 871.6213868400388
#Test Mean Absolute Error: 885.3902866494701

## Attempt at Graphing

In [0]:
import itertools
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
from sklearn.linear_model import LinearRegression

def regression_3d(df, x, y, z, num=100, **kwargs):
    """
    Visualize linear regression in 3D: 2 features + 1 target
    
    df : Pandas DataFrame
    x : string, feature 1 column in df
    y : string, feature 2 column in df
    z : string, target column in df
    num : integer, number of quantiles for each feature
    """
    
    # Plot data
    fig = px.scatter_3d(df, x, y, z, **kwargs)
    
    # Fit Linear Regression
    features = [x, y]
    target = z
    model = LinearRegression()
    model.fit(df[features], df[target])    
    
    # Define grid of coordinates in the feature space
    xmin, xmax = df[x].min(), df[x].max()
    ymin, ymax = df[y].min(), df[y].max()
    xcoords = np.linspace(xmin, xmax, num)
    ycoords = np.linspace(ymin, ymax, num)
    coords = list(itertools.product(xcoords, ycoords))
    
    # Make predictions for the grid
    predictions = model.predict(coords)
    Z = predictions.reshape(num, num).T
    
    # Plot predictions as a 3D surface (plane)
    fig.add_trace(go.Surface(x=xcoords, y=ycoords, z=Z))
    
    return fig

In [262]:
regression_3d(
    train,
    x='BedBath', 
    y='distCentralP', 
    z='price', 
    num=50, 
    title='CleaningBEd'
)