<a href="https://colab.research.google.com/github/raymond98tan/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/Raymond_Tan_LS_DS_212_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
#  Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test
from sklearn.model_selection import train_test_split

In [21]:
df.shape

(48817, 34)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48817 entries, 0 to 49351
Data columns (total 34 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   bathrooms             48817 non-null  float64
 1   bedrooms              48817 non-null  int64  
 2   created               48817 non-null  object 
 3   description           47392 non-null  object 
 4   display_address       48684 non-null  object 
 5   latitude              48817 non-null  float64
 6   longitude             48817 non-null  float64
 7   price                 48817 non-null  int64  
 8   street_address        48807 non-null  object 
 9   interest_level        48817 non-null  object 
 10  elevator              48817 non-null  int64  
 11  cats_allowed          48817 non-null  int64  
 12  hardwood_floors       48817 non-null  int64  
 13  dogs_allowed          48817 non-null  int64  
 14  doorman               48817 non-null  int64  
 15  dishwasher         

In [4]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
# Get the last day of May and subset the training and test data
date_may = '2016-05-31'
# Before the last day in May
df_train = df[df['created'] <= date_may].reset_index().drop('index',axis = 1)
# First day of June onward
df_test = df[df['created'] > date_may].reset_index().drop('index',axis = 1)

In [6]:
df_train.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2.0,4,2016-04-19 04:24:47,,West 18th Street,40.7429,-74.0028,7995,350 West 18th Street,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,2,2016-04-27 03:19:56,Stunning unit with a great location and lots o...,West 107th Street,40.8012,-73.966,3600,210 West 107th Street,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [7]:
df_test.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-06-03 03:21:22,Check out this one bedroom apartment in a grea...,W. 173rd Street,40.8448,-73.9396,1675,644 W. 173rd Street,low,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-06-01 03:11:01,Spacious 1-Bedroom to fit King-sized bed comfo...,East 56th St..,40.7584,-73.9648,3050,315 East 56th St..,low,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,2.0,4,2016-06-07 04:39:56,SPRAWLING 2 BEDROOM FOUND! ENJOY THE LUXURY OF...,W 18 St.,40.7391,-73.9936,7400,30 W 18 St.,medium,1,1,1,1,1,1,0,0,1,0,0,0,1,0,1,1,0,0,1,0,0,0,0,0


In [8]:
# Engineer at least two new features

# This feature finds out if the apartment is 'new' by checking the description
# Descriptions that say 'New York' will still be counted as new=1, how to work 
# around this?? listing may be new but also in New York vs old listing in New York
df_train['new'] = [1 if 'new' in string.lower() else 0 for string in df_train['description'].fillna(' ')]

# This feature finds out if the address is a numbered street or if it has a name
import re

df_train['named_address'] = [0 if bool(re.search(r'\d', string)) else 1
                             for string in df_train['display_address'].fillna('1')]

In [9]:
pd.options.display.max_colwidth = 1000
df_train[['description','new']][0:25]

Unnamed: 0,description,new
0,"Top Top West Village location, beautiful Pre-war building with laundry in the basement and live in super!<br/><br/>Apartment features a large bedroom with closet. Separate living room, kitchen features granite tops, dishwasher and microwave included, marble bathroom and hardwood flooring. Building is very well maintained and conveniently located near A,C,E,L,1,2,3 trains. Surrounded by many local cafe?s, restaurants, available for November 1st move in!<br/><br/>To view this apartment or any other please contact me via email or call at the number listed.<br/><br/><br/><br/><br/><br/>Bond New York is a real estate broker that supports equal housing opportunity.<p><a website_redacted",1
1,"Building Amenities - Garage - Garden - fitness room - laundry room -rooftop deck .<br /><br />Located in midtown East - High energy area - plenty of Bars and restaurants to choose from - within walking distance to the transit E,M,6,7<br /><br />This Apartment also feature a renovated kitchen with microwave - Marble Bath tiles.<br /><br />Call or Email and Text for Exclusive Showing!!<br /><br />**NO FEE**<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><p><a website_redacted",0
2,Beautifully renovated 3 bedroom flex 4 bedroom apartment for rent. Available immediately. Elevator building. Exposed brick walls. Hardwood floors through the apartment. Kitchen has wood kitchen cabinets and stainless steel appliances including a dishwasher and a microwave. Unit has a washer and dryer. Bathroom has ceramic wall and floors tiles and lots of storage. Apartment is very sunny and spacious. Available immediatelyPlease contact/email/text Jessica for an appointment,0
3,,0
4,"Stunning unit with a great location and lots of natural light. Close to the subway line or a nice walk to Columbia University.Near many restaurants to choose from, place of worship and shops. Call/Email Helen and I will work on finding the right place for you :)kagglemanager@renthop.com 766-103-4663Hablo espanol!Building Amenities:* Open Kitchen with stainless steel* Great cabinet space* Good amount of closet space* Subway (1 Trains)Contact me to show you all Bohemia Realty Group has to offer 766-103-4663 kagglemanager@renthop.com espanol! - See more at: website_redacted",0
5,"This huge sunny ,plenty of lights 1 bed/2 bath offers you a brand new kitchen,open to the living space, with stainless steel appliances,granite counters,plenty of cabinet spaces ,dishwasher,micro,marble bath with stand up shower and foldable closets throughout the apartment and your OWN OUTDOOR SPACE!!!The building comes with an elevator,gym,laundry,24 hour doorman and beautiful roof deck/terrace with breathtaking view to the city!gut reno,ss app,dish washer,closets,marble bath,laundry,elevator,drmn,gym,roofdeck<br /><br />Your future new apartment is located in west Chelsea,near amazing restaurants ,shops,art galleries and nightlife!Close to the famous Highline, Chelsea Piers and the Chelsea Market<br /><br />Call or e mail me for any further information or to schedule a viewing.As an expert of the area, I have access to all available apartments.I look forward to help you finding your dream home!<br /><br /><p><a website_redacted",1
6,<p><a website_redacted,0
7,"This is a spacious four bedroom with every bedroom able to fit queen sized beds with windows and closets and room to spare. There are ceiling fans and Exposed brick scattered throughout the apartment. It also has High ceilings, great hardwood floors, an excellent open kitchen with a dishwasher and room for a table, Really great living room that fits a wrap around couch and windows throughout the apartment.<br /><br />Located in the heart of the lower East side you can't help but notice all the neighborhood bars, restaurants, cafes, and boutique shopping.<br /><br />Call, text, or e-mail me today to set up a private viewing.<br /><br /><br /><br />-------------Listing courtesy of Miron Properties. All material herein is intended for information purposes only and has been compiled from sources deemed reliable. Though information is believed to be correct, it is presented subject to errors, omissions, changes or withdrawal without notice. Miron Properties is a licensed Real Estate ...",0
8,"New to the market! Spacious studio located in the 80s, in a well maintained elevator building with laundry room. For viewing and more information please contact Artak.",1
9,"***LOW FEE. Beautiful CHERRY OAK WOODEN FLOORSTHE LOBBY IS VERY NICE AND MODERN! GORGEOUS FLEX TWO BEDROOM APARTMENT LOCATED IN THE HEART OF MURRAY HILL. Near many great restaurants, bars, entertainment, shopping and major transportation. THIS IS A STEAL AND WILL NOT LAST. Great Location 24 HR DOORMAN SPACIOUS ROOMS TONS OF SUNLIGHT IN HOUSE WASHER/DRYER FACILITY<br /><br />FOR FURTHER INFORMATION AND EXCLUSIVE VIEWING PLEASE CONTACT MEYER OVADIA AT 753-396-6626 OR VIA EMAIL AT kagglemanager@renthop.com<br /><br />NOT WHAT YOU ARE LOOKING FOR? FEEL FREE TO CONTACT ME WITH ANY SPECIAL REQUESTS. I'LL BE MORE THAN HAPPY TO FURTHER ASSIST YOU IN YOUR SEARCH<br /><br /><p><a website_redacted",0


In [11]:
df_train[['display_address','named_address']][0:5]

Unnamed: 0,display_address,named_address
0,W 13 Street,0
1,East 49th Street,0
2,West 143rd Street,0
3,West 18th Street,0
4,West 107th Street,0


In [12]:
# Fit linear regression model with at least two features.
from sklearn.linear_model import LinearRegression

# Instantiate the model
model = LinearRegression()

# Get regressors, drop non-numeric values and target
X = df_train.select_dtypes(['number']).drop('price',axis = 1)

# Get target
y = df_train[['price']]
model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [13]:
df_train.select_dtypes(['number']).head()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,new,named_address
0,1.0,1,40.7388,-74.0018,2850,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,1.0,1,40.7539,-73.9677,3275,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,4,40.8241,-73.9493,3350,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2.0,4,40.7429,-74.0028,7995,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,2,40.8012,-73.966,3600,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [14]:
# Get model coefficients and intercept
print('coefficients:\t', model.coef_)
print('intercept:\t', model.intercept_)

coefficients:	 [[  1765.12659286    480.82170497   1527.10683624 -13733.52663864
     148.18780211    -38.73395501   -196.63151841     98.59954545
     480.20364929     49.53743067   -165.95047613   -255.57793711
     139.90077643    -69.51193927    501.89661186   -167.6595985
    -113.07515937    234.76002678   -345.52720879    -52.15034317
      52.75218733   -134.08354994    182.06957743    112.81689723
     130.10323254    -45.52225202    201.14991334    -92.30982743
     -56.33534239    -49.49523138]]
intercept:	 [-1077623.7249224]


In [15]:
# Checking lengths
print(len(model.coef_[0]))
print(len(X.columns))

30
30


In [16]:
# Adding the new features to the validation data
df_test['new'] = [1 if 'new' in string.lower() else 0 for string in df_test['description'].fillna(' ')]
df_test['named_address'] = [0 if bool(re.search(r'\d', string)) else 1
                             for string in df_test['display_address'].fillna('1')]
X_test = df_test.select_dtypes(['number']).drop('price',axis = 1)

In [17]:
# Get regression metrics RMSE, MAE, and  R2 , for both the train and test data.
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import r2_score as R2

# Predicted and actual values for training
y_true_train = y
y_pred_train = model.predict(X)

# Predicted and actual values for testing
y_true_test = df_test[['price']]
X_test = df_test.select_dtypes(['number']).drop('price',axis = 1)
y_pred_test = model.predict(X_test)

# MSE on train, test
MSE_list = [MSE(y_true_train,y_pred_train,squared = False),
            MSE(y_true_test,y_pred_test,squared = False)]
print('Root Mean Squared Error:\n\t{}\n\t{}'.format(MSE_list[0],MSE_list[1]))

# MAE on train, test
MAE_list = [MAE(y_true_train,y_pred_train),
            MAE(y_true_test,y_pred_test)]
print('\nMean Absolute Error:\n\t{}\n\t{}'.format(MAE_list[0],MAE_list[1]))

# R2 score on train, test
R2_list = [R2(y_true_train,y_pred_train),
           R2(y_true_test,y_pred_test)]
print('\nR2 score:\n\t{}\n\t{}'.format(R2_list[0],R2_list[1]))

Root Mean Squared Error:
	1089.6047958540923
	1079.1529899317168

Mean Absolute Error:
	692.0248370834879
	703.605948351354

R2 score:
	0.6176818544170486
	0.6252251125604636
