Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv', index_col='created', parse_dates= ['created'])
# assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
# Step 1: Data Wrangling EDA

df.head()

Unnamed: 0_level_0,bathrooms,bedrooms,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
2016-06-24 07:54:24,1.5,3,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-06-12 12:19:27,1.0,2,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-04-17 03:26:41,1.0,1,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-04-18 02:22:02,1.0,1,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-04-28 01:32:41,1.0,4,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48817 entries, 2016-06-24 07:54:24 to 2016-04-12 02:48:07
Data columns (total 33 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   bathrooms             48817 non-null  float64
 1   bedrooms              48817 non-null  int64  
 2   description           47392 non-null  object 
 3   display_address       48684 non-null  object 
 4   latitude              48817 non-null  float64
 5   longitude             48817 non-null  float64
 6   price                 48817 non-null  int64  
 7   street_address        48807 non-null  object 
 8   interest_level        48817 non-null  object 
 9   elevator              48817 non-null  int64  
 10  cats_allowed          48817 non-null  int64  
 11  hardwood_floors       48817 non-null  int64  
 12  dogs_allowed          48817 non-null  int64  
 13  doorman               48817 non-null  int64  
 14  dishwasher            48817 non-nul

In [5]:
df.describe()

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
count,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0,48817.0
mean,1.201794,1.537149,40.75076,-73.97276,3579.585247,0.524838,0.478276,0.478276,0.447631,0.424852,0.415081,0.367085,0.052769,0.268452,0.185653,0.175902,0.132761,0.138394,0.102833,0.087203,0.060471,0.055206,0.051908,0.046193,0.043305,0.042711,0.039331,0.027224,0.026241
std,0.470711,1.106087,0.038954,0.028883,1762.430772,0.499388,0.499533,0.499533,0.497255,0.494326,0.492741,0.482015,0.223573,0.443158,0.38883,0.380741,0.33932,0.345317,0.303744,0.282136,0.238359,0.228385,0.221844,0.209905,0.203544,0.202206,0.194382,0.162738,0.159852
min,0.0,0.0,40.5757,-74.0873,1375.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,40.7283,-73.9918,2500.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,40.7517,-73.978,3150.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,2.0,40.774,-73.955,4095.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,10.0,8.0,40.9894,-73.7001,15500.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [6]:
# Experimenting with the dates to split the data in train and test
df_new = df.copy()

df_new.head()

Unnamed: 0_level_0,bathrooms,bedrooms,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1
2016-06-24 07:54:24,1.5,3,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-06-12 12:19:27,1.0,2,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-04-17 03:26:41,1.0,1,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-04-18 02:22:02,1.0,1,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2016-04-28 01:32:41,1.0,4,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [7]:
df.columns

Index(['bathrooms', 'bedrooms', 'description', 'display_address', 'latitude',
       'longitude', 'price', 'street_address', 'interest_level', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

In [8]:
# Features engineering 
# Make new column based on apartment description length
# len(df_new['description'][0])
df_new['description_len']= df_new['description'].str.len()
# else use .apply(lambda x: len(x))
df_new['description_len']

# perks does each apartment have
df_new['perks'] = df_new['elevator'] + df_new['hardwood_floors'] + \
df_new['doorman']+ df_new['dishwasher']+ df_new['no_fee']+ df_new['laundry_in_building'] + \
df_new['fitness_center']+ df_new['pre-war']+ df_new['laundry_in_unit']+ df_new['roof_deck'] + \
df['outdoor_space']+ df_new['dining_room']+ df_new['high_speed_internet']+ df_new['balcony']+ \
df_new['swimming_pool']+ df_new['new_construction']+ df_new['terrace']+ df_new['exclusive']+ \
df_new['loft']+ df_new['garden_patio']+ df_new['wheelchair_access']+ df_new['common_outdoor_space']
df_new['perks']

# dogs and cat allowed
df_new['dog_and_cat'] = df_new['dogs_allowed'] + df_new['cats_allowed'] 
df_new['dog_and_cat']

# total numbers of bed and bathrooms
df_new['bathrooms_and_bedrooms'] = df_new['bathrooms'] + df_new['bedrooms'] 
df_new['bathrooms_and_bedrooms']

# total numbers of bed and bathrooms
df_new['bedrooms/bathrooms'] =  df_new['bedrooms'] / df_new['bathrooms'] 
df_new['bedrooms/bathrooms']

# Neighbourhood LatandLong

created
2016-06-24 07:54:24    2.0
2016-06-12 12:19:27    2.0
2016-04-17 03:26:41    1.0
2016-04-18 02:22:02    1.0
2016-04-28 01:32:41    4.0
                      ... 
2016-06-02 05:41:05    2.0
2016-04-04 18:22:34    1.0
2016-04-16 02:13:40    1.0
2016-04-08 02:13:33    0.0
2016-04-12 02:48:07    2.0
Name: bedrooms/bathrooms, Length: 48817, dtype: float64

In [9]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48817 entries, 2016-06-24 07:54:24 to 2016-04-12 02:48:07
Data columns (total 38 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   bathrooms               48817 non-null  float64
 1   bedrooms                48817 non-null  int64  
 2   description             47392 non-null  object 
 3   display_address         48684 non-null  object 
 4   latitude                48817 non-null  float64
 5   longitude               48817 non-null  float64
 6   price                   48817 non-null  int64  
 7   street_address          48807 non-null  object 
 8   interest_level          48817 non-null  object 
 9   elevator                48817 non-null  int64  
 10  cats_allowed            48817 non-null  int64  
 11  hardwood_floors         48817 non-null  int64  
 12  dogs_allowed            48817 non-null  int64  
 13  doorman                 48817 non-null  int64  
 14  dis

# Making Features Matrix with Bathrooms and Bedrooms

In [10]:
# making target vector and features matrix with bathrooms and bedrooms
y = df['price']
X = df[['bathrooms','bedrooms']]
print (f'The shape of y is {y.shape} and that of X is {X.shape}')


# Split the data in train(data before June) and test(data in and after June )
mask = df_new.index < ('2016-06-01 00:00:00')
mask

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

print (f'Training Data y and X shape is {y_train.shape},{X_train.shape}')
print (f'Test Data y and X shape is {y_test.shape},{X_test.shape} \n \n ')

# checking if the length of test and train data is equal to the len of df
assert len(X_test) + len(X_train) == len(df)

# Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
print(f'The equation of the model fit is \n price = {model.intercept_} + {model.coef_[0]} x bathrooms + {model.coef_[1]} x bedrooms \n')

# Get the results
print(f'The Training Model Score is : {model.score(X_train, y_train)}')
print(f'The Test Model Score is : {model.score(X_test, y_test)}')


The shape of y is (48817,) and that of X is (48817, 2)
Training Data y and X shape is (31844,),(31844, 2)
Test Data y and X shape is (16973,),(16973, 2) 
 
 
The equation of the model fit is 
 price = 485.71869002322865 + 2072.610116385187 x bathrooms + 389.32489590255614 x bedrooms 

The Training Model Score is : 0.5111543084316607
The Test Model Score is : 0.5213303957090345



# First Model: Making Features Matrix with Bathrooms and Bedrooms

In [11]:
# making target vector and features matrix
y = df['price']
X = df[['bathrooms','bedrooms']]
print (f'The shape of y is {y.shape} and that of X is {X.shape}')


# Split the data in train(data before June) and test(data in and after June )
mask = df_new.index < ('2016-06-01 00:00:00')
mask

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

print (f'Training Data y and X shape is {y_train.shape},{X_train.shape}')
print (f'Test Data y and X shape is {y_test.shape},{X_test.shape} \n \n ')

# checking if the length of test and train data is equal to the len of df
assert len(X_test) + len(X_train) == len(df)

# Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
print(f'The equation of the model fit is \n price = {model.intercept_} + {model.coef_[0]} x bathrooms + {model.coef_[1]} x bedrooms \n')

# Get the results
print(f'The Training Model Score is : {model.score(X_train, y_train)}')
print(f'The Test Model Score is : {model.score(X_test, y_test)}')

The shape of y is (48817,) and that of X is (48817, 2)
Training Data y and X shape is (31844,),(31844, 2)
Test Data y and X shape is (16973,),(16973, 2) 
 
 
The equation of the model fit is 
 price = 485.71869002322865 + 2072.610116385187 x bathrooms + 389.32489590255614 x bedrooms 

The Training Model Score is : 0.5111543084316607
The Test Model Score is : 0.5213303957090345


In [12]:
# Predicting with 3 bathrooms and 8 bedrooms
model.predict([[3,8]])

array([9818.1482064])

Output for the first model is Training Mode 51.1% and Test Model 51.1%.

# Second Model: Making Features Matrix with Bathrooms and Bedrooms as one variable only to look for better R squared scores.


In [13]:
# updating features matrix to include
X = df_new[['bathrooms_and_bedrooms']]
print (f'The shape of y is {y.shape} and that of X is {X.shape}')


# Split the data in train(data before June) and test(data in and after June )
mask = df_new.index < ('2016-06-01 00:00:00')
mask

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

print (f'Training Data y and X shape is {y_train.shape},{X_train.shape}')
print (f'Test Data y and X shape is {y_test.shape},{X_test.shape} \n \n ')

# checking if the length of test and train data is equal to the len of df
assert len(X_test) + len(X_train) == len(df)

# Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
print(f'The equation intercept is = {model.intercept_} and coefficient {model.coef_}\n')

# Get the results
print(f'The Training Model Score is : {model.score(X_train, y_train)}')
print(f'The Test Model Score is : {model.score(X_test, y_test)}')

The shape of y is (48817,) and that of X is (48817, 1)
Training Data y and X shape is (31844,),(31844, 1)
Test Data y and X shape is (16973,),(16973, 1) 
 
 
The equation intercept is = 1363.4860214142823 and coefficient [809.68138268]

The Training Model Score is : 0.420930005213961
The Test Model Score is : 0.4220517161864745


Output for the Second Model is Training Mode 42.0% and Test Model 42.2%.

# Third Model: Making Features Matrix with Bathrooms, Bedrooms and Perks to look for better R squared scores.



In [14]:
# updating features matrix to include
X = df_new[['bathrooms','bedrooms','perks']]
print (f'The shape of y is {y.shape} and that of X is {X.shape}')


# Split the data in train(data before June) and test(data in and after June )
mask = df_new.index < ('2016-06-01 00:00:00')
mask

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

print (f'Training Data y and X shape is {y_train.shape},{X_train.shape}')
print (f'Test Data y and X shape is {y_test.shape},{X_test.shape} \n \n ')

# checking if the length of test and train data is equal to the len of df
assert len(X_test) + len(X_train) == len(df)

# Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
print(f'The equation intercept is = {model.intercept_} and coefficient {model.coef_}\n')

# Get the results
print(f'The Training Model Score is : {model.score(X_train, y_train)}')
print(f'The Test Model Score is : {model.score(X_test, y_test)}')

The shape of y is (48817,) and that of X is (48817, 3)
Training Data y and X shape is (31844,),(31844, 3)
Test Data y and X shape is (16973,),(16973, 3) 
 
 
The equation intercept is = 341.50890178970394 and coefficient [1939.17782263  390.05145325   80.20429938]

The Training Model Score is : 0.5316997355169831
The Test Model Score is : 0.5407197747938215


Output for the Third Model is Training Mode 53.2% and Test Model 54.1%.

# Fourth Model: Making Features Matrix with Bathrooms, Bedrooms, Perks, Cats and Dogs, latitude, longitude, description_len and interest_level to look for better R squared scores.



In [15]:
# finding the issue with description_len
df_new['description_len'] = df_new['description_len'].fillna(0)
df_new['description_len'].isnull().sum()
# df_new[condition]
df_new['description_len']

created
2016-06-24 07:54:24     588.0
2016-06-12 12:19:27       8.0
2016-04-17 03:26:41     691.0
2016-04-18 02:22:02     492.0
2016-04-28 01:32:41     479.0
                        ...  
2016-06-02 05:41:05     787.0
2016-04-04 18:22:34    1125.0
2016-04-16 02:13:40     671.0
2016-04-08 02:13:33     735.0
2016-04-12 02:48:07     799.0
Name: description_len, Length: 48817, dtype: float64

In [16]:
# Making interest_levl column to float for use
def change_to_number(x):
  if x == 'medium':
    return 2
  elif x == 'high':
    return 3
  else:
    return 1
# df_new['interest_level'].isnull().sum()
df_new['interest_level'] = df_new['interest_level'].apply(change_to_number)

In [17]:
# Removing inf from bedrooms/bathrooms
df_new['bedrooms/bathrooms'] = df_new['bedrooms/bathrooms'].fillna(0)
df_new['bedrooms/bathrooms'].isnull().sum()
df_new = df_new.replace([np.inf],0)
df_new[df_new['bedrooms/bathrooms'] == np.inf]

Unnamed: 0_level_0,bathrooms,bedrooms,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,description_len,perks,dog_and_cat,bathrooms_and_bedrooms,bedrooms/bathrooms
created,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1


### Tried using 'bedrooms/bathrooms' and 'bathrooms_and_bedrooms' instead of 'bathrooms','bedrooms' but they reduce my R squared value

In [18]:
# updating features matrix to include
X = df_new[['bathrooms','bedrooms','perks','dog_and_cat', 'latitude', 'longitude','description_len','interest_level']] ## Try including 'bedrooms/bathrooms' and see score increase marginally ???
print (f'The shape of y is {y.shape} and that of X is {X.shape}')


# Split the data in train(data before June) and test(data in and after June )
mask = df_new.index < ('2016-06-01 00:00:00')
mask

X_train, y_train = X.loc[mask], y.loc[mask]
X_test, y_test = X.loc[~mask], y.loc[~mask]

print (f'Training Data y and X shape is {y_train.shape},{X_train.shape}')
print (f'Test Data y and X shape is {y_test.shape},{X_test.shape} \n \n ')

# checking if the length of test and train data is equal to the len of df
assert len(X_test) + len(X_train) == len(df)

# Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
print(f'The equation intercept is = {model.intercept_} and coefficient {model.coef_}\n')

# Get the results
print(f'The Training Model Score is : {model.score(X_train, y_train)}')
print(f'The Test Model Score is : {model.score(X_test, y_test)}')

The shape of y is (48817,) and that of X is (48817, 8)
Training Data y and X shape is (31844,),(31844, 8)
Test Data y and X shape is (16973,),(16973, 8) 
 
 
The equation intercept is = -1091079.2507346729 and coefficient [ 1.83644640e+03  4.51552932e+02  5.41214190e+01  2.89581837e+01
  1.35497444e+03 -1.40178403e+04  3.87361893e-02 -4.75079888e+02]

The Training Model Score is : 0.6114458125822391
The Test Model Score is : 0.6236723939327279


**Best Training and Test Model score so far is 61.1% and 62.3% using Fourth Model.**

In [19]:
# Create the baseline

from sklearn.metrics import mean_absolute_error
price_mean = df['price'].mean()
price_mean
print(f'MAE Train',mean_absolute_error(y_train, model.predict(X_train)))
print(f'MAE Test',mean_absolute_error(y_test, model.predict(X_test)))

MAE Train 700.5210173549026
MAE Test 701.6544737951979


In [20]:
# Getting the RMSE
from sklearn.metrics import mean_squared_error
print(f'RMSE Train',mean_squared_error(y_train, model.predict(X_train), squared=False))
print(f'RMSE Test',mean_squared_error(y_test, model.predict(X_test), squared=False))

RMSE Train 1098.395063768929
RMSE Test 1081.4962798535753


In [21]:
# Getting the R Squared again
print(f'Model Score (R-Squared) on Train is : {model.score(X_train, y_train)}')
print(f'Model Score (R-Squared) on Test is : {model.score(X_test, y_test)}')

Model Score (R-Squared) on Train is : 0.6114458125822391
Model Score (R-Squared) on Test is : 0.6236723939327279


## Get the model coefficient

In [22]:
print(f'The linear equation for the best run model is \n price = {model.intercept_} + {model.coef_[0]} x bathrooms + {model.coef_[1]} x bedrooms + \
 {model.coef_[2]} x perks + {model.coef_[3]} x dog_and_cat + {model.coef_[4]} x latitude \
 + \n{model.coef_[5]} x longitude + {model.coef_[6]} x description_len + {model.coef_[7]} x interest_level')

The linear equation for the best run model is 
 price = -1091079.2507346729 + 1836.4464011075265 x bathrooms + 451.55293166912355 x bedrooms +  54.12141899586252 x perks + 28.958183703664 x dog_and_cat + 1354.97443871816 x latitude  + 
-14017.840338306245 x longitude + 0.03873618934994738 x description_len + -475.07988827056187 x interest_level


## The best MAE I can get is :-

MAE Train 700.5210173549026

MAE Test 701.6544737951979