<a href="https://colab.research.google.com/github/reidharris01/DS-Unit-2-Linear-Models/blob/master/212_Assignment_Reid_Harris.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
type(df['created'][0]) # checks data type of created column

str

In [5]:
df['created'] = pd.to_datetime(df['created']) # changes created column to timestamps
type(df['created'][0]) # checks to make sure successfully changed

pandas._libs.tslibs.timestamps.Timestamp

In [6]:
# INSTRUCTIONS:  Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.

train_start = '2016-04-01'
train_end = '2016-05-31'
test_start = '2016-06-01'
test_end = '2016-06-30'

train_mask = (train_start <= df['created']) & (df['created'] <= train_end)
test_mask = (test_start <= df['created']) & (df['created'] <= test_end)

train = df.loc[train_mask]
test = df.loc[test_mask]

print(train['created'].min().month, train['created'].max().month) # check range of train months (Looking for "4 5")
print(test['created'].min().month, test['created'].max().month) # check range of test months (Looking for "6 6")

4 5
6 6


In [7]:
# Split apart X and y for both train and test

X_train, X_test = train.drop('price',axis=1), test.drop('price',axis=1)
y_train, y_test = train['price'], test['price']

<font color='red'>**That's how I would split them, but I want to engineer features first before the split since geopy takes a long time and I don't want to have to repeat everything I did with the training set again with the test set.**

In [8]:
# INSTRUCTIONS: Engineer at least two new features. (See below for explanation
# & ideas.)

# I'd like to try to make a feature called "borough", but all we have is
# latitude and longitutde.  We'll use geopy to find address by lat/long.

from geopy.geocoders import Nominatim
geolocator = Nominatim() # initialize locator
location = geolocator.reverse("40.7145, -73.9425") # making sure it works 
# on a point in our dataset
print(location.address)



792, Metropolitan Avenue, Williamsburg, Brooklyn, New York, 11211, United States of America


In [9]:
df['coord_str'] = df['bedrooms'] # create a column of same length with same 
# indices (values don't matter as each value will be overwritten with a for loop)
for i in df.index: # since geopy requires a string input of form "latitute, 
# longitude", we'll make a column of that format
  df['coord_str'][i] = str(df['latitude'][i].round(4)) + ", " + str(df['longitude'][i].round(4))
df['coord_str'] # check to make sure it's looking fine

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


0        40.7145, -73.9425
1        40.7947, -73.9667
2        40.7388, -74.0018
3        40.7539, -73.9677
4        40.8241, -73.9493
               ...        
49347     40.7426, -73.979
49348    40.7102, -74.0163
49349      40.7601, -73.99
49350    40.7066, -74.0101
49351    40.8699, -73.9172
Name: coord_str, Length: 48817, dtype: object

In [10]:
df['full_address'] = df['bedrooms'] # create a column of same length with same 
# indices (values don't matter as each value will be overwritten with a for-loop)
for i in df.index:
  if i < 5: # just doing it for the first few listings to make sure it's working
# now plug new column into geopy reverse look-up
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i]))
  else:
    pass
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7145, -73.9425","792, Metropolitan Avenue, Williamsburg, Brookl..."
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7947, -73.9667","808, Columbus Avenue, Manhattan Valley, Manhat..."
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7388, -74.0018","239, West 13th Street, Manhattan Community Boa..."
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7539, -73.9677","333, East 49th Street, Turtle Bay, Manhattan, ..."
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8241, -73.9493","500, West 143rd Street, Hamilton Heights, Manh..."


In [11]:
# create a column of same length with same indices (values don't matter as each
# value will be overwritten with a for-loop later)
df['full_address'] = df['bedrooms']
df

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7145, -73.9425",3
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7947, -73.9667",2
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7388, -74.0018",1
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7539, -73.9677",1
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8241, -73.9493",4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49347,1.0,2,2016-06-02 05:41:05,"30TH/3RD, MASSIVE CONV 2BR IN LUXURY FULL SERV...",E 30 St,40.7426,-73.9790,3200,230 E 30 St,medium,1,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7426, -73.979",2
49348,1.0,1,2016-04-04 18:22:34,"HIGH END condo finishes, swimming pool, and ki...",Rector Pl,40.7102,-74.0163,3950,225 Rector Place,low,1,1,0,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,"40.7102, -74.0163",1
49349,1.0,1,2016-04-16 02:13:40,Large Renovated One Bedroom Apartment with Sta...,West 45th Street,40.7601,-73.9900,2595,341 West 45th Street,low,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7601, -73.99",1
49350,1.0,0,2016-04-08 02:13:33,Stylishly sleek studio apartment with unsurpas...,Wall Street,40.7066,-74.0101,3350,37 Wall Street,low,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7066, -74.0101",0


In [12]:
df['coord_str'][1325]

'40.7261, -73.8486'

In [13]:
# In case the rows are already clumped by address/location, I need to shuffle
# first so that I get a fair representation of each borough in the model
# since I'm about to choose 10% of them to keep.
df = df.sample(frac=1).reset_index(drop=True) # shuffles rows
df

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984",2
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.9400,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94",1
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.9870,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987",0
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814",1
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48812,1.0,0,2016-04-29 05:04:14,Luxury gut-renovated 4th floor (elevator build...,1090 Saint Nicholas Avenue,40.8383,-73.9397,1850,1090 Saint Nicholas Avenue,low,1,1,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,"40.8383, -73.9397",0
48813,1.0,0,2016-04-17 02:31:11,**E 89TH/2ND AVE * \r\rSPACIOUS*\r\r*EXP BRICK...,E 89 Street,40.7793,-73.9496,1850,310 E 89 Street,low,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,"40.7793, -73.9496",0
48814,1.0,1,2016-04-15 06:37:50,LOW FEE!Desirable Murray Hill location. This T...,East 34th Street,40.7436,-73.9727,3298,401 East 34th Street,medium,1,1,1,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,"40.7436, -73.9727",1
48815,1.0,2,2016-06-16 03:22:23,Amazing 1 Bedroom in a Luxury Building will le...,Washington Street,40.7080,-74.0149,2893,90 Washington Street,medium,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.708, -74.0149",2


In [18]:
# I tried running geopy for all 50K listings, but the processing 
# took over two hours and still wasn't done.  So I'm going to sacrifice most
# of my training set's data points for a feature that could be highly 
# predictive.  100 geopy lookups took 45 seconds, so it's about half a second
# per value.  I'll keep 1K out of the 50K for now which should take about eight
# or nine minutes.

keep = 500

for i in df.index:
  if i < keep:
# plug new column into geopy reverse look-up.
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i])) 
  else:
    pass
df.head()

# So takes a while for just a few hundred.  When I try to do 1000, something
# always goes wrong midway through.  So, I'm going to do this for-loop
# over indices 500-999, then 1000-1499, then 1500-1999, etc.  (below)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


In [19]:
for i in df.index:
  if 500 <= i < 999: # adds full address to rows 500 thru 999
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i])) 
  else:
    pass
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


In [20]:
for i in df.index:
  if 1000 <= i <= 1499: # adds full address to rows 1000-1499
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i])) 
  else:
    pass
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


In [23]:
for i in df.index:
  if 1500 <= i <= 1999: # adds full address to rows 1500-1999
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i])) 
  else:
    pass
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


In [26]:
for i in df.index:
  if 2000 <= i <= 2499: # adds full address to rows 2000-2499
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i])) 
  else:
    pass
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


In [30]:
for i in df.index:
  if 2500 <= i <= 2999: # adds full address to rows 2000-2499
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i])) 
  else:
    pass
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


In [31]:
for i in df.index:
  if 3000 <= i <= 3499: # adds full address to rows 2000-2499
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i])) 
  else:
    pass
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


In [33]:
for i in df.index:
  if 3500 <= i <= 3999: # adds full address to rows 2000-2499
    df['full_address'][i] = str(geolocator.reverse(df['coord_str'][i])) 
  else:
    pass
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


<font color='red'>**4000 is enough rows for now at least.  Probably not worth waiting another 5 hours for the rest.**

In [34]:
# Keep only those with borough info
# Assign to new variable so that if we mess something up, we don't have to
# rerun geopy, which takes forever
df2 = df.loc[df.index < 4000]
display(df2.head())
display(df2.tail())

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar..."
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo..."
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T..."
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan..."
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y..."


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address
3995,1.0,0,2016-06-08 02:10:35,This spacious and immaculate 1 bedroom apartme...,Remsen Street,40.695,-73.9972,2250,33 Remsen Street,low,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.695, -73.9972","33, Remsen Street, Brooklyn Heights, Brooklyn,..."
3996,1.0,2,2016-04-29 03:46:43,Large and recently renovated 2 bedroom duplex ...,Thompson Street,40.7277,-74.0,3950,"Greenwich Village, Downtown Manhattan",low,0,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,"40.7277, -74.0","174, Thompson Street, Manhattan Community Boar..."
3997,2.0,2,2016-04-28 04:28:06,"Experience a new, unparalleled level of luxury...",Junction Blvd,40.7327,-73.8638,3112,61-55 Junction Blvd,low,1,1,1,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,0,0,0,1,0,"40.7327, -73.8638","Dallas BBQ, 61-35, Junction Boulevard, Rego Pa..."
3998,1.0,1,2016-04-28 05:59:48,This luxury building wraps around the 3 corner...,Reade St,40.7162,-74.0095,3295,121 Reade St,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,"40.7162, -74.0095","121, Reade Street, Manhattan, Tribeca, New Yor..."
3999,2.0,3,2016-04-22 06:09:34,NO FEE!! 3 BED-2 BATH WITH WASHER & DRYER-2ND ...,Avenue B,40.7279,-73.9794,5495,186 Avenue B,low,0,0,1,0,0,1,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,"40.7279, -73.9794","186, Avenue B, Alphabet City, Manhattan, New Y..."


In [58]:
boroughs_list = ['Manhattan','Brooklyn','Queens','The Bronx','Staten Island']

# Create 'borough' feature
# Doesn't matter contents, just needs to be same indices
df2['borough'] = df2['bedrooms']
# Overwrite with borough name
for i in df2.index:
  for word in boroughs_list:
    if word in str(df2['full_address'][i]):
      df2['borough'][i] = word
    else:
      pass
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.loc[key

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address,borough
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar...",Manhattan
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo...",Manhattan
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T...",Manhattan
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan...",Manhattan
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y...",Manhattan


In [59]:
df2['borough'].value_counts() # checks values

Manhattan    3422
Brooklyn      377
Queens        163
The Bronx      26
2               6
1               4
4               1
3               1
Name: borough, dtype: int64

In [62]:
# Looks like there are a few that didn't get assigned a borough.
# Remove them
for i in df2.index:
  if df2['borough'][i] == 1:
    df2.drop(index=i, inplace=True)
  elif df2['borough'][i] == 2:
     df2.drop(index=i, inplace=True)
  elif df2['borough'][i] == 3:
     df2.drop(index=i, inplace=True)
  elif df2['borough'][i] == 4:
     df2.drop(index=i, inplace=True)
  else:
    pass
print(df2.shape)
df2.head()

(3988, 37)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address,borough
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar...",Manhattan
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo...",Manhattan
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T...",Manhattan
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan...",Manhattan
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y...",Manhattan


In [63]:
# In order to make borough useable, I'll make dummies
df2 = pd.get_dummies(data= df2, columns=['borough'])

In [64]:
df2.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,coord_str,full_address,borough_Brooklyn,borough_Manhattan,borough_Queens,borough_The Bronx
0,1.0,2,2016-04-12 05:51:55,This is a lovely 2 bedroom apartment located i...,Mulberry Street,40.7177,-73.9984,3983,115 Mulberry Street,low,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,"40.7177, -73.9984","115, Mulberry Street, Manhattan Community Boar...",0,1,0,0
1,1.0,1,2016-05-14 03:24:04,"Spring is here, and NYC apts are renting FAST,...",West 145,40.8222,-73.94,1600,215 West 145,low,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.8222, -73.94","215, West 145th Street, Manhattan Community Bo...",0,1,0,0
2,1.0,0,2016-05-21 02:41:11,"53-story tower with its dramatic entrance, 13,...",W 47 St.,40.7604,-73.987,2600,271 W 47 St.,low,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,"40.7604, -73.987","Biltmore Tower, West 47th Street, Manhattan, T...",0,1,0,0
3,1.0,1,2016-06-05 04:02:29,TAKE ADVANTAGE OF EV'S HOTTEST NEW BLDG * This...,East 2nd Street,40.7214,-73.9814,3500,252 East 2nd Street,medium,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,"40.7214, -73.9814","254, East 2nd Street, Alphabet City, Manhattan...",0,1,0,0
4,1.0,1,2016-04-14 06:07:39,"Prime Gramercy Location, 1 block from subway.<...",First Avenue,40.7318,-73.9822,2950,252 First Avenue,low,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,"40.7318, -73.9822","257, 1st Avenue, Gramercy, East Village, New Y...",0,1,0,0


<font color='red'>**Perfect.  One feature down, one to go.  The next one will be MUCH easier.**

In [65]:
# I also want to make a "total space" feature, combining bathrooms, bedrooms,
# dining_rooms, balcony, and garden_patio.  This one will be much easier

df2['total_space'] = df2['bedrooms'] + df2['bathrooms'] + df2['balcony'] + df2['garden_patio']
df2['total_space']

0       4.0
1       2.0
2       1.0
3       2.0
4       2.0
       ... 
3995    1.0
3996    3.0
3997    4.0
3998    2.0
3999    5.0
Name: total_space, Length: 3988, dtype: float64

In [66]:
# INSTRUCTIONS: Fit a linear regression model with at least two features.

# Now I'll do train/test split on df2

train_start = '2016-04-01'
train_end = '2016-05-31'
test_start = '2016-06-01'
test_end = '2016-06-30'

train_mask = (train_start <= df2['created']) & (df2['created'] <= train_end)
test_mask = (test_start <= df2['created']) & (df2['created'] <= test_end)

train = df2.loc[train_mask]
test = df2.loc[test_mask]

# check range of train months (Looking for "4 5")
print(train['created'].min().month, train['created'].max().month)
# check range of test months (Looking for "6 6")
print(test['created'].min().month, test['created'].max().month)

4 5
6 6


In [67]:
# Split apart X and y for both train and test

# For features, we'll drop price and all the features that aren't predictive
X_train = train.drop(['price','created','description','display_address',
                     'latitude','longitude','street_address',
                     'full_address','coord_str'], axis=1)
X_test = test.drop(['price','created','description','display_address',
                    'latitude','longitude','street_address','full_address',
                    'coord_str'], axis=1)
# We kept interest_level, but we need to map it to an integer
X_train.replace({'high':3,'medium':2,'low':1},inplace=True)
X_test.replace({'high':3,'medium':2,'low':1},inplace=True)

# For targets, we'll just use price
y_train, y_test = train['price'], test['price']

In [68]:
X_train.head()

Unnamed: 0,bathrooms,bedrooms,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space,borough_Brooklyn,borough_Manhattan,borough_Queens,borough_The Bronx,total_space
0,1.0,2,1,0,0,1,0,0,1,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,4.0
1,1.0,1,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,2.0
2,1.0,0,1,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1.0
4,1.0,1,1,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,0,0,1,0,0,2.0
6,1.0,2,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,3.0


In [69]:
from sklearn.linear_model import LinearRegression

model = LinearRegression() # initialize
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [70]:
# INSTRUCTIONS: Get the model's coefficients and intercept.
print("Coefficients =", model.coef_)
print("Intercept =", model.intercept_)

Coefficients = [-2.28695902e+15 -2.28695902e+15 -3.99145106e+02  1.77112972e+02
 -1.04544853e+02 -2.32589395e+02  3.39529231e+01  5.50684841e+02
  1.01415202e+02 -8.53063638e+01 -1.70692821e+01  1.56130827e+02
  1.09475699e+01  5.58710611e+02 -1.88556576e+02 -1.34200611e+02
  1.33000614e+02 -2.78632812e+02 -2.28695902e+15 -3.79296875e+00
  2.22419922e+02  1.81796875e+01  1.00928711e+02  1.94551270e+02
 -2.28695902e+15  4.19960938e+01 -6.59509277e+01  9.30336914e+01
  7.63758789e+02 -3.08757080e+02 -5.48777710e+02  2.28695902e+15]
Intercept = 335.1984526112187


<font color='red'>**So, the equation would be each of those coefficients times each variable (column name) plus the intercept.**

In [71]:
# INSTRUCTIONS: Get regression metrics RMSE, MAE, and R2, for both the train and test data.

# Need predictions and targets defined for all three
train_targets, train_preds = y_train, model.predict(X_train)
test_targets, test_preds = y_test, model.predict(X_test)

# RMSE
import sklearn
import math
mse = sklearn.metrics.mean_squared_error(train_targets, train_preds)
rmse = math.sqrt(mse)

# MAE
mae = round(abs(train_preds - train_targets).sum()/len(train_targets),2)

# R^2
SSreg = ((train_preds - train_targets)**2).sum()
SSmean = ((train_targets.mean() - train_targets)**2).sum()
r2 = 1 - (SSreg/SSmean)

print("RMSE =", round(rmse,2))
print("MAE =", round(mae,2))
print("R-squared =", round(r2,2))

RMSE = 1149.95
MAE = 733.17
R-squared = 0.6


In [72]:
round(abs(train_preds - train_targets).sum()/len(train_targets),2)

733.17

In [73]:
# Let's compare our RMSE and MAE to a baseline mean RMSE and MAE

mse = (((train_targets - train_preds.mean())**2)/len(train_targets)).sum()
rmse = math.sqrt(mse)
mae = round(abs(train_preds.mean() - train_targets).sum()/len(train_targets),2)
print("Baseline RMSE =", round(rmse,2))
print("Baseline MAE =", round(mae,2))

Baseline RMSE = 1810.11
Baseline MAE = 1220.01


<font color='red'>**So our model did indeed improve both RMSE and MAE substantially.**