<a href="https://colab.research.google.com/github/rileythejones/DS-Unit-2-Linear-Models/blob/master/RJ_assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

# import pandas_profiling
# df.profile_report()

import datetime
df['created'] = pd.to_datetime(df['created'])
df_train = df.loc[(df['created'].dt.month==4)|(df['created'].dt.month==5)]
df_test = df.loc[(df['created'].dt.month==6)]
df_train.shape, df_test.shape

# import plotly.express as px
# px.scatter(df, x='longitude', y='latitude', opacity=0.2)

((31844, 34), (16973, 34))

In [0]:
df_train_lat_long = df_train[['latitude', 'longitude']]

In [4]:
df_train_lat_long

Unnamed: 0,latitude,longitude
2,40.7388,-74.0018
3,40.7539,-73.9677
4,40.8241,-73.9493
5,40.7429,-74.0028
6,40.8012,-73.9660
...,...,...
49346,40.7296,-73.9869
49348,40.7102,-74.0163
49349,40.7601,-73.9900
49350,40.7066,-74.0101


In [0]:
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# df_train_scaled = scaler.fit_transform(df_train_lat_long)

# from sklearn.decomposition import PCA
# pca = PCA(10)
# pca_features = pca.fit_transform(df_train_scaled)
# pca_features

In [4]:
from sklearn.metrics import mean_absolute_error
# Arrange y target vectors
target = 'price'
y_train = df_train[target]
y_test = df_test[target]

# Get mean baseline
print('Mean Baseline (using 0 features)')
guess = y_train.mean()

# Train Error
y_pred = [guess] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error (April, May): {mae:.2f} $/month')

# Test Error
y_pred = [guess] * len(y_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error (June): {mae:.2f} $/month')

Mean Baseline (using 0 features)
Train Error (April, May): 1201.88 $/month
Test Error (June): 1197.71 $/month


In [5]:
# 1. Import the appropriate estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression

# 2. Instantiate this class
model = LinearRegression()

# 3. Arrange X features matrices (already did y target vectors)
features = ['bedrooms']
X_train = df_train[features]
X_test = df_test[features]
print(f'Linear Regression, dependent on: {features}')

# 4. Fit the model
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f} $/month')

# 5. Apply the model to new data
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error: {mae:.2f} $/month')

Linear Regression, dependent on: ['bedrooms']
Train Error: 969.88 $/month
Test Error: 988.73 $/month


In [79]:
# Re-arrange X features matrices
features = ['bedrooms', 'bathrooms']
print(f'Linear Regression, dependent on: {features}')
X_train = df_train[features]
X_test = df_test[features]

# Fit the model
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f} $/month')

# Apply the model to new data
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'Test Error: {mae:.2f} $/month')

Linear Regression, dependent on: ['bedrooms', 'bathrooms']
Train Error: 818.53 percentage points
Test Error: 825.90 percentage points


In [120]:
from sklearn.cluster import KMeans
ten_cluster = KMeans(n_clusters=14)
ten_cluster.fit(df_train_lat_long)
labels = ten_cluster.labels_
df_train['cluster'] = labels



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [0]:

df_train_lat_long['cluster'] = labels

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
colors = {0:'red', 1:'blue', 2:'yellow', 3:'orange', 4:'green', 5:'black', 6:'cyan', 7:'yellow', 8:'orange', 9:'green'}
grouped = df_train_lat_long.groupby('cluster')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='latitude', y='longitude', label=key, color=colors[key])
plt.title('Graphing the clusters from lat/long data')
plt.show()   

In [0]:
# adding this did nothing to improve accuracy 
df_train['pet_friendly'] = df_train['dogs_allowed'] + df_train['cats_allowed']
# adding this improved accuracy a little but 
# but I'm not sure how to transfer it over to the new dataset
# 
df_train['cluster'] = labels

In [126]:
# Re-arrange X features matrices
features = ['bedrooms', 'bathrooms', 'cluster']
print(f'Linear Regression, dependent on: {features}')
X_train = df_train[features]
# X_test = df_test[features]

# Fit the model
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'Train Error: {mae:.2f} $/month')

# # Apply the model to new data
# y_pred = model.predict(X_test)
# mae = mean_absolute_error(y_test, y_pred)
# print(f'Test Error: {mae:.2f} $/month')

Linear Regression, dependent on: ['bedrooms', 'bathrooms', 'cluster']
Train Error: 795.31 percentage points


In [0]:
#  Get the model's coefficients and intercept.
#  Get regression metrics RMSE, MAE, and  R2 , for both the train and test data.
#  What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
#  As always, commit your notebook to your fork of the GitHub repo.