# Apartment Hunting

How might you approach the problem of recognizing a good deal from a bad deal?

Housing Markets are difficult to predict:
- Continually changing marketplace
- Many factors that play into one decision
- Different features have different values to renters/rentees

### 1. Formalize a Question 

### 2. What information do we need/would be helpful to answer this question? Where do we find it?

### 3. What patterns do we see in the data? EDA

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('../shared-resources/AptData.csv')
df.describe(include='all')

Unnamed: 0.1,Unnamed: 0,available,bath,bed,cat,content,content_length,date,dog,feet,...,lastseen,lat,laundry,long,parking,price,smoking,time,title,wheelchair
count,59668.0,59662,57955.0,57955.0,59668.0,59668,59668.0,59668,59668.0,52002.0,...,35194,56280.0,54706,56280.0,45358,59603.0,59662.0,59668,59668,59662.0
unique,,418,,,,42098,,172,,,...,57,,5,,7,,,32748,37617,
top,,2017-03-01,,,,Manger Special: $100 off rent for the first 6...,,2017-05-30,,,...,2017-05-31,,w/d in unit,,off-street parking,,,09:44:21,1 AND 2 BED AVAILABLE,
freq,,1714,,,,326,,763,,,...,1748,,34855,,14897,,,8,132,
mean,6049989000.0,,1.283944,1.436287,0.715258,,222.731632,,0.66751,888.145417,...,,45.518263,,-122.628349,,2802.6,0.0,,,0.0
std,60183190.0,,0.51381,0.934212,0.451295,,119.112952,,0.471109,2384.246357,...,,0.134953,,0.507698,,125390.9,0.0,,,0.0
min,5920380000.0,,0.0,0.0,0.0,,3.0,,0.0,1.0,...,,27.939305,,-124.018466,,1.0,0.0,,,0.0
25%,6001450000.0,,1.0,1.0,0.0,,143.0,,0.0,604.0,...,,45.504798,,-122.687025,,1190.0,0.0,,,0.0
50%,6042061000.0,,1.0,1.0,1.0,,212.0,,1.0,789.0,...,,45.519138,,-122.668855,,1399.0,0.0,,,0.0
75%,6096789000.0,,1.5,2.0,1.0,,287.0,,1.0,1000.0,...,,45.532877,,-122.612359,,1795.0,0.0,,,0.0


In [71]:
unique_data = df.drop_duplicates(keep='first')
print(unique_data.columns)

Index(['Unnamed: 0', 'available', 'bath', 'bed', 'cat', 'content',
       'content_length', 'date', 'dog', 'feet', 'furnished', 'getphotos',
       'hasmap', 'housingtype', 'lastseen', 'lat', 'laundry', 'long',
       'parking', 'price', 'smoking', 'time', 'title', 'wheelchair'],
      dtype='object')


### 4. What methods might we use to answer our question?

In [75]:
# input features -> predict price (cont variable)
# linear, log regress, neural, pca, forest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

unique_data = df.drop_duplicates(keep='first')
unique_data = unique_data[['bath', 'bed', 'feet', 'price']]
unique_data = unique_data.dropna()

X = unique_data.drop(['price'], axis=1)
y = unique_data.price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .15, random_state = 100)


In [76]:
lr = LinearRegression(normalize=True)
lr.fit(X_train, y_train)

lr.score(X_test, y_test)

0.0015649683266656389

In [80]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

rf = RandomForestRegressor(n_estimators=30)
rf.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [81]:
pred = rf.predict(X_test)

In [82]:
r2_score(pred, y_test)

-648.90878623577362

In [32]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_train.head()

array([[ 0.2       ,  0.        ,  0.00101679],
       [ 0.5       ,  0.25      ,  0.00239333],
       [ 0.2       ,  0.        ,  0.00115501],
       ..., 
       [ 0.2       ,  0.125     ,  0.0014182 ],
       [ 0.2       ,  0.125     ,  0.00098271],
       [ 0.2       ,  0.        ,  0.00075549]])

In [64]:
from keras.models import Sequential
from keras.layers import Activation, Dense

model = Sequential()
model.add(Dense(3, input_shape=X_train.shape[1:]))  # model.layers[0].get_weights()[:].shape: [(2,),]
model.add(Activation('relu'))  # relu, softmax, sigmoid
model.compile(loss='mean_squared_error', optimizer='sgd') 
# notice that there are 2 random weights and a bias (initialized to zero), just like we had before
print(model.layers[0].get_weights())

[array([[-0.0318293 , -0.63616037, -0.43955475],
       [-0.24039263,  0.3029573 , -0.51735497],
       [-0.79712427,  0.84538138, -0.70950437]], dtype=float32), array([ 0.,  0.,  0.], dtype=float32)]


In [61]:
print(X_train.shape[1:])
# y_train.values.reshape(-1,1)

(3,)


In [65]:
model.fit(X_train, y_train.values.reshape(-1,1), batch_size=100)

ValueError: Error when checking target: expected activation_12 to have shape (None, 3) but got array with shape (43456, 1)

In [38]:
# ypred = model.predict(X_test)
# print(ypred.shape)
# print(y.shape)
# mad = np.abs(ypred - y_test.values.reshape(len(y_test), 1)).mean()
# print(mad)

(7669,)

### 5. How do we explain our results?