# Path 3 - Building and Testing A Model

What is machine learning?
1. Supervised Learning
    - We give out model questions and answers
    - e.g. is this a picture of a cat or dog?
2. Unsupervised Learning
    - We give out model unlabeled data, and it figures out something about it
    - e.g. what are the most common types of customers I have?
3. Reinforcement Learning
    - We give our model an environment to play in, and a notion of when it wins or looses
    - e.g. playing chess

Our case is definitely supervised learning--we have questions and answers (question: how much should a house cost that has these features? answer: the price!). So a model is anything that takes those details, and infers the answer.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

train_df = pd.read_csv('https://raw.githubusercontent.com/wlifferth/build-an-ml-web-app/main/cleaned_data.csv', index_col='id')
train_df.head()

In [None]:
average_price = train_df['price'].mean()

average_price

In [None]:
mean_model_df = train_df.copy()

mean_model_df['predicted'] = 335720

mean_model_df['absolute_error'] = np.abs(mean_model_df['price'] - mean_model_df['predicted'])

In [None]:
plt.hist(mean_model_df['absolute_error'])

mean_model_df['absolute_error'].mean()

In [None]:
# This is literally the simplest model we could build--and this counts as a model! Just a really dumb one : )

# 4. How do we make a model more powerful?
#     a. More data
#         i. Take into account more variables (i.e. what's the average price of a square foot)
#         ii. Add more rows to our dataset
#     b. More "capacity"
#         i. Give our model a bigger brain (we'll look at this later)

In [None]:
# What if we incorporated just the livingArea value?
# What is the average price per square foot?

square_footage_model_df = train_df.copy()
square_footage_model_df['price_per_sqft'] = square_footage_model_df['price'] / square_footage_model_df['livingArea']

square_footage_model_df['price_per_sqft'].mean()

In [None]:
square_footage_model_df['predicted'] = square_footage_model_df['livingArea'] * 195.35527446966

square_footage_model_df['absolute_error'] = np.abs(square_footage_model_df['price'] - square_footage_model_df['predicted'])

In [None]:
plt.hist(square_footage_model_df['absolute_error'])

square_footage_model_df['absolute_error'].mean()

In [None]:
# Wow! We just made our model a lot more accurate! On average, we're 20k closer to the correct price!
# 5. This is our first model--it's called linear regression
#     a. sci-kit learn lets us build this kind of model quickly!

from sklearn.linear_model import LinearRegression

lin_reg_df = train_df.copy()

input_data = lin_reg_df[['livingArea']] # This is 2d
output_data = lin_reg_df['price'] # This is 1d

linear_regression_on_living_area_model = LinearRegression()

linear_regression_on_living_area_model.fit(input_data, output_data)

In [None]:
lin_reg_df['predicted'] = linear_regression_on_living_area_model.predict(input_data)

lin_reg_df['absolute_error'] = np.abs(lin_reg_df['price'] - lin_reg_df['predicted'])

In [None]:
plt.hist(lin_reg_df['absolute_error'])

# Woah--it did a little bit better than us--what's going on? (Adding a bias)
lin_reg_df['absolute_error'].mean(skipna=True)

In [None]:
# Now would also be a good time to introduce another helpful utility from scikit learn--calculating our error for us:

from sklearn.metrics import mean_absolute_error

predictions = linear_regression_on_living_area_model.predict(input_data)

mean_absolute_error(lin_reg_df['price'], predictions)


In [None]:
# We can also give our model more capacity
lin_reg_df = train_df.copy() # Overwriting lin_reg_df

lin_reg_df['livingAreaSquared'] = lin_reg_df['livingArea'] ** 2
lin_reg_df['livingAreaRooted'] = lin_reg_df['livingArea'] ** 0.5

input_data = lin_reg_df[['livingArea', 'livingAreaSquared', 'livingAreaRooted']]
output_data = lin_reg_df['price']
lr_on_living_area_nonlinear_model = LinearRegression()
lr_on_living_area_nonlinear_model.fit(input_data, output_data)
predictions = lr_on_living_area_nonlinear_model.predict(input_data)
mean_absolute_error(lin_reg_df['price'], predictions)

In [None]:
# Okay--we got a little bit better! Could we just keep adding additional terms?
columns = ['livingArea', 'livingAreaSquared', 'livingAreaRooted']
for i in range(2,5):
    column = f'livingAreaToThePowerOf{i}'
    columns.append(column)
    lin_reg_df[column] = lin_reg_df['livingArea'] ** i

input_data = lin_reg_df[columns]
output_data = lin_reg_df['price']
lr_on_living_area_nonlinear_model = LinearRegression()
lr_on_living_area_nonlinear_model.fit(input_data, output_data)
predictions = lr_on_living_area_nonlinear_model.predict(input_data)
mean_absolute_error(lin_reg_df['price'], predictions)

In [None]:
# Technically it's vbetter, but not by much
# 8. Now it's time for the big butt in machine learning, and it's called over fitting. I've spent a lot of time thnking about what the best way to explain over fitting is, and I think a really good analogy is with study guides.
#     a. So I want us to all pretend that I'm a biology teacher, and you all are my students.
#     b. I have 100 questions I've come up with that cover our material, and I need to make a test, and give y'all a study guide.
#     c. So lets say I give you all 100 questions, with answers, as the study guide. Then I randomly pick 10 of them to be the test.
#     d. This might be a fine way of doing things. But, what if some of my students have a photographic memory? This is when someone can look at something, and basically without thinking, recall every specifc detail of what they saw. This is kind of the equivalent of a high-capacity model.
#     e. Well this would be bad, because the students wouldn't have to learn anything, they could just memorize the specific questions and regugitate them.
#     f. This is one of the biggest problems we face in machine learning--it's called over fitting. And the easiest way to think about it, is when your model just memorizes the training data.
#     g. Why is this a problem? Because it doesn't generalize--you can only perform well on data you have already seen. So you can't actually make good predictions.
#     h. So what would we do in the study guide example?
#     i. I could take my 100 questions, give 90 of them to you as a study guide, and keep the remaining 10 a secret for the test. That way you can't get a high grade just by memorizing, you actually have to learn.

In [None]:
fake_data = pd.DataFrame({
    'x': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    # 'y': [-3, 14, 16, 9, 12, 14, 39, 63]
    'y': [0.0, 1.0, 1.4142135623730951, 2.6, 2.0, 2.23606797749979, 2.449489742783178, 2.6457513110645907, 2.8284271247461903, 2.5, 3.1622776601683795, 3.3166247903554]
})

plt.scatter(fake_data['x'], fake_data['y'])

In [None]:
columns = ['x']
predicted_columns = []
for i in range(1,10):
    column = f'xToThePowerOf{i}'
    columns.append(column)
    fake_data[column] = fake_data['x'] ** i
    model = LinearRegression()
    model.fit(fake_data[columns], fake_data['y'])
    predicted_column = f'predictedFrom{i}'
    predicted_columns.append(predicted_column)
    fake_data[predicted_column] = model.predict(fake_data[columns])

In [None]:
for predicted_column in ['predictedFrom1', 'predictedFrom2', 'predictedFrom5', 'predictedFrom9']:
    plt.title(predicted_column)
    plt.scatter(fake_data['x'], fake_data['y'])
    plt.plot(fake_data[predicted_column])
    plt.show()

In [None]:
# So how do we make sure we're not making that last model?
# Thinking back to our story about tests and study guides, we can do the same thing.
# 9. When we do this in machine learning, it's called cross validation.
#     a. Basically we split our training data up into a smaller training set the model gets to see, then we test it on the rest of the data it hasn't seen yet.
# 10. So now we see that adding capacity helps, up to a point. If we add too much capacity, our model just starts memorizings things, and does't perform as well on the data.from sklearn.model_selection import train_test_split

X = train_df[['livingArea']]
y = train_df['price']

errors = []
for i in range(4):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=i)
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    error = mean_absolute_error(predictions, y_test)
    print(error)
    errors.append(error)
print(f'Mean Error {np.mean(errors)}')

# Already we see that our error is worse when our model is being tested on data it hasn't seen yet!

In [None]:
# So we've eaten our vegetables, now we get to go nuts--lets throw in all the data we cleaned last time!

X = train_df.drop(['city', 'state', 'lotUnit', 'price'], axis=1)
y = train_df['price']

errors = []
for i in range(4):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=i)
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    error = mean_absolute_error(predictions, y_test)
    print(error)
    errors.append(error)
print(f'Mean Error {np.mean(errors)}')

In [None]:
# What if we one-hot encoded state?
X = pd.get_dummies(train_df.drop(['city', 'lotUnit', 'price'], axis=1), columns=['state'])
y = train_df['price']

errors = []
for i in range(4):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=i)
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    error = mean_absolute_error(predictions, y_test)
    print(error)
    errors.append(error)
print(f'Mean Error {np.mean(errors)}')

# Nice!

In [None]:
# All this has been using our original model, LinearRegression, but there are a lot of hot sexy models out there
from sklearn.neural_network import MLPRegressor

X = pd.get_dummies(train_df.drop(['city', 'lotUnit', 'price'], axis=1), columns=['state'])
y = train_df['price']

errors = []
for i in range(4):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=i)
    model = MLPRegressor(hidden_layer_sizes=(4,))
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    error = mean_absolute_error(predictions, y_test)
    print(error)
    errors.append(error)
print(f'Mean Error {np.mean(errors)}')

In [None]:
# Finally, lets cover how to submit to the kaggle competition

final_model = LinearRegression()

final_training_input = pd.get_dummies(train_df.drop(['city', 'lotUnit', 'price'], axis=1), columns=['state'])

X = pd.get_dummies(final_training_input)
y = train_df['price']
final_model.fit(X, y)

In [None]:
# First we have to do all the preprocessing we did on our training dataset on our testing dataset:

test = pd.read_csv('https://raw.githubusercontent.com/wlifferth/build-an-ml-web-app/main/test.csv', index_col='id')

test.drop(['homeStatus', 'dateSold', 'address'], axis=1, inplace=True)

def convert_lot_area(row):
    if row['lotUnit'] == 'acres':
        return row['lotArea'] * 43560
    else:
        return row['lotArea']

test['lotArea'] = test.apply(convert_lot_area, axis=1)

test.drop(['lotUnit'], inplace=True, axis=1)

test = pd.get_dummies(test, columns=['homeType'])

print(test.head())

zip_code_df = pd.read_csv('median_income_by_zip_code.csv')

zip_code_df['median_income']

test = pd.merge(test, zip_code_df, how='left', left_on='zipcode', right_on='zip_code').set_index(test.index)

test['median_income'].fillna(test['median_income'].mean(), inplace=True)

test.drop(['zipcode', 'zip_code'], axis=1, inplace=True)

test.head()

In [None]:
final_input = pd.get_dummies(test.drop(['city'], axis=1), columns=['state'])

In [None]:

test['price'] = final_model.predict(final_input)

In [None]:
test.head()


In [None]:
test['price'].to_csv('2021-01-13-submission.csv', index_label='id')

## Next steps

1. One of the reasons our neural network didn't perform very well is because we didn't _normalize_ our data
    - Basically neural network work best when all their inputs are of a similar magnitude, so we scale all our numbers down to be between -1 and 1
    - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer
2. There are other cool ways of encoding our categorical data
    - You could replace each city with the average house price of that city
3. There are a ton of other cool models out there 
    - Search for regression on https://scikit-learn.org/stable/supervised_learning.html