<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
     INE Bootcamp
</h1>
<h2 style="color: white; position: absolute; top:36%; left:10%;">
    Data Analysis, Visualization and Predictive Modeling
</h2> 

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:58%; left:10%;">
    <b>David Mertz, Ph.D.</b>
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:63%; left:10%;">
    <b>Data Scientist</b>
</h3>
</div>

<div style="width: 100%; height: 200px; background-color: #222; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Basic curve fitting as predictive regression
</h1>

<br><br> 
</div>

> A simple approach to predictive modeling is to fit data against a polynomial. In the simplest case, an order one polynomial is called "linear regression."

Let us start out by importing a variety of capabilities we will use, largely from scikit-learn.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
sns.set_theme()

Let us look at some data for housing in King County, Washington (USA).

In [None]:
df = pd.read_csv('data/kc_house_data.csv')
df.head()

For an illustration of polynomials, let us determine what single feature corresponds most strongly with price.

In [None]:
df.corr().loc['price'].abs().sort_values(ascending=False).head(8)

This is a nice starting point for an example.

In [None]:
sqft = df.loc[:,['sqft_living', 'price']].sort_values('sqft_living')
sqft

<h2 style="font-weight: bold;">
    Plotting a feature
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

We can plot the relationship between the "top feature" and the target.  

Note that we could do this just in Pandas, but we want to add more in a moment, which reaches Pandas limits.  E.g. `top_feature.set_index('sqft_living').plot()`

In [None]:
fig, ax = plt.subplots(figsize=(15, 4.5))
ax.plot(sqft.sqft_living, sqft.price, 
        color='cornflowerblue', linewidth=0.5)
ax.set_xlabel("Square Feet Living Area")
ax.set_ylabel("Price");

Let us consider some polynomials.

In [None]:
# Ground truth first
fig, ax = plt.subplots(figsize=(15, 4.5))
ax.plot(sqft.sqft_living, sqft.price, 
         color='cornflowerblue', 
         linewidth=0.5, alpha=0.5,
         label="ground truth")
#ax.set_xscale('log')
ax.set_xlabel("Square Feet Living Area")
ax.set_ylabel("Price")
ax.set_title("Modeled relationship of living area to price")

for degree in range(6):
    X = PolynomialFeatures(degree).fit_transform(sqft[['sqft_living']])
    model = LinearRegression()
    model.fit(X, sqft.price)
    y_predict = model.predict(X)
    ax.plot(sqft.sqft_living, y_predict, linewidth=2, label=f"degree {degree}")
    
ax.legend(loc='upper left');

<h2 style="font-weight: bold;">
    High dimensional linear regression
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

When we move to more dimensions, and using more features, we usually get more predictive power.

In [None]:
X = df.drop(columns=['id', 'zipcode', 'date',         # Clearly non-numeric
                     'lat', 'long', 'yr_renovated',   # Lat/lon "random"; yr_renovated often zero
                     'sqft_living15', 'sqft_lot15',   # Not clear distinction from base features
                     'price'])                        # Price HAS TO be excluded as target
y = df.price
model = LinearRegression()
model.fit(X, y)
model.score(X, y)

In [None]:
X

Including feature combinations is equivalent to fiting a high-dimensional polynomial.  This can often improve prediction further.

In [None]:
X_poly = PolynomialFeatures(3).fit_transform(X)
model = LinearRegression().fit(X_poly, y)
model.score(X_poly, y)

In [None]:
X_poly.shape

<div style="width: 100%; height: 200px; background-color: #ef7d22; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Exercises
</h1>

<br><br> 
</div>

<h2 style="font-weight: bold;">
    Best feature evaluation
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Determine the model score for polynomial fits using only the top feature (including the linear fit)

In [None]:
# your solution here

<h2 style="font-weight: bold;">
    Best few features
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Determine the model score for polynomials using the top two or three features.

In [None]:
# your solution here

<div style="width: 100%; height: 400px; background-color: #222; text-align: center; padding-top: 120px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Review and questions
</h1>

<br><br> 
</div>

---
<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<img src="https://user-images.githubusercontent.com/7065401/98864025-08deda80-2448-11eb-9600-22aa17884cdf.png" style="height: 100%; max-height: inherit; position: absolute; top: 20%; left: 0px;"></img>
<br>

<h2 style="font-weight: bold;">
    David Mertz, Ph.D.
</h2>

<h3 style="color: #ef7d22; margin-top: 0.8em">
    Data Scientist
</h3>
<hr>
<br><br>

<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    david.mertz@gmail.com
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    linkedin.com/in/dmertz/
</p>

</div>

<br><br><br>