![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Basic curve fitting as predictive regression

A simple approach to predictive modeling is to fit data against a polynomial. In the simplest case, an order one polynomial is called "linear regression."

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

Let us start out by importing a variety of capabilities we will use, largely from scikit-learn.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
sns.set_theme()

Let us look at some data for housing in King County, Washington (USA).

In [None]:
df = pd.read_csv('data/kc_house_data.csv')
df.head()

For an illustration of polynomials, let us determine what single feature corresponds most strongly with price.

In [None]:
df.corr().loc['price'].abs().sort_values(ascending=False).head(8)

This is a nice starting point for an example.

In [None]:
sqft = df.loc[:,['sqft_living', 'price']].sort_values('sqft_living')
sqft

## Plotting a feature

We can plot the relationship between the "top feature" and the target.  

Note that we could do this just in Pandas, but we want to add more in a moment, which reaches Pandas limits.  E.g. `top_feature.set_index('sqft_living').plot()`

In [None]:
fig, ax = plt.subplots(figsize=(15, 4.5))
ax.plot(sqft.sqft_living, sqft.price, 
        color='cornflowerblue', linewidth=0.5)
ax.set_xlabel("Square Feet Living Area")
ax.set_ylabel("Price");

Let us consider some polynomials.

In [None]:
# Ground truth first
fig, ax = plt.subplots(figsize=(15, 4.5))
ax.plot(sqft.sqft_living, sqft.price, 
         color='cornflowerblue', 
         linewidth=0.5, alpha=0.5,
         label="ground truth")
#ax.set_xscale('log')
ax.set_xlabel("Square Feet Living Area")
ax.set_ylabel("Price")
ax.set_title("Modeled relationship of living area to price")

for degree in range(6):
    X = PolynomialFeatures(degree).fit_transform(sqft[['sqft_living']])
    model = LinearRegression()
    model.fit(X, sqft.price)
    y_predict = model.predict(X)
    ax.plot(sqft.sqft_living, y_predict, linewidth=2, label=f"degree {degree}")
    
ax.legend(loc='upper left');

## High dimensional linear regression

When we move to more dimensions, and using more features, we usually get more predictive power.

In [None]:
X = df.drop(columns=['id', 'zipcode', 'date',         # Clearly non-numeric
                     'lat', 'long', 'yr_renovated',   # Lat/lon "random"; yr_renovated often zero
                     'sqft_living15', 'sqft_lot15',   # Not clear distinction from base features
                     'price'])                        # Price HAS TO be excluded as target
y = df.price
model = LinearRegression()
model.fit(X, y)
model.score(X, y)

Including feature combinations is equivalent to fiting a high-dimensional polynomial.  This can often improve prediction further.

In [None]:
X_poly = PolynomialFeatures(3).fit_transform(X)
model = LinearRegression().fit(X_poly, y)
model.score(X_poly, y)

## Exercise

* Determine the model score for polynomial fits using only the top feature (including the linear fit).
* Determine the model score for polynomials using the top two or three features.