<a href="https://colab.research.google.com/github/sensei-jirving/Online-DS-PT-01.24.22-cohort-notes/blob/main/Week_16/Lecture_01/Demo/Intro_to_Linear_Regression_Coefficients_05_10_22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Linear to Logistic Regression Coefficients

- UPDATED 05/10/22
- James M. Irving, Ph.D.

## Learning Objectives



- To review how linear regression predicts a continuous value.
- To understand what coefficients are and how they are used to calcualte the target.

- Lesson Duration:
    - ~10 mins

# Predicting the Price of a Home Using Linear Regression

<img src="https://github.com/jirvingphd/from-linear-to-logistic-regression-brief-intro/blob/main/images/istock24011682medium_1200xx1697-955-0-88.jpg?raw=1" width=60% alt="Source: https://images.app.goo.gl/oJoMSGU8LGgDjkA76">

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
## Customization Options
pd.set_option('display.float_format',lambda x: f"{x:,.4f}")
plt.style.use('seaborn-talk')
plt.rcParams['figure.facecolor']='white'

In [None]:
## additional required imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import (r2_score, plot_confusion_matrix,
                             classification_report)

## Customized Options
pd.set_option('display.float_format',lambda x: f"{x:,.4f}")
plt.style.use('seaborn-talk')

In [None]:
## Load in the King's County housing dataset and display the head and info
df = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSEZQEzxja7Hmj5tr5nc52QqBvFQdCAGb52e1FRK1PDT2_TQrS6rY_TR9tjZjKaMbCy1m5217sVmI5q/pub?output=csv")
display(df.head(),df.info())

In [None]:
## FILTERING EXTREME VALUES FOR DEMONSTRATION PURPOSES
df = df.loc[( df['bedrooms'] <8) & (df['price'] < 2_000_000) & df['bathrooms']>0]
df

In [None]:
## Visualize the distribution of house prices (using seaborn!)
sns.displot(df['price'],aspect=2);

## ~~🕹Activity:~~ Visualizing Our Features vs Our Target

- We want to determine how features of a home influence its sale price. 
- Specifically, we will be using:
    - `sqft_living`:Square-Footage of all Living Areas
    - `bedrooms`: # of Bedrooms
    - `bathrooms`: # of Bathrooms

In [None]:
from matplotlib.ticker import StrMethodFormatter
## Plot a scatter plot of sqft-living vs price
ax = sns.scatterplot(data=df,x='sqft_living',y='price',s=50)
ax.set_title('Relationship Between Square Footage and House Price')

## Formatting Price Axis
price_fmt = StrMethodFormatter("${x:,.0f}")
ax.yaxis.set_major_formatter(price_fmt)
ax.get_figure().set_size_inches(10,6)

- We can see a positive relationship between sqft-living and price, but it would be better if we could show the line-of-best-fit with it

### Functionizing Our Code

In [None]:
## NOTE: if we had more time, we would write this together.
def plot_feature_vs_target(df,x='sqft_living',y='price',price_format=True):
    """Plots a seaborn regplot of x vs y."""
    ax = sns.regplot(data=df,x=x,y=y,
                line_kws=dict(color='k',ls='--',lw=2),
               scatter_kws=dict(s=50,edgecolor='white',lw=1,alpha=0.8)
                    )
    
    ax.get_figure().set_size_inches(10,6)
    ax.set_title(f'{x} vs {y}')
    ax.get_figure().set_facecolor('white')
    
    if price_format:
        ## Formatting Price Axis
        price_fmt = StrMethodFormatter("${x:,.0f}")
        ax.yaxis.set_major_formatter(price_fmt)
    return ax

In [None]:
## Visualize the relationship between sqft_living and price
ax = plot_feature_vs_target(df,x='sqft_living');


### What Our Trendline Tells Us
- Our trendline summarizes the relationship between our feature and our target.
- It is comprised of the: <br>
1) y-intercept (AKA $c$ or $b$ or $\beta_{0}$) indicating the default value of y when X=0.<br>
2) and a slope (AKA $m$ or $\beta$) indicating the relationship between X and y. When X increases by 1, y increases by $m$.

In [None]:
## Visualize the relationship between bathrooms and price
plot_feature_vs_target(df,x='bathrooms');

In [None]:
## Visualize the relationship between bedrooms and price
plot_feature_vs_target(df,x='bedrooms')

>- Now, let's create a Linear Regression model with sci-kit learn to determine the effect of these 3 features!

## 🕹Activity: Predicting House Price with sci-kit learn's `LinearRegression`

In [None]:
## Create our X & y using bedrooms,bathrooms, sqft-living
use_cols = ['bedrooms','bathrooms','sqft_living']
X = df[use_cols].copy()
y = df['price'].copy()

## Train test split (random-state 321, test_size=0.25)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=321)
X_train

In [None]:
## import LinearRegression from sklearn and fit the model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train,y_train)

In [None]:
## Get our models' R-squared value for the train and test data
print(f"Training R-Squared: {linreg.score(X_train,y_train):.3f}")
print(f"Test R-Squared: {linreg.score(X_test,y_test):.3f}")

>- Ok, so what does this tell us?
    - Our model can explain 52% of the variance of house price using just 3 features!

### What Coefficients Did Our Model Find? 

In [None]:
## NOTE: with more time, we would code this together. 
def get_coeffs(reg,X_train):
    """Extracts the coefficients from a scikit-learn LinearRegression or LogisticRegression"""
    coeffs = pd.Series(reg.coef_.flatten(),index=X_train.columns)
    
    # if isinstance(reg.intercept_,np.ndarray):
    #     coeffs.loc['intercept'] = reg.intercept_[0]
    # else:
    coeffs.loc['intercept'] = reg.intercept_

    return coeffs

- Linear Regression Equation
$$ \large \hat y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n  $$
which we can simplify to:
$$ \hat y =  \sum_{i=0}^{N} \beta_i x_i  $$

In [None]:
## Get the coefficents from the model using our new function
coeffs = get_coeffs(linreg,X_train)
coeffs

>- **Each coefficient tells us the effect of increasing the values in that column by 1 unit.** 
>- According to our model, we can determine a home's price using the following results:
    - The model assumed a default/starting house price was \$130,191.2155 (the intercept)
    - For each additional bedrooms, subtract      \$-41,206.78
    - For each batrhoom, add \$13,537.01
    - For each square foot of living space, add \$243.11

In [None]:
## Let's select an example house and see how we calculate price
i = 300
house = X_test.iloc[i]
house

In [None]:
## Calculate the home's predicted price using our coefficients
price = house['bedrooms']*coeffs['bedrooms'] + \
        house['bathrooms']*coeffs['bathrooms'] + \
        house['sqft_living']*coeffs['sqft_living'] + coeffs['intercept']

print(f"${price:,.2f}")

In [None]:
## What would our model predict for our test house?
linreg.predict(house.values.reshape(1,-1))

In [None]:
y_test.iloc[i]

## Linear Regression Summary
- Linear regression allowed us to predict the exact dollar price of a given home.
- It summarizes the relationship of each feature using coefficients, which are used to calculate the target. 

>-  But what do we do when we want to predict what group a house belongs to instead of an exact price?