### Parallel Slopes Linear Regression
parallel slopes regression with one numeric and one categorical explanatory variable, the first step toward conquering multiple linear regression

multiple regression is a regression model with more than one explanatory variable
including more explanatory variables can give more insight into the relationship between the explanatory variables and the response and can also provide more accurate predictions

#### ONE EXPLANATORY VARIABLE AT A TIME -- numeric

In [None]:
# linear regression is run by using ols, passing a formula and a dataframe
# the formula has the response variable on the left and the explanatory variable on the right of the tilde 
# the model is then fit using .fit()

from statsmodels.formula.api import ols

mdl_mass_vs_length = ols("mass_g ~ length_cm",
                          data=koi).fit()

# print the model params
print(mdl_mass_vs_length.params)

# this will give you 1 intercept coeefficient and 1 slope coefficient

#### ONE EXPLANATORY VARIABLE AT A TIME -- categorical

In [None]:
# you change the categorical variable to an explanatory variable, such as "species" 
# the coefficients will be a little easier to understand if you use "plus zero" to tell statsmodels not to include an intercept in the model
# you will instead get one intercept coefficient for each category (one coefficient for each species of fish)

from statsmodels.formula.api import ols

mdl_mass_vs_species = ols("mass_g ~ species + 0",
                          data=koi).fit()
print(mdl_mass_vs_species.params)

#### BOTH EXPLANATORY VARIABLES AT THE SAME TIME

In [None]:
# to include both, combine them on the right side of the formula
mdl_mass_vs_both = ols("mass_g ~ length_cm + species + 0",
                        data=koi).fit()

# this will give you one slope coefficient and an intercept coefficient for each category in the categorical variable
# if you compare the results of both the "one variable at a time" models, you'll see that the results are different

#### VISUALIZATION -- 1 numeric explanatory variable

In [None]:
# the standard visualization for a linear regression with a numeric explanatory variable
# a single numeric explanatory variable's prediction will form a single straight line

import matplotlib.pyplot as plt
import seaborn as sns

sns.regplot(x="length_cm",
            y="mass_g",
            data=fish,
            ci=None)

plt.show()

#### VISUALIZATION -- 1 categorical explanatory variable

In [None]:
# there are a few options for plots of this but the simplest one is a boxplot for each category
# the model coefficients are the means of each category which are added with the showmeans argument
# a single categorical explanatory variable's predictions are the means of each category

sns.boxplot(x="species",
            y="mass_g",
            data=fish,
            showmeans=True)

#### VISUALIZATION -- both explanatory variables

In [None]:
# seaborn doesn't have an easy way to plot both the explanatory variables model results so drow the trend lines manually

# first extract the model coefficients into separate intercepts and the slope
coeffs = mdl_mass_vs_both.params
print(coeffs)

ic_bream, ic_perch, ic_pike, ic_roach, sl = coeffs

# now draw a standard scatter plot with one additional argument: hue

sns.scatterplot(x="length_cm",
                y="mass_g",
                hue="species"
                data=fish)
# hue will be used when working with a categorical or continuous variable and want to color by its values

# use plt.axline for each category
plt.axline(xyz=(0, ic_bream), slope=sl, color="blue")
plt.axline(xyz=(0, ic_perch), slope=sl, color="green")
plt.axline(xyz=(0, ic_pike), slope=sl, color="red")
plt.axline(xyz=(0, ic_roach), slope=sl, color="orange")

# since all use the same slope, the trend lines will be parallel to each other, thus the nickname"parallel slopes regression"

#### INTERPRETING PARALLEL SLOPES COEFFICIENTS

"parallel slopes" cases will give you a numeric and a categorical explanatory variable, what do the coefficients mean?

for the numecical variable buddy, for each additional nearby convenience store, the expected house price increases by 0.79 dollars

for the categorical variable buddy, each of the categories mean that for a house of that age group (0-15 years) with 0 nearby convenience stores, the expected house price is 9.41 dollars per square meter

the model has one slope coefficient and then however mant intercept coefficients (one for each category of the explanatory variable)

#### INTERPRETING VISUALIZATIONS

a single numeric explanatory variable's prediction will form a single straight line
a single categorical explanatory variable's predictions are the means of each category

The two plots give very different predictions: one gives a predicted response that increases linearly with a numeric variable; the other gives a fixed response for each category. The only sensible way to reconcile these two conflicting predictions is to incorporate both explanatory variables in the model at once.

When it comes to a linear regression model with a numeric and a categorical explanatory variable, seaborn doesn't have an easy, "out of the box" way to show the predictions so you have to draw the trend lines manually.

When looking at the plot, the parallel slopes let you see that newer houses are on average more expensive than houses older than 15 years. 

### PREDICTING PARALLEL SLOPES



In [None]:
# the prediction workflow start with choosing values for explanatory variables, store them in a pandas dataframe
# the end of the arange is non-inclusive

import pandas as pd
import numpy as np

exl_data_length = pd.DataFrame({"length_cm": np.arange(5, 61, 5)})
print(expl_data_length)

In [None]:
# for multiple explanatory variables you need to define multiple columns in your explanatory dataframe
# for example, if you want all the combinations of a, b, c, 1, 2 then you can use the product function

from itertools import product
product(["a", "b", "c"], [1, 2])
# the output will be all combinations of its inputs

In [None]:
# first create the explanatory variable lists
length_cm = np.arange(5, 61, 5)
# for a categorical variable, use pandas' unique() method 
species = fish["species"].unique()

# create a combination of all the elements of these input lists
p = product(length_cm, species)

# next, turn the product into a pandas df and name the columns
expl_data_both = pd.DataFrame(p, columns=['length_cm', 'species'])
# the output will be 5cm of each fish species, 10cm of each species, etc. 

# add predictions to the df
# predict mass_g from length only
prediction_data_length = expl_data_length.assign(mass_g = mdl_mass_vs_length.predict(expl_data))

# for two or more explanatory variables, the code is the same except for the variable naming
prediction_data_both = expl_data_both.assign(mass_g = mdl_mass_vs_both.predict(expl_data))

In [None]:
# you can visualize the predictions from the model by adding another scatterplot and setting the data argument to pred_data
plt.axline(xyz=(0, ic_bream), slope=sl, color="blue")
plt.axline(xyz=(0, ic_perch), slope=sl, color="green")
plt.axline(xyz=(0, ic_pike), slope=sl, color="red")
plt.axline(xyz=(0, ic_roach), slope=sl, color="orange")

sns.scatterplot(x="length_cm", 
                y="mass_g",
                hue="species",
                data=fish)

sns.scatterplot(x="length_cm", 
                y="mass_g",
                color="black",
                data=prediction_data)
# black is used to distinguish between predictions and actual data points
# the black prediction points will lie on the trend lines

In [None]:
# instead of using a bunch of if/else statements, use numpy's select() function to get values based on conditions
# np.select() takes two arguments: a list of conditions and a list of choices, both lists need to be the same length 

# the conditions list contains a condition statement for each species, returns true or false
conditions = [
    explanatory_data["species"] == "Bream", 
    explanatory_data["species"] == "Perch",
    explanatory_data["species"] == "Pike",
    explanatory_data["species"] == "Roach"
]

# the choices list is the collection of intercepts that were extracted from the model coefficients
choices = [ic_bream, ic_perch, ic_pike, ic_roach]

# np.select() will then retrieve the corresponding intercept for each of the fish species
intercept = np.select(conditions, choices)

# the final prediction step is to calculate the response
prediction_data = explanatory_data.assign(intercept = np.select(conditions, choices),
                                          mass_g = intercept + slope * explanatory_data["length_cm"])
print(prediction_data)

# there's some negative masses predicted, which isn't a good sign 
# you can check that you got the right answer by calling predict
mdl_mass_vs_both.predict(explanatory_data)
# the predictions should be the same numbers as the calculated mass column so they're correct
# the explanation is that this model performs poorly for smoll fish lengths

### ASSESSING MODEL PERFORMANCE

the major benefit of using more than one explanatory variable in a model is that you can sometimes get a better fit than when only using a single explanatory variable
in the last course, two metrics used for measuring model performance were coefficient of determination (r-squared value, measures how good the regression's prediction line fits the observed values, larger number is better) and the residual standard error (RSE, the typical size of the redisduals)
will these metrics improve when both explanatory variables are included in the model

something to be aware of is overfitting, too many explanatory variables can cause overfitting so that the data set is fit perfectly but it no longer reflects the general population, you would know the model was overfit if it performed well on this fish dataset but performed poorly on a different fish dataset

another metric, adjusted coefficient of determination, includes a small penalty term for each additional explanatory variable which will compensate for the effect of overfitting so it's a better metric than the plain co of d

In [None]:
# to get the coefficient of determination, use rsquared attribute of the fitted model
print(mdl_mass_vs_both.rsquared)
# 0 is the worst possible fit and 1 is a perfect fit
# you could calculate these for all of them and then see which one is the best (closest to 1)

In [None]:
# adjusted coefficient of determination
print(mdl_mass_vs_both.rsquared_adj)

In [None]:
# residual standard error
rse_length = np.sqrt(mdl_mass_vs_length.mse_resid)
# RSE isn't directly available as an attribute but the MSE (mean squared error) is so you can take the square root of the mse
# the lowest value would be the best because that means the mass is typically wrong by that amount (about 100 grams, for ex)