### Multiple Linear Regression
work with more than two explanatory variables
up to this point, the models had one numeric and one categorical explanatory variable
what changes if you have 2 numeric explanatory variables instead?

#### Two Numeric Explanatory Variables
two numeric explanatory variables ples a numeric response variable gives three numeric variables to plot

scatter plots are designed to show relationships between two numeric variables, so there are two options:
    draw a 3D scatter plot (beyond the scope of this course, impossible to interpret the graph because of perspective issues)
    draw a 2D scatter plot and use color for the response variable


In [None]:
# a 2D scatterplot with the hue argument set to mass, plot the two numeric explanatory variables on the x and y axis and color
# the points according to the response variable
sns.scatterplot(x="length_cm",
                y="height_cm",
                data=fish,
                hue="mass_g")
# the colors get darker as moving up and to the right which represents heavier fish

# modeling with an extra explanatory variable
mdl_mass_vs_both = ols("mass_g ~ length_cm + height_cm", data=fish).fit()
print(mdl_mass_vs_both.params)
# you'll get a global intercept coefficient and one slope coefficient for each explanatory variable

# prediction flow
# create a datafram of explanatory values with product 
from itertools import product

length_cm = np.arange(5, 61, 5)
height_cm = np.arange(2, 21, 2)

p = product(length_cm, height_cm)
explanatory_data = pd.DataFrame(p, columns=["length_cm", "height_cm"])

# then add a column of predictions with assign and predict
prediction_data = explanatory_data.assign(mass_g = mdl_mass_vs_both.predict(explanatory_data))
print(prediction_data)

# plotting, create one scatter plot with the actual data points and another ote with the prediction data points
# to avoid duplication, the legend in one of the scatter plot calls can be removed
# the prediction data point markers have been changed to squares for clarity 
sns.scatterplot(x="length_cm",
                y="height_cm",
                data=fish,
                hue="mass_g")

sns.scatterplot(x="length_cm",
                y="height_cm",
                data=prediction_data,
                hue="mass_g",
                legend=False,
                marker="s")
# the color of the grid will give a nice overview of how the response variable changes over the plane of explanatory variables

# to include an interaction, replace the plus with a times in the formula
mdl_mass_vs_both = ols("mass_g ~ length_cm * height_cm", data=fish).fit()
print(mdl_mass_vs_both.params)
# this will give you one extra slope term for the interaction of the two explanatory variables

# the prediction flow will be the same as above, but you'll probably change the name of the model (mdl_mass_vs_both_inter)

# the plotting code is the same as above too :) 
# this time you will see that the square marker points closesly match the color of the circilar data points and that's a great
# visual indicator that the model is a good fit

#### More Than Two Explanatory Variables

regression models aren't limited to two explanatory variables
we'll look at 3 explanatory variables and what happens when you increase that number even further
the fish scatter plot showed some grouping of data points, maybe those are based on species
you can check that by faceting on species

modeling code scales nicely with more variables but the dimensions of the prediction dataset inclease rapidly to account for all the possible combinations
visualizing these predictions isn't as useful now because it reaches the limit of visual interpretation so we stick with predicting the response variable instead 

In addition to using x and y scales for two numeric variables, you can use color for a third numeric variable, and you can use faceting for categorical variables. That's about all you can really do without using something like a heatmap and that has less info about each variable 

when using 3 explanatory variables, you have a few options for specifying interactions:
    --no interactions
    --2-way interactions, gives you model coefficients for each pair of variables
    --all the interactions, give you three 2-way interactions and the interaction between all three explanatory variables

In [None]:
# give each species its own panel with the FacetGrid function from seaborn

# first, prepare the grid by specifying the layout, col let's you split by species, col_wrap makes it a 2x2 grid, and palette
# is option and can be used to improve the coloring 
grid = sns.FacetGrid(data=fish, 
                     col="species",
                     hue="mass_g",
                     col_wrap=2,
                     palette="plasma")

# second, map what visualization you want to plot on the grid
# in this example, specify a scatter plot with length and height on the x and y axis respectively 
grid.map(sns.scatterplot, "length_cm", "height_cm")
plt.show()
# you'll see the 4 grids, in the example, there's a noticeable strong positive correlation between length and height for each species of fish
# the relationship between the explanatory variables and the response is harder to quantify because you can't determine colors as accurately as x and y coordinates
# in this case, you can see that as fish get longer and taller, they also get heavier  
                     
# it can be tricky in include more than 3 numeric variables in a scatter plot
# but you can include as many categorical variables as you like when using faceting, but more can make it harder to see an overall picture                  
                     
# modeling doesn't really get harder as you increase the number of explanatory variables
# model with no interaction
ols("mass_g ~ length_cm + height_cm + species + 0", data=fish).fit()
                     
# two-way (or pairwise) interaction between pairs of variables
ols("mass_g ~ length_cm + height_cm + species + length_cm:height_cm + height_cm:species + 0", data=fish).fit()
                     
# three-way interaction between all the explanatory variables
ols("mass_g ~ length_cm + height_cm + species + 
    length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0", data=fish).fit()
    
# this stuff gets crazy to write so there's shortcuts, swap the plus for times, the above is the same as:
ols("mass_g ~ length_cm * height_cm * species + 0", data=fish).fit()
    
# to get only two-way interaction in the model but not the 3-way
ols("mass_g ~ (length_cm + height_cm + species) ** 2 + 0", data=fish).fit()
    
# prediction flow is the same as normal

#### How Linear Regression Works

to keep it simple, this will be simple linear regression with a single numeric explanatory variable

residuals are the actual response minus the predicted response, they'll show as red lines on the scatter plot
for the best fit, you want those red lines to be as short as possible, we want a metric that measures the size of all those residuals and we want to make that as small as possible

adding up all the residuals would be simple but that doesn't work because some of the residuals are negative so those ones would make the total smaller instead of larger

instead of that, we'll do the next easiest thing which is to square each residual so they're non-negative and then add that up, this is called the sum of squares
the tricky part of this is figuring out which intercept and slope coefficients will result in the smallest sum of squares

to solve this problem, we'll use numerical optimization, it means finding the minimum point of a function
we could use calculus for this but we'll let Python find the minimum for us using the optimize package from scipy

Linear regression minimizes the sum of the squares of the differences between the actual responses and the predicted responses.

In [None]:
from scipy.optimize import minimize

# define a function to calculate the sum of squares metric
# accepts the intercept and slope, returns the sum of the squares of residuals
# have to use the trick of giving the function a single coeffs argument then extracting the individual intercept and slope
def calc_sum_of_squares(coeffs):
    intercept, slope = coeffs
    # more calculations
# call minimize() to find coefficients that minimize this function
# pass an initial guess for the coefficients and your sum of squares function
minimize(fun=calc_sum_of_squares, x0=0)

# example in video
# def calc_quadratic(x):
#y = x ** 2 - x + 10
# return y
# minimize(fun=calc_quadratic, x0=3)


# exercise code
def calc_sum_of_squares(coeffs):
    # Unpack coeffs
    intercept, slope = coeffs
    # Calculate predicted y-values
    y_pred = intercept + slope * x_actual
    # Calculate differences between y_actual and y_pred
    y_diff = y_pred - y_actual
    # Calculate sum of squares
    sum_sq = np.sum(y_diff ** 2)
    # Return sum of squares
    return sum_sq

# Call minimize on calc_sum_of_squares  
print(minimize(fun=calc_sum_of_squares,
               x0=[0, 0]))

# Compare the output with the ols() call.
print(ols("price_twd_msq ~ n_convenience", data=taiwan_real_estate).fit().params)