explore the effects of interactions between explanatory variables, it can allow for more realistic models that can have better predictive power

#### Models for Each Category

parallel slopes model used a common slope for each category but that's not always the best option
one way to give each species of fish a different slope is to run a separate model for each of them


In [None]:
# each species with a different slope
# split up the dataset into the subsets by filtering for each species and assigning the result to its own variable
bream = fish[fish["species"] == "Bream"]
perch = fish[fish["species"] == "Perch"]
pike = fish[fish["species"] == "Pike"]
roach = fish[fish["species"] == "Roach"]

# now that there's all the subsets, run 4 models, each predicting mass based on length for each species
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit()
print(mdl_bream.params)

mdl_perch = ols("mass_g ~ length_cm", data=perch).fit()
print(mdl_perch.params)

mdl_pike = ols("mass_g ~ length_cm", data=pike).fit()
print(mdl_pike.params)

mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()
print(mdl_roach.params)

# to make predictions with these models, first create a df of explanatory variables
# each model has the same explanatory variable so this code only needs to be written once
explanatory_data = pd.DataFrame({"length_cm": np.arange(5, 61, 5)})
print(explanatory_data)

# the next step to predicting is to add a column with the assign method, name it after the response, variable, 
# call predict on the model and add explanatory_data as the argument
# the model variable will be different for each since every species has its own model coefficients now
# for readability, include the labels for species
prediction_data_bream = explanatory_data.assign(mass_g = mdl_bream.predict(explanatory_data), species = "Bream")
prediction_data_perch = explanatory_data.assign(mass_g = mdl_perch.predict(explanatory_data), species = "Perch")
prediction_data_pike = explanatory_data.assign(mass_g = mdl_pike.predict(explanatory_data), species = "Pike")
prediction_data_roach = explanatory_data.assign(mass_g = mdl_roach.predict(explanatory_data), species = "Roach")

# working with all the separate dataframes for each species is annoying so concatenate the predictions into one\
prediction_data = pd.contat([prediction_data_bream, 
                              prediction_data_roach, 
                              prediction_data_perch, 
                              prediction_data_pike])

# the regplot function can no longer be used for visualizing because this is regression models across subsets of a dataset
# to visualize this, use seaborn's lmplot function
sns.lmplot(x="length_cm",
           y="mass_g",
           data=fish,
           hue="species",
           ci=None)
plt.show()
# the resulting graph will show that each line has its own slope

# to sanity check our concatenated predictions, we add them to the plot to see if they align with seaborn's lmplot calculations
sns.scatterplot(x="length_cm",
                y="mass_g",
                data=prediction_data,
                hue="species",
                ci=None,
                legend=False)
plt.show()
# as predicted, each line of prediction points follows seaborn's trend lines

# the next question is: are these models better?
# calculate the coefficient of determination for a model on the whole fish dataset
mdl_fish = ols("mass_g ~ length_cm + species", data=fish).fit()
print(mdl_fish.rsquared_adj)
# calculate the co of det for each of the individual models
print(mdl_bream.rsquared_adj)
print(mdl_perch.rsquared_adj)
print(mdl_pike.rsquared_adj)
print(mdl_roach.rsquared_adj)
# one will be higher which indicates a better fit, any that are lower will not be an improvement 

# calculate the residual standard error for the whole dataset model
print(np.sqrt(mdl_fish.mse_resid))
# ...and for the individual models
print(np.sqrt(mdl_bream.mse_resid))
print(np.sqrt(mdl_perch.mse_resid))
print(np.sqrt(mdl_pike.mse_resid))
print(np.sqrt(mdl_roach.mse_resid))
# one of the residual standard errors will be higher which indicates larger differences between actual and predicted values
# any models that are lower than the whole dataset model are considered an improvement

# a mixed performance result (one thing being a better fit than the whole but that having a bigger standard error, 
# while other things improve or don't improve between these two metrics) is common
# the whole dataset model benefits from the increased power of more rows of data
# but, individual models benefit from not having to satisfy different components of data

#### One Model with an Interaction

using different models for different parts of the dataset is annoying
a better solution is to specify a single model that contains intercepts and slopes for each category
you can do this by specifying interactions between explanatory variables

what is an interaction?
for example, different fish species have different mass to length ratios
the effect of length on the expected mass is different for different species
so, the effect of one explanatory variable on the expected response changes depending on the value of another explanatory variable

the formula for no interaction is
    response ~ explntry1 + explntry2
    
the formula to include an interaction between those variables, multiply instead
this is the implicit version because you don't write down which interactions are needed, stasmodels figures that out
    response_var ~ explntry1 * explntry2
    
the implicit version is usually best but if you want to explicity document which interactions are included in the model then you can use the explicit version
    response ~ explntry1 + explntry2 + explntry1:explntry2
    
the result of these two is the same so decide if you like brevity, or detail 

In [None]:
# running the model, the formula used in a model
#mdl_mass_vs_both = ols("mass_g ~ length_cm * species", data=fish).fit()
#print(mdl_mass_vs_both.params)

# the coefficients are difficult to understand, just like the models with the categorical explanatory variable were
# the intercept coefficient is the intercept for the first species: bream
# the length_cm coefficient is the slope for the bream
# the intercept coefficient plus the species[T.Perch] coefficient is the intercept for perch
# the length coefficient plus the T.Perch coefficient is the slope for perch

# this is a mess, we need to use a harder to read formula to get easier to understand coefficients
# easier to understand coefficients
#mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0", data=fish).fit()
#print(mdl_mass_vs_both_inter.params)
# the + 0 removes the global intercept
# you'll now get an intercept coefficient and slope coefficient for each species
# this code result in the same as making models for each category but you don't have to manage four sets of code

# response_var ~ explanatory_var1 + explanatory_var2 + explanatory_var1:explanatory_var2 is probably the updated code needed
# the video said to use the other code but the exercies said it was depreciated 

# Fit a linear regression of price_twd_msq versus n_convenience and house_age_years, using the "times" syntax to 
# implicitly generate an interaction between them.
mdl_price_vs_both_inter = ols('price_twd_msq ~ n_convenience * house_age_years', data=taiwan_real_estate).fit()

# Fit a linear regression of price_twd_msq versus n_convenience and house_age_years, using the "colon" syntax to 
# explicitly generate an interaction between them.
mdl_price_vs_both_inter = ols("price_twd_msq ~ n_convenience + house_age_years + n_convenience:house_age_years", data=taiwan_real_estate).fit()

# Fit a linear regression of price_twd_msq versus house_age_years plus an interaction between n_convenience and 
# house_age_years, and no global intercept, using the taiwan_real_estate dataset.
mdl_readable_inter = ols("price_twd_msq ~ house_age_years + n_convenience:house_age_years + 0", data=taiwan_real_estate).fit()
# based on the last 3 numbers you can conclude that the expected increase in house price for each nearby convenience store is lowest
# for the 30-45 age group

#### Making Predictions with Interactions



In [None]:
# this is the prediction flow used for a model containing an interaction, this is the same as the parallel slopes model
from itertools import product

# create an array of numbers
length_cm = np.arange(5, 61, 5)
# extract the unique values of species
species = fish["species"].unique()
# get all combinations of values of length and species
p = product(length_cm, species)
# remember: product from itertools is used to get all the combinations of length and species

# now convert to a dataframe and name the columns
explanatory_data = pd.DataFrame(p, columns=["length_cm", "species"])
prediction_data = explanatory_data.assign(mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))

# visualizing the predictions
sns.lmplot(x="length_cm",
           y="mass_g",
           data=fish,
           hue="species",
           ci=None)
plt.show()
# the resulting graph will show that each line has its own slope

# to sanity check our concatenated predictions, we add them to the plot to see if they align with seaborn's lmplot calculations
sns.scatterplot(x="length_cm",
                y="mass_g",
                data=prediction_data,
                hue="species")
plt.show()

# to see how the predictions work, you could manually calculate them
# get the coefficients from the model using the params attribute
coeffs = mdl_mass_vs_both_inter.params
# next, unpack all the coefficients into the four intercepts and the four slopes
ic_bream, ic_perch, ic_pike, ic_roach, slope_bream, slope_perch, slope_pike, slope_roach = coeffs
# now use numpy's select function, it needs a list of conditions and an equally sized list of choices
conditions = [
    explanatory_data["species"] == "Bream", 
    explanatory_data["species"] == "Perch",
    explanatory_data["species"] == "Pike",
    explanatory_data["species"] == "Roach"
]
# match up the correct stuff
ic_choices = [ic_bream, ic_perch, ic_pike, ic_roach] #ic is probably short for intercept 
intercept = np.select(conditions, ic_choices)
# match up the correct stuff for the slopes 
slope_choices = [slope_bream, slope_perch, slope_pike, slope_roach]
slope = np.select(conditions, slope_choices)
# finally, calculate the predictions using the formula intercept plus slope times explanatory data
prediction_data = explanatory_data.assign(mass_g = intercept + slope * explanatory_data["length_cm"])
print(prediction_data)
# you'll see that the calculated values are the same as those returned by the predict function

#### Simpson's Paradox

Simpson's paradox is a nonintuitive result that some datasets may face
it occurs when the trend of a model on the whole dataset is very different from the trends shown by models on subsets

for example, the slope of the whole dataset could be positive but if you look at the slope of each group, they could all be negative 

because of this, it's helpful to visualize your dataset so plot it! this is extra true if some models give conflicting results
also remember: you can't choose the best model in general because it depends on the dataset and the question you're trying to answer
you should decide on a question before you start fitting models, articulate a question before you start modeling 

another example is tests scores vs hours playing video games
the whole would show a positive slope but each group would show a negative slope
at first it would seem like playing more video games increases a test score
BUT, the interpretation changes if you reveal that the groups are based on age of children
you would now know that older children score higher on the test and that playing lots of video games is related to a lower score

most the time the grouped model contains more insight that you'd miss otherwise
the discrepancy between the models may reveal that you need to include more explanatory variables (could wealth be the reason there's less infection rates? proximity of hospitals?) 
context is really important when deciding which model to use, consider your dataset and what question you are trying to answer

extreme examples of Simpson's paradox is rare and usually less obvious
you may see a zero slope rather than a complete change in direction
it may not appear in every group
