# One model per category

The model you ran on the whole dataset fits some parts of the data better than others. It's worth taking a look at what happens when you run a linear model on different parts of the dataset separately, to see if each model agrees or disagrees with the others.

`taiwan_real_estate` is available.

In [1]:
# # Filter for rows where house age is 0 to 15 years
# taiwan_0_to_15 = taiwan_real_estate[taiwan_real_estate["house_age_years"] == "0 to 15"]

# # Filter for rows where house age is 15 to 30 years
# taiwan_15_to_30 = taiwan_real_estate[taiwan_real_estate["house_age_years"] == "15 to 30"]

# # Filter for rows where house age is 30 to 45 years
# taiwan_30_to_45 = taiwan_real_estate[taiwan_real_estate["house_age_years"] == "30 to 45"]

# # Model price vs. no. convenience stores using 0 to 15 data
# mdl_0_to_15 = ols("price_twd_msq ~ n_convenience", data = taiwan_0_to_15).fit()

# # Model price vs. no. convenience stores using 15 to 30 data
# mdl_15_to_30 = ols("price_twd_msq ~ n_convenience", data = taiwan_15_to_30).fit()

# # Model price vs. no. convenience stores using 30 to 45 data
# mdl_30_to_45 = ols("price_twd_msq ~ n_convenience", data = taiwan_30_to_45).fit()

# # Print the coefficients
# print(mdl_0_to_15.params)
# print(mdl_15_to_30.params)
# print(mdl_30_to_45.params)

# Predicting multiple models

In order to see what each category's model is doing, it's helpful to make predictions from them. The flow is exactly the same as the flow for making predictions on the whole model, though remember that you only have a single explanatory variable in these models.

The models `mdl_0_to_15`, `mdl_15_to_30` and `mdl_30_to_45` are available.

In [1]:
# # Create explanatory_data, setting no. of conv stores from  0 to 10
# explanatory_data = pd.DataFrame({'n_convenience': np.arange(0, 11)})

# # Add column of predictions using "0 to 15" model and explanatory data 
# prediction_data_0_to_15 = explanatory_data.assign(
#     price_twd_msq = mdl_0_to_15.predict(explanatory_data)
# )

# # Same again, with "15 to 30"
# prediction_data_15_to_30 = explanatory_data.assign(
#     price_twd_msq = mdl_15_to_30.predict(explanatory_data)
# )

# # Same again, with "30 to 45"
# prediction_data_30_to_45 = explanatory_data.assign(
#     price_twd_msq = mdl_30_to_45.predict(explanatory_data)
# )

# print(prediction_data_0_to_15)
# print(prediction_data_15_to_30)
# print(prediction_data_30_to_45)

# Visualizing multiple models

In the last two exercises, you ran models for each category of house ages separately, then calculated predictions for each model. Now it's time to visualize those predictions to see how they compare.

When you use `sns.lmplot()` with the `hue` argument set to the categorical variable, you get multiple trend lines, each with their own slope. This is in contrast with the parallel slopes model you saw in Chapter 1, where all models had the same slope.

`taiwan_real_estate` is available. `prediction_data` is loaded as a concatenated DataFrame of all house ages.

In [2]:
# # Plot the trend lines of price_twd_msq vs. n_convenience for each house age category
# sns.lmplot(x="n_convenience",
#            y="price_twd_msq",
#            data=taiwan_real_estate,
#            hue="house_age_years",
#            ci=None,
#            legend_out=False)

# # Add a scatter plot for prediction_data
# sns.scatterplot(x="n_convenience",
#            y="price_twd_msq",
#            data=prediction_data,
#            hue="house_age_years",
#      legend=False)

# plt.show()

# Assessing model performance

To test which approach is best — the whole dataset model or the models for each house age category — you need to calculate some metrics. Here, you'll compare the coefficient of determination and the residual standard error for each model.

Four models of price versus no. of convenience stores (`mdl_all_ages`, `mdl_0_to_15`, `mdl_15_to_30`, and `mdl_30_to_45`) are available.

In [3]:
# # Print the coeff. of determination for mdl_all_ages
# print("R-squared for mdl_all_ages: ", mdl_all_ages.rsquared)

# # Print the coeff. of determination for mdl_0_to_15
# print("R-squared for mdl_0_to_15: ", mdl_0_to_15.rsquared)

# # Print the coeff. of determination for mdl_15_to_30
# print("R-squared for mdl_15_to_30: ", mdl_15_to_30.rsquared)

# # Print the coeff. of determination for mdl_30_to_45
# print("R-squared for mdl_30_to_45: ", mdl_30_to_45.rsquared)

In [4]:
# # Print the RSE for mdl_all_ages
# print("RSE for mdl_all_ages: ", np.sqrt(mdl_all_ages.mse_resid))

# # Print the RSE for mdl_0_to_15
# print("RSE for mdl_0_to_15: ", np.sqrt(mdl_0_to_15.mse_resid))

# # Print the RSE for mdl_15_to_30
# print("RSE for mdl_15_to_30: ", np.sqrt(mdl_15_to_30.mse_resid))

# # Print the RSE for mdl_30_to_45
# print("RSE for mdl_30_to_45: ", np.sqrt(mdl_30_to_45.mse_resid))


# Specifying an interaction

So far you've used a single parallel slopes model, which gave an OK fit for the whole dataset, then three separate models for each house age category, which gave a better fit for each individual category, but was clunky because you had three separate models to work with and explain. Ideally, you'd have a single model that had all the predictive power of the individual models.

Defining this single model is achieved through adding interactions between explanatory variables. The syntax of `statsmodels.formula` is flexible, and gives you a couple of options, depending on whether you prefer concise code that is quick to type and to read, or explicit code that describes what you are doing in detail.

`taiwan_real_estate` is available.

In [5]:
# # Model price vs both with an interaction using "times" syntax
# mdl_price_vs_both_inter = ols("price_twd_msq ~ n_convenience * house_age_years", data = taiwan_real_estate).fit()

# # Print the coefficients
# print(mdl_price_vs_both_inter.params)

In [6]:
# # Model price vs. both with an interaction using "colon" syntax
# mdl_price_vs_both_inter = ols("price_twd_msq ~ house_age_years + n_convenience + house_age_years:n_convenience", data = taiwan_real_estate).fit()

# # Print the coefficients
# print(mdl_price_vs_both_inter.params)

# Interactions with understandable coeffs

The previous model with the interaction term returned coefficients that were a little tricky to interpret. In order clarify what the model is predicting, you can reformulate the model in a way that returns understandable coefficients. For further clarity, you can compare the results to the models on the separate house age categories (`mdl_0_to_15`, `mdl_15_to_30`, and `mdl_30_to_45`).

`taiwan_real_estate`, `mdl_0_to_15`, `mdl_15_to_30`, and `mdl_30_to_45` are available.

In [7]:
# # Model price vs. house age plus an interaction, no intercept
# mdl_readable_inter = ols("price_twd_msq ~ house_age_years + n_convenience : house_age_years + 0", data = taiwan_real_estate).fit()

# # Print the coefficients for mdl_0_to_15
# print("mdl_0_to_15 coefficients:", "\n", mdl_0_to_15.params)

# # Print the coefficients for mdl_15_to_30
# print("mdl_15_to_30 coefficients:", "\n", mdl_15_to_30.params)

# # Print the coefficients for mdl_30_to_45
# print("mdl_30_to_45 coefficients:", "\n", mdl_30_to_45.params)

# # Print the coefficients for mdl_readable_inter
# print("\n", "mdl_readable_inter coefficients:", "\n", mdl_readable_inter.params)

Which statement about the coefficients of `mdl_readable_inter` is true?
- The expected increase in house price for each nearby convenience store is lowest for the 30 to 45 year age group.

# Predicting with interactions

As with every other regression model you've created, the fun part is making predictions. Fortunately, the code flow for this case is the same as the one without interactions — statsmodels can handle calculating the interactions without any extra prompting from you. The only thing you need to remember is the trick for getting combinations of explanatory variables.

`mdl_price_vs_both_inter` is available as a fitted model, itertools.product is loaded.

In [8]:
# # Create n_convenience as an array of numbers from 0 to 10
# n_convenience = np.arange(0,11)

# # Extract the unique values of house_age_years
# house_age_years = taiwan_real_estate["house_age_years"].unique()

# # Create p as all combinations of values of n_convenience and house_age_years
# p = product(n_convenience,house_age_years)

# # Transform p to a DataFrame and name the columns
# explanatory_data = pd.DataFrame({"n_convenience":n_convenience,
# "house_age_years":house_age_years})

# # Print it
# print(explanatory_data)

In [9]:
# # Create n_convenience as an array of numbers from 0 to 10
# n_convenience = np.arange(0, 11)

# # Extract the unique values of house_age_years
# house_age_years = taiwan_real_estate["house_age_years"].unique()

# # Create p as all combinations of values of n_convenience and house_age_years
# p = product(n_convenience, house_age_years)

# # Transform p to a DataFrame and name the columns
# explanatory_data = pd.DataFrame(p, columns=["n_convenience", "house_age_years"])

# # Add predictions to the DataFrame
# prediction_data = explanatory_data.assign( 
#       price_twd_msq = mdl_price_vs_both_inter.predict(explanatory_data))

# # Plot the trend lines of price_twd_msq vs. n_convenience colored by house_age_years
# sns.lmplot(x= "n_convenience", y= "price_twd_msq", hue="house_age_years", data = taiwan_real_estate)

# # Add a scatter plot for prediction_data
# sns.scatterplot(x= "n_convenience", y= "price_twd_msq", hue="house_age_years", data = prediction_data,
#      legend=False)

# # Show the plot
# plt.show()

# Manually calculating predictions with interactions

In order to understand how `.predict()` works, it's time to calculate the predictions manually again. For this model, there are three separate lines to calculate for, and in each one, the prediction is an intercept plus a slope times the numeric explanatory value. The tricky part is getting the right intercept and the right slope for each case.

`mdl_price_vs_both_inter` and `explanatory_data` are available.

In [10]:
# # Get the coefficients from mdl_price_vs_both_inter
# coeffs = mdl_price_vs_both_inter.params

# # Assign each of the elements of coeffs
# ic_0_15, ic_15_30, ic_30_45, slope_0_15, slope_15_30, slope_30_45 = coeffs

# # Create conditions
# conditions = [
#     explanatory_data["house_age_years"] == "0 to 15",
#     explanatory_data["house_age_years"] == "15 to 30",
#     explanatory_data["house_age_years"] == "30 to 45"
# ]

# # Create intercept_choices
# intercept_choices = [ic_0_15, ic_15_30, ic_30_45]

# # Create slope_choices
# slope_choices = [slope_0_15, slope_15_30, slope_30_45]

# # Create intercept and slope
# intercept = np.select(conditions, intercept_choices)
# slope = np.select(conditions, slope_choices)

# # Create prediction_data with columns intercept and price_twd_msq
# prediction_data = explanatory_data.assign(
#     price_twd_msq = intercept + slope * explanatory_data["n_convenience"]
#     )

# # Print it
# print(prediction_data)

# Modeling eBay auctions

Sometimes modeling a whole dataset suggests trends that disagree with models on separate parts of that dataset. This is known as Simpson's paradox. In the most extreme case, you may see a positive slope on the whole dataset, and negative slopes on every subset of that dataset (or the other way around).

Over the next few exercises, you'll look at eBay auctions of Palm Pilot M515 PDA models.

- `price` = Final sale price, USD
- `openbid` = The opening bid, USD
- `auction_type` = How long did the auction last?

`auctions` is available as a pandas DataFrame.

In [11]:
# # Take a glimpse at the dataset
# print(auctions.info())

# # Model price vs. opening bid using auctions
# mdl_price_vs_openbid = ols("price ~ openbid", data = auctions).fit()

# # See the result
# print(mdl_price_vs_openbid.params)

# # Plot the scatter plot pf price vs. openbid with a linear trend line
# sns.regplot(x = "openbid", y= "price", data = auctions)

# # Show the plot
# plt.show()

# Modeling each auction type

You just saw that the opening bid price appeared not to affect the final sale price of Palm Pilots in the eBay auctions. Now let's look at what happens when you model the three auction types (3 day, 5 day, and 7 day) separately.

`auctions` is available.

In [12]:
# # Fit linear regression of price vs. opening bid and auction type, with an interaction, without intercept
# mdl_price_vs_both = ols("price ~ auction_type + openbid:auction_type + 0", data=auctions).fit()

# # Using auctions, plot price vs. opening bid colored by auction type as a scatter plot with linear regr'n trend lines
# sns.lmplot(x = "openbid", y= "price", hue = "auction_type", data = auctions)

# # Show the plot
# plt.show()


Which statement about the model resolves Simpson's Paradox?
- The two models disagree, and the best model to take advice from depends upon the question we are trying to solve.