# Interactive 3D scatter plot

Since computer screens and paper are both two-dimensional objects, most plots are best suited to visualizing two variables at once. For the case of three continuous variables, you can draw a 3D scatter plot, but perspective problems usually make it difficult to interpret. This can partially be solved by making the 3D plot interactive, so you can view the plot from different angles.

The 3D scatter shown has `n_convenience` on the x-axis, `sqrt_dist_to_mrt_m` on the y-axis, and `price_twd_msq` on the z-axis — you can interact with it by dragging and zooming. The points are colored according to `price_twd_msq`.

Which of the following statements is NOT correct?

- There tends to be a positive relationship between the number of convenience stores and the distance to the closest MRT station.

# Visualizing three numeric variables

There are also some "flat" alternatives to 3D plots that provide easier interpretation, though they require a little thinking about to make. A good approach is plotting the two numeric explanatory variables on the x- and y-axis of a scatter plot, and coloring the points according to the response variable.

`taiwan_real_estate` is available.

In [1]:
# # Transform dist_to_mrt_m to sqrt_dist_to_mrt_m
# taiwan_real_estate["sqrt_dist_to_mrt_m"] = np.sqrt(taiwan_real_estate["dist_to_mrt_m"])
# # Draw a scatter plot of sqrt_dist_to_mrt_m vs. n_convenience colored by price_twd_msq
# sns.scatterplot( x= "n_convenience", y = "sqrt_dist_to_mrt_m", hue ="price_twd_msq" , data =taiwan_real_estate )

# # Show the plot
# plt.show()

# Modeling two numeric explanatory variables

You already saw how to make a model and predictions with a numeric and a categorical explanatory variable. The code for modeling and predicting with two numeric explanatory variables is the same, other than a slight difference in how to specify the explanatory variables to make predictions against.

Here you'll model and predict the house prices against the number of nearby convenience stores and the square-root of the distance to the nearest MRT station.

`taiwan_real_estate` is available with the square-root transformed variable `sqrt_dist_to_mrt_m`. `itertools.product` is also loaded.

In [1]:
# # Fit linear regression of price vs. no. of conv. stores and sqrt dist. to nearest MRT, no interaction
# mdl_price_vs_conv_dist = ols("price_twd_msq ~ n_convenience + sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()

# # Create n_convenience as an array of numbers from 0 to 10
# n_convenience = np.arange(0,11)

# # Create sqrt_dist_to_mrt_m as an array of numbers from 0 to 80 in steps of 10
# sqrt_dist_to_mrt_m = np.arange(0,81,10)

# # Create p as all combinations of values of n_convenience and sqrt_dist_to_mrt_m
# p = product(n_convenience, sqrt_dist_to_mrt_m)

# # Transform p to a DataFrame and name the columns
# explanatory_data = pd.DataFrame(p, columns = ["n_convenience" , "sqrt_dist_to_mrt_m"])

# # Add column of predictions
# prediction_data = explanatory_data.assign(
#     price_twd_msq = mdl_price_vs_conv_dist.predict(explanatory_data)
# )

# # See the result  
# print(prediction_data)


# Visualizing two numeric explanatory variables

The code for visualizing two numeric explanatory variables is the same as you've seen before: create a layer of the actual data points, and add a layer of the prediction points to see how they match. In the case of two numeric explanatory variables, the prediction point layer will look like a grid.

`taiwan_real_estate` and `prediction_data` are available with the square-root transformed variable `sqrt_dist_to_mrt_m`.

In [2]:
# # Create scatter plot of taiwan_real_estate
# sns.scatterplot(x = "n_convenience", y = "sqrt_dist_to_mrt_m", hue = "price_twd_msq", data = taiwan_real_estate)

# # Create scatter plot of prediction_data without legend
# sns.scatterplot(x = "n_convenience", y = "sqrt_dist_to_mrt_m", hue = "price_twd_msq", data = prediction_data,marker="s", legend = False)

# # Show the plot
# plt.show()

# Including an interaction

Just as in the case with one numeric and one categorical explanatory variable, it is possible for numeric explanatory variables to interact. With this model structure, you'll get a third slope coefficient: one for each explanatory variable and one for the interaction.

Here, you'll run, predict, and plot the same model as in the previous exercise, but this time including an interaction between the explanatory variables.

In [3]:
# # Convert to mdl_price_vs_conv_dist_inter
# mdl_price_vs_conv_dist_inter = ols("price_twd_msq ~ n_convenience * sqrt_dist_to_mrt_m", data=taiwan_real_estate).fit()

# # Use mdl_price_vs_conv_dist_inter to make predictions
# n_convenience = np.arange(0, 11)
# sqrt_dist_to_mrt_m = np.arange(0, 81, 10)
# p = product(n_convenience, sqrt_dist_to_mrt_m)
# explanatory_data = pd.DataFrame(p, columns=["n_convenience", "sqrt_dist_to_mrt_m"])
# prediction_data = explanatory_data.assign(
#     price_twd_msq = mdl_price_vs_conv_dist_inter.predict(explanatory_data))

# sns.scatterplot(x="n_convenience", y="sqrt_dist_to_mrt_m", data=taiwan_real_estate, hue="price_twd_msq", legend=False)

# sns.scatterplot(x="n_convenience", y="sqrt_dist_to_mrt_m", data=prediction_data, hue="price_twd_msq", marker="s")

# plt.show()

# Visualizing many variables

As you begin to consider more variables, plotting them all at the same time becomes increasingly difficult. In addition to using x and y scales for two numeric variables, you can use color for a third numeric variable, and you can use faceting for categorical variables. And that's about your limit before the plots become too difficult to interpret. There are some specialist plot types like correlation heatmaps and parallel coordinates plots that will handle more variables, but they give you much less information about each variable, and they aren't great for visualizing model predictions.

Here you'll push the limits of the scatter plot by showing the house price, the distance to the MRT station, the number of nearby convenience stores, and the house age, all together in one plot.

`taiwan_real_estate` is available.

In [4]:
# # Prepare the grid using taiwan_real_estate, for each house age category, colored by price_twd_msq
# grid = sns.FacetGrid(data=taiwan_real_estate,
#             col="house_age_years",
#             hue="price_twd_msq",
#             palette="plasma")

# # Plot the scatterplots with sqrt_dist_to_mrt_m on the x-axis and n_convenience on the y-axis
# grid.map(sns.scatterplot,
#          "sqrt_dist_to_mrt_m",
#           "n_convenience")

# # Show the plot (brighter colors mean higher prices)
# plt.show()

# Different levels of interaction

Once you have three explanatory variables, the number of options for specifying interactions increases. You can specify no interactions. You can specify 2-way interactions, which gives you model coefficients for each pair of variables. The third option is to specify all the interactions, which means the three 2-way interactions and the interaction between all three explanatory variables.

As the number of explanatory variables increases further, the number of interaction possibilities rapidly increases.

`taiwan_real_estate` is available.

In [5]:
# # Model price vs. no. of conv. stores, sqrt dist. to MRT station & house age, no global intercept, no interactions
# mdl_price_vs_all_no_inter = ols("price_twd_msq ~ n_convenience + sqrt_dist_to_mrt_m + house_age_years", data = taiwan_real_estate).fit()

# # See the result
# print(mdl_price_vs_all_no_inter.params)

In [8]:
# # Model price vs. sqrt dist. to MRT station, no. of conv. stores & house age, no global intercept, 3-way interactions
# mdl_price_vs_all_3_way_inter = ols("price_twd_msq ~ n_convenience * sqrt_dist_to_mrt_m * house_age_years", data = taiwan_real_estate).fit()

# # See the result
# print(mdl_price_vs_all_3_way_inter.params)

In [7]:
# # Model price vs. sqrt dist. to MRT station, no. of conv. stores & house age, no global intercept, 2-way interactions
# mdl_price_vs_all_2_way_inter = ols("price_twd_msq ~ (n_convenience + sqrt_dist_to_mrt_m + house_age_years) ** 2 + 0", data = taiwan_real_estate).fit()

# # See the result
# print(mdl_price_vs_all_2_way_inter.params)

# Predicting again

You've followed the prediction workflow several times now with different combinations of explanatory variables. Time to try it once more on the model with three explanatory variables. Here, you'll use the model with 3-way interactions, though the code is the same when using any of the three models from the previous exercise.

`taiwan_real_estate` and `mdl_price_vs_all_3_way_inter` are available. `itertools.product` is loaded.

In [9]:
# # Create n_convenience as an array of numbers from 0 to 10
# n_convenience = np.arange(0, 11)

# # Create sqrt_dist_to_mrt_m as an array of numbers from 0 to 80 in steps of 10
# sqrt_dist_to_mrt_m = np.arange(0, 81, 10)

# # Create house_age_years with unique values
# house_age_years = taiwan_real_estate["house_age_years"].unique()

# # Create p as all combinations of n_convenience, sqrt_dist_to_mrt_m, and house_age_years, in that order
# p = product(n_convenience, sqrt_dist_to_mrt_m, house_age_years)

# # Transform p to a DataFrame and name the columns
# explanatory_data = pd.DataFrame(p, columns=["n_convenience",
#                                             "sqrt_dist_to_mrt_m",
#                                             "house_age_years"])

# # Add column of predictions
# prediction_data = explanatory_data.assign(
#     price_twd_msq = mdl_price_vs_all_3_way_inter.predict(explanatory_data)
# )

# # See the result
# print(prediction_data)

# The sum of squares

In order to choose the "best" line to fit the data, regression models need to optimize some metric. For linear regression, this metric is called the sum of squares.

In the dashboard, try setting different values of the intercept and slope coefficients. In the plot, the solid black line has the intercept and slope you specified. The dotted blue line has the intercept and slope calculated by a linear regression on the dataset.

How does linear regression try to optimize the sum of squares metric?

- Linear regression minimizes the sum of the squares of the differences between the actual responses and the predicted responses.


# Linear regression algorithm

To truly understand linear regression, it is helpful to know how the algorithm works. The code for `ols()` is hundreds of lines because it has to work with any formula and any dataset. However, in the case of simple linear regression for a single dataset, you can implement a linear regression algorithm in just a few lines of code.

The workflow is:

First, write a function to calculate the sum of squares using this general syntax:
```
def function_name(args):
  # some calculations with the args
  return outcome
```

Second, use scipy's minimize function find the coefficients that minimize this function.
The explanatory values (the `n_convenience` column of `taiwan_real_estate`) are available as x_actual. The response values (the `price_twd_msq` column of `taiwan_real_estate`) are available as `y_actual`.

`minimize()` is also loaded.

In [10]:
# # Complete the function
# def calc_sum_of_squares(coeffs):
#     # Unpack coeffs
#     intercept, slope = coeffs
#     # Calculate predicted y-values
#     y_pred = intercept + slope * x_actual
#     # Calculate differences between y_actual and y_pred
#     y_diff = y_pred - y_actual
#     # Calculate sum of squares
#     sum_sq = np.sum(y_diff ** 2)
#     # Return sum of squares
#     return sum_sq

# # Call minimize on calc_sum_of_squares  
# print(minimize(fun=calc_sum_of_squares,
#                x0=(0,0)))

# # Compare the output with the ols() call.
# print(ols("price_twd_msq ~ n_convenience", data=taiwan_real_estate).fit().params)
  