In [15]:
!pip install bambi

Collecting bambi
  Downloading bambi-0.15.0-py3-none-any.whl.metadata (8.8 kB)
Collecting formulae>=0.5.3 (from bambi)
  Downloading formulae-0.5.4-py3-none-any.whl.metadata (4.5 kB)
Downloading bambi-0.15.0-py3-none-any.whl (109 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.2/109.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading formulae-0.5.4-py3-none-any.whl (53 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.7/53.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: formulae, bambi
Successfully installed bambi-0.15.0 formulae-0.5.4


In [16]:
import numpy as np
import pandas as pd
import arviz as az
import pymc as pm
import matplotlib.pyplot as plt
import bambi as bmb

# Unit 6 Exercises: Is my model good?

#### Over and Under fitting, Model Visualization, and Model/Variable Selection Concepts

These exercises are meant to get you to think about the model and variable selection process, and consider how we determine if a model is "good".

**Task1**:

Does elpd_loo mean anything if we only have one model?

Yes even with one model elpd_loo still gives you a number that represents how well the model can predict new data. With a second model it can show the difference between the two models, but with one model it still means something.

**Task2**:

Describe overfitting, in the context of this course

Overfitting is when a model is trained on a dataset and fits that data too well to a point where it only performs well on the training dataset and doesn't perform well on new data.

**Task3**:

How do we mitigate overfitting?

The best way to mitigate overfitting is by using weak priors. Another way to migitage overfitting is by using a larger, more general data set. This way the model that is created looks at general trends instead of specific trends of a small data set.

**Task4**:

How do we mitigate underfitting?

To mitigate underfitting the model needs to be made more flexible and capable of capturing complex trends. This can be done by increasing the complexity of the model like switching from a linear regression model to a polynomial regression model.

**Task5**:

Why would we want more than one predictor in a model?

To make accurate predictions it is good to use more than one predictor. Real world situations and real world outcomes have multiple causes, so to make predictions that are accurate to the real world we need to look at all the causes and use all the precitors we can.

**Task6**:

Can we have too many predictors? How would we now?

Yes, you can have too many predictors. If you have too many predictors you can run into overfitting problems where your model bases its predictions off of predictors that are present in the training data, but might not be representative of the real world because that predictor actually doesn't affect the outcome. We could know that we have too many predictors

**Task7**:

What is variable selection, and how does it work?

variable selection is the process of choosing which predictors to use in your model. It works by looking at which predictors actually affect the real world outcome and using those predictors in your model.

**Task8**:

Describe the differences and similarities between the following three models: linear regression with two predictors, one of which is a categorical variable:

- adding the variables in the model, as is standard.
- using that categorical variable as a hierarchy upon the other predictor variable.
- adding the variables, plus the categorical variable's interaction with the other variable.

Differences:
The slopes for the models are different. The standard model has the same slope for both predictors whereas the hierarchical model and interaction model have a different slope for each predictor. The way the models show the interaction between the predictors is also different in the models. The standard model doesn't show the interaction between the two predictors. The hierarchical model implicitely models the interaction between the two predictors. The interaction model explicitly models the interaction between the two predictors.
Similarities:
All the models can be used to explain the relationship between a prediction and the two predictors. All models handle a categorical variable and are linear. They can all be extended to handle more variables.

**Task9**:

How do we visualize multiple linear regression models? Can we visualize the entire model, all at once?

You cannot visualize the entire model at once, but you can visualize multiple linear regression models by using scatter plots and bar graphs.

**Task10**:

Compare the following linear models that all use the basketball data to predict field goal percentage:

- predictors free throw percentage and position (with position as a categorical predictor)
- predictors free throw percentage and position (with position as a hierarchy)
- predictors free throw percentage and position (with position interacting with frew throw percentage)
- predictors free throw percentage, position, 3 point attempts, and interactions between all three predictors
- predictors free throw percentage, position, 3 point attempts, with an interaction between 3 point attempts and postion.

using ```az.compare()``` and ```az.plot_compare()```, or an equivalent method using LOO (elpd_loo).

You may use the following two code blocks to load and clean the data.

In [4]:
#have to drop incomplete rows, so that bambi will run
bb = pd.read_csv(
    'https://raw.githubusercontent.com/thedarredondo/data-science-fundamentals/refs/heads/main/Data/basketball2324.csv').dropna()

In [5]:
#only look at players who played more than 600 minutes
#which is 20 min per game, for 30 games
bb = bb.query('MP > 600')
#remove players who never missed a free throw
bb = bb.query('`FT%` != 1.0')
#filter out the combo positions. This will make it easier to read the graphs
bb = bb.query("Pos in ['C','PF','SF','SG','PG']")
#gets rid of the annoying '%' sign
bb.rename(columns={"FT%":"FTp","FG%":"FGp"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bb.rename(columns={"FT%":"FTp","FG%":"FGp"}, inplace=True)


In [18]:
model_1 = bmb.Model("'FGp' ~ 'FTp' + Pos", data=bb)
idata_model1 = model_1.fit(idata_kwargs={'log_likelihood': True})

Output()

Output()

In [20]:
model_2 = bmb.Model("'FGp' ~ 'FTp' + (1|Pos)", data=bb)
idata_model2 = model_2.fit(idata_kwargs={'log_likelihood': True})

Output()

Output()

ERROR:pymc.stats.convergence:There were 10 divergences after tuning. Increase `target_accept` or reparameterize.


In [21]:
model_3 = bmb.Model("'FGp' ~ 'FTp' * Pos", data=bb)
idata_model3 = model_3.fit(idata_kwargs={'log_likelihood': True})

Output()

Output()

In [22]:
model_4 = bmb.Model("'FGp' ~ 'FTp' * Pos * '3PA'", data=bb)
idata_model4 = model_4.fit(idata_kwargs={'log_likelihood': True})

Output()

Output()

In [23]:
model_5 = bmb.Model("'FGp' ~ 'FTp' + Pos * '3PA'", data=bb)
idata_model5 = model_5.fit(idata_kwargs={'log_likelihood': True})

Output()

Output()

In [25]:
compare_models = az.compare({"model1":idata_model1, "model2":idata_model2, "model3":idata_model3, "model4":idata_model4, "model5":idata_model5})
compare_models



Unnamed: 0,rank,elpd_loo,p_loo,elpd_diff,weight,se,dse,warning,scale
model5,0,529.787424,12.723274,0.0,0.6007315,15.836023,0.0,False,log
model4,1,529.524352,22.279539,0.263072,0.3992685,14.373571,6.134473,True,log
model3,2,509.032086,14.19606,20.755338,1.459178e-15,16.827788,7.644392,True,log
model1,3,507.320487,8.299099,22.466937,0.0,16.083419,6.783529,False,log
model2,4,507.297724,8.339864,22.489699,0.0,16.069981,6.860257,False,log


**Task11**:

Which model is "better" according to this metric?

Why do you think that is?

According to the metric model5 is the best model. This is the most where the predictors are FTp, Pos, and 3PA with an interaction between Pos and 3PA. This is probably because there is a good correlation between Pos and 3PA where bigger forwards and centers attempt less 3 pointers.