<a href="https://colab.research.google.com/github/rheazh/Data-Analysis/blob/main/Rhea's_Unit6ExercisesSF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unit 6 Exercises: Is my model good?

#### Over and Under fitting, Model Visualization, and Model/Variable Selection Concepts

These exercises are meant to get you to think about the model and variable selection process, and consider how we determine if a model is "good".

**Task1**:

Does elpd_loo mean anything if we only have one model?

If you only have one model, there is no other model to compare against, so elpd_loo on its own does not provide much useful information. It is primarily used to assess the out-of-sample predictive performance of a model and to compare it to other models.

**Task2**:

Describe overfitting, in the context of this course

Overfitting refers to a scenario where a model learns not just the underlying patterns in the data but also the random noise or fluctuations. As a result, the model fits the training data very well but fails to generalize to unseen data. In this course, overfitting typically happens when the model becomes too complex

**Task3**:

How do we mitigate overfitting?

To mitigate overfitting, you can start by simplifying the model. You can do this by reducing the number of predictors or using a less complex model, such as linear regression instead of polynomial regression.

**Task4**:

How do we mitigate underfitting?

To mitigate underfitting, you may need to increase the complexity of your model. This could involve using more predictors or using non-linear models if the relationship between the variables is not well captured by a simple linear model

**Task5**:

Why would we want more than one predictor in a model?

Using multiple predictors in a model allows us to account for more factors that might influence the outcome. With multiple predictors, the model can capture more complex relationships, leading to improved accuracy in predicting the dependent variable.

**Task6**:

Can we have too many predictors? How would we now?

Yes, having too many predictors can lead to problems such as overfitting. If a model has too many predictors, it may start fitting noise in the data, resulting in poor generalization to unseen data. Another problem is where predictors are highly correlated with each other, making it difficult to determine the individual effect of each predictor.

**Task7**:

What is variable selection, and how does it work?

Variable selection is the process of identifying which predictors should be included in a model. It aims to find the most relevant predictors that contribute significantly to the outcome while avoiding unnecessary complexity.It works by evaluating each feature’s relevance to the target variable and then selecting the most important features for the analysis.

**Task8**:

Describe the differences and similarities between the following three models: linear regression with two predictors, one of which is a categorical variable:

- adding the variables in the model, as is standard.
- using that categorical variable as a hierarchy upon the other predictor variable.
- adding the variables, plus the categorical variable's interaction with the other variable.

Adding the variables as is: You include both predictors (one continuous and one categorical) in the model separately, treating them independently.

Using the categorical variable as a hierarchy: You model the categorical variable in a way that reflects a structure or hierarchy, such as treating different levels of the categorical variable as separate groups, which can affect how the continuous predictor influences the outcome.

Adding the interaction: You include both predictors as well as their interaction, meaning the effect of one predictor on the outcome may depend on the level of the other predictor, capturing more complex relationships between them

All three models involve using the same two predictors (one continuous and one categorical), but they differ in how they account for their relationship.
The first treats them separately.
The second introduces a structured relationship.
The third considers how the predictors influence the outcome together.

**Task9**:

How do we visualize multiple linear regression models? Can we visualize the entire model, all at once?

Visualizing a multiple linear regression model with many predictors is difficult because the model is high-dimensional, meaning it has more than two or three variables. But you can still visualize parts of the model.

**Task10**:

Compare the following linear models that all use the basketball data to predict field goal percentage:

- predictors free throw percentage and position (with position as a categorical predictor)
- predictors free throw percentage and position (with position as a hierarchy)
- predictors free throw percentage and position (with position interacting with frew throw percentage)
- predictors free throw percentage, position, 3 point attempts, and interactions between all three predictors
- predictors free throw percentage, position, 3 point attempts, with an interaction between 3 point attempts and postion.

using ```az.compare()``` and ```az.plot_compare()```, or an equivalent method using LOO (elpd_loo).

**You** may use the following two code blocks to load and clean the data.

In [5]:
#have to drop incomplete rows, so that bambi will run
bb = pd.read_csv(
    'https://raw.githubusercontent.com/thedarredondo/data-science-fundamentals/refs/heads/main/Data/basketball2324.csv').dropna()

In [6]:
#only look at players who played more than 600 minutes
#which is 20 min per game, for 30 games
bb = bb.query('MP > 600')
#remove players who never missed a free throw
bb = bb.query('`FT%` != 1.0')
#filter out the combo positions. This will make it easier to read the graphs
bb = bb.query("Pos in ['C','PF','SF','SG','PG']")
#gets rid of the annoying '%' sign
bb.rename(columns={"FT%":"FTp","FG%":"FGp"}, inplace=True)

In [7]:
import pandas as pd

# Load and clean the data
bb = pd.read_csv(
    'https://raw.githubusercontent.com/thedarredondo/data-science-fundamentals/refs/heads/main/Data/basketball2324.csv').dropna()

# Filter players who played more than 600 minutes
bb = bb.query('MP > 600')

# Remove players who never missed a free throw
bb = bb.query('`FT%` != 1.0')

# Filter out combo positions for easier visualization
bb = bb.query("Pos in ['C','PF','SF','SG','PG']")

# Rename columns for easier referencing
bb.rename(columns={"FT%": "FTp", "FG%": "FGp"}, inplace=True)


In [8]:
import xarray as xr


In [9]:
!pip install bambi


Collecting bambi
  Downloading bambi-0.15.0-py3-none-any.whl.metadata (8.8 kB)
Collecting formulae>=0.5.3 (from bambi)
  Downloading formulae-0.5.4-py3-none-any.whl.metadata (4.5 kB)
Downloading bambi-0.15.0-py3-none-any.whl (109 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.2/109.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading formulae-0.5.4-py3-none-any.whl (53 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.7/53.7 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: formulae, bambi
Successfully installed bambi-0.15.0 formulae-0.5.4


In [10]:
import bambi as bmb

# Model with position as a categorical predictor
model1 = bmb.Model("FGp ~ FTp + Pos", data=bb)
trace1 = model1.fit()


Output()

Output()

In [11]:
# Model with position as a hierarchical predictor
model2 = bmb.Model("FGp ~ FTp + (1|Pos)", data=bb)
trace2 = model2.fit()

Output()

Output()

ERROR:pymc.stats.convergence:There were 170 divergences after tuning. Increase `target_accept` or reparameterize.
ERROR:pymc.stats.convergence:The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details


In [12]:
# Model with interaction between FT% and Pos
model3 = bmb.Model("FGp ~ FTp * Pos", data=bb)
trace3 = model3.fit()


Output()

Output()

In [13]:
# Model with FT%, Pos, 3PA and all interactions
model4 = bmb.Model("FGp ~ FTp * Pos * `3P`", data=bb)
trace4 = model4.fit()


Output()

Output()

In [14]:
# Model with FT%, Pos, 3PA and interaction between Pos and 3PA
model5 = bmb.Model("FGp ~ FTp + Pos + `3P` + Pos:`3P`", data=bb)
trace5 = model5.fit()


Output()

Output()

In [16]:
import numpy as np
import pandas as pd
import arviz as az
import pymc as pm
import matplotlib.pyplot as plt

In [18]:
# Store the models and their traces in a dictionary
model_traces = {
    "Model 1: FTp + Pos": trace1,
    "Model 2: FTp + Pos + (1|Pos)": trace2,
    "Model 3: FTp * Pos": trace3,
    "Model 4: FTp * Pos * 3PA": trace4,
    "Model 5: FTp + Pos + 3PA + Pos:3PA": trace5,
}


In [20]:
# Perform LOO comparison using az.compare()
cmp_df = az.compare(model_traces)

# Display the comparison results
print(cmp_df)

# Plot the comparison of models
az.plot_compare(cmp_df)


TypeError: Encountered error in ELPD computation of compare.

**Task11**:

Which model is "better" according to this metric?

Why do you think that is?