# Regression Refresher Exercises

## Exercise 1: Create a basic linear regression model

You're asked to build a linear regression model to predict the weight of the fish based on their width. Take the following points into consideration while building your model.

* Are transformations needed? Does it make sense to... 
    * Transform the predictor?
    * Transform the response?
    * Both?
* How many lines of code is it?
* How does the fit of this model compares to the fit of the simple model we created in the lesson?

In addition, this is the opportunity to be creative. You can try other transformations and custom visualizations as well. 

Have fun!

## Exercise 2: What if we only use the species of fish?

In "Section 70: Accounting for species" we included the species variable as predictor in a model that already had a numerical predictor. Mathematically, it's the following model

$$
\begin{aligned}
\beta_{0,j} & \sim \text{Normal}(0, \sigma_{\beta_0}) \\
\beta_{1,j} & \sim \text{Normal}(0, \sigma_{\beta_1}) \\
\sigma & \sim \text{HalfNormal}(\sigma_\varepsilon) \\
\mu_{i, j} & = \beta_{0,j} + \beta_{1,j} \log{(\text{Length}_{i, j})} \\
\log{(\text{Weight}_{i,j})} & \sim \text{Normal}(\mu_{i, j}, \sigma)
\end{aligned}
$$

for $j=1, \cdots, 7$.

Which in PyMC code is

```python
log_length = np.log(data["Length1"].to_numpy())
log_weight = np.log(data["Weight"].to_numpy())
species, species_idx = np.unique(data["Species"], return_inverse=True)
coords = {"species": species}

with pm.Model(coords=coords) as model:
    β0 = pm.Normal("β0", mu=0, sigma=5, dims="species")
    β1 = pm.Normal("β1", mu=0, sigma=5, dims="species")
    sigma = pm.HalfNormal("sigma", sigma=5)
    mu = β0[species_idx] + β1[species_idx] * log_length
    pm.Normal("log(weight)", mu=mu, sigma=sigma, observed=log_weight)
```

However, we didn't cover how a model that includes the species, but not the length, would look like. That's what this exercise is about!

You have to create a linear regression model using only the species of fish as predictor. Answer the following questions:

* Is there a slope parameter? 
* How many intercept parameters does the model have? Is it one, or more than one? Why?
* What is the meaning of the intercept parameter(s)?
* Is it necessary to transform the response variable?
* What's the difference between this model and the intercept-only model?

## Exercise 3: Multiple intercepts, but a single slope

The model we created in "Section 70: Accounting for species" considers varying intercepts and slopes for every species. It is, every species had its own intercept and its own slope. We did this because we mentioned it was the most flexible approach. However, when we analyzed the posterior estimates we noticed the slope posteriors were all quite similar, meaning regression lines for the species were indeed parallel. Because of this, it makes sense to have a single slope parameter instead of multiple ones, which reduces the complexity of the model a little.

The goal of this exercise is to write a regression model with unpooled intercepts, one intercept per species, but a completely pooled slope -- a single, common, slope for all species. Consider the following points when solving the exercise

* Perform the same train-test split than in "Section 80: New fish arrive". 
* Build the model with a single slope, but multiple intercepts, using the train dataset.
* Predict the weight of the fish in the test set.
* Compare the predictions obtained here with the ones obtained in "Section 80: New fish arrive"

Also, do you notice any difference in sampling speed? Why? Is that what you were expecting?

## Exercise 4: Test your skills with a brand new problem!

You are the data scientist in a research team at large construction company. You are part of a project testing the strength of concrete samples.

Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.

The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.

The project leader asked you to find a simple way to estimate concrete strength so that it's possible to predict how a particular sample is expected to perform.

### The data

The team has already tested more than a thousand samples ([source](https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength)) and the following variables were measured and recorded

* **cement** - Portland cement in kg/m3
* **slag** - Blast furnace slag in kg/m3
* **fly_ash** - Fly ash in kg/m3
* **water** - Water in liters/m3
* **superplasticizer** - Superplasticizer additive in kg/m3
* **coarse_aggregate** - Coarse aggregate (gravel) in kg/m3
* **fine_aggregate** - Fine aggregate (sand) in kg/m3
* **age** - Age of the sample in days
* **strength** - Concrete compressive strength in megapascals (MPa)

### The challenge

This is the initial iteration of the modeling process, so we are not using all the variables in the dataset. You're asked to provide your project leader with a formula that estimates the compressive strength based on **cement** and **water**. 

Estimate the following regression model:

$$
\text{Concrete Strenght} = \beta_0  + \beta_1 \text{cement} + \beta_2 \text{water}
$$

Compute the strength of concrete for all the combinations of the following water and cement values:

* Cement: 300, 400, 500 kg/m3
* Water: 140, 160, 180 liters/m3

### Citations

The data for this exercise originally comes from

* I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).