## Conceptual

### Question 1
For each of parts (a) through (d), indicate whether we would generally
expect the performance of a flexible statistical learning method to be
better or worse than an inflexible method. \
Justify your answer. \
(a) The sample size $n$ is extremely large, and the number of predictors
$p$ is small. \
(b) The number of predictors $p$ is extremely large, and the number
of observations $n$ is small. \
(c) The relationship between the predictors and response is highly
non-linear. \
(d) The variance of the error terms, i.e. $\sigma^2 = Var(\varepsilon)$, is extremely
high.


### Answers
a) Flexible is better, because $n$ is large and so we can best adapt the model without the risk of overfitting. \
b) Inflexible is better, because $p >> n$ and flexible models have a high risk of overfitting. \
c) Flexible. \
d) Inflexible, because highly flexible models have higher variance.

### Question 2
Explain whether each scenario is a classification or regression problem, \
and indicate whether we are most interested in inference or prediction. \
Finally, provide $n$ and $p$. \
(a) We collect a set of data on the top $500$ firms in the US. For each
firm we record profit, number of employees, industry and the
CEO salary. \
We are interested in understanding which factors
affect CEO salary. \
(b) We are considering launching a new product and wish to know
whether it will be a success or a failure. We collect data on $20$
similar products that were previously launched. \
For each product
we have recorded whether it was a success or failure, price
charged for the product, marketing budget, competition price,
and ten other variables. \
(c) We are interested in predicting the % change in the USD/Euro
exchange rate in relation to the weekly changes in the world
stock markets. Hence we collect weekly data for all of $2012$. \
For
each week we record the % change in the USD/Euro, the %
change in the US market, the % change in the British market,
and the % change in the German market.

### Answers
a) Regression and inference, $n = 500$, $p = 3$. \
b) Classification and prediction, $n = 20$, $p = 13$. \
c) Regression and prediction, $n = 52$, $p = 3$.

### Question 4
You will now think of some real-life applications for statistical learning.  
(a) Describe three real-life applications in which classification might be useful.  
Describe the response, as well as the predictors.  
Is the goal of each application inference or prediction? Explain your answer.  
(b) Describe three real-life applications in which regression might be useful.  
Describe the response, as well as the predictors.  
Is the goal of each application inference or prediction? Explain your answer.  
(c) Describe three real-life applications in which cluster analysis might be useful.




## Answers
a) To know if a sample of people are infected or not (1), to know wether a house has been sold or not (2), and to know wether a e-mail is a spam or not (3).  
Predictors and response:  
(1) Some data about the people such as age, symptoms and medical history, and "yes" or "no";  
(2) Some characteristics about the houses for sale, such as localization and price, and "yes" or "no";
(3) Some characteristics such as key-words, sender, and "is spam" or "not spam".  
All of the above falls into the prediction paradigm. 


b) Predicting house prices

Response: House price (a continuous variable).  
Predictors: House size, number of bedrooms, location, age of the property, and amenities.  
Goal: Prediction.  
The objective is to estimate the price of a house based on its characteristics, focusing on accurate predictions for new houses rather than understanding the exact relationship between predictors and price.
Estimating the impact of advertising on sales

Response: Sales (in dollars, a continuous variable).  
Predictors: Advertising budget allocated to TV, radio, and online platforms.  
Goal: Inference.  
The goal is to understand how each advertising channel impacts sales (e.g., how much an additional dollar spent on TV increases sales). This helps allocate budgets effectively.
Modeling electricity consumption

Response: Electricity usage (in kilowatt-hours, a continuous variable).  
Predictors: Time of year, outdoor temperature, and household size.  
Goal: Prediction.  
The aim is to forecast future electricity consumption based on predictors to optimize grid management and resource allocation.

c) Clustering is useful when we want to group data based on shared characteristics or patterns, especially when there is no predefined label or category. For example, we could cluster students based on their test scores into three groups: those with high scores, those around the mean, and those with low scores.

This approach can help identify performance trends and enable targeted actions. For instance:

Students in the low-score group might benefit from additional support or tutoring.
Students in the high-score group could be offered advanced learning opportunities.
Identifying the group around the mean might help to evaluate the overall effectiveness of the teaching methods.

### Question 5
What are the advantages and disadvantages of a very flexible (versus
a less flexible) approach for regression or classification?  
Under what
circumstances might a more flexible approach be preferred to a less
flexible approach? When might a less flexible approach be preferred?

## Answer  
Inflexible is more interpretable, fewer observations required, can be biased.
Flexible can overfit (high error variance).  
In cases where we have high $n$ or
non-linear patterns flexible will be preferred.

### Question 6
Describe the differences between a parametric and a non-parametric
statistical learning approach.  
What are the advantages of a parametric
approach to regression or classification (as opposed to a nonparametric
approach)? What are its disadvantages?

## Answer
In a parametric model, one makes an assumption of the "form" of $f$, i.e., $Y = f(X, \beta) = f(X_1,\,\dots,\,X_p,\,\beta_1,\,\dots,\,\beta_p)$, where $\beta_j$ are real parameters,  
and then one obtain the values of $\beta_j$.  
A parametric approach simplifies the problem because it is easier to estimate a set of parameters.  
The disadvantage is that the model we chose will usually not match the true unknown form of $f$.
In a non-parametric model, one does not make assumptions about $f$,  
instead, one approximates $f$ as closely as possible to the values of the data set.  
By not assuming a particular form of $f$, one can fit a wider range of possible shapes for $f$.  
A very large number of observations is required in order to obtain an accurate estimate for $f$. 

In [46]:
import numpy as np
import pandas as pd

In [73]:
df = pd.DataFrame(data = [[0, 3, 0, 'Red'], 
                        [2, 0, 0, 'Red'],
                        [0, 1, 3, 'Red'],
                        [0, 1, 2, 'Green'],
                        [- 1, 0, 1, 'Green'],
                        [1, 1, 1, 'Red']], 
                  index = ['Obs', 1, 2, 3, 4, 5],
                  columns = ['X1', 'X2', 'X3', 'Y'])

In [77]:
df

Unnamed: 0,X1,X2,X3,Y
Obs,0,3,0,Red
1,2,0,0,Red
2,0,1,3,Red
3,0,1,2,Green
4,-1,0,1,Green
5,1,1,1,Red


### Question 7
Suppose we wish to use this data set df to make a prediction for Y when  
X1 = X2 = X3 = 0 using K-nearest neighbors.  
(a) Compute the Euclidean distance between each observation and
the test point, X1 = X2 = X3 = 0.  
(b) What is our prediction with K = 1? Why?  
(c) What is our prediction with K = 3? Why?  
(d) If the Bayes decision boundary in this problem is highly nonlinear,  
then would we expect the best value for K to be large or  
small? Why?

In [87]:
def euc_norm(X_1, X_2, X_3):
    return np.sqrt(X_1 ** 2 + X_2 ** 2 + X_3 ** 2)

In [90]:
# a) 
for index, row in df.iterrows():
    X1, X2, X3 = row['X1'], row['X2'], row['X3']
    dist = euc_norm(X1, X2, X3)
    print(f"The distance between the point {(X1, X2, X3)} and the test point (0, 0, 0) is {dist:.2f}.")

The distance between the point (0, 3, 0) and the test point (0, 0, 0) is 3.00.
The distance between the point (2, 0, 0) and the test point (0, 0, 0) is 2.00.
The distance between the point (0, 1, 3) and the test point (0, 0, 0) is 3.16.
The distance between the point (0, 1, 2) and the test point (0, 0, 0) is 2.24.
The distance between the point (-1, 0, 1) and the test point (0, 0, 0) is 1.41.
The distance between the point (1, 1, 1) and the test point (0, 0, 0) is 1.73.


### 
b) If $K = 1$, the prediction is green, because is the one point closer to the test point.


### 
c) If $K = 3$, we consider the three closest points, $(- 1, 0, 1)$, $(1, 1, 1)$ and $(2, 0, 0)$, whose responses are, respectively, green, red and red.  
Thus, the prediction is red.  

Formalmente, temos que
\begin{equation} 
P(Y = j | X = x_0) = \frac{1}{K} \sum_{i \, \in \, \mathcal{N}_0} I(y_i = j),
\end{equation}
em que $\mathcal{N}_0$ denota o conjunto dos pontos de treinamento que estão mais próximos de $x_0$, o ponto de teste.  
Os pontos mais próximos do ponto de teste $x_0 = (0, 0, 0)$ são $y_5 = (- 1, 0, 1)$, $y_6 = (1, 1, 1)$ e $y_2 = (2, 0, 0)$, cujas respostas são verde, vermelho e vermelho, respectivamente.  
Logo, temos 
$$
\text{Verde}: \quad \frac{1}{3} \left(I(y_2 = \text{Verde}) + I(y_5 = \text{Verde}) + I(y_6 = \text{Verde})\right) = \frac{1}{3} (0 + 1 + 0) = 1.  \\
\text{Vermelho}: \quad \frac{1}{3} \left(I(y_2 = \text{Vermelho}) + I(y_5 = \text{Vermelho}) + I(y_6 = \text{Vermelho})\right) = \frac{1}{3} (0 + 1 + 1) = 2
$$
Logo, temos que a predição é vermelho.
