# 2.4 Exercises

## Conceptual

### 1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

#### (a) The sample size _n_ is extremely large, and the number of predictors _p_ is small.

Flexible models can generally produce lower bias, but can also be susceptible to high variance.  Given the large sample size, overfitting is less of a concern and therefore we would expect a flexible model to be a better option in this case. 

#### (b) The number of predictors _p_ is extremely large, and the number of observations _n_ is small

An inflexible model would preferable in this case, as overfitting is a major concern when the sample size is small and number of parameters is large.

#### (c) The relationship between the predictors and response is highly non-linear

Given the nonlinearity of the response, we would expect a flexible model to perform better.  A linear model would most likely result in a very high bias.

#### (d) The variance of the error terms, i.e. $\sigma^2 = Var(\epsilon)$, is extremely high.

With a very noisy dataset like this, an inflexible model would be expected to perform better.  A more flexible model would be very susceptible to overfitting and would not generalize well to new data.

### 2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide _n_ and _p_.

#### (a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

Regression  
Inference  
n = 500  
p = 3 (profit, number of employees, industry)

#### (b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a _success_ or _failure_, price charged for the product, marketing budget, competition price, and ten other variables.

Classification  
Prediction  
n = 20  
p = 13 (price charged for the product, marketing budget, competition price, ten other variables)

#### (c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

Regression  
Prediction  
n = 52  
p = 4 (the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market)

### 3. We now revisit the bias-variance decomposition.

#### (a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

#### (b) Explain why each of the five curves has the shape displayed in part (a).

Bias - Decreases monotonically as flexibility (degrees of freedom) increases  
Variance - Increases monotonically as flexibility increases  
Training Error - Decreases monotonically as flexibility increases  
Test Error - Decreases, but then starts increasing again at some point due to overfitting as flexibility increases.  
Bayes Error - Constant and almost always unknowable.

### 4. You will now think of some real-life applications for statistical learning.

#### (a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

1. Mortgage lending.  Response is whether or not mortgage will default.  Predictors might include salary, loan amount, credit history, etc.  The goal is prediction.  
2. Stock market.  Response is whether the stock market will go up or down.  Predictors might be market performance on previous days.  The goal is prediction.  
3. Heart disease.  Response is whether or not a person has heart disease.  Predictors might include age, sex, level of activity, diet, etc.  The goal is inference.

#### (b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

1. Fantasy baseball.  Response is number of fantasy baseball points.  Predictors might be historical on base percentage, slugging percentage, handedness of opposing pitcher, etc.  The goal is prediction.  
2. Beer quality.  Response is the beer's rating score.  Predictors might be things like ABV, type of hops, style of beer, etc.  The goal is inference.  
3. Housing market.  Response is the sale price of a house.  Predictors might be square footage, number of bedrooms and bathrooms, neighborhood.  The goal is prediction.

#### (c) Describe three real-life applications in which cluster analysis might be useful.

1. Retail user segmentation.  
2. Grouping similar movies for a recommender system.  
3. Cancer clustering.

### 5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

A very flexible approach for regression and classification can have a very low bias, but may not generalize well (increased variance) due to overfitting.  Flexible approaches are preferred if the response is non-linear, the sample size is large, and inference is not important.  Less flexible approaches are more desirable if the response is linear and inference is important.

### 6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a non-parametric approach)? What are its disadvantages?

### 7. The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.
<table>
    <tr>
        <th>Obs.</th>
        <th>X1</th>
        <th>X2</th>
        <th>X3</th>
        <th>Y</th>
    </tr>
    <tr>
        <td>1</td>
        <td>0</td>
        <td>3</td>
        <td>0</td>
        <td>Red</td>
    </tr>
    <tr>
        <td>2</td>
        <td>2</td>
        <td>0</td>
        <td>0</td>
        <td>Red</td>
    </tr>
    <tr>
        <td>3</td>
        <td>0</td>
        <td>1</td>
        <td>3</td>
        <td>Red</td>
    </tr>
    <tr>
        <td>4</td>
        <td>0</td>
        <td>1</td>
        <td>2</td>
        <td>Green</td>
    </tr>
    <tr>
        <td>5</td>
        <td>−1</td>
        <td>0</td>
        <td>1</td>
        <td>Green</td>
    </tr>
    <tr>
        <td>6</td>
        <td>1</td>
        <td>1</td>
        <td>1</td>
        <td>Red</td>
    </tr>
</table>
### Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

#### (a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.

#### (b) What is our prediction with K = 1? Why?

#### (c) What is our prediction with K = 3? Why?

#### (d) If the Bayes decision boundary in this problem is highly non-linear, then would we expect the best value for K to be large or small? Why?

## Applied

### 8. This exercise relates to the `College` data set, which can be found in the file `College.csv`. It contains a number of variables for 777 different universities and colleges in the US. The variables are
* `Private` : Public/private indicator
* `Apps` : Number of applications received
* `Accept` : Number of applicants accepted
* `Enroll` : Number of new students enrolled
* `Top10perc` : New students from top 10 % of high school class
* `Top25perc` : New students from top 25 % of high school class
* `F.Undergrad` : Number of full-time undergraduates
* `P.Undergrad` : Number of part-time undergraduates
* `Outstate` : Out-of-state tuition
* `Room.Board` : Room and board costs
* `Books` : Estimated book costs
* `Personal` : Estimated personal spending
* `PhD` : Percent of faculty with Ph.D.’s
* `Terminal` : Percent of faculty with terminal degree
* `S.F.Ratio` : Student/faculty ratio
* `perc.alumni` : Percent of alumni who donate
* `Expend` : Instructional expenditure per student
* `Grad.Rate` : Graduation rate
### Before reading the data into R, it can be viewed in Excel or a text editor.