In [12]:
import pandas as pd
import numpy as np

# Statistical Learning Excercises

#### 1. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

***(a) The sample size n is extremely large, and the number of predictors p is small.***

>In this case, I would favor a more flexible method because it can make use of the large n to glean extra information with minimal risk (given the low number of predictors, so less opportunity for overfitting)

***(b) The number of predictors p is extremely large, and the number of observations n is small.***

>In this case, where models are particular susceptible to overfitting/learning noise in the data, I would favor a less flexible method.

***(c) The relationship between the predictors and response is highly non-linear.***

>In this case, a more flexible model that does not make any assumptions of an underlying functional form would likely outperform an inflexible method.

***(d) The variance of the error terms, i.e. σ2 = Var(ε), is extremely high.***
>I would expect an inflexible model to perform better here, as the more flexible models may be increasingly more likely to overfit the noise as the variance increases.

#### 2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

***(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.***

>This is regression; the target, CEO salary, is a continuous variable.  We are interested in inference, meaning we are less concerned with the particular predictions, and more with the factors associated with inreased or decreased CEO salary. n = # of firms = 500; p = # of predictors = 3

***(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.***

>This is classication; we are trying to make a binary determination of success or failure.  The goal here is prediction since we are more interested in the predictions themselves and less about the relationship between features and the output. n = 20, p = 13

***(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.***

>This is regression for prediction.  We are trying to predict % change, a continuous value, and the goal is to predict, not understand the exact relationship between features and target. n = 56 (one observation per week), p = 3.

#### 3. We now revisit the bias-variance decomposition.

***(a) Provide a sketch of typical (squared) bias, variance, training er- ror, test error, and Bayes (or irreducible) error curves, on a sin- gle plot, as we go from less flexible statistical learning methods towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.***

<img src="../figures/2_3a.jpg" alt="flexibility" width="600px"/>

***(b) Explain why each of the five curves has the shape displayed in part (a).***


#### 4. You will now think of some real-life applications for statistical learning.

***(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.***

***(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.***

***(c) Describe three real-life applications in which cluster analysis might be useful.***

#### 5. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?


#### 6. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a para- metric approach to regression or classification (as opposed to a non- parametric approach)? What are its disadvantages?

>A parametric model the problem to learning a set of parameters of a particular function (e.g. linear regression coefficients), whereas non-parametric approaches do not assume a particular functional form.
Advantages of a parametric approach are that they can have significantly lower variance (because the assumption of a particular functional form will make the model less sensitive to particular data points or outliers, decreasing model variance).  A disadvantage is the increased bias; assuming a certain functional form limits the complexity of decision boundaries, and your assumed form may be insufficient for some problems.  
>
>A non-parametrics approach may be preferred if the data is highly non-linear/decision boundary is highly irregular, or we have large amounts of data and a relatively small number of predictors.  A disadvantage of these approaches can be high variance and greater likelihood of overfitting as the model is flexible enough to learn from noise in the data.



#### 7. The table below provides a training data set containing six observa- tions, three predictors, and one qualitative response variable.

|Obs. |X1 |X2| X3| Y|
|---|---|---|---|---|
|1| 0|3|0| Red|
|2| 2|0|0| Red|
|3| 0|1|3| Red|
|4| 0|1|2| Green|
|5| −1|0|1| Green|
|6| 1|1|1|Red|

***(a) Compute the Euclidean distance between each observation and the test point, X1 =X2 =X3 =0.***

In [3]:
training_data = pd.DataFrame([
    [1, 0, 3, 0, 'Red'],
    [2, 2, 0, 0, 'Red'],
    [3, 0, 1, 3, 'Red'],
    [4, 0, 1, 2, 'Green'],
    [5, -1, 0, 1, 'Green'],
    [6, 1, 1, 1, 'Red']
], columns=['Obs.', 'X1', 'X2', 'X3', 'Y']).set_index('Obs.')

training_data

Unnamed: 0_level_0,X1,X2,X3,Y
Obs.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,3,0,Red
2,2,0,0,Red
3,0,1,3,Red
4,0,1,2,Green
5,-1,0,1,Green
6,1,1,1,Red


In [14]:
def get_distance(training_data, new_sample):
    distance = ((training_data[['X1', 'X2', 'X3']] - new_sample)**2).sum(axis=1).apply(np.sqrt)
    return distance

In [16]:
test_point = [0, 0, 0]

print "Euclidean distance of point {}".format(test_point)
get_distance(training_data, test_point)

Euclidean distance of point [0, 0, 0]


Obs.
1    3.000000
2    2.000000
3    3.162278
4    2.236068
5    1.414214
6    1.732051
dtype: float64

***(b) What is our prediction with K = 1? Why?***

In [52]:
def get_pred(training_data, test_sample, k=1):
    distances = get_distance(training_data, test_sample)
    distances.sort_values(ascending=True, inplace=True)
    print "Distances from training data to test point"
    print distances
    
    return training_data.loc[distances.iloc[:k].index, 'Y'].value_counts().index[0]

In [55]:
# observation 5 is closest to test point, and Y='Green' for obs. 5

'Prediction for k = 1 is "{}"'.format(get_pred(training_data, test_point, k=1))

Distances from training data to test point
Obs.
5    1.414214
6    1.732051
2    2.000000
4    2.236068
1    3.000000
3    3.162278
dtype: float64


'Prediction for k = 1 is "Green"'

***(c) What is our prediction with K = 3? Why?***

In [56]:
# Top 3 closest points are observation 5, 6, and 2. 6 and 2 have color Red, so the modal value/prediction is Red

'Prediction for k = 3 is "{}"'.format(get_pred(training_data, test_point, k=3))

Distances from training data to test point
Obs.
5    1.414214
6    1.732051
2    2.000000
4    2.236068
1    3.000000
3    3.162278
dtype: float64


'Prediction for k = 3 is "Red"'

***(d) If the Bayes decision boundary in this problem is highly non- linear, then would we expect the best value for K to be large or small? Why?***

>In this case, I would favor a smaller value of K, since increasing K smooths our prediction across many nearby points, which also has the effect of making the model _less_ sensitive to non-linear perturbations in the data (desirable if the data is very noisy, but not if there are truly significant non-linearities in the underlying problem).

## Applied