## Ridge, Lasso and Elastic Net
At this point we've seen a number of criteria and algorithms for fitting regression models to data. We've seen the simple linear regression using ordinary least squares, and its more general regression of polynomial functions. We've also seen how we can arbitrarily overfit models to data using kernel methods or feature engineering. With all of that, we began to explore other tools to analyze this general problem of overfitting versus underfitting. This included train and test splits, bias and variance, and cross validation.

Now we're going to take a look at another way to tune our models. These methods all modify our mean squared error function that we were optimizing against. The modifications will add a penalty for large coefficient weights in our resulting model. If we think back to our case of feature engineering, we can see how this penalty will help combat our ability to create more accurate models by simply adding additional features.

In general, all of these penalties are known as $L^p norms$.

## $L^p$ norm of x
In order to help account for underfitting and overfitting, we often use what are called $L^p$ norms.   
The **$L^p$ norm of x** is defined as:  

### $||x||_p  =  \big(\sum_{i} x_i^p\big)^\frac{1}{p}$

## 1. Ridge (L2)
One common normalization is called Ridge Regression and uses the $l_2$ norm (also known as the Euclidean norm) as defined above.   
The ridge coefficients minimize a penalized residual sum of squares:    
    $ \sum(\hat{y}-y)^2 + \lambda\bullet w^2$

Write this loss function for performing ridge regression.

In [None]:
def ridge_loss(y, y_hat):
    return l2_err

## 2. Lasso (L1)
Another common normalization is called Lasso Regression and uses the $l_1$ norm.   
The ridge coefficients minimize a penalized residual sum of squares:    
    $ \sum(\hat{y}-y)^2 + \lambda\bullet |w|$

Write this loss function for performing ridge regression.

In [None]:
def lasso_loss(y, y_hat):
    return l1_err

## 3. Ridge in Practice
Modify your polynomial linear regression function to incorporate the Ridge l2 penalty rather then simply MSE.

In [None]:
#Previous Code
def grad_desc(x, y, precision, max_iters, w, rss, predict):
    previous_step_size = 1 #Arbitrary
    iteration = 0 #iteration counter
    while (previous_step_size > precision) & (iteration < max_iters):
        if iteration%500==0:
            print('Iteration {} \nCurrent weights:\n{} \nRSS Produced: {}'.format(iteration, w, rss(y, predict(X, w))))
            print('\n\n')
        #Calculate Nearby Points
        sample_steps = np.array(w)/1000.0 #Take mean of feature weights and divide by 100. /
                                                #Use this to create surrounding sample points.
        #Calculate the Gradient
        #Look at weights surrounding our current position.
        weights_sample_space = np.array([w+(i*sample_steps) for i in range(-50,51)])

        #Calculate the RSS error for this surrounding weights-space.
        y_hats = np.array([predict(X, wi) for wi in weights_sample_space])
        rss_weights_sample_space = np.array([rss(y, y_hat) for y_hat in y_hats])

        #weights_and_y_hats = np.concatenate((weights_sample_space,  np.array([rss_weights_sample_space]).T), axis=1)
        gradients = np.gradient(rss_weights_sample_space)
        steepest_gradient_idx = max(list(enumerate(gradients)), key=lambda x: x[1])[0]


        #Move opposite the gradient by some step size
        prev_w = w #Save for calculating how much we moved
        w = w - alpha*weights_sample_space[steepest_gradient_idx]

        previous_step_size = np.sqrt(sum([wi**2 for wi in w-prev_w]))
        iteration += 1
    

    print("Gradient descent converged. Local minimum identified at:")
    print('Iteration {} \nCurrent weights:\n{} \nRSS Produced: {}'.format(iteration, w, rss(y, predict(X, w))))
    return w

In [None]:
#Your Ridge Linear Regression Function here

## 4. Lasso in Practice
Modify your polynomial linear regression function to incorporate the Lasso l1 penalty rather then simply MSE.

In [None]:
#Your Lasso Linear Regression Function here

## 5. Run + Compare your Results
Answer the following questions:
* Which model do you think created better results overall? 
* Comment on the differences between the coefficients of the resulting models

In [3]:
import pandas as pd

In [None]:
df = pd.read_excel('movie_data_detailed_with_ols.xlsx')
df.head()
X = df[['budget', 'imdbRating',
       'Metascore', 'imdbVotes']]
y = df['domgross']

In [None]:
#Run your models here

In [None]:
#Your answers here