Welcome to the very last day of the 5-Day Regression Challenge!  (Already, can you believe it?) In the last four days, we've:

* [Day 1: Learned about different types of regression (Poisson, linear and logistic) and when to use them](https://www.kaggle.com/rtatman/regression-challenge-day-1)
* [Day 2: Learned how to fit & evaluate a model with diagnostic plots](https://www.kaggle.com/rtatman/regression-challenge-day-2)
* [Day 3: Learned how to read and understand models](https://www.kaggle.com/rtatman/regression-challenge-day-3)
* [Day 4: Learned how to fit & interpret a multiple regression model](https://www.kaggle.com/rtatman/regression-challenge-day-4)

Yesterday we learned about multiple regression, when we use multiple inputs to predict an output. Multiple regression is very powerful, but it also creates a new problem. So far we've just been picking input variables we think are interesting by hand. But what if you have three or four hundred input variables and no way of knowing ahead of time which ones are going to be important/interesting? Today we're going to learn one technique for **automatically selecting which input variables to use.** Picking which input variables to use in a model is called "feature selection". 

One way to do this is by just fitting a regular GLM regression model and then picking the predictors that have a very big effect on our training data. There are, unfortunately, two big problems with this.

**Over-fitting:** This is when your model is so powerful it's picking up on random patterns in your training data that aren't actually helpful. When you then use that model to try to make predictions on a new dataset you're going to get poor results. In general, the more complex your model is and the more input variables you use the more likely you are to over-fit. In the case of feature selection, overfitting means that you're likely to accidentally select variables that aren't actually important.

**Multicollinearity:** When two or more of your input variables are very strongly related to each other (for example someone's age and the number of birthdays they've had), even if they would actually be very informative for your model they may not show up as important when you look at your GLM model. In particular, the standard error will be much bigger for variables that are highly related to each other. The big impact on feature selection is that this means that you're likely to throw out those variables, even if they are actually very important and interesting!

Fortunately for us, we can tackle both of these problems in one fell swoop by using Elastic Net to select our variables. Elastic Net is a regularization technique for regression, which means it helps us avoid over-fitting by making sure that none of coefficients end up being ridiculously large. At the same time, if a variable isn't important Elastic Net will remove it from the regression model by setting its co-efficient to exactly 0. The really nice thing about Elastic Net is that it's particularly good at dealing with multicollinearity; it won't drop groups of correlated variables if the group as a whole is helpful. (If you're curious about the math under the hood, you can check out [this paper](http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2005.00503.x/abstract).)

Today, we're going to learn how to use Elastic Net to help us select which variables to use for multiple regression.  

# Example: Predicting New Coders' Income
___

For our example, I'm going to be figuring out what features are most helpful when we're predicting the income of a new coder. I'm going to be using data from the 2016 Free Code Camp new coder's survey, and I've already read in data from the 2017 survey for you to use in your example.

In [None]:
import pandas as pd

# read in data
coders_2016 = pd.read_csv("../input/2016-new-coder-survey-/2016-FCC-New-Coders-Survey-Data.csv")
coders_2017 = pd.read_csv("../input/the-freecodecamp-2017-new-coder-survey/2017-fCC-New-Coders-Survey-Data.csv")

The glmnet net package is slightly different from the other regression functions we've been using: it wants your input variables to be in a matrix. (A matrix is like a dataframe that only has numeric variables in it.)

One of the variables I'm interested in is Gender, which right now is a character vector. I'm going to convert it to a boolean variable by making a new column that has TRUE if the Gender column in that rows has "female" in it and FALSE if it dosen't.  We can then put that in our matrix becusae booleans are number (TRUE is equal to 1 and FALSE is equal to 0). 

In addition to gender, I have seven other variables that I think may be interesting: the person's age, how long they spend commuing, whether they have children, whether they attened a coding bookcamp, whether they have any debt, how many hours a week they spend learning and how many months they've been programming for.

In [None]:
# create a subset of the data with only our variables of interest (variables
# that aren't converted numbers won't work)
import pandas as pd
iswoman = pd.DataFrame(coders_2016['Gender'] == "female")
iswoman = iswoman.astype(int)
subset = coders_2016[['Age', 'CommuteTime', 'HasChildren', 
           'AttendedBootcamp', 'HasDebt',
           'HoursLearning', 'MonthsProgramming', 'Income']]
subset['IsWoman'] = iswoman
subset.dropna(inplace=True)

X = subset[['Age', 'CommuteTime', 'HasChildren', 
           'AttendedBootcamp', 'IsWoman', 'HasDebt',
           'HoursLearning', 'MonthsProgramming']]
# get a vector with our output variable
y = subset['Income']

print('Number of data points: ' + str(len(y)))

Because we have a couple thousand data points, even though salary can best be represented by a Poisson distribution, I'm going to use Gaussian distribution. (With bigger datasets, these two distribution start to become very similar). 

In order to fit our model, we're going to do something called “cross-validation”. This is another technique to help avoid over-fitting. The general idea is that you take your full dataset and train a model on just part it. Then you take a slightly different subset of your data and then train another model on that data. In this case, we're going to take ten different subsamples, train a model on each subsample and then average those models together to get our final model. 

In [None]:
from sklearn.linear_model import ElasticNetCV
regr = ElasticNetCV(cv=10, random_state=0)
regr.fit(X, y)

Now that we have our full model, let's check out which features we want to keep!  

In [None]:
print(regr.intercept_)

In [None]:
coefficients = pd.DataFrame()
coefficients['columns'] = X.columns
coefficients['coef'] = regr.coef_
print(coefficients)

It looks like all of our features but "HasDebt" were very useful for predicting income. The dot (.) next to HasDebt means that the coefficient for this input variable was pushed to zero by elastic net.

Now, this is in and of itself a regression model, so if you like you can call it quits at this point. However, since the reason we used Elastic Net in the first place was to help us select which features to use, let's fit a new model using these features. 

First, I'm going to do a bit of data munging to get a vector of the variables with non-zero coefficients out of our Elastic Net model.

In [None]:
variables_non_zero = coefficients[coefficients['coef'] != 0]['columns']
print(variables_non_zero)

Now let's fit a new regression model using these variables! I'm going to do some pasting to put together a formula to give to the glm() function. (You'll notice I'm trying very hard to avoid having to type variable names more than once! I tend to make a lot of typos, so this ends up being a huge time-saver for me.)

In [None]:
# turn our list of formulas into a variable
X = subset[variables_non_zero]

# fit a glm model
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X, y)

Great, now we have a linear regression model! At this point in the challenge, you probably already know what comes next: diagnostic plots. 

In [None]:
import matplotlib.pyplot as plt
y_pred = regr.predict(X)
residual = y - y_pred
plt.scatter(y_pred,residual)

And these look pretty good! We do have a little bit of a skew & some weirdness with very high salaries. That's not super surprising, though: if a brand-new coder is making a salary of more than $100,000 a year there's probably something else going on that we're going to have a hard time accounting for. Overall, this model doesn't look too bad!

Let's take a closer look:

In [None]:
print(regr.intercept_)

In [None]:
coefficients = pd.DataFrame()
coefficients['columns'] = X.columns
coefficients['coef'] = regr.coef_
print(coefficients)

You may notice that the coefficients for this model are slightly different from the ones output but the Elastic Net model. That's due to the different ways that these two models are fitted.

Looking at the estimates and the standard error, it looks like MonthsProgramming is far and away the more important feature, but that age is also pretty important. We can check this out using the Added-Variable Plots we learned about yesterday.

In [None]:
# added-variable plots for our model

And that's pretty much it! So to review, the pipeline for multiple regression that includes feature selection with Elastic Net looks like:

1. Pick an ouput variable & some subset of possible input variables
2. Convert your input values to a matrix
3. Use Elastic Net to select some sub-set of those input variables
4. (Optional) Refit a model using those variables

Ready to give it a try?

# Your turn!
____

1. Pick a variable to predict using the 2017 survey. Pick a variable to predict and at least three possible variables to use to predict it.
2. Select some (or all) of those possible input values using Elastic Net.
3. Fit and evaluate a GLM model using those variables. 
4. *Optional:* If you want to share your analysis with friends or to ask for help, you’ll need to make it public so that other people can see it.
    * Publish your kernel by hitting the big blue “publish” button. (This may take a second.)
    * Change the visibility to “public” by clicking on the blue “Make Public” text (right above the “Fork Notebook” button).
    * Tag your notebook with 5daychallenge

In [None]:
coders_2017.head()

In [None]:
# create a subset of the data with only our variables of interest (variables
# that aren't converted numbers won't work)
import pandas as pd
iswoman = pd.DataFrame(coders_2017['Gender'] == "female")
iswoman = iswoman.astype(int)
subset = coders_2017[['Age', 'HasChildren', 
           'AttendedBootcamp', 'HasDebt',
           'HoursLearning', 'MonthsProgramming', 'Income']]
subset['IsWoman'] = iswoman
subset.dropna(inplace=True)

X = subset[['Age', 'HasChildren', 
           'AttendedBootcamp', 'IsWoman', 'HasDebt',
           'HoursLearning', 'MonthsProgramming']]
# get a vector with our output variable
y = subset['Income']

print('Number of data points: ' + str(len(y)))

In [None]:
from sklearn.linear_model import ElasticNetCV
regr = ElasticNetCV(cv=10, random_state=0)
regr.fit(X, y)

In [None]:
coefficients = pd.DataFrame()
coefficients['columns'] = X.columns
coefficients['coef'] = regr.coef_
print(coefficients)

In [None]:
variables_non_zero = coefficients[coefficients['coef'] != 0]['columns']
print(variables_non_zero)

In [None]:
# turn our list of formulas into a variable
X = subset[variables_non_zero]

# fit a glm model
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X, y)

In [None]:
import matplotlib.pyplot as plt
y_pred = regr.predict(X)
residual = y - y_pred
plt.scatter(y_pred,residual)

In [None]:
coefficients = pd.DataFrame()
coefficients['columns'] = X.columns
coefficients['coef'] = regr.coef_
print(coefficients)

Want more? Ready for a different dataset? [This notebook](https://www.kaggle.com/rtatman/datasets-for-regression-analysis/) has additional dataset suggestions for you to practice regression with. 