Welcome to day 4 of the regression challenge! So far we've been focusing on examples where we are looking at the effect of **one** input variable on a specific output variable. Today, we're going to be looking at predicting a single output variable using multiple input variables. This is known as "multiple regression".

We're going to be using all the skills we've learned so far, so feel free to head back to the previous day's challenges if you need a quick refresher:

* [Day 1: Learned about different types of regression (Poisson, linear and logistic) and when to use them](https://www.kaggle.com/rtatman/regression-challenge-day-1)
* [Day 2: Learned how to fit & evaluate a model with diagnostic plots](https://www.kaggle.com/rtatman/regression-challenge-day-2)
* [Day 3: Learned how to read and understand models](https://www.kaggle.com/rtatman/regression-challenge-day-3)

If you're already caught up, we can get right to down to fitting and examining a model using multiple regression! 
___

<center>
[**You can check out a video that goes with this notebook by clicking here.**](https://www.youtube.com/embed/iN8Rl8sIzVg)

## Example: Predicting BMI
___

For my multiple regression example, I'm going to be using a dataset of health and eating habits of Americans collected by the US Bureau of Labor Statistics. I'm going to see if we can predict BMI (body mass index, intended to be a rough measure of body fat) using height, weight, how much time each person spends exercising and how much time each person spends eating. 

I'm also reading a dataset of New York City census data for you to use in your exercises.

In [None]:
import pandas as pd

# read in our data 
bmi_data = pd.read_csv("../input/eating-health-module-dataset//ehresp_2014.csv")
nyc_census = pd.read_csv("../input/new-york-city-census-data/nyc_census_tracts.csv")

In [None]:
# remove rows where the reported BMI is less than 0 (impossible)
bmi_data = bmi_data[bmi_data['erbmi']>0]

The columns names in the bmi_data dataset aren't very helpful, so I've checked [the documentation](https://www.bls.gov/tus/ehmintcodebk1416.pdf) to figure out which columns have the information I want. After looking through the documentation, I picked out these five variables:

* erbmi = body mass index (this is what I'm going to try to predict!)
* euexfreq = how many times in the past week the person exercised (outside of their job)
* euwgt = weight, in pounds
* euhgt = height, in inches
* ertpreat = amount of time spent eating and drinking (in minutes) over the past week 

With this in mind, I'm going to use the same formula notation we've been using the past few days. I'm just adding four terms instead of one:

In [None]:
X = bmi_data[['euexfreq', 'euwgt', 'euhgt', 'ertpreat']]
y = bmi_data['erbmi']
# fit a glm model
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X,y)

Now that we've fit our model, let's check out some diagnostic plots. I'm using the plot() function, so we're going to get very slightly different plots than with would with the glm.diag.plots() function from the boot package. 

In [None]:
import matplotlib.pyplot as plt
y_pred = reg.predict(X)
residual = y - y_pred
plt.scatter(y_pred,residual)

The top two plots are pretty much the same as the ones we saw on day 2, but the bottom two are slightly different.

* **Residuals vs. predicted values**: We don't want to see a pattern here, it should just sort of look like a cloud & the red line in the middle should be more-or-less flat. The pattern we see here, with bigger residuals towards the sides and smaller ones in the center, means that our model is much more accurate for values in the middle range than extreme values (either very high or very low).
* **Normal Q-Q**: We want all our points to be on that dotted line and the line to go across the center diagonal of the plot. The fact that a lot of our residuals are above or below the line suggests that there's a strong skew in our data.
* **Scale-Location**:  This helps you see if your data points are spread out evenly along your predictors (for example, making sure that 20 out of 25 cantaloupes all weigh exactly 1 pound, which might suggest something is up with your dataset). You want points to be scattered randomly and the red line in the middle to be flat(ish). This plot suggests that we have a strong skew in our data. (Which is the same thing the other plots have been telling us).
* **Residuals vs. Leverage**:  This plot will help you look for outliers. If you can see a dotted red line between the bulk of the points and one or two off on their own, those outliers are strongly affecting your analysis. Here we don't have any strong outliers to worry about.

These plots are telling us that we have a pretty strong skew in our data. We can probably trust the predictions we make towards the center of our range (since residuals are very low around 30), but the further away we move from the mean the less we can trust our model. If we want accurate predictions across the range of possible BMI's we probably don't want to use this model. **We can still continue investigating our model, but we should be very cautious in interpreting our results!**

Now that we're aware of some of the pitfalls with this model, we can examine it more closely.

In [None]:
# examine our model
print('coefficient = ' + str(reg.coef_))
print('intercept = ' + str(reg.intercept_))

Looking at our model, we can tell that we're pretty sure that the average BMI isn't 0, since the intercept is pretty far from 0 and the standard error for it is relatively small (less than 10% of the estimate). Using that same metric, we can also tell that both weight (euwgt) and height (euhgt) are probably important, but that neither how often someone exercises (euexfreq) nor how much time they spend eating (ertpreat) seem to be particularly informative.

We can also see that all our inputs together are very helpful because there's a large difference between the residual deviance and the null deviance.

So, while we do want to be careful with interpreting this model given that we know it doesn't handle extreme values well, it looks like both weight and height are important for predicting BMI. (Which is good, since BMI is calculated using both weight and height!)

We can double check this using Added-Variable Plots, also known as partial-regression plots. AV Plots are a set of plots, one for each of your input variables, where it shows you what happens to your output variables if you hold all but one input variable stable and just change that one input variable. Each observation is represented by a single point on the plot, and the coefficient is shown using a red line.

You can read these a bit like correlation plots: if an input variable is important, there will be a strong linear pattern in the points and the line will have a slope that's very different from 0. If an input variable isn't important, you won't see a strong pattern in the dots and the red line will just be a flat line at 0.

In [None]:
# added-variable plots for our model

Looking at these plots, we can see by looking in the top right corner that as euwgt (weight) increases, so does erbmi (BMI, the variable we're trying to predict). Looking at the bottom left corner we can see that as euhgt (height) increases, erbmi actually **decreases**. So both height and weight are important, but they have the opposite effect! We can also tell this from our model summary because euwgt had a *positive* estimate, while euhgt had a *negative* estimate.

The other two plots show that there's not a strong relationship between those variables and the one we're trying to predict, which we already figured out from our model.

And that's it for multiple regression! Now it's time for you to try it yourself. :)

> If you're really dying to know how to fit a model that's a better representation of this particular dataset, you can check out [this notebook](https://www.kaggle.com/rtatman/regression-challenge-day-4-gamma-distribution/), but you don't need it to work on your assignment for today.

## Your turn!
___

Now it's your turn to come up with a model and interpret it!

1. Pick a question to answer to using the NYC Census dataset. Pick a variable to predict and at least three variables to use to predict it.
2. Fit a GLM model of the appropriate family. (Check out [Monday's challenge](https://www.kaggle.com/rtatman/regression-challenge-day-1) if you need a refresher).
3. Plot diagnostic plots for your model. Does it seem like your model is a good fit for your data? If you're fitting a linear or Poisson model, are the residuals normally distributed (no patterns in the first plot and the points in the second plot are all in a line)? Are there any influential outliers?
4. Check out your model using the summary() function. Which, if any, input variables have a strong relationship to the output variable you're predicting?
5. Plot your output variables using the avPlot() function. Do the plots agree with your interpretation of the model summary? 
6. *Optional:* If you want to share your analysis with friends or to ask for help, you’ll need to make it public so that other people can see it.
    * Publish your kernel by hitting the big blue “publish” button. (This may take a second.)
    * Change the visibility to “public” by clicking on the blue “Make Public” text (right above the “Fork Notebook” button).
    * Tag your notebook with 5daychallenge

In [None]:
nyc_census.head()

In [None]:
nyc_census.columns

In [None]:
selected_nyc_census = nyc_census[['Unemployment', 'Hispanic', 'White', 'Black', 'Native']]
selected_nyc_census = selected_nyc_census.dropna()
X = selected_nyc_census[['Hispanic', 'White', 'Black', 'Native']]
y = selected_nyc_census['Unemployment']

In [None]:
# fit a glm model
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X,y)

In [None]:
y_pred = reg.predict(X)
residual = y - y_pred
plt.scatter(y_pred,residual)

In [None]:
# examine our model
print('coefficient = ' + str(reg.coef_))
print('intercept = ' + str(reg.intercept_))

Want more? Ready for a different dataset? [This notebook](https://www.kaggle.com/rtatman/datasets-for-regression-analysis/) has additional dataset suggestions for you to practice regression with. 