This is the second day of the 5-Day Regression Challenge. You can find the first day's challenge [here](https://www.kaggle.com/rtatman/regression-challenge-day-1). Today, we’re going to learn how to fit a model to data and how to make sure we haven’t violated any of the underlying assumptions. First, though, you need a tiny bit of background:
____

**Regression formulas in R**

In R, regression is expressed using a specific type of object called a formula. This means that the syntax for expressing a regression relationship is the same across packages that use formula objects. The general syntax for a formula looks like this:

    Output ~ input

If you think that more than one input might be affecting your output (for example that both the amount of time spent exercising and the number of calories consumed might affect changes in someone’s weight) you can represent that with this notation:

	Output ~ input1 + input2
    
We'll talk about how to know which inputs you should include later on: for now, let's just stick to picking inputs based on questions that are interesting to you. (Figuring out how to turn a quesiton into a query)

**Regression in Python**

In Python, there is no notion of using the native types/operators to "regress". Instead we would have to structure our input into a nested array (usually using `numpy`) and use machine learning libraries such as `sklearn` to perform regression.


For example, the input is a list of lists where each element in the outer list represents a data point and the inner list represents each data point's feature(s)/signal(s) that affects the output. E.g. a single signal/feature input we could do this:

```python
from sklearn import linear_model
X = inputs = [[0], [1], [2]]  # Usually we use X to denote the inputs.
Y = outputs = [0, 1, 2]       # Usually we use Y to denote the outputs.
regressor = linear_model.LinearRegression()
regressor.fit(X, Y)
```

We can visualize the above inputs/outputs as a table as such:


|Feature 1 |**Output**|
|:---:|:---:|
| 0 | **0** |
| 1 | **1** |
| 2 | **2** |


And if we have multiple features/signals that affects the output: 
The inner list of `X` would include more values:

```python
from sklearn import linear_model
X = inputs = [[0,0,0], [1,1,1], [2,2,2]]  # Usually we use X to denote the inputs.
Y = outputs = [0, 1, 2]       # Usually we use Y to denote the outputs.
regressor = linear_model.LinearRegression()
regressor.fit(X, Y)
```

And visualizing it as a matrix/table:


|Feature 1| Feature 2| Feature3 |**Output**|
|:---:|:---:|:---:|:---:|
| 0 | 0 | 0 | **0** |
| 1 | 1 | 1 | **1** | 
| 2 | 2 | 2 | **2** |




**What are these “residuals” everyone keeps talking about?**

A residual is just how far off a model is for a single point. So if our model predicts that a 20 pound cantaloupe should sell for eight dollars and it actually sells for ten dollars, the residual for that data point would be two dollars. Most models will be off by at least a little bit for pretty much all points, but you want to make sure that there’s not a strong pattern in your residuals because that suggests that your model is failing to capture some underlying trend in your dataset.
____

Today, we're going to practice fitting a regression model to our data and examining the residuals to see if our model is a good representation of our data.

___

<center>
[**You can check out a video that goes with this notebook by clicking here.**](https://www.youtube.com/embed/3C8SxyD8C7I)


## Example: Kaggle data science survey
___

For our example today, we're going to use the Kaggle we’re going to use the 2017 Kaggle ML and Data Science Survey. I’m interested in seeing if we can predict the salary of data scientists based on their age. My intuition is that older data scientists, who are probably more experienced, will have higher salaries.

Because salary is a count value (you're usually paid in integer increments of a unit of currency, and hopefully you shouldn't be being paid a negative amount), we're going to model this with a Poisson regression. 

Before we train a model, however, we need to set up our environment. I'm going to read in two datasets: the Kaggle Data Science Survey for the example and the Stack Overflow Developer Survey for you to work with. 

In **R**


```
# libraries
library(tidyverse)
library(boot) #for diagnostic plots

# read in data
kaggle <- read_csv("../input/kaggle-survey-2017/multipleChoiceResponses.csv")
stackOverflow <- read_csv("../input/so-survey-2017/survey_results_public.csv")
```

In **Python**:

In [3]:
from pandas import read_csv

# Note that the Kaggle data seems to be in latin-1 encoding
kaggle = read_csv("../input/kaggle-survey-2017/multipleChoiceResponses.csv", encoding='iso-8859-1')
stackoverflow = read_csv("../input/so-survey-2017/survey_results_public.csv", encoding='utf8')

Now that we've got our environment set up, I'm going to do a tiny bit of data cleaning. First, I only want to look at rows where we have people who have reported having compensation of more than 0 units of currency. (There are many different currencies in the dataset, but for simplicity I'm going to ignore them.)

In **R**:

```
# do some data cleaning
has_compensation <- kaggle %>%
    filter(CompensationAmount > 0) %>% # only get salaries of > 0
    mutate(CleanedCompensationAmount = str_replace_all(CompensationAmount,"[[:punct:]]", "")) %>%
    mutate(CleanedCompensationAmount = as.numeric(CleanedCompensationAmount)) 

# the last two lines remove puncutation (some of the salaries has commas in them)
# and make sure that salary is numeric
```

In **Python**:

In [4]:
import re
import string

punct = re.escape(string.punctuation)
regex = re.compile(f'[{punct}]')

# Python's numpy/panda gives NaN preferential treatment and leaves 
# them as they are if there's no explicit instructions to handle them.
# So we have to fill them up with zeros first.
clean_compensation = kaggle['CompensationAmount'].fillna(0)

# Then we make sure that the column values are string before 
# we apply the regex to handles the the punctuations.
clean_compensation = clean_compensation.astype(str).apply(lambda x: regex.sub('', x))

# Then create a new 'CleanedCompensationAmount' column with the clean values 
# and we filter all the rows where the 'CompensationAmount' are not numerical
kaggle['CleanedCompensationAmount'] = clean_compensation
kaggle_clean = kaggle[kaggle['CleanedCompensationAmount'].apply(lambda x: x.isnumeric())]

# Now, we can safely cast all values in the 'CleanedCompensationAmount' into integers.
kaggle_clean['CleanedCompensationAmount'] = kaggle_clean['CleanedCompensationAmount'].astype(int)

# Lastly, we filter out the rows where people don't get paid @_@
has_compensation = kaggle_clean[kaggle_clean['CleanedCompensationAmount']  > 0]

# Additional to the R cleaning, let's also remove the rows that NaN for the 'Age' column.
has_compensation = has_compensation.dropna(subset=['Age'], how='all')

In **Python** (with fewer lines):

In [5]:
import re
import string

punct = re.escape(string.punctuation)
regex = re.compile(f'[{punct}]')

kaggle['CleanedCompensationAmount'] = kaggle['CompensationAmount'].fillna(0).astype(str).apply(lambda x: regex.sub('', x))
kaggle_clean = kaggle[kaggle['CleanedCompensationAmount'].apply(lambda x: x.isnumeric())]
has_compensation = kaggle_clean[kaggle_clean['CleanedCompensationAmount'].astype(int)  > 0].dropna(subset=['Age'], how='all')

Alright, now we're ready to fit our model! To do this, we need to pass the function glm() a formula with the columns we're interested in, the name of the dataframe (so it knows where the columns are from) and the family for our model. Remember from earlier that our formula should look like this:

    Output ~ input
    
We're also predicting a count value, as discussed above, so we want to make sure the family is Poisson.

In **R**:

```
# poisson model to predict salary by age
model <- glm(CleanedCompensationAmount ~ Age, data = has_compensation, family = poisson)
```


In **Python** (using `sklearn` without poisson distribution):

In [6]:
import numpy as np

# First we must get 'Age' into the correct input nested array format.
ages_flat_array = np.array(has_compensation['Age'])  # This is a flat array of int.
ages_flat_array

In [7]:
# We use the `reshape()` function.
ages_nested_array = np.array(has_compensation['Age']).reshape(len(has_compensation), 1)
ages_nested_array

In [8]:
import numpy as np

from sklearn import linear_model
X = inputs = np.array(has_compensation['Age']).reshape((len(has_compensation), 1))
Y = outputs = has_compensation['CleanedCompensationAmount']
regressor = linear_model.LinearRegression()
regressor.fit(X, Y)

Sadly `sklearn` doesn't have yet poisson regression. Here's the documentation of the generalized linear models available in `sklear`: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

But fear not the `glm` library comes to the rescue.

In **Python** (with `statsmodels`):

See https://stackoverflow.com/a/37942077/610569 for more details

In [57]:
# Make sure that our input and output are integer
has_compensation['CleanedCompensationAmount'] = has_compensation['Age'].astype(int)
has_compensation['Age'] = has_compensation['Age'].astype(int)

In [58]:
import statsmodels.formula.api
from statsmodels.genmod.families import Poisson

glm = statsmodels.formula.api.gee # A wrapper to emulate the R syntax.
model = glm("CleanedCompensationAmount ~ Age", groups=None, 
            data=has_compensation, family=Poisson())
results = model.fit()
print(results.summary())

dWe'll talk about how to examine and interpret a model tomorrow. For now, we want to make sure that it's a good fit for our data and problem. To do this, let's use some diagnostic plots.  

In **R**:
    
```
# diagnostic plots
glm.diag.plots(model)
```

In **Python**:

Sadly, in Python to get diagnostic plots isn't as simple as R. There's a good article on this at https://medium.com/@emredjan/emulating-r-regression-plots-in-python-43741952c034 

We'll try our best to replicate the R plots here:


In [49]:
# The libraries we'll need for the plot.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import ProbPlot

plt.style.use('seaborn') # pretty matplotlib plots

Before we get to some actual plottings we need to extract the values that we'll be plotting.

In [66]:
# fitted values (need a constant term for intercept)
model_fitted_y = results.fittedvalues
# model residuals
model_residuals = results.resid
# normalized residuals
model_norm_residuals = results.get_influence().resid_studentized_internal
# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = results.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = results.get_influence().cooks_distance[0]

In [60]:
# Line Predictor vs Residuals.
plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'Age', 
                                  data=has_compensation, 
                                  lowess=True, 
                                  scatter_kws={'alpha': 0.5}, 
                                  line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_xlabel('Line Predictor')
plot_lm_1.axes[0].set_ylabel('Residuals')

In [61]:
QQ = ProbPlot(model_norm_residuals)
plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)
plot_lm_2.set_figheight(8)
plot_lm_2.set_figwidth(12)
plot_lm_2.axes[0].set_title('Normal Q-Q')
plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
plot_lm_2.axes[0].set_ylabel('Standardized Residuals');
# annotations
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]
for r, i in enumerate(abs_norm_resid_top_3):
    plot_lm_2.axes[0].annotate(i, 
                               xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                   model_norm_residuals[i]));

All of these diagnostic plots are plotting residuals, or how much our model is off for a specific prediction. Spoiler alert: all of these plots are showing us big warning signs for this model! Here's what they should look like:

* **Residuals vs Linear predictor**: You want this to look like a shapeless cloud. If there are outliers it means you've gotten some things very wrong, and if there's a clear pattern it usually means you've picked the wrong type of model. (For logistic regression, you can just ignore this plot. It's checking if the residuals are normally distributed, and logistic regression doesn't assume that they will be.)
* **Quantiles of standard normal vs. ordered deviance residuals**: For this plot you want to see the residuals lined up along the a diagonal line that goes from the bottom left to top right. If they're strongly off that line, especially in one corner, it means you have a strong skew in your data. (For logistic regression you can ignore this plot too.)
* **Cook's distance vs. h/(1-h)**: Here, you want your data points to be clustered near zero. If you have a data point that is far from zero (on either axis) it means that it's very influential and that one point is dramatically changing your analysis.
* **Cook's distance vs. case**: In this plot, you want your data to be mostly around zero on the y axis. The x axis just tells you what row in your dataframe the observation is taken from. Points that are outliers on the y axis are changing your model a lot and should probably be removed (unless you have a good reason to include them).

Based on these diagnostic plots, we should definitely not trust this model. There are a small handful of very influential points that are drastically changing our model. Remember, we didn't convert all the currencies to the same currency, so we're probably seeing some weirdnesses due to including a currency like the Yen, which is worth roughly one one-hundredth of a dollar. 

With that in mind, let's see how the plots change when we remove any salaries above 200,000. 

In [None]:
# remove compensation values above 150,000
has_compensation <- has_compensation %>%
    filter(CleanedCompensationAmount < 150000)

# linear model to predict salary by age
model <- glm(CleanedCompensationAmount ~ Age, data = has_compensation, family = poisson)

# diagnostic plots
glm.diag.plots(model)

Now our plots looks much better! Our residuals are more-or-less randomly distributed (which is what the first two plots tell us) and while we still have one outstanding influential point, we can tell by comparing the Cook statistics from the first and second set of plots that it's waaaaaaaayyy less influential than the outliers we got rid of. 

Our first model would probably not have been very informative for a new set of observations. Our second model is more likely to be helpful. 

As a final step, we can fit & plot a model to our data, like we did yesterday to see if our hunch about age and salary was correct.

In [None]:
# plot & add a regression line
ggplot(has_compensation, aes(x = Age, y = CleanedCompensationAmount)) + # draw a 
    geom_point() + # add points
    geom_smooth(method = "glm", # plot a regression...
    method.args = list(family = "poisson")) # ...from the binomial family

It looks like we were right about older data scientists making more. It does look like there are some outliers in terms of age, which we could remove with further data cleaning (which you're free to do if you like). First, however, why don't you try your hand at fitting a model and using diagnostic plots to check it out?

## Your turn!
___

Now it's your turn to come up with a model and check it out using diagnostic plots!

1. Pick a question to answer to using the Stack Overflow dataset. (You may want to check out the "survey_results_schema.csv" file to learn more about the data.) Pick a variable to predict and one varaible to use to predict it.
2. Fit a GLM model of the appropriate family. (Check out [yesterday's challenge](https://www.kaggle.com/rtatman/regression-challenge-day-1) if you need a refresher.
3. Plot diagnostic plots for your model. Does it seem like your model is a good fit for your data? Are the residuals normally distributed (no patterns in the first plot and the points in the second plot are all in a line)? Are there any influential outliers?
4. Plot your two variables & use "geom_smooth" and the appropriate family to fit and plot a model
5. Optional: If you want to share your analysis with friends or to ask for help, you’ll need to make it public so that other people can see it.
    * Publish your kernel by hitting the big blue “publish” button. (This may take a second.)
    * Change the visibility to “public” by clicking on the blue “Make Public” text (right above the “Fork Notebook” button).
    * Tag your notebook with 5daychallenge

In [None]:
# your work goes here :)


Want more? Ready for a different dataset? [This notebook](https://www.kaggle.com/rtatman/datasets-for-regression-analysis/) has additional dataset suggestions for you to practice regression with. 