## Problem Solving Homework 5
This homework is intended for you to develop your skills in both Pandas and linear modeling. You'll also use Seaborn to visualize the results. Again, you have to do some independent thinking to integrate various concepts you've learned in the class.

As usual, you must describe all code to get credit.

You'll be using the same NHANES data as in HW4. This data contains the result of interviews and other data collection on thousands of US adults in the 70s and 80s.



### 1. Cleaning
Read in the NHANES NHEFS data into a data frame and clean it to make it more useful for analysis with the following steps. Make a cleaned version of the data frame, codeing each column into either a binary or a number value.  For columns coded into a binary value, you must code them so that 
More extreme values get put together. Like if possible values are "Love ice cream" "Like ice cream", "Don't care about ice cream", "Dislike ice cream", "Hate ice cream" then the more positive feelings about ice cream must be put together into the same category. As well, no category should have fewer than 250 people in it. (You can reuse your HW4 code if you did HW4, but make sure you follow the requirements)

In [62]:
# Load packages and data
import pandas as pd
import numpy as np
nhefs = pd.read_csv('../data/nhefs.csv')

# If the data is numeric, split it in half based on whether it's lower or higher than the median

# Pull only numeric columns
num_df = nhefs.select_dtypes(include=['int64', 'float64'])
# For each numeric column
for col in num_df.columns:
    # Find the median of the data in that column
    med = num_df[col].median()
    # Check whether data is greater than the median
    indexer = num_df[col] > med
    # Reassign the boolean value to the value of that column in the df
    num_df[col] = indexer

# For numeric columns, reassign True as 'high' and False as 'low'
num_df = num_df.applymap(lambda x: 'high' if x==True else 'low')
# Replace the numeric columns in the main df with our encoded columns
nhefs = nhefs.drop(num_df.columns, axis=1)
nhefs = pd.concat([num_df, nhefs], axis=1)

# For columns with categories, pull the category
cats = nhefs.loc[:,['alcoholfreq', 'exercise', 'quit']]
cats = cats.applymap(lambda x: int(x[0]))
# Replace the categorical columns in the main df with our encoded columns
nhefs = nhefs.drop(['alcoholfreq', 'exercise', 'quit'], axis=1)
nhefs = pd.concat([cats, nhefs], axis=1)

# nerve_med
nhefs['nerve_med'] = nhefs['nerve_med'] == 'Yes'

# sex
nhefs['sex'] = nhefs['sex'] == 'man'

# marital
nhefs['marital'] = nhefs['marital'] == 'married'

# race
nhefs['race'] = nhefs['race'] == 'White'

for col in nhefs.columns:
    print(nhefs.groupby(col).size())

alcoholfreq
0    362
1    257
2    547
3    364
4    211
5      5
dtype: int64
exercise
0    343
1    730
2    673
dtype: int64
quit
0    1282
1     464
dtype: int64
income
high     470
low     1276
dtype: int64
price71
high    791
low     955
dtype: int64
price82
high    790
low     956
dtype: int64
smokeintensity
high     492
low     1254
dtype: int64
smokeyrs
high    845
low     901
dtype: int64
tax71_82
high    767
low     979
dtype: int64
wt71
high    873
low     873
dtype: int64
wt82
high    802
low     944
dtype: int64
wt82_71
high    839
low     907
dtype: int64
school
high     308
low     1438
dtype: int64
nerve_med
False    1478
True      268
dtype: int64
sex
False    882
True     864
dtype: int64
marital
False     377
True     1369
dtype: int64
race
False     235
True     1511
dtype: int64


### 2. Single predictor
Get the single predictor of weight gain (wt82_71) that best explains the weight gain.  Your answer should print the best model summary as well as what quantification you used to decide this was the best one.

Note: you may need to remove rows with missing values to get this to work.


Unnamed: 0,alcoholfreq,exercise,quit,income,price71,price82,smokeintensity,smokeyrs,tax71_82,wt71,wt82,wt82_71,school,nerve_med,sex,marital,race
0,1,2,0,low,high,low,high,high,high,high,low,low,low,False,True,True,False
1,0,0,0,low,high,low,low,low,high,low,low,low,low,False,True,True,True
2,3,2,0,low,low,low,low,high,low,low,low,high,low,True,False,False,False
3,2,2,0,low,low,low,low,high,low,low,low,high,low,False,True,False,False
4,2,1,0,low,high,low,low,low,high,high,high,high,low,False,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1741,3,0,0,low,low,high,low,high,low,low,low,high,low,False,False,True,True
1742,4,0,0,low,low,low,high,high,low,low,low,low,low,False,False,True,True
1743,2,1,0,low,low,high,low,high,low,low,low,high,low,False,True,True,True
1744,2,0,0,low,low,high,low,low,low,high,high,low,low,False,True,True,True


### 3. Pair of predictors
Get the best pair of two predictors of weight gain.  Your answer should print the best model summary as well as show how you decided this was the best pair. (We will discuss more than one predictor Nov 15 or 16)

## 4. Model metrics
Calculate RMSE, R-squared, and likelihood of your best models without using the model functions (just using regular pandas/scipy.stats functions) (We will talk about these metrics Nov 15 or 16)

#### 4A: RMSE

#### 4B: R-squared

#### 4C: Likelihood

## 5. Predict
### 5A: Make predictions
Predict weight gain using the best models you created in 2 and 3. Do not use the "predict" function of the model.  Your answer should create a Series for each of the two models. Print out the head of the Series and the describe of the series, for each series, with the models clearly labeled.

### 5B: Visualize predictions
Using seaborn, make a plot that has one predictor on the x-axis, and the wt82_71 on the y-axis, and the predicted values of wt82_71 also on the y-axis, in another color clearly indicated.  

#### 5B-1: Visualize model from Q2

#### 5B-2: Visualize model from Q3
Make 2 plots, one for each predictor. 