<a href="https://colab.research.google.com/github/veritaem/AB-Demo/blob/master/DS_Unit_2_Sprint_Challenge_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 2 Sprint Challenge 3

## Logistic Regression and Beyond

In this sprint challenge you will fit a logistic regression modeling the probability of an adult having an income above 50K. The dataset is available at UCI:

https://archive.ics.uci.edu/ml/datasets/adult

Your goal is to:

1. Load, validate, and clean/prepare the data.
2. Fit a logistic regression model
3. Answer questions based on the results (as well as a few extra questions about the other modules)

Don't let the perfect be the enemy of the good! Manage your time, and make sure to get to all parts. If you get stuck wrestling with the data, simplify it (if necessary, drop features or rows) so you're able to move on. If you have time at the end, you can go back and try to fix/improve.

### Hints

It has a variety of features - some are continuous, but many are categorical. You may find [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) (a method to one-hot encode) helpful!

The features have dramatically different ranges. You may find [sklearn.preprocessing.minmax_scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html#sklearn.preprocessing.minmax_scale) helpful!

In [1]:
import pandas as pd
import numpy as np
import statistics
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns
from sklearn.metrics import r2_score
from sklearn.preprocessing import minmax_scale

  from pandas.core import datetools


## Part 1 - Load, validate, and prepare data

The data is available at: https://archive.ics.uci.edu/ml/datasets/adult

Load it, name the columns, and make sure that you've loaded the data successfully. Note that missing values for categorical variables can essentially be considered another category ("unknown"), and may not need to be dropped.

You should also prepare the data for logistic regression - one-hot encode categorical features as appropriate.

In [2]:
data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
test_data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'
df = pd.read_csv(data, header = None)
test_df = pd.read_csv(test_data)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
test_df.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,|1x3 Cross validator
25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.


In [4]:
df.isna().sum()
#thats weird

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
dtype: int64

In [5]:
df[1].unique()
# ah so its the ol' question mark job..

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [6]:
df[14].unique()

array([' <=50K', ' >50K'], dtype=object)

In [7]:
df = df.replace(' ?', np.nan) #cool
df.isna().sum()

0        0
1     1836
2        0
3        0
4        0
5        0
6     1843
7        0
8        0
9        0
10       0
11       0
12       0
13     583
14       0
dtype: int64

Ok so we now know that there are 3 columns which have non values in them, and these ar:

<br>
1) the workclass of the subject,

<br>
2) the occupation of the subject, and

<br>
3) the country the subject is native to.

<br>
Now all of these sound terribly important to my mind, so I think ill run two models: one dropping them completely, and one letting them keep the '?' value.  Lets see what it does for us!

In [8]:
df2 = df.dropna()
print(df2.isna().sum())
df2.head()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
dtype: int64


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [9]:
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Part 2 - Fit and present a Logistic Regression

Your data should now be in a state to fit a logistic regression. Use scikit-learn, define your `X` (independent variable) and `y`, and fit a model.

Then, present results - display coefficients in as interpretible a way as you can (hint - scaling the numeric features will help, as it will at least make coefficients more comparable to each other). If you find it helpful for interpretation, you can also generate predictions for cases (like our 5 year old rich kid on the Titanic) or make visualizations - but the goal is your exploration to be able to answer the question, not any particular plot (i.e. don't worry about polishing it).

It is *optional* to use `train_test_split` or validate your model more generally - that is not the core focus for this week. So, it is suggested you focus on fitting a model first, and if you have time at the end you can do further validation.

In [0]:
X = df2.drop([14], axis = 1)
y = df2[14]
X2 = df.drop([14], axis = 1)
y2 = df[14]

In [11]:
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
X = pd.get_dummies(X)
X2 = pd.get_dummies(X2)
X2

Unnamed: 0,0,2,4,10,11,12,1_ Federal-gov,1_ Local-gov,1_ Never-worked,1_ Private,1_ Self-emp-inc,1_ Self-emp-not-inc,1_ State-gov,1_ Without-pay,3_ 10th,3_ 11th,3_ 12th,3_ 1st-4th,3_ 5th-6th,3_ 7th-8th,3_ 9th,3_ Assoc-acdm,3_ Assoc-voc,3_ Bachelors,3_ Doctorate,3_ HS-grad,3_ Masters,3_ Preschool,3_ Prof-school,3_ Some-college,5_ Divorced,5_ Married-AF-spouse,5_ Married-civ-spouse,5_ Married-spouse-absent,5_ Never-married,5_ Separated,5_ Widowed,6_ Adm-clerical,6_ Armed-Forces,6_ Craft-repair,6_ Exec-managerial,6_ Farming-fishing,6_ Handlers-cleaners,6_ Machine-op-inspct,6_ Other-service,6_ Priv-house-serv,6_ Prof-specialty,6_ Protective-serv,6_ Sales,6_ Tech-support,6_ Transport-moving,7_ Husband,7_ Not-in-family,7_ Other-relative,7_ Own-child,7_ Unmarried,7_ Wife,8_ Amer-Indian-Eskimo,8_ Asian-Pac-Islander,8_ Black,8_ Other,8_ White,9_ Female,9_ Male,13_ Cambodia,13_ Canada,13_ China,13_ Columbia,13_ Cuba,13_ Dominican-Republic,13_ Ecuador,13_ El-Salvador,13_ England,13_ France,13_ Germany,13_ Greece,13_ Guatemala,13_ Haiti,13_ Holand-Netherlands,13_ Honduras,13_ Hong,13_ Hungary,13_ India,13_ Iran,13_ Ireland,13_ Italy,13_ Jamaica,13_ Japan,13_ Laos,13_ Mexico,13_ Nicaragua,13_ Outlying-US(Guam-USVI-etc),13_ Peru,13_ Philippines,13_ Poland,13_ Portugal,13_ Puerto-Rico,13_ Scotland,13_ South,13_ Taiwan,13_ Thailand,13_ Trinadad&Tobago,13_ United-States,13_ Vietnam,13_ Yugoslavia
0,39,77516,13,2174,0,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,37,284582,14,0,0,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
6,49,160187,5,0,0,16,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,52,209642,9,0,0,45,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
8,31,45781,14,14084,0,50,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
9,42,159449,13,5178,0,40,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [12]:
scaler = StandardScaler()
X = scaler.fit_transform(X)
X2 = scaler.fit_transform(X2)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [13]:
y = pd.DataFrame(y)
y2 = pd.DataFrame(y2)
y2[14].unique()

array([' <=50K', ' >50K'], dtype=object)

In [14]:
y[14] = y[14].replace({' <=50K':0, ' >50K':1})
y

Unnamed: 0,14
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,1
8,1
9,1


In [15]:
print(X.shape)
print(y.shape)
print(X2.shape)
print(y2.shape)

(30162, 104)
(30162, 1)
(32561, 105)
(32561, 1)


In [16]:
y2[14] = y2[14].replace({' <=50K':0, ' >50K':1})
y2

Unnamed: 0,14
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,1
8,1
9,1


In [17]:
lin_reg = LinearRegression().fit(X, y)
lin_reg.score(X, y)

0.37010698722362145

In [18]:
lin_reg = LinearRegression().fit(X2, y2)
lin_reg.score(X2, y2)

0.3690944573554026

In [19]:
log_reg = LogisticRegression().fit(X, y)
log_reg.score(X, y)

  y = column_or_1d(y, warn=True)


0.8498110204893574

In [20]:
log_reg = LogisticRegression().fit(X2, y2)
log_reg.score(X2, y2)

  y = column_or_1d(y, warn=True)


0.8533214581861737

Ok so obviously logistic regression is pulling more weight here, explaining almost 90% of pay based on these factors compared to 40% with linear regression, but whats the MSE for the one that left the unknowns in(which ended up being top performer)?



In [21]:

print(mean_squared_error(y2, log_reg.predict(X2)))
print(log_reg.coef_)

0.14667854181382636
[[ 3.47553418e-01  7.45758989e-02  3.68651320e-01  2.34406346e+00
   2.60537495e-01  3.66386279e-01  1.72915434e-01  8.28982720e-02
  -6.38431732e-02  2.39145308e-01  1.26585924e-01  8.14891961e-03
   4.10493170e-02 -1.24471187e-01 -9.01453909e-02 -1.11461812e-01
  -3.86552992e-02 -3.56333977e-02 -3.50733079e-02 -1.01595242e-01
  -7.36564837e-02 -1.17165092e-02  1.89424334e-02  1.44920357e-01
   1.13424642e-01 -7.37941513e-02  1.34398694e-01 -4.65682727e-01
   1.26627263e-01  1.83952935e-02 -2.26244353e-01  5.29407030e-02
   7.54567048e-01 -7.54265071e-02 -5.35607697e-01 -1.38278857e-01
  -9.13128971e-02  5.77693204e-02 -1.64077212e-02  8.37348562e-02
   3.19613729e-01 -1.39458402e-01 -9.99643323e-02 -2.43267540e-02
  -1.93769587e-01 -2.60012303e-01  2.31788886e-01  1.05756000e-01
   1.45636656e-01  1.39074756e-01  1.52998513e-02 -6.85091375e-02
   1.73167927e-01 -9.01331569e-02 -2.97288417e-01  8.26001452e-02
   2.61194547e-01 -4.75804208e-02  3.06115618e-02 -2.839

Very nicely jobbed, computer, you live another day!  

In [22]:
x =[ 3.47553418e-01,  7.45758989e-02,  3.68651320e-01,  2.34406346e+00,
   2.60537495e-01,  3.66386279e-01,  1.72915434e-01,  8.28982720e-02,
  -6.38431732e-02,  2.39145308e-01,  1.26585924e-01,  8.14891961e-03,
   4.10493170e-02, -1.24471187e-01, -9.01453909e-02, -1.11461812e-01,
  -3.86552992e-02, -3.56333977e-02, -3.50733079e-02, -1.01595242e-01,
  -7.36564837e-02, -1.17165092e-02,  1.89424334e-02,  1.44920357e-01,
   1.13424642e-01, -7.37941513e-02,  1.34398694e-01, -4.65682727e-01,
   1.26627263e-01,  1.83952935e-02, -2.26244353e-01,  5.29407030e-02,
   7.54567048e-01, -7.54265071e-02, -5.35607697e-01, -1.38278857e-01,
  -9.13128971e-02,  5.77693204e-02, -1.64077212e-02,  8.37348562e-02,
   3.19613729e-01, -1.39458402e-01, -9.99643323e-02, -2.43267540e-02,
  -1.93769587e-01, -2.60012303e-01,  2.31788886e-01,  1.05756000e-01,
   1.45636656e-01,  1.39074756e-01,  1.52998513e-02, -6.85091375e-02,
   1.73167927e-01, -9.01331569e-02, -2.97288417e-01, 8.26001452e-02,
   2.61194547e-01, -4.75804208e-02,  3.06115618e-02,-2.83955104e-02,
  -2.87583009e-02,  2.89743660e-02, -2.02170240e-01, 2.02170240e-01,
   3.57507980e-02,  3.13991047e-02, -2.43032154e-02, -8.18776786e-02,
   2.87221882e-02, -7.58800109e-02, -2.75649732e-03, -2.39749382e-02,
   2.59921288e-02,  2.30607922e-02,  4.00332214e-02, -2.37356923e-02,
  -2.79797041e-03,  4.93572788e-03, -1.95951475e-02, -2.16168962e-02,
   2.12543175e-03,  1.45598726e-03, -1.04345298e-02,  8.44262352e-03,
   1.95278856e-02,  4.69485613e-02,  1.13076170e-02,  2.52548141e-02,
  -9.90602654e-03, -5.06615199e-02, -1.98940683e-02, -1.18524162e-01,
  -2.00460031e-02,  4.74142552e-02,  7.78767518e-03,  5.15406046e-03,
  -8.76243889e-03,  3.64671719e-03, -4.36388000e-02, 8.83519969e-03,
  -8.91958177e-03, -4.79735059e-03,  1.16332331e-01, -4.33881191e-02,
   1.92827910e-02]


print(sorted(x))

[-0.535607697, -0.465682727, -0.297288417, -0.260012303, -0.226244353, -0.20217024, -0.193769587, -0.139458402, -0.138278857, -0.124471187, -0.118524162, -0.111461812, -0.101595242, -0.0999643323, -0.0913128971, -0.0901453909, -0.0901331569, -0.0818776786, -0.0758800109, -0.0754265071, -0.0737941513, -0.0736564837, -0.0685091375, -0.0638431732, -0.0506615199, -0.0475804208, -0.0436388, -0.0433881191, -0.0386552992, -0.0356333977, -0.0350733079, -0.0287583009, -0.0283955104, -0.024326754, -0.0243032154, -0.0239749382, -0.0237356923, -0.0216168962, -0.0200460031, -0.0198940683, -0.0195951475, -0.0164077212, -0.0117165092, -0.0104345298, -0.00990602654, -0.00891958177, -0.00876243889, -0.00479735059, -0.00279797041, -0.00275649732, 0.00145598726, 0.00212543175, 0.00364671719, 0.00493572788, 0.00515406046, 0.00778767518, 0.00814891961, 0.00844262352, 0.00883519969, 0.011307617, 0.0152998513, 0.0183952935, 0.0189424334, 0.019282791, 0.0195278856, 0.0230607922, 0.0252548141, 0.0259921288, 0.

## Part 3 - Analysis, Interpretation, and Questions

### Based on your above model, answer the following questions

1. What are 3 features positively correlated with income above 50k?
2. What are 3 features negatively correlated with income above 50k?
3. Overall, how well does the model explain the data and what insights do you derive from it?

*These answers count* - that is, make sure to spend some time on them, connecting to your analysis above. There is no single right answer, but as long as you support your reasoning with evidence you are on the right track.

Note - scikit-learn logistic regression does *not* automatically perform a hypothesis test on coefficients. That is OK - if you scale the data they are more comparable in weight.

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis

#Questions pt 1

1) As per the coefficients reported just above, the largest coefficient is coefficient 4, or the row for 'capital gain', the second and third are 'married - civ- spouse', or people in civil unions, and 'capital gain'.  All of this suggests that the strongest predictors of whether you make specifically more than 50K is -having capital gains(or reporting them i suppose is more accurate), being in a civil union, and having a higher education level.  All of which is not only intuitive but borne out in the data. 

For checkback, the list x above is sorted to have the highest coefficients at the end of the list, which you can then just ctrl+f to the coefficients above it(dont forget the decimal!) and then just go to that column in the dummies dataset (X2's printout).  Im absolutely positive theres a better way, but this way works just fine and spares me some extra time and effort for this writeup! 
<br>

<br>

<br>


2) Likewise, the 3 largest negative factors are : married-spouse absent(ouch!), having only a preschool education, and being a child in a housuehold, in order of largest to smallest negative influencer.  So dont be a baby or a child if you want to make good money, but it looks like having marriage issues is even worse for your financial wellbeing, thats the Big Oof from me!  
<br>

<br>

<br>


3)Overall this model explains quite a bit of this data, getting 85% or so correct after scaling and leaving in the unknown categories.  The insights to publish about this might be that in order to be making a salary above 50K, you should be of working age, have a high education, and not be in marriage troubles.  I was very suprised about the marriage thing but i ran it like 6 times so... yeah, thats a rough one.
<br>



#Questions pt 2

### Match the following situation descriptions with the model most appropriate to addressing them

In addition to logistic regression, a number of other approaches were covered this week. Pair them with the situations they are most appropriate for, and briefly explain why.

Situations:
1. You are given data on academic performance of primary school students, and asked to fit a model to help predict "at-risk" students who are likely to receive the bottom tier of grades.
2. You are studying tech companies and their patterns in releasing new products, and would like to be able to model and predict when a new product is likely to be launched.
3. You are working on modeling expected plant size and yield with a laboratory that is able to capture fantastically detailed physical data about plants, but only of a few dozen plants at a time.

Approaches:
1. Ridge Regression
2. Quantile Regression
3. Survival Analysis


1)'at-risk' kids really means measuring the bottom quantile of grades (lets just say bottom 6.3252878% for simplicity). 

Because of this, quantile regression will allow you to separate this quantile out and do your measurements and models with it, which would be what you want in this case, since we assume anyone above the bottom tier of grades ( the other 93.6747122%) that doesnt match the predictions generated by it arent historically in the biggest risk
<br>

<br>

<br>
2)survival analysis is all about time and events, and so it can measure the likely time that elapses between and event and a starting point.  Because of this, survival analysis is just what you need for this issue!  By analyzing the average 'lifetime'(time before a product comes out) and fitting a fancy model to it, you can get the lifetime from your current time to when a new product is expected to launch!  Start saving your money up now!
<br>

<br>

<br>
3)This one needs ridge analysis, because ridge analysis allows you to take a model that would overfit(as a plant database would with its many columns and few observations) and stretch it gently so as to make it more generalizable.  This is done by penalizing the slope of tha data, which makes the model 'underfit a little on purpose' in a way, which, if you know youre overfitting, is just what you want! This will make the model less accurate on its training data but more useful looking at additional observations.
<br>

<br>

<br>