# Introduction

The National Longitudinal Survey of Youth 1997-2011 dataset is one of the most important databases available to social scientists working with US data. 

It allows scientists to look at the determinants of earnings as well as educational attainment and has incredible relevance for government policy. It can also shed light on politically sensitive issues like how different educational attainment and salaries are for people of different ethnicity, sex, and other factors. When we have a better understanding how these variables affect education and earnings we can also formulate more suitable government policies. 

<center><img src=https://i.imgur.com/cxBpQ3I.png height=400></center>


### Upgrade Plotly

In [2]:
# %pip install --upgrade plotly

###  Import Statements


In [3]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

## Notebook Presentation

In [4]:
pd.options.display.float_format = '{:,.2f}'.format

# Load the Data



In [5]:
df = pd.read_csv('NLSY97_subset.csv')

### Understand the Dataset

Have a look at the file entitled `NLSY97_Variable_Names_and_Descriptions.csv`. 

---------------------------

    :Key Variables:  
      1. S           Years of schooling (highest grade completed as of 2011)
      2. EXP         Total out-of-school work experience (years) as of the 2011 interview.
      3. EARNINGS    Current hourly earnings in $ reported at the 2011 interview

# Preliminary Data Exploration 🔎

**Challenge**

* What is the shape of `df_data`? 
* How many rows and columns does it have?
* What are the column names?
* Are there any NaN values or duplicates?

In [6]:
print(df.shape)
print(df.columns)
print(df.isna())


(1599, 96)
Index(['ID', 'EARNINGS', 'S', 'EXP', 'FEMALE', 'MALE', 'BYEAR', 'AGE',
       'AGEMBTH', 'HHINC97', 'POVRAT97', 'HHBMBF', 'HHBMOF', 'HHOMBF',
       'HHBMONLY', 'HHBFONLY', 'HHOTHER', 'MSA97NO', 'MSA97NCC', 'MSA97CC',
       'MSA97NK', 'ETHBLACK', 'ETHHISP', 'ETHWHITE', 'EDUCPROF', 'EDUCPHD',
       'EDUCMAST', 'EDUCBA', 'EDUCAA', 'EDUCHSD', 'EDUCGED', 'EDUCDO',
       'PRMONM', 'PRMONF', 'PRMSTYUN', 'PRMSTYPE', 'PRMSTYAN', 'PRMSTYAE',
       'PRFSTYUN', 'PRFSTYPE', 'PRFSTYAN', 'PRFSTYAE', 'SINGLE', 'MARRIED',
       'COHABIT', 'OTHSING', 'FAITHN', 'FAITHP', 'FAITHC', 'FAITHJ', 'FAITHO',
       'FAITHM', 'ASVABAR', 'ASVABWK', 'ASVABPC', 'ASVABMK', 'ASVABNO',
       'ASVABCS', 'ASVABC', 'ASVABC4', 'VERBAL', 'ASVABMV', 'HEIGHT',
       'WEIGHT04', 'WEIGHT11', 'SF', 'SM', 'SFR', 'SMR', 'SIBLINGS', 'REG97NE',
       'REG97NC', 'REG97S', 'REG97W', 'RS97RURL', 'RS97URBN', 'RS97UNKN',
       'JOBS', 'HOURS', 'TENURE', 'CATGOV', 'CATPRI', 'CATNPO', 'CATMIS',
       'CATSE', 'COLLBAR

## Data Cleaning - Check for Missing Values and Duplicates

Find and remove any duplicate rows.

In [7]:
df.drop_duplicates()

Unnamed: 0,ID,EARNINGS,S,EXP,FEMALE,MALE,BYEAR,AGE,AGEMBTH,HHINC97,...,URBAN,REGNE,REGNC,REGW,REGS,MSA11NO,MSA11NCC,MSA11CC,MSA11NK,MSA11NIC
0,4275,18.50,12,9.71,0,1,1984,27,24.00,64000.00,...,1,0,0,1,0,0,0,1,0,0
1,4328,19.23,17,5.71,0,1,1982,29,32.00,6000.00,...,2,0,0,1,0,0,1,0,0,0
2,8763,39.05,14,9.94,0,1,1981,30,23.00,88252.00,...,1,0,0,0,1,0,0,1,0,0
3,8879,16.80,18,1.54,0,1,1983,28,30.00,,...,1,0,1,0,0,0,1,0,0,0
4,1994,36.06,15,2.94,0,1,1984,27,23.00,44188.00,...,1,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1589,4379,10.50,16,4.21,0,1,1983,28,33.00,37500.00,...,1,0,0,0,1,0,0,1,0,0
1590,687,12.00,11,8.88,0,1,1983,28,24.00,35100.00,...,1,0,1,0,0,0,1,0,0,0
1593,4198,20.00,16,6.73,0,1,1983,28,32.00,43660.00,...,1,0,0,1,0,0,0,1,0,0
1595,5599,8.00,12,6.90,0,1,1983,28,20.00,10002.00,...,0,0,0,1,0,1,0,0,0,0


# Split Training & Test Dataset

We *can't* use all the entries in our dataset to train our model. Keep 20% of the data for later as a testing dataset (out-of-sample data).  

DF is my training model, and testing_df is my testing model

In [8]:
testing_df = pd.read_csv('Testing.csv')

# Simple Linear Regression

Only use the years of schooling to predict earnings. Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data? 

In [9]:
reg = LinearRegression()
reg.fit(df[['S']].values, df['EARNINGS'])
reg.score(df[['S']].values, df['EARNINGS'])

0.06799053397850718

### Evaluate the Coefficients of the Model

Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative). 

Interpret the regression. How many extra dollars can one expect to earn for an additional year of schooling?

In [10]:
reg.coef_

array([1.14750398])

In [11]:
print(reg.predict([[12]]))
print(reg.predict([[13]]))
print(reg.predict([[14]]))


[15.90625314]
[17.05375712]
[18.2012611]


### Analyse the Estimated Values & Regression Residuals

How good our regression is also depends on the residuals - the difference between the model's predictions ( 𝑦̂ 𝑖 ) and the true values ( 𝑦𝑖 ) inside y_train. Do you see any patterns in the distribution of the residuals?

In [12]:
residuals = []
for s in testing_df['S']:
    prediction = reg.predict([[s]])
    residual = s - prediction
    residuals.append(residual)
print(residuals)

[array([-3.90625314]), array([-4.34876508]), array([-4.64377303]), array([-3.90625314]), array([-4.2012611]), array([-4.49626906]), array([-4.2012611]), array([-4.2012611]), array([-5.08628497]), array([-4.2012611]), array([-4.49626906]), array([-4.49626906]), array([-4.49626906]), array([-4.05375712]), array([-4.49626906]), array([-4.34876508]), array([-4.64377303]), array([-4.79127701]), array([-4.49626906]), array([-4.79127701]), array([-3.75874917]), array([-4.05375712]), array([-4.79127701]), array([-3.90625314]), array([-3.75874917]), array([-4.49626906]), array([-4.2012611]), array([-3.61124519]), array([-4.2012611]), array([-3.46374121]), array([-5.08628497]), array([-4.49626906]), array([-4.34876508]), array([-4.79127701]), array([-3.90625314]), array([-4.2012611]), array([-4.49626906]), array([-3.31623723]), array([-4.93878099]), array([-4.49626906]), array([-4.2012611]), array([-4.05375712]), array([-4.64377303]), array([-4.64377303]), array([-4.05375712]), array([-3.9062531

# Multivariable Regression

Now use both years of schooling and the years work experience to predict earnings. How high is the r-squared for the regression on the training data? 

In [13]:
mreg = LinearRegression()
mreg.fit(df[['S', 'EXP']], df['EARNINGS'])
mreg.score(df[['S', 'EXP']], df['EARNINGS'])

0.0958206621011295

### Evaluate the Coefficients of the Model

In [14]:
mreg.coef_

array([1.67401216, 0.86688565])

### Analyse the Estimated Values & Regression Residuals

In [15]:
print(mreg.predict([[8, 10]]))

[10.69237509]




In [16]:
residuals = []
for bruh,data in testing_df.iterrows():
    prediction = mreg.predict(df[['S', 'EXP']])
    residual = s - prediction
    residuals.append(residual)

# Use Your Model to Make a Prediction

How much can someone with a bachelors degree (12 + 4) years of schooling and 5 years work experience expect to earn in 2011?

In [17]:
print(mreg.predict([[16, 17]]))

[30.15267193]




In [18]:
print(mreg.predict([[32, 33]]))

[70.80703692]




# Experiment and Investigate Further

Which other features could you consider adding to further improve the regression to better predict earnings? 

In [23]:
mreg = LinearRegression()
mreg.fit(df[['S', 'EXP','CATGOV','CATPRI', 'SIBLINGS']].values, df['EARNINGS'])

In [25]:
print(mreg.predict([[20, 15, 1, 0,2]]))
print(mreg.predict([[19, 17, 1, 0,2]]))
print(mreg.predict([[23, 13, 1, 0,2]]))
# print(mreg.predict([[23,3,0,1,3]]))

[34.70222724]
[34.6974806]
[38.14100887]
