<a href="https://colab.research.google.com/github/richardky30/income-hours/blob/main/income_hours.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Our files is relatively large (25 MB). Let's load it into Google Drive and then copy the path here.

In [None]:
import pandas as pd 
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/data/usa_00004.csv')
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,UHRSWORK,INCTOT
0,0,9000
1,0,150
2,32,1400
3,0,22700
4,0,0


How many samples do we have?

In [None]:
len(df)

3239553

Wow, that's a lot!

In [None]:
from sklearn.linear_model import LinearRegression as lr
x = pd.DataFrame(df.UHRSWORK)
y = df.INCTOT
model = lr().fit(x, y)
r_sq = model.score(x, y)
print("The coefficient of determination is", r_sq)
print("The y-intercept is", model.intercept_)
print("The coefficient is", model.coef_)
y_pred = model.predict(x)
print("The predicted response is", y_pred, sep='\n')

The coefficient of determination is 0.16222310610079538
The y-intercept is 3028382.5803736877
The coefficient is [-69296.11264377]
The predicted response is
[3028382.58037369 3028382.58037369  810906.97577302 ... 3028382.58037369
 3028382.58037369 3028382.58037369]


That seems a bit strange. Let's investigate.

In [None]:
df.max()

UHRSWORK         99
INCTOT      9999999
dtype: int64

We forgot to remove the N/A values! They are 00 for UHRSWORK and 9999999 for INCTOT. Also, 99 in UHRSWORK indicates a top-coded data observation.

In [None]:
df = df[df.UHRSWORK != 0]
df = df[df.INCTOT != 9999999]

In [None]:
x = pd.DataFrame(df.UHRSWORK)
y = df.INCTOT
model = lr().fit(x, y)
r_sq = model.score(x, y)
print("The coefficient of determination is", r_sq)
print("The y-intercept is", model.intercept_)
print("The coefficient is", model.coef_)
y_pred = model.predict(x)
print("The predicted response is", y_pred[:8], sep='\n')

The coefficient of determination is 0.08602565076795865
The y-intercept is -4532.8039325639475
The coefficient is [1687.33738866]
The predicted response is
[ 49461.99250444  62960.69161369  47774.65511578  62960.69161369
  15715.24473131  62960.69161369  29213.94384056 113580.81327337]


That's better! But how about polynomial regression?

In [None]:
from sklearn.preprocessing import PolynomialFeatures as pf
x_ = pf(include_bias=False).fit_transform(x)
model = lr().fit(x_, y)
r_sq = model.score(x_, y)
print("The coefficient of determination is", r_sq)
print("The y-intercept is", model.intercept_)
print("The coefficients are", model.coef_)
y_pred = model.predict(x_)
print("The predicted response is", y_pred[:8], sep='\n')

The coefficient of determination is 0.08602811115700015
The y-intercept is -4958.571575994414
The coefficients are [ 1.71335936e+03 -3.47240999e-01]
The predicted response is
[ 49513.35301163  63020.21703877  47821.86983924  63020.21703877
  15551.73798419  63020.21703877  29169.7191311  113275.10240137]


Now let's try out statsmodels.

In [None]:
import statsmodels.api as sm
x['INCTOT'] = 1
model = sm.OLS(y, x)
results = model.fit()
results.summary()

  import pandas.util.testing as tm


0,1,2,3
Dep. Variable:,INCTOT,R-squared:,0.086
Model:,OLS,Adj. R-squared:,0.086
Method:,Least Squares,F-statistic:,158700.0
Date:,"Thu, 03 Mar 2022",Prob (F-statistic):,0.0
Time:,19:46:40,Log-Likelihood:,-21283000.0
No. Observations:,1686350,AIC:,42570000.0
Df Residuals:,1686348,BIC:,42570000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
UHRSWORK,1687.3374,4.235,398.401,0.000,1679.036,1695.638
INCTOT,-4532.8039,170.802,-26.538,0.000,-4867.571,-4198.037

0,1,2,3
Omnibus:,1647505.757,Durbin-Watson:,1.82
Prob(Omnibus):,0.0,Jarque-Bera (JB):,93761465.879
Skew:,4.803,Prob(JB):,0.0
Kurtosis:,38.244,Cond. No.,122.0


In [None]:
print("The coefficient of determination is", results.rsquared)
print("The adjusted coefficient of determination is", results.rsquared_adj)
print("The regression coefficients are", results.params.to_string(index=False), sep='\n')
print("The predicted responses are", results.predict(x)[:6].to_string(index=False), sep='\n')

The coefficient of determination is 0.0860256507679441
The adjusted coefficient of determination is 0.08602510878352032
The regression coefficients are
 1687.337389
-4532.803933
The predicted responses are
49461.992504
62960.691614
47774.655116
62960.691614
15715.244731
62960.691614
