Le but de ce projet est de comprendre et d’appliquer la technique des doubles moindres carrés. Par Romain DS.

## Import des modules et de la base de donnée.

In [46]:
import pandas as pd
import numpy as np
import statsmodels.api as sm 
from linearmodels.iv import IV2SLS

url = "https://raw.githubusercontent.com/ATerracol/P8Econ/master/data/Projet3_Groupe10.csv"
df = pd.read_csv(url)

In [47]:
df.head()

Unnamed: 0,gender,ethnicity,urban,unemp,wage,distance,education,income
0,female,afam,yes,7.7,9.881197,0.2,12,high
1,female,other,no,7.2,8.855185,0.6,16,high
2,male,other,no,7.1,9.975384,1.0,15,high
3,female,other,no,10.0,10.266041,2.5,13,low
4,male,other,no,4.6,9.92358,4.0,12,low


In [48]:
df.dtypes

gender        object
ethnicity     object
urban         object
unemp        float64
wage         float64
distance     float64
education      int64
income        object
dtype: object

In [49]:
df.income.value_counts()

income
low     1050
high     450
Name: count, dtype: int64

In [52]:
labels = {
    'gender':{'male':0,'female':1},
    'ethnicity':{'other':0,'hispanic':1,'afam':2},
    'urban':{'no':0,'yes':1},
    'income':{'low':0,'high':1}
}

df.replace(labels, inplace=True)

In [54]:
X = df.drop(df.columns[2:6],axis=1)
X = sm.add_constant(X)
y = df['wage']

## la régression MCO du salaire sur l’éducation, l’ethnicité, le genre et l’appartenance à un foyer à haut ou bas revenu.

In [55]:
X

Unnamed: 0,const,gender,ethnicity,education,income
0,1.0,1,2,12,1
1,1.0,1,0,16,1
2,1.0,0,0,15,1
3,1.0,1,0,13,0
4,1.0,0,0,12,0
...,...,...,...,...,...
1495,1.0,1,0,13,0
1496,1.0,0,0,17,1
1497,1.0,1,0,12,1
1498,1.0,1,2,13,0


In [63]:
ols_model = sm.OLS(y,X).fit()
ols_model.summary()

0,1,2,3
Dep. Variable:,wage,R-squared:,0.036
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,13.99
Date:,"Fri, 29 Aug 2025",Prob (F-statistic):,3.27e-11
Time:,00:46:36,Log-Likelihood:,-2516.9
No. Observations:,1500,AIC:,5044.0
Df Residuals:,1495,BIC:,5070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,9.8242,0.270,36.453,0.000,9.296,10.353
gender,-0.1275,0.068,-1.888,0.059,-0.260,0.005
ethnicity,-0.2910,0.045,-6.472,0.000,-0.379,-0.203
education,-0.0111,0.019,-0.575,0.565,-0.049,0.027
income,0.1685,0.076,2.210,0.027,0.019,0.318

0,1,2,3
Omnibus:,6.717,Durbin-Watson:,1.98
Prob(Omnibus):,0.035,Jarque-Bera (JB):,5.642
Skew:,0.072,Prob(JB):,0.0595
Kurtosis:,2.736,Cond. No.,113.0


Le nombre d'année d'années d'éducation peut etre endogène dans ce modèle car son coefficient est négatif, sa p-value est non significative. Il y a un risque de corrélation entre education et l'erreur donc il y a un biais dans OLS.

Parmi les variables disponibles non utilisées comme variables explicatives distance peut sembler à même de constituer un instrument valide. CA influence la probabilité de faire de plus longues études de plus ça n'affecte pas directement la target. 

In [None]:
Regression IV en prenant comme instrument externe distance.

In [62]:
iv_model = IV2SLS.from_formula("wage ~ 1 + ethnicity + gender + income + [education ~ distance]",data=df).fit()
iv_model.summary

0,1,2,3
Dep. Variable:,wage,R-squared:,-0.0269
Estimator:,IV-2SLS,Adj. R-squared:,-0.0296
No. Observations:,1500,F-statistic:,51.430
Date:,"Fri, Aug 29 2025",P-value (F-stat),0.0000
Time:,00:44:37,Distribution:,chi2(4)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,12.418,2.7254,4.5564,0.0000,7.0763,17.760
ethnicity,-0.3185,0.0558,-5.7059,0.0000,-0.4279,-0.2091
gender,-0.1236,0.0701,-1.7633,0.0779,-0.2610,0.0138
income,0.3518,0.2119,1.6603,0.0969,-0.0635,0.7671
education,-0.2012,0.2002,-1.0049,0.3149,-0.5936,0.1912
