<h1> BART for inequalities </h1>

In [36]:
import os

import pandas as pd

<h2> Preprocessing </h2>

Life expectancy for all countries from the World Bank

In [33]:
life_exp_df = pd.read_csv("wb_life_expectancy.csv", skiprows = 4)
life_exp_df = life_exp_df[["Country Name", "2017"]]
life_exp_df.rename(columns = {"2017":"Life Expectancy 2017"}, inplace=True)
life_exp_df = life_exp_df.dropna()
life_exp_df

Unnamed: 0,Country Name,Life Expectancy 2017
0,Aruba,76.010000
1,Afghanistan,64.130000
2,Angola,60.379000
3,Albania,78.333000
5,Arab World,71.622526
...,...,...
259,Kosovo,71.946341
260,"Yemen, Rep.",66.086000
261,South Africa,63.538000
262,Zambia,63.043000


Income distribution for all countries and world regions from WID.world. The distribution is split into bottom 50 percent, 50-90 percent (middle class), top 10 percent and the top 1 percent share

In [31]:
income_df = pd.read_csv("wid_income_dist.csv", skiprows = 1, sep = ";", header = None)
income_df = income_df[[0, 2, 4]]
income_df.columns = ["Region Name", "percentile", "Income Share"]
income_df = income_df.dropna() # Only keep regions with all 4 parts of the income distribution
income_df = income_df.pivot(index='Region Name', columns='percentile')['Income Share'] # reshape, col per share
income_df

percentile,p0p50,p50p90,p90p100,p99p100
Region Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Africa,0.088212,0.368794,0.542994,0.190221
Albania,0.209400,0.470900,0.319700,0.082100
Algeria,0.207066,0.420077,0.372856,0.097033
Angola,0.130631,0.380834,0.488535,0.151751
Austria,0.234300,0.449100,0.316600,0.092700
...,...,...,...,...
United Kingdom,0.206100,0.439300,0.354600,0.126100
Western Africa,0.116490,0.375802,0.507708,0.164721
Zambia,0.073127,0.311930,0.614943,0.230787
Zanzibar,0.154000,0.365000,0.481000,0.161700


In [23]:
income_df.dropna()

Unnamed: 0,Region Name,percentile,Income Share
0,Africa,p90p100,0.542994
1,Africa,p50p90,0.368794
2,Africa,p0p50,0.088212
3,Africa,p99p100,0.190221
12,Albania,p90p100,0.319700
...,...,...,...
835,Zanzibar,p99p100,0.161700
836,Zimbabwe,p90p100,0.507186
837,Zimbabwe,p50p90,0.364170
838,Zimbabwe,p0p50,0.128644


Merge the life expectancy and income dataframes on country

In [34]:
le_income_df = life_exp_df.merge(income_df, left_on = "Country Name", right_on = "Region Name")
le_income_df

Unnamed: 0,Country Name,Life Expectancy 2017,p0p50,p50p90,p90p100,p99p100
0,Angola,60.379000,0.130631,0.380834,0.488535,0.151751
1,Albania,78.333000,0.209400,0.470900,0.319700,0.082100
2,Austria,81.641463,0.234300,0.449100,0.316600,0.092700
3,Burundi,60.898000,0.151344,0.371082,0.477574,0.145485
4,Belgium,81.439024,0.205900,0.480100,0.313900,0.077700
...,...,...,...,...,...,...
79,Tanzania,64.479000,0.153972,0.365047,0.480980,0.161714
80,Uganda,62.516000,0.131229,0.353945,0.514826,0.168541
81,South Africa,63.538000,0.062700,0.286500,0.650800,0.192100
82,Zambia,63.043000,0.073127,0.311930,0.614943,0.230787


In [42]:
X = le_income_df[["p0p50", "p50p90", "p90p100", "p99p100"]]
y = le_income_df[["Life Expectancy 2017"]]

<h2> Random Forest implementation </h2>

In [48]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
rf_pipeline = Pipeline(steps=[("model", RandomForestRegressor(n_estimators = 50, random_state = 0))
                             ])

In [51]:
from sklearn.model_selection import cross_val_score
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(rf_pipeline, X, y,
                              cv = 5,
                              scoring = "neg_mean_absolute_error")

print("MAE scores:\n", scores)
print("Average MAE score (across experiments):")
print(scores.mean())

MAE scores:
 [3.79270005 2.72457212 5.30532131 3.46914154 4.39436878]
Average MAE score (across experiments):
3.937220760162539


<h2> XBART implementation </h2>

In [55]:
from xbart import XBART
xbt_pipeline = Pipeline(steps=[("model", XBART(num_trees = 10, num_sweeps = 10, burnin = 0))
                             ])

In [56]:
scores = -1 * cross_val_score(xbt_pipeline, X, y,
                              cv = 5,
                              scoring = "neg_mean_absolute_error")

print("MAE scores:\n", scores)
print("Average MAE score (across experiments):")
print(scores.mean())

MAE scores:
 [nan nan nan nan nan]
Average MAE score (across experiments):
nan


In [61]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(le_income_df, test_size = 0.2)

X_train = train[["p0p50", "p50p90", "p90p100", "p99p100"]]
y_train = train[["Life Expectancy 2017"]]

X_test = test[["p0p50", "p50p90", "p90p100", "p99p100"]]
y_test = test[["Life Expectancy 2017"]]

In [62]:

xbt = XBART(num_trees = 100, num_sweeps = 40, burnin = 15)
xbt.fit(X_train,y_train)
xbart_yhat_matrix = xbt.predict(X_test)  # Return n X num_sweeps matrix
y_hat = xbart_yhat_matrix[:,15:].mean(axis=1) # Use mean a prediction estimate

TypeError: y must be numpy array or pandas Series

In [63]:
from bartpy.sklearnmodel import SklearnModel

ModuleNotFoundError: No module named 'bartpy'