# Task 6
**In this task we will create a predictor using scikit to predict GDP per capita. We will be exluding other GDP related data fields.**
\
\
Minitasks: \
 a) We will show prediction error (MSE) on the training and the testing data sets.\
 b) Name the fields we will use.\
 c) Find the top 5 fields/features that contribute the most to te predictions.\
 d) Train another predictor that uses those top 5 features.\
 e) Save the predictor in a file.\
For this task I am thinking of using scikit's other_dataGBOOST as the data will be straight forward.
\
\
Steps:
1) We will first preprocess the data to make it usable.
2) Train first model.
3) We will find which features are most likely to contribute the most to predictions.
4) Train final model.
5) Save model to file.

First we load up the data from the IMF_DATA file 

In [1]:
import pandas as pd

imf_data = pd.read_pickle("IMF_DATA.pkl")

Now we filter out the data we will use and split it to other_data and gdp_data\
**Note: estimates will not be used**

In [2]:
def remove_predictions(row):
    DATA_END_DATE = 2025
    DATA_START_DATE = 1980
    
    prediction_start_year = int(row['Estimates Start After'])
    
    if prediction_start_year < DATA_START_DATE:
        row.loc[DATA_START_DATE : DATA_END_DATE] = 0
    elif prediction_start_year < DATA_END_DATE:
        row.loc[prediction_start_year + 1 : DATA_END_DATE] = 0
        
    return row

In [3]:
imf_data = imf_data.apply(remove_predictions, axis = 1)

In [4]:
# Fetching the GDP per capita data denoted in US dollars
gdp_pattern = r'\bGross domestic product per capita.*'
gdp_data = imf_data[imf_data['Subject Descriptor'].str.contains(gdp_pattern)]
gdp_data = gdp_data[gdp_data['Units'].str.contains(r'U.S. dollars')]

# Dropping all other columns apart from 'Country' and 1980 : 2025
columns_to_be_dropped = ['WEO Country Code', 'ISO', 'WEO Subject Code', 'Subject Descriptor', 'Subject Notes',
                               'Units', 'Scale', 'Country/Series-specific Notes', 'Estimates Start After']
gdp_data.drop(columns_to_be_dropped, axis = 1, inplace = True)

In [5]:
# Dropping rows that contain GDP related data
gdp_pattern = r'\bGross domestic product.*'
other_data = imf_data[imf_data['Subject Descriptor'].str.contains(gdp_pattern) == False]

In [6]:
# Keeping only Units that i will use, as there are some Subject descriptors that have over 195 values
# r'Index', r'U.S. dollars', 
units_to_be_dropped = [r'Missing']

# Looping over units_to_be_dropped and dropping the rows that contain the expression
for expression in units_to_be_dropped:
    other_data = other_data[other_data['Units'].str.contains(expression) == False]

In [7]:
# Making a human readable subjects dataframe where one columns is the WEO subject code the other is subject descriptor
subjects = other_data.loc[:, ['Subject Descriptor', 'WEO Subject Code', 'Units']]
subjects.drop_duplicates(inplace = True) 

# Uncomment code if you want to see the human readable subjects dataframe
#print(subjects)

In [8]:
# Dropping columns in other_data that will be of no use
columns_to_be_dropped = ['Subject Descriptor', 'WEO Country Code', 'ISO', 
                         'Subject Notes', 'Country/Series-specific Notes', 'Estimates Start After', 'Units', 'Scale']
other_data.drop(columns_to_be_dropped, axis = 1, inplace = True)

Now that we have all the subjects in the units we need, we can start converting both dataframes from wide to long versions and also drop zero rows.

In [9]:
# Melting other_data to long format so the year can become a column
other_data = other_data.melt(id_vars=['Country', 'WEO Subject Code'], var_name='Year', value_name='Value')

# Pivoting other_data to make every subject code a seperate column and the resetting the index
other_data = other_data.pivot_table(index=['Country', 'Year'], columns='WEO Subject Code', values='Value').reset_index()

In [10]:
# Melting the gdp_data so I can merge based on 'Country' and 'Year' columns
gdp_data = gdp_data.melt(id_vars='Country', var_name='Year', value_name='GDP per capita')

# Merging the data on 'Country' and 'Year' columns
merged_data = other_data.merge(gdp_data, on = ['Country', 'Year'], how = 'inner')

# Discarding rows if GDP per capita is == 0
merged_data = merged_data[merged_data['GDP per capita'] != 0]

Now that we have a complete data frame where there is no null values for GDP per capita we can start searching for features we will use.
First we will need to find which columns have the least nan values (0 in this case). There is 6500 rows that have GDP per capita values.

In [11]:
# We find which columns have the least zero values
non_zero_values = {}
for column in merged_data.columns:
    non_zero_values[column] = len(merged_data[merged_data[column] != 0])

#pd.DataFrame(list(non_zero_values.items()), columns = ['Column', 'Number']
#            ).sort_values('Number', axis = 0, ascending = False
#                        ).merge(subjects, how='inner', left_on = 'Column', right_on = 'WEO Subject Code')

Now we can finally try to train our first model (XGBRegressor) with all the features. First we still have to impute our data with the help of a pipeline.

In [12]:
# Replacing 0 values with nan 
merged_data.replace({0 : float('nan')}, inplace = True)

# Imputing nan values 
from sklearn.model_selection import train_test_split

y = merged_data['GDP per capita']
X = merged_data.drop(['GDP per capita', 'Country', 'Year'], axis=1)

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error

from xgboost import XGBRegressor

train_preprocessor = SimpleImputer()

train_model = XGBRegressor(n_estimators=750, learning_rate=0.1)

train_pipeline = Pipeline(steps=[('preprocessor', train_preprocessor),
                              ('model', train_model)
                             ])

# Preprocessing of training data, fit model 
train_pipeline.fit(X_train, y_train,)

# Preprocessing of validation data, get predictions
preds = train_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 1234.8449853572529


Features used for training first model: 'BCA', 'BCA_NGDPD', 'FLIBOR6', 'GGR', 'GGR_NGDP',
       'GGSB', 'GGSB_NPGDP', 'GGX', 'GGXCNL', 'GGXCNL_NGDP', 'GGXONLB',
       'GGXONLB_NGDP', 'GGXWDG', 'GGXWDG_NGDP', 'GGXWDN', 'GGXWDN_NGDP',
       'GGX_NGDP', 'LE', 'LP', 'LUR', 'NGAP_NPGDP', 'NGSD_NGDP', 'NID_NGDP',
       'PCPI', 'PCPIE', 'PCPIEPCH', 'PCPIPCH', 'PPPEX', 'TMG_RPCH', 'TM_RPCH',
       'TXG_RPCH', 'TX_RPCH'

Now using mi scores we will find the top 5 most important features for the model.

In [14]:
from sklearn.feature_selection import mutual_info_regression

# Function for scoring features, it doens't except nan values :(
def make_mi_scores(X, y):
    mi_scores = mutual_info_regression(X, y)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X.fillna(0), y.fillna(0))
mi_scores.head(9)  # show a few features with their MI scores

LE            0.345073
BCA           0.303304
GGR_NGDP      0.276372
LUR           0.273179
LP            0.239088
GGX_NGDP      0.227669
PPPEX         0.222344
GGSB_NPGDP    0.200230
NGAP_NPGDP    0.187885
Name: MI Scores, dtype: float64

Features we will use: LE, BCA, GGR_NGDP, LUR, LP\
LE - Employment, BCA - Current account balance, GGR_NGDP - General government revenue, LUR - Unemployment rate, LP - Population.
I will also add the 3 features that will be needed in the nex task for training (We will need, continent, population).

In [15]:
features_final = ['LE', 'BCA', 'GGR_NGDP', 'LUR', 'LP']

X_final = X[features_final]

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X_final, y, train_size=0.8, test_size=0.2, random_state=69)

train_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = train_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

MAE: 1967.8464338720576


As you can see by the MAE the second model is a bit worse than the first (by about 750 USD), though we could improve on this further by making custom features

Now we will save the last model to a file for our next task, I will be using joblib for this as it is generally more efficient for dealing with large NumPy arrays.

**Note: because our model is part of a pipeline we will save the pipeline itself**

In [16]:
import joblib

filename = 'finalized_pipeline.sav'
joblib.dump(train_pipeline, filename)

['finalized_pipeline.sav']