# About this notebook

This notebook uses the output of [this kernel](https://www.kaggle.com/hinepo/pnad-data-analysis) to create and train some AI models to predict the income for brazilian people.

Part 1: [Data analysis](https://www.kaggle.com/hinepo/pnad-data-analysis)

Part 2: Modeling (this notebook)

Part 3: [Lazy Predict](https://www.kaggle.com/hinepo/pnad-lazy-predict?scriptVersionId=74288711)

# Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from scipy.stats import skew, norm

from sklearn.preprocessing import MinMaxScaler, RobustScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

# Modeling
import xgboost as xgb

from colorama import Fore, Back, Style
g_ = Fore.GREEN
m_ = Fore.MAGENTA
sr_ = Style.RESET_ALL

import warnings
warnings.filterwarnings("ignore")

# Load data

In [None]:
df = pd.read_csv('../input/pesquisa-nacional-por-amostra-de-domiclios-pnad/pnad_2015_clean.csv')
df

# Config

Class to help simulating results.

In [None]:
class CFG:
    debug = False
#     Transform_target = True
    Transform_target = False
    scaler = MinMaxScaler()
#     scaler = RobustScaler()
    threshold = 15000
    eval_split = 0.10
    eval_split_seed = 42
    n_folds = 5
    seeds = [0, 1]

In [None]:
if CFG.debug:
    df = df.head(10_000)

# Outliers

As seen in the [data analysis notebook](https://www.kaggle.com/hinepo/pnad-data-analysis), this dataset has very few outliers, but the ones in it are very far from the normal cases (check boxplots).

For modeling, a good decision would be to remove them, as they will certainly degrade the performance of the models.

I will just remove some based in an arbitrary threshold. But this decision could also be based on a more detailed statiscal analysis.

In [None]:
df = df[df['Income'] <= CFG.threshold]
df = df.reset_index(drop=True)
df.shape

# Skewed features

In [None]:
df.dtypes

In [None]:
numeric_feats = df.dtypes[df.dtypes != "object"].index
print("There are {} numeric columns.".format(numeric_feats.shape[0]))

# Check the skew of all numerical features
skewed_features = df[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical columns: \n")
skewness = pd.DataFrame({'Skew' : skewed_features})
skewness

In [None]:
high_skew = skewed_features[skewed_features > 0.75]
skew_index = high_skew.index

print("{} numeric column(s) with Skew > 0.75.".format(high_skew.shape[0]))
high_skewness = pd.DataFrame({'High Skewed' : high_skew})
high_skewness.head(10)

In [None]:
f, ax = plt.subplots(figsize=(10, 6))
sns.distplot(df['Income'], color="b", fit=norm);
ax.xaxis.grid(True)
ax.set(ylabel="Frequency")
ax.set(xlabel="Income")
ax.set(title="Target original distribution");

In [None]:
print("Skewness: %f" % df['Income'].skew()) # curve lopsidedness
print("Kurtosis: %f" % df['Income'].kurt()) # curve tailedness

print("\nIncome min: %.2f" % df['Income'].min())
print("Income max: %.2f" % df['Income'].max())

# Target transformation

Income is the variable that we want to predict (target variable), and it is highly skewed. So I tried to correct its distribution before modeling, but I could not find a statiscal transformation that improved the results, probably because the distribution is too far from the normal distribution.

It is obviously unrealistic to expect the variable Income to have some distribution close to one normal/Gaussian one.

In case we choose to perform the target transformation (CFG class settings), after calculating the predictions we will need to transform back the target to get some reasonable results (np.expm1() function).

In [None]:
# log(1+x) transform
df_original = df.copy() # save the dataframe before any transformation

df["Income"] = np.log1p(df["Income"])

comparison = pd.DataFrame({'Target (original)' : df_original['Income'], 'Target (transformed)' : df['Income']})
comparison

In [None]:
comparison.describe()

In [None]:
comparison.hist(); # comparing distributions

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (10, 8))

ax1 = plt.subplot(2, 1, 1)
stats.probplot(df_original['Income'], plot=plt);
ax1.set_title('Income (original)')
plt.tight_layout();

ax2 = plt.subplot(2, 1, 2)
stats.probplot(df['Income'], plot=plt);
ax2.set_title('Income (transformed)');
plt.tight_layout();

In [None]:
# if CFG class is set to not do target transformation, just undo it

if CFG.Transform_target == False:
    df['Income'] = np.expm1(df['Income'])

# One-Hot-Encoding

In [None]:
df

In [None]:
# convert Sex feature into integers
# because later I want the model to calculate the impact of Sex column, not 'Male' and 'Female' categories
Sex_dict = {
    'Male' : 0,
    'Female' : 1
}

df["Sex"] = df["Sex"].map(Sex_dict)

In [None]:
# One-hot-encoding
df = pd.get_dummies(df, drop_first = False)
df.shape

In [None]:
# dataframe ready to feed the models
df

# Split

Now we take out some fraction of the dataset that will later be used to evaluate the models and predictions. This avoids data leakage and overfitting, and ensures the models will be evaluated on data they have not seen during training.

In [None]:
# take out a sample to use to evalute predictions later
eval_set = df.sample(frac = CFG.eval_split, random_state = CFG.eval_split_seed)
eval_set.shape

In [None]:
# indexes that will be used to evaluate predictions
list_of_indexes_in_eval_set = list(df.sample(frac = CFG.eval_split, random_state = CFG.eval_split_seed).index)

# train_set: remove indexes above
df = df.drop(list_of_indexes_in_eval_set, axis=0).reset_index(drop=True)
df.shape

# Features and Target

Here we separate the features (variables that we will use to predict the target) from the target (variable that we want to predict).

In [None]:
features = df.drop('Income', axis = 1)
features.shape

In [None]:
target = df['Income']
target.shape

In [None]:
eval_set_features = eval_set.drop('Income', axis = 1)
eval_set_features = eval_set_features.reset_index(drop=True)
eval_set_features.shape

In [None]:
eval_set_target = eval_set['Income']
eval_set_target = eval_set_target.reset_index(drop=True)
eval_set_target.shape

# Scaling

Since it is a regression problem and we might use algorithms that depend on distance (non tree-based models) we need to scale the features. We should not scale the target though.

In [None]:
# scale only features
scaler = CFG.scaler.fit(features)
features_scaled = scaler.transform(features)

features_scaled.shape

In [None]:
# scale only features (eval_set)
scaler_eval = CFG.scaler.fit(eval_set_features)
eval_set_features_scaled = scaler_eval.transform(eval_set_features)

eval_set_features_scaled.shape

# Metric

Here I choose Root Mean Squared Error (RMSE) as an evaluation metric because in this case I want some measure that is sensitive to outliers. So I will evaluate the models using RMSE since it calculates the square of the deviation before taking the average and the square root of the error.

Also, I want to be able to compare the predicted values against the real values (ground truth). Like MAE (Mean Absolute Error), RMSE also measures the error on the same dimension of the target, which is not the case if we use MSE (Mean Squared Error).

In [None]:
def rmse_score(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

# XGB

[XGB docs](https://xgboost.readthedocs.io/en/latest/python/python_api.html)

In [None]:
def get_preds_xgb(X = features_scaled, y = np.array(target),
                  nfolds = CFG.n_folds,
                  n_estimators = 100):
    
    seeds_avg = list() # to store seed runs and calculate average
    xgb_models = [] # to store models

    for seed in CFG.seeds:
        scores = list() # to store scores
        
        kfold = KFold(n_splits = CFG.n_folds, shuffle=True, random_state = seed)
        
        for k, (train_idx,valid_idx) in enumerate(kfold.split(X, y)):
            
            # instantiate model
            model = xgb.XGBRegressor(objective ='reg:squarederror',
                                     n_estimators = n_estimators,
                                     random_state = seed) # use tree_method='gpu_hist' to enable gpu
            
            # KFold split
            X_train, y_train = X[train_idx], y[train_idx] # train set for this fold
            X_valid, y_valid = X[valid_idx], y[valid_idx] # validation set for this fold
            
            model.fit(X_train, y_train) # train (fit) model
            xgb_models.append(model) # store model
            prediction = model.predict(X_valid) # make predictions for the current fold
            
            # Undo target transformation on the predictions made (if it was done)
            if CFG.Transform_target == True:
                prediction = np.expm1(prediction)
            
            score = round(rmse_score(prediction, y_valid), 4) # calculate score (RMSE)
            scores.append(score) # store score
            
            print(f'Seed {seed} | Fold {k} | RMSE score: {score}') # print fold score (RMSE)
            
        print(f"{m_}\nMean RMSE for seed {seed} : {round(np.mean(scores), 4)}{sr_}\n") # print seed score (RMSE)
        
        seeds_avg.append(round(np.mean(scores), 4)) # calculate run score (RMSE)
        
    print(f"{g_}Average score: {round(np.mean(seeds_avg), 4)}{sr_}") # print run score (RMSE)
    
    return round(np.mean(seeds_avg), 4), xgb_models

In [None]:
%%time
score_xgb, xgb_models = get_preds_xgb()

# Feature importances

In [None]:
# choose model (fold) to evaluate
model_fold = 1 # first one

features_importances = xgb_models[model_fold].feature_importances_
argsort = np.argsort(features_importances)
features_importances_sorted = features_importances[argsort]

feature_names = features.columns
features_sorted = feature_names[argsort]

# plot feature importances
plt.figure(figsize = (8, 10))
plt.barh(features_sorted, features_importances_sorted)
plt.title("Feature Importances");

# Evaluate

Predict on eval set: Compute predicted values (y_pred) and RMSE between predictions and real values (y_test).

The models have not seen the data in the evaluation set (eval_set_features_scaled and eval_set_target).

Here we use all xgb models to generate the predictions. The ensemble of xgb models has (CFG.n_folds * len(CFG.seeds)) models.

In [None]:
# predict on eval_set using all models trained
# clip to avoid predicting negative incomes
y_pred = np.mean([np.clip(xgb_models[i].predict(eval_set_features_scaled), a_min = 0, a_max=None)\
         for i in range(0, len(xgb_models))], axis=0)
 
# Undo target transformation on the predictions made (if it was done)
if CFG.Transform_target == True:
    y_pred = np.expm1(y_pred)

y_test = eval_set_target

# Undo target transformation on the eval set (if it was done)
if CFG.Transform_target == True:
    y_test = np.expm1(eval_set_target)

# compute error (RMSE for evaluation set)
rmse = rmse_score(y_test, y_pred)
print("RMSE on test data: %.2f" % rmse)

# Predicted x Real

Plot predictions x real values.

In [None]:
%%time

print_every = 50

fig = plt.figure(figsize=(20,5))

plt.bar(list(range(len(y_test[::print_every]))), y_test.values[::print_every],
        alpha = 1, color = 'red', width = 1, label = 'true values')

plt.bar(list(range(len(y_pred[::print_every]))), y_pred[::print_every],
        alpha = 0.5, color = 'blue', width = 1, label = 'predicted values')

plt.legend();

# Simulate

Make any prediction you want!

Define your features array: Set the values below for each column

In [None]:
my_pred = np.array([[

# Sex (0: Male; 1: Female)
1,
# Age
45,
# Years of study
12,
# Height
1.60,
# State_Acre
0,
# State_Alagoas
0,
# State_Amapá
0,
# State_Amazonas
0,
# State_Bahia
0,
# State_Ceará
0,
# State_Distrito Federal
0,
# State_Espírito Santo
0,
# State_Goiás
0,
# State_Maranhão
0,
# State_Mato Grosso
0,
# State_Mato Grosso do Sul
0,
# State_Minas Gerais
0,
# State_Paraná
0,
# State_Paraíba
0,
# State_Pará
0,
# State_Pernambuco
0,
# State_Piauí
0,
# State_Rio Grande do Norte
0,
# State_Rio Grande do Sul
0,
# State_Rio de Janeiro
1,
# State_Rondônia
0,
# State_Roraima
0,
# State_Santa Catarina
0,
# State_Sergipe
0,
# State_São Paulo
0,
# State_Tocantins
0,
# Color_White 
0,
# Color_Indigenous
0,
# Color_Brown
1,
# Color_Black
0,
# Color_Yellow
0
]])

In [None]:
res = np.mean([np.clip(xgb_models[i].predict(my_pred), a_min = 0, a_max=None)\
               for i in range(0, len(xgb_models))], axis=0)

print("Income predicted for information in my_pred array:", round(res[0], 2), "reais.")