<a href="https://colab.research.google.com/github/HinePo/PNAD-analysis-and-prediction/blob/master/PNAD_2015.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PNAD

In this article we will study the PNAD database from 2015, clean and explore the data, and then we will use AI to try to predict the value of the column 'Renda' (monthly salary) based on the other variables.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import seaborn as sns

## Loading the dataset

PNAD: Pesquisa Nacional por Amostra de Domicílios 2015 - IBGE (Instituto Brasileiro de Geografia e Estatística)


Dataset available in https://www.kaggle.com/upadorprofzs/testes

In [None]:
df = pd.read_csv('../input/testes/dados.csv')
df.head()

In [None]:
df.shape
# there are 76840 rows and 7 columns on the dataframe (df)

In [None]:
df.isnull().values.any()

In [None]:
df['Cor'].unique()

In [None]:
df.dtypes

# Data Manipulation

As mentioned in the dataset description (link above), although the dataframe has only integer and float values, a lot of the columns are categorical variables codes.

All models require that only numerical data is fed to it. We can use the dataset as it is, but it's not ideal since this way we are applying different values and importances that don't really exist in the categories of the features.

So I usually prefer another approach. For this, first I will have to map the categorical values to their respective categories, and then we can perform One Hot Encoding technique to get the dummies from each categorical feature. This way we won't attribute any importance to any category, and will also help us to analyse the data. The down side is the dataset will become much larger with the dummies.

In [None]:
# How many categories are present in each column?

for col in df.columns:
  print(col, " :", len(df[col].unique()))

## UF column

In [None]:
Dict_UF = {
    11 : 'Rondônia',
    12 : 'Acre',
    13 : 'Amazonas',
    14 : 'Roraima',
    15 : 'Pará',
    16 : 'Amapá',
    17 : 'Tocantins',
    21 : 'Maranhão',
    22 : 'Piauí',
    23 : 'Ceará',
    24 : 'Rio Grande do Norte',
    25 : 'Paraíba',
    26 : 'Pernambuco',
    27 : 'Alagoas',
    28 : 'Sergipe',
    29 : 'Bahia',
    31 : 'Minas Gerais',
    32 : 'Espírito Santo',
    33 : 'Rio de Janeiro',
    35 : 'São Paulo',
    41 : 'Paraná',
    42 : 'Santa Catarina',
    43 : 'Rio Grande do Sul',
    50 : 'Mato Grosso do Sul',
    51 : 'Mato Grosso',
    52 : 'Goiás',
    53 : 'Distrito Federal'
}

In [None]:
df["UF"] = df["UF"].map(Dict_UF)

In [None]:
# Verifying changes made
df.loc[2000:2010]

## Sexo column

This column is ok and will not be modified.

0 means 'male'.

1 means 'female'.

## Idade column

This column is also alright and doesn't need any changes.

## Cor column

In [None]:
Dict_Cor = {
    0 : 'Indígena',
    2 : 'Branca',
    4 : 'Preta',
    6 : 'Amarela',
    8 : 'Parda',
    9 : 'Sem declaração'
    }

In [None]:
df["Cor"] = df["Cor"].map(Dict_Cor)

In [None]:
# Verifying changes made
df.loc[45000:45005]

In [None]:
df.groupby('Cor').count()

## Anos de Estudo column

In [None]:
df['Anos de Estudo'].value_counts()

In [None]:
Dict_Anos = {
    1 : 0,
    2 : 1,
    3 : 2,
    4 : 3,
    5 : 4,
    6 : 5,
    7 : 6,
    8 : 7,
    9 : 8,
    10 : 9,
    11 : 10,
    12 : 11,
    13 : 12,
    14 : 13,
    15 : 14,
    16 : 15,
    17 : 0
    }

In [None]:
df["Anos de Estudo"] = df["Anos de Estudo"].map(Dict_Anos)

In [None]:
df.dtypes

In [None]:
# Verifying changes made
df.tail()

## Renda column

This is also fine.

## Altura column

In [None]:
df['Altura'] = round(df['Altura'], 2)

In [None]:
df.loc[900:905]

# Exploratory Data Analysis

Let's do some brief analysis of the dataset.

In [None]:
df.columns

## UF

In [None]:
df["UF"].value_counts()

In [None]:
df["UF"].value_counts().plot(kind = 'bar', figsize=(12,5))
plt.title("Number of observations by UF")

## Sexo

In [None]:
df["Sexo"].value_counts()

In [None]:
df["Sexo"].value_counts().plot(kind = 'bar')
plt.title("Number of observations by Sexo")
# 0 means male;
# 1 means female.

## Idade

In [None]:
df["Idade"].value_counts()

In [None]:
plt.title("Number of observations by Idade")
df["Idade"].plot(kind = 'hist')

In [None]:
# from 76840 observations, there are 423 that have Age less than 20
len(df["Idade"][df["Idade"]<20])

In [None]:
print("Maximum value for Idade", df["Idade"].max())
print("Minimum value for Idade", df["Idade"].min())

## Cor

In [None]:
df["Cor"].value_counts()

In [None]:
plt.figure(figsize = (5,5))
plt.title("Number of observations by Cor")
df["Cor"].value_counts().plot(kind = 'bar')

## Anos de Estudo

15 years means '15 years or more'.

In [None]:
df["Anos de Estudo"].value_counts()

In [None]:
plt.title("Number of observations by Anos de Estudo")
df["Anos de Estudo"].value_counts().plot(kind = 'bar')

In [None]:
# Anos de estudo by Cor
sns.boxplot(x = df['Cor'], y = df['Anos de Estudo'], data = df)
plt.title("Anos de Estudo x Cor")

In [None]:
# Anos de estudo by Sexo
sns.boxplot(x = df['Sexo'], y = df['Anos de Estudo'], data = df)
plt.title("Anos de Estudo x Sexo")

In [None]:
df.groupby('UF').mean()[['Anos de Estudo']].plot(kind='bar')
plt.title("Anos de Estudo (Average) x UF")

## Renda

### Renda Distribution

In [None]:
df["Renda"].value_counts()

In [None]:
# Some insights
print("Number of observations that have Renda < 20 k :", len(df["Renda"][df["Renda"] < 20000]))
print("Number of observations that have Renda > 20 k :", len(df["Renda"][df["Renda"] > 20000]))
print("Number of observations that have Renda > 40 k :", len(df["Renda"][df["Renda"] > 40000]))
print("\nAverage Salary (Renda) :", round(df['Renda'].mean(), 2))
print("Maximum value for Renda :", df["Renda"].max())
print("Minimum value for Renda :", df["Renda"].min())

In [None]:
# hist plot with zoom
plt.style.use('seaborn-talk')
fig, ax = plt.subplots(1, 4, figsize = (14, 5))
ax[0].hist(df["Renda"][df["Renda"] < 40000], bins = 100)
ax[0].set_title('Frequency x Renda (<40k)')
ax[1].hist(df["Renda"][df["Renda"] < 15000], bins = 100)
ax[1].set_title('Frequency x Renda (<15k)')
ax[2].hist(df["Renda"][df["Renda"] < 10000], bins = 100)
ax[2].set_title('Frequency x Renda (<10k)')
ax[3].hist(df["Renda"][df["Renda"] < 5000], bins = 100)
ax[3].set_title('Frequency x Renda (<5k)')

In [None]:
df["Renda"][df["Renda"] > 40000].plot(kind = 'hist', bins = 100)
plt.title('Frequency x Renda (>40k)')

### Renda x Cor

In [None]:
# Renda (<5000) by cor
sns.boxplot(x = df['Cor'], y = df['Renda'][df['Renda'] < 5000], data = df[df['Renda'] < 5000])
plt.title('Renda (<5k) x Cor')

In [None]:
# Renda (>5000) by cor
sns.boxplot(x = df['Cor'], y = df['Renda'][df['Renda'] > 25000], data = df[df['Renda'] > 25000])
plt.title('Renda (>25k)  x Cor')

### Renda x Sexo

In [None]:
sns.boxplot(x = df['Sexo'], y = df['Renda'][df['Renda'] > 25000], data = df[df['Renda'] > 25000])
plt.title('Renda (>25k) x Sexo')

In [None]:
sns.boxplot(x = df['Sexo'], y = df['Renda'][df['Renda'] < 10000], data = df[df['Renda'] < 10000])
plt.title('Renda (<10k) x Sexo')

In [None]:
sns.boxplot(x = df['Sexo'], y = df['Renda'][df['Renda'] < 4000], data = df[df['Renda'] < 4000])
plt.title('Renda (<4k) x Sexo')

### Renda x Idade

In [None]:
sns.scatterplot(df['Idade'], df['Renda'], data = df, hue = df['Cor'])
plt.xticks([0, 10, 20, 30, 40, 50, 60, 70, 80], labels = [0, 10, 20, 30, 40, 50, 60, 70, 80])
plt.title("Renda x Idade x Cor")

### Renda x Anos de Estudo

In [None]:
less_than_five_years = df[df["Anos de Estudo"] <= 5]
five_nine_years = df[(df["Anos de Estudo"] > 5) &  (df["Anos de Estudo"] < 10)]
nine_fourteen_years = df[(df["Anos de Estudo"] >= 10) & (df["Anos de Estudo"] < 15)]
more_than_fifteen_years = df[df["Anos de Estudo"] >= 15]

In [None]:
print("Average Salary (Renda) for 0-5 years of study :", round(less_than_five_years['Renda'].mean(), 2))
print("Average Salary (Renda) for 6-9 years of study :", round(five_nine_years['Renda'].mean(), 2))
print("Average Salary (Renda) for 10-14 years of study :", round(nine_fourteen_years['Renda'].mean(), 2))
print("Average Salary (Renda) for 15+ years of study :", round(more_than_fifteen_years['Renda'].mean(), 2))

In [None]:
# plot averages
year_avgs = np.array([
    round(less_than_five_years['Renda'].mean(), 2),
    round(five_nine_years['Renda'].mean(), 2),
    round(nine_fourteen_years['Renda'].mean(), 2),
    round(more_than_fifteen_years['Renda'].mean(), 2)
    ])

categories = np.array(['<5', '5-9', '10-14', '15+'])

In [None]:
plt.figure(figsize=(5,3))
sns.barplot(x=categories, y=year_avgs)
plt.title("Renda x Anos de Estudo")

### Renda x UF

In [None]:
df.groupby('UF').mean()[['Renda']].plot(kind='bar')

## Altura

In [None]:
# Altura by Sexo
plt.title("Altura x Sexo")
sns.boxplot(x = df['Sexo'], y = df['Altura'])

In [None]:
print("Average height for men :", round(df[df['Sexo'] == 0].Altura.mean(), 3))
print("Average height for women :", round(df[df['Sexo'] == 1].Altura.mean(), 3))

# Attributes Relations

Now let's check some correlations on the data.

In [None]:
sns.pairplot(df, hue = 'Cor')

In [None]:
sns.pairplot(df, hue = 'Sexo')

In [None]:
# heatmap for correlations
corr = df.corr()
sns.heatmap(corr, annot = True, vmin = 0, vmax = 1, cmap = 'Purples')

# Dealing with Outliers

One issue we might want to address is the imbalance on the target column (Renda). Those outliers (salaries above ~15k) will certainly have a bad influence on the model while fitting, so we can test it out by creating a new dataset (df_model) that has fewer outliers. Since the outliers are quite few in numbers, droping these observations should have no impact on the model's learning. 

The metric that we will be using (RMSE: Root Mean Squared Error) to evaluate the models is sensitive to outliers, so this is also a good reason to remove some data that is too far from the normal.

You can test your own values and get new results if you open this notebook with Google Colab.

In [None]:
# Creating new df, with fewer outliers
df_model = df[df['Renda'] <= 15000]
df.shape, df_model.shape

# One Hot Encoding

As stated in the "Data Manipulation" section, we wil now perform One Hot Encoding to prepare the data for the model.

In [None]:
df_model = pd.get_dummies(df_model, drop_first = True)
df_model.shape

In [None]:
df_model.head()

# Defining features and target

In [None]:
df_model.columns

In [None]:
features = df_model.drop('Renda', axis = 1)
features.shape

In [None]:
features.head()

In [None]:
target = df_model['Renda']
target.shape

In [None]:
target.head()

# Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# scaling dataset
scaler.fit(features)
features_scaled = scaler.transform(features)

In [None]:
features_scaled

# Regression Preliminary Modeling

In [None]:
from sklearn.model_selection import cross_val_score
cv = 10
scoring = 'neg_mean_squared_error'
random_state = 0

We will evaluate our models using RMSE (Root Mean Squared Error). In this metric, the lower the scoring is the better the model is.

In [None]:
all_models = []
all_scores = []

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()
scores = cross_val_score(model, features_scaled, target, cv = cv,
                         scoring = scoring, n_jobs = -1)
np.sqrt(-scores.mean())

Lol. Linear Regression, y u so bad? xD
I won't even plot this.

Obviously, the relations on this dataset cannot be modeled with a linear algorithm.

## Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

In [None]:
model = Lasso(random_state = random_state)
scores = cross_val_score(model, features_scaled, target, cv = cv,
                         scoring = scoring, n_jobs = -1)
res = round(np.sqrt(-scores.mean()), 2)

In [None]:
all_models.append('Lasso')
all_scores.append(res)

In [None]:
all_models, all_scores

## Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

In [None]:
model = Ridge(random_state = random_state)
scores = cross_val_score(model, features_scaled, target, cv = cv,
                         scoring = scoring, n_jobs = -1)
res = round(np.sqrt(-scores.mean()), 2)

In [None]:
all_models.append('Ridge')
all_scores.append(res)

In [None]:
all_models, all_scores

## Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
model = KNeighborsRegressor()
scores = cross_val_score(model, features_scaled, target, cv = cv,
                         scoring = scoring, n_jobs = -1)
res = round(np.sqrt(-scores.mean()), 2)

In [None]:
all_models.append('KNN')
all_scores.append(res)

In [None]:
all_models, all_scores

## Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
model = DecisionTreeRegressor(random_state = random_state)
scores = cross_val_score(model, features_scaled, target, cv = cv,
                         scoring = scoring, n_jobs = -1)
res = round(np.sqrt(-scores.mean()), 2)

In [None]:
all_models.append('Decision Tree')
all_scores.append(res)

In [None]:
all_models, all_scores

## XGB Regressor

Obs.: I also experimented with other tree-based models, like RandomForestRegressor, but the models usually take too long to run (several minutes) and output worse results.

In [None]:
from xgboost import XGBRegressor

In [None]:
# this takes a minute to run
model = XGBRegressor(random_state = random_state)
scores = cross_val_score(model, features_scaled, target, cv = cv,
                         scoring = scoring, n_jobs = -1)
res = round(np.sqrt(-scores.mean()), 2)

In [None]:
all_models.append('XGB')
all_scores.append(res)

In [None]:
all_models, all_scores

## Neural Networks

In [None]:
from sklearn.neural_network import MLPRegressor

In [None]:
# this takes a few minutes to run
model = MLPRegressor(random_state = random_state)
scores = cross_val_score(model, features_scaled, target, cv = cv,
                         scoring = scoring, n_jobs = -1)
res = round(np.sqrt(-scores.mean()), 2)

In [None]:
all_models.append('MLP')
all_scores.append(res)

In [None]:
all_models, all_scores

## Preliminary Modeling Results

In [None]:
names = list(all_models)
values = list(all_scores)

In [None]:
# plot results
bar1 = plt.bar(np.arange(len(values)), values)
plt.xticks(range(len(names)), names)
plt.title('Renda prediction: Model x Error')
plt.ylim(0,2500)
for rect in bar1:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width()/2.0, height, '%.2f' % float(height), ha='center', va='bottom', fontsize = 12, fontweight = 'bold')

We can see that the best model we tested is the XGB Regressor, so that's what we are going to use from now on.

# Split dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target,
                                                      test_size = 0.2, random_state = random_state)

In [None]:
X_train.shape, X_test.shape

In [None]:
y_train.shape, y_test.shape

# Training

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
# Define model
model = XGBRegressor(objective='reg:squarederror', random_state = random_state)

# Fit (train) model
model.fit(X_train, y_train,
          eval_set=[(X_train, y_train), (X_test, y_test)],
          eval_metric='rmse',
          verbose=False)

In [None]:
# Evaluate model
# Predict on new data (X_test). The model wasn't trained on this data and hasn't seen it yet
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE on test data: %.2f" % rmse)

We can see that the model has improved and reduced its error after training on the data.

In [None]:
all_models.append('XGB trained')
all_scores.append(round(rmse, 2))

In [None]:
all_models, all_scores

# Final Results

## Models comparison

In [None]:
names = list(all_models)
values = list(all_scores)

In [None]:
# plot results
bar1 = plt.bar(np.arange(len(values)), values)
plt.xticks(range(len(names)), names)
plt.title('Renda prediction: Model x Error')
plt.ylim(0,2500)
for rect in bar1:
    height = rect.get_height()
    plt.text(rect.get_x() + rect.get_width()/2.0, height, '%.2f' % float(height), ha='center', va='bottom', fontsize = 12, fontweight = 'bold')

## Feature Importances

The plot below reveals the importances of each feature (column) on the dataset. These importances were captured by the model while fitting the data and they show which features have the most impact on the prediction of the value of Renda.

In [None]:
features_importances = model.feature_importances_
argsort = np.argsort(features_importances)
features_importances_sorted = features_importances[argsort]

feature_names = features.columns
features_sorted = feature_names[argsort]

# plot feature importances
plt.figure(figsize = (5,10))
plt.barh(features_sorted, features_importances_sorted)
plt.title("Feature Importances")

## Renda: Predicted x Real

In [None]:
print_every = 50
fig = plt.figure(figsize=(20,5))
plt.bar(list(range(len(y_test[::print_every]))), y_test.values[::print_every],
        alpha = 1, color = 'red', width = 1, label = 'true values')
plt.bar(list(range(len(y_pred[::print_every]))), y_pred[::print_every],
        alpha = 0.5, color = 'blue', width = 1, label = 'predicted values')
plt.legend()

## Make predictions

In [None]:
# Making predictions of Renda for the first 5 observations of the test set (X_test)
model.predict(X_test)[0:5]

In [None]:
# Make any prediction you want!
# Define your features array: Set the values below for each column

my_pred = np.array([[

# Sexo
1,
# Idade
25,
# Anos de Estudo
8,
# Altura
1.65,
# UF_Alagoas
0,
# UF_Amapá
0,
# UF_Amazonas
0,
# UF_Bahia
0,
# UF_Ceará
0,
# UF_Distrito Federal
0,
# UF_Espírito Santo
0,
# UF_Goiás
0,
# UF_Maranhão
0,
# UF_Mato Grosso
0,
# UF_Mato Grosso do Sul
0,
# UF_Minas Gerais
0,
# UF_Paraná
0,
# UF_Paraíba
0,
# UF_Pará
0,
# UF_Pernambuco
0,
# UF_Piauí
0,
# UF_Rio Grande do Norte
0,
# UF_Rio Grande do Sul
0,
# UF_Rio de Janeiro
1,
# UF_Rondônia
0,
# UF_Roraima
0,
# UF_Santa Catarina
0,
# UF_Sergipe
0,
# UF_São Paulo
0,
# UF_Tocantins
0,
# Cor_Branca 
0,
# Cor_Indígena
0,
# Cor_Parda 
1,
# Cor_Preta
0
]])

In [None]:
res = model.predict(my_pred)
print("Renda predicted for information in my_pred array:", round(res[0], 2), "reais.")

If you are not happy with your predictions, go check again the "Feature Importances" section and see how you can change your input ;)

# Conclusions

This article explored the PNAD 2015 dataset and was able to provide some insights on it. It showed that, with the features and information contained on this dataset, it's possible to predict the value of a person's monthly salary (Renda) with a RMSE of ~1600.