I've pulled this dataset from Kaggle.com. The file describes the income and expenditure characteristics of Filipino households.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
family_income_data = pd.read_csv('../input/family-income-and-expenditure/Family Income and Expenditure.csv')

In [None]:
family_income_data.head()

In [None]:
import missingno as msno

First thing I want to is to check if there are missing values in my dataset. I accomplish this using the **missingno** library's Matrix method. This helps me visualize where the null values are.

In [None]:
msno.matrix(family_income_data)

## TASK 1: Make a model to Predict Household Income through the Expenditures

Let's see if we can predict `Total Household Income` through the expenditures each family makes. I initiate this by taking all the column names with 'expenditures' in it.

In [None]:
expenditures = [column for column in family_income_data.columns if 'Expenditure' in column]

Checking the values I have:

In [None]:
expenditures

I set my features as the splice of the original dataset where the column names are expenditures, and set the target as the `Total Household Income` column

In [None]:
X = family_income_data.loc[:, expenditures]
y = family_income_data['Total Household Income']

Importing the necessary libraries for fitting.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression, Lasso
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.preprocessing import Normalizer, PolynomialFeatures, MinMaxScaler, StandardScaler
from xgboost import XGBRegressor, XGBClassifier
from sklearn.svm import SVR
from sklearn.decomposition import IncrementalPCA, SparsePCA, KernelPCA
from sklearn.manifold import TSNE
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, precision_score,recall_score, classification_report, confusion_matrix

But before I go to training a model, I wanna see first how the the features appear in a scatterplot when treated as a function of the label.

In [None]:
plt.figure(figsize=(20, 20))
i = 1
for exp in expenditures :
    plt.subplot(6,3,i)
    sns.regplot(x=X[exp], y=y)
    i += 1

With respect to the selected label, a lot of the features show a huge variance. This will limit the reliability of the regression model later.

Next, my goal is to see how the features and the income is correlated, and I want to expose this visually using a heatmap and the .corr() method.

In [None]:
Xy = X.copy()
Xy['THI'] = y

In [None]:
Xy_corr = Xy.corr()

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(Xy_corr, square=True)

From here, it seems like the best correlation I have is 0.8, which is `Total Rice Expenditure` and `Bread and Cereals Expenditure`. Next to that, we have `Total Food Expenditure` has a somehow high correlation value with `Meat Expenditure` and `Vegetables Expenditure`. Makes sense especially that **Pinoys** are a culture of *rice and ulam*. It's not surprising that these are related.

Visualizing these three in a regression plot:

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(3,1,1)
sns.regplot(x=family_income_data['Total Rice Expenditure'], y = family_income_data['Bread and Cereals Expenditure'])
plt.subplot(3,1,2)
sns.regplot(x=family_income_data['Total Food Expenditure'], y = family_income_data['Vegetables Expenditure'])
plt.subplot(3,1,3)
sns.regplot(x=family_income_data['Total Food Expenditure'], y = family_income_data['Meat Expenditure'])

At this point, I want to show how skewed the data is. This is to set my expectation for regression modelling later.

In [None]:
plt.figure(figsize=(15, 25))
i = 1
for exp in expenditures :
    plt.subplot(6,3,i)
    sns.distplot(X[exp])
    i += 1

Okay, so I tried to see how the data points are distributed and there is a heavy skew to the left. This tells me that a large number of participants in the dataset spends about the same amount for expenses. The KDE's, however, tell a different story. Most peaks are below the bin with the most number of datapoints, which tells us there are values to the extreme right that are skewing the curve heavily. 

While a lot of people agree on spending a certain range, there also exists a group that are spending at a maximum.

All the while, the distribution of income is this:

In [None]:
plt.figure(figsize=(10, 10))
sns.distplot(y, bins=1000)

This is an extremely ridiculous distribution curve. I set 20 bins and still the most distinguishable is still three.

What this tells me is that almost everyone in the dataset has a `Total Household Income` of between 5000-10000. It's quite disturbing.

Okay, now let's train a model. I'm going to pick `RandomForestRegressor` and `KNeighborsRegressor`, since these are great picks for chaotic scatterplots. Then, I'm also throwing in an `XGBRegressor` in the end to see if an optimized model can perform better than the earlier two.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=41)

I'm setting verbose to True so that I can see what's happening under the model as it happens.

Moreover, I'm adding a comparative regplot for the actual values for Total Household Income and the ones predicted by the model. Ideally, we want to see them fit inside the regression line to say that "Okay, this is a good model."

### Random Forest Regressor

In [None]:
rfr = RandomForestRegressor(verbose=True, n_jobs=-1, n_estimators=1000)
rfr.fit(X_train, y_train)

In [None]:
rfr.score(X_test, y_test)

Let's visually observe how the predicted values and the actual values correlate in a 45-degree line. The closer the points are in the line, the more it tells us that the model we selected did a great job in terms of predicting values.

We'll also repeat this method for the next models we use.

In [None]:
y_rfr_predict = rfr.predict(X)
mean_squared_error(y, y_rfr_predict)
plt.figure(figsize=(20,5))
ax = sns.regplot(x=y, y = y_rfr_predict)
ax.set(xlabel='Total Household Income', ylabel='Predicted TIH')

 ### K-Nearest Neighbors Regressor

In [None]:
knr = KNeighborsRegressor(n_neighbors=15, n_jobs=-1, leaf_size=50)
knr.fit(X_train, y_train)

In [None]:
knr.score(X_test, y_test)

In [None]:
y_knr_predict = knr.predict(X)
mean_squared_error(y, y_knr_predict)
plt.figure(figsize=(20,5))
ax = sns.regplot(x=y, y = y_knr_predict)
ax.set(xlabel='Total Household Income', ylabel='Predicted TIH')

### XGBRegressor

In [None]:
xgbr = XGBRegressor(nthread = -1, eta=0.1, subsample=0.5)
xgbr.fit(X_train, y_train)

In [None]:
xgbr.score(X_test, y_test)

In [None]:
y_xgbr_predict = xgbr.predict(X)
mean_squared_error(y, y_xgbr_predict)
sns.regplot(x=y, y = y_xgbr_predict)
plt.figure(figsize=(20,5))

From the three models used, it seems like the best to yield score is the Random Forest Regressor. This model can still be improved by scaling the data points and normalizing the distribution.

## TASK 2: Predicting Income Bracket Through Classification

For this task, I aim to predict the income category of each household using the following features:
 -  Household Head Sex
 -  Household Head Age
 -  Household Head Marital Status
 -  Household Head Highest Grade Completed
 -  Household Head Job or Business Indicator
 -  Household Head Class of Worker
 -  Type of Household
 -  Total Number of Family members
 -  Total number of family members employed
 -  Type of Building/House
 -  Type of Roof
 -  Type of Walls
 -  House Floor Area
 -  House Age
 -  Number of bedrooms
 -  Tenure Status
 -  Toilet Facilities
 -  Electricity
 -  Main Source of Water Supply
 -  Number of Television
 -  Number of CD/VCD/DVD
 -  Number of Component/Stereo set
 -  Number of Refrigerator/Freezer
 -  Number of Washing Machine
 -  Number of Airconditioner
 -  Number of Car, Jeep, Van
 -  Number of Landline/wireless telephones
 -  Number of Cellular phone
 -  Number of Personal Computer
 -  Number of Stove with Oven/Gas Range
 -  Number of Motorized Banca
 -  Number of Motorcycle/Tricycle
 
 For consistency, I will also be using the same models I used in the Regression task: Random Forest,KNeigbors, and XGBoost Classifiers.

Before I get to the task of one-hot encoding it, I have to know how big I could make the dataset by checking the unique values per feature columns. Just the ones with types 'object':

In [None]:
class_fid = family_income_data.loc[: , ['Household Head Sex' ,'Household Head Age' ,'Household Head Marital Status' ,'Household Head Highest Grade Completed' ,'Household Head Job or Business Indicator' ,'Household Head Class of Worker' ,'Type of Household' ,'Total Number of Family members' ,'Total number of family members employed' ,'Type of Building/House' ,'Type of Roof' ,'Type of Walls' ,'House Floor Area' ,'House Age' ,'Number of bedrooms' ,'Tenure Status' ,'Toilet Facilities' ,'Electricity' ,'Main Source of Water Supply' ,'Number of Television' ,'Number of CD/VCD/DVD' ,'Number of Component/Stereo set' ,'Number of Refrigerator/Freezer' ,'Number of Washing Machine' ,'Number of Airconditioner' ,'Number of Car, Jeep, Van' ,'Number of Landline/wireless telephones' ,'Number of Cellular phone' ,'Number of Personal Computer' ,'Number of Stove with Oven/Gas Range' ,'Number of Motorized Banca' ,'Number of Motorcycle/Tricycle' ]]

In [None]:
class_fid_cat = []
for col in class_fid.columns :
    if class_fid[col].dtype == object :
        class_fid_cat.append(col)
        print(col," : ",len(class_fid[col].value_counts()))

I'm afraid the `Household Head Highest Grade Completed` could slow down my simulation, so I'm going to simplify this:

First, I have to check the unique values for the column.

In [None]:
for item in family_income_data['Household Head Highest Grade Completed'].value_counts().index :
    print(".", item)

Next, I manually sort them out to five categories:

In [None]:
educ_attainment = { 'DNA/Primary/Elementary' : ['Elementary Graduate', 'Grade 4', 'Grade 5', 'Grade 3', 'Grade 2', 'Grade 1', 'Grade 6', 'No Grade Completed', 'Preschool'], 
                    'Secondary' : ['High School Graduate', 'Second Year High School', 'Third Year High School', 'First Year High School'],
                    'Attended College' : ['Second Year College', 'Third Year College', 'First Year College', 'Second Year Post Secondary', 'Fourth Year College', 'First Year Post Secondary'],
                    'Post Baccalaureate' : ['Post Baccalaureate'], 
                    'Degrees/Programs' : ['Business and Administration Programs', 'Teacher Training and Education Sciences Programs', 'Engineering and Engineering Trades Programs', 'Engineering and Engineering trades Programs', 'Engineering and Engineering trades Programs', 'Health Programs', 'Computing/Information Technology Programs', 'Security Services Programs', 'Agriculture, Forestry, and Fishery Programs',
                                  'Transport Services Programs', 'Social and Behavioral Science Programs', 'Social and Behavioral Science Programs', 'Personal Services Programs', 'Humanities Programs', 'Other Programs in Education at the Third Level, First Stage, of the Type that Leads to an Award not Equivalent to a First University or Baccalaureate Degree',
                                  'Law Programs', 'Architecture and Building Programs', 'Basic Programs', 'Journalism and Information Programs', 'Arts Programs', 'Life Sciences Programs', 'Manufacturing and Processing Programs',
                                  'Social Services Programs', 'Physical Sciences Programs', 'Other Programs of Education at the Third Level, First Stage, of the Type that Leads to a Baccalaureate or First University/Professional Degree (HIgher Education Level, First Stage, or Collegiate Education Level)',
                                  'Veterinary Programs', 'Environmental Protection Programs'
                                 ]
                    }

Then, I use pd.Series.apply() to create a new column with these categories

In [None]:
family_income_data['Household Head Highest Grade Completed (Simplified)'] = family_income_data['Household Head Highest Grade Completed'].apply(lambda x : ''.join([key for key in educ_attainment.keys() if x in educ_attainment[key]]))

In [None]:
family_income_data.head()

I have to create the income categories first before fitting a model. I decided to make four categories that will be used as labels based on the Total Household Income values:

1. **Category 1** - (< 25% of the distribution)
2. **Category 2** - (25%-50% of the distribution)
3. **Category 3** - (50%-75% of the distribution)
4. **Category 4** - (> 75% of the distribution)

I think to make things simple, dividing using the quartiles will suffice.

In [None]:
family_income_data['Total Household Income'].describe()

I use pd.qcut to divide the data by quartiles, and use custom labels as I have described earlier:

In [None]:
family_income_data['Income Category'] = pd.qcut(family_income_data['Total Household Income'], q=4, labels=['Category 1', 'Category 2', 'Category 3', 'Category 4'])

In [None]:
family_income_data

Okay, at this point, I can create my feature and label sets:

In [None]:
XX =  family_income_data.loc[:, [ 'Household Head Sex', 'Household Head Age', 'Household Head Marital Status','Household Head Highest Grade Completed (Simplified)','Household Head Job or Business Indicator','Household Head Class of Worker' , 'Type of Household' , 'Total Number of Family members' , 'Total number of family members employed', 'Type of Building/House', 'Type of Roof' , 'Type of Walls' , 'House Floor Area' , 'House Age' , 'Number of bedrooms' , 
'Tenure Status' , 'Toilet Facilities' , 'Electricity' , 'Main Source of Water Supply' , 'Number of Television' , 'Number of CD/VCD/DVD' , 'Number of Component/Stereo set' , 'Number of Refrigerator/Freezer' , 'Number of Washing Machine' , 
'Number of Airconditioner' , 'Number of Car, Jeep, Van' , 'Number of Landline/wireless telephones' , 'Number of Cellular phone' , 'Number of Personal Computer' , 'Number of Stove with Oven/Gas Range' , 'Number of Motorized Banca' , 'Number of Motorcycle/Tricycle']]
yy = family_income_data['Income Category']

In [None]:
XX

Looks great! Now time to One-Hot encode using sklearn's preprocessing library.

In [None]:
from sklearn.preprocessing import OneHotEncoder

Same with earlier, I need to know if I have any null values for any of my features.

In [None]:
msno.matrix(XX)

Woops, need to clean that up with `fillna`. Let's just place a string called 'N/A' for those.

In [None]:
XX['Household Head Class of Worker'].fillna('N/A', inplace=True)

Earlier, I made a list of categorical values, and I remember I created a new feature called 'Household Head Highest Grade Completed (Simplified)'. I want to use this in place of the huge 'Household Head Highest Grade Completed ' column.

In [None]:
class_fid_cat

In [None]:
class_fid_cat[2] = 'Household Head Highest Grade Completed (Simplified)'

Now, I can make the categorical section of my XX data set, and then One-Hot encode it.

In [None]:
XX_cat = XX.loc[:, class_fid_cat]

In [None]:
ohe = OneHotEncoder(sparse=False) #setting sparse=False since I want to see my array
XX_t = ohe.fit_transform(XX_cat)

In [None]:
XX_t

Great! Now, my categorical features are encoded to ones and zeroes. Time to append those to my numerical ones and create my final XX set.

In [None]:
class_fid_num = [col for col in X.columns if col not in class_fid_cat]
XX_tt_0 = family_income_data[class_fid_num]
XX_tt = XX_tt_0.to_numpy() #necessary step since I'm scared that I won't be able to merge a numpy array and a dataframe

In [None]:
XX_T = np.concatenate((XX_t, XX_tt), axis=1) #axis 1 tells me I am concatenating column-wise

Now, I can train-test-split my data. Similar to the regression part, I'm using an 80%-20% division:

In [None]:
XX_train, XX_test, yy_train, yy_test = train_test_split(XX_T, yy, test_size=0.2, random_state=41)

### Random Forest Classifier

In [None]:
rfc = RandomForestClassifier(verbose = True, n_jobs=-1)
yy_fit_rfc = rfc.fit(XX_train, yy_train)

In [None]:
yy_pred_rfc = yy_fit_rfc.predict(XX_T)

Let's see how well the predictions made are using the classification report:

In [None]:
print(classification_report(yy, yy_pred_rfc))

Awesome! I think this model did pretty well. Let's add a confusion matrix comparison just to see how the two values vary

In [None]:
labels = ['Category 1', 'Category 2', 'Category 3', 'Category 4']
cm = confusion_matrix(yy, yy_pred_rfc, labels=labels)
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
cax = ax.matshow(cm, cmap='cubehelix')
plt.title('Confusion Matrix - Random Forest Classifier')
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('True')


That diagonal lighter shade tells me that a large part of my Predicted matched the Actual values.

### KNeighbors Classifier

In [None]:
knc = KNeighborsClassifier(n_jobs=-1)
yy_fit_knc= knc.fit(XX_train, yy_train)

In [None]:
yy_pred_knc = yy_fit_knc.predict(XX_T)

In [None]:
print(classification_report(yy, yy_pred_knc))

KNeighbors didn't do as well as the Random Forest Regressor.

In [None]:
labels = ['Category 1', 'Category 2', 'Category 3', 'Category 4']
cm = confusion_matrix(yy, yy_pred_knc, labels=labels)
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
cax = ax.matshow(cm, cmap='cubehelix')
plt.title('Confusion Matrix - KNeighbors Classifier')
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('True')

Lastly, let's do an XGB Classifier.

### XGB Classifier

In [None]:
xbc = XGBClassifier(verbose = True, n_jobs=-1)
yy_fit_xbc = xbc.fit(XX_train, yy_train)

In [None]:
yy_pred_xbc = yy_fit_xbc.predict(XX_T)

In [None]:
print(classification_report(yy, yy_pred_xbc))

In [None]:
labels = ['Category 1', 'Category 2', 'Category 3', 'Category 4']
cm = confusion_matrix(yy, yy_pred_knc, labels=labels)
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111)
cax = ax.matshow(cm, cmap='cubehelix')
plt.title('Confusion Matrix - XGB Classifer')
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('True')

Seeing this final product, the best classifier to use to predict the Income Category was the Random Forest Classifier. KNN predicted worse possibly because it's hard to define a nearest neighbor classification with the huge variance in our dataset. XGBoost, on the other hand, performs better for Time-Series data.