In this demo project, the performance of a single decision tree will be compared to a random forest in a regression problem. The dataset is the Family Income and Expenditure of Filipino Households. The models will predict the Total Household Income

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

**Importing and Processing the data:**
First the dataset will be imported from a CSV. Exploratory data analysis will be conducted to check the data types of the columns

In [None]:
df = pd.read_csv('../input/family-income-and-expenditure/Family Income and Expenditure.csv')
target = 'Total Household Income'
df.sample(5)


In [None]:
df.dtypes

In [None]:
df.describe(include='all')

The dataset hase 41,544 samples and 59 variables. Most are integer values but some are strings (categorical). For the purposes of this project, only the numerical (int64) columns will be considered

**Exploratory Data Analysis**

In [None]:
df = df.select_dtypes(include=['int64'])
df.sample(7)
# print(df.dtypes)

In [None]:
#Plot set up for uniformity
sns.set_style('whitegrid')
sns.set_context("paper")
figsize = (30,14)
sns.set()

In [None]:
print(df[target].describe())

Plotting the histogram of Total Household Income, it can be seen that it is heavily concentrated on the lower end with few outliers with high incomes.

In [None]:
plt.hist(df[target], bins = 200)

In [None]:
# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
#Correlation with output variable
cor_target = abs(corr[target])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
relevant_features

In [None]:
df = df[relevant_features.keys()]
df.sample(5)

In [None]:
x = df.drop(columns = [target])
y = df[target]

x.sample(5)

In [None]:
for col in x.columns:
    x[col] = (x[col]-x[col].mean())/x[col].std()
x.sample(5)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.1, random_state=214)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state = 214)

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state = 214)

tree_reg.fit(X_train, y_train)
y_pred = tree_reg.predict(X_test)
tree_mse = mean_squared_error(y_test, y_pred)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

In [None]:
forest_reg.fit(X_train, y_train)
y_pred = forest_reg.predict(X_test)
forest_mse = mean_squared_error(y_test, y_pred)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [{'n_estimators': [3,10,30], 'max_features':[6,8,10,12]}]
grid_search = GridSearchCV(forest_reg, param_grid, cv = 5,
                          scoring='neg_mean_squared_error',
                          return_train_score = True)
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_estimator_


In [None]:
y_pred = grid_search.predict(X_test)
forest_mse = mean_squared_error(y_test, y_pred)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
forest_reg.fit(X_train, y_train, n_estimators = 30, max_features=8)