## Introduction
* The aim of this model is to profile the type of families with high and low spending in education
* Multiple linear and polynomial model is used to build the models and r square is used to evaluate them
* The initial model with only income as the feature scores 0.16662 on r square
* We clean the data by reducing the range of income and removing families with no school-age children
* The final model scores 0.31158 which indicates little to none corelation with the chosen features
* other tested features include: total food spending, medical spending, and specific food spending

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Importing the dataset

In [None]:
dataset = pd.read_csv('../input/family-income-and-expenditure/Family Income and Expenditure.csv')
X = dataset.iloc[:, [0, 12, 13, 26]].values # Feature: total income, alcohol spending, tobacco spending, age of head of household
y = dataset.iloc[:, [20]].values # Total education expenditure
s = dataset.iloc[:, [35]].values # Number of school-age children

## Removing high income household
* Dataset is concentrated in household income below PHP5,000,000

In [None]:
income_outlier = []
cutoff = 5000000
for i in range(len(X)):
    if X[i][0] > cutoff:
      income_outlier.append(i)
X = np.delete(X, income_outlier, 0)
y = np.delete(y, income_outlier, 0)
s = np.delete(s, income_outlier, 0)

## Removing household with no school-age children
* We don't need to account family with no school-age children

In [None]:
children_outlier = []
for i in range(len(s)):
    if s[i] == 0:
      children_outlier.append(i)
X = np.delete(X, children_outlier, 0)
y = np.delete(y, children_outlier, 0)
s = np.delete(s, children_outlier, 0)

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Polynomial Regression model on the Training set

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 2)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly, y_train)

## Predicting the Test set results

In [None]:
y_pred = regressor.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

## Evaluating the Model Performance

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)