# Income Prediction ML - without AWI adjustment

## Goal
Predict year t+2 income from year t data  - **without AWI adjustment** for all incomes
- 2011 data to predict 2013  
- 2012 data to predict 2014  
- 2013 data to predict 2015  

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
# Load data
income = pd.read_csv("Income_Home_Prices_ZIP.csv")

In [3]:
# Datafram columns name
income.columns

Index(['ZIP Code', 'Borough', 'Neighborhood', 'Year', 'Median HH Income ($)',
       'Mean HH Income ($)', 'Median Home Value ($/sq. foot)', '% Employed',
       '% Unemployed', '% Not in Labor Force', 'Bordering Water',
       'Number of Subway Stations in ZIP', 'Stops in ZIP',
       'Number of Subway Lines Serving ZIP', 'Lines Serving ZIP',
       'Number of Parks', 'Number of Playgrounds', 'Park Acreage',
       'LandSqMile', 'Latitude', 'Longitude', 'adjacentZIP'],
      dtype='object')

In [5]:
# Rename zip code column
income = income.rename(columns = {'ZIP Code':'ZIPCODE'})

In [10]:
# Reformat Median HH Income to numeric values
formatsign = lambda x: float(x.replace("$","").replace(",",""))
formatpercent = lambda x: float(x.replace("%",""))

income['MedianIncome'] = income['Median HH Income ($)'].map(formatsign)
income['HomeValue'] = income['Median Home Value ($/sq. foot)'].map(formatsign)

In [11]:
# Create Dataframe per year

income2011 = income[income.Year == 2011]
income2012 = income[income.Year == 2012]
income2013 = income[income.Year == 2013]
income2014 = income[income.Year == 2014]
income2015 = income[income.Year == 2015]

In [12]:
# Calculate mean of surrounding areas median income

In [13]:
def neighbor_avg2011(zips):
    zips_list = [int(z) for z in zips.split(",")]
    total = 0
    n = 0

    for zz in zips_list:
        try:
            neighbor = income2011[income2011.ZIPCODE == zz]['MedianIncome'].tolist()[0]
            total += neighbor
            n += 1
        except:
            pass
    
    avg = round(total/n,2)
    
    return avg

# Calculate average of adjacent areas income
income2011['MedianIncomeAdj'] = income2011['adjacentZIP'].map(neighbor_avg2011)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [14]:
def neighbor_avg2012(zips):
    zips_list = [int(z) for z in zips.split(",")]
    total = 0
    n = 0

    for zz in zips_list:
        try:
            neighbor = income2012[income2012.ZIPCODE == zz]['MedianIncome'].tolist()[0]
            total += neighbor
            n += 1
        except:
            pass
    
    avg = round(total/n,2)
    
    return avg

# Calculate average of adjacent areas income
income2012['MedianIncomeAdj'] = income2012['adjacentZIP'].map(neighbor_avg2012)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [15]:
def neighbor_avg2013(zips):
    zips_list = [int(z) for z in zips.split(",")]
    total = 0
    n = 0

    for zz in zips_list:
        try:
            neighbor = income2013[income2013.ZIPCODE == zz]['MedianIncome'].tolist()[0]
            total += neighbor
            n += 1
        except:
            pass
    
    avg = round(total/n,2)
    
    return avg

# Calculate average of adjacent areas income
income2013['MedianIncomeAdj'] = income2013['adjacentZIP'].map(neighbor_avg2013)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [16]:
def neighbor_avg2014(zips):
    zips_list = [int(z) for z in zips.split(",")]
    total = 0
    n = 0

    for zz in zips_list:
        try:
            neighbor = income2014[income2014.ZIPCODE == zz]['MedianIncome'].tolist()[0]
            total += neighbor
            n += 1
        except:
            pass
    
    avg = round(total/n,2)
    
    return avg

# Calculate average of adjacent areas income
income2014['MedianIncomeAdj'] = income2014['adjacentZIP'].map(neighbor_avg2014)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [17]:
def neighbor_avg2015(zips):
    zips_list = [int(z) for z in zips.split(",")]
    total = 0
    n = 0

    for zz in zips_list:
        try:
            neighbor = income2015[income2015.ZIPCODE == zz]['MedianIncome'].tolist()[0]
            total += neighbor
            n += 1
        except:
            pass
    
    avg = round(total/n,2)
    
    return avg

# Calculate average of adjacent areas income
income2015['MedianIncomeAdj'] = income2015['adjacentZIP'].map(neighbor_avg2015)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


# Income Prediction Random Forest Regressor

In [18]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

#### A) Data pre-processing

NOT USED  
Year	AWI	Increase (from previous year)  
2015	48,098.63	3.48%  
2014	46,481.52	3.55%  
2013	44,888.16	1.28%  
2012	44,321.67	3.12%  
2011	42,979.61	3.13%  
2010	41,673.83	2.36%  

In [28]:
# Remove the wage increase with reference year 2011

#income2011['MedianIncomeDisc'] = income2011['MedianIncome']
#income2012['MedianIncomeDisc'] = income2012['MedianIncome'] / (1.0312)
#income2013['MedianIncomeDisc'] = income2013['MedianIncome'] / ((1.0312)*(1.0128))
#income2014['MedianIncomeDisc'] = income2014['MedianIncome'] / ((1.0312)*(1.0128)*(1.0355))
#income2015['MedianIncomeDisc'] = income2015['MedianIncome'] / ((1.0312)*(1.0128)*(1.0355)*(1.0348))

In [20]:
# Prediciton of income at t+2 - create output variable income t+2

income2011['MedianIncome_in2yrs'] = income2013['MedianIncome'].values
income2012['MedianIncome_in2yrs'] = income2014['MedianIncome'].values
income2013['MedianIncome_in2yrs'] = income2015['MedianIncome'].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [29]:
# Create dataset for ML training

X = income2011.append([income2012, income2013])

In [30]:
# Extra data processing

def yesno(x):
    if x == "Y":
        return 1
    else:
        return 0

X['% Employed'] = X['% Employed'].map(formatpercent)
X['Bordering Water'] = X['Bordering Water'].map(yesno)
X['Number of Subway Lines Serving ZIP'].fillna(0, inplace=True)

not_features = ['Borough', 'Neighborhood','LandSqMile', 'Latitude',
       'Longitude','Median HH Income ($)',
       'Mean HH Income ($)', 'Median Home Value ($/sq. foot)',
       '% Unemployed', '% Not in Labor Force',
       'Number of Subway Stations in ZIP', 'Stops in ZIP','Lines Serving ZIP',
       'adjacentZIP','HomeValue']

for notf in not_features:
    del X[notf]
    
X.columns

Index(['ZIPCODE', 'Year', '% Employed', 'Bordering Water',
       'Number of Subway Lines Serving ZIP', 'Number of Parks',
       'Number of Playgrounds', 'Park Acreage', 'MedianIncome',
       'MedianIncomeAdj', 'MedianIncome_in2yrs'],
      dtype='object')

In [31]:
# Shuffle data
X = X.sample(frac=1, random_state=0).reset_index(drop=True)

# Create labels
Y = [i for i in X.MedianIncome_in2yrs]
Year = [i for i in X.Year]
ZIP = [i for i in X.ZIPCODE]

# Drop labels from input data
X.drop('MedianIncome_in2yrs', axis=1, inplace=True)
X.drop('Year', axis=1, inplace=True)
X.drop('ZIPCODE', axis=1, inplace=True)

n = int(X.shape[0] * 4/5)
X_train = X[:n]
Y_train = Y[:n]
Year_train = Year[:n]
ZIP_train = ZIP[:n]
X_test = X[n:]
Y_test = Y[n:]
Year_test = Year[n:]
ZIP_test = ZIP[n:]

In [32]:
print(X_train.shape)
print(len(Y_train))
print(len(Year_train))

(420, 8)
420
420


### B) ML algorithms

#### B0 - Baseline

In [33]:
# Baseline - predicted t+2 = income at t
Y1 = np.array(X_test['MedianIncome'].tolist())  # pred
Y2 = Y_test  # actual

# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(Y2, Y1))


# The MAPE (mean absolute percentage error)
print("Mean absolute percentage error:", np.mean(np.abs((Y2 - Y1) / Y2)) * 100)


# Explained variance score: 1 is perfect prediction (goodness of fit measures same are .score method)
print('Variance score: %.6f' % r2_score(Y2, Y1))

Mean squared error: 21918461.30
Mean absolute percentage error: 4.35415051021
Variance score: 0.974282


#### B1 - Random Forest

In [34]:
# 1. Random Forest Regression

rf = RandomForestRegressor(n_estimators=500, criterion='mse', max_depth=8, random_state=0) # mse: mean squared error, 
                                                         # classes= 2^max_depth = 256 vs 174 zips?
rf.fit(X_train,Y_train)
print('Best feature is ' + str(np.where(rf.feature_importances_)[0][0]))
preds_rf = rf.predict(X_test)

performance = 1 - abs(np.mean(preds_rf - Y_test)) / (max(Y) - min(Y))
print('Performance is ', performance)


# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(Y_test, preds_rf))

# The MAPE (mean absolute percentage error)
print("Mean absolute percentage error:", np.mean(np.abs((Y_test - preds_rf) / Y_test)) * 100)

# Explained variance score: 1 is perfect prediction (goodness of fit measures same are .score method)
print('Variance score: %.6f' % r2_score(Y_test, preds_rf))

Best feature is 0
Performance is  0.999124783355
Mean squared error: 13798533.02
Mean absolute percentage error: 3.60257068705
Variance score: 0.983809


In [35]:
print(X_train.columns)
print(rf.feature_importances_)

Index(['% Employed', 'Bordering Water', 'Number of Subway Lines Serving ZIP',
       'Number of Parks', 'Number of Playgrounds', 'Park Acreage',
       'MedianIncome', 'MedianIncomeAdj'],
      dtype='object')
[  5.65640519e-03   2.37826934e-04   5.29525367e-03   1.34487097e-03
   3.51605908e-04   2.34407125e-03   9.76907985e-01   7.86198076e-03]


In [36]:
results = np.vstack((Year_test, ZIP_test, Y_test, preds_rf))
np.savetxt("RF_noAWIadj.csv", results, delimiter=",")

#### B2 - Linear Regression

In [37]:
# 2. Linear Regression

lr = LinearRegression()
lr.fit(X_train, Y_train)

preds_lr = lr.predict(X_test)


# The coefficients
print(X_train.columns)
print('Intercept: \n', lr.intercept_)
print('Coefficients: \n', lr.coef_)
print("")

# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(Y_test, preds_lr))

# The MAPE (mean absolute percentage error)
print("Mean absolute percentage error:", np.mean(np.abs((Y_test - preds_lr) / Y_test)) * 100)

# Explained variance score: 1 is perfect prediction (goodness of fit measures same are .score method)
print('Variance score: %.6f' % r2_score(Y_test, preds_lr))

Index(['% Employed', 'Bordering Water', 'Number of Subway Lines Serving ZIP',
       'Number of Parks', 'Number of Playgrounds', 'Park Acreage',
       'MedianIncome', 'MedianIncomeAdj'],
      dtype='object')
Intercept: 
 354.292271095
Coefficients: 
 [ -4.96177165e+01  -9.56199478e-01   7.07060725e+02  -2.46576580e+01
  -1.81356170e+02  -1.58989341e-03   1.04909522e+00   4.56363423e-03]

Mean squared error: 12675394.00
Mean absolute percentage error: 4.08103688296
Variance score: 0.985127
