# Data imports and new feature generation

## Intro:

In this notebook, we (1.1) import the training data ('properties'), the training labels ('train'), & the sample submission file ('sample_submission'). We prepare a training dataframe (train_df) and then (1.2) generate potentially relevant extra features. 

In the next notebook (2), we will clean this training dataframe and end by splitting it into the usual 6 (x_train, y_train, x_valid, y_valid, x_test, y_test).

In [38]:
import pandas as pd
import numpy as np
import timeit

In [8]:
np.random.seed(1234)

In [49]:
# Read data from saved pickle files.
# start_time = timeit.default_timer()
properties = pd.read_pickle('Data/properties')
train = pd.read_pickle('Data/train')
sample_submission=pd.read_pickle('Data/sample_submission')
# elapsed = timeit.default_timer() - start_time

Let's look at the df heads to get a quick sense of what's there:

In [28]:
properties.head(1)

Unnamed: 0,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,10754147,,,,0.0,0.0,,,,,...,,,,9.0,2015.0,9.0,,,,


In [29]:
train.head(1)

Unnamed: 0,parcelid,logerror,transactiondate
0,11016594,0.0276,2016-01-01


In [30]:
sample_submission.head(1)

Unnamed: 0,ParcelId,201610,201611,201612,201710,201711,201712
0,10754147,0,0,0,0,0,0


So, to be clear, we have data on a bunch of properties in the properties df, and then we have data on the transaction date and the [logerror](https://www.kaggle.com/c/zillow-prize-1/data) of the Zillow prediction and the actual transaction price in the train df.

## Feature generation

Add some more features using this helpful Kaggle [kernel](https://www.kaggle.com/nikunjm88/creating-additional-features?scriptVersionId=1379783), mainly by [Nikunj](https://www.kaggle.com/nikunjm88).

In [45]:
updated_properties = properties

In [46]:
# Rankings according to Nikunj XGB importance f_score
updated_properties['N-LivingAreaProp'] = updated_properties['calculatedfinishedsquarefeet']/updated_properties['lotsizesquarefeet'] #1
updated_properties['N-ValueRatio'] = updated_properties['taxvaluedollarcnt']/updated_properties['taxamount'] #2
updated_properties['N-ValueProp'] = updated_properties['structuretaxvaluedollarcnt']/updated_properties['landtaxvaluedollarcnt'] #3
updated_properties["N-location"] = updated_properties["latitude"] + updated_properties["longitude"] #4
updated_properties["N-location-2"] = updated_properties["latitude"]*updated_properties["longitude"] #5
updated_properties['N-ExtraSpace'] = updated_properties['lotsizesquarefeet'] - updated_properties['calculatedfinishedsquarefeet']#6

zip_count = updated_properties['regionidzip'].value_counts().to_dict()
updated_properties['N-zip_count'] = updated_properties['regionidzip'].map(zip_count) #7
updated_properties['N-TaxScore'] = updated_properties['taxvaluedollarcnt']*properties['taxamount'] #8

group = updated_properties.groupby('regionidcity')['structuretaxvaluedollarcnt'].aggregate('mean').to_dict()
updated_properties['N-Avg-structuretaxvaluedollarcnt'] = updated_properties['regionidcity'].map(group)#9
updated_properties["N-structuretaxvaluedollarcnt-2"] = updated_properties["structuretaxvaluedollarcnt"] ** 2 #10

They go on, but adding these 10 for now seems fine enough. Let's stop here, recognizing that adding more features could boost performance later.

For reference, the oroginal shapes were:

In [50]:
print(properties.shape)

(2985217, 58)


In [48]:
print(updated_properties.shape)

(2985217, 68)


## Export new features df for next steps:

In [44]:
updated_properties.to_pickle('updated_properties')

# Works Cited:

- https://www.kaggle.com/infinitewing/xgboost-without-outliers-lb-0-06463