# Data-Fixing with Zillow Dataset
Useful Links:
- [Dataframe merging](https://pandas.pydata.org/pandas-docs/stable/merging.html)
- [Pipe-separated values to df](https://stackoverflow.com/questions/20949955/changing-pipe-separated-data-to-dataframe-in-python-pandas)
- [Storing and reading pickles for storage of "good" Dataframes](https://stackoverflow.com/questions/17098654/how-to-store-a-dataframe-using-pandas)

In [85]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import operator
import sys
import os
from utils import zillow_helpers
sys.path.insert(0,'../')
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Set some data folders here. Be sure to "ignore" your actual data folders when pushing to Github.

In [37]:
datadir1 = 'data/46/ZAsmt/'
datadir2 = 'data/46/ZTrans/'
datadir3 = 'data/50/ZAsmt/'
datadir4 = 'data/50/ZTrans/'

List the text files in the folder. This could be helpful for making iterables later!

In [61]:
os.listdir(datadir1)

['AdditionalPropertyAddress.txt',
 'BKManagedSpecific.txt',
 'Building.txt',
 'BuildingAreas.txt',
 'CareOfName.txt',
 'ExteriorWall.txt',
 'ExtraFeature.txt',
 'Garage.txt',
 'InteriorFlooring.txt',
 'InteriorWall.txt',
 'LotSiteAppeal.txt',
 'MailAddress.txt',
 'Main.txt',
 'Name.txt',
 'Oby.txt',
 'Pool.txt',
 'SaleData.txt',
 'TaxDistrict.txt',
 'TaxExemption.txt',
 'TypeConstruction.txt',
 'Value.txt',
 'VestingCodes.txt']

Known syntax for opening and "reading" lines from text files contained below

In [68]:
f = open(datadir1+'SaleData.txt', 'r')

In [73]:
# Only read a few lines at a time with these files!!!
for lines in range(3):
    line = f.readline()
    print(line)

B90D272B-3F28-E611-80C4-3863BB43AC67|1|||2013-04-04|||91|151|CFD||640000.0000|AF|46107|801464

BA0D272B-3F28-E611-80C4-3863BB43AC67|1|||2010-05-18|||89|450|WD|WRDE|17500.0000|AF|46107|801464

BD0D272B-3F28-E611-80C4-3863BB43AC67|1|||1999-06-11|||||||| |46107|801464



It's good to note that creation of a dataframe from these massive file-sets does not seem to require a lot of processing power. Attempting to read and print the whole dataset probably will crash your kernel! Be sure to read a few lines of the head each time to check!

In [75]:
df = pd.read_csv(datadir1+'SaleData.txt', sep="|", index_col=False, header=None, low_memory=False)
df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,A30D272B-3F28-E611-80C4-3863BB43AC67,1,,,2011-06-10,,,90.0,398.0,WD,WRDE,,,46107,801464
1,A90D272B-3F28-E611-80C4-3863BB43AC67,1,,,1998-08-24,,,,,,,,,46107,801464
2,AD0D272B-3F28-E611-80C4-3863BB43AC67,1,,,2010-07-14,,,90.0,14.0,WD,WRDE,58000.0,AF,46107,801464


In [87]:
df2 = zillow_helpers.txt_to_df(datadir1+'Main.txt')
df2.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,85,86,87,88,89,90,91,92,93,94
0,01D1B108-F8A6-E611-80C9-3863BB43AC67,119614254,46137,SD,ZIEBACH,2016-11-01,72016,4,BKF,2643,...,,,,,,,0,1564496,2624,562086187
1,02D1B108-F8A6-E611-80C9-3863BB43AC67,119613826,46137,SD,ZIEBACH,2016-11-01,72016,4,BKF,2213,...,,,,,,,0,1564496,2196,-1621235809
2,03D1B108-F8A6-E611-80C9-3863BB43AC67,119615712,46137,SD,ZIEBACH,2016-11-01,72016,4,BKF,4104,...,,,,,,,0,1564496,4081,1892456234


If your data frame is good, then pickle it! It will save your dataframe and prevent the need to re-create it later.
![Pickle Rick](http://pm1.narvii.com/6511/c7ba0df4a630d1c05fad94fec2cac061bc28d69a_128.jpg)

In [81]:
df.to_pickle("df.pickle")

In [82]:
df3 = pd.read_pickle("df.pickle")

In [83]:
df3.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,A30D272B-3F28-E611-80C4-3863BB43AC67,1,,,2011-06-10,,,90.0,398.0,WD,WRDE,,,46107,801464
1,A90D272B-3F28-E611-80C4-3863BB43AC67,1,,,1998-08-24,,,,,,,,,46107,801464
2,AD0D272B-3F28-E611-80C4-3863BB43AC67,1,,,2010-07-14,,,90.0,14.0,WD,WRDE,58000.0,AF,46107,801464


# Organizing via Syntax
Time to start looking at organizing/merging files! It seems that the best way to organize is by separating dataframes into ZAsmt and ZTrans first.