# Session 8: Reading and Writing Data, Data Wrangling: Geocoding the Rental Data

Below is a concise version of the code to load the Bay Area rental data, load it into a DataFrame, and clean it up.  We will use this as a starting point.  It is one solution to the first part of Assignment 1.

In [1]:
# import libraries and read in the csv file
import re as re, pandas as pd, numpy as np, requests, json
df = pd.read_csv('data/bay.csv')

# clean price and neighborhood
df.price = df.price.str.strip('$').astype('float64')
df.neighborhood = df.neighborhood.str.strip().str.strip('(').str.strip(')')

# clean bedrooms
for i in df.bedrooms.index:
    r = re.search('(?<=\/ )(.*)(?=br)', df.bedrooms[i])
    df.bedrooms[i] = r.group(0) if r else np.nan
df.bedrooms = df.bedrooms.astype('float64')


# clean up the sqft
for i in df.sqft.index:
    if('ft' in df.sqft[i]):
        end = df.sqft[i].find('ft')
        begin = df.sqft[i].find('- ') + 2
        if(begin > end):
            begin = df.sqft[i].find('/ ') + 2
        df.sqft[i] = df.sqft[i][begin:end]
    else:
        df.sqft[i] = np.nan
df.sqft = df.sqft.astype('float64')

# break out the date into month day year columns
df['month'] = df['date'].str.split().str[0]
df['day'] = df['date'].str.split().str[1].astype('int32')
df['year'] = df['date'].str.split().str[2].astype('int32')
del df['date']

## Now we need to build a geocoding script.

Some hints about at least one way to do this.  You may find other methods to be equally or more useful.

Test out the FCC API manually in your browser and look at how the results look.

Review the examples in Chapter 6 of Wes McKinney's book for retrieving data from a url.  I suggest using the JSON format in the API, but you can test either JSON or XML.

Also, since you will want to step through the DataFrame one row at a time, look at the potential of using iterrows. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.iterrows.html

**CAUTION**: Since the API is a bit slow, and works one record at a time, test it out with only a few records until you are positive you have it working correctly.  It may take several minutes to grind through all the records, and you don't want to get us locked out of the site.  You might want to print the row you are processing - maybe every tenth row number, to give you some sense of how it is progressing.

And watch out for those missing lat-longs.  You'll probably want to skip them since the API won't know what to do with them.

Finally, study how to write an HDF5 file, and write your results to a file in this format.

In [8]:
#We use an FCC API to convert lat, long to census block and other geographies
#http://www.fcc.gov/developers/census-block-conversions-api
url = 'http://data.fcc.gov/api/block/2010/find?format=json&latitude='

# define the new geolocation fields for our dataframe
df['blockfips'] = ''
df['countyfips'] = ''
df['county'] = ''

#We need to iterate over the rows of the DataFrame and get data from the FCC API for each
#http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.iterrows.html
for i, row in df.iterrows():
    if row['latitude']>0: # and index<50:
        resp = requests.get(url+str(row['latitude'])+'&longitude='+str(row['longitude']))
        data = json.loads(resp.text)
        #print data
        df['countyfips'][i] = data['County']['FIPS']
        df['county'][i] = data['County']['name']
        df['blockfips'][i] = data['Block']['FIPS']
        if i%200 == 0: 
            print 'processing row: ', i

print df[:5]

store = pd.HDFStore('data/bay.h5')
store['rents'] = df
print store['rents'][:5]
store.close()

processing row:  0
processing row:  200
processing row: 

ConnectionError: HTTPConnectionPool(host='data.fcc.gov', port=80): Max retries exceeded with url: /api/block/2010/find?format=json&latitude=37.015548&longitude=-121.591553 (Caused by <class 'httplib.BadStatusLine'>: '')

 800


In [28]:
store = pd.HDFStore('data/bay.h5')
test = store['rents']
print test['fipsblock'][:100]

0     060855114002016
1     060855114002016
2     060855114002016
3     060855114002016
4     060855114002016
5     060855114002016
6     060855114002016
7     060855114002016
8     060855114002016
9     060855114002016
10    060855114002016
11    060855114002016
12    060855114002016
13    060855114002016
14    060855114002016
...
85    060855114002016
86    060855114002016
87    060855114002016
88    060855114002016
89    060855114002016
90    060855114002016
91    060855114002016
92    060855114002016
93    060855114002016
94    060855114002016
95    060855114002016
96    060855114002016
97    060855114002016
98    060855114002016
99    060855114002016
Name: fipsblock, Length: 100, dtype: object
