# Demographic and Economic data per zipcode
* Option 1: Use US census (community) data
* Option 2: Use Python module uszipcode

 I am now using the Python module [uszipcode](https://uszipcode.readthedocs.io/index.html#)  to get demographic and economic data per zipcode. Only problem so far: it needs to be installed via pip. 
```python
import sys
!{sys.executable} -m pip install uszipcode
```
The added complexity is to ensure that only the local Jupyter environment is affected, see [Installing Python Packages from a Jupyter Notebook](https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/).

Note: The data is stored in a sqllite database that one downloads. Presumably, one could query the database directly instead of using the few exposed methods. "The Zipcode and SimpleZipcode are actually sqlalchemy orm declarative base class."

In [None]:
import sys
!{sys.executable} -m pip install uszipcode


## What kind of data do we have?
"uszipcode has two backend database, SimpleZipcode and Zipcode. Zipcode has more info, but the database file is 450MB (takes more time to download)."

In [2]:
from uszipcode import SearchEngine, SimpleZipcode, Zipcode

In [3]:
zip = "10001"
search = SearchEngine(simple_zipcode=True, db_file_dir="./tmp") # set simple_zipcode=False to use rich info database
zipcode = search.by_zipcode(zip)

 Choose which directory you want to use to download the database file. 
 
```python
  search = SearchENgine(db_file_dir="/tmp")
```


In [4]:
zipcode.values() # to list

['10001',
 'Standard',
 'New York',
 'New York, NY',
 ['New York'],
 'New York County',
 'NY',
 40.75,
 -73.99,
 'Eastern',
 0.9090909090909091,
 ['718', '917', '347', '646'],
 21102,
 33959.0,
 0.62,
 0.0,
 12476,
 11031,
 650200,
 81671,
 -74.008621,
 -73.984076,
 40.759731,
 40.743451]

Alternatives
```python
zipcode.to_dict() # to dict
zipcode.to_json() # to json
```

## The finer view



In [5]:
search = SearchEngine(simple_zipcode=False, db_file_dir="./tmp") 

In [6]:
zip = "10001"
zipcode = search.by_zipcode(zip)

In [None]:
zipcode.keys()

## Which variables to keep?
For space reasons, we may want to prune the database a bit.

In [8]:
vars = ['zipcode_type',
 'major_city',
 'post_office_city',
 'common_city_list',
 'county',
 'state',
 'lat',
 'lng',
 'timezone',
 'radius_in_miles',
 'area_code_list',
 'population',
 'population_density',
 'land_area_in_sqmi',
 'water_area_in_sqmi',
 'housing_units',
 'occupied_housing_units',
 'median_home_value',
 'median_household_income',
 
 'zipcode',
 
 'population_by_year',
 'population_by_age',
 'population_by_gender',
 'population_by_race',
 'head_of_household_by_age',
 'families_vs_singles',
 'households_with_kids',
 'children_by_age',
 'housing_type',
 'year_housing_was_built',
 'housing_occupancy',
 'vancancy_reason',
 'owner_occupied_home_values',
 'rental_properties_by_number_of_rooms',
 'monthly_rent_including_utilities_studio_apt',
 'monthly_rent_including_utilities_1_b',
 'monthly_rent_including_utilities_2_b',
 'monthly_rent_including_utilities_3plus_b',
 'employment_status',
 'average_household_income_over_time',
 'household_income',
 'annual_individual_earnings',
 'sources_of_household_income____percent_of_households_receiving_income',
 'sources_of_household_income____average_income_per_household_by_income_source',
 'household_investment_income____percent_of_households_receiving_investment_income',
 'household_investment_income____average_income_per_household_by_income_source',
 'household_retirement_income____percent_of_households_receiving_retirement_incom',
 'household_retirement_income____average_income_per_household_by_income_source',
 'source_of_earnings',
 'means_of_transportation_to_work_for_workers_16_and_over',
 'travel_time_to_work_in_minutes',
 'educational_attainment_for_population_25_and_over',
 'school_enrollment_age_3_to_17']


## Which zipcodes to consider?
Determined by ev hub data. 
PROBLEM: See below. The zipcodes from the registration database are a bit dubious. We may have to clean the registration database before we can use them as is.

In [None]:
import pandas as pd

In [None]:
url = "https://raw.githubusercontent.com/siddhantmaharana/atlytics_team_recylers/master/data/zip_data.csv"

In [None]:
zip_df = pd.read_csv(url)

In [None]:
zip_df.head()

In [None]:
zip_df["ZIP Code"].value_counts()

In [None]:
# Problem ZIP Code 94304 is  PALO ALTO CA
zip_df[zip_df["ZIP Code"]=='94304.0']

In [None]:
zip_df[zip_df["ZIP Code"]=='H2C2G']



## Which zipcodes to consider, a state based approach.

In [None]:
Let simply use all the zipcodes of a given state.

In [9]:
res_RI = search.by_state("Rhode Island")  # a list of dictionaries

In [10]:
[ x.zipcode for x in res_RI] 

['02804', '02806', '02808', '02809', '02812']

There are 90 zip codes in Rhode Island. Why do we have 5 rows only? There is a limit on returns, 5 is the default.

In [None]:
res_RI = search.by_state("Rhode Island", returns=None)
[ x.zipcode for x in res_RI] 

#### Better code

In [None]:
from uszipcode import SearchEngine, SimpleZipcode, Zipcode

with SearchEngine() as search:
    res_RI_2 = search.by_state("Rhode Island")  
    lst = [ x.zipcode for x in res_RI_2] 
    
    


#### Boilerplate testing

In [None]:
dict(res_RI[0].items())

In [None]:
# {key: d[key] for key in d.viewkeys() & l}

In [None]:
list_of_dict = [ {key: (dict(x.items())[key]) for key in vars} for x in res_RI]

In [None]:
table = pd.DataFrame(list_of_dict)

## Building the table

In [21]:
import pandas as pd
from uszipcode import SearchEngine, SimpleZipcode, Zipcode

In [22]:
ev_reg_files = ['co_ev_registrations_public.xlsx',
 'ct_ev_registrations.xlsx',
 'fl_ev_registrations.xlsx',
 'mi_ev_registrations_public.xlsx',
 'mn_ev_registrations_public.xlsx',
 'nj_ev_registrations_public.xlsx',
 'ny_ev_registrations_public.xlsx',
 'or_ev_registrations_public.xlsx',
 'tx_ev_registrations_public.xlsx',
 'va_ev_registrations_public.xlsx',
 'vt_ev_registrations_public.xlsx',
 'wa_ev_registrations_public.xlsx',
 'wi_ev_registrations_public.xlsx']


In [23]:
states =  ['colorado',
 'connecticut',
 'florida',
 'michigan',
 'minnesota',
 'new jersey',
 'new york',
 'oregon',
 'texas',
 'virginia',
 'vermont',
 'washington',
 'wisconsin']


In [24]:
table = pd.DataFrame()

search = SearchEngine(simple_zipcode=False, db_file_dir="./tmp") 

In [25]:
for s in states:
    results = search.by_state(s, returns=None)
    print ( results[0].state)
    # build a list of dictionaries with limited set of keys
    list_of_dict = [ {key: (dict(x.items())[key]) for key in vars} for x in results ]
    print ("processing..", s + "  " + str(len(list_of_dict)))
    
    table = pd.concat([table, pd.DataFrame(list_of_dict)])

CO
processing.. colorado  450
CT
processing.. connecticut  272
FL
processing.. florida  935
MI
processing.. michigan  911
MN
processing.. minnesota  831
NJ
processing.. new jersey  558
NY
processing.. new york  1667
OR
processing.. oregon  399
TX
processing.. texas  1745
VA
processing.. virginia  861
VT
processing.. vermont  244
WA
processing.. washington  530
WI
processing.. wisconsin  722


Missing a few zipcodes apparently:
* "Colorado has roughly 649 zip codes"
* "Florida has roughly 1469 zip codes"

In [26]:
table.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10125 entries, 0 to 721
Data columns (total 53 columns):
zipcode_type                                                                        10125 non-null object
major_city                                                                          10125 non-null object
post_office_city                                                                    9915 non-null object
common_city_list                                                                    10125 non-null object
county                                                                              10125 non-null object
state                                                                               10125 non-null object
lat                                                                                 9915 non-null float64
lng                                                                                 9915 non-null float64
timezone                                      

In [28]:
table.head()

Unnamed: 0,zipcode_type,major_city,post_office_city,common_city_list,county,state,lat,lng,timezone,radius_in_miles,...,sources_of_household_income____average_income_per_household_by_income_source,household_investment_income____percent_of_households_receiving_investment_income,household_investment_income____average_income_per_household_by_income_source,household_retirement_income____percent_of_households_receiving_retirement_incom,household_retirement_income____average_income_per_household_by_income_source,source_of_earnings,means_of_transportation_to_work_for_workers_16_and_over,travel_time_to_work_in_minutes,educational_attainment_for_population_25_and_over,school_enrollment_age_3_to_17
0,Standard,Arvada,"Arvada, CO",[Arvada],Jefferson County,CO,39.79,-105.1,Mountain,4.0,...,"[{'key': 'Data', 'values': [{'x': 'Wages', 'y'...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'Worked Full...","[{'key': 'Data', 'values': [{'x': 'Car, Truck,...","[{'key': 'Data', 'values': [{'x': '< 10', 'y':...","[{'key': 'Data', 'values': [{'x': 'Less Than H...","[{'key': 'Data', 'values': [{'x': 'Enrolled In..."
1,Standard,Arvada,"Arvada, CO","[Arvada, Westminster]",Jefferson County,CO,39.83,-105.06,Mountain,3.0,...,"[{'key': 'Data', 'values': [{'x': 'Wages', 'y'...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'Worked Full...","[{'key': 'Data', 'values': [{'x': 'Car, Truck,...","[{'key': 'Data', 'values': [{'x': '< 10', 'y':...","[{'key': 'Data', 'values': [{'x': 'Less Than H...","[{'key': 'Data', 'values': [{'x': 'Enrolled In..."
2,Standard,Arvada,"Arvada, CO",[Arvada],Jefferson County,CO,39.81,-105.12,Mountain,3.0,...,"[{'key': 'Data', 'values': [{'x': 'Wages', 'y'...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'Worked Full...","[{'key': 'Data', 'values': [{'x': 'Car, Truck,...","[{'key': 'Data', 'values': [{'x': '< 10', 'y':...","[{'key': 'Data', 'values': [{'x': 'Less Than H...","[{'key': 'Data', 'values': [{'x': 'Enrolled In..."
3,Standard,Arvada,"Arvada, CO","[Arvada, Westminster]",Jefferson County,CO,39.85,-105.12,Mountain,3.0,...,"[{'key': 'Data', 'values': [{'x': 'Wages', 'y'...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'Worked Full...","[{'key': 'Data', 'values': [{'x': 'Car, Truck,...","[{'key': 'Data', 'values': [{'x': '< 10', 'y':...","[{'key': 'Data', 'values': [{'x': 'Less Than H...","[{'key': 'Data', 'values': [{'x': 'Enrolled In..."
4,Standard,Arvada,"Arvada, CO",[Arvada],Jefferson County,CO,39.87,-105.22,Mountain,5.0,...,"[{'key': 'Data', 'values': [{'x': 'Wages', 'y'...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'Interest', ...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'IRA Distrib...","[{'key': 'Data', 'values': [{'x': 'Worked Full...","[{'key': 'Data', 'values': [{'x': 'Car, Truck,...","[{'key': 'Data', 'values': [{'x': '< 10', 'y':...","[{'key': 'Data', 'values': [{'x': 'Less Than H...","[{'key': 'Data', 'values': [{'x': 'Enrolled In..."


### Saving the table 


In [30]:
table.to_csv('data/uszipcode_data.csv')


## Building state-zipcode pairs

In [32]:
state_zip_pairs = table[['state','zipcode']]

In [None]:
state_zip_pairs[1000:1050]

In [36]:
state_zip_pairs.to_csv('data/state_zip_pairs.csv')