# Demographic and Economic data per zipcode
* Option 1: Use US census (community) data
* Option 2: Use Python module uszipcode

 I am now using the Python module [uszipcode](https://uszipcode.readthedocs.io/index.html#)  to get demographic and economic data per zipcode. Only problem so far: it needs to be installed via pip. 
```python
import sys
!{sys.executable} -m pip install uszipcode
```
The added complexity is to ensure that only the local Jupyter environment is affected, see [Installing Python Packages from a Jupyter Notebook](https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/)

In [30]:
import sys
!{sys.executable} -m pip install uszipcode




## What kind of data do we have?
"uszipcode has two backend database, SimpleZipcode and Zipcode. Zipcode has more info, but the database file is 450MB (takes more time to download)."

In [31]:
from uszipcode import SearchEngine

In [32]:
zip = "10001"
search = SearchEngine(simple_zipcode=True, db_file_dir="./tmp") # set simple_zipcode=False to use rich info database
zipcode = search.by_zipcode(zip)

Start downloading data for simple zipcode database, total size 9MB ...
  1 MB finished ...
  2 MB finished ...
  3 MB finished ...
  4 MB finished ...
  5 MB finished ...
  6 MB finished ...
  7 MB finished ...
  8 MB finished ...
  9 MB finished ...
  10 MB finished ...
  Complete!


 Choose which directory you want to use to download the database file. 
 
```python
  search = SearchENgine(db_file_dir="/tmp")
```


In [33]:
zipcode.values() # to list

['10001',
 'Standard',
 'New York',
 'New York, NY',
 ['New York'],
 'New York County',
 'NY',
 40.75,
 -73.99,
 'Eastern',
 0.9090909090909091,
 ['718', '917', '347', '646'],
 21102,
 33959.0,
 0.62,
 0.0,
 12476,
 11031,
 650200,
 81671,
 -74.008621,
 -73.984076,
 40.759731,
 40.743451]

Alternatives
```python
zipcode.to_dict() # to dict
zipcode.to_json() # to json
```

## The finer view



In [34]:
search = SearchEngine(simple_zipcode=False, db_file_dir="./tmp") 

In [35]:
zip = "10001"
zipcode = search.by_zipcode(zip)

In [38]:
zipcode.keys()

['zipcode_type',
 'major_city',
 'post_office_city',
 'common_city_list',
 'county',
 'state',
 'lat',
 'lng',
 'timezone',
 'radius_in_miles',
 'area_code_list',
 'population',
 'population_density',
 'land_area_in_sqmi',
 'water_area_in_sqmi',
 'housing_units',
 'occupied_housing_units',
 'median_home_value',
 'median_household_income',
 'bounds_west',
 'bounds_east',
 'bounds_north',
 'bounds_south',
 'zipcode',
 'polygon',
 'population_by_year',
 'population_by_age',
 'population_by_gender',
 'population_by_race',
 'head_of_household_by_age',
 'families_vs_singles',
 'households_with_kids',
 'children_by_age',
 'housing_type',
 'year_housing_was_built',
 'housing_occupancy',
 'vancancy_reason',
 'owner_occupied_home_values',
 'rental_properties_by_number_of_rooms',
 'monthly_rent_including_utilities_studio_apt',
 'monthly_rent_including_utilities_1_b',
 'monthly_rent_including_utilities_2_b',
 'monthly_rent_including_utilities_3plus_b',
 'employment_status',
 'average_household_inc

## Which variables to keep?
For space reasons, we may want to prune the database a bit.

In [39]:
vars = ['zipcode_type',
 'major_city',
 'post_office_city',
 'common_city_list',
 'county',
 'state',
 'lat',
 'lng',
 'timezone',
 'radius_in_miles',
 'area_code_list',
 'population',
 'population_density',
 'land_area_in_sqmi',
 'water_area_in_sqmi',
 'housing_units',
 'occupied_housing_units',
 'median_home_value',
 'median_household_income',
 
 'zipcode',
 
 'population_by_year',
 'population_by_age',
 'population_by_gender',
 'population_by_race',
 'head_of_household_by_age',
 'families_vs_singles',
 'households_with_kids',
 'children_by_age',
 'housing_type',
 'year_housing_was_built',
 'housing_occupancy',
 'vancancy_reason',
 'owner_occupied_home_values',
 'rental_properties_by_number_of_rooms',
 'monthly_rent_including_utilities_studio_apt',
 'monthly_rent_including_utilities_1_b',
 'monthly_rent_including_utilities_2_b',
 'monthly_rent_including_utilities_3plus_b',
 'employment_status',
 'average_household_income_over_time',
 'household_income',
 'annual_individual_earnings',
 'sources_of_household_income____percent_of_households_receiving_income',
 'sources_of_household_income____average_income_per_household_by_income_source',
 'household_investment_income____percent_of_households_receiving_investment_income',
 'household_investment_income____average_income_per_household_by_income_source',
 'household_retirement_income____percent_of_households_receiving_retirement_incom',
 'household_retirement_income____average_income_per_household_by_income_source',
 'source_of_earnings',
 'means_of_transportation_to_work_for_workers_16_and_over',
 'travel_time_to_work_in_minutes',
 'educational_attainment_for_population_25_and_over',
 'school_enrollment_age_3_to_17']


## Which zipcodes to consider?
Determined by ev hub data.

In [42]:
import pandas as pd

In [49]:
url = "https://raw.githubusercontent.com/siddhantmaharana/atlytics_team_recylers/master/data/zip_data.csv"

In [50]:
zip_df = pd.read_csv(url)

In [51]:
zip_df.head()

Unnamed: 0.1,Unnamed: 0,ZIP Code,count,state
0,0,80002.0,53,co
1,1,80003.0,99,co
2,2,80004.0,126,co
3,3,80005.0,140,co
4,4,80007.0,159,co


In [55]:
zip_df["ZIP Code"].value_counts()

8054.0     9
94304.0    8
55344.0    8
43026.0    8
30348.0    7
          ..
48757.0    1
14445.0    1
11786.0    1
12164.0    1
55313.0    1
Name: ZIP Code, Length: 9083, dtype: int64

In [63]:
# Problem ZIP Code 94304 is  PALO ALTO CA
zip_df[zip_df["ZIP Code"]=='94304.0']

Unnamed: 0.1,Unnamed: 0,ZIP Code,count,state
1250,871,94304.0,6,ct
2502,871,94304.0,6,ct
3754,871,94304.0,6,ct
5327,532,94304.0,1,mn
6172,839,94304.0,3,nj
8572,2391,94304.0,192,ny
8727,134,94304.0,4,or
10471,1301,94304.0,105,tx


In [62]:
zip_df[zip_df["ZIP Code"]=='H2C2G']

Unnamed: 0.1,Unnamed: 0,ZIP Code,count,state
11599,819,H2C2G,1,wa


## TO DO

* Build a pandas dataframe with zipcode and variables of interest