# Python Exercise 9

### Problem 9-1: converting JSON to CSV, cleaning data

Download the file `zips.json` and place it in the same directory as the notebooks.

The file contains information about all postal code areas in the United States of America. On each line of the file we find information about one postal code area in a JSON document. The JSON documents look as follows:
```
    { 
        "_id" : "01001", 
        "city" : "AGAWAM", 
        "loc" : [ -72.622739, 42.070206 ], 
        "pop" : 15338, 
        "state" : "MA" 
    }
```

The actual zip code in in the `_id` field, the `loc` field contains the coordinates (longitude and latitude) of the zip code area, `city` is the name of the city (a city can has many zip code areas), `pop` is the population of the zip code area, and `state` is the state the zip code is in.



Our first task is to convert this file into a CSV file and clean the data a little during the process. Here's the steps we want to do:

* change the name of the `_id` field to `zip`
* convert the name of the city so that each word in the name starts with an uppercase character and all remaining characters of the word are lowercase. For example:
    * `AGAWAM` -> `Agawam`
    * `NEW YORK` -> `New York`
* convert the list `loc` containing longitude and latitude into two fields `longitude` and `latitude`

I provided code to open the `json` file and to open a new `csv` file. After opening the files we perform the following steps:

1. read each line and convert it into a Python dictionary (using `json.reads`)
2. transform the dictionary using the `clean_zip` function
3. write the cleaned dictionary to a the cvs file.

Your task is to implement the `clean_zip` function so that it performs as outlined above. (I already added one line to remove the `loc` field):

In [102]:
import csv
import json

def clean_zip(zip_data):
    #
    # add your code below this line
    #
    zip_data['zip']=zip_data['_id']
    zip_data['city']=zip_data['city'].title()
    zip_data['longitude']=zip_data['loc'][0]
    zip_data['latitude']=zip_data['loc'][1]
    del zip_data['loc']
    del zip_data['_id']
    return zip_data

with open('zips.json') as in_file, open('zips.csv', 'w') as out_file:
    field_names = ['zip', 'city', 'longitude', 'latitude', 'pop', 'state']
    writer = csv.DictWriter(out_file, field_names)
    writer.writeheader()
    for line in in_file:
        d = json.loads(line)
        writer.writerow(clean_zip(d))

Now let's check how the data looks like:

In [103]:
import pandas as pd
df = pd.read_csv('zips.csv')
df.head()

Unnamed: 0,zip,city,longitude,latitude,pop,state
0,1001,Agawam,-72.622739,42.070206,15338,MA
1,1002,Cushman,-72.51565,42.377017,36963,MA
2,1005,Barre,-72.108354,42.409698,4546,MA
3,1007,Belchertown,-72.410953,42.275103,10579,MA
4,1008,Blandford,-72.936114,42.182949,1240,MA


Observe that the values for `zip`, `longitude`, and `latitude` are all `NaN` (**N**ot **a** **N**umber, pandas way of telling us that they are missing). The right values should appear once you finished writing `clean_zip`.


### Problem 9-2: calculate population

Calculate the total population of the United States of America (according to the dataset):

In [104]:

df['pop'].sum()

248408400

Which 5 states have the highest population:

In [105]:
df2=df.groupby('state').sum().sort_values('pop',ascending=False)
df2.head(5)

Unnamed: 0_level_0,zip,longitude,latitude,pop
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,141919430,-181794.790982,55163.20937,29754890
NY,20330377,-119867.338913,67437.045861,17990402
TX,129283392,-163366.170609,52327.970282,16984601
FL,26733804,-65864.791365,22580.621689,12686644
PA,25011438,-113171.366377,59273.693528,11881643


### Problem 9-3: calculate city population

Now we want to calculate the population of different cities. If we just group by `city` we will encounter a problem: cities with the same name exist in different states! Let us first check whether this problem exists: we list the number of different states each city name appears (I don't know how to achieve this just using pandas, there might be a smart way which I missed). In any way, we can use the `iterrows` method of DataFrame to iterate over all rows in a DataFrame, `iterrows` generates tuples consisting of index and corresponding row of the dataframe (no need to understand this code fully, just run it and see the output):

In [106]:
from collections import defaultdict

states_by_city = defaultdict(set)
for idx, zip_area in df.iterrows():
    states_by_city[zip_area['city']].add(zip_area['state'])
    
# now states_by_city contains a set of states for each city-name, let's sort by length:
cities_by_size = sorted(states_by_city.items(), key=lambda x: len(x[1]), reverse=True)
for city in  cities_by_size[:5]:
    print('{}: {}'.format(city[0], city[1]))

Clinton: {'WA', 'MT', 'MS', 'SC', 'AR', 'ME', 'OK', 'LA', 'NJ', 'IA', 'OH', 'NC', 'MD', 'WI', 'NY', 'CT', 'MA', 'TN', 'IL', 'MN', 'MI', 'IN', 'PA', 'KY'}
Franklin: {'VA', 'WV', 'ID', 'AZ', 'AR', 'ME', 'AL', 'LA', 'NJ', 'VT', 'NH', 'NC', 'TX', 'WI', 'NY', 'MA', 'NE', 'MO', 'TN', 'MN', 'MI', 'IN', 'PA', 'KY'}
Madison: {'WV', 'KS', 'SC', 'MS', 'ME', 'NJ', 'FL', 'NH', 'OH', 'NC', 'WI', 'NY', 'CT', 'SD', 'GA', 'MO', 'NE', 'TN', 'IL', 'CA', 'MN', 'IN', 'PA'}
Arlington: {'VA', 'WA', 'KS', 'AZ', 'AL', 'CO', 'IA', 'VT', 'OH', 'TX', 'WI', 'NY', 'MA', 'SD', 'GA', 'NE', 'TN', 'IL', 'OR', 'MN', 'IN', 'KY'}
Greenville: {'VA', 'WV', 'UT', 'DE', 'SC', 'MS', 'ME', 'AL', 'IA', 'FL', 'OH', 'NC', 'TX', 'WI', 'NY', 'MO', 'IL', 'CA', 'RI', 'MI', 'IN', 'KY'}


We thus see that some cities (like Clinton, Franklin, Madison) exists in many different states!

Hence, we should group by city and state, or even better, add a column to the DataFrame containing both the city and the state-name.

Add a column `full_name` which contains the name of the city, following by a comma, followed by the code of the state. For example, the full name of the first row of the dataframe should be `"Agawam, MA"`.

In [107]:
def get(city,state):
    return city+', '+state
df['full_name']=get(df['city'],df['state'])
df.head()

Unnamed: 0,zip,city,longitude,latitude,pop,state,full_name
0,1001,Agawam,-72.622739,42.070206,15338,MA,"Agawam, MA"
1,1002,Cushman,-72.51565,42.377017,36963,MA,"Cushman, MA"
2,1005,Barre,-72.108354,42.409698,4546,MA,"Barre, MA"
3,1007,Belchertown,-72.410953,42.275103,10579,MA,"Belchertown, MA"
4,1008,Blandford,-72.936114,42.182949,1240,MA,"Blandford, MA"


Now calculate the population of every city in the USA and print out (only) the names and population of all cities with more than `1000000` inhabitants, ordered from the largest to the smallest. 

In [108]:
df3=df.groupby('full_name').sum();
df3=df3[df3['pop']>1000000]
df3.sort_values('pop',ascending=False).loc[:,'pop']

full_name
Chicago, IL         2452177
Brooklyn, NY        2300504
Los Angeles, CA     2102295
Houston, TX         2095918
Philadelphia, PA    1610956
New York, NY        1476790
Bronx, NY           1209548
San Diego, CA       1049298
Name: pop, dtype: int64

**Hint for problems 9-3 to 9-5:** you will need to either use two `groupby` functions or use some other methods of the `groupby`-object (documentation: http://pandas.pydata.org/pandas-docs/stable/api.html#groupby, this kind of object is returned by the `groupby`-method of `DataFrame`).

### Problem 9-4:

Calculate the average population per city for each state:

In [110]:
df4=df;
df4['cnt']=1
df5=df4.groupby('state').sum()
df5['avg']=df5['pop']/df5['cnt']
df5.loc[:,'avg']

state
AK     2793.323077
AL     7126.255732
AR     4066.998270
AZ    13574.918519
CA    19627.236148
CO     7955.929952
CT    12498.539924
DC    25287.500000
DE    12569.207547
FL    15779.407960
GA    10201.914961
HI    13852.862500
IA     3011.301518
ID     4126.020492
IL     9238.137429
IN     8201.384615
KS     3461.937063
KY     4543.243511
LA     9089.644397
MA    12692.879747
MD    11384.235714
ME     2991.824390
MI    10611.069635
MN     4958.029478
MO     5141.496982
MS     7088.749311
MT     2544.420382
NC     9402.321986
ND     1632.409207
NE     2749.371080
NH     5088.311927
NJ    14315.162963
NM     5489.380435
NV    11556.086538
NY    11279.248903
OH    10771.119166
OK     5367.892491
OR     7401.877604
PA     8149.275034
RI    14539.391304
SC     9962.008571
SD     1810.929688
TN     8378.792096
TX    10164.333333
UT     8404.146341
VA     7575.341912
VT     2315.876543
WA    10055.148760
WI     6832.079609
WV     2733.454268
WY     3239.485714
Name: avg, dtype: float64

### Problem 9-5: largest city in each state
For each state, find the largest city in this state.

In [125]:
dd={}
def get_max(pop,state):
    if state not in dd.keys():
        dd[state]=pop;
    else:
        dd[state]=max(dd[state],pop)
df.apply(get_max)
#df[df['pop']>(get_max(df['state']))]

Unnamed: 0,zip,city,longitude,latitude,pop,state,full_name,cnt
0,1001,Agawam,-72.622739,42.070206,15338,MA,"Agawam, MA",1
1,1002,Cushman,-72.515650,42.377017,36963,MA,"Cushman, MA",1
2,1005,Barre,-72.108354,42.409698,4546,MA,"Barre, MA",1
3,1007,Belchertown,-72.410953,42.275103,10579,MA,"Belchertown, MA",1
4,1008,Blandford,-72.936114,42.182949,1240,MA,"Blandford, MA",1
5,1010,Brimfield,-72.188455,42.116543,3706,MA,"Brimfield, MA",1
6,1011,Chester,-72.988761,42.279421,1688,MA,"Chester, MA",1
7,1012,Chesterfield,-72.833309,42.381670,177,MA,"Chesterfield, MA",1
8,1013,Chicopee,-72.607962,42.162046,23396,MA,"Chicopee, MA",1
9,1020,Chicopee,-72.576142,42.176443,31495,MA,"Chicopee, MA",1
