# New York Air Quality

## Read dataset

In [27]:
import pandas as pd

df = pd.read_csv('./data/Air_Quality.csv', sep=',', decimal='.')

df.head(5)


Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,336867,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Winter 2014-15,12/01/2014,23.97,
1,336741,375,Nitrogen dioxide (NO2),Mean,ppb,CD,107,Upper West Side (CD7),Winter 2014-15,12/01/2014,27.42,
2,550157,375,Nitrogen dioxide (NO2),Mean,ppb,CD,414,Rockaway and Broad Channel (CD14),Annual Average 2017,01/01/2017,12.55,
3,412802,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Winter 2015-16,12/01/2015,22.63,
4,412803,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Summer 2016,06/01/2016,14.0,


## Investigating columns

## Data classification

It is not a homogeneous dataset because it has a vast variety of data types, such as

- Diseases frequency (asthma, cardiac and respiratory deaths, etc.)
- Inorganic chemical concentration in the air (O3, NO2, PM 2.5, Benzene)
- Emission of poluents (SO2, NOx, PM2.5)

In [28]:
df['Name']\
    .groupby(df['Name'])\
    .count()\
    .sort_values(ascending=False)

Name
Nitrogen dioxide (NO2)                                    6345
Fine particles (PM 2.5)                                   6345
Ozone (O3)                                                2115
Asthma emergency departments visits due to Ozone           480
Asthma hospitalizations due to Ozone                       480
Asthma emergency department visits due to PM2.5            480
Annual vehicle miles traveled (cars)                       321
Annual vehicle miles traveled                              321
Annual vehicle miles traveled (trucks)                     321
Cardiac and respiratory deaths due to Ozone                240
Cardiovascular hospitalizations due to PM2.5 (age 40+)     240
Respiratory hospitalizations due to PM2.5 (age 20+)        240
Deaths due to PM2.5                                        240
Outdoor Air Toxics - Formaldehyde                          203
Outdoor Air Toxics - Benzene                               203
Boiler Emissions- Total SO2 Emissions             

## Measure strategies

It's important to notice that there are different measures strategies on this dataset. So, one should assure that you're performing operations with data that has the same measure strategy

In [29]:
df['Measure']\
    .groupby(df['Measure'])\
    .count()\
    .sort_values(ascending=False)

Measure
Mean                                    14805
Million miles                             963
Estimated annual rate                     720
Estimated annual rate (age 18+)           720
Estimated annual rate (under age 18)      720
Annual average concentration              406
Number per km2                            288
Estimated annual rate (age 30+)           240
Name: Measure, dtype: int64

### Geotypes

Here is an explanation of each geotype of this dataset

- UHF42 (United Hospital Fund) refers to an aglomerate of 42 public health areas
- CD refers to a community district
- UHF34 is the same as UHF42, but with 34 health areas
- Borough refers to a New York's district (Bronx, Brooklyn, Manhattan, Queens, and Staten Island)
- City wide

In [31]:
df['Geo Type Name']\
    .groupby(df['Geo Type Name'])\
    .count()\
    .sort_values(ascending=False)

Geo Type Name
UHF42       7392
CD          6844
UHF34       3570
Borough      880
Citywide     176
Name: Geo Type Name, dtype: int64

## Study cases

### Cardiovascular hospitalizations due to PM2.5 (age 40+)

In [32]:
sample = df[df['Name'] == 'Cardiovascular hospitalizations due to PM2.5 (age 40+)'][['Name', 'Measure', 'Geo Place Name', 'Geo Type Name', 'Data Value']]

sample.sort_values(by=['Geo Place Name'], ascending=False, inplace=True)

sample

Unnamed: 0,Name,Measure,Geo Place Name,Geo Type Name,Data Value
14237,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Willowbrook,UHF42,11.000000
11313,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Willowbrook,UHF42,17.800000
11064,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Willowbrook,UHF42,14.048749
11063,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Willowbrook,UHF42,16.800000
14849,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Willowbrook,UHF42,27.100000
...,...,...,...,...,...
14130,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Bayside - Little Neck,UHF42,12.500000
18458,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Bayside - Little Neck,UHF42,3.000000
18459,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Bayside - Little Neck,UHF42,18.300000
14114,Cardiovascular hospitalizations due to PM2.5 (...,Estimated annual rate,Bayside - Little Neck,UHF42,11.700000


In [33]:
sample['Measure'].groupby(sample['Measure']).count().sort_values(ascending=False)

Measure
Estimated annual rate    240
Name: Measure, dtype: int64

In [34]:
sample['Geo Place Name'].groupby(sample['Geo Place Name']).count().sort_values(ascending=False)

Geo Place Name
Bayside - Little Neck                   5
Bedford Stuyvesant - Crown Heights      5
Bensonhurst - Bay Ridge                 5
Borough Park                            5
Bronx                                   5
Brooklyn                                5
Canarsie - Flatlands                    5
Central Harlem - Morningside Heights    5
Chelsea - Clinton                       5
Coney Island - Sheepshead Bay           5
Crotona -Tremont                        5
Downtown - Heights - Slope              5
East Flatbush - Flatbush                5
East Harlem                             5
East New York                           5
Flushing - Clearview                    5
Fordham - Bronx Pk                      5
Fresh Meadows                           5
Gramercy Park - Murray Hill             5
Greenpoint                              5
Greenwich Village - SoHo                5
High Bridge - Morrisania                5
Hunts Point - Mott Haven                5
Jamaica            