# New York Air Quality

## Read dataset

In [13]:
import pandas as pd

df = pd.read_csv('./data/Air_Quality.csv', sep=',', decimal='.')

df.head(5)


Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,336867,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Winter 2014-15,12/01/2014,23.97,
1,336741,375,Nitrogen dioxide (NO2),Mean,ppb,CD,107,Upper West Side (CD7),Winter 2014-15,12/01/2014,27.42,
2,550157,375,Nitrogen dioxide (NO2),Mean,ppb,CD,414,Rockaway and Broad Channel (CD14),Annual Average 2017,01/01/2017,12.55,
3,412802,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Winter 2015-16,12/01/2015,22.63,
4,412803,375,Nitrogen dioxide (NO2),Mean,ppb,CD,407,Flushing and Whitestone (CD7),Summer 2016,06/01/2016,14.0,


## Investigating columns

### Data classification

It is not a homogeneous dataset because it has a vast variety of data types, such as

- Diseases frequency (asthma, cardiac and respiratory deaths, etc.)
- Inorganic chemical concentration in the air (O3, NO2, PM 2.5, Benzene)
- Emission of poluents (SO2, NOx, PM2.5)

In [14]:
df['Name']\
    .groupby(df['Name'])\
    .count()\
    .sort_values(ascending=False)

Name
Nitrogen dioxide (NO2)                                    6345
Fine particles (PM 2.5)                                   6345
Ozone (O3)                                                2115
Asthma emergency departments visits due to Ozone           480
Asthma hospitalizations due to Ozone                       480
Asthma emergency department visits due to PM2.5            480
Annual vehicle miles traveled (cars)                       321
Annual vehicle miles traveled                              321
Annual vehicle miles traveled (trucks)                     321
Cardiac and respiratory deaths due to Ozone                240
Cardiovascular hospitalizations due to PM2.5 (age 40+)     240
Respiratory hospitalizations due to PM2.5 (age 20+)        240
Deaths due to PM2.5                                        240
Outdoor Air Toxics - Formaldehyde                          203
Outdoor Air Toxics - Benzene                               203
Boiler Emissions- Total SO2 Emissions             

### Measure strategies

It's important to notice that there are different measures strategies on this dataset. 

So, one should assure that you're performing operations with data that has the same `Measure` and the same `Measure Info`

In [15]:
df['Measure']\
    .groupby(df['Measure'])\
    .count()\
    .sort_values(ascending=False)

Measure
Mean                                    14805
Million miles                             963
Estimated annual rate                     720
Estimated annual rate (age 18+)           720
Estimated annual rate (under age 18)      720
Annual average concentration              406
Number per km2                            288
Estimated annual rate (age 30+)           240
Name: Measure, dtype: int64

In [16]:
df['Measure Info']\
    .groupby(df['Measure Info'])\
    .count()\
    .sort_values(ascending=False)

Measure Info
ppb                     8460
mcg/m3                  6345
per 100,000 adults      1440
per square mile          963
per 100,000 children     720
Âµg/m3                   406
number                   288
per 100,000              240
Name: Measure Info, dtype: int64

### Geotypes

Here is an explanation of each geotype of this dataset

- UHF42 (United Hospital Fund) refers to an aglomerate of 42 public health areas
- CD refers to a community district
- UHF34 is the same as UHF42, but with 34 health areas
- Borough refers to a New York's district (Bronx, Brooklyn, Manhattan, Queens, and Staten Island)
- City wide

In [17]:
df['Geo Type Name']\
    .groupby(df['Geo Type Name'])\
    .count()\
    .sort_values(ascending=False)

Geo Type Name
UHF42       7392
CD          6844
UHF34       3570
Borough      880
Citywide     176
Name: Geo Type Name, dtype: int64

In [18]:
df[df['Geo Type Name'] == 'Borough']['Geo Place Name'].unique()

array(['Queens', 'Manhattan', 'Bronx', 'Brooklyn', 'Staten Island'],
      dtype=object)

## Study cases

### Asthma hospitalizations due to Ozone

In [19]:
sample = df[df['Name'] == 'Asthma hospitalizations due to Ozone'][['Name', 'Measure', 'Geo Place Name', 'Geo Type Name', 'Data Value']]

sample.sort_values(by=['Geo Place Name'], ascending=False, inplace=True)

sample

Unnamed: 0,Name,Measure,Geo Place Name,Geo Type Name,Data Value
3087,Asthma hospitalizations due to Ozone,Estimated annual rate (age 18+),Willowbrook,UHF42,4.7
4980,Asthma hospitalizations due to Ozone,Estimated annual rate (age 18+),Willowbrook,UHF42,3.8
8005,Asthma hospitalizations due to Ozone,Estimated annual rate (under age 18),Willowbrook,UHF42,5.6
8008,Asthma hospitalizations due to Ozone,Estimated annual rate (under age 18),Willowbrook,UHF42,7.1
8343,Asthma hospitalizations due to Ozone,Estimated annual rate (age 18+),Willowbrook,UHF42,3.6
...,...,...,...,...,...
15561,Asthma hospitalizations due to Ozone,Estimated annual rate (age 18+),Bayside - Little Neck,UHF42,1.2
9122,Asthma hospitalizations due to Ozone,Estimated annual rate (age 18+),Bayside - Little Neck,UHF42,1.8
14104,Asthma hospitalizations due to Ozone,Estimated annual rate (under age 18),Bayside - Little Neck,UHF42,8.3
18803,Asthma hospitalizations due to Ozone,Estimated annual rate (age 18+),Bayside - Little Neck,UHF42,0.0


In [20]:
sample['Measure'].groupby(sample['Measure']).count().sort_values(ascending=False)

Measure
Estimated annual rate (age 18+)         240
Estimated annual rate (under age 18)    240
Name: Measure, dtype: int64

In [21]:
sample['Geo Place Name'].groupby(sample['Geo Place Name']).count().sort_values(ascending=False)

Geo Place Name
Bayside - Little Neck                   10
Bedford Stuyvesant - Crown Heights      10
Bensonhurst - Bay Ridge                 10
Borough Park                            10
Bronx                                   10
Brooklyn                                10
Canarsie - Flatlands                    10
Central Harlem - Morningside Heights    10
Chelsea - Clinton                       10
Coney Island - Sheepshead Bay           10
Crotona -Tremont                        10
Downtown - Heights - Slope              10
East Flatbush - Flatbush                10
East Harlem                             10
East New York                           10
Flushing - Clearview                    10
Fordham - Bronx Pk                      10
Fresh Meadows                           10
Gramercy Park - Murray Hill             10
Greenpoint                              10
Greenwich Village - SoHo                10
High Bridge - Morrisania                10
Hunts Point - Mott Haven               

In [22]:
sample['Geo Type Name'].groupby(sample['Geo Type Name']).count().sort_values(ascending=False)

Geo Type Name
UHF42       420
Borough      50
Citywide     10
Name: Geo Type Name, dtype: int64

## Analysis suggestions

Filter the measures related to hospitalization and deaths and perform a comparative analysis enhancing the following points

### Comparison between pollutants

- PM2.5 vs Ozone: Which one seems to have a greater impact (hospitalizations or deaths) in specific areas?
- Neighborhoods with high asthma hospitalization rates (Ozone) also have high cardiac hospitalization rates (PM2.5)?

### Total Health Impact

- For each neighborhood or district, sum up the health impacts (hospitalizations and deaths) to understand which areas suffer the most overall from air pollution.
- Create a disease burden index related to air pollution (like a score).

### Vulnerability Analysis

- Cross the data to identify highly vulnerable areas — neighborhoods that rank high across several health outcomes (asthma, cardiac hospitalizations, respiratory deaths).

### Correlation Analysis

- See if there is a correlation between PM2.5-related hospitalizations and Ozone-related deaths.
- Or if areas with many asthma cases also experience more respiratory deaths.

### Pollution Profile by Region

Create profiles for neighborhoods:
- Neighborhood X has more problems with Ozone and asthma
- Neighborhood Y → more problems with PM2.5 and cardiac hospitalizations