# Project 1
# NYC Air Quality Analysis (PM 2.5)
This project uses dataset from NYC Open Data - Air Quality https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r/about_data 
This project analyze the fine particulate matter (PM2.5) which is Indicator ID 365 to explore the air quality trend in New York City. The dataset contains around 18,900 rows covering annual, seasonal, and borough-level measurements from 2009 to 2023.

**Goals:**
1. Compute the mean, median, and mode of PM2.5 levels using:
   - (a) Pandas 
   - (b) Pure Python
2. Create one visualization using shapes and symbols.
3. Interpret the findings and explain what the visualization shows.




In [3]:
pip install pandas


Collecting pandas
  Using cached pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting numpy>=1.26.0 (from pandas)
  Using cached numpy-2.3.4-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached pandas-2.3.3-cp313-cp313-macosx_11_0_arm64.whl (10.7 MB)
Using cached numpy-2.3.4-cp313-cp313-macosx_14_0_arm64.whl (5.1 MB)
Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB)
Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: pytz, tzdata, numpy, pandas
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4/4[0m [pandas]2m3/4[0m [pandas]
[1A[2KSuccessfully installed numpy-2.3.4 pandas-2.3.3 pytz-2025.2 tzdata-2025.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: 

## Data Loading and Analyze Data
First, install and import csv 

In [19]:
import pandas as pd
df = pd.read_csv("NYC_Air_Quality.csv")
df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18862 entries, 0 to 18861
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unique ID       18862 non-null  int64  
 1   Indicator ID    18862 non-null  int64  
 2   Name            18862 non-null  object 
 3   Measure         18862 non-null  object 
 4   Measure Info    18862 non-null  object 
 5   Geo Type Name   18862 non-null  object 
 6   Geo Join ID     18862 non-null  int64  
 7   Geo Place Name  18862 non-null  object 
 8   Time Period     18862 non-null  object 
 9   Start_Date      18862 non-null  object 
 10  Data Value      18862 non-null  float64
 11  Message         0 non-null      float64
dtypes: float64(2), int64(3), object(7)
memory usage: 1.7+ MB


Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,878218,386,Ozone (O3),Mean,ppb,UHF42,402,West Queens,Summer 2023,06/01/2023,34.365989,
1,876975,375,Nitrogen dioxide (NO2),Mean,ppb,UHF42,501,Port Richmond,Summer 2023,06/01/2023,11.331992,
2,876900,375,Nitrogen dioxide (NO2),Mean,ppb,UHF42,207,East Flatbush - Flatbush,Summer 2023,06/01/2023,12.020333,
3,877140,375,Nitrogen dioxide (NO2),Mean,ppb,CD,205,Fordham and University Heights (CD5),Summer 2023,06/01/2023,14.123178,
4,874556,365,Fine particles (PM 2.5),Mean,mcg/m3,UHF34,410,Rockaways,Summer 2023,06/01/2023,8.150637,


## Compute Mean, Median, and Mode (Pandas)
Filter to **Indicator ID = 365 (Fine particles - PM2.5)** 

In [20]:
pm = df[(df["Indicator ID"] == 365)]
if "Measure" in pm.columns:
    pm = pm[pm["Measure"].astype(str).str.contains("mean", case=False, na=False)]

In [None]:
s = pd.to_numeric(pm["Data Value"], errors="coerce").dropna()

In [22]:
p_mean = s.mean()
p_median = s.median()
p_mode = s.mode().iloc[0] if not s.mode().empty else None


In [23]:
print("PANDAS  mean / median / mode")
print(round(p_mean, 3), round(p_median, 3), None if p_mode is None else round(p_mode, 3))

PANDAS  mean / median / mode
9.045 8.76 10.35


## Compute Mean, Median, and Mode (Hard Way)
In this part, recreate the same calculations using only the Python standard library (`csv`) and basic functions for mean, median, and mode.

In [None]:
import csv

def read_pm25_values(csv_path, indicator_id=365):
  
    values = []
    with open(csv_path, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            try:
                if int(float(row["Indicator ID"])) != indicator_id:
                    continue
                if "Measure" in row and row["Measure"]:
                    if "mean" not in row["Measure"].lower():
                        continue
                v = float(row["Data Value"])
                values.append(v)
            except (ValueError, TypeError, KeyError):
                # skip bad or missing values
                continue
    values.sort()
    return values


In [None]:
def mean(vals):
    return sum(vals) / len(vals)

def median(vals):
    n = len(vals)
    return vals[n // 2] if n % 2 else (vals[n // 2 - 1] + vals[n // 2]) / 2

def mode(vals):
    # frequency dict; if multiple modes, returns the first encountered
    freq = {}
    best_val, best_count = None, 0
    for v in vals:
        freq[v] = freq.get(v, 0) + 1
        if freq[v] > best_count:
            best_val, best_count = v, freq[v]
    return best_val


In [None]:
CSV_FILE = "NYC_Air_Quality.csv"   # or your exact filename
PM25_ID = 365                      # PM2.5 indicator ID

vals = read_pm25_values(CSV_FILE, indicator_id=PM25_ID)
h_mean, h_median, h_mode = mean(vals), median(vals), mode(vals)

print("HARD WAY mean / median / mode")
print(round(h_mean, 3), round(h_median, 3), round(h_mode, 3))

HARD WAY mean / median / mode
9.045 8.76 10.35


## Data Visualization
Use only Python’s `print()` function to draw a horizontal bar chart showing average PM2.5 by year. Each bar’s length is proportional to the value. 

### Interpretation
The visualization above shows the average PM2.5 (fine particulate matter) levels in New York City from 2009 to 2023. Each ▮ bar represents the mean concentration in micrograms per cubic meter (µg/m³) for that year or season. We can observe a explicitly downward trend in PM2.5 levels over time. It suggests that NYC’s air quality has improved significantly since 2009. For example, the annual average dropped from around 10.98 µg/m³ in 2009 to about 6.46 µg/m³ in 2023. Seasonal variations (higher in summer and winter) reflect differences in heating, traffic, and atmospheric conditions.

### Prepare data with pandas

In [None]:
by_year = (
    pm.groupby("Time Period")["Data Value"]
      .mean()
      .reset_index()
      .sort_values("Time Period")
)


### visualization using only standard library 


In [None]:

years = by_year["Time Period"].astype(str).tolist()
vals = by_year["Data Value"].astype(float).tolist()

max_val = max(vals) if vals else 1.0
scale = 40 / max_val  # control bar width

print("Average PM2.5 (Indicator ID 365) by Year")
print("(Each ▮ bar’s length is proportional to the mean PM2.5 in µg/m³)")
for year, val in zip(years, vals):
    bar = "▮" * int(val * scale)
    print(f"{year:>4}: {bar} {val:.2f}")


Average PM2.5 (Indicator ID 365) by Year
(Each ▮ bar’s length is proportional to the mean PM2.5 in µg/m³)
Annual Average 2009: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 10.98
Annual Average 2010: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 10.07
Annual Average 2011: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 10.59
Annual Average 2012: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 9.44
Annual Average 2013: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 9.14
Annual Average 2014: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 9.40
Annual Average 2015: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 9.04
Annual Average 2016: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 7.88
Annual Average 2017: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 7.74
Annual Average 2018: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 7.38
Annual Average 2019: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 7.01
Annual Average 2020: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 6.32
Annual Average 2021: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 6.76
Annual Average 2022: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 6.07
Annual Average 2023: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 6.85
Summer 2009: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 11.10
Summer 2010: ▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮▮ 12.21
Summer 2011: ▮▮▮▮▮▮▮

## Conclusion
In this project, I used New York City Air Quality dataset from open NYC Data to measure the PM2.5 air quality trend. The analysis successfully met all technical requirements by computing the mean, median, and mode values through both pandas and pure Python methods. The visualization component used only the Python standard library to generate an ASCII bar chart, where the length of each bar represented the average annual PM2.5 concentration. The results suggest that with stricter emission regulation control, PM 2.5 have a decline trend. Overall, this project helps me to get familiar with how to use pandas and python (hard way) part to filter large dataset. It also provides a insight for me about the simple visualization.