# NYC Air Quality Data Analysis
This project analyzes the air quality data from NYC Open Data.  
The dataset contains air pollution measurements across boroughs and years.  
We will calculate the mean, median, and mode of the numeric column `Data Value` using both pandas and the Python standard library.

In [2]:
import pandas as pd

# Load dataset
df = pd.read_csv("airquality.csv")

# Preview
df.head()

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,878218,386,Ozone (O3),Mean,ppb,UHF42,402,West Queens,Summer 2023,06/01/2023,34.365989,
1,876975,375,Nitrogen dioxide (NO2),Mean,ppb,UHF42,501,Port Richmond,Summer 2023,06/01/2023,11.331992,
2,876900,375,Nitrogen dioxide (NO2),Mean,ppb,UHF42,207,East Flatbush - Flatbush,Summer 2023,06/01/2023,12.020333,
3,877140,375,Nitrogen dioxide (NO2),Mean,ppb,CD,205,Fordham and University Heights (CD5),Summer 2023,06/01/2023,14.123178,
4,874556,365,Fine particles (PM 2.5),Mean,mcg/m3,UHF34,410,Rockaways,Summer 2023,06/01/2023,8.150637,


## 1. Data Overview

The dataset contains air pollution measurements by borough and year.  
Each record includes fields such as:
- `Geo Type Name` — whether it's a Borough, Citywide, etc.
- `Geo Place Name` — the name of the area (e.g., Manhattan, Bronx)
- `Data Value` — the measured pollution value.

Let's check the number of rows and columns.

In [3]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18862 entries, 0 to 18861
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unique ID       18862 non-null  int64  
 1   Indicator ID    18862 non-null  int64  
 2   Name            18862 non-null  object 
 3   Measure         18862 non-null  object 
 4   Measure Info    18862 non-null  object 
 5   Geo Type Name   18862 non-null  object 
 6   Geo Join ID     18862 non-null  int64  
 7   Geo Place Name  18862 non-null  object 
 8   Time Period     18862 non-null  object 
 9   Start_Date      18862 non-null  object 
 10  Data Value      18862 non-null  float64
 11  Message         0 non-null      float64
dtypes: float64(2), int64(3), object(7)
memory usage: 1.7+ MB


(18862, 12)

## 2. Why This Dataset?

This dataset is ideal because:
- It has **enough rows** (over 1,000).
- It includes a **numeric column** (`Data Value`) suitable for calculating mean, median, and mode.
- The results are **meaningful** for real-world interpretation: average pollution level tells us about the overall air quality trend in NYC.

In [4]:
# Clean data
df = df.dropna(subset=["Data Value"])

# Compute statistics
mean_val = df["Data Value"].mean()
median_val = df["Data Value"].median()
mode_val = df["Data Value"].mode()[0]

print("Mean:", mean_val)
print("Median:", median_val)
print("Mode:", mode_val)

Mean: 21.05158015629472
Median: 14.79
Mode: 2.0


## 3. Results Using pandas

The mean, median, and mode represent the **central tendency** of air pollution levels.  
- **Mean** shows the overall average level.  
- **Median** helps reduce the influence of extreme outliers.  
- **Mode** indicates the most frequently recorded pollution level.

These metrics help understand typical air quality across New York City.

In [5]:
import csv

values = []
with open("airquality.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        try:
            val = float(row["Data Value"])
            values.append(val)
        except ValueError:
            continue

# Mean
n = len(values)
mean_val_py = sum(values) / n

# Median
sorted_vals = sorted(values)
if n % 2 == 1:
    median_val_py = sorted_vals[n // 2]
else:
    median_val_py = (sorted_vals[n // 2 - 1] + sorted_vals[n // 2]) / 2

# Mode
counts = {}
for v in values:
    counts[v] = counts.get(v, 0) + 1

max_count = max(counts.values())
modes_py = [v for v, c in counts.items() if c == max_count]

print("Mean:", mean_val_py)
print("Median:", median_val_py)
print("Mode(s):", modes_py)

Mean: 21.05158015629467
Median: 14.79
Mode(s): [2.0]


## 4. Comparing pandas vs. Standard Library

Both methods produce nearly identical results.  
Small differences can arise due to floating-point precision.  
This consistency shows that our manual implementation works correctly and that pandas simplifies these operations greatly.

In [7]:
borough_means = df.groupby("Geo Place Name")["Data Value"].mean().to_dict()
max_val = max(borough_means.values())
max_width = 40

print("Average Air Quality by Borough\n")
for borough, value in sorted(borough_means.items(), key=lambda x: x[1], reverse=True)[:15]:
    bar_len = int(value / max_val * max_width)
    bar = "#" * bar_len
    print(f"{borough:<40} | {bar:<40} {value:>6.2f}")

Average Air Quality by Borough

High Bridge - Morrisania                 | ########################################  38.11
Hunts Point - Mott Haven                 | ######################################    36.74
Crotona -Tremont                         | #####################################     36.19
East Harlem                              | ###################################       33.65
Central Harlem - Morningside Heights     | ##################################        32.50
Bronx                                    | ###############################           30.38
Manhattan                                | ###############################           29.70
Gramercy Park - Murray Hill              | ###############################           29.58
Union Square - Lower East Side           | ##############################            28.88
Stuyvesant Town and Turtle Bay (CD6)     | #############################             28.36
Chelsea - Clinton                        | ###############

## 4. Text-Based Visualization

This is a simple **text chart** that shows the average `Data Value` by borough.  
Each `#` symbol represents relative pollution level — longer bars mean higher average concentration.  
This visualization uses only Python’s standard library (`print` and string multiplication).

From the chart, we can see that some boroughs, such as **Central Harlem – Morningside Heights** and **The Bronx**,  
have relatively high pollution values, while areas like **Bayside – Little Neck** show lower readings.  
This variation may reflect differences in **traffic density, industrial activity, or population concentration** across NYC.

## 5. Conclusion

This project demonstrates how to analyze a real-world dataset using both pandas and Python’s standard library.  
The calculated mean, median, and mode give insight into the overall distribution of NYC’s air quality data.  
Through a simple text-based visualization, we also explored spatial differences among boroughs.

Overall, the project highlights how data analysis can transform raw public data into interpretable insights —  
even without using advanced visualization libraries.

## Acknowledgment: Use of AI Tools

This project was completed with partial assistance from AI tools (e.g., ChatGPT) for **guidance, code structure, and Markdown writing style**.  
All data analysis, interpretation, and validation of results were conducted independently by the author.  

The AI tool was mainly used to:
- Clarify Python syntax and pandas functions  
- Suggest clearer Markdown explanations and blog-style formatting  
- Improve readability and presentation of the notebook  