# Project 1 — Exploring Electricity Consumption (2010–2025)

**Research question.** What does monthly electricity consumption look like over time? Are there many “zero-consumption” periods and how do they affect summary statistics?

**Dataset.** New York City “Electric Consumption and Cost (2010–May 2025)”.  
**Primary source link:** <https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2010-May-2025-/jr24-e7cr/about_data>


**What you’ll see in this notebook.**
1. Load the dataset and pick a numeric column — **`Consumption (KW)`**.
2. Compute **mean / median / mode** with pandas.
3. Compute the **same metrics the hard way** using only the Python standard library.
4. Make a **pure-standard-library** visualization (ASCII chart) with a clear title, axis labels, and units.
5. Discuss what worked, what didn’t, and what we learned.


## Data loading & cleaning

The raw file uses **thousands separators** (e.g., `1,680.81`), so we pass `thousands=","` to `pandas.read_csv`.  
We keep valid numeric values (including zeros) and coerce non-numeric values to `NaN`.

In [None]:
import pandas as pd

df = pd.read_csv('Electric_Consumption_And_Cost_(2010_-_May_2025)_20251110.csv', thousands=",")

In [None]:
col = "Consumption (KW)"
values = df[col]
values = pd.to_numeric(df["Consumption (KW)"], errors="coerce").dropna()

### Summary statistics (pandas)

- **Mean** is sensitive to rare high peaks (right-skewed distribution).
- **Median** reflects a typical operating level.
- **Mode** shows the most frequent value; in this dataset zeros are common (idle/off periods).

In [None]:
mean = values.mean()
median = values.median()
mode = values.mode()[0] 

print("Mean:", round(mean, 2))
print("Median:", round(median, 2))
print("Mode:", round(mode, 2))

Mean: 60.82
Median: 20.09
Mode: 0.0


## The hard way (standard library only)

We re-read the CSV with the built-in `csv` module.  
We strip commas from numbers, skip empty strings, and convert to `float`.  
Then we manually compute:
- **mean**: `sum(values) / len(values)`
- **median**: sort and pick middle (average two middles if even length)
- **mode**: count with a dictionary (we round to 2 decimals to avoid floating-point uniqueness)

In [None]:
import csv
values_py = []
with open("Electric_Consumption_And_Cost_(2010_-_May_2025)_20251110.csv", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        raw = row["Consumption (KW)"]
        if raw is None:
            continue
        raw = raw.strip().replace(",", "")     # 解析千分位
        if raw == "":                          # 空值跳过（和 pandas 的 NaN 一致）
            continue
        try:
            val = float(raw)
            values_py.append(val)              # 保留 0，与 pandas 对齐
        except ValueError:
            pass

In [None]:
# mean
py_mean = sum(values_py) / len(values_py)

# median
sorted_vals = sorted(values_py)
n = len(sorted_vals)
py_median = sorted_vals[n//2] if n % 2 == 1 else (sorted_vals[n//2 - 1] + sorted_vals[n//2]) / 2

# mode（浮点数精度很多，建议按两位小数聚合，和 pandas 可比）
counts = {}
for v in sorted_vals:
    k = round(v, 2)                 # 以两位小数为桶，避免浮点微差导致唯一值过多
    counts[k] = counts[k] + 1 if k in counts else 1
mode_key = max(counts, key=counts.get)
py_mode = mode_key

print("Python Mean:", round(py_mean, 2))
print("Python Median:", round(py_median, 2))
print("Python Mode:", round(py_mode, 2))

Python Mean: 60.82
Python Median: 20.09
Python Mode: 0.0


## Visualizing the distribution (ASCII histogram)

**Title**: KW Consumption Distribution (2010–2025)  
**X-axis**: Frequency (count)  
**Y-axis**: Consumption bins (kW)

We print a vertical ASCII bar for each bin. Each block represents a fraction of the bin with respect to the max bin count.

In [None]:
bins = [0, 10, 20, 50, 100, 200, 400, 600, 800, 1000, 1500]
counts = pd.cut(values, bins=bins, right=True).value_counts().sort_index()

max_count = counts.max()
width = 50

print("KW Consumption Distribution (ASCII Histogram)")
for interval, count in counts.items():
    bar_len = max(1, int(count / max_count * width))
    bar = "█" * bar_len
    print(f"{str(interval):18} {bar} {count}")

KW Consumption Distribution (ASCII Histogram)
(0, 10]            █████████████ 28270
(10, 20]           ███████ 15008
(20, 50]           ██████████████████████████ 54529
(50, 100]          ██████████████████████████████████████████████████ 101780
(100, 200]         ███████████████████████████████████████ 80642
(200, 400]         █████████████ 26475
(400, 600]         █ 4053
(600, 800]         █ 1267
(800, 1000]        █ 488
(1000, 1500]       █ 348


## Average monthly consumption (last 24 months)

**Title**: Average Monthly Electricity Consumption  
**Y-axis**: Month (YYYY-MM)  
**X-axis**: Average consumption (kW)

We aggregate by the `Revenue Month` column (`YYYY-MM`), then render a pure-text bar chart so the length encodes the magnitude (kW).

In [None]:
df['Revenue Month'].unique()

array(['2025-05', '2025-04', '2025-03', '2025-02', '2025-01', '2024-12',
       '2024-11', '2024-10', '2024-09', '2024-08', '2024-07', '2024-06',
       '2024-05', '2024-04', '2024-03', '2024-02', '2024-01', '2023-12',
       '2023-11', '2023-10', '2023-09', '2023-08', '2023-07', '2023-06',
       '2023-05', '2023-04', '2023-03', '2023-02', '2023-01', '2022-12',
       '2022-11', '2022-10', '2022-09', '2022-08', '2022-07', '2022-06',
       '2022-05', '2022-04', '2022-03', '2022-02', '2022-01', '2021-12',
       '2021-11', '2021-10', '2021-09', '2021-08', '2021-07', '2021-06',
       '2021-05', '2021-04', '2021-03', '2021-02', '2021-01', '2020-12',
       '2020-11', '2020-10', '2020-09', '2020-08', '2020-07', '2020-06',
       '2020-05', '2020-04', '2020-03', '2020-02', '2020-01', '2019-12',
       '2019-11', '2019-10', '2019-09', '2019-08', '2019-07', '2019-06',
       '2019-05', '2019-04', '2019-03', '2019-02', '2019-01', '2017-12',
       '2017-11', '2017-10', '2017-09', '2017-08', 

In [None]:
df["Revenue Month"] = pd.to_datetime(df["Revenue Month"], format="%Y-%m", errors="coerce")
monthly_avg = (
    df.dropna(subset=["Revenue Month", col])
      .groupby("Revenue Month")[col]
      .mean()
      .sort_index()
)

monthly_avg_tail = monthly_avg.tail(24)

labels = [d.strftime("%Y-%m") for d in monthly_avg_tail.index]
values = [float(v) for v in monthly_avg_tail.values]

In [None]:
title = "Average Monthly Electricity Consumption (kW)"
x_label = "Average Consumption (kW)"   # X轴（数值，含单位）
y_label = "Month (YYYY-MM)"            # Y轴（类目）

MAX_BAR_WIDTH = 50

if values:
    max_val = max(values)
    scale = MAX_BAR_WIDTH / max_val if max_val > 0 else 1.0
else:
    max_val = 1.0
    scale = 1.0

print("=" * (len(title) + 4))
print(f"| {title} |")
print("=" * (len(title) + 4))
print(f"Y-axis: {y_label}")
print(f"X-axis: {x_label}")
print(f"Note: 1 block ≈ {max_val/MAX_BAR_WIDTH:.2f} kW (scale based on max)")
print()

for lab, val in zip(labels, values):
    blocks = int(val * scale)
    bar = "█" * blocks if blocks > 0 else "▏"
    print(f"{lab} | {bar} {val:.2f} kW")

| Average Monthly Electricity Consumption (kW) |
Y-axis: Month (YYYY-MM)
X-axis: Average Consumption (kW)
Note: 1 block ≈ 1.77 kW (scale based on max)

2023-06 | ███████████████████████████ 47.75 kW
2023-07 | ███████████████████████████████████████ 69.27 kW
2023-08 | ███████████████████████████████████████ 69.72 kW
2023-09 | ████████████████████████████████████████ 70.82 kW
2023-10 | ██████████████████████████ 46.28 kW
2023-11 | ████████████████████████ 43.34 kW
2023-12 | ████████████████████████ 44.04 kW
2024-01 | ████████████████████████ 42.69 kW
2024-02 | ████████████████████████ 43.03 kW
2024-03 | ███████████████████████ 41.57 kW
2024-04 | ███████████████████████ 41.13 kW
2024-05 | ████████████████████████████ 50.67 kW
2024-06 | ██████████████████████████████████████████ 75.21 kW
2024-07 | ████████████████████████████████████████ 71.63 kW
2024-08 | ██████████████████████████████████████████████████ 88.28 kW
2024-09 | ████████████████████████████████████ 64.12 kW
2024-10 | █████████

## Findings

- The distribution is **right-skewed**: many low/zero values and occasional high peaks, which raises the **mean** above the **median**.
- A naive read (without `thousands=",""`) silently dropped large values → biased mean (lower). We kept this “dead end” to show why it’s wrong and how to fix it.
- Monthly averages show recent variability; zeros likely correspond to off/idle periods.