# Project 1

## Step 0
**Mini research question (v0):** Mini research question (v0): What does the distribution of NYC taxi trip distances look like?

**Dataset:** Dataset: NYC Open Data — Yellow Taxi Trip Records (2023).
We’ll download a full table (all columns) for one month so it’s reproducible and not too large.

**URL (CSV, all columns, Jan 2023, limit 500k):** "https://data.cityofnewyork.us/resource/4b4i-vvec.csv?$where=tpep_pickup_datetime between '2023-01-01T00:00:00' and '2023-01-31T23:59:59'&$limit=500000"

（rename it as "nyc_taxi_2023_01_full.csv" in the project）

**What we’ll do:**

-Use pandas to read the CSV and then pick one numeric column: trip_distance.

-Compute mean, median, and rounded mode (0.1 mile bins).

-Build a standard-library-only ASCII visualization (no plotting libraries).

## Step 1

print out latitude

In [15]:
import pandas as pd

# Read the CSV (it should be in the same folder as this notebook)
df = pd.read_csv("nyc_taxi_2023_01_full.csv")

# Print the entire 'trip_distance' column
print(df["trip_distance"])


0         1.53
1         1.32
2         1.70
3         3.10
4         3.80
          ... 
499995    1.70
499996    1.64
499997    1.46
499998    1.35
499999    0.40
Name: trip_distance, Length: 500000, dtype: float64


## Step 2

Calculate the average distance.

In [31]:
import pandas as pd
# Ensure the column is numeric (coerce bad/missing values to NaN)
df["trip_distance"] = pd.to_numeric(df["trip_distance"], errors="coerce")

# Calculate the mean over non-missing values
avg_distance = df["trip_distance"].mean()

print(f"Average trip distance (miles): {avg_distance:.2f}")

Average trip distance (miles): 3.97


## Step 3

Calculate the median of distance.

In [32]:
import pandas as pd
# Calculate the median over non-missing values
median_distance = df["trip_distance"].median()
print(f"median trip distance (miles): {median_distance:.2f}")

median trip distance (miles): 1.88


## Step 4

Calculate the mode of distance.

In [35]:
import pandas as pd
# Round to 0.1 mile first (continuous data -> practical mode)
rounded = pd.to_numeric(df["trip_distance"], errors="coerce").dropna().round(1)
modes = rounded.mode().tolist()     # a Python list

# Pretty print
if modes:
    print("Mode(s) of trip_distance (0.1-mile bins):",
          ", ".join(f"{m:.1f}" for m in modes))
else:
    print("No mode found.")

Mode(s) of trip_distance (0.1-mile bins): 1.0


## Step 5 Mean / Median / Mode 

 ASCII viz (standard library only)

#### 1. read file

In [29]:
import csv
import math
from collections import Counter
LOCAL_CSV = "nyc_taxi_2023_01_full.csv"   # same file from Step 0

# --- 1) Read the single numeric column from CSV ---
def read_trip_distance(path):
    vals = []
    with open(path, newline="", encoding="utf-8") as f:
        r = csv.DictReader(f)
        for row in r:
            try:
                x = float(row["trip_distance"])
                if math.isfinite(x) and x >= 0:
                    vals.append(x)
            except (ValueError, TypeError):
                pass
    return vals

vals = read_trip_distance(LOCAL_CSV)
print(f"Loaded {len(vals)} distances.")


Loaded 500000 distances.


#### 2. Hand-made mean / median / (rounded) mode

In [30]:
def mean_std(v):
    return sum(v)/len(v) if v else float("nan")

def median_std(v):
    n = len(v)
    if n == 0:
        return float("nan")
    s = sorted(v)
    m = n // 2
    return s[m] if n % 2 == 1 else (s[m-1] + s[m]) / 2

def mode_rounded_std(v, ndigits=1):
    if not v:
        return []
    rounded = [round(x, ndigits) for x in v]
    counts = Counter(rounded)
    peak = max(counts.values())
    return sorted([x for x, c in counts.items() if c == peak])

mean_h   = mean_std(vals)
median_h = median_std(vals)
modes_h  = mode_rounded_std(vals, ndigits=1)

print(f"[stdlib] Mean trip distance (miles):   {mean_h:.2f}")
print(f"[stdlib] Median trip distance (miles): {median_h:.2f}")
print("[stdlib] Mode(s) (0.1-mile bins):", ", ".join(f"{m:.1f}" for m in modes_h))

[stdlib] Mean trip distance (miles):   3.97
[stdlib] Median trip distance (miles): 1.88
[stdlib] Mode(s) (0.1-mile bins): 1.1


## Step 6

data visualization

In [27]:
# Step 5: ASCII histogram for trip_distance (no plotting libs)

# 1) Prep values with pandas (allowed for calculations)
dist_series = pd.to_numeric(df["trip_distance"], errors="coerce").dropna()

# Clip extreme outliers so the chart is readable (use the 99th percentile, not a hard-coded number)
lo = float(dist_series.min())
hi_cap = float(dist_series.quantile(0.99))   # top 1% clipped
values = [x for x in dist_series if x <= hi_cap]

# 2) Draw with Python standard library only
def ascii_hist(values, bins=20, max_width=40, ch="▇", title="Trip distance (miles)"):
    if not values:
        print("No data to plot."); 
        return

    lo = min(values); hi = max(values)
    if hi == lo: 
        hi = lo + 1e-9  # avoid zero-width
    width = (hi - lo) / bins
    edges = [lo + i*width for i in range(bins+1)]
    counts = [0]*bins
    for x in values:
        i = min(int((x - lo) / (hi - lo) * bins), bins-1)
        counts[i] += 1

    m = max(counts) or 1
    print(f"{title} — ASCII histogram (clipped at 99th percentile)")
    print("Bin range → count (each block is relative to the max bin)\n")
    for i, c in enumerate(counts):
        left, right = edges[i], edges[i+1]
        bar = ch * int(round(c / m * max_width))
        print(f"{left:6.2f}–{right:6.2f} | {bar}")

ascii_hist(values, bins=20, max_width=42, ch="*",
           title="NYC Yellow Taxi — trip_distance")


NYC Yellow Taxi — trip_distance — ASCII histogram (clipped at 99th percentile)
Bin range → count (each block is relative to the max bin)

  0.00–  1.03 | *******************************
  1.03–  2.07 | ******************************************
  2.07–  3.10 | *********************
  3.10–  4.13 | **********
  4.13–  5.17 | *****
  5.17–  6.20 | ***
  6.20–  7.23 | ***
  7.23–  8.27 | **
  8.27–  9.30 | **
  9.30– 10.34 | **
 10.34– 11.37 | **
 11.37– 12.40 | *
 12.40– 13.44 | *
 13.44– 14.47 | *
 14.47– 15.50 | *
 15.50– 16.54 | *
 16.54– 17.57 | **
 17.57– 18.60 | **
 18.60– 19.64 | *
 19.64– 20.67 | *
