# NYC Taxicab 2015 Data from S3 - Parquet with Dask

This notebook demonstrates reading NYC taxicab 2015 data from S3 (in Parquet format) using Dask and performing simple data manipulations.

## Learning Objectives

- **S3 Access**: Reading data directly from Amazon S3 without downloading locally
- **Parquet Format**: Understanding columnar storage and its benefits
- **Lazy Evaluation**: Working with large datasets that don't fit in memory
- **Basic Manipulations**: Filtering, aggregations, and transformations on distributed data

## Why Parquet?

- **Columnar storage**: Efficient for analytics queries (only read needed columns)
- **Compression**: Smaller file sizes, faster I/O
- **Schema preservation**: Data types are stored with the data

## Why S3?

- **Scalability**: Access datasets too large for local storage
- **No local download**: Work directly with cloud data
- **Public datasets**: Many datasets available without credentials


In [1]:
import dask.dataframe as dd
import pandas as pd
import numpy as np
from dask.distributed import Client

# Optional: Start a client for better performance and diagnostics
client = Client()
client


Perhaps you already have a cluster running?
Hosting the HTTP server on port 59003 instead


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:59003/status,

0,1
Dashboard: http://127.0.0.1:59003/status,Workers: 5
Total threads: 10,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:59004,Workers: 5
Dashboard: http://127.0.0.1:59003/status,Total threads: 10
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:59017,Total threads: 2
Dashboard: http://127.0.0.1:59019/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:59007,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-g756zg78,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-g756zg78

0,1
Comm: tcp://127.0.0.1:59021,Total threads: 2
Dashboard: http://127.0.0.1:59024/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:59008,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-twf2pnhd,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-twf2pnhd

0,1
Comm: tcp://127.0.0.1:59026,Total threads: 2
Dashboard: http://127.0.0.1:59028/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:59009,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-wzwrj_1w,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-wzwrj_1w

0,1
Comm: tcp://127.0.0.1:59018,Total threads: 2
Dashboard: http://127.0.0.1:59022/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:59010,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-matzoiz2,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-matzoiz2

0,1
Comm: tcp://127.0.0.1:59027,Total threads: 2
Dashboard: http://127.0.0.1:59030/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:59011,
Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-u6t7ymx8,Local directory: /var/folders/cl/mbdp3zfx4cg9mmnz60g052540000gn/T/dask-worker-space/worker-u6t7ymx8


## Data Source

- **S3 Bucket**: `s3://dask-data/nyc-taxi/nyc-2015.parquet/`
- **Format**: Parquet (columnar, compressed)
- **Access**: Public (no credentials needed)
- **Year**: 2015
- **Size**: Large enough to demonstrate out-of-core processing


## Reading Parquet from S3

Dask can read Parquet files directly from S3 using `dd.read_parquet()`. Key points:

- **Lazy loading**: Data isn't loaded into memory until you call `.compute()`
- **Storage options**: Use `storage_options={"anon": True}` for public buckets
- **Wildcards**: Can use patterns like `part.*.parquet` to read multiple files


In [None]:
# First, let's check if files exist in the S3 bucket
import s3fs

s3 = s3fs.S3FileSystem(anon=True)
bucket_path = "s3://dask-data/nyc-taxi/nyc-2015.parquet/"

# Try to list files in the bucket
try:
    files = s3.glob(f"{bucket_path}*.parquet")
    print(f"Found {len(files)} parquet files")
    if files:
        print(f"Sample files: {files[:3]}")
    else:
        print("No parquet files found. Trying alternative path...")
        # Try listing the directory structure
        try:
            dirs = s3.ls("s3://dask-data/nyc-taxi/")
            print(f"Available directories: {dirs}")
        except Exception as e:
            print(f"Could not list directory: {e}")
except Exception as e:
    print(f"Error accessing S3: {e}")
    print("The S3 bucket may not be accessible or the path may be incorrect.")

# Read NYC taxi 2015 data from S3 (lazy - don't call .compute() yet!)
# Note: Remove .compute() to keep it as a Dask DataFrame
df = dd.read_parquet(
    "s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet",
    storage_options={"anon": True}  # Anonymous access for public bucket
)

print(f"\nData loaded (lazy Dask DataFrame): {type(df)}")
print(f"Number of partitions: {df.npartitions}")


Data loaded (lazy): Empty DataFrame
Columns: []
Index: []


In [10]:
# Get basic info without loading all data
print("Columns:", list(df.columns))
print("\nData types:")
print(df.dtypes)
print("\nFirst few rows:")
df.head()


Columns: []

Data types:
Series([], dtype: object)

First few rows:


In [4]:
# Get approximate size (this requires some computation)
print(f"Approximate number of rows: {len(df):,}")
print(f"Memory usage estimate: {df.memory_usage(deep=True).sum().compute() / 1024**3:.2f} GB")


Approximate number of rows: 0
Memory usage estimate: 0.00 GB


## Basic Data Exploration

Let's explore the dataset structure and understand what we're working with. Remember:
- Operations are **lazy** - they build a computation graph
- Call `.compute()` to actually execute and get results
- This allows us to work with datasets larger than memory


In [5]:
# Look at a sample of the data
sample = df.head(10)
print("Sample data:")
print(sample)

# Check for common taxi columns
expected_cols = ['passenger_count', 'trip_distance', 'fare_amount', 
                 'tip_amount', 'total_amount', 'pickup_datetime', 
                 'dropoff_datetime']
available = [col for col in expected_cols if col in df.columns]
print(f"\nAvailable columns: {available}")


Sample data:
Empty DataFrame
Columns: []
Index: []

Available columns: []


In [6]:
# Compute basic statistics for numerical columns
# This is still lazy until .compute() is called
numerical_cols = df.select_dtypes(include=[np.number]).columns
print("Numerical columns:", list(numerical_cols))

# Compute descriptive statistics
stats = df[numerical_cols].describe().compute()
print("\nDescriptive Statistics:")
print(stats)


Numerical columns: []


ValueError: Cannot describe a DataFrame without columns

In [None]:
# Count missing values per column
missing = df.isnull().sum().compute()
print("Missing values per column:")
print(missing[missing > 0])
print(f"\nTotal missing values: {missing.sum():,}")


## Simple Data Manipulations

Now let's perform some basic data manipulations:
- **Filtering**: Select subsets of data
- **Creating columns**: Add derived columns
- **Grouping**: Aggregate data by categories
- **Remember**: All operations are lazy until `.compute()` is called


In [None]:
# Filter for trips with passengers
df_with_passengers = df[df['passenger_count'] > 0]

# Filter for trips with tips
df_with_tips = df[df['tip_amount'] > 0]

print(f"Trips with passengers: {len(df_with_passengers):,}")
print(f"Trips with tips: {len(df_with_tips):,}")


In [None]:
# Calculate tip percentage
df = df.assign(
    tip_percentage=(df['tip_amount'] / df['fare_amount'] * 100).fillna(0),
    has_tip=(df['tip_amount'] > 0)
)

# Show new columns
print("New columns added:")
print(df[['tip_percentage', 'has_tip']].head())


In [None]:
# Average tip amount by passenger count
avg_tip_by_passengers = df.groupby('passenger_count')['tip_amount'].mean().compute()
print("Average tip amount by passenger count:")
print(avg_tip_by_passengers)


In [None]:
# Summary statistics by passenger count
summary = df.groupby('passenger_count').agg({
    'fare_amount': ['mean', 'std', 'count'],
    'tip_amount': ['mean', 'sum'],
    'trip_distance': 'mean'
}).compute()

print("Summary statistics by passenger count:")
print(summary)


In [None]:
# Show the computation graph for a complex operation
result = df.groupby('passenger_count')['tip_amount'].mean()
result.visualize()


## Working with Partitions

Dask splits data across **partitions** for parallel processing. Understanding partitions helps optimize performance:

- Data is divided into chunks (partitions)
- Each partition can be processed independently
- Repartitioning can optimize for specific operations


In [None]:
print(f"Number of partitions: {df.npartitions}")
print(f"Partition sizes (approximate):")

# Get size of each partition
partition_sizes = df.map_partitions(len).compute()
print(partition_sizes.head(10))
print(f"\nAverage partition size: {partition_sizes.mean():.0f} rows")


In [None]:
# Repartition to optimize for downstream operations
# This is useful if partitions are too small or too large
df_repartitioned = df.repartition(npartitions=10)
print(f"Original partitions: {df.npartitions}")
print(f"Repartitioned: {df_repartitioned.npartitions}")


## Performance Considerations

When working with large datasets, consider these optimization strategies:

- **`.persist()`**: Cache data in memory for repeated operations
- **Column selection**: Only load columns you need
- **Early filtering**: Filter data as early as possible in your pipeline


In [None]:
# Select only needed columns to reduce memory usage
df_subset = df[['passenger_count', 'fare_amount', 'tip_amount', 'trip_distance']]

# This uses less memory than the full dataframe
print("Reduced columns for analysis")
print(f"Original columns: {len(df.columns)}")
print(f"Selected columns: {len(df_subset.columns)}")


In [None]:
# If you'll use the filtered data multiple times, persist it
df_filtered = df[df['fare_amount'] > 0].persist()

# Now multiple operations on df_filtered will be faster
mean_fare = df_filtered['fare_amount'].mean().compute()
median_fare = df_filtered['fare_amount'].median().compute()

print(f"Mean fare: ${mean_fare:.2f}")
print(f"Median fare: ${median_fare:.2f}")


## Summary

### Key Takeaways

- **Reading Parquet from S3**: Use `dd.read_parquet()` with `storage_options={"anon": True}` for public buckets
- **Lazy Evaluation**: Operations build a computation graph; call `.compute()` to execute
- **Basic Manipulations**: Filtering, column creation, and aggregations work just like pandas
- **Performance Optimization**: Use column selection, early filtering, and `.persist()` for repeated operations
- **Partitions**: Data is split across partitions for parallel processing; repartition when needed

### Benefits of This Approach

- **No local storage needed**: Work directly with cloud data
- **Memory efficient**: Handle datasets larger than available RAM
- **Parallel processing**: Automatic parallelization across partitions
- **Scalable**: Can scale to clusters for even larger datasets


## Next Steps and Exercises

Try these exercises to deepen your understanding:

1. **Different years**: Read data from other years (2014, 2016, etc.)
2. **Combine years**: Merge multiple years of data
3. **More complex aggregations**: Group by multiple columns
4. **Time-based analysis**: If datetime columns are available, analyze patterns by hour/day/month
5. **Export results**: Save computed results to local Parquet files using `.to_parquet()`
6. **Visualizations**: Create plots of aggregated data (histograms, bar charts)

### Example: Export to Local Parquet

```python
# Export filtered data to local Parquet file
df_filtered.to_parquet('filtered_taxi_data.parquet')
```
