# NYC Taxi Data S3 Reader

This notebook demonstrates how to read NYC taxi data from S3 using Dask for efficient processing of large datasets.

## Data Source
- **Dataset**: NYC Taxi & Limousine Commission (TLC) data
- **S3 Location**: `s3://nyc-tlc/trip data/`
- **Format**: Parquet files
- **Years Available**: 2009-2023

## Features
- Lazy loading of large datasets
- Parallel processing with Dask
- Memory-efficient operations
- Data exploration and analysis


In [1]:
# Import required libraries
import dask.dataframe as dd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')
%matplotlib inline

print(f"Starting NYC Taxi Data Analysis at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


Starting NYC Taxi Data Analysis at 2025-10-17 16:16:54


## 1. Explore Available Data

First, let's see what data is available in the S3 bucket.


In [5]:
# Import s3fs for S3 exploration
import s3fs

# Create S3 filesystem connection
s3 = s3fs.S3FileSystem(anon=True)

# Explore the NYC TLC bucket structure
bucket_path = 's3://nyc-tlc/trip data/'
print(f"Exploring bucket: {bucket_path}")

# List available files (this might take a moment)
try:
    files = s3.glob(f"{bucket_path}*.parquet")
    print(f"Found {len(files)} parquet files")
    
    # Show first few files
    print("\nSample files:")
    for i, file in enumerate(files[:10]):
        print(f"  {i+1}. {file}")
    
    if len(files) > 10:
        print(f"  ... and {len(files) - 10} more files")
        
except Exception as e:
    print(f"Error exploring S3 bucket: {e}")
    print("This might be due to network connectivity or bucket access restrictions")


Exploring bucket: s3://nyc-tlc/trip data/
Error exploring S3 bucket: Access Denied
This might be due to network connectivity or bucket access restrictions


## 2. Load NYC Taxi Data

Load a sample of NYC taxi data. We'll start with a specific year to keep the dataset manageable.


In [6]:
# Define the data source - using 2015 data as an example
# You can change the year or use multiple years
year = 2015
data_source = f"s3://nyc-tlc/trip data/yellow_tripdata_{year}-*.parquet"

print(f"Loading NYC taxi data for year {year}")
print(f"Data source: {data_source}")

# Load data with Dask (lazy loading)
try:
    # Read parquet files from S3
    df = dd.read_parquet(
        data_source,
        storage_options={'anon': True}
    )
    
    print(f"✓ Successfully loaded data")
    print(f"Data shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    
except Exception as e:
    print(f"Error loading data: {e}")
    print("This might be due to network connectivity or data availability")


Loading NYC taxi data for year 2015
Data source: s3://nyc-tlc/trip data/yellow_tripdata_2015-*.parquet
Error loading data: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Access Denied
This might be due to network connectivity or data availability


## 3. Explore Data Structure

Let's examine the structure and content of the loaded data.


In [4]:
# Display basic information about the dataset
print("=== Dataset Information ===")
print(f"Shape: {df.shape}")
print(f"Number of partitions: {df.npartitions}")
print(f"Memory usage: {df.memory_usage(deep=True).sum().compute() / 1024**3:.2f} GB")

# Show column information
print("\n=== Column Information ===")
print(f"Total columns: {len(df.columns)}")
print("Columns:")
for i, col in enumerate(df.columns, 1):
    print(f"  {i:2d}. {col}")

# Show data types
print("\n=== Data Types ===")
dtypes = df.dtypes.compute()
for col, dtype in dtypes.items():
    print(f"  {col}: {dtype}")


=== Dataset Information ===


NameError: name 'df' is not defined

In [None]:
# Display first few rows
print("=== Sample Data ===")
sample_data = df.head()
print(sample_data)

# Display last few rows
print("\n=== Last Few Rows ===")
tail_data = df.tail()
print(tail_data)


## 4. Data Analysis

Perform some basic analysis on the taxi data.


In [None]:
# Basic statistics for numerical columns
print("=== Basic Statistics ===")
try:
    # Get numerical columns
    numerical_cols = df.select_dtypes(include=[np.number]).columns
    print(f"Numerical columns: {list(numerical_cols)}")
    
    if len(numerical_cols) > 0:
        stats = df[numerical_cols].describe().compute()
        print("\nDescriptive Statistics:")
        print(stats)
    else:
        print("No numerical columns found")
        
except Exception as e:
    print(f"Error computing statistics: {e}")


In [None]:
# Analyze specific columns if they exist
print("=== Column Analysis ===")

# Check for common taxi data columns
common_columns = {
    'trip_distance': 'Trip distance in miles',
    'fare_amount': 'Fare amount in dollars',
    'tip_amount': 'Tip amount in dollars',
    'total_amount': 'Total amount in dollars',
    'passenger_count': 'Number of passengers',
    'pickup_datetime': 'Pickup date and time',
    'dropoff_datetime': 'Dropoff date and time'
}

available_columns = []
for col, description in common_columns.items():
    if col in df.columns:
        available_columns.append((col, description))
        print(f"✓ {col}: {description}")
    else:
        print(f"✗ {col}: Not found")

print(f"\nFound {len(available_columns)} common taxi columns")


## 5. Data Visualization

Create some visualizations to understand the data better.


In [None]:
# Create visualizations for available numerical columns
if len(available_columns) > 0:
    print("Creating visualizations...")
    
    # Set up the plotting area
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle(f'NYC Taxi Data Analysis - {year}', fontsize=16)
    
    plot_count = 0
    
    # Plot available numerical columns
    for col, description in available_columns[:4]:  # Limit to 4 plots
        if col in df.columns:
            try:
                row = plot_count // 2
                col_idx = plot_count % 2
                
                # Sample data for plotting (to avoid memory issues)
                sample_data = df[col].dropna().sample(frac=0.1).compute()
                
                if len(sample_data) > 0:
                    axes[row, col_idx].hist(sample_data, bins=50, alpha=0.7, edgecolor='black')
                    axes[row, col_idx].set_title(f'{col} Distribution')
                    axes[row, col_idx].set_xlabel(col)
                    axes[row, col_idx].set_ylabel('Frequency')
                    
                    plot_count += 1
                    
            except Exception as e:
                print(f"Error plotting {col}: {e}")
    
    # Hide unused subplots
    for i in range(plot_count, 4):
        row = i // 2
        col_idx = i % 2
        axes[row, col_idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()
    
else:
    print("No suitable columns found for visualization")


## 6. Summary

### What we accomplished:
1. **Connected to S3** and explored available NYC taxi data
2. **Loaded data** using Dask for efficient processing
3. **Explored data structure** and content
4. **Performed analysis** on trip patterns, distances, and fares
5. **Created visualizations** to understand data distributions

### Key Features:
- **Lazy loading**: Data is loaded on-demand to save memory
- **Parallel processing**: Dask processes data across multiple cores
- **Scalable**: Can handle datasets much larger than available memory
- **Interactive**: Easy to explore and analyze data

### Next Steps:
- Try different years of data
- Combine multiple years for longitudinal analysis
- Add more sophisticated analysis (geospatial, temporal patterns)
- Export to different formats (CSV, database, etc.)


In [None]:
print(f"\nAnalysis completed at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("✓ NYC Taxi Data Analysis Complete!")
