# Module 1 Homework: Docker & SQL

This notebook contains solutions to the Module 1 homework questions.

## Table of Contents
1. [Question 1: Understanding Docker images](#question-1)
2. [Question 2: Understanding Docker networking](#question-2)
3. [Question 3: Counting short trips](#question-3)
4. [Question 4: Longest trip for each day](#question-4)
5. [Question 5: Biggest pickup zone](#question-5)
6. [Question 6: Largest tip](#question-6)
7. [Question 7: Terraform Workflow](#question-7)

## Question 1: Understanding Docker images {#question-1}

**Task:** Run docker with the python:3.13 image. Use an entrypoint bash to interact with the container. What's the version of pip in the image?

**Answer Options:**
- 25.3
- 24.3.1
- 24.2.1
- 23.3.1

In [10]:
# Run this command in terminal to check pip version:
# docker run --entrypoint bash python:3.13 -c "pip --version"

# Or interactively:
# docker run -it --entrypoint bash python:3.13
# Then inside the container: pip --version

In [11]:
import subprocess

# Run docker command to check pip version
try:
    result = subprocess.run(
        ['docker', 'run', '--entrypoint', 'bash', 'python:3.13', '-c', 'pip --version'],
        capture_output=True,
        text=True,
        timeout=30
    )
    if result.returncode == 0:
        print("Pip version output:")
        print(result.stdout)
        print("\nAnswer: Check the output above to determine the pip version")
    else:
        print("Docker is not running. Please start Docker and run:")
        print("docker run --entrypoint bash python:3.13 -c 'pip --version'")
except Exception as e:
    print(f"Error: {e}")
    print("\nTo check pip version manually, run:")
    print("docker run --entrypoint bash python:3.13 -c 'pip --version'")

Pip version output:
pip 25.3 from /usr/local/lib/python3.13/site-packages/pip (python 3.13)


Answer: Check the output above to determine the pip version


## Question 2: Understanding Docker networking {#question-2}

**Task:** Given the docker-compose.yaml, what is the hostname and port that pgadmin should use to connect to the postgres database?

**Answer Options:**
- postgres:5433
- localhost:5432
- db:5433
- postgres:5432
- db:5432

### Explanation:

In Docker Compose, services can communicate with each other using their **service names** as hostnames. The service name in the docker-compose.yaml is `db`.

For the port:
- Inside the container, postgres runs on port `5432` (default PostgreSQL port)
- The port mapping `5433:5432` maps host port 5433 to container port 5432
- **For inter-container communication**, you use the **container's internal port** (5432), not the host-mapped port (5433)

Therefore, pgadmin should connect to: **`db:5432`**

**Answer: db:5432**

## Question 3: Counting short trips {#question-3}

**Task:** For trips in November 2025 (lpep_pickup_datetime between '2025-11-01' and '2025-12-01', exclusive of the upper bound), how many trips had a trip_distance of less than or equal to 1 mile?

**Answer Options:**
- 7,853
- 8,007
- 8,254
- 8,421

### Setup: Download the data first

In [12]:
# Download the data files
import subprocess
import os

# Create data directory if it doesn't exist
os.makedirs('data', exist_ok=True)

# Download green taxi trips data
print("Downloading green_tripdata_2025-11.parquet...")
subprocess.run([
    'wget', 
    'https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet',
    '-O', 'data/green_tripdata_2025-11.parquet'
])

# Download taxi zone lookup
print("\nDownloading taxi_zone_lookup.csv...")
subprocess.run([
    'wget',
    'https://github.com/DataTalksClub/nyc-tlc-data/releases/download/misc/taxi_zone_lookup.csv',
    '-O', 'data/taxi_zone_lookup.csv'
])

print("\nData download complete!")

Downloading green_tripdata_2025-11.parquet...

Downloading taxi_zone_lookup.csv...


--2026-01-26 23:25:26--  https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2025-11.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 2600:9000:26ec:7a00:b:20a5:b140:21, 2600:9000:26ec:5200:b:20a5:b140:21, 2600:9000:26ec:bc00:b:20a5:b140:21, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|2600:9000:26ec:7a00:b:20a5:b140:21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1164775 (1.1M) [binary/octet-stream]
Saving to: ‘data/green_tripdata_2025-11.parquet’

     0K .......... .......... .......... .......... ..........  4% 7.59M 0s
    50K .......... .......... .......... .......... ..........  8% 31.1M 0s
   100K .......... .......... .......... .......... .......... 13% 7.58M 0s
   150K .......... .......... .......... .......... .......... 17% 6.29M 0s
   200K .......... .......... .......... .......... .......... 21%  763M 0s
   250K .......... .......... .......... .......... .....


Data download complete!


HTTP request sent, awaiting response... 302 Found
Location: https://release-assets.githubusercontent.com/github-production-release-asset/513814948/5a2cc2f5-b4cd-4584-9c62-a6ea97ed0e6a?sp=r&sv=2018-11-09&sr=b&spr=https&se=2026-01-27T07%3A02%3A09Z&rscd=attachment%3B+filename%3Dtaxi_zone_lookup.csv&rsct=application%2Foctet-stream&skoid=96c2d410-5711-43a1-aedd-ab1947aa7ab0&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skt=2026-01-27T06%3A01%3A10Z&ske=2026-01-27T07%3A02%3A09Z&sks=b&skv=2018-11-09&sig=p0cXsflfZsyL9rH3u1ki2sd7Y%2BEGjEw10rntROL1R5s%3D&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmVsZWFzZS1hc3NldHMuZ2l0aHVidXNlcmNvbnRlbnQuY29tIiwia2V5Ijoia2V5MSIsImV4cCI6MTc2OTQ5NTM4NSwibmJmIjoxNzY5NDk1MDg1LCJwYXRoIjoicmVsZWFzZWFzc2V0cHJvZHVjdGlvbi5ibG9iLmNvcmUud2luZG93cy5uZXQifQ.ayQ_58yH3NkI4Di7ifvlCNRWvzEdprSvcZxygynFr1E&response-content-disposition=attachment%3B%20filename%3Dtaxi_zone_lookup.csv&response-content-type=application%2Foctet-stream [following]
--2026-

In [13]:
# Load data and perform analysis
import pandas as pd
import pyarrow.parquet as pq

# Read parquet file
df = pd.read_parquet('data/green_tripdata_2025-11.parquet')

print(f"Total rows: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
print("\nFirst few rows:")
df.head()

Total rows: 46912
Columns: ['VendorID', 'lpep_pickup_datetime', 'lpep_dropoff_datetime', 'store_and_fwd_flag', 'RatecodeID', 'PULocationID', 'DOLocationID', 'passenger_count', 'trip_distance', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'ehail_fee', 'improvement_surcharge', 'total_amount', 'payment_type', 'trip_type', 'congestion_surcharge', 'cbd_congestion_fee']

First few rows:


Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,...,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge,cbd_congestion_fee
0,2,2025-11-01 00:34:48,2025-11-01 00:41:39,N,1.0,74,42,1.0,0.74,7.2,...,0.5,1.94,0.0,,1.0,11.64,1.0,1.0,0.0,0.0
1,2,2025-11-01 00:18:52,2025-11-01 00:24:27,N,1.0,74,42,2.0,0.95,7.2,...,0.5,0.0,0.0,,1.0,9.7,2.0,1.0,0.0,0.0
2,2,2025-11-01 01:03:14,2025-11-01 01:15:24,N,1.0,83,160,1.0,2.19,13.5,...,0.5,5.0,0.0,,1.0,21.0,1.0,1.0,0.0,0.0
3,2,2025-11-01 00:10:57,2025-11-01 00:24:53,N,1.0,166,127,1.0,5.44,24.7,...,0.5,0.5,0.0,,1.0,27.7,1.0,1.0,0.0,0.0
4,1,2025-11-01 00:03:48,2025-11-01 00:19:38,N,1.0,166,262,1.0,3.2,18.4,...,1.5,1.0,0.0,,1.0,24.65,1.0,1.0,2.75,0.0


In [14]:
# Convert lpep_pickup_datetime to datetime if it's not already
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])

# Filter for November 2025 (between '2025-11-01' and '2025-12-01', exclusive of upper bound)
nov_trips = df[
    (df['lpep_pickup_datetime'] >= '2025-11-01') & 
    (df['lpep_pickup_datetime'] < '2025-12-01')
]

# Count trips with trip_distance <= 1 mile
short_trips = nov_trips[nov_trips['trip_distance'] <= 1.0]

print(f"Total trips in November 2025: {len(nov_trips):,}")
print(f"Trips with trip_distance <= 1 mile: {len(short_trips):,}")
print(f"\n✓ ANSWER: {len(short_trips):,}")

Total trips in November 2025: 46,891
Trips with trip_distance <= 1 mile: 8,007

✓ ANSWER: 8,007


### SQL Query (for use in PostgreSQL):

```sql
SELECT COUNT(*) as short_trips_count
FROM green_taxi_trips
WHERE lpep_pickup_datetime >= '2025-11-01' 
  AND lpep_pickup_datetime < '2025-12-01'
  AND trip_distance <= 1.0;
```

## Question 4: Longest trip for each day {#question-4}

**Task:** Which was the pick up day with the longest trip distance? Only consider trips with trip_distance less than 100 miles (to exclude data errors). Use the pick up time for your calculations.

**Answer Options:**
- 2025-11-14
- 2025-11-20
- 2025-11-23
- 2025-11-25

In [15]:
# Filter for November 2025 and trip_distance < 100
nov_trips_filtered = df[
    (df['lpep_pickup_datetime'] >= '2025-11-01') & 
    (df['lpep_pickup_datetime'] < '2025-12-01') &
    (df['trip_distance'] < 100)
].copy()

# Extract date from pickup datetime
nov_trips_filtered['pickup_date'] = nov_trips_filtered['lpep_pickup_datetime'].dt.date

# Find the maximum trip distance for each day
max_distance_per_day = nov_trips_filtered.groupby('pickup_date')['trip_distance'].max().reset_index()
max_distance_per_day.columns = ['pickup_date', 'max_distance']

# Sort by max_distance descending to find the day with longest trip
max_distance_per_day = max_distance_per_day.sort_values('max_distance', ascending=False)

print("Top 5 days with longest trip distances:")
print(max_distance_per_day.head())

# The day with the longest trip
longest_trip_day = max_distance_per_day.iloc[0]
print(f"\nDay with longest trip: {longest_trip_day['pickup_date']}")
print(f"Longest trip distance: {longest_trip_day['max_distance']:.2f} miles")
print(f"\n✓ ANSWER: {longest_trip_day['pickup_date']}")

Top 5 days with longest trip distances:
   pickup_date  max_distance
13  2025-11-14         88.03
19  2025-11-20         73.84
22  2025-11-23         45.26
21  2025-11-22         40.16
14  2025-11-15         39.81

Day with longest trip: 2025-11-14
Longest trip distance: 88.03 miles

✓ ANSWER: 2025-11-14


### SQL Query:

```sql
SELECT 
    DATE(lpep_pickup_datetime) as pickup_date,
    MAX(trip_distance) as max_distance
FROM green_taxi_trips
WHERE lpep_pickup_datetime >= '2025-11-01' 
  AND lpep_pickup_datetime < '2025-12-01'
  AND trip_distance < 100
GROUP BY DATE(lpep_pickup_datetime)
ORDER BY max_distance DESC
LIMIT 1;
```

## Question 5: Biggest pickup zone {#question-5}

**Task:** Which was the pickup zone with the largest total_amount (sum of all trips) on November 18th, 2025?

**Answer Options:**
- East Harlem North
- East Harlem South
- Morningside Heights
- Forest Hills

### First, we need to load the zones data and join with trips

In [16]:
# Load zones data
zones_df = pd.read_csv('data/taxi_zone_lookup.csv')
print("Zones data:")
print(zones_df.head())
print(f"\nTotal zones: {len(zones_df)}")

Zones data:
   LocationID        Borough                     Zone service_zone
0           1            EWR           Newark Airport          EWR
1           2         Queens              Jamaica Bay    Boro Zone
2           3          Bronx  Allerton/Pelham Gardens    Boro Zone
3           4      Manhattan            Alphabet City  Yellow Zone
4           5  Staten Island            Arden Heights    Boro Zone

Total zones: 265


In [17]:
# Filter for November 18th, 2025
nov_18_trips = df[
    (df['lpep_pickup_datetime'] >= '2025-11-18') & 
    (df['lpep_pickup_datetime'] < '2025-11-19')
]

print(f"Total trips on November 18, 2025: {len(nov_18_trips):,}")

# Join with zones data on PULocationID
nov_18_with_zones = nov_18_trips.merge(
    zones_df,
    left_on='PULocationID',
    right_on='LocationID',
    how='left'
)

# Group by pickup zone and sum total_amount
zone_totals = nov_18_with_zones.groupby('Zone')['total_amount'].sum().reset_index()
zone_totals = zone_totals.sort_values('total_amount', ascending=False)

print("\nTop 5 pickup zones by total_amount:")
print(zone_totals.head())

# The zone with largest total_amount
biggest_zone = zone_totals.iloc[0]
print(f"\nPickup zone with largest total_amount: {biggest_zone['Zone']}")
print(f"Total amount: ${biggest_zone['total_amount']:,.2f}")
print(f"\n✓ ANSWER: {biggest_zone['Zone']}")

Total trips on November 18, 2025: 1,773

Top 5 pickup zones by total_amount:
                         Zone  total_amount
39          East Harlem North       9281.92
40          East Harlem South       6696.13
23               Central Park       2378.79
128  Washington Heights South       2139.05
85        Morningside Heights       2100.59

Pickup zone with largest total_amount: East Harlem North
Total amount: $9,281.92

✓ ANSWER: East Harlem North


### SQL Query:

```sql
SELECT 
    z.Zone as pickup_zone,
    SUM(t.total_amount) as total_amount_sum
FROM green_taxi_trips t
JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID
WHERE DATE(t.lpep_pickup_datetime) = '2025-11-18'
GROUP BY z.Zone
ORDER BY total_amount_sum DESC
LIMIT 1;
```

## Question 6: Largest tip {#question-6}

**Task:** For passengers picked up in the zone named "East Harlem North" in November 2025, which was the drop off zone that had the largest tip?

**Note:** It's tip, not trip. We need the name of the zone, not the ID.

**Answer Options:**
- JFK Airport
- Yorkville West
- East Harlem North
- LaGuardia Airport

In [18]:
# Filter for November 2025 and pickup zone "East Harlem North"
nov_trips_with_zones = df[
    (df['lpep_pickup_datetime'] >= '2025-11-01') & 
    (df['lpep_pickup_datetime'] < '2025-12-01')
].merge(
    zones_df,
    left_on='PULocationID',
    right_on='LocationID',
    how='left'
)

# Filter for pickup zone "East Harlem North"
east_harlem_north_pickups = nov_trips_with_zones[
    nov_trips_with_zones['Zone'] == 'East Harlem North'
]

print(f"Trips picked up from East Harlem North in November 2025: {len(east_harlem_north_pickups):,}")

# Join with zones for dropoff location
east_harlem_north_with_dropoff_zones = east_harlem_north_pickups.merge(
    zones_df,
    left_on='DOLocationID',
    right_on='LocationID',
    how='left',
    suffixes=('_pickup', '_dropoff')
)

# Group by dropoff zone and sum tips
dropoff_zone_tips = east_harlem_north_with_dropoff_zones.groupby('Zone_dropoff')['tip_amount'].sum().reset_index()
dropoff_zone_tips.columns = ['dropoff_zone', 'total_tips']
dropoff_zone_tips = dropoff_zone_tips.sort_values('total_tips', ascending=False)

print("\nTop 10 dropoff zones by total tips:")
print(dropoff_zone_tips.head(10))

# Check the specific zones from answer options
target_zones = ['JFK Airport', 'Yorkville West', 'East Harlem North', 'LaGuardia Airport']
print("\nChecking zones from answer options:")
for zone in target_zones:
    zone_data = dropoff_zone_tips[dropoff_zone_tips['dropoff_zone'] == zone]
    if len(zone_data) > 0:
        print(f"  {zone}: ${zone_data.iloc[0]['total_tips']:,.2f}")

# The dropoff zone with largest tip (among answer options)
largest_tip_zone = dropoff_zone_tips.iloc[0]
print(f"\nDropoff zone with largest tip (overall): {largest_tip_zone['dropoff_zone']}")
print(f"Total tips: ${largest_tip_zone['total_tips']:,.2f}")

# Among the answer options, Yorkville West has the highest
yorkville_west = dropoff_zone_tips[dropoff_zone_tips['dropoff_zone'] == 'Yorkville West']
if len(yorkville_west) > 0:
    print(f"\n✓ ANSWER (among options): Yorkville West")
    print(f"   Total tips: ${yorkville_west.iloc[0]['total_tips']:,.2f}")

Trips picked up from East Harlem North in November 2025: 12,041

Top 10 dropoff zones by total tips:
              dropoff_zone  total_tips
114  Upper East Side North     4242.01
29       East Harlem South     3425.94
116  Upper West Side North     2752.60
135         Yorkville West     2403.17
76     Morningside Heights     2121.65
117  Upper West Side South     1888.49
60       LaGuardia Airport     1835.52
134         Yorkville East     1488.96
69        Manhattan Valley     1307.24
61         Lenox Hill East     1217.26

Checking zones from answer options:
  JFK Airport: $307.66
  Yorkville West: $2,403.17
  East Harlem North: $604.10
  LaGuardia Airport: $1,835.52

Dropoff zone with largest tip (overall): Upper East Side North
Total tips: $4,242.01

✓ ANSWER (among options): Yorkville West
   Total tips: $2,403.17


### SQL Query:

```sql
SELECT 
    z_dropoff.Zone as dropoff_zone,
    SUM(t.tip_amount) as total_tips
FROM green_taxi_trips t
JOIN taxi_zone_lookup z_pickup ON t.PULocationID = z_pickup.LocationID
JOIN taxi_zone_lookup z_dropoff ON t.DOLocationID = z_dropoff.LocationID
WHERE t.lpep_pickup_datetime >= '2025-11-01' 
  AND t.lpep_pickup_datetime < '2025-12-01'
  AND z_pickup.Zone = 'East Harlem North'
GROUP BY z_dropoff.Zone
ORDER BY total_tips DESC
LIMIT 1;
```

## Question 7: Terraform Workflow {#question-7}

**Task:** Which of the following sequences, respectively, describes the workflow for:
1. Downloading the provider plugins and setting up backend
2. Generating proposed changes and auto-executing the plan
3. Remove all resources managed by terraform

**Answer Options:**
- terraform import, terraform apply -y, terraform destroy
- teraform init, terraform plan -auto-apply, terraform rm
- terraform init, terraform run -auto-approve, terraform destroy
- terraform init, terraform apply -auto-approve, terraform destroy
- terraform import, terraform apply -y, terraform rm

### Explanation:

1. **Downloading provider plugins and setting up backend:** `terraform init`
   - Initializes Terraform working directory
   - Downloads provider plugins
   - Sets up backend configuration

2. **Generating proposed changes and auto-executing the plan:** `terraform apply -auto-approve`
   - Generates execution plan
   - Automatically approves and applies changes
   - Alternative: `terraform plan` (just shows plan) + `terraform apply` (applies)

3. **Remove all resources managed by terraform:** `terraform destroy`
   - Destroys all resources managed by the current Terraform configuration

**Answer: terraform init, terraform apply -auto-approve, terraform destroy**