# Module 1 Homework: Docker & SQL

In this homework we'll prepare the environment and practice Docker and SQL

When submitting your homework, you will also need to include a link to your GitHub repository or other public code-hosting site.

This repository should contain the code for solving the homework.

When your solution has SQL or shell commands and not code (e.g. python files) file format, include them directly in the README file of your repository.

---

## Question 1. Understanding Docker images

Run docker with the python:3.13 image. Use an entrypoint bash to interact with the container.

	What's the version of pip in the image?

In [31]:
!echo| pip -V

pip 25.3 from /usr/local/lib/python3.13/site-packages/pip (python 3.13)


## Prepare the Data

In [32]:
import pandas as pd
from sqlalchemy import create_engine

# Connection details
user = "postgres"
password = "postgres"
host = "db"
port = "5432"
database = "ny_taxi"

engine = create_engine(f"postgresql://{user}:{password}@{host}:{port}/{database}")

In [33]:
# List all tables and show row counts for each
from sqlalchemy import inspect

inspector = inspect(engine)
tables = inspector.get_table_names()
print("Tables in the database:", tables)

for table in tables:
    count = pd.read_sql(f"SELECT COUNT(*) FROM {table}", engine).iloc[0, 0]
    print(f"Table '{table}' has {count} rows.")

# Load the tables into DataFrames
df_trips = pd.read_sql_table('green_tripdata', engine)
df_taxi_zones = pd.read_sql_table('taxi_zone_lookup', engine)

Tables in the database: ['green_tripdata', 'taxi_zone_lookup']
Table 'green_tripdata' has 46912 rows.
Table 'taxi_zone_lookup' has 265 rows.


## Question 3. Counting short trips

    For the trips in November 2025 (lpep_pickup_datetime between '2025-11-01' and '2025-12-01', exclusive of the upper bound), how many trips had a trip_distance of less than or equal to 1 mile?

In [34]:
november_trips = df_trips[
    (df_trips['lpep_pickup_datetime'] >= '2025-11-01') &
    (df_trips['lpep_pickup_datetime'] < '2025-12-01') &
    (df_trips['trip_distance'] <= 1)
]

num_short_trips = len(november_trips)
num_short_trips

8007

## Question 4. Longest trip for each day

    Which was the pick up day with the longest trip distance? Only consider trips with trip_distance less than 100 miles (to exclude data errors).

Use the pick up time for your calculations.

In [35]:
valid_trips = df_trips[df_trips['trip_distance'] < 100]

longest_trip = valid_trips.loc[valid_trips['trip_distance'].idxmax()]

# Extract the pickup day
pickup_day_longest_trip = longest_trip['lpep_pickup_datetime'].date()
pickup_day_longest_trip

datetime.date(2025, 11, 14)

## Question 5. Biggest pickup zone
    
    Which was the pickup zone with the largest total_amount (sum of all trips) on November 18th, 2025?

        East Harlem North
        East Harlem South
        Morningside Heights
        Forest Hills

In [36]:
# Filter trips on November 18th, 2025
nov18_trips = df_trips[
    (df_trips['lpep_pickup_datetime'] >= '2025-11-18') &
    (df_trips['lpep_pickup_datetime'] < '2025-11-19')
]

# Group by pickup location and sum total_amount
pickup_totals = nov18_trips.groupby('PULocationID')['total_amount'].sum()

# Find the PULocationID with the largest total_amount
max_pu_location_id = pickup_totals.idxmax()

# Get the pickup zone name
pickup_zone_name = df_taxi_zones[df_taxi_zones['LocationID'] == max_pu_location_id]['Zone'].values[0]
pickup_zone_name


'East Harlem North'

## Question 6. Largest tip

    For the passengers picked up in the zone named "East Harlem North" in November 2025, which was the drop off zone that had the largest tip?

Note: it's tip , not trip. We need the name of the zone, not the ID.

    JFK Airport
    Yorkville West
    East Harlem North
    LaGuardia Airport


In [37]:
# Filter trips picked up in "East Harlem North" in November 2025
east_harlem_november = df_trips[
    (df_trips['lpep_pickup_datetime'] >= '2025-11-01') &
    (df_trips['lpep_pickup_datetime'] < '2025-12-01') &
    (df_trips['PULocationID'].isin(df_taxi_zones[df_taxi_zones['Zone'] == 'East Harlem North']['LocationID']))
]

# Find the trip with the largest tip
largest_tip_trip = east_harlem_november.loc[east_harlem_november['tip_amount'].idxmax()]

# Get the drop off zone name
dropoff_zone_id = largest_tip_trip['DOLocationID']
dropoff_zone_name = df_taxi_zones[df_taxi_zones['LocationID'] == dropoff_zone_id]['Zone'].values[0]
dropoff_zone_name

'Yorkville West'