## Week 1 Homework

In this homework we'll prepare the environment 
and practice with Docker and SQL

## Question 1. Knowing docker tags

Run the command to get information on Docker 

```docker --help```

Now run the command to get help on the "docker build" command

Which tag has the following text? - *Write the image ID to the file* 

- `--imageid string`
- `--iidfile string`
- `--idimage string`
- `--idfile string`

### Answer

Run the following command will see the information about tags

``` docker build --help```

In [1]:
!docker build --help


Usage:  docker build [OPTIONS] PATH | URL | -

Build an image from a Dockerfile

Options:
      --add-host list           Add a custom host-to-IP mapping (host:ip)
      --build-arg list          Set build-time variables
      --cache-from strings      Images to consider as cache sources
      --disable-content-trust   Skip image verification (default true)
  -f, --file string             Name of the Dockerfile (Default is
                                'PATH/Dockerfile')
      --iidfile string          Write the image ID to the file
      --isolation string        Container isolation technology
      --label list              Set metadata for an image
      --network string          Set the networking mode for the RUN
                                instructions during build (default "default")
      --no-cache                Do not use cache when building the image
  -o, --output stringArray      Output destination (format:
                                type=local,dest=path)
    

## Question 2. Understanding docker first run 

Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash.
Now check the python modules that are installed ( use pip list). 
How many python packages/modules are installed?

- 1
- 6
- 3
- 7

### Answer

Run the following

``` docker run -it  python:3.9 /bin/bash
pip list
```

| Package     | Version |
|-------------|---------|
| pip         | 22.0.4  |
| setuptools  | 58.1.0  |
| wheel       | 0.38.4  |

The answer is 3



# Prepare Postgres

Run Postgres and load data as shown in the videos
We'll use the green taxi trips from January 2019:

```wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz```

You will also need the dataset with zones:

```wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv```

Download this data and put it into Postgres (with jupyter notebooks or with a pipeline)

In [2]:
import pandas as pd
from sqlalchemy import create_engine

In [3]:
engine = create_engine('postgresql://root:root@localhost:5433/ny_taxi')

In [None]:
!wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz

In [4]:
df_green_taxi_trips=pd.read_csv('green_tripdata_2019-01.csv.gz')

In [5]:
df_green_taxi_trips.lpep_pickup_datetime = pd.to_datetime(df_green_taxi_trips.lpep_pickup_datetime)
df_green_taxi_trips.lpep_dropoff_datetime = pd.to_datetime(df_green_taxi_trips.lpep_dropoff_datetime)

In [6]:
# DDL Data Definition Language
print(pd.io.sql.get_schema(df_green_taxi_trips, name='green_taxi_trips'))

CREATE TABLE "green_taxi_trips" (
"VendorID" INTEGER,
  "lpep_pickup_datetime" TIMESTAMP,
  "lpep_dropoff_datetime" TIMESTAMP,
  "store_and_fwd_flag" TEXT,
  "RatecodeID" INTEGER,
  "PULocationID" INTEGER,
  "DOLocationID" INTEGER,
  "passenger_count" INTEGER,
  "trip_distance" REAL,
  "fare_amount" REAL,
  "extra" REAL,
  "mta_tax" REAL,
  "tip_amount" REAL,
  "tolls_amount" REAL,
  "ehail_fee" REAL,
  "improvement_surcharge" REAL,
  "total_amount" REAL,
  "payment_type" INTEGER,
  "trip_type" INTEGER,
  "congestion_surcharge" REAL
)


In [7]:
df_green_taxi_trips.to_sql(name="green_taxi_trips", con=engine, if_exists="append")

918

In [8]:
!wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv

--2023-01-30 15:52:30--  https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.59.104, 52.216.49.176, 52.216.221.232, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.59.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12322 (12K) [application/octet-stream]
Saving to: ‘taxi+_zone_lookup.csv.1’


2023-01-30 15:52:32 (4.27 MB/s) - ‘taxi+_zone_lookup.csv.1’ saved [12322/12322]



In [9]:
df_green_taxi_trips_zones = pd.read_csv('taxi+_zone_lookup.csv')

In [10]:
df_zones.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


In [15]:
!wget "https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv" -O "zone_lookup.csv"

--2023-01-30 20:18:31--  https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.47.94, 52.216.217.96, 54.231.202.104, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.47.94|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12322 (12K) [application/octet-stream]
Saving to: ‘zone_lookup.csv’


2023-01-30 20:18:33 (51.8 MB/s) - ‘zone_lookup.csv’ saved [12322/12322]



In [16]:
lookup_df = pd.read_csv("zone_lookup.csv")
lookup_df.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


## Question 3. Count records 

How many taxi trips were totally made on January 15?

Tip: started and finished on 2019-01-15. 

Remember that `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns are in the format timestamp (date and hour+min+sec) and not in date.

- 20689
- 20530
- 17630
- 21090

In [11]:
df_green_taxi_trips[(df_green_taxi_trips["lpep_pickup_datetime"]>="2019-01-15 00:00:00")&(df_green_taxi_trips["lpep_dropoff_datetime"]<"2019-01-16 00:00:00")].shape[0]

20530

## Question 4. Largest trip for each day
Which was the day with the largest trip distance Use the pick up time for your calculations.

2019-01-18
2019-01-28
2019-01-15
2019-01-10

In [12]:
df_green_taxi_trips[df_green_taxi_trips['trip_distance'] == df_green_taxi_trips['trip_distance'].max()].lpep_pickup_datetime

297377   2019-01-15 19:27:58
Name: lpep_pickup_datetime, dtype: datetime64[ns]

# Question 5. The number of passengers
In 2019-01-01 how many trips had 2 and 3 passengers?

2: 1282 ; 3: 266
2: 1532 ; 3: 126
2: 1282 ; 3: 254
2: 1282 ; 3: 274

In [13]:
df_green_taxi_trips[(df_green_taxi_trips['lpep_pickup_datetime'] >= '2019-01-01 00:00:00') & \
    (df_green_taxi_trips['lpep_pickup_datetime'] < '2019-01-02 00:00:00')]['passenger_count'].value_counts()

1    12415
2     1282
5      616
6      273
3      254
4      129
0       21
Name: passenger_count, dtype: int64

### Question 6. Largest tip
For the passengers picked up in the Astoria Zone which was the drop off zone that had the largest tip? We want the name of the zone, not the id.

Note: it's not a typo, it's tip , not trip

Central Park
Jamaica
South Ozone Park
Long Island City/Queens Plaza

Answer is "Long Island City/Queens Plaza"

In [17]:
astoria_id = lookup_df[lookup_df['Zone'] == 'Astoria']['LocationID'].values[0]
astoria_df = df_green_taxi_trips[df_green_taxi_trips['PULocationID'] == astoria_id]
drop_off_id = astoria_df[astoria_df['tip_amount'] == astoria_df.tip_amount.max()].DOLocationID.values[0]
lookup_df[lookup_df["LocationID"] == drop_off_id]['Zone'].values[0]

'Long Island City/Queens Plaza'