#### For running the sql queries on jupyter notebook we are gonna use the [`ipython-sql`](https://github.com/catherinedevlin/ipython-sql) module.

``` python
pip install ipython-sql
```
##### Load the extension with the command `%load_ext`


In [2]:
# pip install ipython-sql
%load_ext sql

#### For creating a connection we are using the following syntax for postgres.
``` postgresql://user:password@host:port/db_name```

In [32]:
%sql postgresql://root:root@localhost:5432/ny_taxi

%sql postgresql://

For this session we are going to use to tables of the database:

1. [`yellow_taxi_trips`](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.csv) : This table contains the list of information about the trips done by the yellow taxi in nyc.
2. `taxi_zones` : This table contains the information about the zones details where the tips has been made.

** `%sql` will consider only the single line as a query whereas `%%sql` will take all the cell value as sql query.**
  

In [34]:
# Starting with the simple sql query
%sql SELECT * FROM yellow_taxi_trips LIMIT 5;

 * postgresql://root:***@localhost:5432/ny_taxi
5 rows affected.


VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
1,2021-01-01 00:30:10,2021-01-01 00:36:12,1,2.1,1,N,142,43,2,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5
1,2021-01-01 00:51:20,2021-01-01 00:52:19,1,0.2,1,N,238,151,2,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0
1,2021-01-01 00:43:30,2021-01-01 01:11:06,1,14.7,1,N,132,165,1,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0
1,2021-01-01 00:15:48,2021-01-01 00:31:01,0,10.6,1,N,138,132,1,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0
2,2021-01-01 00:31:49,2021-01-01 00:48:21,1,4.94,1,N,68,33,1,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5


In [38]:
%sql SELECT * FROM taxi_zones LIMIT 5;

 * postgresql://root:***@localhost:5432/ny_taxi
5 rows affected.


LocationID,Borough,Zone,service_zone
1,EWR,Newark Airport,EWR
2,Queens,Jamaica Bay,Boro Zone
3,Bronx,Allerton/Pelham Gardens,Boro Zone
4,Manhattan,Alphabet City,Yellow Zone
5,Staten Island,Arden Heights,Boro Zone


### Exploring the database tables and data.

In [37]:
%%sql
SELECT *
FROM 
    yellow_taxi_trips yt JOIN taxi_zones tz
    ON yt."PULocationID" = tz."LocationID"
WHERE
    yt."PULocationID" = tz."LocationID"
LIMIT 5;

 * postgresql://root:***@localhost:5432/ny_taxi
5 rows affected.


VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,LocationID,Borough,Zone,service_zone
1,2021-01-01 00:30:10,2021-01-01 00:36:12,1,2.1,1,N,142,43,2,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5,142,Manhattan,Lincoln Square East,Yellow Zone
1,2021-01-01 00:51:20,2021-01-01 00:52:19,1,0.2,1,N,238,151,2,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0,238,Manhattan,Upper West Side North,Yellow Zone
1,2021-01-01 00:43:30,2021-01-01 01:11:06,1,14.7,1,N,132,165,1,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0,132,Queens,JFK Airport,Airports
1,2021-01-01 00:15:48,2021-01-01 00:31:01,0,10.6,1,N,138,132,1,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0,138,Queens,LaGuardia Airport,Airports
2,2021-01-01 00:31:49,2021-01-01 00:48:21,1,4.94,1,N,68,33,1,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5,68,Manhattan,East Chelsea,Yellow Zone


#### Selects all columns from the yellow_taxi_trips table. Select only the first 5 rows.

In [45]:
%%sql
SELECT
    *
FROM
    yellow_taxi_trips t,
    taxi_zones zpu,
    taxi_zones zdo
WHERE
    t."PULocationID" = zpu."LocationID" AND
    t."DOLocationID" = zdo."LocationID"
LIMIT 5;

 * postgresql://root:***@localhost:5432/ny_taxi
5 rows affected.


VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,LocationID,Borough,Zone,service_zone,LocationID_1,Borough_1,Zone_1,service_zone_1
1,2021-01-01 00:30:10,2021-01-01 00:36:12,1,2.1,1,N,142,43,2,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5,142,Manhattan,Lincoln Square East,Yellow Zone,43,Manhattan,Central Park,Yellow Zone
1,2021-01-01 00:51:20,2021-01-01 00:52:19,1,0.2,1,N,238,151,2,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0,238,Manhattan,Upper West Side North,Yellow Zone,151,Manhattan,Manhattan Valley,Yellow Zone
1,2021-01-01 00:43:30,2021-01-01 01:11:06,1,14.7,1,N,132,165,1,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0,132,Queens,JFK Airport,Airports,165,Brooklyn,Midwood,Boro Zone
1,2021-01-01 00:15:48,2021-01-01 00:31:01,0,10.6,1,N,138,132,1,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0,138,Queens,LaGuardia Airport,Airports,132,Queens,JFK Airport,Airports
2,2021-01-01 00:31:49,2021-01-01 00:48:21,1,4.94,1,N,68,33,1,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5,68,Manhattan,East Chelsea,Yellow Zone,33,Brooklyn,Brooklyn Heights,Boro Zone


* We give aliases to the `yellow_taxi_trips` and `taxi_zones' tables for easier access.
* We replace the IDs inside `PULocationID` and `DOLocationID` with the actual zone IDs for pick ups and drop offs.
* We use double quotes (`""`) for the column names because in Postgres we need to use them if the column names contains capital letters.

In [46]:
%%sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", '/', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", '/', zdo."Zone") AS "dropoff_loc"
FROM
    yellow_taxi_trips t,
    taxi_zones zpu,
    taxi_zones zdo
WHERE
    t."PULocationID" = zpu."LocationID" AND
    t."DOLocationID" = zdo."LocationID"
LIMIT 5;

 * postgresql://root:***@localhost:5432/ny_taxi
5 rows affected.


tpep_pickup_datetime,tpep_dropoff_datetime,total_amount,pickup_loc,dropoff_loc
2021-01-01 00:30:10,2021-01-01 00:36:12,11.8,Manhattan/Lincoln Square East,Manhattan/Central Park
2021-01-01 00:51:20,2021-01-01 00:52:19,4.3,Manhattan/Upper West Side North,Manhattan/Manhattan Valley
2021-01-01 00:43:30,2021-01-01 01:11:06,51.95,Queens/JFK Airport,Brooklyn/Midwood
2021-01-01 00:15:48,2021-01-01 00:31:01,36.35,Queens/LaGuardia Airport,Queens/JFK Airport
2021-01-01 00:31:49,2021-01-01 00:48:21,24.36,Manhattan/East Chelsea,Brooklyn/Brooklyn Heights


* Same as previous but instead of the complete rows we only display specific columns.
* We make use of joins (implicit joins in this case) to display combined info as a single column.
    * The new "virtual" column `pickup_loc` contains the values of both `Borough` and `Zone` columns of the zones table, separated by a slash (`/`).
    * Same for `dropoff_loc`.
* More specifically this is an inner join, because we only select the rows that overlap between the 2 tables.
* Learn more about SQL joins [here](https://dataschool.com/how-to-teach-people-sql/sql-join-types-explained-visually/) and [here](https://www.wikiwand.com/en/Join_(SQL)).

In [48]:
%%sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", '/', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", '/', zdo."Zone") AS "dropoff_loc"
FROM
    yellow_taxi_trips t JOIN taxi_zones zpu
        ON t."PULocationID" = zpu."LocationID"
    JOIN taxi_zones zdo
        ON t."DOLocationID" = zdo."LocationID"
LIMIT 5;

 * postgresql://root:***@localhost:5432/ny_taxi
5 rows affected.


tpep_pickup_datetime,tpep_dropoff_datetime,total_amount,pickup_loc,dropoff_loc
2021-01-01 00:30:10,2021-01-01 00:36:12,11.8,Manhattan/Lincoln Square East,Manhattan/Central Park
2021-01-01 00:51:20,2021-01-01 00:52:19,4.3,Manhattan/Upper West Side North,Manhattan/Manhattan Valley
2021-01-01 00:43:30,2021-01-01 01:11:06,51.95,Queens/JFK Airport,Brooklyn/Midwood
2021-01-01 00:15:48,2021-01-01 00:31:01,36.35,Queens/LaGuardia Airport,Queens/JFK Airport
2021-01-01 00:31:49,2021-01-01 00:48:21,24.36,Manhattan/East Chelsea,Brooklyn/Brooklyn Heights


* Exactly the same statement as before but rewritten using explicit `JOIN` keywords.
    * Explicit inner joins are preferred over implicit inner joins.
* The `JOIN` keyword is used after the `FROM` statement rather than the `WHERE` statement. The `WHERE` statement is actually unneeded.
    ```sql
    SELECT whatever_columns FROM table_1 JOIN table_2_with_a_matching_column ON column_from_1=column_from_2
    ```
* You can also use the keyword `INNER JOIN` for clarity.
* Learn more about SQL joins [here](https://dataschool.com/how-to-teach-people-sql/sql-join-types-explained-visually/) and [here](https://www.wikiwand.com/en/Join_(SQL)).

##### Checking whether `PULocationID` column contains null values.

In [50]:
%%sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    yellow_taxi_trips
WHERE
    "PULocationID" is NULL
LIMIT 100;

 * postgresql://root:***@localhost:5432/ny_taxi
0 rows affected.


tpep_pickup_datetime,tpep_dropoff_datetime,total_amount,PULocationID,DOLocationID


##### Checking whether `DOLocationID` column contains null values.

In [51]:
%%sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    yellow_taxi_trips
WHERE
    "DOLocationID" is NULL
LIMIT 100;

 * postgresql://root:***@localhost:5432/ny_taxi
0 rows affected.


tpep_pickup_datetime,tpep_dropoff_datetime,total_amount,PULocationID,DOLocationID


* Selects rows fromn the `yellow_taxi_trips` table whose drop off location ID does not appear in the `taxi_zones` table.
* If you did not modify any rows in the original datasets, the query would return an empty list.

In [52]:
%%sql
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    yellow_taxi_trips
WHERE
    "DOLocationID" NOT IN (
        SELECT "LocationID" FROM taxi_zones
    )
LIMIT 100;

 * postgresql://root:***@localhost:5432/ny_taxi
0 rows affected.


tpep_pickup_datetime,tpep_dropoff_datetime,total_amount,PULocationID,DOLocationID
