# Explore Green Taxi data

## Import libraries

In [1]:
import pandas as pd
from sqlalchemy import create_engine

## Download data

In [2]:
!wget -nc https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-09.csv.gz

El fichero “green_tripdata_2019-09.csv.gz” ya está ahí, no se recupera.



## Load data

Since the dataset may be too big, let's have an overview by just loading 100 rows.

In [3]:
df = pd.read_csv("green_tripdata_2019-09.csv.gz", nrows=100)
df

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge
0,2,2019-09-01 00:10:53,2019-09-01 00:23:46,N,1,65,189,5,2.00,10.5,0.5,0.5,2.36,0.00,,0.3,14.16,1,1,0.0
1,2,2019-09-01 00:31:22,2019-09-01 00:44:37,N,1,97,225,5,3.20,12.0,0.5,0.5,0.00,0.00,,0.3,13.30,2,1,0.0
2,2,2019-09-01 00:50:24,2019-09-01 01:03:20,N,1,37,61,5,2.99,12.0,0.5,0.5,0.00,0.00,,0.3,13.30,2,1,0.0
3,2,2019-09-01 00:27:06,2019-09-01 00:33:22,N,1,145,112,1,1.73,7.5,0.5,0.5,1.50,0.00,,0.3,10.30,1,1,0.0
4,2,2019-09-01 00:43:23,2019-09-01 00:59:54,N,1,112,198,1,3.42,14.0,0.5,0.5,3.06,0.00,,0.3,18.36,1,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2,2019-09-01 00:40:55,2019-09-01 00:48:35,N,1,95,160,1,2.24,9.0,0.5,0.5,0.00,0.00,,0.3,10.30,2,1,0.0
96,2,2019-09-01 00:13:52,2019-09-01 00:21:47,N,1,75,151,1,2.04,8.5,0.5,0.5,1.96,0.00,,0.3,11.76,1,1,0.0
97,2,2019-09-01 00:37:24,2019-09-01 01:02:31,N,1,41,182,1,7.77,26.0,0.5,0.5,5.46,0.00,,0.3,32.76,1,1,0.0
98,2,2019-09-01 00:22:15,2019-09-01 00:29:52,N,1,74,260,1,4.99,15.0,0.5,0.5,2.58,6.12,,0.3,25.00,1,1,0.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               100 non-null    int64  
 1   lpep_pickup_datetime   100 non-null    object 
 2   lpep_dropoff_datetime  100 non-null    object 
 3   store_and_fwd_flag     100 non-null    object 
 4   RatecodeID             100 non-null    int64  
 5   PULocationID           100 non-null    int64  
 6   DOLocationID           100 non-null    int64  
 7   passenger_count        100 non-null    int64  
 8   trip_distance          100 non-null    float64
 9   fare_amount            100 non-null    float64
 10  extra                  100 non-null    float64
 11  mta_tax                100 non-null    float64
 12  tip_amount             100 non-null    float64
 13  tolls_amount           100 non-null    float64
 14  ehail_fee              0 non-null      float64
 15  improve

Let's change the datatype of `lpep_pickup_datetime` and `lpep_dropoff_datetime` columns, from _object_ to _datetime_.

In [5]:
df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)

In [6]:
df.iloc[:, 1:3].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   lpep_pickup_datetime   100 non-null    datetime64[ns]
 1   lpep_dropoff_datetime  100 non-null    datetime64[ns]
dtypes: datetime64[ns](2)
memory usage: 1.7 KB


We see there are some differences between Green Taxi and Yellow Taxi datasets schema:
* While in the Yellow Taxi dataset we found `tpep_pickup_datetime` and `tpep_dropoff_datetime`, in the Green Taxi dataset these are named `lpep_pickup_datetime` and `lpep_dropoff_datetime`.
* There are 2 extra columns in the Green Taxi dataset: `ehail_fee` (which, at least in this data file is completely empty), and `trip_type`.
* The table columns are arranged in different order in the two datasets.