# **Problem Statement**

In [1]:
#mount the drive

We are working as a data analyst in a company called Fast Cars. Fast Cars is a cab agglomerator like Uber and Ola i.e. it connects passengers to cabs in cities for travel through an app.

They are going to launch their product in New York City. But before they want to understand the new york taxi market.

As a data analyst you are provided with a Yellow Taxi dataset which contains information about taxis that people took in New York city from streets.

You are asked to analyse this data to provide insights about the taxi market of new york.

Do a short and preliminary analysis on the data.

You can find the data and the relevant information about Yellow Taxi dataset here - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Data identification and collection**

As this is a short and preliminary analysis, so we will be focussing on only 1 month of data for analysis.

viz. 2020 June.

The data is collected from the above website mentioned

# **Data Cleaning and Manipulation**

The first stage in any EDA problem solving strategy is Data cleaning and Manipulation stage.

**Data Cleaning and Manipulation Steps**

> Data Understanding

- Reading Dataset documentation
- Importing data - understanding each column data
- Data Summarisation like Check data type of columns, number of rows etc.

> Data Cleaning

- Dropping irrelevant columns
- Renaming the columns
- Dropping the duplicate rows
- Dropping or handling missing values
- Dropping invalid data rows (Also check if column have correct data type)
- Detecting and handling outliers (this can be handled in data analysis part as well)


> Data Manipulation

- Column transformation
- Joining datasets
- other manipulation like pivoting or transposing (this is also applied in  data analysis part)

## Data Understanding

### Reading data documentation

In the website - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

there is data about yellow taxis in csv format categorised by each year and month.

There is data dictionary pdf as well mentioned in the website called `Yellow Trips Data Dictionary`

By looking at data_description file we find that following columns would be useful for our analysis:(you can find the data description file here - https://drive.google.com/file/d/1B6JUAqGNmaMfyc6TQqxRWrvOVI1DtgQf/view?usp=sharing)<br>
* tpep_pickup_datetime - The date and time when the meter was engaged.
* tpep_dropoff_datetime - The date and time when the meter was disengaged.
* Passenger_count - The number of passengers in the vehicle.
* Trip_distance - The elapsed trip distance in miles reported by the taximeter.
* PULocationID - TLC Taxi Zone in which the taximeter was engaged
* DOLocationID - TLC Taxi Zone in which the taximeter was disengaged
* Payment_type - A numeric code signifying how the passenger paid for the trip.
* Fare_amount - The time-and-distance fare calculated by the meter.
* Extra - Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
* MTA_tax - \\$ 0.50 MTA tax that is automatically triggered based on the metered rate in use.
* Improvement_surcharge - \\$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
* congestion_surcharge - fees on congestion
* Tip_amount - This field is automatically populated for credit card tips. Cash tips are not included.
* Tolls_amount - Total amount of all tolls paid in trip.
* Total_amount -  The total amount charged to passengers. Does not include cash tips.

And we will drop the following columns:<br>
* VendorID
* RateCodeID
* Store_and_fwd_flag

For more details about each column value or the data available please look at the file attached to the link of `Trip Record User Guide`

We also have `taxi+_zone_lookup.csv` file where the zone ID mentioned in the columns `PULocationID` and `DOLocationID` are mapped to their respective locations.

### Importing the data

In [3]:
# import important libraries - matplotlib, seaborn and pandas
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [4]:
# yellow taxi data
file_loc = '/content/drive/MyDrive/Dataset_folder/yellow_tripdata_2020-06.parquet'

# read file
trip_data = pd.read_parquet(file_loc)
trip_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,1.0,N,140,68,1,15.5,3.0,0.5,4.0,0.0,0.3,23.3,2.5,
1,1,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,1.0,N,79,226,1,19.5,3.0,0.5,2.0,0.0,0.3,25.3,2.5,
2,1,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,1.0,N,238,116,2,10.0,0.5,0.5,0.0,0.0,0.3,11.3,0.0,
3,1,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,1.0,N,141,116,2,17.5,3.0,0.5,0.0,0.0,0.3,21.3,2.5,
4,1,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,1.0,N,186,75,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,


In [5]:
# taxi zone lookup file
file_loc_2 = '/content/drive/MyDrive/Dataset_folder/taxi+_zone_lookup.csv'

#zone look up file
taxi_zone_data = pd.read_csv(file_loc_2)
taxi_zone_data.head()

Unnamed: 0,LocationID,Borough,Zone,service_zone
0,1,EWR,Newark Airport,EWR
1,2,Queens,Jamaica Bay,Boro Zone
2,3,Bronx,Allerton/Pelham Gardens,Boro Zone
3,4,Manhattan,Alphabet City,Yellow Zone
4,5,Staten Island,Arden Heights,Boro Zone


### Data Summarisation

In [6]:
trip_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,1.0,N,140,68,1,15.5,3.0,0.5,4.0,0.0,0.3,23.3,2.5,
1,1,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,1.0,N,79,226,1,19.5,3.0,0.5,2.0,0.0,0.3,25.3,2.5,
2,1,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,1.0,N,238,116,2,10.0,0.5,0.5,0.0,0.0,0.3,11.3,0.0,
3,1,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,1.0,N,141,116,2,17.5,3.0,0.5,0.0,0.0,0.3,21.3,2.5,
4,1,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,1.0,N,186,75,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,


In [7]:
trip_data.tail()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
549792,2,2020-06-30 23:05:00,2020-06-30 23:32:00,,12.96,,,17,69,0,32.91,0.0,0.5,2.75,6.12,0.3,42.58,,
549793,2,2020-06-30 23:21:47,2020-06-30 23:25:24,,0.36,,,41,41,0,11.45,0.0,0.5,2.75,0.0,0.3,15.0,,
549794,2,2020-06-30 23:34:00,2020-06-30 23:44:00,,2.36,,,242,81,0,18.45,0.0,0.5,2.75,0.0,0.3,22.0,,
549795,2,2020-06-30 23:22:47,2020-06-30 23:42:01,,5.5,,,14,118,0,15.9,0.0,0.5,6.23,12.24,0.3,35.17,,
549796,2,2020-06-30 23:56:18,2020-07-01 00:27:19,,9.59,,,61,137,0,29.68,0.0,0.5,0.0,0.0,0.3,32.98,,


In [8]:
print(trip_data.shape)

(549797, 19)


In [9]:
trip_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 549797 entries, 0 to 549796
Data columns (total 19 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   VendorID               549797 non-null  int64         
 1   tpep_pickup_datetime   549797 non-null  datetime64[ns]
 2   tpep_dropoff_datetime  549797 non-null  datetime64[ns]
 3   passenger_count        499079 non-null  float64       
 4   trip_distance          549797 non-null  float64       
 5   RatecodeID             499079 non-null  float64       
 6   store_and_fwd_flag     499079 non-null  object        
 7   PULocationID           549797 non-null  int64         
 8   DOLocationID           549797 non-null  int64         
 9   payment_type           549797 non-null  int64         
 10  fare_amount            549797 non-null  float64       
 11  extra                  549797 non-null  float64       
 12  mta_tax                549797 non-null  floa

## Data Cleaning and Manipulation Steps (Reading Assignment)

We have done all the data cleaning and manipulation steps below though we have not followed the exact cleaning steps methodically as mentioned at the start.

These are the cleaning that we have done
* Dropped 3 columns `'VendorID','RatecodeID','store_and_fwd_flag'`
* converted pickup and dropoff column data type to datetime
* Extracted trip day from the pickup datetime column
* Extracted pickup hour and dropoff hour also from the above datetime columns
* calculated duration value from it
* checked for missing values
* converted payment type value from integer to string (based on mapping given in data dictionary file)
* combined the three tax values (mta_tax , extra, improvement_surcharge) into one single value called total_taxes.

In [10]:
# remove following columns - 'VendorID','RatecodeID','store_and_fwd_flag'
trip_data.drop(['VendorID','RatecodeID','store_and_fwd_flag','congestion_surcharge','airport_fee'],axis=1,inplace=True)
# print data head
trip_data.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,140,68,1,15.5,3.0,0.5,4.0,0.0,0.3,23.3
1,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,79,226,1,19.5,3.0,0.5,2.0,0.0,0.3,25.3
2,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,238,116,2,10.0,0.5,0.5,0.0,0.0,0.3,11.3
3,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,141,116,2,17.5,3.0,0.5,0.0,0.0,0.3,21.3
4,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,186,75,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95


We will now deal with time related columns, we have two time related columns
* tpep_pickup_datetime
* tpep_dropoff_datetime

We will first convert these column to datatime data type of pandas.

we will create three different features from these
* hour - pickup hour and dropoff hour
* day name - this is basically the day of the week when trip took place - we will only take day name from pickup date.
( as day name for drop date is supposed to be same as pickup date)
* duration of trip

In [11]:
# convert 'tpep_pickup_datetime' and 'tpep_dropoff_datetime' to datetime format
trip_data['tpep_pickup_datetime'] = pd.to_datetime(trip_data['tpep_pickup_datetime'])
trip_data['tpep_dropoff_datetime'] = pd.to_datetime(trip_data['tpep_dropoff_datetime'])
# print data info
print(trip_data.info())
# print data head
trip_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 549797 entries, 0 to 549796
Data columns (total 14 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   tpep_pickup_datetime   549797 non-null  datetime64[ns]
 1   tpep_dropoff_datetime  549797 non-null  datetime64[ns]
 2   passenger_count        499079 non-null  float64       
 3   trip_distance          549797 non-null  float64       
 4   PULocationID           549797 non-null  int64         
 5   DOLocationID           549797 non-null  int64         
 6   payment_type           549797 non-null  int64         
 7   fare_amount            549797 non-null  float64       
 8   extra                  549797 non-null  float64       
 9   mta_tax                549797 non-null  float64       
 10  tip_amount             549797 non-null  float64       
 11  tolls_amount           549797 non-null  float64       
 12  improvement_surcharge  549797 non-null  floa

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,140,68,1,15.5,3.0,0.5,4.0,0.0,0.3,23.3
1,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,79,226,1,19.5,3.0,0.5,2.0,0.0,0.3,25.3
2,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,238,116,2,10.0,0.5,0.5,0.0,0.0,0.3,11.3
3,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,141,116,2,17.5,3.0,0.5,0.0,0.0,0.3,21.3
4,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,186,75,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95


In [12]:
# create 'duration' column using pd.Timedelta(minutes=1)
trip_data['duration'] = (trip_data['tpep_dropoff_datetime'] - trip_data['tpep_pickup_datetime'])/ pd.Timedelta(minutes=1)
# create 'trip_pickup_hour' column using 'tpep_pickup_datetime' column
trip_data['trip_pickup_hour'] = trip_data['tpep_pickup_datetime'].dt.hour
# create 'trip_dropoff_hour' column using 'tpep_dropoff_datetime' column
trip_data['trip_dropoff_hour'] = trip_data['tpep_dropoff_datetime'].dt.hour
# create 'trip_day' column using 'tpep_pickup_datetime' column - use day_name()
trip_data['trip_day'] = trip_data['tpep_pickup_datetime'].dt.day_name()
#create 'trip_date column using tpep_pickup_datetime' column -use date
trip_data['trip_date']=trip_data['tpep_pickup_datetime'].dt.date
# print data info
print(trip_data.info())
# print data head
trip_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 549797 entries, 0 to 549796
Data columns (total 19 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   tpep_pickup_datetime   549797 non-null  datetime64[ns]
 1   tpep_dropoff_datetime  549797 non-null  datetime64[ns]
 2   passenger_count        499079 non-null  float64       
 3   trip_distance          549797 non-null  float64       
 4   PULocationID           549797 non-null  int64         
 5   DOLocationID           549797 non-null  int64         
 6   payment_type           549797 non-null  int64         
 7   fare_amount            549797 non-null  float64       
 8   extra                  549797 non-null  float64       
 9   mta_tax                549797 non-null  float64       
 10  tip_amount             549797 non-null  float64       
 11  tolls_amount           549797 non-null  float64       
 12  improvement_surcharge  549797 non-null  floa

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,trip_pickup_hour,trip_dropoff_hour,trip_day,trip_date
0,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,140,68,1,15.5,3.0,0.5,4.0,0.0,0.3,23.3,18.583333,0,0,Monday,2020-06-01
1,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,79,226,1,19.5,3.0,0.5,2.0,0.0,0.3,25.3,21.716667,0,1,Monday,2020-06-01
2,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,238,116,2,10.0,0.5,0.5,0.0,0.0,0.3,11.3,9.3,0,0,Monday,2020-06-01
3,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,141,116,2,17.5,3.0,0.5,0.0,0.0,0.3,21.3,15.416667,0,1,Monday,2020-06-01
4,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,186,75,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,12.816667,0,0,Monday,2020-06-01


Let's also see the number of missing values for each column

In [13]:
# print missing values for each column - use .isnull().sum
trip_data.isnull().sum(axis=0).reset_index()

Unnamed: 0,index,0
0,tpep_pickup_datetime,0
1,tpep_dropoff_datetime,0
2,passenger_count,50718
3,trip_distance,0
4,PULocationID,0
5,DOLocationID,0
6,payment_type,0
7,fare_amount,0
8,extra,0
9,mta_tax,0


In [14]:
passengercount_missing=(trip_data['passenger_count'].isnull().sum()/trip_data.shape[0])*100
passengercount_missing

9.224859357180923

From the above table we can observe that we have 9.22% of missing values in passenger_count.

<br>
Instead of deleting the rows we will replace Null value with average value of passenger count for that day

In [15]:
#calculate avg passenger_count for every date
passenger_count_avg=round(trip_data.groupby('trip_date')['passenger_count'].mean()).reset_index()
passenger_count_avg

Unnamed: 0,trip_date,passenger_count
0,2009-01-01,1.0
1,2020-05-31,1.0
2,2020-06-01,1.0
3,2020-06-02,1.0
4,2020-06-03,1.0
5,2020-06-04,1.0
6,2020-06-05,1.0
7,2020-06-06,1.0
8,2020-06-07,1.0
9,2020-06-08,1.0


For payment_type we have the following mapping for categories:<br>
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip

let's just check if we have only these categories available in payment_type or not

In [16]:
# value_counts for 'payment_type' column
trip_data['payment_type'].value_counts()

1    322582
2    168953
0     50718
3      5257
4      2275
5        12
Name: payment_type, dtype: int64

Now we will replace these number in payment category with actual category names.

In [17]:
# function for mapping numerical payment_type to actual payment
def map_payment_type(x):
    if x==1:
        return 'Credit_card'
    elif x==2:
        return 'Cash'
    elif x==3:
        return 'No_charge'
    elif x==4:
        return 'Dispute'
    elif x==5:
        return 'Unknown'
    else:
        return 'Voided_trip'

# use .apply and lambda on payment_type column to change 'payment_type' column
trip_data['payment_type'] = trip_data.payment_type.apply(lambda x:map_payment_type(x))
# print data head
trip_data.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,duration,trip_pickup_hour,trip_dropoff_hour,trip_day,trip_date
0,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,140,68,Credit_card,15.5,3.0,0.5,4.0,0.0,0.3,23.3,18.583333,0,0,Monday,2020-06-01
1,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,79,226,Credit_card,19.5,3.0,0.5,2.0,0.0,0.3,25.3,21.716667,0,1,Monday,2020-06-01
2,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,238,116,Cash,10.0,0.5,0.5,0.0,0.0,0.3,11.3,9.3,0,0,Monday,2020-06-01
3,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,141,116,Cash,17.5,3.0,0.5,0.0,0.0,0.3,21.3,15.416667,0,1,Monday,2020-06-01
4,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,186,75,Credit_card,14.5,3.0,0.5,3.65,0.0,0.3,21.95,12.816667,0,0,Monday,2020-06-01


In [18]:
# print data info to show that payment_type data type has changed
trip_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 549797 entries, 0 to 549796
Data columns (total 19 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   tpep_pickup_datetime   549797 non-null  datetime64[ns]
 1   tpep_dropoff_datetime  549797 non-null  datetime64[ns]
 2   passenger_count        499079 non-null  float64       
 3   trip_distance          549797 non-null  float64       
 4   PULocationID           549797 non-null  int64         
 5   DOLocationID           549797 non-null  int64         
 6   payment_type           549797 non-null  object        
 7   fare_amount            549797 non-null  float64       
 8   extra                  549797 non-null  float64       
 9   mta_tax                549797 non-null  float64       
 10  tip_amount             549797 non-null  float64       
 11  tolls_amount           549797 non-null  float64       
 12  improvement_surcharge  549797 non-null  floa

Now our Total_amount is basically<br>
Total_amount = fare_amount + tolls_amount + tip_amount + (extra + mta_tax + improvement_surcharge)

of the above components of total_amount we will specifically focus on 'fare_amount','tip_amount', 'tolls_amount' and 'total taxes'.

We are combining the extra, mta_tax and improvement_surcharge under one category called total_taxes as these are determined by local laws and taxes and is not dependent upon distance travelled or time taken for trip.

Here total taxes would be the sum of three columns 'extra','mta_tax', 'improvement_surcharge'. So we will make a new column for total_taxes.

We will also drop these three columns 'extra','mta_tax','improvement_surcharge'.


In [19]:
# create 'total_taxes' column from summing 'extra','mta_tax', 'improvement_surcharge'
trip_data['total_taxes'] = trip_data['extra']+trip_data['mta_tax']+trip_data['improvement_surcharge']
# drop 'extra','mta_tax','improvement_surcharge' columns
trip_data.drop(['extra','mta_tax','improvement_surcharge'],axis=1,inplace=True)
# print data head
trip_data.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,tip_amount,tolls_amount,total_amount,duration,trip_pickup_hour,trip_dropoff_hour,trip_day,trip_date,total_taxes
0,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,140,68,Credit_card,15.5,4.0,0.0,23.3,18.583333,0,0,Monday,2020-06-01,3.8
1,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,79,226,Credit_card,19.5,2.0,0.0,25.3,21.716667,0,1,Monday,2020-06-01,3.8
2,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,238,116,Cash,10.0,0.0,0.0,11.3,9.3,0,0,Monday,2020-06-01,1.3
3,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,141,116,Cash,17.5,0.0,0.0,21.3,15.416667,0,1,Monday,2020-06-01,3.8
4,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,186,75,Credit_card,14.5,3.65,0.0,21.95,12.816667,0,0,Monday,2020-06-01,3.8


In [20]:
trip_data.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,PULocationID,DOLocationID,payment_type,fare_amount,tip_amount,tolls_amount,total_amount,duration,trip_pickup_hour,trip_dropoff_hour,trip_day,trip_date,total_taxes
0,2020-06-01 00:31:23,2020-06-01 00:49:58,1.0,3.6,140,68,Credit_card,15.5,4.0,0.0,23.3,18.583333,0,0,Monday,2020-06-01,3.8
1,2020-06-01 00:42:50,2020-06-01 01:04:33,1.0,5.6,79,226,Credit_card,19.5,2.0,0.0,25.3,21.716667,0,1,Monday,2020-06-01,3.8
2,2020-06-01 00:39:51,2020-06-01 00:49:09,1.0,2.3,238,116,Cash,10.0,0.0,0.0,11.3,9.3,0,0,Monday,2020-06-01,1.3
3,2020-06-01 00:56:13,2020-06-01 01:11:38,1.0,5.3,141,116,Cash,17.5,0.0,0.0,21.3,15.416667,0,1,Monday,2020-06-01,3.8
4,2020-06-01 00:16:41,2020-06-01 00:29:30,1.0,4.4,186,75,Credit_card,14.5,3.65,0.0,21.95,12.816667,0,0,Monday,2020-06-01,3.8


In [21]:
trip_data.to_csv('/content/drive/MyDrive/taxi_data/yellow_taxi_data_2020-06_cleaned.csv',index=False)

We will be reading back this file in data analysis and visualisation step.