To analyze the relationship between fare amount and payment type.

In [1]:
#Import Packages and Libraries
import os

import pandas as pd
import numpy as np

from scipy import stats

In [2]:
#import dataset
get_cwd = os.getcwd()
df = pd.read_csv(os.path.join(get_cwd, "Dataset") + "/2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

In [3]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [4]:
#Basic stats
df.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
mean,1.556236,1.642319,2.913313,1.043394,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,1.285231,3.653171,0.708391,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,1.0,0.99,1.0,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,1.0,1.61,1.0,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,2.0,3.06,1.0,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8
max,2.0,6.0,33.96,99.0,265.0,265.0,4.0,999.99,4.5,0.5,200.0,19.1,0.3,1200.29


In [5]:
#Average fare amount for each payment type
df.groupby(['payment_type']).agg({'fare_amount':'mean'})

Unnamed: 0_level_0,fare_amount
payment_type,Unnamed: 1_level_1
1,13.429748
2,12.213546
3,12.186116
4,9.913043


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, we conduct a hypothesis test.

$Step 1$: Defining Null Hypothesis(H0) and Alternative Hypothesis(HA):

H0 - There is no difference in the average fare amount between the customers who use credit cards and cutomers who use cash.

HA - There is a difference in the average fare amount between the customers who use credit cards and cutomers who use cash.


$Step2$: Choose a Significance Level:

Significance Level = 5%

$Step3$: Finding the P-Value:

In [6]:
cc_df = df[df['payment_type'] == 1]

cash_df = df[df['payment_type'] == 2]

#significance level:
sl = 0.05

#calculating the p-value
p_value = stats.ttest_ind(a = cc_df['fare_amount'], b = cash_df['fare_amount'], equal_var=False)

print("P-Value:", p_value[1])

print('P-Value < Significance Level:', (p_value[1] <sl))

P-Value: 6.797387473030518e-12
P-Value < Significance Level: True


$Step4:$ Reject or Fail to Reject Null Hypothesis

Since P-Value is less than company's significance level, we can reject the null hypothesis. That means, there is a difference in the average fare amount between customers who use credit cards and customers who use cash.

NOTE:

This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa.