# New York City Taxi & Limousine Commission (NYC TLC)
## A/B Hypothesis Testing for the Taxi Ride Fare Project

Overview:\
Purpose: Conduct A/B test to analyze whether there is a relationship between payment type and fare amount. \
Objective: Build predictive model(s) for taxi ride fares to increase taxi driver profitability.

**Part 1:** Inspect the data

**Part 2:** Conduct hypothesis testing

**Part 3:** Communicate insights

### Change Log
2024_0520, S. Souto, Initial Version

### Data Sources

1. Sampled from original data: NYC.gov: "2017_Yellow_Taxi_Trip_Data.csv"

### Imports and Data Loading

In [1]:
# Import packages and libraries
import pandas as pd
from scipy import stats

In [2]:
# Notebook setup
pd.set_option('display.max_columns', None)

In [3]:
# Load dataset into dataframe, save copy
df0 = pd.read_csv('data/2017_Yellow_Taxi_Trip_Data.csv', index_col = 0)
df1 = df0.copy()

## Part 1: Inspect the data
**Note:**  Refer to the previous comprehensive EDA effort for in-depth analysis.

In [4]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22699 entries, 24870114 to 17208911
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               22699 non-null  int64  
 1   tpep_pickup_datetime   22699 non-null  object 
 2   tpep_dropoff_datetime  22699 non-null  object 
 3   passenger_count        22699 non-null  int64  
 4   trip_distance          22699 non-null  float64
 5   RatecodeID             22699 non-null  int64  
 6   store_and_fwd_flag     22699 non-null  object 
 7   PULocationID           22699 non-null  int64  
 8   DOLocationID           22699 non-null  int64  
 9   payment_type           22699 non-null  int64  
 10  fare_amount            22699 non-null  float64
 11  extra                  22699 non-null  float64
 12  mta_tax                22699 non-null  float64
 13  tip_amount             22699 non-null  float64
 14  tolls_amount           22699 non-null  float64
 1

In [5]:
df1.shape

(22699, 17)

In [6]:
df1.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [7]:
# descriptive stats
df1.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


We are interested in the relationship between fare amount and payment type. To begin, we will look at the average fare for each payment type.

From the data dictionary, `payment_type` is a numeric code:
1. Credit Card
2. Cash
3. No charge
4. Dispute
5. Unknown
6. Voided trip

In [8]:
df1['payment_type'].value_counts()

payment_type
1    15265
2     7267
3      121
4       46
Name: count, dtype: int64

The top two payment types are credit card (1), and cash (2).  Credit Card method of payment more than doubles those that are paid with cash.

In [9]:
df1.groupby('payment_type')['fare_amount'].mean()

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

Our initial data exploration indicated that tips were exclusively associated with credit card payments.  It also suggested a potential correlation between payment type and fare amount, with credit card users paying slightly higher average fares (as seen above). To determine if this difference is statistically significant and not merely due to random chance, a hypothesis test will be conducted to analyze the relationship between `payment_type` and `fare_amount`.

## Part 2: Conduct hypothesis testing

**Null hypothesis:** There is no difference in average fare between customers who use credit cards and customers who use cash. 

**Alternative hypothesis:** There is a difference in average fare between customers who use credit cards and customers who use cash.

Conduct a two-sample t-test with 5% significance level.

In [10]:
# Create two dataframes, filter on Credit Card and Cash
credit_card = df1[df1['payment_type'] == 1]
cash = df1[df1['payment_type'] == 2]

In [11]:
# For this analysis, the chosen significance level is 5%
significance_level = 0.05
significance_level

0.05

In [12]:
# run t-test, find the pvalue:
tstatistic, pvalue = stats.ttest_ind(a=cash['fare_amount'], b=credit_card['fare_amount'], equal_var=False)
print("tstatistic:", tstatistic)
print("pvalue:", pvalue)

tstatistic: -6.866800855655372
pvalue: 6.797387473030518e-12


In [13]:
# Check if pvalue is less than or equal to significance level
pvalue <= significance_level

True

- The pvalue is less than or equal to the significance level of 0.05.
- At the 5% significance level, there is sufficient evidence to reject the null hypothesis.
- There is enough evidence to suggest that there is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.

## Part 3: Communicate insights

- The results of this A/B test infer that credit card customers tend to pay higher fares compared to cash customers.
- This analysis assumes a randomly sampled dataset encompassing all payment types, with customers selecting exclusively between cash and credit card. To isolate payment type impact, the dataset was filtered to include only these two payment methods.
- **The observed correlation between payment type and fare amount may be influenced by confounding factors.** For instance, customers might opt for credit card payments on longer trips due to carrying less cash. Therefore, it's plausible that fare amount dictates payment type rather than the reverse.