# T-Test with Python · NYC Taxi Fares

Completed by [Anton Starshev](http://linkedin.com/in/starshev) on 17/04/2024

### Context

A data consulting firm is tasked with analyzing the relationship between fare amounts and payment types within the NYC Taxi Service. In particular, it is important to find out if there is any statistically significant difference in fare amounts with correspondence to the payment type.

### Data

This project uses a dataset called `2017_Yellow_Taxi_Trip_Data.csv` gathered by the New York Taxi and Limousine Commission (TLC). In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown

### Execution

Imported necessary libraries.

In [2]:
import pandas as pd
import numpy as np
from scipy import stats

Loaded the dataset.

In [7]:
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

taxi_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


Checked data size.

In [9]:
taxi_data.shape

(22699, 17)

Verified data types of the variables.

In [11]:
taxi_data.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
RatecodeID                 int64
store_and_fwd_flag        object
PULocationID               int64
DOLocationID               int64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

In [8]:
taxi_data.shape

(22699, 17)

Looked at the average fare amount for each payment type.

In [5]:
taxi_data_card = taxi_data[taxi_data['payment_type'] == 1]
taxi_data_card.fare_amount.mean()

13.429747789059942

In [6]:
taxi_data_cash = taxi_data[taxi_data['payment_type'] == 2]
taxi_data_cash.fare_amount.mean()

12.21354616760699

**Observation:** Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, a hypothesis test needs to be conducted.

Stated the null and alternative hypotheses:

**H<sub>0</sub>**: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

**H<sub>a</sub>**: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.

Assigned **5% significance level** to the hypothesis test.

Determined the type of hypothesis testing: **2-sample 2-tailed T-test**

Conducted the hyypothesis test using the SciPy Stats module.

In [12]:
stats.ttest_ind(a = taxi_data_card.fare_amount, b = taxi_data_cash.fare_amount, 
equal_var = False, alternative = 'two-sided')

TtestResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12, df=16675.48547403633)

**Test result:** Given the p-value significantly smaller than the 5% significance level, H<sub>0</sub> was rejected.

### Insight and recommendation

Based on the conducted test, the key business insight is that there is a statistically significant difference in the average fare amount between customers who pay by credit card and those who pay by cash. Specifically, customers who use credit cards exhibit a higher total amount on average compared cash/paying customers. Therefore, encouraging customers to use credit cards for payment can likely lead to increased revenue for taxi cab drivers, judging by the statistical analysis result. For instance, implementing signage within cabs indicating some advantages of credit card payments and / or requiring cab drivers to verbally communicate this information to customers.

### Acknowledgment

I would like to express gratitude to Google and Coursera for supporting the educational process and providing the opportunity to refine and showcase skills acquired during the courses by completing real-life scenario portfolio projects, such as this.

### Reference

This is an end-of-course workplace scenario project *«Automatidata, featuring a fictional data consulting firm»* proposed within the syllabus of Google Advanced Data Analytics Professional Certificate on Coursera.