# Part 4: Statistical analysis 

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. Your A/B test results should aim to find ways to generate more revenue for taxi cab drivers.

**Note:** For the purpose of this exercise, assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
  
*Activities:*

**Part 1:** Imports and data loading

**Part 2:** Conduct EDA and hypothesis testing

**Part 3:** Communicate insights with stakeholders


*Note:* In the dataset, payment_type is encoded in integers:

1: Credit card
2: Cash
3: No charge
4: Dispute
5: Unknown

<h3>Imports and data loading</h3>

In [9]:
# import library

import pandas as pd
import scipy.stats as stats

In [4]:
# import dataset

taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

<h3>Conduct EDA and hypothesis testing</h3>

In [5]:
taxi_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [6]:
taxi_data.describe(include = 'all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699,22699,22699.0,22699.0,22699.0,22699,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
unique,,22687,22688,,,,2,,,,,,,,,,
top,,07/03/2017 3:45:19 PM,10/18/2017 8:07:45 PM,,,,N,,,,,,,,,,
freq,,2,2,,,,22600,,,,,,,,,,
mean,1.556236,,,1.642319,2.913313,1.043394,,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,,,1.285231,3.653171,0.708391,,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,,,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,,,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,,,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,,,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8


In [7]:
taxi_data.groupby('payment_type')['fare_amount'].mean()

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64

**Hypothesis Testing:-**

Null hypothesis: There is no difference in average fare between customers who use credit cards and dispute. 

Alternative hypothesis: There is a difference in average fare between customers who use credit cards and dispute

In [10]:
significance_level = 0.05

credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 4]['fare_amount']

stats.ttest_ind( a = credit_card, b = cash, equal_var= False)

Ttest_indResult(statistic=0.9866203884666155, pvalue=0.3290937942836959)

**Conclusion-**

p_value > significance_level

Fail to reject null hypothesis

Null hypothesis is true: There is no difference in average fare between customers who use credit cards and dispute.

**Hypothesis Testing:** 
**Null hypothesis**: There is no difference in average fare between customers who use credit cards and customers who use cash. 
**Alternative hypothesis**: There is a difference in average fare between customers who use credit cards and customers who use cash

In [None]:
# other payment types
credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
other = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']

stats.ttest_ind( a = credit_card, b = other, equal_var= False)

**Conclusion-**

p_value << significance_level

Reject null hypothesis

Alternative hypothesis is true: There is a difference in average fare between customers who use credit cards and customers who use cash

<h3>Communicate insights with stakeholders</h3>

**Key takeaways**

- The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers.

- This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa.