# NYC TLC Project Part 3 

To analyze the relationship between fare amount and payment type and to conduct an A/B test. 



# Statistical analysis

The project covers fundamental concepts such as descriptive statistics and hypothesis testing.
<br/>   

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. The A/B test results should aim to find ways to generate more revenue for taxi cab drivers.

**Note:** For the purpose of this project, assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
  
*This activity has four parts:*

**Part 1:** Imports and data loading

**Part 2:** Conduct EDA and hypothesis testing



# **Conduct an A/B test**


Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

<details>
  <summary><h4><strong>Hint: </strong></h4></summary>

Before you begin, recall the following Python packages and functions that may be useful:

*Main functions*: stats.ttest_ind(a, b, equal_var)

*Other functions*: mean() 

*Packages*: pandas, stats.scipy

</details>

In [1]:
import pandas as pd
import numpy as np
from scipy import stats


In [2]:
# Load dataset into dataframe
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA). 

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [3]:
print(taxi_data.describe())
print(taxi_data.shape)
print(taxi_data.info())

           VendorID  passenger_count  trip_distance    RatecodeID  \
count  22699.000000     22699.000000   22699.000000  22699.000000   
mean       1.556236         1.642319       2.913313      1.043394   
std        0.496838         1.285231       3.653171      0.708391   
min        1.000000         0.000000       0.000000      1.000000   
25%        1.000000         1.000000       0.990000      1.000000   
50%        2.000000         1.000000       1.610000      1.000000   
75%        2.000000         2.000000       3.060000      1.000000   
max        2.000000         6.000000      33.960000     99.000000   

       PULocationID  DOLocationID  payment_type   fare_amount         extra  \
count  22699.000000  22699.000000  22699.000000  22699.000000  22699.000000   
mean     162.412353    161.527997      1.336887     13.026629      0.333275   
std       66.633373     70.139691      0.496211     13.243791      0.463097   
min        1.000000      1.000000      1.000000   -120.000000 

We are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type. 

In [4]:
payment_grouped = taxi_data.groupby(["payment_type"])
mean_fare = payment_grouped["fare_amount"].mean()
print(mean_fare)

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, we conduct a hypothesis test.


### Task 3. Hypothesis testing


$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.



Oour goal in this step is to conduct a two-sample t-test. The steps for conducting a hypothesis test are: 


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 



We choose 5% as the significance level and proceed with a two-sample t-test.

In [5]:
significance_level = 0.05
credit_card = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']
stats.ttest_ind(a=credit_card, b=cash, equal_var=False)

Ttest_indResult(statistic=6.866800855655372, pvalue=6.797387473030518e-12)

Since the p-value is significantly smaller than the significance level of 5%, you reject the null hypothesis. 

We conclude that there is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.

### Outcomes

1.   The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers. 

2.   This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa. 