# Automatidata project 
**Course 4 - The Power of Statistics**

You are a data professional in a data consulting firm, called Automatidata. The current project for their newest client, the New York City Taxi & Limousine Commission (New York City TLC) is reaching its midpoint, having completed a project proposal, Python coding work, and exploratory data analysis.

You receive a new email from Uli King, Automatidata’s project manager. Uli tells your team about a new request from the New York City TLC: to analyze the relationship between fare amount and payment type. A follow-up email from Luana includes your specific assignment: to conduct an A/B test. 

A notebook was structured and prepared to help you in this project. Please complete the following questions.


# Course 4 End-of-course project: Statistical analysis

In this activity, you will practice using statistics to analyze and interpret data. The activity covers fundamental concepts such as descriptive statistics and hypothesis testing. You will explore the data provided and conduct A/B and hypothesis testing.  
<br/>   

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. Your A/B test results should aim to find ways to generate more revenue for taxi cab drivers.

**Note:** For the purpose of this exercise, assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
  
*This activity has four parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct EDA and hypothesis testing
* How did computing descriptive statistics help you analyze your data? 

* How did you formulate your null hypothesis and alternative hypothesis? 

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from your A/B test?

* What business recommendations do you propose based on your results?

<br/> 
Follow the instructions and answer the questions below to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work. 

# **Conduct an A/B test**


<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## PACE: Plan 

In this stage, consider the following questions where applicable to complete your code response:
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.


==> ENTER YOUR RESPONSE HERE 

*Complete the following steps to perform statistical analysis of your data:* 

### Task 1. Imports and data loading

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

<details>
  <summary><h4><strong>Hint: </strong></h4></summary>

Before you begin, recall the following Python packages and functions that may be useful:

*Main functions*: stats.ttest_ind(a, b, equal_var)

*Other functions*: mean() 

*Packages*: pandas, stats.scipy

</details>

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Load dataset into dataframe
taxi_data = pd.read_csv("2017_Yellow_Taxi_Trip_Data.csv", index_col = 0)

<img src="images/Analyze.png" width="100" height="100" align=left>

<img src="images/Construct.png" width="100" height="100" align=left>

## PACE: **Analyze and Construct**

In this stage, consider the following questions where applicable to complete your code response:
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


Answer:
- **Summarizing Data Distribution:** Descriptive statistics such as measures of central tendency (mean, median, mode) and measures of dispersion (variance, standard deviation, range) provide information about the distribution of your data. They give you a sense of the typical values, the spread of the data, and the presence of outliers or unusual patterns.

- **Detecting Skewness and Symmetry:** Descriptive statistics, such as skewness and kurtosis, help you understand the shape of your data distribution. Positive or negative skewness indicates a departure from symmetry, while kurtosis measures the degree of peakedness or flatness in the distribution. These statistics can reveal important characteristics about your data and guide further analysis.

- **Identifying Central Tendency:** Measures of central tendency, such as the mean or median, provide insight into the typical or representative value of the data. They help you understand the center around which the data points tend to cluster and can be used to make comparisons or assess deviations from the norm.

- **Assessing Variability and Spread:** Descriptive statistics like variance and standard deviation quantify the spread or variability of your data. They indicate how much the data points deviate from the mean and can help identify patterns, trends, or differences between groups or variables.

- **Understanding Relationships and Correlations:** Descriptive statistics, such as correlation coefficients, can help you explore the relationships between variables. They provide insights into the strength and direction of the linear association between variables, allowing you to identify potential dependencies or patterns.

- **Identifying Outliers:** Descriptive statistics can help identify outliers, which are data points that significantly deviate from the rest of the data. Outliers can indicate measurement errors, anomalies, or unique observations that require further investigation.

### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA). 

<details>
  <summary><h4><strong>Hint: </strong></h4></summary>

Refer back to *Self Review Descriptive Statistics* for this step-by-step proccess.

</details>

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [8]:
taxi_data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
24870114,2,03/25/2017 8:55:43 AM,03/25/2017 9:09:47 AM,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
35634249,1,04/11/2017 2:53:28 PM,04/11/2017 3:19:58 PM,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
106203690,1,12/15/2017 7:26:56 AM,12/15/2017 7:34:08 AM,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
38942136,2,05/07/2017 1:17:59 PM,05/07/2017 1:48:14 PM,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
30841670,2,04/15/2017 11:32:20 PM,04/15/2017 11:49:03 PM,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [10]:
taxi_data.shape

(22699, 17)

In [11]:
taxi_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22699 entries, 24870114 to 17208911
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               22699 non-null  int64  
 1   tpep_pickup_datetime   22699 non-null  object 
 2   tpep_dropoff_datetime  22699 non-null  object 
 3   passenger_count        22699 non-null  int64  
 4   trip_distance          22699 non-null  float64
 5   RatecodeID             22699 non-null  int64  
 6   store_and_fwd_flag     22699 non-null  object 
 7   PULocationID           22699 non-null  int64  
 8   DOLocationID           22699 non-null  int64  
 9   payment_type           22699 non-null  int64  
 10  fare_amount            22699 non-null  float64
 11  extra                  22699 non-null  float64
 12  mta_tax                22699 non-null  float64
 13  tip_amount             22699 non-null  float64
 14  tolls_amount           22699 non-null  float

In [12]:
taxi_data.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0,22699.0
mean,1.556236,1.642319,2.913313,1.043394,162.412353,161.527997,1.336887,13.026629,0.333275,0.497445,1.835781,0.312542,0.299551,16.310502
std,0.496838,1.285231,3.653171,0.708391,66.633373,70.139691,0.496211,13.243791,0.463097,0.039465,2.800626,1.399212,0.015673,16.097295
min,1.0,0.0,0.0,1.0,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,1.0,0.99,1.0,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,1.0,1.61,1.0,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,2.0,3.06,1.0,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8
max,2.0,6.0,33.96,99.0,265.0,265.0,4.0,999.99,4.5,0.5,200.0,19.1,0.3,1200.29


You are interested in the relationship between payment type and the total fare amount the customer pays. One approach is to look at the average total fare amount for each payment type. 

In [13]:
# Define the mapping of values to descriptions
payment_mapping = {
    1: 'Credit card',
    2: 'Cash',
    3: 'No charge',
    4: 'Dispute',
    5: 'Unknown'
}

# Replace the values in the "payment_type" column
taxi_data['payment_type'] = taxi_data['payment_type'].replace(payment_mapping)

In [27]:
taxi_data.groupby('payment_type')[['total_amount']].mean().rename(columns={'total_amount': 'average_total_fare'}).sort_values(by='average_total_fare', ascending=False)

Unnamed: 0_level_0,average_total_fare
payment_type,Unnamed: 1_level_1
Credit card,17.663577
No charge,13.579669
Cash,13.545821
Dispute,11.238261


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger total fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in total fare amount. To assess whether the difference is statistically significant, you conduct a hypothesis test.


### Task 3. Hypothesis testing

Before you conduct your hypothesis test, consider the following questions where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses. Consider your hypotheses for this project as listed below.

$H_0$: There is no difference in the average total fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average total fare amount between customers who use credit cards and customers who use cash.



Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test: 


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 



**Note:** For the purpose of this exercise, your hypothesis test is the main component of your A/B test. 

You choose 5% as the significance level and proceed with a two-sample t-test.

In [29]:
mask1 = taxi_data['payment_type'] == 'Credit card'
mask2 = taxi_data['payment_type'] == 'Cash'

creditcard_users_fare = taxi_data[mask1]['total_amount']
cash_users_fare = taxi_data[mask2]['total_amount']

t_statistic, p_value = stats.ttest_ind(a=creditcard_users_fare, 
                                       b=cash_users_fare, 
                                       equal_var=False, 
                                       alternative='two-sided')

print("T-statistic:", t_statistic)
print("P-value:", p_value)


T-statistic: 20.34644022783838
P-value: 4.5301445359736376e-91


Conclusion:

Since the P-value is extremely small (close to zero) and less than significance level of 0.05, we can conclude that there is a statistically significant difference in the average total fare amount between customers who use credit cards and customers who use cash. Therefore, we **reject the null hypothesis $H_0$ and accept the alternative hypothesis $H_A$ that there is a difference in the average total fare amount between the two payment types.**

<img src="images/Execute.png" width="100" height="100" align=left>

## PACE: **Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### Task 4. Communicate insights with stakeholders

*Ask yourself the following questions:*

1. What business insight(s) can you draw from the result of your hypothesis test?
2. Consider why this A/B test project might not be realistic, and what assumptions had to be made for this educational project.

Answer for Q1:
Business Insight(s) drawn from the result of hypothesis test:
- **Payment Preference:** The company can observe that customers who pay with credit cards have significantly different fare amounts compared to those who pay with cash. This insight can help the company understand customer preferences and tailor payment options accordingly.

- **Revenue Analysis:** By analyzing the fare amounts based on payment types, the company can gain insights into revenue generation. They can identify which payment method contributes more to their overall revenue and make informed decisions about optimizing payment processing systems or incentivizing certain payment methods.

- **Financial Planning:** The difference in fare amounts between credit card and cash payments can also have implications for financial planning. The company can assess transaction costs, reconcile cash handling expenses, and evaluate the impact of payment processing fees on their financial statements.

Answer for Q2:
Regarding the A/B test project's realism and assumptions:

- **Educational Project:** It's important to note that this was an educational project, and the dataset used may not fully reflect the complexities and real-world dynamics of a taxi company's operations. Real-world A/B tests typically involve more rigorous experimental designs, sample sizes, and considerations for external factors.

- **Assumptions:** The hypothesis test assumes that the sampled data is representative of the entire population, and that the two groups (credit card users and cash users) are independent and randomly selected. These assumptions may not hold in real-world scenarios, as there may be confounding variables or selection biases that could impact the results.

- **Limited Factors Considered:** The hypothesis test focused solely on the average total fare amount and the payment type. In practice, there are various other factors that can influence fare amounts, such as distance, time of day, surge pricing, and additional charges. These factors were not considered in this analysis.

- It is crucial to conduct further research, analysis, and considerations beyond the scope of this project to draw more comprehensive and realistic insights for business decision-making.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.