# Automatidata project 
**Course 4 - The Power of Statistics**

You are a data professional in a data consulting firm, called Automatidata. The current project for their newest client, the New York City Taxi & Limousine Commission (New York City TLC) is reaching its midpoint, having completed a project proposal, Python coding work, and exploratory data analysis.

You receive a new email from Uli King, Automatidata’s project manager. Uli tells your team about a new request from the New York City TLC: to analyze the relationship between fare amount and payment type. A follow-up email from Luana includes your specific assignment: to conduct an A/B test. 

A notebook was structured and prepared to help you in this project. Please complete the following questions.


# Course 4 End-of-course project: Statistical analysis

In this activity, you will practice using statistics to analyze and interpret data. The activity covers fundamental concepts such as descriptive statistics and hypothesis testing. You will explore the data provided and conduct A/B and hypothesis testing.  
<br/>   

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze A/B tests. Your A/B test results should aim to find ways to generate more revenue for taxi cab drivers.

**Note:** For the purpose of this exercise, assume that the sample data comes from an experiment in which customers are randomly selected and divided into two groups: 1) customers who are required to pay with credit card, 2) customers who are required to pay with cash. Without this assumption, we cannot draw causal conclusions about how payment method affects fare amount.

**The goal** is to apply descriptive statistics and hypothesis testing in Python. The goal for this A/B test is to sample data and analyze whether there is a relationship between payment type and fare amount. For example: discover if customers who use credit cards pay higher fare amounts than customers who use cash.
  
*This activity has four parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct EDA and hypothesis testing
* How did computing descriptive statistics help you analyze your data? 

* How did you formulate your null hypothesis and alternative hypothesis? 

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from your A/B test?

* What business recommendations do you propose based on your results?

<br/> 
Follow the instructions and answer the questions below to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work. 

# **Conduct an A/B test**


# **PACE stages**

Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

## PACE: Plan 

In this stage, consider the following questions where applicable to complete your code response:
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.


The research question for this project is: "Is there a statistically significant relationship between the fare amount and the payment type used by customers, specifically between credit card payments and cash payments?"

*Complete the following steps to perform statistical analysis of your data:* 

### Task 1. Imports and data loading

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

<details>
  <summary><h4><strong>Hint: </strong></h4></summary>

Before you begin, recall the following Python packages and functions that may be useful:

*Main functions*: stats.ttest_ind(a, b, equal_var)

*Other functions*: mean() 

*Packages*: pandas, stats.scipy

</details>

In [1]:
#==> ENTER YOUR CODE HERE

import pandas as pd
import numpy as np
from scipy import stats

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
# Load dataset into dataframe
taxi_data = pd.read_csv(r"C:\Users\younu\Desktop\My Py Scripts\Git Repos\14_Google Advanced DA - Automatidata\Automatidata - Dataset.csv", index_col = 0)

## PACE: **Analyze and Construct**

In this stage, consider the following questions where applicable to complete your code response:
1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


Computing descriptive statistics helps to summarize and understand the main characteristics of the dataset, such as the central tendency, variability, and distribution. It allows for identifying patterns, trends, and potential anomalies, which are essential for conducting more detailed analyses.

### Task 2. Data exploration

Use descriptive statistics to conduct Exploratory Data Analysis (EDA). 

<details>
  <summary><h4><strong>Hint: </strong></h4></summary>

Refer back to *Self Review Descriptive Statistics* for this step-by-step proccess.

</details>

**Note:** In the dataset, `payment_type` is encoded in integers:
*   1: Credit card
*   2: Cash
*   3: No charge
*   4: Dispute
*   5: Unknown



In [4]:
#==> ENTER YOUR CODE HERE

# Compute descriptive statistics
descriptive_stats = taxi_data.describe()
print(descriptive_stats)

           VendorID  passenger_count  trip_distance    RatecodeID  \
count  22699.000000     22699.000000   22699.000000  22699.000000   
mean       1.556236         1.642319       2.913313      1.043394   
std        0.496838         1.285231       3.653171      0.708391   
min        1.000000         0.000000       0.000000      1.000000   
25%        1.000000         1.000000       0.990000      1.000000   
50%        2.000000         1.000000       1.610000      1.000000   
75%        2.000000         2.000000       3.060000      1.000000   
max        2.000000         6.000000      33.960000     99.000000   

       PULocationID  DOLocationID  payment_type   fare_amount         extra  \
count  22699.000000  22699.000000  22699.000000  22699.000000  22699.000000   
mean     162.412353    161.527997      1.336887     13.026629      0.333275   
std       66.633373     70.139691      0.496211     13.243791      0.463097   
min        1.000000      1.000000      1.000000   -120.000000 

You are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type. 

In [5]:
#==> ENTER YOUR CODE HERE

# Explore the relationship between payment type and fare amount
average_fare_by_payment = taxi_data.groupby('payment_type')['fare_amount'].mean()
print(average_fare_by_payment)

payment_type
1    13.429748
2    12.213546
3    12.186116
4     9.913043
Name: fare_amount, dtype: float64


Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, you conduct a hypothesis test.


### Task 3. Hypothesis testing

Before you conduct your hypothesis test, consider the following questions where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses. Consider your hypotheses for this project as listed below.

$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.



Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test: 


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis 



**Note:** For the purpose of this exercise, your hypothesis test is the main component of your A/B test. 

You choose 5% as the significance level and proceed with a two-sample t-test.

In [6]:
#==> ENTER YOUR CODE HERE

# Extract fare amounts for credit card and cash payments
credit_card_fares = taxi_data[taxi_data['payment_type'] == 1]['fare_amount']
cash_fares = taxi_data[taxi_data['payment_type'] == 2]['fare_amount']

# Conduct a two-sample t-test
t_stat, p_value = stats.ttest_ind(credit_card_fares, cash_fares, equal_var=False)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: 6.866800855655372
P-value: 6.797387473030518e-12


==> ENTER YOUR DECISION TO ACCEPT OR REJECT THE NULL HYPOTHESIS

Given the T-statistic of 6.8668 and a P-value of 6.797e-12, we reject the null hypothesis. This indicates that there is a statistically significant difference in the average fare amounts between customers who use credit cards and those who use cash.

## PACE: **Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### Task 4. Communicate insights with stakeholders

*Ask yourself the following questions:*

1. What business insight(s) can you draw from the result of your hypothesis test?
2. Consider why this A/B test project might not be realistic, and what assumptions had to be made for this educational project.

The hypothesis test results suggest that customers who pay with credit cards tend to have higher fare amounts compared to those who pay with cash. This insight can be leveraged to develop strategies that encourage more customers to use credit cards, potentially leading to increased revenue for taxi drivers.

This A/B test project might not be realistic due to several reasons:

1. Random Assignment: The assumption that customers are randomly assigned to pay with either cash or credit card is unrealistic in a real-world setting. Payment methods are usually determined by customer preference and situational factors.
2. Controlled Environment: In practice, it is challenging to control external variables that might influence payment choices, such as availability of cash, credit card facilities, and customer convenience.
3. Behavioral Factors: Customer behavior and preferences play a significant role in choosing payment methods, which were not accounted for in this simplified scenario.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.