# 🧪 NYC Yellow Taxi Fare Statistical Analysis
This notebook performs statistical analysis on NYC Yellow Taxi data to determine if there's a significant difference in average fare based on payment type (`Card` vs `Cash`).

## 📥 Importing Libraries and Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load dataset (update the path as per your local setup)
df = pd.read_csv('yellow_tripdata_2020-01.csv')
df.head()

## 🧹 Data Preprocessing

In [None]:
# Convert datetime columns
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

In [None]:
# Calculate duration in minutes
df['Duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60

In [None]:
# Keep only relevant columns
df = df[['passenger_count', 'payment_type', 'fare_amount', 'trip_distance', 'Duration']]

In [None]:
# Drop rows with null values
df.dropna(inplace=True)

In [None]:
# Convert data types
df['passenger_count'] = df['passenger_count'].astype(int)
df['payment_type'] = df['payment_type'].astype(int)

In [None]:
# Remove duplicates
df = df.drop_duplicates()

In [None]:
# Keep only valid payment types and passenger counts
df = df[df['payment_type'] < 3]
df = df[(df['passenger_count'] > 0) & (df['passenger_count'] < 6)]

In [None]:
# Replace payment_type with labels
df['payment_type'].replace([1, 2], ['Card', 'Cash'], inplace=True)

In [None]:
# Remove invalid (<=0) values
df = df[(df['fare_amount'] > 0) & (df['trip_distance'] > 0) & (df['Duration'] > 0)]

## 📊 Outlier Removal using IQR

In [None]:
# Remove outliers using IQR method
for col in ['fare_amount', 'trip_distance', 'Duration']:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    IQR = q3 - q1
    lower = q1 - 1.5 * IQR
    upper = q3 + 1.5 * IQR
    df = df[(df[col] >= lower) & (df[col] <= upper)]

## 📈 Visualization

In [None]:
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.title("Payment type by fare amount")
plt.hist(df[df['payment_type'] == 'Card']['fare_amount'], bins=20, edgecolor='k', alpha=0.7, label='Card')
plt.hist(df[df['payment_type'] == 'Cash']['fare_amount'], bins=20, edgecolor='k', alpha=0.7, label='Cash')
plt.legend()

plt.subplot(1, 2, 2)
plt.title("Payment type by distance")
plt.hist(df[df['payment_type'] == 'Card']['trip_distance'], bins=20, edgecolor='k', alpha=0.7, label='Card')
plt.hist(df[df['payment_type'] == 'Cash']['trip_distance'], bins=20, edgecolor='k', alpha=0.7, label='Cash')
plt.legend()
plt.tight_layout()
plt.show()

## 🧪 Hypothesis Testing
**H₀ (Null Hypothesis):** No difference in average fare between Card and Cash payments

**H₁ (Alternate Hypothesis):** There is a difference in average fare

In [None]:
# Split data
card_sample = df[df['payment_type'] == 'Card']['fare_amount']
cash_sample = df[df['payment_type'] == 'Cash']['fare_amount']

# Perform Welch's t-test
t_stats, p_value = stats.ttest_ind(a=card_sample, b=cash_sample, equal_var=False)
print('T-Statistic:', t_stats)
print('P-Value:', p_value)

## ✅ Conclusion
Since the P-value is less than 0.05, we reject the null hypothesis.
There is a statistically significant difference in average fare between Card and Cash payments.