# Maximizing Revenue for Taxi Cab drivers through Payment Type Analysis

## Problem Statement

In the fast-paced taxi booking sector, maximizing the revenue for the driver is essential for long-term success and driver happiness. Our goal is to use data driven insights to maximize revenue streams for taxi drivers in order to meet this need. Our research aims to determine whether payment methods have an impact on fare pricing and understand the customer group that uses our prefered mode of payment.

## Research Question

Is there a relationship between the total fare amount and payment type and how can we nudge customers towards payment methods that generate higher revenue for driver without negatively impacting customer experience?

## Objective

To examine the relationship between total fare and preferred method of payment if any.
We use descriptive statistics and hypothesis testing to extract useful information that can help drivers generate more cash. In particular we want to find out if there is a big difference in fares for those who pay through credit card vs those who pay in cash.

# Loading Packages

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

from IPython.display import display, Markdown

import warnings

warnings.filterwarnings('ignore')

# Understanding the Data

[Information about the Data]

# Data Loading

In [None]:
'''

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

d = {}
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        d[filename] = pd.read_csv(os.path.join(dirname, filename))

df = d['yellow_tripdata_2015-01.csv']

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

'''

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if filename == 'yellow_tripdata_2015-01.csv':
            df = pd.read_csv(os.path.join(dirname, filename))
        else:
            continue

# Initial Examination

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

# Data Collection

In [None]:
df["tpep_pickup_datetime"] = pd.to_datetime(df["tpep_pickup_datetime"])
df["tpep_dropoff_datetime"] = pd.to_datetime(df["tpep_dropoff_datetime"])
df.info()

In [None]:
df['duration'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
df['duration'] = df['duration'].dt.total_seconds()/60
df

In [None]:
df = df[['passenger_count', 'trip_distance', 'payment_type', 'fare_amount', 'duration']]
df

# Data Cleaning

In [None]:
# Checking for missing values

df.isnull().sum()

In [None]:
# Checking for right datatypes

df.dtypes

In [None]:
# Checking for duplicate values

df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace = True)

In [None]:
df.shape

# Univariate Analysis

In [None]:
df

In [None]:
# Looking at categorical data

df['passenger_count'].value_counts(normalize = True)

In [None]:
df = df[(df['passenger_count']>0) & (df['passenger_count']<7)]
df['passenger_count'].value_counts(normalize = True)

In [None]:
df['payment_type'].value_counts(normalize = True)

In [None]:
df = df[df['payment_type']<3]
df['payment_type'].value_counts(normalize = True)

In [None]:
df['payment_type'].replace([1, 2], ['Card', 'Cash'], inplace = True)
df

In [None]:
# Looking at numerical data

df.describe()

In [None]:
# taking positive values only
df = df[df['trip_distance']>0]
df = df[df['fare_amount']>0]
df = df[df['duration']>0]
df.describe()

In [None]:
# outlier handling

plt.boxplot(df['fare_amount'])

In [None]:
for col in ['trip_distance', 'fare_amount', 'duration']:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3-q1

    lowerbound = q1 - 1.5 * iqr
    upperbound = q3 + 1.5 * iqr
    df = df[(df[col]>lowerbound) & (df[col]>upperbound)]

In [None]:
df['trip_distance'].describe()

# Multivariate Analysis

In [None]:
df[df['payment_type']=='Card']

# Objective (i): 
Examine the relationship between total fare and method of payment. 
We use descriptive statistics here to extract useful information that can help drivers generate more cash. In particular we want to find out if there is a big difference in fares for those who pay through credit card vs those who pay in cash.

In [None]:
df.groupby('payment_type').agg(
    {
    'fare_amount': ['mean', 'std'],
    'trip_distance': ['mean', 'std']
    })

In [None]:
plt.title("Preference of Payment Type(No of rides)")
plt.pie([
            df[df['payment_type']=='Cash']['payment_type'].count(),
            df[df['payment_type']=='Card']['payment_type'].count()],
        labels = ['Cash', 'Card'], autopct='%1.1f%%', startangle=90, 
        colors = ['#FF5733', '#C70039'], explode = [0, 0.05],
        wedgeprops = {
            'edgecolor': 'black',
            'linewidth': 2
        })
plt.show()

In [None]:
plt.title("Preference of Payment Type(Fare Amount)")
plt.pie([
            df[df['payment_type']=='Cash']['fare_amount'].sum(),
            df[df['payment_type']=='Card']['fare_amount'].sum()],
        labels = ['Cash', 'Card'], autopct='%1.1f%%', startangle=90, 
        colors = ['#FF5733', '#C70039'], explode = [0, 0.05],
        wedgeprops = {
            'edgecolor': 'black',
            'linewidth': 2
        })
plt.show()

In [None]:
plt.figure()
plt.title('fare_amount')
plt.hist(df[df['payment_type']=='Card']['fare_amount'], bins = 25, label = 'Card', edgecolor = 'k', color = '#C70039')
plt.hist(df[df['payment_type']=='Cash']['fare_amount'], bins = 25, label = 'Cash', edgecolor = 'k', color = '#FF5733')
plt.legend()
plt.show()

# Objective (ii):
Examine the relationships preferred mode of payment and other factors(other than total fare) like customer count, trip duration or trip distance.

# Payment Type and Customer Count

In [None]:
p_count = df.groupby(['payment_type', 'passenger_count'])[['passenger_count']].count()
p_count.rename(columns = {'passenger_count': 'count'}, inplace=True)
p_count['perc'] = (p_count['count']/p_count['count'].sum())*100
p_count.reset_index(inplace=True)
p_count

In [None]:
new_df = pd.DataFrame(
    columns = ['payment_type', '1', '2', '3', '4', '5', '6'])
new_df['payment_type'] = ['Cash', 'Card']
new_df.iloc[0,1:] = p_count.iloc[6:,-1]
new_df.iloc[1,1:] = p_count.iloc[0:6,-1]
new_df['zero'] = [0, 0]

In [None]:
new_df

In [None]:
plt.title('Payment Type and Customer Count Breakup')
plt.xlabel('Fare Amount')
plt.yticks([0, 1], new_df['payment_type'])

color_list = ['#DAF7A6', '#FFC300', '#FF5733', '#C70039', '#900C3F', '#581845']

passenger_count_sum = new_df['zero']

for segment_color in color_list:
    passenger_count = new_df[str((color_list.index(segment_color))+1)]
    plt.barh(y = [0, 1], width = passenger_count,
             left = passenger_count_sum,
             color = segment_color, 
             label = str(color_list.index(segment_color)+1))
    passenger_count_sum = passenger_count_sum + passenger_count
    

plt.legend()
plt.show()

In [None]:
plt.title("Card by passenger count")
plt.pie(
        df[df['payment_type']=='Card']['passenger_count'].value_counts(),
        labels = [1, 2, 3, 4, 5, 6], startangle=90, 
        colors = ['#DAF7A6', '#FFC300', '#FF5733', '#C70039', '#900C3F', '#581845'],
        wedgeprops = {
            'edgecolor': 'black',
            'linewidth': 2
        })
plt.show()

plt.title("Cash by passenger count")
plt.pie(
        df[df['payment_type']=='Cash']['passenger_count'].value_counts(),
        labels = [1, 2, 3, 4, 5, 6], startangle=90, 
        colors = ['#DAF7A6', '#FFC300', '#FF5733', '#C70039', '#900C3F', '#581845'],
        wedgeprops = {
            'edgecolor': 'black',
            'linewidth': 2
        })
plt.show()

# Payment Type and Trip Distance

In [None]:
plt.figure()
plt.title('trip_distance')
plt.hist(df[df['payment_type']=='Card']['trip_distance'], bins = 25, label = 'Card', edgecolor = 'k', color = '#C70039')
plt.hist(df[df['payment_type']=='Cash']['trip_distance'], bins = 25, label = 'Cash', edgecolor = 'k', color = '#FF5733')
plt.legend()
plt.show()

In [None]:
plt.title("Preference of Payment Type(Trip Distance)")
plt.pie([
            df[df['payment_type']=='Cash']['trip_distance'].sum(),
            df[df['payment_type']=='Card']['trip_distance'].sum()],
        labels = ['Cash', 'Card'], autopct='%1.1f%%', startangle=90, 
        colors = ['#FF5733', '#C70039'], explode = [0, 0.05],
        wedgeprops = {
            'edgecolor': 'black',
            'linewidth': 2
        })
plt.show()

# Payment Type and Duration

In [None]:
plt.figure()
plt.title('duration')
plt.hist(df[(df['payment_type']=='Card')]['duration'], bins = 25, label = 'Card', edgecolor = 'k', color = '#C70039')
plt.hist(df[(df['payment_type']=='Cash')]['duration'], bins = 25, label = 'Cash', edgecolor = 'k', color = '#FF5733')
plt.legend()
plt.show()

In [None]:
plt.title("Preference of Payment Type(Duration)")
plt.pie([
            df[df['payment_type']=='Cash']['duration'].sum(),
            df[df['payment_type']=='Card']['duration'].sum()],
        labels = ['Cash', 'Card'], autopct='%1.1f%%', startangle=90, 
        colors = ['#FF5733', '#C70039'], explode = [0, 0.05],
        wedgeprops = {
            'edgecolor': 'black',
            'linewidth': 2
        })
plt.show()

# A/B Testing

# TEST 1

1. Is there a statistically significant difference in average fares for those who pay through credit card vs those who pay in cash? How big is the difference with a 95% CI?

(i) In general

(ii) (a) for short trips, (b) for long trips

(iii) (a) passengers <= 4, (b) passangers > 4

(iv) 

(a) for short trips and passengers <= 4 (intracity, 4 seaters)

(b) for long trips and passengers <= 4 (intercity, 4 seaters)

(c) for short trips and passengers > 4 (intracity, large groups)

(d) for long trips and passengers > 4 (intercity, large groups)

2. Is the variance in fares for those who pay through credit card vs those who pay in cash smaller? How much? (Do credit cards target more higher ticket customers than cash?)

## SUBTEST 1

H_null: meu_credit_card_fares - meu_cash_fares = 0

H_alternate: meu_credit_fares - meu_cash_fares > 0

## SUBTEST 2

H_null: sigma_cash_fares / sigma_credit_card_fares = 1

H_alternate: sigma_cash_fares / sigma_credit_card_fares > 1

In [None]:
#checking the shape of the sample

st.probplot(df['fare_amount'], dist = 'norm', plot=plt)
plt.title("Q-Q PLOT")
plt.show()

In [None]:
# 1(i)

# Separating the groups (in general)
credit_fare = df[df['payment_type']=='Card']['fare_amount']
cash_fare = df[df['payment_type']=='Cash']['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare, cash_fare, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

# Print the results
print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

# Interpretation
if result.pvalue < alpha:
    print('Credit fare exceeds cash fare')
    print(f'Credit card fares exceed cash fares by at least ${ci.low} with {1-alpha}% confidence')
else:
    print('Credit fare does not exceed cash fare')


# 2(i)

# Using Levene's test
levene_statistic, levene_p_value = st.levene(cash_fare, credit_fare)

# Print the results
print("Levene's Statistic:", levene_statistic)
print("p-value:", levene_p_value)

# Interpretation
alpha = 0.05
if levene_p_value < alpha:
    print("Reject the null hypothesis: variances are significantly different.")
else:
    print("Fail to reject the null hypothesis: variances are not significantly different.")

print("Variance of credit fares: ", np.var(credit_fare, ddof=1))
print("Variance of cash fares: ", np.var(cash_fare, ddof=1))

In [None]:
# Set the style of seaborn
sns.set(style='whitegrid')
# Create a figure
plt.figure(figsize=(10, 6))
# Plot the histogram
sns.histplot(df['trip_distance'], bins=30, kde=True, stat='density', color='blue', alpha=0.6)
# Add labels and title
plt.title('Probability Distribution of the Column')
plt.xlabel('Value')
plt.ylabel('Density')
# Show the plot
plt.show()

print('mean: ', df['trip_distance'].mean())

In [None]:
# 1(ii)a

# Separating the groups (for short distances dist<=30)
credit_fare_short = df[(df['payment_type']=='Card') & (df['trip_distance']<=30)]['fare_amount']
cash_fare_short = df[(df['payment_type']=='Cash') & (df['trip_distance']<=30)]['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare_short, cash_fare_short, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

if result.pvalue < alpha:
    print('Credit fare exceeds cash fare for short distances')
    print(f'Credit card fares exceed cash fares by at least ${ci.low} with {1-alpha}% confidence for short distance(<=30)')
else:
    print('Credit fare does not exceed cash fare for short distances')

In [None]:
# 1(ii)b

# Separating the groups (for long distances dist>30)
credit_fare_long = df[(df['payment_type']=='Card') & (df['trip_distance']>30)]['fare_amount']
cash_fare_long = df[(df['payment_type']=='Cash') & (df['trip_distance']>30)]['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare_long, cash_fare_long, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

if result.pvalue < alpha:
    print('Credit fare exceeds cash fare for long distances')
    print(f'Credit card fares exceed cash fares by at least ${result.confidence_interval().low} with {1-alpha}% confidence for long distance(>30)')
else:
    print('Credit fare does not exceed cash fare for long distances')


In [None]:
# 1(iii)a

# Separating the groups (for passenger <= 4)
credit_fare_small = df[(df['payment_type']=='Card') & (df['passenger_count']<=4)]['fare_amount']
cash_fare_small = df[(df['payment_type']=='Cash') & (df['passenger_count']<=4)]['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare_small, cash_fare_small, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

if result.pvalue < alpha:
    print('Credit fare exceeds cash fare for 4 seaters')
    print(f'Credit card fares exceed cash fares by at least ${result.confidence_interval().low} with {(1-alpha)*100}% confidence for 4 seaters')
else:
    print('Credit fare does not exceed cash fare for 4 seaters')


In [None]:
# 1(iii)b

# Separating the groups (for passenger > 4)
credit_fare_large = df[(df['payment_type']=='Card') & (df['passenger_count']>4)]['fare_amount']
cash_fare_large = df[(df['payment_type']=='Cash') & (df['passenger_count']>4)]['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare_large, cash_fare_large, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

if result.pvalue < alpha:
    print('Credit fare exceeds cash fare for large groups')
    print(f'Credit card fares exceed cash fares by at least ${ci.low} with {(1-alpha)*100}% confidence for large taxis')
else:
    print('Credit fare does not exceed cash fare for large groups')
    

In [None]:
# 1(iv)a

# Separating the groups (for short distance and passenger <= 4)
credit_fare_short_small = df[(df['payment_type']=='Card') & (df['trip_distance']<=30) & (df['passenger_count']<=4)]['fare_amount']
cash_fare_short_small = df[(df['payment_type']=='Cash') & (df['trip_distance']<=30) & (df['passenger_count']<=4)]['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare_short_small, cash_fare_short_small, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

if result.pvalue < alpha:
    print('Credit fare exceeds cash fare for short distance travel in 4 seaters')
    print(f'Credit card fares exceed cash fares by at least ${ci.low} with {(1-alpha)*100}% confidence for short distance travel in 4 seaters')
else:
    print('Credit fare does not exceed cash fare for short distance travel in 4 seaters')
    

In [None]:
# 1(iv)b

# Separating the groups (for long distance and passenger <= 4)
credit_fare_long_small = df[(df['payment_type']=='Card') & (df['trip_distance']>30) & (df['passenger_count']<=4)]['fare_amount']
cash_fare_long_small = df[(df['payment_type']=='Cash') & (df['trip_distance']>30) & (df['passenger_count']<=4)]['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare_long_small, cash_fare_long_small, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

if result.pvalue < alpha:
    print('Credit fare exceeds cash fare for long distance travel in 4 seaters')
    print(f'Credit card fares exceed cash fares by at least ${ci.low} with {(1-alpha)*100}% confidence for long distance travel in 4 seaters')
else:
    print('Credit fare does not exceed cash fare for long distance travel in 4 seaters')
    

In [None]:
# 1(iv)c

# Separating the groups (for short distance and passenger > 4)
credit_fare_short_large = df[(df['payment_type']=='Card') & (df['trip_distance']<=30) & (df['passenger_count']>4)]['fare_amount']
cash_fare_short_large = df[(df['payment_type']=='Cash') & (df['trip_distance']<=30) & (df['passenger_count']>4)]['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare_short_large, cash_fare_short_large, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

if result.pvalue < alpha:
    print('Credit fare exceeds cash fare for short distance travel in large groups')
    print(f'Credit card fares exceed cash fares by at least ${ci.low} with {(1-alpha)*100}% confidence for short distance travel in large groups')
else:
    print('Credit fare does not exceed cash fare for short distance travel in large groups')
    

In [None]:
# 1(iv)d

# Separating the groups (for long distance and passenger > 4)
credit_fare_short_small = df[(df['payment_type']=='Card') & (df['trip_distance']>30) & (df['passenger_count']>4)]['fare_amount']
cash_fare_short_small = df[(df['payment_type']=='Cash') & (df['trip_distance']>30) & (df['passenger_count']>4)]['fare_amount']

# Performing Welch's T Test 2 way, greater
result = st.ttest_ind(credit_fare_short_small, cash_fare_short_small, 
                      equal_var=False, alternative='greater')

alpha = 0.05
ci = result.confidence_interval(confidence_level=(1-alpha))

print(f'T statistic: {result.statistic}')
print(f'p value: {result.pvalue}')

if result.pvalue < alpha:
    print('Credit fare exceeds cash fare for long distance travel in large groups')
    print(f'Credit card fares exceed cash fares by at least ${ci.low} with {(1-alpha)*100}% confidence for short distance travel in large groups')
else:
    print('Credit fare does not exceed cash fare for long distance travel in large groups')
    

In [None]:
df

# Conclusion

## Credit paying customers leads to high fares than cash paying customers in general. 

## Most amount of credit paying customers are single customers and for long distances.