# I would really apprecaite your support by leaving an upvote if you liked my work :)


# **The Client**
XYZ is a private firm in US. Due to remarkable growth in the Cab Industry in last few years and multiple key players in the market, it is planning for an investment in Cab industry and as per their Go-to-Market(G2M) strategy they want to understand the market before taking final decision.

# **Project delivery:**
You have been provided with multiple data sets that contains information on 2 cab companies. Each file (data set) provided represents different aspects of the customer profile. XYZ is interested in using your actionable insights to help them identify the right company to make their investment.

The outcome of your delivery will be a presentation to XYZ’s Executive team. This presentation will be judged based on the visuals provided, the quality of your analysis and the value of your recommendations and insights.

# **Data Set:**
You have been provided 4 individual data sets. Time period of data is from 02/01/2016 to 31/12/2018.

Below are the list of datasets which are provided for the analysis:

**Cab_Data.csv** – this file includes details of transaction for 2 cab companies.

**Customer_ID.csv** – this is a mapping table that contains a unique identifier which links the customer’s demographic details.

**Transaction_ID.csv** – this is a mapping table that contains transaction to customer mapping and payment mode.

**City.csv** – this file contains list of US cities, their population and number of cab users.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import scikitplot as skplt
import seaborn as sns
sns.set()

from sklearn import metrics
from sklearn.model_selection import cross_validate

import warnings
warnings.filterwarnings('ignore')

plt.rcParams['axes.labelsize'] = 15
plt.rcParams['axes.titlesize'] = 20

In [None]:
#importing datasets

cab_df = pd.read_csv("../input/cab-invest-firm-datasets/Cab_Data - Copy.csv")
cab_df.info()

In [None]:
cab_df.isna().sum()

There are no missing values in this dataset.

In [None]:
cab_df.head()

The `Date of Travel` is of serial number format. Extracting data information from this column.

In [None]:
cab_df['Date of Travel'] = pd.to_datetime(cab_df['Date of Travel'])

In [None]:
cab_df.head()

In [None]:
#Import city dataset
city_df = pd.read_csv('../input/cab-invest-firm-datasets/City.csv')
city_df

In [None]:
#checking if Cities in city_df are in cab_df

np.setdiff1d(city_df.City, cab_df.City) #No san Francisco in cab_df dataset

In [None]:
# importing Transaction ID dataset

trans_df = pd.read_csv('../input/cab-invest-firm-datasets/Transaction_ID.csv')
trans_df.info()

In [None]:
trans_df.head()

In [None]:
#Checking for Transaction ID's not in cab_df dataset

len(np.setdiff1d(trans_df['Transaction ID'], cab_df['Transaction ID']))

There are about 80,706 transaction ID's not present in the df1 dataset. These will be dropped when joining trans_df with the df1 dataset

In [None]:
df1 = pd.merge(cab_df, trans_df, on = 'Transaction ID')

In [None]:
# importing customer ID df

cust_df = pd.read_csv('../input/cab-invest-firm-datasets/Customer_ID.csv')
cust_df.head()

In [None]:
cust_df.info()

In [None]:
# Checking for Customer ID's not in df2 dataset

len(np.setdiff1d(cust_df['Customer ID'], df1['Customer ID']))

Around 3023 Customer ID's are not in the df2 dataset. These will be dropped when cust_df will be joined with the df2 dataset. 

In [None]:
full_df = pd.merge(df1, cust_df, on = 'Customer ID')
full_df.head()

In [None]:
full_df.info()

In [None]:
#Checking for NA's

full_df.isna().sum()

We have joined all datasets and created a master dataset comprising of columns from all of the datasets.

Next, splitting the `City` column in to city and state columns. Fortunately, there already exists a forum online with a list of all US States abbreviations in a list format.https://snipplr.com/view/50728/list-of-us-state-abbreviations 

In [None]:
%%time
US_States = [
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
]

state = []
city = []

for i in range(len(full_df)):
    if full_df.City[i].split()[~0] in US_States:
        city.append(full_df.City[i].split(f' {full_df.City[i].split()[~0]}')[0])
        state.append(full_df.City[i].split()[~0])
    else:
        city.append(full_df.City[i])
        state.append(np.nan)

In [None]:
full_df['City'] = city
full_df['State'] = state

full_df.head()

In [None]:
print(full_df.City.unique())
print()
full_df.State.unique()

In [None]:
full_df.isna().sum()

There are about 12501 observations with missing states columns. Checking which cities have missing states. 

In [None]:
full_df[full_df.State.isna()]['City'].unique()

Since both Orange County and Silicon Valley are located within the State of California, I will impute the missing values in the `State`column belonging to these cities as 'CA' (California)

In [None]:
full_df.fillna('CA', inplace = True)

In [None]:
full_df.isna().sum()

In [None]:
#Checking for duplicated observation or columns
full_df.duplicated().sum() #None

In [None]:
#Replacing spaces in columns names with _

full_df.columns = [col.strip().replace(' ', '_').lower() for col in full_df.columns]
print(full_df.columns)
full_df.rename(columns = {'income_(usd/month)' : 'cust_income', 
                          'date_of_travel' : 'travel_date'}, inplace = True)

In [None]:
#Sorting the data based on date of travel and transaction ID.

full_df.sort_values(['travel_date', 'transaction_id'], ignore_index=True, inplace = True)

In [None]:
full_df

In [None]:
#checking for any duplicated observations

print(cust_df.duplicated().sum())
print(trans_df.duplicated().sum())
print(cab_df.duplicated().sum())

#None

##### Now the dataset is complete and ready for EDA

In [None]:
# full_df.to_csv('full_df.csv', index = False)

# EDA

In [None]:
city_df.info()

Both `population` and `users` columns are of object data type. Converting these columns to numeric

In [None]:
city_df.Population[0], city_df.Users[0]
#There are commas inside figures in population and users.

In [None]:
city_df.Population = [city_df.Population[i].replace(",", "") for i in range(len(city_df))]
city_df.Users = [city_df.Users[i].replace(",", "") for i in range(len(city_df))]

In [None]:
city_df

In [None]:
city_df.Population[0]

In [None]:
city_df.Population = city_df.Population.astype('int64')
city_df.Users = city_df.Users.astype('int64')

In [None]:
city_df.info()

In [None]:
US_States = [
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
]

state = []
city = []

for i in range(len(city_df)):
    if city_df.City[i].split()[~0] in US_States:
        city.append(city_df.City[i].split(f' {city_df.City[i].split()[~0]}')[0])
        state.append(city_df.City[i].split()[~0])
    else:
        city.append(city_df.City[i])
        state.append(np.nan)
        
city_df.City = city
city_df['state'] = state

In [None]:
city_df

In [None]:
city_df.fillna('CA', inplace = True)

### Visualizing Demographics

I will visualize the total population of both users and non-users of cab services by state.

In [None]:
city_df.groupby(['state', 'City']).mean()[['Population', 'Users']]

In [None]:
demograph = city_df.groupby(['state', 'City']).mean()[['Population', 'Users']].sum(level = 0)

In [None]:
demograph['non-users'] = demograph['Population'] - demograph['Users']

In [None]:
demograph.sort_values(['Population', 'Users'])[['Users', 'non-users']].\
plot(kind = 'barh', stacked = True, figsize = (12, 7), title = "Cab Passengers by State");

plt.xlabel("Population in Millions");

The state of New York has the highest population of approximately 8.5 million followed California (around 5.3 million). Both of these states also have the highest number of users, followed by Illinois, DC and Massachusetts.  

In [None]:
#Visualizing at City level

city_demog = city_df.groupby(['City']).mean()[['Population', 'Users']].sum(level = 0)

city_demog['non-users'] = city_demog['Population'] - city_demog['Users']

city_demog.sort_values(['Population', 'Users'])[['Users', 'non-users']].\
plot(kind = 'barh', stacked = True, figsize = (10, 7), title = "Cab Passengers by City");

plt.xlabel("Population in Millions");

New York city has the highest number of users followed by Chicago, Los Angeles and Washington.

In [None]:
#Gender
print(f'Proportion of Total Male Customers: {full_df.gender.value_counts(normalize = True)[0]*100:.2f} %')
print(f'Proportion of Total Female Customers: {full_df.gender.value_counts(normalize = True)[1]*100:.2f} %')

fig, ax = plt.subplots(1,2, figsize = (10,5))

sns.countplot(full_df.gender, ax = ax[0]).set_title("Passengers by Gender");

pd.crosstab(index = full_df.company, columns = full_df.gender, normalize = 'index').\
plot(kind = 'bar', stacked = True, ax = ax[1], rot = 0, title = "Customer Gender Proportions");

Proportion of Male passengers are higher than female passengers. But both companies have the same distribution of passengers by gender.

# Visualizing Trips

In [None]:
#Extracting date info from travel_date

full_df['year'] = full_df['travel_date'].dt.year
full_df['month'] = full_df.travel_date.dt.month
full_df['date'] = full_df.travel_date.dt.day
full_df['day_of_week'] = full_df.travel_date.dt.dayofweek

In [None]:
full_df.info()

In [None]:
full_df

In [None]:
trip = full_df.groupby(['travel_date', 'company']).size().reset_index().rename(columns = {0 : 'count'})

#full_df.travel_date = full_df.travel_date.dt.date
trip

In [None]:
trip["day"] = trip.travel_date.dt.day_name()
trip

##### Distribution of number of trips

In [None]:
#Assigning Colors for companies
palette = ['#d965a4', '#ffc400']

In [None]:
fig, ax = plt.subplots(1,2, figsize = (13,5))
sns.histplot(x = 'count', hue = 'company', data = trip, kde = True, palette = palette, 
             bins = 100, ax = ax[0]).set_title("Daily Passenger Trips distribution");

sns.boxplot(x = 'count', y = 'company', data = trip, hue = 'company', palette = palette, 
            ax = ax[1]).set_title("Daily Passenger Trips distribution");

The above plots depicts the distribution of daily trips by both Cab companies. $\color{yellow}{\text{Yellow Cab}}$ has a higher median trips compared to $\color{violet}{\text{Pink Cab}}$. Both distributions are skewed to the right, signifying that greater number of trips on some days are rarer. 

In [None]:
plt.figure(figsize = (13,6))
sns.lineplot(x = 'travel_date', y = 'count', data = trip, hue = 'company', 
             palette = palette);
plt.title('Daily Number of Trips');
plt.xlabel('Date of Travel');
plt.ylabel('Number of Trips');

The above plots displays daily trips made by both Cab companies from beginning of 2016 till the end of  2018. There is a clear seasonality on a weekly, monthly and yearly level for both Cab companies. Both Cab companies follows the same patterns. <br>
<br>
On a monthly level, there is a clear upward trend. On new year, the daily trips dips down to the lowest again. But on a yearly level, the trend seems to be almost uniform.<br>
<br>
$\color{yellow}{\text{Yellow Cab}}$ makes significantly more trips on any given day compared to $\color{violet}{\text{Pink Cab}}$. The highest reported trips for both Cab companies was on 5th of January, 2018.

In [None]:
plt.figure(figsize = (13,6))
ax = sns.lineplot(x = 'travel_date', y = 'count', data = trip, hue = 'company', 
             palette = palette);

plt.title('Weekly Seasonality (2016)');
plt.xlabel('Date of Travel');
plt.ylabel('Number of Trips');
plt.xlim(16801, 17167);

In [None]:
plt.figure(figsize = (13,6))
sns.lineplot(x = 'travel_date', y = 'count', data = trip, hue = 'company', 
             palette = palette);

plt.title('Weekly Seasonality (2017)');
plt.xlabel('Date of Travel');
plt.ylabel('Number of Trips');
plt.xlim(17167, 17532);


In [None]:
plt.figure(figsize = (13,6))
sns.lineplot(x = 'travel_date', y = 'count', data = trip, hue = 'company', 
             palette = palette);

plt.title('Weekly Seasonality (2018)');
plt.xlabel('Date of Travel');
plt.ylabel('Number of Trips');
plt.xlim(17532, 17897);

For the year 2017, the seasonality pattern is different compared to both 2016 and 2018.

In [None]:
#Selecting range of months
#2016 to 2017

plt.figure(figsize = (13,6))

ax = sns.lineplot(x = 'travel_date', y = 'count', data = trip, hue = 'company', 
             palette = palette);

plt.axvline(x = 17167, color = 'k', alpha = 0.8, linestyle = "--"); #To mark new year

for day, color in zip(['Friday', 'Saturday', 'Sunday'], ['red', 'blue', 'green']):
    trip.query(f"day == '{day}'")[['travel_date', 'count']].\
    plot.scatter(x = 'travel_date', y = 'count', ax = ax, label = f'{day}', color = color);


plt.title('Weekly Seasonality');
plt.xlabel('Date of Travel');
plt.ylabel('Number of Trips');
plt.xlim(17120, 17200);

On any given month, there is a weekly seasonality where the number of trips are especially high during the weekends (Saturday and Sunday) for the year 2016.<br>
<br>
Interestingly, for the year 2017, the seasonality starts to change. There is an increase in number of rides during Friday, then it dips down on Saturday and then increases again during Sunday. This pattern is observed for both Cab companies. <br>

In [None]:
#2017 to 2018

plt.figure(figsize = (13,6))
ax = sns.lineplot(x = 'travel_date', y = 'count', data = trip, hue = 'company', 
             palette = palette);

plt.axvline(x = 17532, color = 'k', alpha = 0.8, linestyle = "--"); #To mark new year

for day, color in zip(['Friday', 'Saturday', 'Sunday'], ['red', 'blue', 'green']):
    trip.query(f"day == '{day}'")[['travel_date', 'count']].\
    plot.scatter(x = 'travel_date', y = 'count', ax = ax, label = f'{day}', color = color);

plt.title('Weekly Seasonality');
plt.xlabel('Date of Travel');
plt.ylabel('Number of Trips');
plt.xlim(17470, 17600);

From 2017 to 2018, the pattern changes once again, this time, there is a peak of riders during Friday, then Saturday but then dips to down to low during Sunday. <br>

Let's look closely into number of trips during the 5th January 2018.

In [None]:
daily_trip_city = full_df.groupby(['travel_date', 'city', 'company']).size().reset_index().\
                                                            rename(columns = {0 : 'count'})

daily_trip_city

In [None]:
sns.relplot(x = 'travel_date', y = 'count', row = 'city', hue = 'company', 
            palette = palette, data = daily_trip_city, kind = 'line', aspect = 2);

plt.xlim(17532, 17542);
plt.xticks(rotation = 45);
plt.tight_layout();

The highest spike in the number of rides can be noticed in New York, followed by Chicago, Washington and Los Angeles. It is unknown why there was a spike in number of rides on this particular day. There was a blizzard storm in New York during the first week of January on 2018. Unfortunately, I wasn't able to find any suitable weather datasets during this time period for any of the states. 

Next, I will visualize the data on a Month level basis.

In [None]:
#Aggregating at a month level to visualize month seasonality

month_trip = full_df.groupby(['year', 'month', 'company']).size().reset_index().\
                                                    rename(columns = {0:'count'})
month_trip

In [None]:
month_trip['month_level'] = month_trip['year'].astype('str') + "-" + month_trip['month'].astype('str')
month_trip

In [None]:
plt.figure(figsize = (13,6))
ax = sns.lineplot(x = 'month_level', y = 'count', data = month_trip, hue = 'company', 
             palette = palette);

for month, name, color in zip([2,12], ['February', 'December'], ['red', 'blue']):
    month_trip.query(f"month == '{month}'")[['month_level', 'count']].\
    plot.scatter(x = 'month_level', y = 'count', ax = ax, label = f'{name}', color = color);

plt.xticks(rotation = 45)
plt.title('Monthly Trips');
plt.xlabel('Month of Travel');
plt.ylabel('Number of Trips');

When aggregating trips at a monthly level, again, there is a clear seasonality. The number of trips at the month of February is the lowest and the number of trips at the month of December is the highest for all the years. For both Cab companies, there is a slight upward trend  throughout the years.

##### Next, I will visualize monthly trips per city for both Cab companies

In [None]:
city_trips_month = full_df.groupby(['year', 'month', 'city', 'company']).size().\
                                    reset_index().rename(columns = {0:'count'})

city_trips_month['month_level'] = city_trips_month['year'].astype('str') + "_" + \
                                                city_trips_month['month'].astype('str')

city_trips_month

In [None]:
for i in city_trips_month.city.unique():
    plt.figure(figsize = (13,6))
    temp_df = city_trips_month.query(f"city == '{i}'")
    
    sns.lineplot(x = 'month_level', y = 'count', data = temp_df, hue = 'company', 
                 palette = palette);

    plt.title(f'Monthly Trips in {i}');
    plt.xlabel('Date of Travel');
    plt.ylabel('Number of Trips');
    plt.xticks(rotation = 45)

The above plots displays patterns at city level more clearly.
<br>
<br>
* $\color{yellow}{\text{Yellow Cab}}$ thrives on the following cities: **Atlanta, Austin, Boston, Chicago, Dallas, Denver, Los Angeles, Miami, New York, Orange County, Phoenix, Seattle, Silicon Valley, Tucson, Washington**.
<br>
<br>
* $\color{violet}{\text{Pink Cab}}$ thrives on the following cities: **Nashville and Sacramento**. 
<br>
<br>
In cities such as **Pittsburgh and San Diego** both companies have almost the same number of rides.
<br>
<br>
Overall, $\color{yellow}{\text{Yellow Cab}}$ company seems to perform well in terms of the number of rides during the time period.

# Visualizing Trip information

#### What is the income made by the driver for each trip?

**Assumptions**

1) Due to limited available data on the Internet, I will assume that the expenses for the trip only involves fuel charges. 

2) There aren't sufficient data on the Internet for the base fares per year for each city in the US. This will be left out of the analysis.

In [None]:
full_df['profit'] = full_df.price_charged - full_df.cost_of_trip #operating income
full_df

In [None]:
full_df[['km_travelled', 'price_charged', 'cost_of_trip', 'profit']].describe().T

In [None]:
fig, axes = plt.subplots(2,2, figsize = (10,8), sharey = True)

for col, ax in zip(['km_travelled', 'price_charged', 'cost_of_trip', 'profit'], 
                   axes.flatten()):
    
    sns.boxplot(x = col, data = full_df, y = 'company', ax = ax, palette = palette);
    plt.tight_layout();

Above plots illustrates distribution of features related to the trip. The distributions of the distance traveled, Cab expenses all follow a uniform distribution. Only profit follows a Gaussian distribution that is skewed to the right. <br>
<br>
There are high outliers on the right side of both profit and price charged columns.Both Cab companies has the same median distance traveled. $\color{yellow}{\text{Yellow Cab}}$ has a higher Cab expenses overall. The median price charged by $\color{violet}{\text{Pink Cab}}$ is lower than it's rival company. The profit of $\color{yellow}{\text{Yellow Cab}}$ is significantly higher. 
<br>
<br>
Both Cab companies has made some $\color{red}{\text{losses}}$, as is evidents on the left side of the profit box plots. I will take a closer look at this in the next sessions.

I have a hypothesis that the outliers in the price_charged variable might be due to Cabs offering 'Premium' services, where the company offers trips in luxury or high-end vehicles. In order to test this hypothesis, I will cap the price range according to the upper limit of the inter-quantile rage for both Cab company's price_charged variable. <br>
<br>
According to my hypothesis, customers who calls for premium cabs are richer and they would use premium cabs to travel any distance.

In [None]:
#For yellow cab

yc_IQR = full_df.query('company == "Yellow Cab"').price_charged.quantile(0.75) - \
         full_df.query('company == "Yellow Cab"').price_charged.quantile(0.25)

pc_IQR = full_df.query('company == "Pink Cab"').price_charged.quantile(0.75) - \
         full_df.query('company == "Pink Cab"').price_charged.quantile(0.25)

print(f'Yellow Cab IQR = {yc_IQR:.3f}')
print(f'Pink Cab IQR = {pc_IQR:.3f}')

distance = 1.5

yc_upper_limit = full_df.query('company == "Yellow Cab"').price_charged.quantile(0.75) + \
                 (yc_IQR * distance)

pc_upper_limit = full_df.query('company == "Pink Cab"').price_charged.quantile(0.75) + \
                 (pc_IQR * distance)

print()
print(f"Yellow Cab Upper Bounday = {yc_upper_limit:.3f}")
print(f"Pink Cab Upper Bounday = {pc_upper_limit:.3f}")

In [None]:
#Creating new varibale called 'is_premium'. Premium trips are marked as 1.

is_premium = []

for i in range(len(full_df)):
    if full_df['company'][i] == 'Yellow Cab':
        if full_df['price_charged'][i] >= yc_upper_limit:
            is_premium.append(1)
        else:
            is_premium.append(0)
    else:
        if full_df['price_charged'][i] >= pc_upper_limit:
            is_premium.append(1)
        else:
            is_premium.append(0)
    
full_df['is_premium'] = is_premium

In [None]:
premium_trips = full_df.query('is_premium == 1')
non_premium_trips = full_df.query('is_premium == 0')

fig, ax = plt.subplots(1,2, figsize = (13,5), sharey = True)

sns.boxplot(x = 'price_charged', y = 'company', data = premium_trips, palette = palette, 
            order = ['Pink Cab', 'Yellow Cab'], ax = ax[0]).\
set_title('Premium Trips');


sns.boxplot(x = 'price_charged', y = 'company', data = non_premium_trips, palette = palette, ax = ax[1]).\
set_title('Non-Premium Trips');

From the above plots, price charged has been further split based on the premium category. Premium trips have a higher price range compared to non-premium trips. Almost all of the outliers within the non-premium trips has been removed. But for the premium trips, there are still outliers at the higher end of the `price_charged` variable.
<br>
<br>
Next, I will see how premium rides associate with Distance and customer income.

In [None]:
fig, ax = plt.subplots(1,2, figsize = (13,5))

sns.boxplot(y = 'km_travelled', x = 'is_premium', data = full_df, ax = ax[0]).\
set_title('Distance by Premium Fares');

sns.boxplot(y = 'cust_income', x = 'is_premium', data = full_df, ax = ax[1]).\
set_title('Customer Income by Premium Fares');

The above plots clearly illustrates that perhaps price_charged for the trips mainly depends on the distance traveled and that customer's income don't have much of an affect. Therefore, my hypothesis is wrong regarding the `price_charged` variable.

Next, I will visualize the data in a way that might reveal any correlation between these features. I will only take a sample of the data as it is computationally expensive to plot a Pairplot using the whole dataset.

In [None]:
%%time
sampled_df_1 = full_df[['km_travelled', 'price_charged', 'cost_of_trip', 'profit', 
                      'company']].sample(frac = 0.05, random_state=42)

g = sns.pairplot(sampled_df_1, hue = 'company', hue_order = ['Pink Cab', 'Yellow Cab'], 
             palette = palette, plot_kws={'alpha': 0.1});

g._legend.set_bbox_to_anchor((0.98, 0.8))

plt.tight_layout()

In [None]:
sns.heatmap(sampled_df_1.corr(), annot = True, center = 0);

The above plots clearly depicts that all features are correlated with each other. In general, as the distance of travel increases, both Cab expenses and Cab fares increases. The correlation is strong between distance traveled and Cab expenses. <br>
<br>
For price, expense and profit variables, **distance traveled is the compounding variable**.
<br>
<br>
For the operational expenses, there is a greater spread between all of the variables. When looking closely at the scatter plot between price charged and profit, the spread is higher when the price charged is low but the spread becomes less as the price charged increases.

#### Is daily number of trips and total distance traveled in a day correlated?

In [None]:
trip_distance_df = full_df.groupby('travel_date').agg({'km_travelled' : 'sum', 'city' : 'count'}).\
                   rename(columns = {'city' : 'trips_#'})

plt.figure(figsize = (10,6))
sns.scatterplot(x = 'trips_#', y = 'km_travelled', data = trip_distance_df, edgecolor = 'black', 
                alpha = 0.5).set_title("Daily trips vs Total distance covered daily");

There is a perfect correlation between number of trips and total distance traveled in a day. Therefore either one of these variables determines both company's daily revenues, expenses and profits.

##### Loses

In this section, we look closely at the trips did not turn a profit.

In [None]:
loss = full_df.query("profit <= 0")
loss

There are about 24823 trips which ended up having a loss instead of a profit.

#### How does the distribution of losses vary across all cities?

In [None]:
sns.catplot(y = 'city', x = 'profit', col = 'company', data = loss, kind = 'violin', row = 'year');

Although both companies has trips that did not make a profit, trips made my $\color{violet}{\text{Pink Cab}}$ across all cities for all three years has had a higher frequency of losses compared to $\color{yellow}{\text{Yellow Cab}}$. These losses could affect the overall profit margin of $\color{violet}{\text{Pink Cab}}$ 

#### At what time-line was the most total losses made by both companies?

In [None]:
total_loss = loss.groupby(['travel_date', 'company']).sum()['profit'].sort_values().reset_index()

plt.figure(figsize = (13,5))

ax = sns.lineplot(x = 'travel_date', y = 'profit', data = total_loss, hue = 'company', 
             hue_order = ['Pink Cab', 'Yellow Cab'], palette = palette);

total_loss.loc[:50, ["travel_date", "profit"]].plot.scatter("travel_date", "profit", 
                                                            color = 'red', ax = ax);

The above plot shows a time-line of trips that only made losses which is aggregated at a daily level by summing up the losses. Here, $\color{yellow}{\text{Yellow Cab}}$  has had a series of total losses every year. What's apparent is a pattern. There are a few cluster of losses during certain time periods at particular months, most apparent during the months of July and August. 

In [None]:
month_loss_count = loss.groupby(['month', 'company']).size().reset_index().\
                                                        rename(columns = {0:'count'})

plt.figure(figsize = (10,5))
sns.lineplot(x = 'month', y = 'count', hue = 'company', data = month_loss_count, 
             palette = palette);

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'July', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

plt.xticks(np.arange(1,13,1), labels = months);
plt.ylabel('Number of Non-profitable Trips');
plt.title("Month-Wise Total Non-Profitable Trips");

From the above plot, we can deduce that $\color{yellow}{\text{Yellow Cab}}$ makes the highest non-profitable trips during the months of July, August, November and December. For $\color{violet}{\text{Pink Cab}}$, it is during September and October.

#### I will test the hypothesis if losses differs across months.

In [None]:
loss_month_profit = loss[['month', 'profit']]
loss_month_profit

In [None]:
plt.figure(figsize = (10,5));
ax = sns.boxplot(y = 'month', x = 'profit', data = loss_month_profit, orient = 'h').\
set_title("Losses Distribution by Month");

plt.yticks(np.arange(0,12,1), labels = months);

The distribution of losses are all skewed across all months. The above plot does not follow the assumptions of an ANOVA one-way test as the distribution is skewed. Therefore I will employ a non-parametric test called 'Kruskal-Wallis H-test'. It is a non-parametric version of ANOVA.
<br>
<br>
The Kruskal-Wallis H-test tests the null hypothesis that the population **median** of all of the groups are equal.
<br>
<br>
Therefore, I will employ this test to check whether any associations exists between month and losses made.

In [None]:
#creating arrays of profit(losses) for each month

jan = loss_month_profit.query('month == 1')['profit'].values
feb = loss_month_profit.query('month == 2')['profit'].values
mar = loss_month_profit.query('month == 3')['profit'].values
apr = loss_month_profit.query('month == 4')['profit'].values
may = loss_month_profit.query('month == 5')['profit'].values
jun = loss_month_profit.query('month == 6')['profit'].values
jul = loss_month_profit.query('month == 7')['profit'].values
aug = loss_month_profit.query('month == 8')['profit'].values
sep = loss_month_profit.query('month == 9')['profit'].values
octr = loss_month_profit.query('month == 10')['profit'].values
nov = loss_month_profit.query('month == 11')['profit'].values
dec = loss_month_profit.query('month == 12')['profit'].values

#### * **Null hypothesis**: Monthly profit medians are equal (no variation in means of groups). <br>
H0: m1=m2=…=mp
<br>
<br>
#### * **Alternative hypothesis**: At least one monthly profit median is different from other months. <br>
H1: All μ are not equal

In [None]:
from scipy import stats

alpha = 0.05
H, p = stats.kruskal(jan, feb, mar, apr, may, jun, jul, aug, sep, octr, nov, dec)

print(f'H-Value: {H:.3f}')
print(f'P-value: {p:.3f}')
print()

alpha = 0.01

if p <= alpha:
    print('P-value less than alpha - Reject H0')
else:
    print('P-value higher than alpha - Cannot Reject H0')

The above tests signifies that the median losses differs across months and that losses are higher during some particular months.

#### In which city does both companies makes  most bad trips?

In [None]:
loss_city = loss.groupby(['city', 'company']).size().reset_index().rename(columns = {0:'count'})

g = sns.catplot(y = 'city', x = 'count', col = 'company', data = loss_city, kind = 'bar');

g.set_xlabels("Total Non-Profitable Trips");

According to the data, the most number of non-profit trips made by $\color{yellow}{\text{Yellow Cab}}$ was on Chicago, Boston, Washington and Los Angeles. For $\color{violet}{\text{Pink Cab}}$, its mostly on Chicago and Los Angeles.

In [None]:
loss_year_city = loss.groupby(['year', 'city', 'company']).sum()['profit'].reset_index()

g = sns.catplot(y = 'city', x = 'profit', col = 'company', row = 'year', data = loss_year_city, 
                kind = 'bar');

g.set_xlabels("Total Losses ($)");

For $\color{yellow}{\text{Yellow Cab}}$, losses made from non-profit trips was the highest on Chicago during 2017. For $\color{violet}{\text{Pink Cab}}$, it was on Los Angeles during 2017.

#### How much does the monthly Cab Fare, Cab expenses and profit vary across all states for both Cab companies?

In [None]:
#Grouping by year, month, state and company and taking the median of Cab fare and expenses

#I take the median, as the distribution of both profit and price charged in heavily skewed to the right.

cab_monthly_finances = full_df.groupby(['year', 'month', 'state', 'company']).\
median()[['price_charged', 'cost_of_trip', 'profit']].reset_index()

#Concatenating year and month into a single column
cab_monthly_finances['month_level'] = cab_monthly_finances['year'].astype('str') + "-" + \
                                                cab_monthly_finances['month'].astype('str')

#dropping individual year and month column
cab_monthly_finances.drop(['year', 'month'], axis = 1, inplace = True)

#unpivoting price charge and cost of trip in order to make it easier to plot both in a single axis
cab_monthly_finances = cab_monthly_finances.melt(id_vars = ['state', 'company', 'month_level'], 
                                                 var_name = 'inc_exp', value_name = 'amount')

cab_monthly_finances

In [None]:
g = sns.relplot(y = 'amount', x = 'month_level', data = cab_monthly_finances, kind = 'line', 
                hue = 'inc_exp', row = 'state', col = 'company');

axes = g.axes.flatten()

for ax in axes:
    ax.axhline(0, ls='--', color='red') #to mark profit below zero

g.set_xticklabels(rotation=65);
plt.tight_layout();

g.set_ylabels("Median Monthly Amount $");
g.set_xlabels("Timeline in Months");
g._legend.set_bbox_to_anchor((0.99, 0.985))

The above plots represents the monthly median Cab fares and Cab expenses for both companies across all states. The $\color{blue}{\text{blue line}}$ represents the average $\color{blue}{\text{price charged}}$ for the trips and the $\color{orange}{\text{orange line}}$ represents the $\color{orange}{\text{cab expenses}}$. $\color{green}{\text{green line}}$ represents the $\color{green}{\text{profit}}$.
The $\color{red}{\text{red line}}$ marks the border where any amount (mainly $\color{green}{\text{profit}}$) below zero is $\color{red}{\text{losses}}$. These are my observations:
<br>
* Across all states, $\color{yellow}{\text{Yellow Cab}}$ has higher $\color{orange}{\text{cab expenses}}$ compared to $\color{violet}{\text{Pink Cab}}$. 
<br>
<br>
* In general, $\color{violet}{\text{Pink Cab}}$ has lower $\color{blue}{\text{price charged}}$ compared to $\color{yellow}{\text{Yellow Cab}}$. This is especially discernible in the state of New York.
<br>
<br>
* Like $\color{blue}{\text{price charged}}$, $\color{green}{\text{profit}}$ for $\color{yellow}{\text{Yellow Cab}}$ is higher than it's rival. Both $\color{blue}{\text{price charged}}$ and $\color{green}{\text{profit}}$ follows the same pattern.
<br>
<br>
* For the state of New York, which has the highest Cab passengers compared to any other states, $\color{yellow}{\text{Yellow Cab}}$ has the highest Cab Fares compared to $\color{violet}{\text{Pink Cab}}$ during the same time-line. $\color{yellow}{\text{Yellow Cab's}}$ $\color{green}{\text{profit}}$ is significantly higher in New York compared to any other state.
<br>
<br>
* There is an interesting pattern for the state of PA (Pennsylvania). During February 2018, both Cab fares and Cab expenses of $\color{violet}{\text{Pink Cab}}$ dipped way below the norm. At that same time period, the Cab Fares of $\color{yellow}{\text{Yellow Cab}}$ has shown a significant spike.
<br>
<br>
* Both Cab companies makes median $\color{red}{\text{losses}}$ during time periods across all states. This is more frequent for $\color{violet}{\text{Pink Cab}}$ in states such as Colorado, Florida, Georgia, Illinois, Massachusetts, Pennsylvania, Tennessee, Texas and Washington. For $\color{yellow}{\text{Yellow Cab}}$, there are barely any months making median $\color{red}{\text{losses}}$. This could signify that $\color{yellow}{\text{Yellow Cab}}$ perform better across all states and any $\color{red}{\text{losses}}$ it makes from non-profit trips will be easily offsetted by the amount of $\color{green}{\text{profit}}$ it makes.

It is important to note that for all the three variables, there is a confounding variable, which is the distance traveled (`km_traveled`). Perhaps binning this variable will help help in further analysis.

### Distance (`km_travelled`)

In [None]:
#Binning km_travelled in to 3 equal freq quantiles.

full_df['trip_type'] = pd.qcut(full_df.km_travelled, 3, labels = ['short', 'medium', 'long'])
#Short distance = 1.899 to 15.47 km
#medium distance = 15.47 to 29.4 km
#Short distance = 29.4 to 48.0 km

sns.countplot(full_df.trip_type);

All trip durations have equal occurrences according to the data.

#### How does price charged vary with different trip intervals at State level?

In [None]:
sns.catplot(x = 'price_charged', y = 'state', data = full_df, row = 'company', kind = 'violin', 
           col = 'trip_type');

plt.tight_layout()

From the above plots, the price charged for short distance trips by both cab companies across all states are roughly the same, except for New York. As the distance increases, more variations in prices charged across all states is observed for $\color{yellow}{\text{Yellow Cab}}$. Comparatively, prices charged for a particular distance category by $\color{violet}{\text{Pink Cab}}$ remains roughly the same across all states.

#### How does Expenses vary with different trip intervals at State level?

In [None]:
sns.catplot(x = 'cost_of_trip', y = 'state', data = full_df, row = 'company', kind = 'violin', 
           col = 'trip_type');

plt.tight_layout()

Cab expenses remains uniform at different trip intervals across all states.

#### How does profit vary with different trip intervals at State level?

In [None]:
sns.catplot(x = 'profit', y = 'state', data = full_df, row = 'company', kind = 'violin', 
           col = 'trip_type');

plt.tight_layout()

From the above plot, there is a higher probability of making high profits when the trip duration is long across all states.

#### At which distance interval most non-profit trips are made by both companies?

In [None]:
loss_dist = full_df.query("profit < 0")[['profit', 'company', 'trip_type']]

g = sns.catplot(x = 'profit', y = 'trip_type', col = 'company', data = loss_dist, kind = 'violin');

For both Cab companies, short trips results in higher frequency of losses. Losses made during medium trips are roughly the same for both. For long trips, although the frequency is less compared to other two intervals, there is a small probability to make higher losses. $\color{violet}{\text{Pink Cab}}$ has has trips with the highest losses compared to it's rival.

#### How does Distance affect the price, costs and profit?

In [None]:
sampled_df_2 = full_df[['price_charged', 'cost_of_trip', 'profit', 'trip_type']].\
sample(frac = 0.05, random_state=42)

g = sns.pairplot(sampled_df_2, hue = 'trip_type', plot_kws={'alpha': 0.1});

g._legend.set_bbox_to_anchor((1, 0.8))
plt.tight_layout();

The above plots depicts at each distance intervals, the price, profit and costs increases, although there is a higher variability in all variables as the distance increases.

Next, I will bin `profit` into different levels and see it's relation with distance.

In [None]:
bins = [-221, 0.00001, 30.5, 85.5, 200.5, 800.5, 1464]
#Here, I assume where that profit = 0 is also loss.

label = ['loss', 'low', 'average', 'above-average', 'high', 'highest']

full_df['profit_level'] = pd.cut(full_df.profit, bins = bins, labels = label)

In [None]:
fig, ax = plt.subplots(1,2, figsize = (10,5), sharey = True, tight_layout = True);

pd.crosstab(index = full_df.query("company == 'Pink Cab'").profit_level, 
            columns = full_df.query("company == 'Pink Cab'").trip_type, normalize = 'index').\
plot(kind = 'barh', stacked = True, ax = ax[0], title = "Pink Cab", legend = False);

ax[0].set_xlabel('Proportion');

pd.crosstab(index = full_df.query("company == 'Yellow Cab'").profit_level, 
            columns = full_df.query("company == 'Yellow Cab'").trip_type, 
            normalize = 'index').plot(kind = 'barh', stacked = True, ax = ax[1], 
                                     title = "Yellow Cab");

ax[1].set_xlabel('Proportion');

The above plots depicts proportion of trip_intervals that contributes to each profit level for both companies. For the loss category (where profit <= 0), all three trip durations have equal intervals, which could mean that distance might not be the main factor contributing to losses for both Cab companies.
<br>
<br>
Long trips contributes the most to highest profit.
<br>
<br>
What sets apart $\color{yellow}{\text{Yellow Cab}}$ is that it is able to make better profits from shorter trips compared to it's rival. 

#### Is there any association between daily number of trips and trip types?

In [None]:
trip_date = full_df.groupby(['travel_date', 'company', 'trip_type']).size().reset_index().\
            rename(columns={0:'count'})

plt.figure(figsize = (13,7))
sns.scatterplot(x = 'travel_date', y = 'count', hue = 'trip_type', data = trip_date, alpha = 0.8, 
               edgecolor="black", style = 'company').set_title('Daily Trip Types');

At a daily level, there seems to be equal frequency of all trip types. No trip stands out in particular. Even on 5th Jan 2018, when the highest number of trips were made according to the data, all trip types were made on that day.

#### Is there any association between distance traveled and profit being at a loss?

I will use **chi-squared test** to test my hypothesis.

In [None]:
#Is there any association between distance travelled and profit_level being loss?

pd.crosstab(index = full_df.query("profit_level == 'loss'").profit_level, 
            columns = full_df.query("profit_level == 'loss'").trip_type)

#### **Null Hypothesis (H0)**: There is no association between Loss and trip_type. <br>
#### **Alternative hypothesis (H1)**: There is an association between loss and trip_type. 

In [None]:
data = pd.crosstab(index = full_df.query("profit_level == 'loss'").profit_level, 
            columns = full_df.query("profit_level == 'loss'").trip_type).values

from scipy.stats import chi2_contingency
stat, p, dof, expected = chi2_contingency(data)
  
# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')

There is no association between profit being loss and distance traveled. 

#### Let's also test whether any association exists between other profit levels and distance traveled. The Null hypothesis remains the same as before (except its for profit instead of loss).

In [None]:
data = pd.crosstab(index = full_df.query("profit_level != 'loss'").profit_level, 
            columns = full_df.query("profit_level != 'loss'").trip_type).values

stat, p, dof, expected = chi2_contingency(data)
  
# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')

For other profit levels, there is an association between distance profit and distance traveled.

# Customer Income

In [None]:
#Distribution of Customer Income
plt.figure(figsize = (8,5))
sns.histplot(x = 'cust_income', kde = True, data = full_df);

Just like age variable, the income of customers also follows a two-phase uniform distribution, signifying that there is an equal probability of finding customers belonging to all ranges of salary below 25000 $ and lower but yet equal probability of finding customer's of higher income range.

##### Is there a relation between customer's income and the number of times the customer uses a Cab service?

In [None]:
#Creating a dataset that displays customer id, customer's income and total number of times the 
#customer has made the trip.

customer_income_trip_df = full_df.groupby(['customer_id', 'cust_income']).size().\
                                                reset_index().rename(columns = {0:'count'})

customer_income_trip_df

In [None]:
plt.figure(figsize = (10,6))
sns.scatterplot(x = 'cust_income', y = 'count', data = customer_income_trip_df, 
               alpha = 0.5);
plt.ylabel("Number of Trips");

In [None]:
#Correlation using spearman as the data is not normally distributed and both variables are discrete, 
#not continuous.
sns.heatmap(customer_income_trip_df.iloc[:,1:].corr('spearman'), annot = True, center = 0);

According to the data in hand, there is absolutely no correlation between a customer's income and the number of times a customer travels using Cabs. Perhaps binning income levels help add more insights to our analysis. I have sourced income range categories from here: https://money.usnews.com/money/personal-finance/family-finance/articles/where-do-i-fall-in-the-american-economic-class-system 

In [None]:
#Binning income levels

bins = [0, 2670.5, 4451.5, 8903.5, 20030.5, 35000.5]
label = ['low', 'low-middle', 'middle', 'upper-middle', 'high']

full_df['income_levels'] = pd.cut(full_df.cust_income, bins = bins, labels = label)

In [None]:
income_count = full_df.groupby(['year', 'company', 'income_levels']).size().reset_index().\
                                                                rename(columns = {0:'count'})

sns.catplot(x = 'income_levels', y = 'count', row = 'company', col = 'year', data = income_count, 
           kind = 'bar', aspect = 0.85);

After binning income, we can see that most of the passengers belong to upper-middle class for both Cab companies, followed by high income class. $\color{yellow}{\text{Yellow Cab}}$ have higher proportion of passengers. For both companies, there is a slight growth in passengers from 2016 to 2017, but then stagnated/dipped slightly below from 2017 to 2018.

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (12,5))

sns.violinplot(x = 'profit', y = 'income_levels', data = full_df, ax = ax[0]).\
                            set_title("Profit Distribution by Income Groups");

pd.crosstab(index = full_df.income_levels, columns = full_df.trip_type, normalize = 'index').\
plot(kind = 'bar', stacked = True, rot = 0, ax = ax[1], title = "Distance Travelled by Income Groups").\
legend(loc = "lower right", title = "Trip Type");

ax[1].set_ylabel('Proportion');

Across all income groups, the distribution of profit from all income classes are the same. Similarly, customers from all income classes have equal proportion of trip durations.

In [None]:
#Null Hypothesis (H0): No association between income and distance travelled
#Alternative (H1): Association exists between these 2 variables

data = pd.crosstab(index = full_df.income_levels, columns = full_df.trip_type).values

stat, p, dof, expected = chi2_contingency(data)
  
# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')

Both the plots and statistical tests proved that distance traveled does not depend on the income level of the customer.

In [None]:
fig, ax = plt.subplots(1, 2, tight_layout = True, figsize = (11, 6), sharey = True)

#Pink Cab
pd.crosstab(columns = full_df.query("company == 'Pink Cab'").profit_level,
            index = full_df.query("company == 'Pink Cab'").income_levels, 
            normalize = 'index').\
plot(kind = 'barh', stacked = True, title = "Pink Cab", ylabel = "Proportion", 
                 xlabel = 'Income Groups', ax = ax[0], legend = False)

#Yellow Cab
pd.crosstab(columns = full_df.query("company == 'Yellow Cab'").profit_level, 
            index = full_df.query("company == 'Yellow Cab'").income_levels, 
            normalize = 'index').\
plot(kind = 'barh', stacked = True, title = "Yellow Cab", ylabel = "Proportion", 
                 xlabel = 'Income Groups', ax = ax[1]).\
legend(loc='lower right', bbox_to_anchor=(1.0, -0.3), ncol = 3, title = 'Profit Levels');

For both companies, the proportion of profit levels are the same for customer's of all income levels. There is no association between Customer's income and profit.

In [None]:
pd.crosstab(full_df.city, columns = full_df.income_levels, normalize = 'index').\
plot(kind = 'barh', stacked = True, figsize = (10,6), title = "Proportion of Customers by Income Levels");

All cities have similar customer distributions by income level.

In [None]:
fig, ax = plt.subplots(1,2, sharey = True, figsize = (13,5))

pd.crosstab(index = full_df.is_premium, columns = full_df.income_levels, normalize = 'index').\
plot(kind = 'barh', stacked = True, title = "Proportion of Premium rides", ax = ax[0]).\
legend(loc = 'lower center', bbox_to_anchor = (0.5, -0.3), ncol = 3);

pd.crosstab(index = full_df.is_premium, columns = full_df.trip_type, normalize = 'index').\
plot(kind = 'barh', stacked = True, title = "Proportion of Trip Types", ax = ax[1]).\
legend(loc = 'lower center', bbox_to_anchor = (0.5, -0.2), ncol = 3);

For the `is_premum` variable, again my hypothesis for the premium cabs has been proven wrong as there is no association between is_premium variable with both distance and customer's income level. Therefore, I will drop this column.

From all of the above plots, we can conclude that customer's income level don't determine the Cab companies profit or mode of operation.

In [None]:
full_df.drop('is_premium', axis = 1, inplace = True)

##### Which Cab company has the most loyal customers?

In [None]:
loyal_cust = full_df.groupby(['customer_id', 'company']).size().reset_index().rename(columns = {0:'count'})

#identifying loyal customer's that has utilized a particular Cab company atleast 5 times.
loyal_cust['is_loyal_five'] = np.where(loyal_cust['count'] >= 5, 'Loyal', 'Not Loyal')

loyal_cust

In [None]:
fig, ax = plt.subplots(1,2, figsize = (10,5))

pd.crosstab(index = loyal_cust.company, columns = loyal_cust.is_loyal_five).\
plot(kind = 'bar', rot = 0, stacked = True, title = "Total Loyal Customers", ax = ax[0], legend = False);

pd.crosstab(index = loyal_cust.company, columns = loyal_cust.is_loyal_five, normalize = 'index').\
plot(kind = 'bar', rot = 0, stacked = True, title = "Loyal Customer Proportion", ax = ax[1]);

From the above plots, its clear that $\color{yellow}{\text{Yellow Cab}}$ customer's have more loyal customer's who has used the company's services at least 5 times compared to  $\color{violet}{\text{Pink Cab}}$ company.

In [None]:
#identifying loyal customer's that has utilized a particular Cab company atleast 10 times.
loyal_cust['is_loyal_ten'] = np.where(loyal_cust['count'] >= 10, 'Loyal', 'Not Loyal')

fig, ax = plt.subplots(1,2, figsize = (10,5))

pd.crosstab(index = loyal_cust.company, columns = loyal_cust.is_loyal_ten).\
plot(kind = 'bar', rot = 0, stacked = True, title = "Total Loyal Customers", ax = ax[0], legend = False);

pd.crosstab(index = loyal_cust.company, columns = loyal_cust.is_loyal_ten, normalize = 'index').\
plot(kind = 'bar', rot = 0, stacked = True, title = "Loyal Customer Proportion", ax = ax[1]);

Same as before, $\color{yellow}{\text{Yellow Cab}}$ customer's have more loyal customer's who has used the company's services at least 10 times compared to  $\color{violet}{\text{Pink Cab}}$ company.

#### Customer growth by year

In [None]:
#Annual Customer growth by company

yearly_cust_growth = full_df.groupby(['city', 'year', 'company']).agg({'customer_id':'nunique'}).\
                                                                                    reset_index()

yearly_cust_growth

In [None]:
sns.catplot(y = 'customer_id', x = 'year', col = 'company', data = yearly_cust_growth, 
           kind = 'point', hue = 'company', palette = palette);

There was a slight growth of customer's from 2016 to 2017 but then lowered slightly from 2017 to 2018.

##### Preferred Payment Mode by Customers

In [None]:
pd.crosstab(index = full_df.income_levels, columns = full_df.payment_mode, normalize = 'index').\
plot(kind = 'bar', stacked = True, rot = 0, title = "Payment-Mode Proportions by Income Levels").\
legend(loc = "lower center", ncol = 2, bbox_to_anchor=(0.5, -0.3));

In [None]:
pd.crosstab(columns = full_df.gender, index = full_df.income_levels, normalize = 'index').\
plot(kind = 'bar', stacked = True, rot = 0, title = "Gender Proportions by Income").\
legend(loc = 'lower center', ncol = 2, bbox_to_anchor = (0.5, -0.3));

### Age

The age distribution of the passengers in the dataset follows a two-phase uniform distribution. According to the data, the highest number of passengers are between the age groups of 18 to 40 year olds. Older passengers above 40 until 65 still make up a sizable portion of passengers.

In [None]:
full_df['age_level'] = pd.qcut(full_df.age, 5, 
                               labels = ['early_20s', 'late_20s', 'early_30s', 'middle_age', 'senior'])

# early_20s: 17.999 to 24.0 
# late_20s: 24.0 to 30.0 
# early_30s: 30.0 to 36.0 
# middle_age: 36.0 to 47.0
# senior : 47.0 to 65.0

In [None]:
sns.catplot(x = 'age_level', col = 'company', data = full_df, kind = 'count');

$\color{yellow}{\text{Yellow Cab}}$ have higher number of customers, but the distribution of customers on an age level are the same for both companies.

In [None]:
fig, ax = plt.subplots(1,2, figsize = (10,5), tight_layout = True, sharey = True)

pd.crosstab(index = full_df.age_level, columns = full_df.trip_type, normalize = 'index').\
plot(kind = 'bar', stacked = True, rot = 0, title = "Distance travelled by Age", 
     ylabel = "Proportion", ax = ax[0]).\
legend(loc = "lower center", bbox_to_anchor = (0.5,-0.3), ncol = 3);

pd.crosstab(index = full_df.age_level, columns = full_df.profit_level, normalize = 'index').\
plot(kind = 'bar', stacked = True, rot = 0, title = "Profit by Age", ax = ax[1]).\
legend(loc = "lower center", bbox_to_anchor = (0.5,-0.35), ncol = 3);

Customer's age doesn't seem to have any affect on both Distance traveled and Profit as the proportions remains the same across all age levels.

#### Market Share

In [None]:
#MArket Share by company

pd.crosstab(index = full_df.year, columns = full_df.company, normalize = 'index').\
plot(kind = 'bar', stacked = True, rot = 0, title = 'Company Market Share', color = ['tab:pink', 'gold'], 
    figsize = (10, 5), ylabel = 'Proportion').\
legend(loc = 'lower center', ncol = 2, bbox_to_anchor = (0.5, -0.3));

In [None]:
fig, axes = plt.subplots(3,1, sharex = True, figsize = (10,20))

for year, ax in zip([2016, 2017, 2018], axes.flatten()):
    temp_df = full_df.query(f'year == {year}')
    pd.crosstab(index=temp_df.city, columns=temp_df.company, normalize='index').sort_values('Pink Cab').\
    plot(kind = 'barh', stacked = True, ax = ax, title = f"{year} Market Share", 
        color = ['tab:pink', 'gold']).\
    legend(loc = 'lower center', ncol = 2, bbox_to_anchor = (0.5, -0.3))
    
    ax.axvline(0.5, ls = '--', color = 'k', alpha = 0.7)
    
plt.xlabel("Proportion");
plt.xticks(np.arange(0.0, 1.1, 0.1));

Across all the years, the market share for each company has not changed much. 

$\color{yellow}{\text{Yellow Cab}}$ has the highest market share across majority of the cities.

#### Does `price_charged` vary by Gender?

In [None]:
sns.catplot(x = 'gender', y = 'price_charged', data = full_df, col = 'company', kind = 'violin');

For both cab companies, the median price_charged for both gender's are very similar. Therefore there is no relation between price_charged and gender.

# Conclusion

After analyzing all the variables in the dataset, here is a summary of my analysis:
<br>
* Both Cab company's financial performance is mainly based on **profit**. Profit is derived from the difference of the **price charged** and **cost of trip** for each trip. Both of these variables are highly correlated with the **distance traveled** for each trip. And the total distance traveled in a day is positively correlated with **total number of daily trips**. 
<br>
<br>
* There is **weekly, monthly and quarterly seasonality** on the number of rides in a given time period. The number of cab rides are higher during December and at their lowest during February.  
<br>
<br>
*  $\color{yellow}{\text{Yellow Cab}}$ has higher coverage on cities and has higher loyal customers compared to $\color{violet}{\text{Pink Cab}}$. Moreover, $\color{yellow}{\text{Yellow Cab}}$ seems to perform well almost on all cities and is able to make significantly higher profits compared to it's rival.
<br>

#### In conclusion, we can measure a company's performance by looking at the total number of daily trips.

In the next section, I will include extra datasets with the full dataset to see other factors that can affect both company's mode of operations.

### Holidays vs Trips

In [None]:
holiday_df = pd.read_csv("../input/us-holiday-dates-2004-2021/US Holiday Dates (2004-2021).csv").\
             query('Date >= "2016-01-01" & Date <= "2018-12-31"').\
             sort_values('Date').reset_index(drop = True)

holiday_df.head()

In [None]:
holiday_df.Date = pd.to_datetime(holiday_df.Date)
holiday_df.info()

In [None]:
#Holidays are 1, non-holidays = 0

full_df['is_holiday'] = np.where(full_df['travel_date'].isin(holiday_df.Date), 1, 0)

In [None]:
holiday_trip = full_df.groupby(['travel_date', 'company', 'is_holiday']).size().reset_index().\
               rename(columns = {0:'count'})

# plt.figure(figsize = (25,8));
g = sns.relplot(x = 'travel_date', y = 'count', hue = 'is_holiday', style = 'company', 
            data = holiday_trip, edgecolor="black", aspect = 1.9, height = 5, alpha = 0.8);

g._legend.set_bbox_to_anchor((0.95, 0.8))
plt.tight_layout();

In [None]:
trip_holiday = full_df.groupby(['travel_date', 'is_holiday']).\
                   size().reset_index().rename(columns = {0:'trips_#'})

fig, ax = plt.subplots(1,2, figsize = (13,5))

sns.boxplot(x = 'is_holiday', y = 'trips_#', data = trip_holiday, ax = ax[0]).\
set_title('Trips by Holiday');

pd.crosstab(index = full_df.income_levels, columns = full_df.is_holiday, normalize = 'index').\
plot(kind='bar', stacked=True, title='Trips Porportions by Income levels on Holidays', ax=ax[1], rot = 0).\
legend(loc = 'lower center', bbox_to_anchor = (0.5,-0.25), ncol = 2);

The above plot shows that the number of trips done on both types of days are heavily skewed to the right. Also, customer's of all income levels have same proportions across levels for the number of trips they make on holidays and non-holidays.
<br>
<br>
The median number of trips for both categories seems almost equal. But we need to test this statistically in order to establish an association between number of trips and type of holiday.
<br>
<br>
Similar to earlier, I will employ Kruskal-Wallis H-test to test for association between type of day and number of trips.

#### Is there an association between number of trips traveled on both holidays and non-holidays?

#### **Kruskal-Wallis H-test**
<br>

##### Null hypothesis (H0) : Median number of trips are the same for both types of Days.

##### Alternative Hypothesis (H1): Median number of trips are different for both holidays.

In [None]:
from scipy import stats

a = holiday_trip.query('is_holiday == 0')['count'].values # not holiday trips
b = holiday_trip.query('is_holiday == 1')['count'].values # holiday trips

h, p = stats.kruskal(a, b)

print(f'H-Value: {h:.3f}')
print(f'P-value: {p:.3f}')
print()

alpha = 0.05

if p <= alpha:
    print('P-value less than alpha - Reject H0')
else:
    print('P-value higher than alpha - Cannot Reject H0')

As the P-value is higher than the set alpha level, **we can conclude that holidays does not cause any increase or decrease in daily number of trips**.

### New York Weather Data Set

In order to find any association between weather conditions and trips, I sourced datasets related to climate in US through: https://www.ncei.noaa.gov/access/search/data-search/global-summary-of-the-day.
<br>
<br>
Unfortunately, I was only able to source data for all three years (2016-2018) for New York. Therefore, this section analyses weather and trip information only for the state of New York.

In [None]:
ny_df = pd.read_csv('../input/extra-datasets/ny_weather.csv')
ny_df.DATE = pd.to_datetime(ny_df.DATE)
ny_df.rename(columns = {'DATE' : 'travel_date'}, inplace = True)
ny_df.info()

In [None]:
full_df.columns

In [None]:
df = ny_df.merge(full_df, how = 'inner', right_on = ['travel_date', 'city', 'state'], 
                 left_on = ['travel_date', 'city', 'state'])

df.columns

In [None]:
df.drop(['payment_mode', 'gender'], axis = 1, 
       inplace = True)

#renaming columns
df.rename(columns = {'TEMP' : 'avg_temp_F', #Daily avergae Temperature in Faherenheit
                      'DEWP' : 'dew_point', #Average Dew Point
                      'VISIB' : 'visibility', #Average Visbility
                      'MXSPD' : 'max_wind_speed', #Wind Speed
                      'PRCP' : 'precipitation' #Precipitation
                     }, inplace = True)

ny_weather_df = df[['travel_date', 'avg_temp_F', 'precipitation', 'city', 'price_charged', 
                    'cost_of_trip', 'km_travelled']]

In [None]:
ny_trip = ny_weather_df.groupby('travel_date').agg({'city' : 'count', 
                                                    'avg_temp_F' : 'median',
                                                    'precipitation' : 'median',
                                                    'km_travelled' : 'median',
                                                    'price_charged' : 'median',
                                                    'cost_of_trip' : 'median'}).\
          rename(columns = {'city' : 'trips_#'})

ny_trip.head()

In [None]:
ax = ny_trip.plot(subplots = True, figsize = (14, 25));
ny_trip.rolling(30).mean().plot(subplots = True, ax = ax, color = 'black', alpha = 0.8, legend = False);

The above plots show daily variations in number of trips and changes in climate such as the median of  temperature, precipitation, price_charged, cost of trip and profit for New York. The black lines represents 30-day rolling mean of each of the variables.
<br>
<br>
Visually, there seems to be no apparent correlation between any of the variables. In the next section, I will do a **granger-causality test** to test for any correlation between any of these variables.

According to https://www.machinelearningplus.com/time-series/time-series-analysis-python/: 
<br>
"Granger causality test is used to determine if one time series will be useful to forecast another.
<br>
<br>
It is based on the idea that if X causes Y, then the forecast of Y based on previous values of Y AND the previous values of X should outperform the forecast of Y based on previous values of Y alone.
<br>
<br>
So, understand that Granger causality should not be used to test if a lag of Y causes Y. Instead, it is generally used on exogenous (not Y lag) variables only.
<br>
<br>
#### The Null hypothesis is: the series in the second column, does not Granger cause the series in the first. If the P-Values are less than a significance level (0.05) then you reject the null hypothesis and conclude that the said lag of X is indeed useful."

I will test to see if the number of trips is affected by any of the climate variables.

In [None]:
from statsmodels.tsa.stattools import grangercausalitytests


targets = ['trips_#', 'price_charged', 'cost_of_trip']
cols = ['avg_temp_F', 'precipitation']

max_lags = 5 #number of lags

for i in targets:
    print()
    for j in cols:
        results = grangercausalitytests(ny_trip[[i, j]], max_lags, verbose = False); 
        p_values=[round(results[k+1][0]['ssr_ftest'][1],4) for k in range(max_lags)] 
        print(f'Target - {i}, Column - {j} : P_Values - {p_values}')

The P-value for each of the lags is higher than 0.05 for daily trips vs climate variables. This signifies that both climate variables does not 'granger cause' number of trips. The same can be said for cost of trip. 
<br>
<br>
Interestingly, the P-value for the price charged variable and air temperature variables are below 5% threshold for the first 3 lags. But, we established earlier that the price of a trip is determined by the distance traveled, which when totaled daily is highly correlated with total daily trips. Therefore, I can conclude the daily mean temperature and precipitation does not affect both cab company's mode of operations.

# Extreme Weather Events

I managed to find data from 2016 to 2018 that records different extreme weather events in various states of US from https://www.ncdc.noaa.gov/stormevents/ftp.jsp.
<br>
<br>
I will compare the data obtained with the Cab dataset to see if extreme weather events affects both cab company's mode of operation.

In [None]:
ext_weather_df = pd.read_csv('../input/extra-datasets/extreme_weather.csv')
ext_weather_df.head()

In [None]:
ext_weather_df.isna().sum()

The dataset came with a document defining all the variables in the dataset. I am quoting some sentences from document regarding some variables:

`MAGNITUDE`:  "magnitude (Ex: 0.75, 60, 0.88, 2.75) - The magnitude of the event. This is only used for wind speeds and hail size (e.g. 0.75” of hail; 60 knot winds)".
<br>
<br>
Therefore there are lot of observations missing in `MAGNITUDE` column. I will drop those observations. The only way to quantify the severity of the event is by the `MAGNITUDE` variable. Hence, for this section, I will only be able to check the effect of both Hail and wind speeds with the Cab dataset.

In [None]:
ext_weather_df.dropna(axis = 0, inplace = True)

In [None]:
ext_weather_df.MAGNITUDE.fillna(0, inplace = True)
ext_weather_df.isna().sum()

In [None]:
np.setdiff1d(full_df.state.unique(), ext_weather_df.state_abbr.unique())

State 'DC' in missing in the extreme_weather dataset.

In [None]:
#COnverting Event_date to datatime format
ext_weather_df['Event_Date'] = pd.to_datetime(ext_weather_df['Event_Date'])

In [None]:
ext_weather_df.info()

In [None]:
ext_weather_df.head()

In [None]:
#Types of weather events
np.sort(ext_weather_df.EVENT_TYPE.unique())

In [None]:
plt.figure(figsize = (10,5))
sns.violinplot(y = 'EVENT_TYPE', x = 'MAGNITUDE', data = ext_weather_df, orient = 'h').\
set_title("Magnitude of Weather Events");

Note: There is no unit for magnitude. As mentioned before, magnitude for hail represents the size of the hail. For Wind, magnitude represents the wind speed. <br>
<br>
We can see from the above Violin-plots that the frequency of each of the weather events is very low at various magnitude.

In [None]:
ew_df = ext_weather_df.groupby(['Event_Date', 'state_abbr', 'EVENT_TYPE']).\
        median()[['MAGNITUDE']].reset_index()
ew_df.head()

Next, I will join Cab dataset with extreme_weather dataset using 'Left join'. This way we will also have records where there was no extreme weather events in order to be able to compare those with extreme weather events.

In [None]:
new_df = full_df.query("state != 'DC'").merge(
    ew_df, 
    how = 'left', 
    left_on = ['travel_date', 'state'], 
    right_on = ['Event_Date', 'state_abbr']
).drop(['city', 'payment_mode', 'gender', 'is_holiday', 'day_of_week', 'cust_income'], axis = 1)

new_df.shape

In [None]:
new_df.head()

In [None]:
new_df.isna().sum()

In [None]:
new_df.EVENT_TYPE.unique()

Missing values in the `EVENT_TYPE` variable will be imputed with 'No Event' as this denotes that no extreme event took place on that day. And `MAGNITUDE` for 'No Event' category will be 0.

In [None]:
new_df.MAGNITUDE.fillna(0, inplace = True)
new_df.EVENT_TYPE.fillna('No Event', inplace = True)

In [None]:
new_df.isna().sum()

In [None]:
new_df.duplicated().sum() #No duplicated values

In [None]:
new_df.drop(['Event_Date', 'state_abbr'], axis = 1, inplace = True)

In [None]:
new_df.isna().sum()

In [None]:
new_df

In [None]:
#Effect of EVENT TYPES on trips

event_trip = new_df.groupby(['travel_date', 'EVENT_TYPE']).\
            agg({'state' : 'count', 'MAGNITUDE' : 'median'}).reset_index().\
            rename(columns = {'state' : 'trips'})

event_trip

In [None]:
fig, axes = plt.subplots(3,2, figsize = (12,20))

for event, ax in zip(event_trip.EVENT_TYPE.unique(), axes.flatten()):
    sns.scatterplot(x = 'MAGNITUDE', y = 'trips', data = event_trip.query(f'EVENT_TYPE == "{event}"'), 
                   ax = ax, size = 'MAGNITUDE', hue = 'MAGNITUDE', edgecolor = 'black').\
    set_title(f"{event} vs Trips");

fig.delaxes(axes[2,1]);

The above plots illustrates the effect of Magnitude of different events on number of daily trips. 
<br>
<br>
The first plot is obvious - days with no extreme weather events don't affect number of trips. Whereas on all other plots, magnitude of events below certain thresholds don't seems to affect number of trips daily.
<br>
<br>
But, after certain threshold, the plots show lower number of trips at very high magnitude. 
<br>
<br>
An exception to this is for the Strong Winds event. There are higher number of trips when the magnitude of Strong Winds are high.

In [None]:
fig, axes = plt.subplots(2,2, figsize = (12,10))

for event, ax in zip(['High Wind', 'Thunderstorm Wind', 'Hail', 'Strong Wind'], axes.flatten()):
    sns.heatmap(event_trip.query(f'EVENT_TYPE == "{event}"')[['trips', 'MAGNITUDE']].corr('spearman'), 
               ax = ax, center = 0, annot = True).\
    set_title(f"Correlation: {event} vs Trips");

Using Spearman correlation, there is no correlation between each events and number of trips. Therefore, according to the data, the number of trips does not increase or decrease during a weather event.

In [None]:
#Creating new variable where any event other than 'No Event' is 1.
new_df['is_extreme'] = np.where(new_df['EVENT_TYPE'] == 'No Event', 0, 1)

In [None]:
is_extreme_trip = new_df.groupby(['travel_date', 'is_extreme']).size().reset_index().\
             rename(columns = {0:'count'})

is_extreme_trip

In [None]:
fig, ax = plt.subplots(1,2, figsize = (14,5))
sns.violinplot(y = 'is_extreme', x = 'count', data = is_extreme_trip, orient = 'h', ax = ax[0]).\
set_title("Distribution of trips on Weather Events");

ax[0].set_xlabel('Number of Trips');

sns.countplot(x = 'is_extreme', data = is_extreme_trip, ax = ax[1]).\
set_title("Number of Trips on Weather Events");

The above plots shows that median number of trips on both types of weather conditions differs a lot. On days with weather events, the median trips are much lower than days with no weather events.
<br>
<br>
I will perform a statistical test again like earlier, testing whether the median number of trips during both events are the same or not.

In [None]:
from scipy import stats

a = is_extreme_trip.query('is_extreme == 0')['count'].values # not holiday trips
b = is_extreme_trip.query('is_extreme == 1')['count'].values # holiday trips

h, p = stats.kruskal(a, b)

print(f'H-Value: {h:.3f}')
print(f'P-value: {p:.3f}')
print()

alpha = 0.05

if p <= alpha:
    print('P-value less than alpha - Reject H0')
else:
    print('P-value higher than alpha - Cannot Reject H0')

The above tests shows that there is a difference in number of trips during non-weather and weather events. 

In [None]:
# new_df.to_csv('df_with_weather_events.csv')

In [None]:
# full_df.to_csv('full_df.csv', index = False)