# Cyclistic Case Study

## Intro
The goal of this project is to gain insight into how casual riders and annual members differ in their usage of a bike-share company. Withing the company, the director of marketing believes that the company's future success depends on maximizing the number of annual memberships. Therefore, the insights found in this analysis will be used to design a digital marketing strategy aimed at converting casual riders into annual members. 

For the purpose of this study, **casual riders** are defined as customers who purchase single-ride or full-day passes, while **Cyclistic members** are customers who purchase an annual membership. 

In [None]:
# Importing data science libraries
import pandas as pd
import numpy as np
import glob
import os

# Importing visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
import plotly as px
import missingno as msno
import plotly.offline as pyo 
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots

plt.style.use('fivethirtyeight')

In [None]:
# Importing the data
path = 'data'
csv_files = glob.glob(os.path.join(path, "*.csv"))

dfs = []

for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

In [None]:
# Combining the data into a single data frame
cdf = pd.concat(dfs, ignore_index=True).drop_duplicates('ride_id') #cdf standing for "combined data frame"

# Checking the combined data frame
cdf.head(10)

In [None]:
cdf.shape

In [None]:
cdf.info()

In [None]:
# Converting `started_at` and `ended_at` to datetime format
cdf['started_at'] = pd.to_datetime(cdf['started_at'], format='mixed')
cdf['ended_at'] = pd.to_datetime(cdf['ended_at'], format='mixed')

cdf.info()

In [None]:
# Sorting by datetime, most recent first
cdf.sort_values('started_at', ascending = False).head(10)

## Cleaning the data
Next we will look for missing values and duplicates and figure out how to deal with them. 

In [None]:
# Checking if there are any missing values across each column

cdf.isnull().any()

In [None]:
cdf.isna().sum()

In [None]:
msno.bar(cdf)

In [None]:
msno.heatmap(cdf, cmap='YlGnBu')

In [None]:
msno.matrix(cdf)

In [None]:
# Replacing missing values
cdf['start_station_name'] = cdf['start_station_name'].fillna('Unknown')
cdf['start_station_id'] = cdf['start_station_id'].fillna('Unknown')
cdf['end_station_name'] = cdf['end_station_name'].fillna('Unknown')
cdf['end_station_id'] = cdf['end_station_id'].fillna('Unknown')

cd_u = cdf.loc[cdf['start_station_name'] == 'Unknown'].sample(n = 100)

cd_u.head(30)

In [None]:
fig = px.scatter_geo(cd_u, lat='start_lat', lon='start_lng',
                     title='Unknown Start Station Name')
fig.show()

In [None]:
cdf['start_station_name'].loc[cdf['start_station_name'] != 'Unknown'].value_counts()

In [None]:
d1 = cdf['started_at']
d2 = cdf['ended_at']

diff = d2-d1
cdf['duration'] = diff
cdf['duration'].head()

In [None]:
cdf.head()

In [None]:
# Looking at the average ride time for members vs. casual riders

mc = cdf.groupby(cdf['member_casual'])
mc_duration = mc['duration'].mean()

print(mc_duration)

## Observation #1
The average member pass ride is less than half the time of the average casual pass ride. It is unfortunate that there is not a way to parse out the casual pass rides into daily and single-ride passes. We can learn from this though that members' rides are usually shorter than the casual rides. 

In [None]:
ax = mc['duration'].mean().plot(kind='bar', title='Average Ride Time Member vs. Casual')

In [None]:
# Creating a simple data frame that shows the difference in average ride time between casual and member ride times.
art_data = {'casual' : [27.12],
        'member' : [12.98]}

art = pd.DataFrame(art_data, index=['avg_minutes'])

print(art)

In [None]:
# Visualizing to show the stark difference in average ride times. 

ax = sns.barplot(art)

for i in ax.containers:
    ax.bar_label(i,)
    ax.set(xlabel='Ride Type', ylabel='Average Minutes Per Ride')

In [None]:
# Now looking at the number of bike rental events for members vs. casual riders. 

cdf['member_casual'].value_counts()

## Observation #2
So we've found that the average ride time for casual riders is much higher than members, but we've now seen that the number of rental events for members is much higher than the number of casual rider rentals. 

In [None]:
from matplotlib import ticker
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# additional code before plt.show()
formatter = ticker.ScalarFormatter()
formatter.set_scientific(False)

ax = sns.barplot(cdf['member_casual'].value_counts(), palette='Set1')
plt.xlabel("Ride Type")
plt.ylabel("# of Rides (in Millions)")
plt.title("Rides per Pass Type")
ax.yaxis.set_major_formatter(formatter)