# The Affect of Covid-19 on NYC Bike Rentals

TBD - Introduction paragraph
- Set the context
- Introduce the analysis
- Add an image?


## 1. A story in 9+ million rows of data

TBD - Introduce the data set
- Where does it come from
- What were some of the challenges - # of rows and optimization
- What are the steps in this section



### Import Packages 


In [None]:
# Import packages
import glob
import numpy as np
import pandas as pd
import math
# import scipy.stats as stats

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-poster') #sets the size of the charts
plt.style.use('ggplot')

import seaborn as sns
sns.set(style='ticks', color_codes=True, font_scale=1.25)

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.tile_providers import CARTODBPOSITRON, get_provider
from bokeh.models import ColumnDataSource, HoverTool

# Set display option for floats in Pandas
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Setup Bokeh to output directly to the notebook
output_notebook(resources=None, verbose=False, hide_banner=True, load_timeout=5000, notebook_type='jupyter')


### Prep for data import


In [None]:
# Dictionary of columns and optimal dtypes
col_types = {'start station id': 'int',
             'end station id': 'int', 
             'usertype': 'category',
             'birth year': 'int', 
             'gender': 'int8'
            }

# Create list of updated column names
col_names = ['tripduration', 'starttime', 'stoptime', 'start_station_id', 'end_station id',
             'bikeid', 'usertype', 'birth_year', 'gender']

# Create list of station column names
station_cols = ['id', 'name', 'lat', 'lon']


### Create helper functions


In [None]:
# Function for reading in and processing data files
def read_and_process(filepath):
    '''Reads in and processes CitiBike monthly csv data files
         
    Args:
      filepath (string): path to csv data file
    
    Returns:
      temp_df (dataframe): dataframe containing the list of rentals for a given month
      
      temp_stations (dataframe): dataframe containing the list of unique rental stations for a given month
    '''
    
    # Read in the data
    temp_df = pd.read_csv(filepath) 
    
    # Drop rows with null values
    temp_df = temp_df.dropna()
    
    # Convert start/stop time columns to datetime
    temp_df.starttime = pd.to_datetime(temp_df.starttime, infer_datetime_format=True)
    temp_df.stoptime = pd.to_datetime(temp_df.stoptime, infer_datetime_format=True)
    
    # Convert column dtypes
    temp_df = temp_df.astype(col_types)
    
    # Process station data
    temp_stations = process_station_data(temp_df)
    
    # Drop redundant columns
    temp_df = temp_df.drop(['start station name', 'start station latitude', 'start station longitude',
              'end station name', 'end station latitude', 'end station longitude'], axis=1)
    
    # Rename remaining columns
    temp_df.columns = col_names

    return temp_df, temp_stations


# Function for abstracting out station-related data
def process_station_data(df):
    '''Processes and abstracts station-related data into a separate dataframe
         
    Args:
      df (dataframe): dataframe containing bike rental data, including start and end stations
    
    Returns:
      temp_stations (dataframe): dataframe containing the list of unique rental stations for a given month
    '''
    
    # New column names
    cols = ['id', 'name', 'lat', 'lon']

    # Temp df for start stations
    start_stations = df[['start station id', 'start station name',
                         'start station latitude', 'start station longitude']]
    start_stations.columns = cols
    start_stations = start_stations.drop_duplicates()

    # Temp df for end stations
    end_stations = df[['end station id', 'end station name',
                       'end station latitude', 'end station longitude']]
    end_stations.columns = cols
    end_stations = end_stations.drop_duplicates()
    
    # Concatenate the start/end station dfs and drop dups 
    temp_stations = pd.concat([start_stations, end_stations],
                              ignore_index=True).drop_duplicates()
    
    return temp_stations


### Read in the data


In [None]:
# Create lists for each set of dataframes
rentals_dfs = []
station_dfs = []

# Read in and process the data
data_files = glob.glob('../data/' + "*.csv")
for file in data_files:
    rentals_df, station_df = read_and_process(file)
    rentals_dfs.append(rentals_df)
    station_dfs.append(station_df)

# Concatenate into 2 dataframes - rentals & stations and drop duplicate stations
df = pd.concat(rentals_dfs, ignore_index=True)
stations = pd.concat(station_dfs, ignore_index=True).drop_duplicates(subset='id', keep="first")

# Calculate memory usage
rentals_mem = df.memory_usage().sum() / 1024**2
stations_mem = stations.memory_usage().sum() / 1024**2

# Print output
print("Rentals: " + str(df.shape[0]) + " rows")
print('Memory usage after optimization:  {:.2f} MB'.format(rentals_mem))
print("Stations: " + str(stations.shape[0]) + " rows")
print('Memory usage after optimization:  {:.2f} MB'.format(stations_mem))


## 2. What's a good day to ride?

TBD - Intro the section

- Add more date dimensions
- Add a count column
- Trip count by dow, hod (gender?, age? usertype?)

Our data set includes 2 columns containing datetime information, but the data is stored in str format. Let's start by converting these columns to datetime format. Then, following this conversion, we'll extract some additional date related info to aid us in our analysis later on. 

To do this, we will create 3 new columns:

- **hour** - Extract hour of the day as int (i.e, 0 - 23, 0 = Midnight)
- **dow** - Extract day of the week as int (i.e., 0 = Monday, 6 = Sunday)
- **month-day** - Extract 2-digit month and 2-digit day (e.g., 01-31)
- **year** - Extract the 4-digit year (e.g., 2019)
- **date** - Extract the date (e.g., '2013-10-01')


### Create new columns for date and time manipulations


In [None]:
# Create new columns for hour, day of week and date of rental
# df["hour"] = df.starttime.dt.hour
# df["dow"] = df.starttime.dt.dayofweek
# df["month_day"] = df.starttime.dt.strftime('%m-%d')
# df["date"] = df.starttime.dt.date


# df['day'] = df.starttime.dt.day
# df['month'] = df.starttime.dt.month
df["year"] = df.starttime.dt.year



### Create a helper function for the plots


In [None]:
# def plot_barh(df, col, factor, ylabels, title, figsize):
#     '''Plots a horizontal bar chart showing rental counts across a 
#         specified dimension (col)
    
#     Args:
#       df (dataframe): dataframe containing bike rental data
#       col (string): name of column for dimension to be plotted
#       factor (integer): to indicate the number of bars (for arrangement)
#       ylabels (list): list of strings for y-axis labels
#       title (string): plot title
#       figsize (tuple): tuple of integers designating width and height of output"
    
#     Returns:
#       Outputs plot inline
#     '''
    
#     # The y values are the unique list of col values
#     y = df[col].unique().tolist()
    
#     # The x values are the sum of the rental count by the col dimension 
#     rentals_19 = df[df.year == 2019].groupby([col])[col].count()
#     x_19 = rentals_19.tolist()
#     rentals_20 = df[df.year == 2020].groupby([col])[col].count()
#     x_20 = rentals_20.tolist()
    
#     # Set width and arrange bars 
#     width = 0.75
#     ind = np.arange(len(rentals_19))
    
#     # Create figure & set style
#     fig = plt.figure(figsize=figsize)
#     ax = fig.add_subplot(111)
#     ax.set_title(title,fontsize=20)
#     ax.set_ylim(-1,factor)
#     sns.set_style("ticks")
# #     sns.set_style("whitegrid")
    
#     # Plot the data for each year
#     ax.barh(ind, x_19, width, color='#1E88E5', label='2019')
#     ax.barh(ind, x_20, width, color='#FFC107', label='2020')
        
#     # Create axis labels and legend
#     plt.yticks(ind+width/factor, labels=ylabels)
#     plt.gca().invert_yaxis() # account for default reversal of the y axis values
#     plt.legend(loc='upper right');
    
#     plt.show()

### Plotting rides by hour of the day


In [None]:
# # Assign variables
# col = 'hour'
# factor = 24
# ylabels = ['12:00 AM', '1:00 AM', '2:00 AM', '3:00 AM', '4:00 AM', '5:00 AM', '6:00 AM', '7:00 AM',
#            '8:00 AM', '9:00 AM', '10:00 AM', '11:00 AM', '12:00 PM', '1:00 PM', '2:00 PM', '3:00 PM',
#            '4:00 PM', '5:00 PM', '6:00 PM', '7:00 PM', '8:00 PM', '9:00 PM', '10:00 PM', '11:00 PM']
# title = 'Total Rentals by Hour of Day'
# figsize = (20,15)

# # Create the plot
# plot_barh(df, col, factor, ylabels, title, figsize)


### Plotting rides by day of the week


In [None]:
# # Assign variables
# col = 'dow'
# factor = 7
# ylabels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# title = 'Total Rentals by Day of Week'
# figsize = (20,5)

# # Create the plot
# plot_barh(df, col, factor, ylabels, title, figsize)


### Rental count year-over-year comparison


In [None]:
# Sort the entire by starttime
df = df.sort_values(by=['starttime']).reset_index(drop=True)

In [None]:
# Create a column to facilitate totaling rental count
df['rental_count'] = 1

In [None]:
# Create temp dfs by year
df_2019 = df.loc[df['year'] == 2019]
df_2020 = df.loc[df['year'] == 2020]

# Release df
del df

In [None]:
# Calculate running total of rental counts by year
df_2019['running_total'] = df_2019['rental_count'].cumsum()
df_2020['running_total'] = df_2020['rental_count'].cumsum()

In [None]:
# Calculate running total of duration by year
df_2019['running_total_dur'] = df_2019['tripduration'].cumsum()
df_2020['running_total_dur'] = df_2020['tripduration'].cumsum()

# Concatenate the temp dfs back into one

In [None]:
fig = plt.figure(figsize=(20,10))
ax = sns.lineplot(x='starttime', y='running_total', data=df_2019, color='#1E88E5', label='2019')
sns.lineplot(x='starttime', y='running_total', data=df_2020, color='#FFC107', label='2020')

# plt.axhline(y_19.mean(), color='#1E88E5', linestyle='dashed', linewidth=2, label='2019 Avg')
# plt.axhline(y_20.mean(), color='#FFC107', linestyle='dashed', linewidth=2, label='2020 Avg')

In [None]:
###
# TODO
# 1. Figure out how to label the weekend bars and determine if adjustment is needed to placement
# 2. Adjust labels on x-axis to show all 31 days
###
    
#     # Create figure & set style
#     fig = plt.figure(figsize=figsize)
#     ax = fig.add_subplot(111)
#     ax.set_title(title,fontsize=20)
#     ax.set_ylim(-1,factor)
#     sns.set_style("ticks")
#     plt.legend(loc='upper right');

# The x values are the unique list of dates per year
x_19 = df[df.year == 2019]['month_day'].unique().tolist()
x_20 = df[df.year == 2020]['month_day'].unique().tolist()

# The y values are the rental counts for each user type
y_19 = df[df.year == 2019].groupby(['month_day']).usertype.count()
y_20 = df[df.year == 2020].groupby(['month_day']).usertype.count()

# Create the figure and plot the lines for each year, by date and average 
fig = plt.figure(figsize=(20,10))
ax = sns.lineplot(x_19, y_19, color='#1E88E5', label='2019')
plt.axhline(y_19.mean(), color='#1E88E5', linestyle='dashed', linewidth=2, label='2019 Avg')
sns.lineplot(x_20, y_20, color='#FFC107', label='2020')
plt.axhline(y_20.mean(), color='#FFC107', linestyle='dashed', linewidth=2, label='2020 Avg')

# # Create a list of weekend dates
# weekends = []
# for date in x_values:
#     if date.weekday() >= 5:
#         weekends.append(date)

# # Plot a vertical gray bar to designate weekend days
# for date in weekends:
#     plt.axvline(date, color='gray', alpha=0.2, linewidth=37) # linewidth based on trial and error

# Add limits, labels and title
# g1.set_xticks(range(1,32,1))
# g1.set_xticklabels = [item.day for item in x_values]
# g1.set_xlim(df.date.min(), df.date.max())
g.set(xlabel=None, ylabel=None,
      title='Rental Count Per Day - Customer vs. Subscriber')
g.legend(loc='upper right', bbox_to_anchor=(1.15, 1.0), ncol=1);

In [None]:
temp_df_gender = df.pivot_table(index=['year', 'month', 'day', 'gender'],values='rental_count',
                                     aggfunc=np.sum).reset_index()

In [None]:
temp_df_gender = temp_df_gender[(temp_df_gender['month'] != 2) & (temp_df_gender['day'] != 29)]

In [None]:
g = sns.FacetGrid(temp_df_gender, col='year', col_wrap=2, height=15)
g = g.map(plt.plot, 'starttime', 'rental_count')

In [None]:
df = pd.DataFrame(
    data=np.random.randn(90, 4),
    columns=pd.Series(list("ABCD"), name="walk"),
    index=pd.date_range("2015-01-01", "2015-03-31",
                        name="date"))
df = df.cumsum(axis=0).stack().reset_index(name="val")
def dateplot(x, y, **kwargs):
    ax = plt.gca()
    data = kwargs.pop("data")
    data.plot(x=x, y=y, ax=ax, grid=False, **kwargs)
g = sns.FacetGrid(df, col="walk", col_wrap=2, height=3.5)
g = g.map_dataframe(dateplot, "date", "val")


## 3. It's always sunny in NYC

TBD - Intro section
- Merge in weather data
- Trip counts versus weather (gender? age? usertype?)


## 4. Bikes, bikes, bikes

TBD - Intro section
- Most popular bikes - by count
- Most used bikes - by duration
- Count of bikes in service at any given time


## 5. What about stations?

TBD - Intro section
- Most popular stations by count (age, gender, usertype options / slider or other interactive to change between months
- Least popular stations
- Most heavily traveled routes (if possible to get directions)


## 6. So what does it all mean?

TBD - conclusion
- Recommendations
- Suggestions for other analysis