# Exploratory analysis of Indego City Bike trip data

This notebook contains some examples of data analysis using python. We use the trip data from Philadelphia's Indego City Bike program. The data is released on a quartely basis by the City of Philadelphia and is available freely as CSV files through the city's portal for the bike share program. We have collected data into the `Indego-trip-data` folder. The folder has been zipped to keep the size small. We read the file directly from the zip below. The files are organized as

1. `indego-stations.csv` - Station names and IDs
2. `indego-trips-{yyyy}-q{n}.csv` - Trip data for year `{yyyy}` and quarter `{n}` which is between 1 and 4. We have data from the 2nd quarter of 2015 till the 4th quarter of 2019.

We start by reading and analyzing data from the first file `indego-trips-2015-q2.csv`.

### Read data into a pandas Dataframe

Pandas (https://pandas.pydata.org/) is a extremely popular data analysis library. We will be using it extensively for analyzing this data set. There are also other approaches to analyze data of this type. We will note these methods in the end.

In [None]:
import os
import numpy
import zipfile
import pandas

# Setup the file path for zip file
data_folder = os.path.abspath('./data/')
    
# Read file from zip file
zip_file = os.path.join(data_folder, 'Indego-trip-data.zip')

def read_csv_from_zip(zf, fn):
    """Read a CSV file from a zip archive
    
    Parameters
    ----------
    zf : zipfile.ZipFile instance
        ZipFile instance
        
    fn : str
        String name for data file
        
    Returns
    -------
    pandas.DataFrame
        Contents of CSV read into a DataFrame
    """
    with zipfile.ZipFile(zf) as z:
        matching = [s for s in z.namelist() if fn in s]
        if not matching:
            raise FileNotFoundError('File {} not in zip archive'.format(fn))
        else:
            zipped_data_file = matching[0]
            # Read data from CSV file and confirm type of data
            return pandas.read_csv(z.open(zipped_data_file))

print(data_folder)
print(zip_file)

In [None]:
# Select the file for analysis
data_file = 'indego-trips-2015-q2.csv'
data_2015_q2 = read_csv_from_zip(zip_file, data_file)
type(data_2015_q2)

In [None]:
# First five entries in dataframe
data_2015_q2.head()

In [None]:
# Last five entries in dataframe
data_2015_q2.tail()

### Total number of trips

In [None]:
total_trips = len(data_2015_q2)
print(total_trips)

### Dataframe column headers

The pandas.DataFrame is a custom data structure. The DataFrame object comes with 

* member variable/attribute/member - Values associated with the object
* member functions - Functions that act on the data in the DataFrame

Member variables are referred to by `<object_name>.<member variable>` and functions are referred to by `<object_name>.<member function name>()`.

In [None]:
print(data_2015_q2.columns)

### Summary statistics of numerical columns

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. 
All other columns are ignored, unless you use the argument include='all'

In [None]:
data_2015_q2.describe()

### Slicing

Extracting a subset of data from the DataFrame.

In [None]:
# Refer to a single data point using position
print(data_2015_q2.iloc[0,0])

In [None]:
# Refer to a single data point using the index and column name
print(data_2015_q2.loc[119555, "trip_id"])

In [None]:
# Refer to a row of data
print(data_2015_q2.loc[119555, :])

In [None]:
# Refer to a single column
print(data_2015_q2.loc[:, "trip_id"])
print(data_2015_q2["trip_id"])

In [None]:
# Extract a slice of data with some rows and columns
print(data_2015_q2.loc[119550:119555, "trip_id":"start_station_id"])

In [None]:
# First 10 rows of data
data_2015_q2.iloc[0:10,:]

In [None]:
# Extracting a column from a DataFrame creates a pandas.Series
start_times = data_2015_q2['start_time']
type(start_times)

### Handling time entries - datetime library

Pandas contains extensive capabilities and features for working with time series data for all domains. It builds on the python's datetime library and NumPy's datetime64 and timedelta64. In our data set we have the starting time and ending tume for each trip that are part of the data.

In [None]:
# By default the time data is read in as a string
# This is not terribly useful format for dates and 
# times as we cannot do time operations on the data
type(data_2015_q2.iloc[0,2])

In [None]:
# Convert columns to datetime
# The following command replaces the values in-place. If you do not want this
# you will need to create a copy of a DataFrame

data_2015_q2['start_time'] = pandas.to_datetime(data_2015_q2['start_time']) 
data_2015_q2['end_time'] = pandas.to_datetime(data_2015_q2['end_time']) 
type(data_2015_q2.iloc[0,2])

###  Using datetime - Number of trips in June

We now want to extract a subset of the data using the start_time column as a reference and calculate the number of trips in June.

In [None]:
# Datetime example
# Datetime library allows us to create standardized date and 
# time formats and perform arithmetic and logical operations

import datetime

t1 = datetime.datetime(2019, 12, 31, 10, 0, 0)
t2 = datetime.datetime(2020, 1, 1, 10, 0, 0)

print(t2 - t1)
print(t2 != t1)

In [None]:
# datetime(year, month, day, hour, minute, second, microsecond)
june_1 = datetime.datetime(2015,6,1,0,0,1)
june_30 = datetime.datetime(2015,6,30,23,59,59)

# .loc functionality allows for logical expressions for indexing
# Here we use the & operator to filter all start_times between
# June 1 and June 30

data_2015_june = data_2015_q2.loc[(data_2015_q2['start_time'] > june_1) & (data_2015_q2['start_time'] < june_30)]
len(data_2015_june)

### Average number of trips in each month of the quarter

### What are the shortest, longest, and average trip lengths?

### Number of trips from a given station

pandas.Series has a member function `value_counts()` that counts the number of unique entries in the series. It returns a Series object indexed by the values being counted.

In [None]:
# Extract the starting station
start_station_ids = data_2015_q2.loc[:,"start_station_id"]

# Return a Series containing counts of unique values.
start_station_tripcount = start_station_ids.value_counts()
print(start_station_tripcount)

In [None]:
# Use the index of value_counts() to look at number of trips from a given station
start_station_tripcount.loc[3065]

In [None]:
# Writing function to get the number of trips from a given station
def n_trips_from(df, station_id):
    """Calculate the number of trips from a station
    
    Parameters
    ----------
    df : pandas.DataFrame
        Pandas dataframe containing trip data
        
    station_id: int
        ID of the starting station
    
    Returns
    -------
    n_trips: int
        Number of trips starting at station_id
    """
    return df["start_station_id"].value_counts().loc[station_id]

n_trips_from(data_2015_q2, 3004)

### Adding station names based on station IDs

The dataframe currently stores only the station IDs. There is a separate CSV file that contains the mapping between the station ID and station name. After reading the data we can use the `map` member function of a pandas.Series to create a new column where the value of the column is read from the map.

In [None]:
station_names = read_csv_from_zip(zip_file, 'indego-stations.csv')
station_names.head()

In [None]:
# Need to have the station_names dataframe indexed by the station_id for the map to work
station_names = station_names.set_index(station_names['station_id'])
station_names.head()

In [None]:
# Use the map function
data_2015_q2['start_station_name'] = data_2015_q2['start_station_id'].map(station_names['station_name'])
data_2015_q2[['start_station_id','start_station_name']].tail()

## When are people riding?

What hour of the day are most rides starting? To answer this question we need to create a histogram of the start_time data. Note that we have already converted this to datetime format, so we can easily extract the hour from datetime entry and build the histogram.


In [None]:
# Pure pandas approach
# This indirectly calls matplotlib in the background
%matplotlib inline

data_2015_q2['start_time'].apply(lambda x: x.hour).hist(bins=24)

In [None]:
# Extracting data using numpy and then plotting

%matplotlib inline

import numpy
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

# Empty list to hold hour data
data_2015_q2_hr = []

# Loop to extract hour from data frame
for i, (xs, xe) in enumerate(zip(data_2015_q2['start_time'],data_2015_q2['end_time'])):
    data_2015_q2_hr.append(xs.hour)
    
# Converting list to numpy    
data_2015_q2_hr = numpy.array(data_2015_q2_hr)

# Computing the histogram using numpy
histogram, bin_edges = numpy.histogram(data_2015_q2_hr,bins=numpy.linspace(-0.50, 23.50, 25))
print(histogram, bin_edges)

In [None]:
# Use matplotlib to create a figure and plot the histogram

# Generate figure
fig = plt.figure()

# Generate a grid object
gs = gridspec.GridSpec(1,1)

# Add a axes to the grid
ax = fig.add_subplot(gs[0,0])

# Plot the histogram
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
ax.bar(bin_centers, histogram)

In [None]:
# You can also compute and plot the histogram in one matplotlib command
# Notice the difference

fig = plt.figure()
gs = gridspec.GridSpec(1,1)
ax = fig.add_subplot(gs[0,0])
histogram, bin_edges, patches = ax.hist(data_2015_q2_hr,bins=numpy.linspace(-0.50, 23.50, 25))

In [None]:
# You can also use the pandas .dt.hour.value_counts() method to get the counts for each hour
print(data_2015_q2['start_time'].dt.hour.value_counts())

# Analyzing several files

In [None]:
def data_file(year, quarter):
    return 'indego-trips-{:d}-q{:d}.csv'.format(year, quarter)

data = read_csv_from_zip(zip_file, data_file(2015,2))
nrows = data.shape[0]

for y in [2015, 2016, 2017, 2018, 2019]:
    for q in [1,2,3,4]:
        if y == 2015 and q < 2:
            continue
        else:
            df = read_csv_from_zip(zip_file,data_file(y,q))
            nrows += df.shape[0]
            data = data.append(df)
            print(data.shape[0])

print(data.shape)
print(nrows)

In [None]:
files = [ [data_file(y,q) for q in [1,2,3,4] if not (y==2015 and q == 1)] 
          for y in [2015,2016,2017,2018,2019] ]
files_flattened = [item for sublist in files for item in sublist]
# print(files_flattened)

files = [ data_file(y,q) for q in [1,2,3,4]  
          for y in [2015,2016,2017,2018,2019]
          if not (y==2015 and q == 1) ]

dataframes = [read_csv_from_zip(zip_file,f) for f in files]
data = pandas.concat(dataframes)


In [None]:
data.shape

In [None]:
counts = data_2015_q2.groupby(['start_station_id','start_station_name']).size()
counts.sort_values(ascending=False)[:5]

In [None]:
def top_5_start_stations(df, key='start_station_id'):
    counts = df.groupby([key,'start_station_name']).size()
    return counts.sort_values(ascending=False)[:5]