# Exploratory Data Analysis and Visualization

Demonstrating univariate analysis of the Citibike trips dataset.

# Import libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize

# Setting some defaults
sns.set(rc = {'figure.figsize':(12,8)})
sns.set(font_scale = 0.5)
sns.set_context("poster")

# Read the data

In [2]:
df = pd.read_csv("../../data/citibike_sample.csv.zip")

In [3]:
# Need to convert the timestamp columns to datetimes
# Because this is lost when a dataframe is written to a CSV
# Everything in a CSV becomes text and loses its Python data type
df['starttime'] = pd.to_datetime(df['starttime'])
df['stoptime'] = pd.to_datetime(df['stoptime'])

View the first few rows.

In [4]:
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,...,usertype,birth year,gender,tripduration_minutes,starttime_dayname,stoptime_dayname,starttime_hour,stoptime_hour,age,distance_miles
0,1309,2020-10-27 16:51:37.994,2020-10-27 17:13:27.103,3457,E 58 St & Madison Ave,40.763026,-73.972095,3534,Frederick Douglass Blvd & W 117 St,40.805159,...,Subscriber,1983,male,21.816667,Tuesday,Tuesday,16,17,37,3.047236
1,1553,2020-10-09 19:46:12.616,2020-10-09 20:12:06.366,3815,E 51 St & 2 Ave,40.755293,-73.967641,3641,Broadway & W 25 St,40.742869,...,Subscriber,1996,male,25.883333,Friday,Friday,19,20,24,1.418907
2,437,2020-10-10 15:19:17.455,2020-10-10 15:26:34.470,3096,Union Ave & N 12 St,40.71924,-73.95242,3086,Graham Ave & Conselyea St,40.715143,...,Subscriber,1984,female,7.283333,Saturday,Saturday,15,15,36,0.502512
3,1490,2020-10-28 01:30:42.644,2020-10-28 01:55:33.210,3821,Evergreen Ave & Noll St,40.70106,-73.93318,3058,Lewis Ave & Kosciuszko St,40.692371,...,Subscriber,1989,male,24.833333,Wednesday,Wednesday,1,1,31,0.633161
4,1178,2020-10-11 16:48:46.773,2020-10-11 17:08:24.976,3417,Baltic St & 5 Ave,40.679577,-73.97855,3300,Prospect Park West & 8 St,40.665147,...,Subscriber,1978,male,19.633333,Sunday,Sunday,16,17,42,1.002208


View a description of all numeric columns.

In [5]:
df.describe()

Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,birth year,tripduration_minutes,starttime_hour,stoptime_hour,age,distance_miles
count,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0
mean,1039.44631,2114.45744,40.741888,-73.976341,2111.629844,40.74149,-73.976464,37653.471902,1981.862754,17.324105,14.233238,14.41212,38.137246,1.367531
std,875.726452,1555.662785,0.036212,0.02366,1556.525668,0.036103,0.023743,9111.587283,12.243732,14.595441,4.66727,4.745886,12.243732,1.097055
min,180.0,72.0,40.6554,-74.025353,72.0,40.6554,-74.063913,14529.0,1921.0,3.0,0.0,0.0,16.0,0.0
25%,464.0,435.0,40.717548,-73.993836,434.0,40.717488,-73.993929,32940.0,1970.0,7.733333,11.0,11.0,28.0,0.610297
50%,792.0,3115.0,40.741444,-73.981281,3112.0,40.739448,-73.98142,39491.0,1985.0,13.2,15.0,15.0,35.0,1.057515
75%,1350.0,3524.0,40.765005,-73.960241,3522.0,40.765005,-73.960241,45318.0,1992.0,22.5,18.0,18.0,50.0,1.814084
max,10792.0,4229.0,40.852252,-73.884308,4229.0,40.852252,-73.884308,48702.0,2004.0,179.866667,23.0,23.0,99.0,13.152597


# Categorical Variables

First, let's visualize the distrutions of two categorial variables in this dataset: `usertype`, `gender`.

## User Type

`usertype` is a categorical variable containing two categories: Subscribers and Customers.

Subcribers are members who have an annual Citi Bike subscription. Customers are those who purchase single ride or day passes.

### Histogram (Counts)

Make a histogram showing the total number of trips contained in the dataset, segmented by `usertype`.

In [6]:
# your code here


### Histogram (Percentage)

Make a histogram showing the *percentage* of trips contained in the dataset, segmented by `usertype`.

In [7]:
# your code here


## Gender

`gender` is a categorical variable containing three categories as defined by Citi Bike: Zero=unknown; 1=male; 2=female.

### Histogram (Counts)

Make a histogram showing the total number of trips contained in the dataset, segmented by `gender`.

In [8]:
# your code here


### Histogram (Percentage)

Make a histogram showing the percentage of trips contained in the dataset, segmented by `gender`.

In [9]:
# your code here


# Numerical Variables

Next, let's visualize the distributions of three numerical variables in this dataset: `tripduration_minutes`, `distance_miles`, and `age`.

For now, we will analyze just a single variable at a time.

## Trip Duration

`tripduration_minutes` is a continuous numeric variable that contains the duration of each trip in minutes.

### Histogram

Make a histogram representing trip duration. Use percentage rather than counts, and set the histogram's bin width to be 1 minute.

In [10]:
# your code here


Make another histogram representing trip duration, similar to above but with a bin width of 5 minutes.

In [11]:
# your code here


Make yet another histogram representing trip duration, but this time with a bin width of 30 minutes.

In [12]:
# your code here


Make a fourth and final histogram representing trip duration, but this time use Doane's formula to set the bin width automatically.

In [13]:
# your code here


Which bin width do you think is the best for visualizing trip duration?

### Density Plot

Make a combined histogram and density plot representing trip duration, using the `kde=True` parameter. Set bin width to be 5 minutes.

In [14]:
# your code here


Make a density plot representing trip duration, grouped by `usertype`.

In [15]:
# your code here


Make a density plot representing trip duration, grouped by `gender`.

In [16]:
# your code here


What do you think is a better choice for visualizing trip duration? Histogram or density plot?

### Box Plot

Make a boxplot with the upper fence and lower fence (i.e. "whiskers"). You can leave the default paramaters as they are for now.

In [17]:
# your code here


Do you see any potential outliers?

### Strip Plot

Make a stripplot representing trip duration. Set alpha to 0.1 and size = 5.

In [18]:
# your code here


Make another stripplot representing trip duration, grouped by usertype.

In [19]:
# your code here


Make another stripplot representing trip duration, grouped by gender.

In [20]:
# your code here


### Swarm Plot

Make a swarmplot representing trip duration. Only use a sample of 2,000 trips, since the swarm plot cannot handle all 500K at once.

In [21]:
# your code here


Make a swarmplot representing trip duration, grouped by usertype.

In [22]:
# your code here


Make a swarmplot representing trip duration, grouped by gender.

In [23]:
# your code here


### Violin Plot

Make a violin plot representing trip duration.

In [24]:
# your code here


Make a violin plot representing trip duration, grouped by gender and further split by user type.

In [25]:
# your code here
