# By: Thomas Isola
## Class: PUI 2018
## HW #6 Assignment 4

**Assignment 4 Goal**:  
- Test if the distribution of: trip duration of bikers that ride during the day vs night are different.  
- Use 3 tests: KS, Pearson's, Spearman's.  
- Use the scipy.stats functions scipy.stats.ks_2samp, scipy.stats.pearsonr, scipy.stats.spearmanr.  
- For the KS do the test with the entire dataset and with a subset 200 times smaller.  
- Choose a single significant threshold for the whole exercise.  
- For each test, phrase the Null Hypothesis in words. Describe the return of the scipy function you use in each case. State the result in terms of rejection of the Null.

In [1]:
# Import packages
import pylab as pl
import pandas as pd
import numpy as np
import os
import scipy.stats
#imports downloader
from getCitiBikeCSV import getCitiBikeCSV

%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Statistical Test Null Hypotheses
For all tests, I will use $\alpha = 0.05$
### 2 sample K-S Test
**H0**: The two samples (daytime trip duration and nighttime trip duration) come from the same continuous distribution.
### Pearson's test for correlation
**H0**: The two samples (daytime trip duration and nighttime trip duration) are not correlated.
### Spearman's test for correlation
**H0**: The two samples (daytime trip duration and nighttime trip duration) are not correlated.

In [2]:
# Download desired months of Citibike data

# First month (warmer month)
datestring1 = '201602'
getCitiBikeCSV(datestring1)

Downloading 201602
file in place, you can continue


In [3]:
# Second month (colder month)
datestring2 = '201608'
getCitiBikeCSV(datestring2)

Downloading 201608
file in place, you can continue


In [4]:
# Convert the data to pandas dataframes
df1 = pd.read_csv(os.getenv("PUIDATA") + "/" + datestring1 + '-citibike-tripdata.csv')
df2 = pd.read_csv(os.getenv("PUIDATA") + "/" + datestring2 + '-citibike-tripdata.csv')

In [5]:
# Combine the dataframes
df3 = pd.concat([df1, df2], ignore_index=True)

In [6]:
# Display the columns
df3.columns

Index(['tripduration', 'starttime', 'stoptime', 'start station id',
       'start station name', 'start station latitude',
       'start station longitude', 'end station id', 'end station name',
       'end station latitude', 'end station longitude', 'bikeid', 'usertype',
       'birth year', 'gender'],
      dtype='object')

In [7]:
# Drop the unnecessary columns
df3.drop(['stoptime','start station id','start station name','start station latitude',
         'start station longitude','end station id','end station name','end station latitude',
          'end station longitude','bikeid','usertype','birth year','gender'], axis=1, inplace=True)
df3.head()

Unnamed: 0,tripduration,starttime
0,461,2/1/2016 00:00:08
1,297,2/1/2016 00:00:56
2,280,2/1/2016 00:01:00
3,662,2/1/2016 00:01:00
4,355,2/1/2016 00:01:41


In [8]:
# Filter the data by day and night
# For this assignment, I am using 6AM - 6PM as day and otherwise it is night
df3['Day'] = np.where(((df3.starttime > '2/1/2016 06:00:00') & (df3.starttime < '2/1/2016 18:00:00')) 
                      | ((df3.starttime > '2/2/2016 06:00:00') & (df3.starttime < '2/2/2016 18:00:00'))
                      | ((df3.starttime > '2/3/2016 06:00:00') & (df3.starttime < '2/3/2016 18:00:00'))
                      | ((df3.starttime > '2/4/2016 06:00:00') & (df3.starttime < '2/4/2016 18:00:00'))
                      | ((df3.starttime > '2/5/2016 06:00:00') & (df3.starttime < '2/5/2016 18:00:00'))
                      | ((df3.starttime > '2/6/2016 06:00:00') & (df3.starttime < '2/6/2016 18:00:00'))
                      | ((df3.starttime > '2/7/2016 06:00:00') & (df3.starttime < '2/7/2016 18:00:00'))
                      | ((df3.starttime > '2/8/2016 06:00:00') & (df3.starttime < '2/8/2016 18:00:00'))
                      | ((df3.starttime > '2/9/2016 06:00:00') & (df3.starttime < '2/9/2016 18:00:00'))
                      | ((df3.starttime > '2/10/2016 06:00:00') & (df3.starttime < '2/10/2016 18:00:00'))
                      | ((df3.starttime > '2/11/2016 06:00:00') & (df3.starttime < '2/11/2016 18:00:00'))
                      | ((df3.starttime > '2/12/2016 06:00:00') & (df3.starttime < '2/12/2016 18:00:00'))
                      | ((df3.starttime > '2/13/2016 06:00:00') & (df3.starttime < '2/13/2016 18:00:00'))
                      | ((df3.starttime > '2/14/2016 06:00:00') & (df3.starttime < '2/14/2016 18:00:00'))
                      | ((df3.starttime > '2/15/2016 06:00:00') & (df3.starttime < '2/15/2016 18:00:00'))
                      | ((df3.starttime > '2/16/2016 06:00:00') & (df3.starttime < '2/16/2016 18:00:00'))
                      | ((df3.starttime > '2/17/2016 06:00:00') & (df3.starttime < '2/17/2016 18:00:00'))
                      | ((df3.starttime > '2/18/2016 06:00:00') & (df3.starttime < '2/18/2016 18:00:00'))
                      | ((df3.starttime > '2/19/2016 06:00:00') & (df3.starttime < '2/19/2016 18:00:00'))
                      | ((df3.starttime > '2/20/2016 06:00:00') & (df3.starttime < '2/20/2016 18:00:00'))
                      | ((df3.starttime > '2/21/2016 06:00:00') & (df3.starttime < '2/21/2016 18:00:00'))
                      | ((df3.starttime > '2/22/2016 06:00:00') & (df3.starttime < '2/22/2016 18:00:00'))
                      | ((df3.starttime > '2/23/2016 06:00:00') & (df3.starttime < '2/23/2016 18:00:00'))
                      | ((df3.starttime > '2/24/2016 06:00:00') & (df3.starttime < '2/24/2016 18:00:00'))
                      | ((df3.starttime > '2/25/2016 06:00:00') & (df3.starttime < '2/25/2016 18:00:00'))
                      | ((df3.starttime > '2/26/2016 06:00:00') & (df3.starttime < '2/26/2016 18:00:00'))
                      | ((df3.starttime > '2/27/2016 06:00:00') & (df3.starttime < '2/27/2016 18:00:00'))
                      | ((df3.starttime > '2/28/2016 06:00:00') & (df3.starttime < '2/28/2016 18:00:00'))
                      | ((df3.starttime > '2/29/2016 06:00:00') & (df3.starttime < '2/29/2016 18:00:00'))
                      | ((df3.starttime > '8/1/2016 06:00:00') & (df3.starttime < '8/1/2016 18:00:00'))
                      | ((df3.starttime > '8/2/2016 06:00:00') & (df3.starttime < '8/2/2016 18:00:00'))
                      | ((df3.starttime > '8/3/2016 06:00:00') & (df3.starttime < '8/3/2016 18:00:00'))
                      | ((df3.starttime > '8/4/2016 06:00:00') & (df3.starttime < '8/4/2016 18:00:00'))
                      | ((df3.starttime > '8/5/2016 06:00:00') & (df3.starttime < '8/5/2016 18:00:00'))
                      | ((df3.starttime > '8/6/2016 06:00:00') & (df3.starttime < '8/6/2016 18:00:00'))
                      | ((df3.starttime > '8/7/2016 06:00:00') & (df3.starttime < '8/7/2016 18:00:00'))
                      | ((df3.starttime > '8/8/2016 06:00:00') & (df3.starttime < '8/8/2016 18:00:00'))
                      | ((df3.starttime > '8/9/2016 06:00:00') & (df3.starttime < '8/9/2016 18:00:00'))
                      | ((df3.starttime > '8/10/2016 06:00:00') & (df3.starttime < '8/10/2016 18:00:00'))
                      | ((df3.starttime > '8/11/2016 06:00:00') & (df3.starttime < '8/11/2016 18:00:00'))
                      | ((df3.starttime > '8/12/2016 06:00:00') & (df3.starttime < '8/12/2016 18:00:00'))
                      | ((df3.starttime > '8/13/2016 06:00:00') & (df3.starttime < '8/13/2016 18:00:00'))
                      | ((df3.starttime > '8/14/2016 06:00:00') & (df3.starttime < '8/14/2016 18:00:00'))
                      | ((df3.starttime > '8/15/2016 06:00:00') & (df3.starttime < '8/15/2016 18:00:00'))
                      | ((df3.starttime > '8/16/2016 06:00:00') & (df3.starttime < '8/16/2016 18:00:00'))
                      | ((df3.starttime > '8/17/2016 06:00:00') & (df3.starttime < '8/17/2016 18:00:00'))
                      | ((df3.starttime > '8/18/2016 06:00:00') & (df3.starttime < '8/18/2016 18:00:00'))
                      | ((df3.starttime > '8/19/2016 06:00:00') & (df3.starttime < '8/19/2016 18:00:00'))
                      | ((df3.starttime > '8/20/2016 06:00:00') & (df3.starttime < '8/20/2016 18:00:00'))
                      | ((df3.starttime > '8/21/2016 06:00:00') & (df3.starttime < '8/21/2016 18:00:00'))
                      | ((df3.starttime > '8/22/2016 06:00:00') & (df3.starttime < '8/22/2016 18:00:00'))
                      | ((df3.starttime > '8/23/2016 06:00:00') & (df3.starttime < '8/23/2016 18:00:00'))
                      | ((df3.starttime > '8/24/2016 06:00:00') & (df3.starttime < '8/24/2016 18:00:00'))
                      | ((df3.starttime > '8/25/2016 06:00:00') & (df3.starttime < '8/25/2016 18:00:00'))
                      | ((df3.starttime > '8/26/2016 06:00:00') & (df3.starttime < '8/26/2016 18:00:00'))
                      | ((df3.starttime > '8/27/2016 06:00:00') & (df3.starttime < '8/27/2016 18:00:00'))
                      | ((df3.starttime > '8/28/2016 06:00:00') & (df3.starttime < '8/28/2016 18:00:00'))
                      | ((df3.starttime > '8/29/2016 06:00:00') & (df3.starttime < '8/29/2016 18:00:00'))
                      ,"Day","Night")
df3.head()

Unnamed: 0,tripduration,starttime,Day
0,461,2/1/2016 00:00:08,Night
1,297,2/1/2016 00:00:56,Night
2,280,2/1/2016 00:01:00,Night
3,662,2/1/2016 00:01:00,Night
4,355,2/1/2016 00:01:41,Night


In [9]:
# Split the day and night data
dfNight = df3[df3.Day == "Night"]
dfDay = df3[df3.Day == "Day"]

In [10]:
# Perform the 2 sample K-S test
ks = scipy.stats.ks_2samp(dfNight.tripduration,dfDay.tripduration)
print(ks)

Ks_2sampResult(statistic=0.0084213674921139736, pvalue=6.5945558267841418e-30)


The value of the K-S statistic is low and the pvalue is smaller than our chosen significance level of 0.05, therefore, the null hypothesis is rejected.

In [11]:
# Select random slices of the day and night data of equal size  
numpy.random.seed(100)
dfNight_reduced = numpy.random.choice(dfNight.tripduration, size=200, replace=False)
dfDay_reduced = numpy.random.choice(dfDay.tripduration, size=200, replace=False)
dfNight_reduced.sort()
dfDay_reduced.sort()

In [12]:
# Perform Pearson's test for correlation
pear = scipy.stats.pearsonr(dfNight_reduced, dfDay_reduced)
print(pear)

(0.99901744324249409, 5.9941913280920799e-270)


The value of the correlation coefficient is high and the p-value is smaller than our chosen significance level of 0.05, therefore, the null hypothesis is rejected.

In [13]:
spear = scipy.stats.spearmanr(dfNight_reduced, dfDay_reduced)
print(spear)

SpearmanrResult(correlation=0.99998799957201612, pvalue=0.0)


The value of the correlation coefficient is high and the p-value is smaller than our chosen significance level of 0.05, therefore, the null hypothesis is rejected.