# HW 4 Part 4:
    test if the distribution of 
    
    1) trip duration of bikers that ride during the day vs night
    
    2) age of bikers for trips originating in Manhattan and in Brooklyn
    
    are different. Use 3 tests: KS, Pearson's, Spearman's. 
    
    Use the scipy.stats functions scipy.stats.ks_2samp, scipy.stats.pearsonr, scipy.stats.spearmanr. 
    
    For the KS do the test with the entire dataset and with a subset 200 times smaller
    
    Choose a single significant threshold for the whole exercise. 
    
    For each test phrase the Null Hypothesis in words.
    
    Describe the return of the scipy function you use in each case.
    
    State the result in terms of rejection of the Null.

In [5]:
import pandas as pd
import numpy as np
import scipy.stats

# Read and wrangle data
### Comparing trip durations for bikers during the day and night

In [7]:
datestring1 = '201704'
datestring2 = '201707'

df1 = pd.read_csv('/gws/open/Student/citibike' + "/" + datestring1 + '-citibike-tripdata.csv.zip', 
                 compression='zip')
df2 = pd.read_csv('/gws/open/Student/citibike' + "/" + datestring2 + '-citibike-tripdata.csv.zip', 
                 compression='zip')

In [8]:
df = pd.concat([df1, df2], ignore_index=True)
df['date'] = pd.to_datetime(df['starttime'])
df['hour'] = pd.DatetimeIndex(df['date']).hour
df['tripduration'] = df['tripduration']/60 # convert to minutes

# define daytime as between 6AM and 7PM
df['daynight'] = np.where((df['hour'] > 5) & (df['hour'] <19),'day','night')
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,date,hour,daynight
0,3.25,2017-04-01 00:00:58,2017-04-01 00:04:14,217,Old Fulton St,40.702772,-73.993836,430,York St & Jay St,40.701485,-73.986569,25454,Subscriber,1966.0,1,2017-04-01 00:00:58,0,night
1,9.75,2017-04-01 00:01:54,2017-04-01 00:11:40,490,8 Ave & W 33 St,40.751551,-73.993934,3158,W 63 St & Broadway,40.771639,-73.982614,25930,Subscriber,1995.0,1,2017-04-01 00:01:54,0,night
2,11.083333,2017-04-01 00:01:56,2017-04-01 00:13:01,504,1 Ave & E 16 St,40.732219,-73.981656,412,Forsyth St & Canal St,40.715816,-73.994224,27110,Subscriber,1987.0,1,2017-04-01 00:01:56,0,night
3,4.366667,2017-04-01 00:02:24,2017-04-01 00:06:47,445,E 10 St & Avenue A,40.727408,-73.98142,411,E 6 St & Avenue D,40.722281,-73.976687,28392,Subscriber,1961.0,1,2017-04-01 00:02:24,0,night
4,2.233333,2017-04-01 00:02:36,2017-04-01 00:04:50,3423,West Drive & Prospect Park West,40.661063,-73.979453,3300,Prospect Park West & 8 St,40.665147,-73.976376,19316,Subscriber,1967.0,2,2017-04-01 00:02:36,0,night


In [9]:
df.drop(['starttime', 'stoptime', 'start station id',
         'start station name', 'start station latitude',
         'start station longitude', 'end station id', 'end station name',
         'end station latitude', 'end station longitude', 'bikeid', 'usertype', 
         'birth year', 'gender', 'date','hour'], 
        axis=1, inplace=True)

In [10]:
df.head()

Unnamed: 0,tripduration,daynight
0,3.25,night
1,9.75,night
2,11.083333,night
3,4.366667,night
4,2.233333,night


In [11]:
dayTrips = pd.np.array(df['tripduration'][df['daynight'] == 'day'])
nightTrips = pd.np.array(df['tripduration'][df['daynight'] == 'night'])

## KS Test to compare 2 samples

### Null Hypothesis:
The trip duration times for riders during the day and the night are drawn from the same distribution.


#### Alternative Hypothesis:
The trip duration times for riders during the day and the night are not drawn from the same distribution.

#### Significance level
α = 0.05

In [12]:
ks = scipy.stats.ks_2samp(dayTrips,nightTrips)
print(ks)

Ks_2sampResult(statistic=0.029651666638394972, pvalue=0.0)


#### Results
Because the K-S statistic indiciates the distance between the CDFs and is small, we can't reject the null hypothesis that the distributions are the same (even though the p-value is small.

### KS Test for 1/200 subset

In [13]:
dfSubset = df.sample(frac=0.005, random_state = 10)
dayTripsSub = pd.np.array(dfSubset['tripduration'][df['daynight'] == 'day'])
nightTripsSub = pd.np.array(dfSubset['tripduration'][df['daynight'] == 'night'])

In [14]:
ks = scipy.stats.ks_2samp(dayTripsSub,nightTripsSub)
print(ks)

Ks_2sampResult(statistic=0.03616185809259387, pvalue=0.0025793612402765721)


#### Results
Same as above, the K-S statistic is small enough so that we can't reject the null hypothesis.

## Pearson's  test for correlation

#### Null Hypothesis:
The trip duration times for riders during the day and the night are uncorrelated.

#### Alternative Hypothesis:
The trip duration times for riders during the day and the night are correlated.

#### Significance level
α = 0.05

In [15]:
np.random.seed(2)
dayReduced = np.sort(np.random.choice(dayTrips, size = 300000, replace = False))
nightReduced = np.sort(np.random.choice(nightTrips, size = 300000, replace = False))

In [16]:
ps = scipy.stats.pearsonr(dayReduced,nightReduced)
print(ps)

(0.97426320005624822, 0.0)


#### Results
The correlation coefficient is very close to 1, which indicates that the datasets are nearly linearly correlated. In addition, the p-value 0.0 indicates that there is no possibility of uncorrelated data producing a correlation coefficient as extreme as above. Thus we can reject the null hypothesis and determine that the datasets are correlated.

## Spearman's  test for correlation

#### Null Hypothesis:
The trip duration times for riders during the day and the night are uncorrelated.

#### Alternative Hypothesis:
The trip duration times for riders during the day and the night are correlated.

#### Significance level
α = 0.05

In [17]:
sm = scipy.stats.spearmanr(dayReduced,nightReduced)
print(sm)

SpearmanrResult(correlation=0.99999946909216542, pvalue=0.0)


#### Results
The correlation coefficient is very close to 1, which indicates that the datasets are nearly linearly correlated. In addition, the p-value 0.0 indicates that there is no possibility of uncorrelated data producing a correlation coefficient as extreme as above. Thus we can reject the null hypothesis and determine that the datasets are correlated.

# Next...analyze data for biker age and borough

In [18]:
df = pd.concat([df1, df2], ignore_index=True)
df['age'] = 2017 - df['birth year']
df = df.rename(columns = {'start station latitude':'lat', 'start station longitude':'long'})
df.drop(['tripduration','starttime', 'stoptime', 'start station id',
      'end station id', 'end station name',
      'end station latitude', 'end station longitude', 'bikeid', 'usertype', 'gender', 'birth year'
       ], axis=1, inplace=True)
df.head()

Unnamed: 0,start station name,lat,long,age
0,Old Fulton St,40.702772,-73.993836,51.0
1,8 Ave & W 33 St,40.751551,-73.993934,22.0
2,1 Ave & E 16 St,40.732219,-73.981656,30.0
3,E 10 St & Avenue A,40.727408,-73.98142,56.0
4,West Drive & Prospect Park West,40.661063,-73.979453,50.0


In [21]:
# manually set approximate latitude & longtitude values to approximate 
# linear boundaries of the boroughs

# approximate boundaries for Manhattan exclude Financial District, Upper East Side, Harlem
df['borough'] =  np.where((df['lat'] > 40.707) & (df['long'] > -74.02) & (df['long'] < 73.97) ,
                          'Manhattan', np.NaN)
dfMan = df[(df['borough'] == 'Manhattan')]

# 2 different boundaries/areas for Brooklyn: Downtown Brooklyn and Williamsburg
df['borough'] =  np.where((df['lat'] < 40.707) & (df['long'] > -74.00), 'Brooklyn', np.NaN)
dfBklyn1 = df[(df['borough'] == 'Brooklyn')]
df['borough'] =  np.where((df['lat'] < 40.736) & (df['lat'] > 40.707) & (df['long'] > -73.96), 
                          'Brooklyn', np.NaN)
dfBklyn2 = df[(df['borough'] == 'Brooklyn')]
dfBklyn = pd.concat([dfBklyn1, dfBklyn2], ignore_index=True)

In [22]:
#drop NA age values
dfMan.dropna(subset=['age'],inplace= True)
dfBklyn.dropna(subset=['age'],inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [23]:
# Citibike stations in Manhattan
dfMan.head()

Unnamed: 0,start station name,lat,long,age,borough
1,8 Ave & W 33 St,40.751551,-73.993934,22.0,Manhattan
2,1 Ave & E 16 St,40.732219,-73.981656,30.0,Manhattan
3,E 10 St & Avenue A,40.727408,-73.98142,56.0,Manhattan
6,W 20 St & 11 Ave,40.746745,-74.007756,46.0,Manhattan
8,11 Ave & W 41 St,40.760301,-73.998842,40.0,Manhattan


In [24]:
# Citibike stations in Brooklyn
dfBklyn.head()

Unnamed: 0,start station name,lat,long,age,borough
0,Old Fulton St,40.702772,-73.993836,51.0,Brooklyn
1,West Drive & Prospect Park West,40.661063,-73.979453,50.0,Brooklyn
2,Smith St & 9 St,40.674696,-73.997858,43.0,Brooklyn
3,3 St & 7 Ave,40.670384,-73.978397,30.0,Brooklyn
4,Washington Park,40.691782,-73.97373,41.0,Brooklyn


In [25]:
agesMan =  pd.np.array(dfMan['age'])
agesBklyn =  pd.np.array(dfBklyn['age'])

## KS Test to compare two samples

### Null Hypothesis:
The ages of riders in Brooklyn and Manhattan are drawn from the same distribution.


#### Alternative Hypothesis:
The ages of riders in Brooklyn and Manhattan are not drawn from the same distribution.

#### Significance level
α = 0.05

In [26]:
ks = scipy.stats.ks_2samp(agesMan,agesBklyn)
print(ks)

Ks_2sampResult(statistic=0.09340184547021711, pvalue=0.0)


#### Results
Because the K-S statistic indiciates the distance between the CDFs and is small, we can't reject the null hypothesis that the distributions are the same (even though the p-value is small.

## KS Test for 1/200 subset

In [27]:
dfManSubset = dfMan.sample(frac=0.005, random_state = 10)
dfBklynSubset = dfBklyn.sample(frac=0.005, random_state = 10)

agesManSub = pd.np.array(dfManSubset['age'])
agesBklynSub = pd.np.array(dfBklynSubset['age'])

In [28]:
ks = scipy.stats.ks_2samp(dayTripsSub,nightTripsSub)
print(ks)

Ks_2sampResult(statistic=0.03616185809259387, pvalue=0.0025793612402765721)


#### Results
Same as above, the K-S statistic is small enough so that we can't reject the null hypothesis.

## Pearson's test for correlation

#### Null Hypothesis:
The ages of riders in Brooklyn and Manhattan are not correlated.

#### Alternative Hypothesis:
The ages of riders in Brooklyn and Manhattan are correlated.

#### Significance level
α = 0.05

In [29]:
np.random.seed(2)
agesManReduced = np.sort(np.random.choice(agesMan, size = 300000, replace = False))
agesBklynReduced = np.sort(np.random.choice(agesBklyn, size = 300000, replace = False))

In [30]:
ps = scipy.stats.pearsonr(agesManReduced,agesBklynReduced)
print(ps)

(0.99381518210323905, 0.0)


#### Results
The correlation coefficient is very close to 1, which indicates that the datasets are nearly linearly correlated. In addition, the p-value 0.0 indicates that there is no possibility of uncorrelated data producing a correlation coefficient as extreme as above. Thus we can reject the null hypothesis and determine that the datasets are correlated.

### Spearman's test for correlation

#### Null Hypothesis:
The ages of riders in Brooklyn and Manhattan are not correlated.

#### Alternative Hypothesis:
The ages of riders in Brooklyn and Manhattan are correlated.

#### Significance level
α = 0.05

In [31]:
sm = scipy.stats.spearmanr(agesManReduced,agesBklynReduced)
print(sm)

SpearmanrResult(correlation=0.99881500098251863, pvalue=0.0)


#### Results
The correlation coefficient is very close to 1, which indicates that the datasets are nearly linearly correlated. In addition, the p-value 0.0 indicates that there is no possibility of uncorrelated data producing a correlation coefficient as extreme as above. Thus we can reject the null hypothesis and determine that the datasets are correlated.