# Assignment 2: CitiBike Data Analysis

Your notebook must display
* the complete formulation of the hypothesis to be tested
* the data tables for the unreducted datasets (first few columns)
* the data tables for the reducted datasets (first few columns)
* the plots for each dataframe, with usual rules for plotting applying: visible and readable axes, title, legend, caption.

## Idea: Subscribers bike more on the weekday and Customers bike more on the weekends.

## NULL HYPOTHESIS
### The ratio of subscribers biking on weekdays over subscribers biking on weekends is the same or lower than the ratio of customers biking on weekdays over customers biking on weekends a significance level of $\alpha = 0.05$.

## $H_0$ : $\frac{S_{\mathrm{weekday}}}{S_{\mathrm{weekend}}} <= \frac{C_{\mathrm{weekday}}}{C_{\mathrm{weekend}}}$

## $H_1$ : $\frac{S_{\mathrm{weekday}}}{S_{\mathrm{weekend}}} > \frac{C_{\mathrm{weekday}}}{C_{\mathrm{weekend}}}$

In [1]:
import numpy as np
import pandas as pd
import sys
import os

In [2]:
def getCitiBikeCSV(datestring):
    '''
    Retrieves the CitiBike CSV file given a datestring
    and saves it in the directory specified in $PUIDATA
    
    Argument:
        datestring: String in format YYYYMM
    '''
    
    PUIdata = os.getenv("PUIDATA")
    
    print ("Downloading", datestring)
    
    ### First I will check that it is not already there
    if not os.path.isfile(PUIdata + "/" + datestring + "-citibike-tripdata.csv"):
        if os.path.isfile(datestring + "-citibike-tripdata.csv"):
            # if in the current dir just move it
            if os.system("mv " + datestring + "-citibike-tripdata.csv " + PUIdata):
                print ("Error moving file!, Please check!")
                
        #otherwise start looking for the zip file
        else:
            if not os.path.isfile(PUIdata + "/" + datestring + "-citibike-tripdata.zip"):
                if not os.path.isfile(datestring + "-citibike-tripdata.zip"):
                    os.system("curl -O https://s3.amazonaws.com/tripdata/" + datestring + "-citibike-tripdata.zip")
                ###  To move it I use the os.system() functions to run bash commands with arguments
                os.system("mv " + datestring + "-citibike-tripdata.zip " + PUIdata)
            ### unzip the csv 
            os.system("unzip " + PUIdata + "/" + datestring + "-citibike-tripdata.zip")
            ## NOTE: old csv citibike data had a different name structure. 
            if '2014' in datestring:
                os.system("mv " + datestring[:4] + '-' +  datestring[4:] + 
                          "\ -\ Citi\ Bike\ trip\ data.csv " + datestring + "-citibike-tripdata.csv")
            os.system("mv " + datestring + "-citibike-tripdata.csv " + PUIdata)
    
    ### One final check:
    if not os.path.isfile(PUIdata + "/" + datestring + "-citibike-tripdata.csv"):
        print ("WARNING!!! something is wrong: the file is not there!")

    else:
        print ("file in place, you can continue")

In [3]:
datestring = '201612'
getCitiBikeCSV(datestring)

('Downloading', '201612')
file in place, you can continue


In [4]:
PUIdata = os.getenv('PUIDATA')
print(PUIdata)

/home/cusp/uc288/PUIdata


### NOTE TO SELF: Trip duration is in SECONDS

In [5]:
data = pd.read_csv(PUIdata + '/' + datestring + '-citibike-tripdata.csv', 
                   infer_datetime_format=True,
                   parse_dates=['Start Time', 'Stop Time'])
data.head()

Unnamed: 0,Trip Duration,Start Time,Stop Time,Start Station ID,Start Station Name,Start Station Latitude,Start Station Longitude,End Station ID,End Station Name,End Station Latitude,End Station Longitude,Bike ID,User Type,Birth Year,Gender
0,528,2016-12-01 00:00:04,2016-12-01 00:08:52,499,Broadway & W 60 St,40.769155,-73.981918,228,E 48 St & 3 Ave,40.754601,-73.971879,26931,Subscriber,1964.0,1
1,218,2016-12-01 00:00:28,2016-12-01 00:04:06,3418,Plaza St West & Flatbush Ave,40.675021,-73.971115,3358,Garfield Pl & 8 Ave,40.671198,-73.974841,27122,Subscriber,1955.0,1
2,399,2016-12-01 00:00:39,2016-12-01 00:07:19,297,E 15 St & 3 Ave,40.734232,-73.986923,345,W 13 St & 6 Ave,40.736494,-73.997044,19352,Subscriber,1985.0,1
3,254,2016-12-01 00:00:44,2016-12-01 00:04:59,405,Washington St & Gansevoort St,40.739323,-74.008119,358,Christopher St & Greenwich St,40.732916,-74.007114,20015,Subscriber,1982.0,1
4,1805,2016-12-01 00:00:54,2016-12-01 00:31:00,279,Peck Slip & Front St,40.707873,-74.00167,279,Peck Slip & Front St,40.707873,-74.00167,23148,Subscriber,1989.0,1


In [6]:
data.columns

Index([u'Trip Duration', u'Start Time', u'Stop Time', u'Start Station ID',
       u'Start Station Name', u'Start Station Latitude',
       u'Start Station Longitude', u'End Station ID', u'End Station Name',
       u'End Station Latitude', u'End Station Longitude', u'Bike ID',
       u'User Type', u'Birth Year', u'Gender'],
      dtype='object')

In [7]:
data.drop([u'Trip Duration', u'Stop Time', u'Start Station ID',
       u'Start Station Name', u'Start Station Latitude',
       u'Start Station Longitude', u'End Station ID', u'End Station Name',
       u'End Station Latitude', u'End Station Longitude', u'Bike ID',
       u'Birth Year', u'Gender'], axis=1, inplace=True)


In [8]:
data.head()

Unnamed: 0,Start Time,User Type
0,2016-12-01 00:00:04,Subscriber
1,2016-12-01 00:00:28,Subscriber
2,2016-12-01 00:00:39,Subscriber
3,2016-12-01 00:00:44,Subscriber
4,2016-12-01 00:00:54,Subscriber


### Separating men (\_m) and women (\_w) to test the hypothesis.

#### User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member)

### Plot the average trip duration of men against women

In [None]:
((df['date'][df['gender'] == 2].groupby([df['date'].dt.weekday]).count()) / norm_w).plot(kind="bar", 
                                                                                         color='IndianRed', 
                                                                                         label='female')

In [None]:
norm_m = 1

data['Duration'][data['Gender'] == 1].mean() / norm_m
men