# Homework 2


## Task 1 (6 points)

Your task is to **compute the maximum number of active "citibikers"** that were using the Citibike service at any point in time. This the same as computing the maximum number of citibikes that were checked out at a particular time. The input data set is **citibike_docking_events.csv**, which logged all docking and undocking events at all Citibike stations. The description of the fields in this file is as follows:

|Column name|Description|
|--|--|
|time |The timestamp of the event. All events are sorted increasingly by their timestamps. |
|bikeid |The unique ID of the bike involved in this event. |
|station_id |The station ID, where the event happened. |
|event |A string of either *"dock"* or *"undock"* for describing the drop-off or pick-up event, respectively. |

For example, let's assume that on *Feb-01-2015*, there was a user that picked a bike at the station ID *521* at midnight and dropped it at the station ID *423* at 14 minutes past midnight. If the bike that this customer used has the ID of *17131*, then you should see two events being logged in this data set as:

<pre>
...
2015-02-01 00:00:00+00,17131,521,undock
...
2015-02-01 00:14:00+00,17131,423,dock
...
</pre>

You are given the above data set in a streaming fashion (reading in row by row), and must design a streaming algorithm that uses the least possible additional memory to compute the maximum number of active users of the Citibike service. Please modify the code snippet below to complete this task. Your code should only output a single number, which is the number of active users. Of course, you can add additional initialization codes outside of the for loop as needed.
Additional, please provide a brief rationale and/or justification for your design after the code.

In [1]:
import csv

def csvRows(filename):
    '''
    This function creates a generator
    of all the rows read one by one from the file(csv)
    params: 
    filename: filename passed as string
    '''
    with open(filename, 'r') as fi:
        reader = csv.DictReader(fi)
        for row in reader:
            yield row

maxActiveUsers = 0
count = 0
for row in csvRows('citibike_docking_events.csv'):
    if row["event"] == "undock":
        count+=1
    elif row["event"] =="dock":
        if maxActiveUsers <= count:
            maxActiveUsers = count
        count-=1

print 'The maximum active citi bike users in the given dataset are :',maxActiveUsers

The maximum active citi bike users in the given dataset are : 250


#### RATIONALE AND JUSTIFICATION

**Given dataset:** Sorted data acording to starttime timestamp. 
                    
  Type of event either _undock_ of _dock_
                    
** Objective:** To find the maximum active citi bike users 

** Rationale:** 
The number of active citibike users are number of citibikes whose ride is not completed at any given point in time. The data is given in a _streaming fashion_ using a generator(csvRows) to create a data stream without storing it into the memory. A counter is initalized which is increased by every 'undock' event and decreased by a 'dock' event. This increment and decrement is possible as the dataset is sorted regardless of event type. 
The variable for maximum value for the count which is initalized to zero is compared to the counter and if the counter is greater than the maximum variable(maxActiveUser) then the maximum variable is overwritten by the counter value. 


## Task 2 (4 points)

The objective of this task is identical to Task 1's but you are asked to use the **cibibike.csv** data set instead of the docking events. The main difference (and challenge) is that both pick-up and drop-off event for each trip is now presented as a single record, thus, the drop-off events are not sorted by their timestamps. You are again asked to do this in a streaming fashion that needs to minimize the amount of memory usage. Please modify the code below accordingly, and also with a brief explaination of the solution.

In [2]:
import csv
import datetime
from dateutil import parser

def csvRows(filename):
    '''
    This function craetes a generator
    of all the rows taken one by one from the file(csv)
    params: 
    filename: filename passed as string
    '''
    with open(filename, 'r') as fi:
        reader = csv.DictReader(fi)
        for row in reader:
            yield row

maxActiveUsers = 0
count = 0
stoplist = []   # list of stoptimes 
for row in csvRows('citibike.csv'):
    starttime = parser.parse(row['starttime'])
    stoptime = parser.parse(row['stoptime'])
    stoplist.append(stoptime)
    
    #Use filter + lamba function to compare the starttime with the stoptime list. 
    stoplist = filter(lambda x: x > starttime, stoplist) 
    count = len(stoplist) 
    if count > maxActiveUsers:
        maxActiveUsers = count
    
print 'The maximum active citi bike users in the given dataset are :',maxActiveUsers

The maximum active citi bike users in the given dataset are : 250


#### RATIONALE AND JUSTIFICATION



**Given dataset:** Sorted dataset with respect to starttime of the ride. 

Starttime and stoptime within a single entry. 
                    
** Objective:** To find the maximum active citi bike users with minimum amount of memory usage.

** Rationale:**
The data is given in a _streaming fashion_ using a generator(csvRows) as in the above problem. A list is created of the stoptime extracted from each row of the data. A counter and the maximum users variables are initialized to zero. A higher order function (filter and lamba) are used to compare the starttime with each element of the stoptime list. The counter is set to the total elements in the stoptime list which is further compared to the maximum user variable (maxActiveUsers) and the variable is overwritten with the counter value in case the counter is greater than the variable. 
The higher order function helps in iterating the function argument to result in required output without storing the process of the passed function. 