### Challenge Set 1

### Topic:   Explore MTA turnstile data   
#### Date:    7/7/2016   
#### Name:    Marc Gameroff

# Challenge 1

In [1]:
!curl -O http://web.mta.info/developers/data/nyct/turnstile/turnstile_160611.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.2M    0 24.2M    0     0  2580k      0 --:--:--  0:00:09 --:--:--  671k


In [1]:
import csv, itertools, pprint
from dateutil import parser

input_file = csv.DictReader(open("turnstile_160611.txt"))
rawdict = {}

for row in input_file:
    # Extract last name to use as key
    key = (row['C/A'], row['UNIT'], row['SCP'], row['STATION']) 
    # delete spaces in final column name
    row['EXITS'] = row['EXITS                                                               ']
    # Add key and value pairs to dictionary
    if key not in rawdict:
        rawdict[key] = []
    rawdict[key].append([row['LINENAME'], row['DIVISION'], row['DATE'], row['TIME'],
                        row['DESC'], row['ENTRIES'], row['EXITS'].strip()])


```rawdict``` is a solution to Challenge 1.

In [2]:
# Print the very beginning of the rawdict dictionary
for key in sorted(rawdict)[:1]:
    print (key, ':')
    for value in rawdict[key][:5]:
        print (' '*3, value)

('A002', 'R051', '02-00-00', '59 ST') :
    ['NQR456', 'BMT', '06/04/2016', '00:00:00', 'REGULAR', '0005692587', '0001927326']
    ['NQR456', 'BMT', '06/04/2016', '04:00:00', 'REGULAR', '0005692650', '0001927338']
    ['NQR456', 'BMT', '06/04/2016', '08:00:00', 'REGULAR', '0005692683', '0001927379']
    ['NQR456', 'BMT', '06/04/2016', '12:00:00', 'REGULAR', '0005692790', '0001927500']
    ['NQR456', 'BMT', '06/04/2016', '16:00:00', 'REGULAR', '0005693055', '0001927552']


# Challenge 2

In [3]:
import csv, itertools, pprint
from dateutil import parser

input_file = csv.DictReader(open("turnstile_160611.txt"))
time_series = {}

for row in input_file:
    # Extract last name to use as key
    key = (row['C/A'], row['UNIT'], row['SCP'], row['STATION'])  
    row['entries_n'] = int(row['ENTRIES'])
    # Add key and value pairs to dictionary
    if key not in time_series:
        time_series[key] = []
    dateString = row['DATE'] + " " + row['TIME']  
    dateObj = parser.parse(dateString)  
    time_series[key].append([dateObj, row['entries_n']])  

```time_series``` is a solution to Challenge 2.

In [4]:
#### Print the first two elements (just a few values per key)
for key in sorted(time_series)[:2]:
    print (key, ':')
    for value in time_series[key][:3]:
        print (' '*3, value)

('A002', 'R051', '02-00-00', '59 ST') :
    [datetime.datetime(2016, 6, 4, 0, 0), 5692587]
    [datetime.datetime(2016, 6, 4, 4, 0), 5692650]
    [datetime.datetime(2016, 6, 4, 8, 0), 5692683]
('A002', 'R051', '02-00-01', '59 ST') :
    [datetime.datetime(2016, 6, 4, 0, 0), 5222784]
    [datetime.datetime(2016, 6, 4, 4, 0), 5222817]
    [datetime.datetime(2016, 6, 4, 8, 0), 5222840]


# Challenge 3

At this point, the values for each key (i.e., turnstile) in ```time_series``` shows cumulative turnstile entries at 4-hour intervals. 

7/11/16

Hi Reshama,
I ran into problems with #3 and met with Mike last week about it. My first problem was figuring out how to collapse and sum the elements belonging to a single day. He reminded me of setdefault() as a way to build up the new keys, something like this:

```python
count_per_date = {}
for date_hour_count in one_turnstile_date_hour_values: # for list in list_of_lists
    date_, entries = date_hour_count[0].date(), date_hour_count[1]  
    # sum the counts for each date
    count_per_date[date_] = count_per_date.setdefault(date_, 0) + entries
```

That evening, I redid the code we worked on to not use setdefault() because I was having trouble again figuring out what the code was doing, just like how I tend to still use nested for loops instead of comprehensions because I'm not solid enough on what's going on under the hood, and for loops are easier to apprehend. (I have time to become a comprehension master later.)  After much trial and error, I was able to accomplish the date collapsing and sorting, leaving the issue of counts for later. While the following is probably as un-Pythonic as it comes, at least I figured out part of it. 

```python
d_final = {}
for old_key, old_value in time_series.items():
    d = {}
    for lst in old_value:
        new_key = lst[0].date()
        new_value = lst[1]
        if new_key not in d:
            d[new_key] = 0
        d[new_key] = d[new_key] + new_value
        d2 = sorted(d)
    #print(d)
    d_final[old_key] = d2;
```  
('N063', 'R011', '02-00-00', '42 ST-PORT AUTH') : [datetime.date(2016, 6, 4), datetime.date(2016, 6, 5), datetime.date(2016, 6, 6), datetime.date(2016, 6, 7), datetime.date(2016, 6, 8), datetime.date(2016, 6, 9), datetime.date(2016, 6, 10)]
('R421', 'R427', '00-06-01', 'MIDDLETOWN RD') : [datetime.date(2016, 6, 4), datetime.date(2016, 6, 5), datetime.date(2016, 6, 6), datetime.date(2016, 6, 7), datetime.date(2016, 6, 8), datetime.date(2016, 6, 9), datetime.date(2016, 6, 10)]


### I intend to figure out all of #3 with some more help, and then do no. 4 and on, which somehow don't look as challenging to me. 

Note: I thought I'd learn more if I figured this out on my own but I learned my lesson: it doesn't speed/enhance my learning to spend so long on something if it means other assignments have to suffer. When I'm really stuck on something, getting help and tips EARLY and _then_ trying to go it alone, will likely work better for me from now on. (Sounds obvious now, but it wasn't.)

## Everything below is just snippets, not part of this homework submission, but I wanted to show some of the things I was trying to do to help myself understand things.




```python
d = {"first_name": "Alfred", "last_name":"Hitchcock"}

for key,val in d.items():
    print("{} = {}".format(key, val))
```

    last_name = Hitchcock
    first_name = Alfred



```python
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
for k, v in s:
    d[k].append(v)

list(d.items())
```


```python
from collections import defaultdict
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)
```


```python
tmp   = [['day1',4],
         ['day2',5],
         ['day3',34],
         ['day2',32],
         ['day1',2],
         ['day3',1]]

#want to end up with:
#d = { 'aaa': 6,
#      'bbb': 37
#      'ccc': 35 }


d = {}
for lst in tmp:
    key = lst[0]
    value = lst[1]
    if key not in d:
        d[key] = 0
    d[key] = d[key] + value 
```


```python
newkey = None
d = None
```


```python

```


```python
d_initial = { 'A' : [['day1',4],
                     ['day2',5],
                     ['day3',34],
                     ['day2',32],
                     ['day1',2],
                     ['day3',1]],
            'B' :   [['day1',40],
                     ['day2',50],
                     ['day3',340],
                     ['day2',320],
                     ['day1',20],
                     ['day3',10]],
            'C' :   [['day1',400],
                     ['day2',500],
                     ['day3',3400],
                     ['day2',3200],
                     ['day1',200],
                     ['day3',100]] } 
```


```python
# i want to end up with this:

d_final = {  'A' : [['day1',6],
                    ['day2',37],
                    ['day3',35]],
             'B' : [['day1',60],
                    ['day2',370],
                    ['day3',350]],
             'C' : [['day1',600],
                    ['day2',3700],
                    ['day3',3500]] }
```




```python 
d = {"first_name": "Alfred", "last_name":"Hitchcock"}

for key,val in d.items():
    print("{} = {}".format(key, val))
```    
first_name = Alfred
last_name = Hitchcock


```python
d_final = {}
for k, v in d_initial.items():
    for elem in v:
        ddd = {}
        newkey=elem[0]
        num =elem[1]
        if newkey not in ddd:
            ddd[newkey] = 0
        ddd[newkey] = ddd[newkey] + num
        print(k, newkey, num)             
```

    C day1 400
    C day2 500
    C day3 3400
    C day2 3200
    C day1 200
    C day3 100
    A day1 4
    A day2 5
    A day3 34
    A day2 32
    A day1 2
    A day3 1
    B day1 40
    B day2 50
    B day3 340
    B day2 320
    B day1 20
    B day3 10



```python
tmp =    {'A': [['aaa', 4],
      ['bbb', 5],
      ['ccc', 34],
      ['bbb', 32],
      ['aaa', 2],
      ['ccc', 1]],
     'B': [['aaa', 40],
      ['bbb', 50],
      ['ccc', 340],
      ['bbb', 320],
      ['aaa', 20],
      ['ccc', 10]],
     'C': [['aaa', 400],
      ['bbb', 500],
      ['ccc', 3400],
      ['bbb', 3200],
      ['aaa', 200],
      ['ccc', 100]]}
```
