# Analyzing New York City subway data 

#### Author: Sushant N. More

### Data from web.mta.info/developers/turnstile.html.  Also, using a data file from weather underground obtained from Udacity website

#### Revision history: 

Sept. 15, 2017: Started writing

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats
import csv

## Data cleaning

We will be analysing subway data from MTA's website for the month of May 2011 (because later I want to relate this to weather data and the weather data I have is for the month of May 2011). 

This is real-world data.  We can expect significant effort in cleaning and formatting the data. The data for a given month is in four different files.  Let's start by looking at the files to see how data is arranged. 

In [3]:
turnstile_df1 = pd.read_csv('./data/turnstile_110507_from_web_mta_info.txt')

In [4]:
turnstile_df1.head()

Unnamed: 0,A002,R051,02-00-00,04-30-11,00:00:00,REGULAR,003143506,001087907,04-30-11.1,04:00:00,...,05-01-11,00:00:00.1,REGULAR.6,003144312,001088151,05-01-11.1,04:00:00.1,REGULAR.7,003144335,001088159
0,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177,05-01-11,12:00:00,...,05-02-11,08:00:00,REGULAR,3144941.0,1088420.0,05-02-11,12:00:00,REGULAR,3145094.0,1088753.0
1,A002,R051,02-00-00,05-02-11,16:00:00,REGULAR,3145337,1088823,05-02-11,20:00:00,...,05-03-11,16:00:00,REGULAR,3146790.0,1089417.0,05-03-11,20:00:00,REGULAR,3147615.0,1089478.0
2,A002,R051,02-00-00,05-04-11,00:00:00,REGULAR,3147798,1089515,05-04-11,04:00:00,...,05-05-11,00:00:00,REGULAR,3149281.0,1090139.0,05-05-11,04:00:00,REGULAR,3149297.0,1090145.0
3,A002,R051,02-00-00,05-05-11,08:00:00,REGULAR,3149331,1090257,05-05-11,09:04:33,...,05-05-11,12:00:00,OPEN,3149494.0,1090579.0,05-05-11,16:00:00,DOOR,3149805.0,1090652.0
4,A002,R051,02-00-00,05-05-11,20:00:00,REGULAR,3150639,1090714,05-06-11,00:00:00,...,05-06-11,20:00:00,REGULAR,3152200.0,1091283.0,,,,,


The data is not conviniently arranged as expected.  And it's difficult to get sense of it by loading it into pandas data frame. Let's try looking at the file contents directly.

In [20]:
with open("./data/turnstile_110507_from_web_mta_info.txt") as myfile:
    print myfile.readlines()[0:4] 

['A002,R051,02-00-00,04-30-11,00:00:00,REGULAR,003143506,001087907,04-30-11,04:00:00,REGULAR,003143547,001087915,04-30-11,08:00:00,REGULAR,003143563,001087935,04-30-11,12:00:00,REGULAR,003143646,001088024,04-30-11,16:00:00,REGULAR,003143865,001088083,04-30-11,20:00:00,REGULAR,003144181,001088132,05-01-11,00:00:00,REGULAR,003144312,001088151,05-01-11,04:00:00,REGULAR,003144335,001088159              \r\n', 'A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,003144353,001088177,05-01-11,12:00:00,REGULAR,003144424,001088231,05-01-11,16:00:00,REGULAR,003144594,001088275,05-01-11,20:00:00,REGULAR,003144808,001088317,05-02-11,00:00:00,REGULAR,003144895,001088328,05-02-11,04:00:00,REGULAR,003144905,001088331,05-02-11,08:00:00,REGULAR,003144941,001088420,05-02-11,12:00:00,REGULAR,003145094,001088753              \r\n', 'A002,R051,02-00-00,05-02-11,16:00:00,REGULAR,003145337,001088823,05-02-11,20:00:00,REGULAR,003146168,001088888,05-03-11,00:00:00,REGULAR,003146322,001088918,05-03-11,04:00:00,REGULAR

As we can see, there are numerous data points included in each row of the MTA Subway turnstile text file. 

We want to write a function that will update each row in the text file so there is only one entry per row. So a single row from the input file will generate multiple rows. For instance the first row displayed in the above file will turn into following set of rows.

A002,R051,02-00-00,04-30-11,00:00:00,REGULAR,003143506,001087907,
A002,R051,02-00-00,04-30-11,04:00:00,REGULAR,003143547,001087915,
A002,R051,02-00-00,04-30-11,08:00:00,REGULAR,003143563,001087935,
A002,R051,02-00-00,04-30-11,12:00:00,REGULAR,003143646,001088024,
A002,R051,02-00-00,04-30-11,16:00:00,REGULAR,003143865,001088083,
A002,R051,02-00-00,04-30-11,20:00:00,REGULAR,003144181,001088132,
A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,003144312,001088151,
A002,R051,02-00-00,05-01-11,04:00:00,REGULAR,003144335,001088159

The first three elements in the input line -- A002,R051,02-00-00 -- are repeated for each of the 8 lines in the ouput file. 

In [24]:
fin1 = open("./data/turnstile_110507_from_web_mta_info.txt", 'r')
fout1 = open("./data/updated_turnstile_110507_from_web_mta_info.txt", 'w')

reader = csv.reader(fin1, delimiter = ',', quoting=csv.QUOTE_NONE)
writer = csv.writer(fout1, delimiter = ',', quoting=csv.QUOTE_NONE)

for line in reader:
    
    record1 = line[0]
    record2 = line[1]
    record3 = line[2]
    
    length = len(line)
    
    nn = (length - 1 - 7) / 5
    
    for i in range(0, nn + 1):
        
        lineToWrite = [record1, record2, record3, line[5*i + 3], \
                         line[5*i + 4], line[5*i + 5], line[5*i + 6], line[5*i + 7] ]
        
        writer.writerow(lineToWrite)

fin1.close()
fout1.close()

Check to see if the updated file looks as expected

In [25]:
with open("./data/updated_turnstile_110507_from_web_mta_info.txt") as myfile:
    print myfile.readlines()[0:4] 

['A002,R051,02-00-00,04-30-11,00:00:00,REGULAR,003143506,001087907\r\n', 'A002,R051,02-00-00,04-30-11,04:00:00,REGULAR,003143547,001087915\r\n', 'A002,R051,02-00-00,04-30-11,08:00:00,REGULAR,003143563,001087935\r\n', 'A002,R051,02-00-00,04-30-11,12:00:00,REGULAR,003143646,001088024\r\n']
