# Analyzing New York City subway data 

#### Author: Sushant N. More

### Data from web.mta.info/developers/turnstile.html.  Also, using a data file from weather underground obtained from Udacity website

#### Revision history: 

Sept. 15, 2017: Started writing

In [31]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats
import csv

## Data cleaning

We will be analysing subway data from MTA's website for the month of May 2011 (because later I want to relate this to weather data and the weather data I have is for the month of May 2011). 

This is real-world data.  We can expect significant effort in cleaning and formatting the data. The data for a given month is in four different files.  Let's start by looking at the files to see how data is arranged. 

In [32]:
turnstile_df1 = pd.read_csv('./data/turnstile_110507_from_web_mta_info.txt')

In [33]:
turnstile_df1.head()

Unnamed: 0,A002,R051,02-00-00,04-30-11,00:00:00,REGULAR,003143506,001087907,04-30-11.1,04:00:00,...,05-01-11,00:00:00.1,REGULAR.6,003144312,001088151,05-01-11.1,04:00:00.1,REGULAR.7,003144335,001088159
0,A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,3144353,1088177,05-01-11,12:00:00,...,05-02-11,08:00:00,REGULAR,3144941.0,1088420.0,05-02-11,12:00:00,REGULAR,3145094.0,1088753.0
1,A002,R051,02-00-00,05-02-11,16:00:00,REGULAR,3145337,1088823,05-02-11,20:00:00,...,05-03-11,16:00:00,REGULAR,3146790.0,1089417.0,05-03-11,20:00:00,REGULAR,3147615.0,1089478.0
2,A002,R051,02-00-00,05-04-11,00:00:00,REGULAR,3147798,1089515,05-04-11,04:00:00,...,05-05-11,00:00:00,REGULAR,3149281.0,1090139.0,05-05-11,04:00:00,REGULAR,3149297.0,1090145.0
3,A002,R051,02-00-00,05-05-11,08:00:00,REGULAR,3149331,1090257,05-05-11,09:04:33,...,05-05-11,12:00:00,OPEN,3149494.0,1090579.0,05-05-11,16:00:00,DOOR,3149805.0,1090652.0
4,A002,R051,02-00-00,05-05-11,20:00:00,REGULAR,3150639,1090714,05-06-11,00:00:00,...,05-06-11,20:00:00,REGULAR,3152200.0,1091283.0,,,,,


The data is not conviniently arranged as expected.  And it's difficult to get sense of it by loading it into pandas data frame. Let's try looking at the file contents directly.

In [34]:
with open("./data/turnstile_110507_from_web_mta_info.txt") as myfile:
    print myfile.readlines()[0:4] 

['A002,R051,02-00-00,04-30-11,00:00:00,REGULAR,003143506,001087907,04-30-11,04:00:00,REGULAR,003143547,001087915,04-30-11,08:00:00,REGULAR,003143563,001087935,04-30-11,12:00:00,REGULAR,003143646,001088024,04-30-11,16:00:00,REGULAR,003143865,001088083,04-30-11,20:00:00,REGULAR,003144181,001088132,05-01-11,00:00:00,REGULAR,003144312,001088151,05-01-11,04:00:00,REGULAR,003144335,001088159              \r\n', 'A002,R051,02-00-00,05-01-11,08:00:00,REGULAR,003144353,001088177,05-01-11,12:00:00,REGULAR,003144424,001088231,05-01-11,16:00:00,REGULAR,003144594,001088275,05-01-11,20:00:00,REGULAR,003144808,001088317,05-02-11,00:00:00,REGULAR,003144895,001088328,05-02-11,04:00:00,REGULAR,003144905,001088331,05-02-11,08:00:00,REGULAR,003144941,001088420,05-02-11,12:00:00,REGULAR,003145094,001088753              \r\n', 'A002,R051,02-00-00,05-02-11,16:00:00,REGULAR,003145337,001088823,05-02-11,20:00:00,REGULAR,003146168,001088888,05-03-11,00:00:00,REGULAR,003146322,001088918,05-03-11,04:00:00,REGULAR

As we can see, there are numerous data points included in each row of the MTA Subway turnstile text file. 

We want to write a function that will update each row in the text file so there is only one entry per row. So a single row from the input file will generate multiple rows. For instance the first row displayed in the above file will turn into following set of rows.

A002,R051,02-00-00,04-30-11,00:00:00,REGULAR,003143506,001087907,
A002,R051,02-00-00,04-30-11,04:00:00,REGULAR,003143547,001087915,
A002,R051,02-00-00,04-30-11,08:00:00,REGULAR,003143563,001087935,
A002,R051,02-00-00,04-30-11,12:00:00,REGULAR,003143646,001088024,
A002,R051,02-00-00,04-30-11,16:00:00,REGULAR,003143865,001088083,
A002,R051,02-00-00,04-30-11,20:00:00,REGULAR,003144181,001088132,
A002,R051,02-00-00,05-01-11,00:00:00,REGULAR,003144312,001088151,
A002,R051,02-00-00,05-01-11,04:00:00,REGULAR,003144335,001088159

The first three elements in the input line -- A002,R051,02-00-00 -- are repeated for each of the 8 lines in the ouput file. 

Along with the above, following two preemptive operations have been done at the file below.  

1) From the time column, made a new column for Hour.  So that I can later on plot hourly data and so on

2) I wanna combine this with weather data which has the date format as yyyy-mm-dd. So I changed the date format here. 

As I found out the hard way, given the size of the data, it is really time-consuming to do in-memory computations.

In [35]:
fin1 = open("./data/turnstile_110507_from_web_mta_info.txt", 'r')
fout1 = open("./data/updated_turnstile_110507_from_web_mta_info.txt", 'w')

reader = csv.reader(fin1, delimiter = ',', quoting=csv.QUOTE_NONE)
writer = csv.writer(fout1, delimiter = ',', quoting=csv.QUOTE_NONE)

for line in reader:
    
    record1 = line[0]
    record2 = line[1]
    record3 = line[2]
    
    length = len(line)
    
    nn = (length - 1 - 7) / 5
    
    for i in range(0, nn + 1):
        
        date = line[5*i + 3]
        
        date_formatted = '20' + date.split('-')[2] + '-' + date.split('-')[0] + '-' + date.split('-')[1]
        
        lineToWrite = [record1, record2, record3, date_formatted, \
                         line[5*i + 4], int(line[5*i + 4].split(':')[0]), \
                       line[5*i + 5], line[5*i + 6], line[5*i + 7] ]
        
        writer.writerow(lineToWrite)

fin1.close()
fout1.close()

Check to see if the updated file looks as expected

In [36]:
with open("./data/updated_turnstile_110507_from_web_mta_info.txt") as myfile:
    print myfile.readlines()[0:4] 

['A002,R051,02-00-00,2011-04-30,00:00:00,0,REGULAR,003143506,001087907\r\n', 'A002,R051,02-00-00,2011-04-30,04:00:00,4,REGULAR,003143547,001087915\r\n', 'A002,R051,02-00-00,2011-04-30,08:00:00,8,REGULAR,003143563,001087935\r\n', 'A002,R051,02-00-00,2011-04-30,12:00:00,12,REGULAR,003143646,001088024\r\n']


Let's fix the data in other files all at once

In [37]:
fileloc = './data/'
filenames = [fileloc + 'turnstile_110514_from_web_mta_info.txt', \
            fileloc + 'turnstile_110521_from_web_mta_info.txt', \
            fileloc + 'turnstile_110528_from_web_mta_info.txt', \
            fileloc + 'turnstile_110604_from_web_mta_info.txt']

In [38]:
for k in range(0, len(filenames)):
    
    f_in = open(filenames[k], 'r')
    f_out = open(filenames[k][0:len(fileloc)] + 'updated_' + filenames[k][len(fileloc):], 'w')
    
    reader_in = csv.reader(f_in, delimiter = ',', quoting=csv.QUOTE_NONE)
    writer_out = csv.writer(f_out, delimiter = ',', quoting=csv.QUOTE_NONE)

    for line in reader_in:

        record1 = line[0]
        record2 = line[1]
        record3 = line[2]

        length = len(line)

        nn = (length - 1 - 7) / 5

        for i in range(0, nn + 1):

            date = line[5*i + 3]
        
            date_formatted = '20' + date.split('-')[2] + '-' + date.split('-')[0] + '-' + date.split('-')[1]
        
            lineToWrite = [record1, record2, record3, date_formatted, \
                             line[5*i + 4], int(line[5*i + 4].split(':')[0]), \
                           line[5*i + 5], line[5*i + 6], line[5*i + 7] ]

            writer_out.writerow(lineToWrite)
            
    f_in.close()
    f_out.close()

As mentioned in the field description file, the entries in the column are as follows: 

C/A = Control Area (e.g., A002)

UNIT = Remote Unit for a station (e.g., R051)

SCP = Subunit Channel Position represents an specific address for a device (02-00-00)

DATEn = Represents the date (MM-DD-YY)

TIMEn = Represents the time (hh:mm:ss) for a scheduled audit event

DEScn = Represent the "REGULAR" scheduled audit event (occurs every 4 hours)

ENTRIESn = The comulative entry register value for a device

EXISTn = The cumulative exit register value for a device

Let's combine all the input files we generated into a single file and have a header row at the top describing the entries

In [39]:
filenamesUpdated = [fileloc + 'updated_turnstile_110507_from_web_mta_info.txt', \
            fileloc + 'updated_turnstile_110514_from_web_mta_info.txt', \
            fileloc + 'updated_turnstile_110521_from_web_mta_info.txt', \
            fileloc + 'updated_turnstile_110528_from_web_mta_info.txt', \
            fileloc + 'updated_turnstile_110604_from_web_mta_info.txt'       ]

In [40]:
with open(fileloc + 'turnstile_master_data.csv', 'w') as master_file:
    master_file.write('C/A,UNIT,SCP,DATEn,TIMEn,Hour,DESCn,ENTRIESn,EXITSn\n')
    
    for name in filenamesUpdated:
        with open(name) as infile:
            for line in infile:
                master_file.write(line)

In [41]:
with open("./data/turnstile_master_data.csv") as myfile:
    print myfile.readlines()[0:4] 

['C/A,UNIT,SCP,DATEn,TIMEn,Hour,DESCn,ENTRIESn,EXITSn\n', 'A002,R051,02-00-00,2011-04-30,00:00:00,0,REGULAR,003143506,001087907\r\n', 'A002,R051,02-00-00,2011-04-30,04:00:00,4,REGULAR,003143547,001087915\r\n', 'A002,R051,02-00-00,2011-04-30,08:00:00,8,REGULAR,003143563,001087935\r\n']


In [42]:
turnstile_df = pd.read_csv('./data/turnstile_master_data.csv')

In [43]:
turnstile_df.head()

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,Hour,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,2011-04-30,00:00:00,0,REGULAR,3143506,1087907
1,A002,R051,02-00-00,2011-04-30,04:00:00,4,REGULAR,3143547,1087915
2,A002,R051,02-00-00,2011-04-30,08:00:00,8,REGULAR,3143563,1087935
3,A002,R051,02-00-00,2011-04-30,12:00:00,12,REGULAR,3143646,1088024
4,A002,R051,02-00-00,2011-04-30,16:00:00,16,REGULAR,3143865,1088083


In [44]:
turnstile_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1069902 entries, 0 to 1069901
Data columns (total 9 columns):
C/A         1069902 non-null object
UNIT        1069902 non-null object
SCP         1069902 non-null object
DATEn       1069902 non-null object
TIMEn       1069902 non-null object
Hour        1069902 non-null int64
DESCn       1069902 non-null object
ENTRIESn    1069902 non-null int64
EXITSn      1069902 non-null int64
dtypes: int64(3), object(6)
memory usage: 73.5+ MB


In [45]:
turnstile_df.DATEn.unique()

array(['2011-04-30', '2011-05-01', '2011-05-02', '2011-05-03',
       '2011-05-04', '2011-05-05', '2011-05-06', '2011-05-07',
       '2011-05-08', '2011-05-09', '2011-05-10', '2011-05-11',
       '2011-05-12', '2011-05-13', '2011-05-14', '2011-05-15',
       '2011-05-16', '2011-05-17', '2011-05-18', '2011-05-19',
       '2011-05-20', '2011-05-21', '2011-05-22', '2011-05-23',
       '2011-05-24', '2011-05-25', '2011-05-26', '2011-05-27',
       '2011-05-28', '2011-05-29', '2011-05-30', '2011-05-31',
       '2011-06-01', '2011-06-02', '2011-06-03'], dtype=object)

In [46]:
turnstile_df_reg = turnstile_df_reg.loc[~(turnstile_df_reg['DATEn'].isin(['2011-04-30', \
                                                                         '2011-05-31', '2011-06-01',\
                                                                         '2011-06-02', '2011-06-03']))]

In [47]:
turnstile_df.DESCn.unique()

array(['REGULAR', 'DOOR', 'OPEN', 'TS', 'VLT', 'OPN', 'RECOVR', 'AUD',
       'LOGON', 'LGF-MAN', 'BRD', 'CHG'], dtype=object)

In [48]:
turnstile_df.DESCn.value_counts()

REGULAR    873271
DOOR        53615
OPEN        49333
RECOVR      41186
AUD         38530
TS           4665
VLT          4181
OPN          3192
LOGON        1896
BRD            15
CHG            13
LGF-MAN         5
Name: DESCn, dtype: int64

The field description key says that the 'REGULAR' in the DESCn column represents a scheduled audit event. So, let's keep only the entries which have 'REGULAR' in the DESCn column. 

In [49]:
turnstile_df_reg = turnstile_df.loc[turnstile_df['DESCn'] == 'REGULAR']

In [50]:
turnstile_df_udacity = pd.read_csv('./data/turnstile_data_master_with_weather_udacity.csv')

In [54]:
turnstile_df_udacity_improved = \
pd.read_csv('./data/improved-dataset/improved-dataset/turnstile_weather_v2.csv')

The data in the MTA Subway Turnstile data reports on the cumulative number of entries and exits per row. We would like to find the entries since the last reading (since that can be translated into subway ridership). 

We create a new column called ENTRIESn_4hourly and assign to the column the difference between ENTRIES of the currrent row and the previous row. 

In [51]:
turnstile_df_reg.shape

(873271, 9)

In [52]:
turnstile_df_udacity.shape

(131951, 22)

In [55]:
turnstile_df_udacity_improved.shape

(42649, 27)

In [56]:
turnstile_df_udacity_improved.DATEn.unique()

array(['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
       '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11',
       '05-11-11', '05-12-11', '05-13-11', '05-14-11', '05-15-11',
       '05-16-11', '05-17-11', '05-18-11', '05-19-11', '05-20-11',
       '05-21-11', '05-22-11', '05-23-11', '05-24-11', '05-25-11',
       '05-26-11', '05-27-11', '05-28-11', '05-29-11', '05-30-11',
       '05-31-11'], dtype=object)

In [57]:
turnstile_df_udacity.DATEn.unique()

array(['2011-05-01', '2011-05-02', '2011-05-03', '2011-05-04',
       '2011-05-05', '2011-05-06', '2011-05-07', '2011-05-08',
       '2011-05-09', '2011-05-10', '2011-05-11', '2011-05-12',
       '2011-05-13', '2011-05-14', '2011-05-15', '2011-05-16',
       '2011-05-17', '2011-05-18', '2011-05-19', '2011-05-20',
       '2011-05-21', '2011-05-22', '2011-05-23', '2011-05-24',
       '2011-05-25', '2011-05-26', '2011-05-27', '2011-05-28',
       '2011-05-29', '2011-05-30'], dtype=object)

In [59]:
turnstile_df_udacity.UNIT.nunique()

465

In [60]:
turnstile_df_udacity_improved.UNIT.nunique()

240

In [61]:
turnstile_df.UNIT.nunique()

465

In [62]:
turnstile_df_udacity.UNIT.nunique()

465

In [53]:
turnstile_df_udacity.head()

Unnamed: 0.1,Unnamed: 0,UNIT,DATEn,TIMEn,Hour,DESCn,ENTRIESn_hourly,EXITSn_hourly,maxpressurei,maxdewpti,...,meandewpti,meanpressurei,fog,rain,meanwindspdi,mintempi,meantempi,maxtempi,precipi,thunder
0,0,R001,2011-05-01,01:00:00,1,REGULAR,0.0,0.0,30.31,42.0,...,39.0,30.27,0.0,0.0,5.0,50.0,60.0,69.0,0.0,0.0
1,1,R001,2011-05-01,05:00:00,5,REGULAR,217.0,553.0,30.31,42.0,...,39.0,30.27,0.0,0.0,5.0,50.0,60.0,69.0,0.0,0.0
2,2,R001,2011-05-01,09:00:00,9,REGULAR,890.0,1262.0,30.31,42.0,...,39.0,30.27,0.0,0.0,5.0,50.0,60.0,69.0,0.0,0.0
3,3,R001,2011-05-01,13:00:00,13,REGULAR,2451.0,3708.0,30.31,42.0,...,39.0,30.27,0.0,0.0,5.0,50.0,60.0,69.0,0.0,0.0
4,4,R001,2011-05-01,17:00:00,17,REGULAR,4400.0,2501.0,30.31,42.0,...,39.0,30.27,0.0,0.0,5.0,50.0,60.0,69.0,0.0,0.0


In [63]:
turnstile_df.head()

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,Hour,DESCn,ENTRIESn,EXITSn
0,A002,R051,02-00-00,2011-04-30,00:00:00,0,REGULAR,3143506,1087907
1,A002,R051,02-00-00,2011-04-30,04:00:00,4,REGULAR,3143547,1087915
2,A002,R051,02-00-00,2011-04-30,08:00:00,8,REGULAR,3143563,1087935
3,A002,R051,02-00-00,2011-04-30,12:00:00,12,REGULAR,3143646,1088024
4,A002,R051,02-00-00,2011-04-30,16:00:00,16,REGULAR,3143865,1088083


In [81]:
turnstile_df_reg.loc[(turnstile_df['UNIT'] == 'R051') & (turnstile_df['C/A'] != 'A002'), :]

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,Hour,DESCn,ENTRIESn,EXITSn
168711,R245,R051,00-00-00,2011-04-30,00:00:00,0,REGULAR,10641946,2070651
168712,R245,R051,00-00-00,2011-04-30,04:00:00,4,REGULAR,10642028,2070661
168715,R245,R051,00-00-00,2011-04-30,12:00:00,12,REGULAR,10642198,2070765
168716,R245,R051,00-00-00,2011-04-30,16:00:00,16,REGULAR,10642661,2070930
168717,R245,R051,00-00-00,2011-04-30,20:00:00,20,REGULAR,10643410,2071053
168718,R245,R051,00-00-00,2011-05-01,00:00:00,0,REGULAR,10643811,2071092
168719,R245,R051,00-00-00,2011-05-01,04:00:00,4,REGULAR,10643884,2071102
168722,R245,R051,00-00-00,2011-05-01,12:00:00,12,REGULAR,10644017,2071155
168723,R245,R051,00-00-00,2011-05-01,16:00:00,16,REGULAR,10644342,2071260
168724,R245,R051,00-00-00,2011-05-01,20:00:00,20,REGULAR,10644901,2071352


In [None]:
turnstile_df_reg.loc[(turnstile_df['UNIT'] == 'R051') & (turnstile_df['C/A'] != 'A002'), :]

In [79]:
True & True

True

In [82]:
turnstile_df_reg['SCPChange'] = (turnstile_df_reg['SCP'] == turnstile_df_reg['SCP'].shift(1))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [85]:
turnstile_df_reg[turnstile_df_reg['SCPChange'] == False].head()

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,Hour,DESCn,ENTRIESn,EXITSn,SCPChange
0,A002,R051,02-00-00,2011-04-30,00:00:00,0,REGULAR,3143506,1087907,False
47,A002,R051,02-00-01,2011-04-30,00:00:00,0,REGULAR,3093886,658807,False
93,A002,R051,02-03-00,2011-04-30,00:00:00,0,REGULAR,1286852,4585104,False
138,A002,R051,02-03-01,2011-04-30,00:00:00,0,REGULAR,2845503,4386061,False
185,A002,R051,02-03-02,2011-04-30,00:00:00,0,REGULAR,2616361,3583719,False


In [86]:
turnstile_df_reg.loc[45:48, :]

Unnamed: 0,C/A,UNIT,SCP,DATEn,TIMEn,Hour,DESCn,ENTRIESn,EXITSn,SCPChange
45,A002,R051,02-00-00,2011-05-06,16:00:00,16,REGULAR,3151315,1091227,True
46,A002,R051,02-00-00,2011-05-06,20:00:00,20,REGULAR,3152200,1091283,True
47,A002,R051,02-00-01,2011-04-30,00:00:00,0,REGULAR,3093886,658807,False
48,A002,R051,02-00-01,2011-04-30,04:00:00,4,REGULAR,3093909,658812,True


In [None]:
turnstile_df_reg['SCPChange'].value_counts()

In [None]:
turnstile_df_reg.loc[:,'ENTRIESn_4hourly'] = \
    turnstile_df_reg.loc[:, 'ENTRIESn'] - turnstile_df_reg.loc[:, 'ENTRIESn'].shift(1) 
# using .loc is a safe way of doing this rather than using 'chained' indexing

In [None]:
turnstile_df_reg['ENTRIESn_4hourly_actual'] = turnstile_df_reg['ENTRIESn_4hourly'] * turnstile_df_reg['SCPChange']

In [None]:
turnstile_df_reg

As expected, the first variable is Nan. Let's replace it with 1. 

In [None]:
turnstile_df_reg.iloc[11737:11742]

In [None]:
turnstile_df_reg.loc[turnstile_df_reg['ENTRIESn_4hourly_actual'] < 0]

In [None]:
turnstile_df_reg.loc[:, 'ENTRIESn_4hourly'] = turnstile_df_reg.loc[:, 'ENTRIESn_4hourly'].fillna(0)

In [None]:
turnstile_df_reg.head()

Let's do the same thing for the Exits and replace Nan with 0

In [None]:
turnstile_df_reg['DATEn'] == turnstile_df_reg['DATEn'].shift(1);

In [None]:
turnstile_df_reg.loc[:,'EXITSn_4hourly'] = \
    turnstile_df_reg.loc[:, 'EXITSn'] - turnstile_df_reg.loc[:, 'EXITSn'].shift(1) 

In [None]:
turnstile_df_reg.loc[:, 'EXITSn_4hourly'] = turnstile_df_reg.loc[:, 'EXITSn_4hourly'].fillna(0)

In [None]:
turnstile_df_reg.head()

In [None]:
turnstile_df_reg[turnstile_df_reg['UNIT'] == 'R001']

For future analysis, it's convenient if we convert TIMEn into the hour of the day. 

In [None]:
turnstile_df_reg.head(10)

Removing some of the April and June values that had crept in

In [None]:
turnstile_df_reg['DATEn'].unique()

In [None]:
turnstile_df_reg.head(5)

## Merge the weather data

In [None]:
weather_df = pd.read_csv('./data/weather-underground.csv')

In [None]:
weather_df.head()

In [None]:
weather_df.shape

In [None]:
turnstile_df_reg.shape

In [None]:
turnstile_df_reg.DATEn.unique()

In [None]:
turnstile_df_reg.DATEn.nunique()

In [None]:
weather_df.columns

It's clear that many of these features are not gonna have an effect on subway ridership. Also, getting rid of columns which have the same value throughout -- for instance snowfall = 0 (it's the month of May!). So let's pick columns which could be relevant. 

In [None]:
weather_df_relevant = weather_df[['date', 'maxpressurei', 'maxdewpti', 'mindewpti', 'minpressurei', 'meandewpti', \
                                 'meanpressurei', 'fog', 'rain', 'meanwindspdi', 'mintempi', 'meantempi', \
                                 'maxtempi', 'precipi', 'thunder']]

In [None]:
weather_df_relevant.head()

In [None]:
turnstile_df_reg.columns

In [None]:
weather_df_relevant = weather_df_relevant.rename(columns = {'date': 'DATEn'})

** Make the columnname for the date column match **

In [None]:
weather_df_relevant.head()

In [None]:
turnstile_weather_df = pd.merge(turnstile_df_reg, weather_df_relevant, on = 'DATEn', how = 'outer')

In [None]:
turnstile_weather_df.head()

In [None]:
turnstile_weather_df.tail()

In [None]:
turnstile_weather_df.shape

In [None]:
turnstile_df_reg.shape

In [None]:
weather_df_relevant.shape

## Data analysis

Now that we have merged the weather and the subway data.  Let's do some data exploration.

Let's start by looking at the Entries_4hourly and check what distribution it follows. Let's look at the case of rainy vs. non-rainy days. 

In [None]:
plt.figure()
(turnstile_weather_df.loc[:,'ENTRIESn_4hourly_actual'][turnstile_weather_df['rain'] == 0]).hist(bins = 2000, label = 'no rain')
(turnstile_weather_df.loc[:,'ENTRIESn_4hourly_actual'][turnstile_weather_df['rain'] == 1]).hist(bins = 2000, label = 'rain')
plt.xlim([0, 6000])
plt.legend()
plt.xlabel('ENTRIESn_hourly')
plt.ylabel('Frequency')

In [None]:
turnstile_weather_df.loc[:,'ENTRIESn_4hourly_actual'][turnstile_weather_df['rain'] == 0]

In [None]:
np.mean(turnstile_weather_df['ENTRIESn_4hourly_actual'][turnstile_weather_df['rain'] == 1])

In [None]:
turnstile_weather_df.loc[1]

In [None]:
turnstile_weather_df['UNIT'].nunique()

In [None]:
turnstile_weather_df[turnstile_weather_df['UNIT'] == 'R001']

In [None]:
turnstile_weather_df.shape

In [None]:
turnstile_weather_df[['C/A','UNIT','ENTRIESn_4hourly']].groupby(['UNIT'], as_index = False).sum()

In [None]:
turnstile_weather_df[turnstile_weather_df['ENTRIESn_4hourly'] < 0]

In [None]:
True

In [None]:
int(True)

In [None]:
True * 5

In [None]:
False * 5