# Analyzing the NYC Subway Dataset

Project connected to the [Udacity Intro to Data Science course](https://www.udacity.com/course/viewer#!/c-ud359-nd).

by Victor Ribeiro, October/2015

---

### Section 0. References

**About the Dataset**

Turnstile and Weather Variables dataset reports on the cumulative number of entries and exists in the NYC with additional information about the weather. 

* [Original Dataset](https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv) - data set used throughout the course and used in the report below.
* [Improved Dataset](https://www.dropbox.com/s/1lpoeh2w6px4diu/improved-dataset.zip?dl=0) - cleaned-up subset of original dataset with additional variables. [Variables in the dataset](https://s3.amazonaws.com/uploads.hipchat.com/23756/665149/05bgLZqSsMycnkg/turnstile-weather-variables.pdf)

**References**

* [Mann-Whitney U Test](https://storage.googleapis.com/supplemental_media/udacityu/4332539257/MannWhitneyUTest.pdf) Udacity
* [Mann-Whitney U Test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) Wikipedia
* [Shapiro-Wilk Test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) Wikipedia
* [Shapiro-Wild Test](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html) Python reference
* [Diez, David; Barr, Christopher; Çetinkaya-Rundel, Mine] OpenIntro Statistics, Third Edition

---

### Section 1. Statistical Test

**1.1 Which statistical test did you use to analyze the NYC subway data? Did you use a one-tail or a two-tail P value? What is the null hypothesis? What is your p-critical value?**

>Considering the proposed project :
>* Null hypothesis : there's no difference between number of rides in the metro while raining vs. no raining. 
>* Alternative hypothesis : there's a difference between number of rides in the metro while raining  vs.
no raining.
>
>A Mann-Whitney U Test is applied. It's a two-tail test and p-critical value is 5% (or 0.05).

---
**1.2 Why is this statistical test applicable to the dataset? In particular, consider the assumptions that the test is making about the distribution of ridership in the two samples.**

>Taking a look in the data to support decision on 1.1

In [None]:
import numpy as np
import pandas
import pandasql
import matplotlib.pyplot as plt
import datetime
import csv
import scipy
import scipy.stats
import statsmodels.api as sm
import sys
from ggplot import *
import itertools

df = pandas.read_csv("turnstile_data_master_with_weather.csv")
#df.describe()

In [None]:
df.head()

In [None]:
plt.figure()
bins = 25
alpha = 0.75
df[df['rain']==0]['ENTRIESn_hourly'].hist(bins = bins, alpha=alpha) 
df[df['rain']==1]['ENTRIESn_hourly'].hist(bins = bins, alpha=alpha) 
    
plt.suptitle('Histogram of ENTRIESn_hourly')
plt.ylabel('Frequency')
plt.xlabel('ENTRIESn_hourly')
plt.legend(['no rain', 'rain'])
plt.grid(True)
plt.show()

![Histogram Raining and Non-Raining](https://github.com/vfribeiro/IntroDataScience/blob/master/figure_1.png?raw=true)

>As per histogram above neither raining nor no-raining data follow a normal distribution. Indeed, by applying Shapiro-Wik test (below), we confirm datasets do not follow normal distribution as p-value for shapiro test on both raining / no-raining data is really small. 

In [None]:
print scipy.stats.shapiro(df[df['rain']==0]['ENTRIESn_hourly'])
print scipy.stats.shapiro(df[df['rain']==1]['ENTRIESn_hourly'])

>Thus, a non-parametric test (a test that does not assume the data is drawn from any particular underlying probability distribution) like Mann-Whithney U Test is applicable.

---
**1.3 What results did you get from this statistical test? These should include the following numerical values: p-values, as well as the means for each of the two samples under test.**

>Applying Mann-Whitney U Test.



In [None]:
with_rain_mean = np.mean(df[df['rain']==1]['ENTRIESn_hourly'])
without_rain_mean = np.mean(df[df['rain']==0]['ENTRIESn_hourly'])
    
U,p = scipy.stats.mannwhitneyu(df[df['rain']==1]['ENTRIESn_hourly'],
                               df[df['rain']==0]['ENTRIESn_hourly'])

print 'Mean with rain :',with_rain_mean, '\nMean without rain :', without_rain_mean, '\nU :', U, '\n2*p:', 2*p

**1.4 What is the significance and interpretation of these results?**

>`ENTRIESn_hourly` raining mean is slightly bigger than no-raining means. 2 \* p-value is slightly below 0.05, thus **null hypothesis is rejected**.

---


### Section 2. Linear Regression

**2.1 What approach did you use to compute the coefficients theta and produce prediction for ENTRIESn_hourly in your regression model:**
- **OLS using Statsmodels or Scikit Learn,**
- **Gradient descent using Scikit Learn,**
- **Or something different?**

>OLS (using Statsmodels) has been selected to predict `ENTRIESn_hourly`.

In [None]:
def linear_regression(features, values):
    model = sm.OLS(values,sm.add_constant(features))
    results = model.fit()
    intercept = results.params[0]
    params = results.params[1:]    
    
    return intercept, params

features = df[['Hour','rain','meantempi','minpressurei']] 
dummy_units = pandas.get_dummies(df['UNIT'], prefix='unit')
features = features.join(dummy_units)

# Perform linear regression
intercept, params = linear_regression(features, df['ENTRIESn_hourly'])
    
predictions = intercept + np.dot(features, params)

def compute_r_squared(data, predictions):
    n = ((data - predictions)**2).sum()
    d = ((data - data.mean())**2).sum()
    
    r_squared = 1 - n/d
    return r_squared

compute_r_squared(df['ENTRIESn_hourly'], predictions)


>Trying some brute force to chose features after failing miserably with gut feeling only (different sets leaded to very different values).

In [None]:
# collection with almost all features. Following Ex 5 on Set 3, EXITSn_hourly is not being used.
aall_features = ['rain','fog', 'precipi', 'Hour', 
                 'meanwindspdi',
                 'meantempi', 
                 'meanpressurei',
                 'meandewpti' ]

                #'maxtempi', 'mintempi', 'meantempi', 
                #'maxpressurei', 'minpressurei', 'meanpressurei',
                #'maxdewpti', 'mindewpti', 'meandewpti' ]

# multiple variables to log results
i = 0
t_feat = []
t_inte = []
t_para = []
t_pred = []
t_rsqu = []
t_subs = []
log_experiments = []

# global max logs and counter
r_max = -1
s_max = 'none'
j = 0

# This brute force loop will select all combinations of some specific sizes. 
# At first, not using dummies variables.
for L in range(3,(len(aall_features)+1)):
    log_experiments.append(j)
    r_max_local = -1
    
    # for each combination, runs a linear regression, loging data, preserving max
    for subset in itertools.combinations(aall_features, L):
        t_feat.append(i)
        t_inte.append(i)
        t_para.append(i)
        t_pred.append(i)
        t_rsqu.append(i)
        t_subs.append(i)
        
        t_subs[i] = subset
        
        # Selected features
        t_feat[i] = df[[subset[0]]]
        for k in range(1,len(subset)):
            t_feat[i] = t_feat[i].join(df[[subset[k]]])
        
        # dummy variable
        t_feat[i] = t_feat[i].join(pandas.get_dummies(df['UNIT'], prefix='unit'))

        # Perform linear regression
        t_inte[i], t_para[i] = linear_regression(t_feat[i], df['ENTRIESn_hourly'])
        t_pred[i] = t_inte[i] + np.dot(t_feat[i], t_para[i])
        t_rsqu[i] = compute_r_squared(df['ENTRIESn_hourly'], t_pred[i])
        
        # Saving max for each combination size
        if r_max_local < t_rsqu[i]:
            r_max_local = t_rsqu[i]
            log_experiments[j] = [r_max_local, t_subs[i], i]
            
        #print 'Test ',i,' Subset:', subset, 'R2:', t_rsqu[i]
        i = i+1
    
    # Saving total max for all combinations size so far
    if r_max < r_max_local:
        r_max = r_max_local
        s_max = log_experiments[j]
        
    j = j+1
        
# print Rˆ2 max for each combination size and features used
for k in range(0,len(log_experiments)):
    print log_experiments[k][0], log_experiments[k][1], log_experiments[k][2]

In [None]:
# now for the best combinations without dummies, add dummies variables
log_experiments_dummies = []

j = 0
for k in log_experiments:
    log_experiments_dummies.append(j)
    
    subset = log_experiments[j][1]
    
    t_feat.append(i)
    t_inte.append(i)
    t_para.append(i)
    t_pred.append(i)
    t_rsqu.append(i)
    t_subs.append(i)
        
    t_subs[i] = subset
        
    # Selected features + dummies
    t_feat[i] = df[[subset[0]]]
    for k in range(1,len(subset)):
        t_feat[i] = t_feat[i].join(df[[subset[k]]])
            
    t_feat[i] = t_feat[i].join(pandas.get_dummies(df['UNIT'], prefix='unit'))
        
    # Perform linear regression
    t_inte[i], t_para[i] = linear_regression(t_feat[i], df['ENTRIESn_hourly'])
    t_pred[i] = t_inte[i] + np.dot(t_feat[i], t_para[i])
    t_rsqu[i] = compute_r_squared(df['ENTRIESn_hourly'], t_pred[i])
      
    log_experiments_dummies[j] = [t_rsqu[i], t_subs[i], i]
            
    # Saving global max so far
    if r_max < t_rsqu[i]:
        r_max = t_rsqu[i]
        s_max = log_experiments_dummies[j]    

    #print 'Test ',i,' Subset:', subset, 'R2:', t_rsqu[i]
    i = i+1        
    j = j+1

    
# print Rˆ2 max for each subset with dummies and features used
for k in range(0,len(log_experiments_dummies)):
    print log_experiments_dummies[k][0], log_experiments_dummies[k][1], '+ Dummies : UNIT'

---
**2.2 What features (input variables) did you use in your model? Did you use any dummy variables as part of your features?**

In [None]:
print 'Selected variables were : UNIT as dummy variable and :', s_max[1]


---
**2.3 Why did you select these features in your model? We are looking for specific reasons that lead you to believe that the selected features will contribute to the predictive power of your model.**
- **Your reasons might be based on intuition. For example, response for fog might be: “I decided to use fog because I thought that when it is very foggy outside people might decide to use the subway more often.”**
- **Your reasons might also be based on data exploration and experimentation, for example: “I used feature X because as soon as I included it in my model, it drastically improved my R2 value.”**

>As I lived majority of my live in a city without metro, I do not have any strong gut feeling about what are the variables that would influentiated the most. Indeed, I've tried different combinations and results were really different one from each other. Thus, I've decided for some brute force : (i) pick all combinations with 4 and 5 elements and calculate R2 without a dummy variable and then, pick the best results and try with a dummy variable.
>
>The dummy variable 'UNIT' drastically improves R2 value. Max R2 value for features mentioned in 2.2 is:



In [None]:
print r_max

---

**2.4 What are the parameters (also known as "coefficients" or "weights") of the non-dummy features in your linear regression model?**

In [None]:
t_para[s_max[2]].head(6)

---
**2.5 What is your model’s R2 (coefficients of determination) value?**

In [None]:
compute_r_squared(df['ENTRIESn_hourly'], t_pred[s_max[2]])

---

**2.6 What does this R2 value mean for the goodness of fit for your regression model? Do you think this linear model to predict ridership is appropriate for this dataset, given this R2  value?**

> R2 is the the percentage of variance that is explained. The closer R2 is to one, the better is the model. And, the closer to zero, the worse is the model. Our R2 is smaller than 0.5 (closer to zero than to one) which is mid-term, not good, but not bad. Below, a histogram of residuals (original data - predicted data) is presented. Most of the residuals are close to zero +/- 5000.

---


In [None]:
plt.figure()
plt.suptitle('Histogram of residuals')
plt.ylabel('Frequency')
plt.xlabel('Difference original vs predicted')
plt.grid(True)
(df['ENTRIESn_hourly'] - t_pred[s_max[2]]).hist(bins = 50)
plt.show()

![Histogram of residuals](https://github.com/vfribeiro/IntroDataScience/blob/master/figure_2.png?raw=true)

### Section 3. Visualization

**Please include two visualizations that show the relationships between two or more variables in the NYC subway data.
Remember to add appropriate titles and axes labels to your plots. Also, please add a short description below each figure commenting on the key insights depicted in the figure.**

**3.1 One visualization should contain two histograms: one of  ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days.**
- **You can combine the two histograms in a single plot or you can use two separate plots.**
- **If you decide to use to two separate plots for the two histograms, please ensure that the x-axis limits for both of the plots are identical. It is much easier to compare the two in that case.**
- **For the histograms, you should have intervals representing the volume of ridership (value of ENTRIESn_hourly) on the x-axis and the frequency of occurrence on the y-axis. For example, each interval (along the x-axis), the height of the bar for this interval will represent the number of records (rows in our data) that have ENTRIESn_hourly that falls in this interval.**
- **Remember to increase the number of bins in the histogram (by having larger number of bars). The default bin width is not sufficient to capture the variability in the two samples.**


In [None]:
plt.figure()
bins = 20
alpha = 0.50
df[df['rain']==0]['ENTRIESn_hourly'].hist(bins = bins, alpha=alpha) 
df[df['rain']==1]['ENTRIESn_hourly'].hist(bins = bins, alpha=alpha) 
    
plt.suptitle('Histogram of ENTRIESn_hourly')
plt.ylabel('Frequency')
plt.xlabel('ENTRIESn_hourly')
plt.legend(['no rain', 'rain'])
plt.grid(True)
plt.show()

![Histogram Raining and Non-Raining](https://github.com/vfribeiro/IntroDataScience/blob/master/figure_1.png?raw=true)

---

**3.2 One visualization can be more freeform. You should feel free to implement something that we discussed in class (e.g., scatter plots, line plots) or attempt to implement something more advanced if you'd like. Some suggestions are:**
- **Ridership by time-of-day**
- **Ridership by day-of-week**

---

In [None]:
df_t1 = df[['ENTRIESn_hourly', 'Hour']].groupby('Hour').sum()
df_t1.index.name = 'Hour'
df_t1.reset_index(inplace=True)

df_t2 = df[['EXITSn_hourly', 'Hour']].groupby('Hour').sum()
df_t2.index.name = 'Hour'
df_t2.reset_index(inplace=True)

In [None]:
plt.figure()
plt.suptitle('Total entries and exits per hour of the day')
plt.ylabel('Total')
plt.xlabel('Hour of the day')
plt.grid(True)
plt.plot(df_t1['Hour'], df_t1['ENTRIESn_hourly'])
plt.plot(df_t2['Hour'], df_t2['EXITSn_hourly'])
plt.legend(['entries', 'exits'])
plt.show()

![Total entries and exits per hour of the day](https://github.com/vfribeiro/IntroDataScience/blob/master/figure_3.png?raw=true)

>Quite interesting that `EXITS` are consistently smaller than `ENTRIES` : does NYC has subway stations with no turnstiles?
>
>Trying now ggplot.

In [None]:
df_t1 = df[['ENTRIESn_hourly', 'EXITSn_hourly', 'DATEn']].groupby('DATEn').sum()
df_t1.index.name = 'DATEn'
df_t1.reset_index(inplace=True)
df_t1['DATEn'] = pandas.to_datetime(df_t1['DATEn'])
df_t1.head()

df_t2 = pandas.melt(df1, 'DATEn')

gg = ggplot(df_t2, aes(x='DATEn', y='value', colour = 'variable')) +\
    geom_line() +\
    ylab("Number entries or exits") +\
    xlab("Day") +\
    ggtitle("Total daily entries and exits")
print gg

![Total daily entries and exits](https://github.com/vfribeiro/IntroDataScience/blob/master/figure_4.png?raw=true)

>Once more, it's possible to confirm a smaller amount of exits than entries.

### Section 4. Conclusion

**Please address the following questions in detail. Your answers should be 1-2 paragraphs long.**

**4.1 From your analysis and interpretation of the data, do more people ride
the NYC subway when it is raining or when it is not raining?**

>Yes, according the study here presented and based in the data used, it's possible to conclude with a high level of condifence that more people ride the NYC subway when it's raining than when it's not raining.

**4.2 What analyses lead you to this conclusion? You should use results from both your statistical
tests and your linear regression to support your analysis.**

>In section 1, a full statistical analysis was presented on top of the given dataset. First, data was analyzed to verify what kind of statistical test could be used. Then, after observind data does not follow a normal distribution a non-parametrical test, Mann-Whitney U Test, was selected and applied. The resulting p-value leaded to reject the null hypothesis (stating no difference between rides when it's raining vs it's not raining) with high level of confidence `(2*p-value<0.05)`.
>
>Linear regression model using OLS (Statsmodels) on top of the given dataset has generated Rˆ2 lower than 0.5 - a not good result (but also not really bad). Residual histogram shows majority of residuals (original data - predicted) are around `0 +/- 5000`.



---

### Section 5. Reflection

**Please address the following questions in detail. Your answers should be 1-2 paragraphs long.**

**5.1 Please discuss potential shortcomings of the methods of your analysis, including:**
- **Dataset,**
- **Analysis, such as the linear regression model or statistical test.**

>With respect to the data :
>
>1. **Data is from a single month : May/2011. This is a big issue for the analysis here made.** May is popular known as a 'nice wheater month' all around the globe, thus trying to figure out if people would ride more or less NYC subway using a single month does not seem reasonable. It would be better to have good data from all months during all seasons.
>2. Further data analysis, verification and fixes may be required. For instance, by reading discussions at Udacity site, it's possible to find some potential flaws (like lot amount of entries in one hour followed by immediate almost nobody some minutes after. I haven't investigated in detail the data neither discussed or verified how the data was collected.
>
>With respect to the linear regression model :
>
>* The value for R2 obtained is 0.46 - not good but not bad... Trying other models (polynomial, logistic regreassion) may lead to better results.  

---

**5.2 (Optional) Do you have any other insight about the dataset that you would like to share with us?**

>Most important comments about the dataset have been added in question above. While a great exercise (and I really enjoyed doing this project), the results in this study are useless as the data is from a single month. 

---

In [None]:
for k in range(0,len(t_subs)):
    if ((len(t_subs[k])==3) and ('Hour' in t_subs[k])):
        print k, t_subs[k], t_rsqu[k]

In [None]:
len(t_subs[0])

In [None]:
t_subs[0]

In [None]:
'Hour' in t_subs[0]
