# Analyzing the NYC Subway Dataset

Project connected to the [Udacity Intro to Data Science course](https://www.udacity.com/course/viewer#!/c-ud359-nd).

by Victor Ribeiro, October/2015

---

### Section 0. References

**About the Dataset**

Turnstile and Weather Variables dataset reports on the cumulative number of entries and exists in the NYC with additional information about the weather. 

* [Original Dataset](https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv) - data set used throughout the course and used in the report below.
* [Improved Dataset](https://www.dropbox.com/s/1lpoeh2w6px4diu/improved-dataset.zip?dl=0) - cleaned-up subset of original dataset with additional variables. [Variables in the dataset](https://s3.amazonaws.com/uploads.hipchat.com/23756/665149/05bgLZqSsMycnkg/turnstile-weather-variables.pdf)

**References**

* [Mann-Whitney U Test](https://storage.googleapis.com/supplemental_media/udacityu/4332539257/MannWhitneyUTest.pdf) Udacity
* [Mann-Whitney U Test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) Wikipedia
* [Shapiro-Wilk Test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) Wikipedia
* [Shapiro-Wild Test](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html) Python reference
* [Diez, David; Barr, Christopher; Çetinkaya-Rundel, Mine] OpenIntro Statistics, Third Edition

---

### Section 1. Statistical Test

**1.1 Which statistical test did you use to analyze the NYC subway data? Did you use a one-tail or a two-tail P value? What is the null hypothesis? What is your p-critical value?**

>Considering the proposed project :
>* Null hypothesis : there's no difference between number of rides in the metro during raining days vs. not raining days. 
>* Alternative hypothesis : there's a difference between number of rides in the metro during raining days vs.
not raining days.
>
>A Mann-Whitney U Test is applied. It's a two-tail test and p-critical value 5% (or 0.05).

---
**1.2 Why is this statistical test applicable to the dataset? In particular, consider the assumptions that the test is making about the distribution of ridership in the two samples.**

>Taking a look in the data to support decision on 1.1

In [9]:
import numpy as np
import pandas
import pandasql
import matplotlib.pyplot as plt
import datetime
import csv
import scipy
import scipy.stats
import statsmodels.api as sm
import sys
from ggplot import *

df = pandas.read_csv("turnstile_data_master_with_weather.csv")
df.describe()

Unnamed: 0.1,Unnamed: 0,Hour,ENTRIESn_hourly,EXITSn_hourly,maxpressurei,maxdewpti,mindewpti,minpressurei,meandewpti,meanpressurei,fog,rain,meanwindspdi,mintempi,meantempi,maxtempi,precipi,thunder
count,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951.0,131951
mean,65975.0,10.896158,1095.348478,886.890838,30.031894,57.241302,48.259013,29.892714,52.703526,29.965077,0.1671,0.334245,5.543065,56.169775,64.269729,71.769968,0.172276,0
std,38091.117022,6.892084,2337.015421,2008.604886,0.125689,8.770891,11.305312,0.146384,9.94359,0.130461,0.373066,0.471728,1.982441,6.338875,6.568289,7.627218,0.429005,0
min,0.0,0.0,0.0,0.0,29.74,39.0,22.0,29.54,31.0,29.64,0.0,0.0,1.0,46.0,55.0,58.0,0.0,0
25%,32987.5,5.0,39.0,32.0,29.96,50.0,38.0,29.84,45.0,29.91,0.0,0.0,5.0,52.0,60.0,65.0,0.0,0
50%,65975.0,12.0,279.0,232.0,30.03,57.0,51.0,29.91,54.0,29.96,0.0,0.0,5.0,54.0,63.0,71.0,0.0,0
75%,98962.5,17.0,1109.0,847.0,30.1,64.0,55.0,29.97,60.0,30.05,0.0,1.0,6.0,60.0,68.0,78.0,0.1,0
max,131950.0,23.0,51839.0,45249.0,30.31,70.0,66.0,30.23,68.0,30.27,1.0,1.0,12.0,70.0,78.0,86.0,2.18,0


In [18]:
    plt.figure()
    bins = 50
    alpha = 0.75
    df[df['rain']==0]['ENTRIESn_hourly'].hist(bins = bins, alpha=alpha) # your code here to plot a historgram for hourly entries when it is not raining
    df[df['rain']==1]['ENTRIESn_hourly'].hist(bins = bins, alpha=alpha) # your code here to plot a historgram for hourly entries when it is raining
    
    plt.suptitle('Histogram of ENTRIESn_hourly')
    plt.ylabel('Frequency')
    plt.xlabel('ENTRIESn_hourly')
    plt.legend(['no rain', 'rain'])
    plt.grid(True)
    plt.show()

![Histogram Raining and Non-Raining](https://github.com/vfribeiro/IntroDataScience/blob/master/figure_1.png?raw=true)

>As per histogram above neither raining days nor no-raining days data follow a normal distribution. Indeed, by applying Shapiro-Wik test (below), we confirm the hypothesis as p-value for shapiro test on both raining / no-raining data is really small. 

In [12]:
print scipy.stats.shapiro(df[df['rain']==0]['ENTRIESn_hourly'])
print scipy.stats.shapiro(df[df['rain']==1]['ENTRIESn_hourly'])

(0.47661787271499634, 0.0)
(0.4715914726257324, 0.0)


>Thus, a non-parametric test like Mann-Whithney U Test is applicable.

---
**1.3 What results did you get from this statistical test? These should include the following numerical values: p-values, as well as the means for each of the two samples under test.**

>Applying Mann-Whitney U Test.



In [16]:
with_rain_mean = np.mean(df[df['rain']==1]['ENTRIESn_hourly'])
without_rain_mean = np.mean(df[df['rain']==0]['ENTRIESn_hourly'])
    
U,p = scipy.stats.mannwhitneyu(df[df['rain']==1]['ENTRIESn_hourly'],
                               df[df['rain']==0]['ENTRIESn_hourly'])

print with_rain_mean, without_rain_mean, U, 2*p

1105.44637675 1090.27878015 1924409167.0 0.049999825587


**1.4 What is the significance and interpretation of these results?**

>`ENTRIESn_hourly` raining mean is slightly bigger than no-raining means - which makes sense as people may prefer walk if it's not raining (personally, I would guess a much higher difference). 2 \* p-value is slightly below 0.05, thus **null hypothesis is rejected**.

---


### Section 2. Linear Regression

**2.1 What approach did you use to compute the coefficients theta and produce prediction for ENTRIESn_hourly in your regression model:**
- **OLS using Statsmodels or Scikit Learn,**
- **Gradient descent using Scikit Learn,**
- **Or something different?**

**2.2 What features (input variables) did you use in your model? Did you use any dummy variables as part of your features?**

**2.3 Why did you select these features in your model? We are looking for specific reasons that lead you to believe that the selected features will contribute to the predictive power of your model.**
- **Your reasons might be based on intuition. For example, response for fog might be: “I decided to use fog because I thought that when it is very foggy outside people might decide to use the subway more often.”**
- **Your reasons might also be based on data exploration and experimentation, for example: “I used feature X because as soon as I included it in my model, it drastically improved my R2 value.”**

**2.4 What are the parameters (also known as "coefficients" or "weights") of the non-dummy features in your linear regression model?**

**2.5 What is your model’s R2 (coefficients of determination) value?**

**2.6 What does this R2 value mean for the goodness of fit for your regression model? Do you think this linear model to predict ridership is appropriate for this dataset, given this R2  value?**

---

### Section 3. Visualization

**Please include two visualizations that show the relationships between two or more variables in the NYC subway data.
Remember to add appropriate titles and axes labels to your plots. Also, please add a short description below each figure commenting on the key insights depicted in the figure.**

**3.1 One visualization should contain two histograms: one of  ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days.**
- **You can combine the two histograms in a single plot or you can use two separate plots.**
- **If you decide to use to two separate plots for the two histograms, please ensure that the x-axis limits for both of the plots are identical. It is much easier to compare the two in that case.**
- **For the histograms, you should have intervals representing the volume of ridership (value of ENTRIESn_hourly) on the x-axis and the frequency of occurrence on the y-axis. For example, each interval (along the x-axis), the height of the bar for this interval will represent the number of records (rows in our data) that have ENTRIESn_hourly that falls in this interval.**
- **Remember to increase the number of bins in the histogram (by having larger number of bars). The default bin width is not sufficient to capture the variability in the two samples.**

**3.2 One visualization can be more freeform. You should feel free to implement something that we discussed in class (e.g., scatter plots, line plots) or attempt to implement something more advanced if you'd like. Some suggestions are:**
- **Ridership by time-of-day**
- **Ridership by day-of-week**

---

### Section 4. Conclusion

**Please address the following questions in detail. Your answers should be 1-2 paragraphs long.**

**4.1 From your analysis and interpretation of the data, do more people ride
the NYC subway when it is raining or when it is not raining?**

**4.2 What analyses lead you to this conclusion? You should use results from both your statistical
tests and your linear regression to support your analysis.**

---

### Section 5. Reflection

**Please address the following questions in detail. Your answers should be 1-2 paragraphs long.**

**5.1 Please discuss potential shortcomings of the methods of your analysis, including:**
- **Dataset,**
- **Analysis, such as the linear regression model or statistical test.**

**5.2 (Optional) Do you have any other insight about the dataset that you would like to share with us?**

---