# Exercise goals
In this exercise, we will explore a large dataset of NYC train data.

# Train data exercise
Now we will use NYC taxi data to build a box plot. 


## Python notes
In the code below, we use `pd.read_csv(train.csv)` to load the data file `train.csv` into the variable `train`. (Don't worry, this file is stored in this lab, you do not need to do anything to get it.)

We then use the following lines to show the summary train data:
```python
duration = train['trip_duration']
print(duration.describe()) # summary stats
```

Run the next cell to see documentation for the `read_csv` method.

In [None]:
import pandas as pd
?pd.read_csv
#?pd.DataFrame.describe

# Read in the train data
You will run the following cell to read in the train data.

First, you must set the values for `train` and `duation` variables.
- For `train`, us the `read_csv` method. 
- For `duratio`n, select the colum from the `train` variable called `trip_duration`.

Try it first.  You can check your code against the answer in the cell that follows.

**Note**: As there's a lot of data, this code block might take up to 15 seconds to run.

In [None]:
# again, we load libraries that allow us to access pre-made functions
import pandas as pd
import matplotlib.pyplot as plt
# shows matplotlib plots within the jupyter notebook
%matplotlib inline 
# if you get an error here, you may need to move train.csv to your current working directory from your 
# Downloads folder

#------------------Enter your code here---------------------------#
# load trip.csv
train = 
duration = 
#-----------------------------------------------------------------#
# use describe() to display summary stats for trip_duration
duration.describe()

# Answer code
We used the following code in the cell above:

```python
#------------------Enter your code here---------------------------#
# load trip.csv
train = pd.read_csv('train.csv')
# use describe() to display summary stats for trip_duration
duration = train['trip_duration']
#-----------------------------------------------------------------#
```

# Create box plot for train data
The next code cell creates a box plot for the train data.

## Python notes
We use the method `ad.annotate()` to add arrows pointing to different percentiles in the box plot.

We use `plt.boxplot(duration, whis=whiskers)` to generate and show the box plot for duration data.

In [None]:
fig = plt.figure()
fig.suptitle("NYC Taxi Trip Duration")
ax = fig.add_subplot(111)
duration = train['trip_duration']
# create a variable with the number of observations in duration
count_duration = len(duration)

# lambda gives another way to define a function. So, we can call
# perc(22) and it substitute 22 for x.
# perc(x) returns the value that is greater than x% of the data
perc = lambda x: sorted(duration)[int(count_duration*x/100)-1]

# bottom of box
ax.annotate('25th percentile', xy=(1.1, perc(25)), xytext=(1.25, perc(25)),
            arrowprops=dict(facecolor='black', shrink=0.05))

#---------------Enter your code here---------------#
# top of box
ax.annotate('75th percentile', xy=(1.1, perc(75)), xytext=(1.25, perc(75)),
            arrowprops=dict(facecolor='black', shrink=0.05))

#--------------------------------------------------#

# there are an equal number of points greater than and less than the median
# line in the box
# sorted() sorts duration from least to greatest
median = sorted(duration)[int(count_duration/2)-1]

# line within the box
ax.annotate('median', xy=(1.1, median), xytext=(1.25, median),
            arrowprops=dict(facecolor='black', shrink=0.05))

# we choose which percentile to set our whiskers at
# NOTE - style - same comment as before (var assignments)
whiskers = one, two = [5,95]

# whisker 1
ax.annotate(str(one)+'th percentile', xy=(1.05, perc(one)), xytext=(1.25, perc(one)),
            arrowprops=dict(facecolor='black', shrink=0.05))
# whisker 2
ax.annotate(str(two)+'th percentile', xy=(1.05, perc(two)), xytext=(1.25, perc(two)),
            arrowprops=dict(facecolor='black', shrink=0.05))
# add y axis
plt.ylabel('Duration (seconds)')

#---------------Enter your code here---------------#

# boxplot call
plt.boxplot(duration, whis=whiskers)

# display the plot 
plt.show()
#--------------------------------------------------#

# Documentation
Run the following cell to pull up documentation for box plots.

In [None]:
?plt.boxplot

# Train duration box plot data
In the box chart above, the data is so spread out that we cannot see the box. This is the result of
outliers, which we can remove from our plot adding the `showfliers` flag to `plt.boxplot()`.

Run the block below to see duration summary data and a box plot with outliers removed.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

train = pd.read_csv('train.csv')
duration = train['trip_duration']
print(duration.describe()) # summary stats

# think of plt.figure as the frame of a picture
fig = plt.figure()
fig.suptitle("NYC Taxi Trip Duration")
# think of the ax as the picture within the frame
ax = fig.add_subplot(111)

# create a variable with the number of observations in duration
count_duration = len(duration)

# lambda gives another way to define a function. So, we can call
# perc(22) and it substitute 22 for x.
# perc(x) gives the value that is greater than x% of the data
perc = lambda x: sorted(duration)[int(count_duration*x/100)-1]

# bottom of box
ax.annotate('25th percentile', xy=(1.1, perc(25)), xytext=(1.25, perc(25)),
            arrowprops=dict(facecolor='black', shrink=0.05))
# top of box
ax.annotate('75th percentile', xy=(1.1, perc(75)), xytext=(1.25, perc(75)),
            arrowprops=dict(facecolor='black', shrink=0.05))

# there are an equal number of points greater than and less than the median
# line in the box
# sorted() sorts duration from least to greatest
median = sorted(duration)[int(count_duration/2)-1]

# line within the box
ax.annotate('median', xy=(1.1, median), xytext=(1.25, median),
            arrowprops=dict(facecolor='black', shrink=0.05))

# we choose which percentiles to place our whiskers at
whiskers = one, two = [5,95]

# whisker 1
ax.annotate(str(one)+'th percentile', xy=(1.05, perc(one)), xytext=(1.25, perc(one)),
            arrowprops=dict(facecolor='black', shrink=0.05))
# whisker 2
ax.annotate(str(two)+'th percentile', xy=(1.05, perc(two)), xytext=(1.25, perc(two)),
            arrowprops=dict(facecolor='black', shrink=0.05))
plt.ylabel("Duration (seconds)")
#---------------Enter your code here---------------#

# call to boxplot
plt.boxplot(duration, whis=whiskers, showfliers=False)

# display the plot
plt.show()
#--------------------------------------------------#

# Bar chart exercise
We will now make a bar chart of the number of rides with different numbers of passengers.

- We use `pd.DataFrame.groupby` to get a count of the number of rides that occur with 0, 1, 2, etc. passengers
- You can read more about `groupby` here: https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

# Code documentation
Run the following cell to see docs for `groupby`.

In [None]:
import pandas as pd
?pd.DataFrame.groupby

# Get passenger count
Edit the line `passenger_count =` to group the results by `passenger_count` using the `groupby` method. Then run the following cell.

Try it first.  You can check your code against the answer in the cell that follows.

In [None]:
# We import libraries as is common practice
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# seaborn is an API built on top of matplotlib
# It creates more aesthetically appealing plots and simplifies some things
# read more about what an API is here:
# https://en.wikipedia.org/wiki/Application_programming_interface
import seaborn as sns; sns.set()
%matplotlib inline

# load up the NYC Taxi data
# remember to move train.csv to your current working directory
train = pd.read_csv('train.csv')
# groupby is kind of like a pivot table in excel 
# check the docs by running ?pd.DataFrame.groupby

#-----------------Enter your code here--------------------#
# groupby groups all rows together that correspond to each unique value 
# of passenger counte
# count() collapses all the rows into the number of rows for each group
# ['passenger_count'] returns the column we want
passenger_count = 

#---------------------------------------------------------#

# display passenger_count
passenger_count

# Answer code
We used the following code in the cell above:

```python
#-----------------Enter your code here--------------------#
# groupby groups all rows together that correspond to each unique value 
# of passenger counte
# count() collapses all the rows into the number of rows for each group
# ['passenger_count'] returns the column we want
passenger_count = train.groupby('passenger_count').count().id

#---------------------------------------------------------#
```

# Code documentation
Run the following cell to see docs for the `bar` method.

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
?ax.bar

# Create the bar chart 
In the cell below, write the call to the `ax.bar()` method to create the bar chart for passenger count 
vs. the number of rides.

Try it first.  You can check your code against the answer in the cell that follows.

In [None]:
# these give us control of title, x axis, y axis, etc.
fig, ax = plt.subplots()
bar_width=0.35
opacity=0.4
# shape[0] returns the number of rows/observations
# np.arange returns a generator [0, rows in passenger count - 1]
index = np.arange(passenger_count.shape[0])
# we create our passenger count bar
#------------------Enter your code here-------------------#
rect = ax.bar()

#---------------------------------------------------------#
ax.set_xlabel('Passenger Count')
ax.set_ylabel('Number of Rides')
ax.set_title('Number of Rides vs. Passenger Count')

fig.tight_layout()
# a table form of our data
print(passenger_count)
plt.show()


# Answer code
We used the following code in the cell above:

```python
#------------------Enter your code here-------------------#
rect = ax.bar(index, passenger_count, bar_width,
              alpha=opacity, color='b')

#---------------------------------------------------------#
```

# Histogram exercise
We will now make a histogram of trip duration.

**Hint**: taking the log may make the values easier to interpret in the final graph.

In the cell below, edit the call to the `plt.hist()` method.

**Hint**: taking the log may make the values easier to interpret in the final graph.

Then answer: What's the most likely trip duration?

Try it first.  You can check your code against the answer in the cell that follows.

In [None]:
# Histogram Exercise
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

train = pd.read_csv('train.csv')

fig = plt.figure()
fig.suptitle('Trip Duration')
# this bunches all the values closer together
# if you plot raw values, the resultant graph isn't very good and takes a while to load
#-------------Enter your code here-------------------------------#
plt.hist() 
#----------------------------------------------------------------#
plt.xlabel('ln(Duration in seconds)')
plt.ylabel('Count')
plt.show()

# Answer code
We used the following code in the cell above:

```python
#-------------Enter your code here-------------------------------#
plt.hist(np.log(train['trip_duration']), bins='auto', color='black') 
#----------------------------------------------------------------#
```

# Most likely trip duration
Using the histogram above, what's the most likely trip duration in seconds?
Enter your answer as a parameter to `np.exp()`, then run the cell below.

In [None]:

#-------------------Enter Your code here--------------#

# enter your answer for most likely duration
import numpy as np
np.exp()
#-----------------------------------------------------#