# Exploring and Visualizing Data
In this exercise, we will explore a large dataset for New York City trains. We will then create visualizations for that data in the following forms.

- A **box and whisker plot** that shows train trip duration.
- A **bar chart** that shows the number of passengers on each train.
- A **histogram** that shows the most likely trip duration.

## Box and Whisker Plot: Train Trip Duration
This first visualization is a box and whisker plot that uses train data from New York City. This visualization has three steps.
1. Load the train data.
2. Create a box and whisker plot.
3. Refine the box and whisker plot.

### Step 1. Load the Train Data
Before we can create the box and whisker plot, we need a dataset. To load the train data, follow these steps.

1. In the following code cell, enter a value for the `train` variable. 

   To do this, either replace the `train =` line in the code cell with the following line, or create your own value for the `train` variable based on the `read_csv` method.

    `train = pd.read_csv('train.csv')`

   For more information, see **Python Notes** and **read_csv Method Documentation** below the code cell.

2. In the following code cell, enter a value for the `duration` variable. 

   To do this, either replace the `duration =` line in the code cell with the following line, or enter your own value for the `duration` variable based on the `trip_duration` column in the `train` variable.

    `duration = train['trip_duration']`

3. Run the code cell.

   Note: This dataset is large. This code block might take up to 15 seconds to run.

In [None]:
# Load libraries that allow us to access pre-made functions.
import pandas as pd
import matplotlib.pyplot as plt
# Show matplotlib plots within the jupyter notebook.
%matplotlib inline 

# You need to load the train.csv file. To do this, enter a value for the train 
# variable in the following line.
#-------- Enter a value for the train variable here --------------#
train =
#-----------------------------------------------------------------#
# If an error occurs when you try to load the trip.csv file, try moving train.csv 
# from your Downloads folder to your current working directory.

#-------- Enter a value for the duration variable here -----------#
duration =
#-----------------------------------------------------------------#
# use describe() to display summary stats for trip_duration
duration.describe()

#### Python Notes
In this code, `pd.read_csv(train.csv)` loads the `train.csv` data file into the `train` variable. (The `train.csv` file is stored in this lab.)

The following lines show the summary train data:
```python
duration = train['trip_duration']
print(duration.describe()) # summary stats
```

#### read_csv Method Documentation
Run the following cell to see documentation for the `read_csv` method.


In [None]:
import pandas as pd
?pd.read_csv
#?pd.DataFrame.describe

### Step 2. Create a Box and Whisker Plot
To create a box and whisker plot of this train data, run the following code cell.

For more information about the code, see **Python Notes** and **Box Plot Documentation** below the code cell.

In [None]:
fig = plt.figure()
fig.suptitle("NYC Train Trip Duration")
ax = fig.add_subplot(111)
duration = train['trip_duration']
# Create a variable with the number of observations in the duration.
count_duration = len(duration)

# lambda gives another way to define a function. So, we can call
# perc(22) and it will substitute 22 for x.
# perc(x) returns the value that is greater than x% of the data.
perc = lambda x: sorted(duration)[int(count_duration*x/100)-1]

# bottom of box.
ax.annotate('25th percentile', xy=(1.1, perc(25)), xytext=(1.25, perc(25)),
            arrowprops=dict(facecolor='black', shrink=0.05))

#---------------Enter your code here---------------#
# top of box
ax.annotate('75th percentile', xy=(1.1, perc(75)), xytext=(1.25, perc(75)),
            arrowprops=dict(facecolor='black', shrink=0.05))

#--------------------------------------------------#

# There are an equal number of points greater than and less than the median
# line in the box.
# sorted() sorts duration from least to greatest.
median = sorted(duration)[int(count_duration/2)-1]

# Create a line within the box.
ax.annotate('median', xy=(1.1, median), xytext=(1.25, median),
            arrowprops=dict(facecolor='black', shrink=0.05))

# Choose the percentiles where you want set the whiskers.
# NOTE - style - same comment as before (var assignments).
whiskers = one, two = [5,95]

# Whisker 1.
ax.annotate(str(one)+'th percentile', xy=(1.05, perc(one)), xytext=(1.25, perc(one)),
            arrowprops=dict(facecolor='black', shrink=0.05))
# Whisker 2.
ax.annotate(str(two)+'th percentile', xy=(1.05, perc(two)), xytext=(1.25, perc(two)),
            arrowprops=dict(facecolor='black', shrink=0.05))
# Add the y axis.
plt.ylabel('Duration (seconds)')

#---------------Enter your code here---------------#

# The boxplot call.
plt.boxplot(duration, whis=whiskers)

# Display the plot. 
plt.show()
#--------------------------------------------------#

#### Python Notes
We use the `ad.annotate()` method to add arrows pointing to different percentiles in the box plot.

We use `plt.boxplot(duration, whis=whiskers)` to generate and show the box plot for duration data.

#### Box Plot Documentation
Run the following cell to access documentation for box plots.

In [None]:
?plt.boxplot

### Step 3. Refine the Box and Whisker Plot
In the box plot we just created, the data is so spread out that we cannot see the box. This is the result of outliers, which we can remove from our plot.

Run the following code cell to create a box plot with the outliers removed.

For more information about the code, see **Python Notes** below the code cell.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

train = pd.read_csv('train.csv')
duration = train['trip_duration']
print(duration.describe()) # summary stats

# Think of plt.figure as the frame of a picture.
fig = plt.figure()
fig.suptitle("NYC Taxi Trip Duration")
# Think of ax as the picture within the frame.
ax = fig.add_subplot(111)

# Create a variable with the number of observations in the duration.
count_duration = len(duration)

# lambda gives another way to define a function. So, we can call
# perc(22) and it will substitute 22 for x.
# perc(x) returns the value that is greater than x% of the data.
perc = lambda x: sorted(duration)[int(count_duration*x/100)-1]

# The bottom of the box.
ax.annotate('25th percentile', xy=(1.1, perc(25)), xytext=(1.25, perc(25)),
            arrowprops=dict(facecolor='black', shrink=0.05))
# The top of the box.
ax.annotate('75th percentile', xy=(1.1, perc(75)), xytext=(1.25, perc(75)),
            arrowprops=dict(facecolor='black', shrink=0.05))

# There are an equal number of points greater than and less than the median
# line in the box.
# sorted() sorts the duration from least to greatest.
median = sorted(duration)[int(count_duration/2)-1]

# A line within the box.
ax.annotate('median', xy=(1.1, median), xytext=(1.25, median),
            arrowprops=dict(facecolor='black', shrink=0.05))

# Choose the percentiles where you want set the whiskers.
whiskers = one, two = [5,95]

# Whisker 1.
ax.annotate(str(one)+'th percentile', xy=(1.05, perc(one)), xytext=(1.25, perc(one)),
            arrowprops=dict(facecolor='black', shrink=0.05))
# Whisker 2.
ax.annotate(str(two)+'th percentile', xy=(1.05, perc(two)), xytext=(1.25, perc(two)),
            arrowprops=dict(facecolor='black', shrink=0.05))
plt.ylabel("Duration (seconds)")
#---------------Enter your code here---------------#

# The call to the box plot.
plt.boxplot(duration, whis=whiskers, showfliers=False)

# Display the plot.
plt.show()
#--------------------------------------------------#

#### Python Notes
We remove the outliers from our plot by adding the `showfliers` flag to `plt.boxplot()`.


## Bar Chart: Number of Trains with 'n' Passengers

This visualization is a bar chart of the number of rides with different numbers ('n') of passengers. This visualization has two steps.
1. Load the train data.
2. Create a bar chart.

### Step 1. Load the Train Data

To load the train data, follow these steps.

1. In the following code cell, edit the `passenger_count =` line to group the results by `passenger_count` using the `groupby` method. 

   To do this, either replace the `passenger_count =` line in the code cell with the following line, or add your own code.

    `passenger_count = train.groupby('passenger_count').count().id`

   For more information, see **Python Notes** and **groupby Documentation** below the code cell.

2. Run the code cell.

In [None]:
# Load libraries that allow us to access pre-made functions.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# seaborn is an API built on top of matplotlib.
# It creates more aesthetically appealing plots and simplifies some things.
# Read more about what an API is here:
# https://en.wikipedia.org/wiki/Application_programming_interface
import seaborn as sns; sns.set()
%matplotlib inline

# Load the NYC train data.
# Remember to move train.csv to your current working directory.
train = pd.read_csv('train.csv')
# groupby is similar to a pivot table in Microsoft Excel.
# groupby groups all rows together that correspond to each unique value 
# of passenger count.
# count() collapses all the rows into the number of rows for each group.
# ['passenger_count'] returns the column we want.

#---------------- Edit the following line ----------------#
passenger_count = 
#---------------------------------------------------------#

# Display passenger_count.
passenger_count

#### Python Notes

We use `pd.DataFrame.groupby` to get a count of the number of rides that occur with 0, 1, 2, etc. passengers. 

#### groupby Documentation
For more information about `groupby`, see https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm, or run the following cell to access `groupby` documentation.

In [None]:
import pandas as pd
?pd.DataFrame.groupby

### Step 2. Create a Bar Chart

Now that we've loaded passenger data and grouped the results by number of passengers, we can create a bar chart to show this data.

To create a bar chart, follow these steps.
1. In the following code cell, write the call to the `ax.bar()` method to create the bar chart for the number of rides that have specific passenger counts. 

   To do this, either replace the `rect = ax.bar()` line in the code with the following lines, or write your own call.

```python
rect = ax.bar(index, passenger_count, bar_width,
              alpha=opacity, color='b')
```
   For more information, see **bar Method Documentation** below the code cell.
   
2. Run the code cell.

In [None]:
# These give us control of the title, x axis, y axis, etc.
fig, ax = plt.subplots()
bar_width=0.35
opacity=0.4
# shape[0] returns the number of rows/observations.
# np.arange returns a generator [0, rows in passenger count - 1].
index = np.arange(passenger_count.shape[0])

# Create the passenger count bar.
#------------------Enter your code here-------------------#
rect = ax.bar()
#---------------------------------------------------------#

ax.set_xlabel('Passenger Count')
ax.set_ylabel('Number of Rides')
ax.set_title('Number of Rides with Each Passenger Count')

fig.tight_layout()
# a table form of our data
print(passenger_count)
plt.show()


#### bar Method Documentation
Run the following cell to access documentation for the `bar` method.

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
?ax.bar

### Histogram: Trip Duration
Finally, we will create a histogram that shows trip duration.

To create the histogram, follow these steps.

1. In the following code cell, edit the call to the `plt.hist()` method.

   To do this, either replace the `plt.hist()` line with the following line, or enter your own call.
   
   `plt.hist(np.log(train['trip_duration']), bins='auto', color='black')``

   For more information, see **Python Notes** below the code cell.
   
2. Run the code cell.

In [1]:
# Histogram Exercise
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

train = pd.read_csv('train.csv')

fig = plt.figure()
fig.suptitle('Trip Duration')
# Bunch all the values closer together.
# If you plot raw values, the resultant graph isn't very good and takes a while to load.

#------------- Edit the following line --------------------------#
plt.hist() 
#----------------------------------------------------------------#

plt.xlabel('ln(Duration in Seconds)')
plt.ylabel('Count')
plt.show()

FileNotFoundError: File b'train.csv' does not exist

#### Python Notes

Taking the log may make the values easier to interpret in the final graph.

#### Most likely trip duration
Using the histogram above, what's the most likely trip duration in seconds?
Enter your answer as a parameter to `np.exp()`, then run the cell below.

In [None]:

#------------- Edit the following line ----------------#

# Enter your answer for most likely trip duration.
import numpy as np
np.exp()
#-----------------------------------------------------#