# Visualizing Uber and Lyft trip distances



Let's look at a dataset of Uber and Lyft trips in Boston. 

* The dataset is in the file "fhv_rides.csv". Only the economic versions of the services (UberX and Lyft Economy) are included.

**Question**: Do Uber riders or Lyft riders have longer trips?

We can seperate Uber trips from Lyft trips using filtering

In this course, many of the visualizations are enabled by the `pyplot` module in the `matplotlib` library. `pylot` is often imported as `plt`. It includes many functions for different kinds of charts.

## What plot should we use to visualize the distribution of trip distances?

# Using the histogram

In `matplotlib.pyplot`, the function `hist()` creates histograms. The main input argument is the observed values of a variable. After using the histogram, we should use the `plt.show()` function 

Please keep in mind that many of the ideas that we discuss here (e.g., adding legends, titles, grid lines, ticks etc.) for histograms can be applied to other charts that can be created using `pyplot`.

Let's visualize Uber's trip distances using a histogram.

## Using keyword arguments to change the look of the chart

The kwargs (short for keyword arguments) is a way to pass arguments to a function by explicitly specifying the **parameter name along with the value**. They customizes different aspects of a plot, such as color, number of bins to show, etc.

You can find these under **kwargs in the documentation for the `hist()` can be found in the [matplotlib.pyplot.hist documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html).

We can:
* change the color using `color=`, color of the bar edges using 'edgecolor='
* change the number of bins by `bins=`
* use `density=True` to let the height of the bars be the percentage of values in the bins, instead of the count of values in the bins
* use `plt.title()` to define the title
* use `plt.xlabel()` and `plt.ylabel()` for the axis labels 
* use `alpha=` to make the bars translucent
* use `plt.grid()` to add grids
* use `plt.legend()` to show a legend; note that to show the legend, we need to set the labels for each histogram in the same plot

Note that we can add a vertical line, to help visualize the total frequency to the left/right of a certain value; also a horizontal line to find the ranges of distances with a frequency below a certain value. 

We can add Lyft's prices into the same chart.

**Notes**:
1. In addition to specifying the number of bins, we can also:
    * use a list to specify the edge locations of each bin
        - try to set `bin_edge=[0,2,4,6,8,10,12,14,16,18]`, and then use `bins=bin_edge` in `plt.hist()`
        - automatically generate the bins by setting `bins=` to one of the pre-defined string values ('auto', 'fd', 'doane', 'scott', 'stone', 'rice', 'sturges', or 'sqrt'), which correspond to some pre-set methods. See [numpy.histogram_bin_edges](https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges) for details.
2. See the documentation of pyplot for the [available color codes](https://matplotlib.org/stable/gallery/color/named_colors.html).

# Using the boxplot

Let's visualize Uber's per-mile prices using a boxplot (aka., ).

In `matplotlib.pyplot`, the function `boxplot()` creates boxplots.

In [3]:
# Create a boxplot for Uber's trip distances



**Question**: How do we interpret the boxplot?

* The orange line in the center represents the median. 
* The top and bottom edge indicates the 1st quartile (Q1, 25th percentile) and 3rd quartile (Q3, 75th percentile).
    * The difference Q3-Q1 is called the interquartile range (IQR)
* The "whiskers" extends:
    * upwards to the max point within 1.5 x IQR above Q3
    * downwards to the min point within 1.5 x IQR below Q1

In [4]:
# Calculate Q1, median, and Q3


In [58]:
# Calculate Q1, median, Q3, IQR, upper and lower whiskers

q1 = Qs.iloc[0]
q2 = Qs.iloc[1]
q3 = Qs.iloc[2]

iqr = q3-q1

lw = q1-1.5*iqr
uw = q3+1.5*iqr

print(q1,q2,q3,iqr,lw,uw)

1.51 2.35 3.01 1.4999999999999998 -0.7399999999999995 5.26


Now let's include the plots for both Uber and Lyft trips.

To create multiple boxplots in the same chart, the input argument will be a list of variables.

Based on the boxplot, what insights can we get?