>  We’ve already seen how the mean, standard deviation, and median can help us understand our data. They describe the bulk of our data, trying to summarize where most values lie. But sometimes it’s useful to look at the unusual values:
> + Which users had an unusually high number of unsuccessful login attempts?
> + Which products were the most popular?
> + On which days and at what times are our sales the lowest?
>
>These questions aren’t unique to data science. For example, bars have been offering
> “happy hour” for many years, discounting their products at a time when they have
> fewer customers. Data science allows us to ask these questions more formally, to get
> more precise answers, and then to check whether our changes have had the desired
> results.

> In this exercise, you are to create a two-column data frame from the taxi data we looked at in exercise 6. The first column will contain the passenger count for each trip, and the second column will contain the distance (in miles) for each trip. Once you have created this data frame, I want you to 
> + Count how many trip distances were outliers. 
> + Calculate the mean number of passengers for outliers. Is it different from the mean number of passengers for all trips

SOLUTION

In [6]:
import pandas as pd
from pandas import Series 
from pandas import DataFrame

trip_distance = pd.read_csv('./Data/taxi-distance.csv', header=None).squeeze()
passenger_count = pd.read_csv('./Data/taxi-passenger-count.csv', header=None).squeeze()
df = DataFrame({'trip_distance': trip_distance,
                'passenger_count': passenger_count})
df


Unnamed: 0,trip_distance,passenger_count
0,1.63,1
1,0.46,1
2,0.87,1
3,2.13,1
4,1.40,1
...,...,...
9994,2.70,1
9995,4.50,1
9996,5.59,1
9997,1.54,6


In [24]:
# Calculate the interquartile range (IQR)
iqr = (df.trip_distance.quantile(0.75) - df.trip_distance.quantile(0.25))

# Calculate the lower and upper bounds for outliers
IQR1 = df.trip_distance.quantile(0.25) - 1.5*iqr
IQR2 = df.trip_distance.quantile(0.75) + 1.5*iqr

# Create a new column to identify outliers
df1 = pd.cut(df.trip_distance,
              bins= [-9999, IQR1, IQR2, 9999],
              labels = [True, False, True],
              ordered=False)
df.loc[df1.array]

Unnamed: 0,trip_distance,passenger_count
7,11.90,4
60,9.30,1
73,12.65,1
82,10.24,3
88,23.76,2
...,...,...
9975,7.60,1
9976,12.60,1
9979,11.30,1
9980,9.13,1


In [25]:
#mean passenger count for outliers

df.loc[df1.array].mean()

trip_distance      12.257285
passenger_count     1.730107
dtype: float64