# Boolean Indexing with Numpy

We'll be working with a subset of this data: Yellow taxi trips to and from New York City airports between January and June 2016. In our dataset, each row represents a unique taxi trip. Below is information about selected columns from the data set:

- `pickup_year` - The year of the trip.
- `pickup_month` - The month of the trip (January is 1, December is 12).
- `pickup_day` - The day of the month of the trip.
- `pickup_location_code` - The airport or borough where the the trip started, as one of eight categories:<br>
 * 0 - Bronx.<br>
 * 01 - Brooklyn.<br>
 * 02 - JFK Airport.<br>
 * 03 - LaGuardia Airport.<br>
 * 04 - Manhattan.<br>
 * 05 - Newark Airport.<br>
 * 06 - Queens.<br>
 * 07 - Staten Island.
- `dropoff_location_code` - The airport or borough where the the trip finished, using the same eight category codes as pickup_location_code.
- `trip_distance` - The distance of the trip in miles.
- `trip_length` - The length of the trip in seconds.
- `fare_amount` - The base fare of the trip, in dollars.
- `total_amount` - The total amount charged to the passenger, including all fees, tolls and tips.

In [1]:
import numpy as np

In [3]:
# The numpy.genfromtxt() function reads a text file into a NumPy ndarray.
# np.genfromtxt(filename,delimiter)
taxi = np.genfromtxt('nyc_taxis.csv',delimiter=',',skip_header=1)
taxi

array([[2.016e+03, 1.000e+00, 1.000e+00, ..., 1.165e+01, 6.999e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 8.000e+00, 5.430e+01,
        1.000e+00],
       [2.016e+03, 1.000e+00, 1.000e+00, ..., 0.000e+00, 3.780e+01,
        2.000e+00],
       ...,
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 5.000e+00, 6.334e+01,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 8.950e+00, 4.475e+01,
        1.000e+00],
       [2.016e+03, 6.000e+00, 3.000e+01, ..., 0.000e+00, 5.484e+01,
        2.000e+00]])

In [7]:
# Boolean operation on narray
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

a_bool = a < 3
print(a_bool)
b_bool = b == 'blue'
print(b_bool)
c_bool = c > 100
print(c_bool)

[ True  True False False False]
[ True  True False  True]
[False  True False  True]


In [14]:
#Boolean indexing
print(taxi.shape)
february = taxi[taxi[:,1] == 2]
print(february.shape)
february_rides = february.shape[0]

(89560, 15)
(13333, 15)


In [15]:
march = taxi[taxi[:,1] == 3]
print(march.shape)
february_rides = march.shape[0]

(15547, 15)


In [18]:
tip_bool = taxi[:,12] > 50
tip_bool.shape

(89560,)

In [19]:
top_tips = taxi[tip_bool,5:14]

In [21]:
taxi_modified = taxi.copy()

In [23]:
tax_modified[28214,5] = 1
taxi_modified[:,0] = 16
taxi_modified[1800:1802,7] = taxi_modified[:,7].mean()

In [27]:
taxi_modified = taxi.copy()
print(taxi_modified.shape)
zeros = np.zeros([taxi_modified.shape[0], 1])
print(zeros.shape)
taxi_modified = np.concatenate([taxi_modified,zeros],axis=1)
print(taxi_modified.shape)

(89560, 15)
(89560, 1)
(89560, 16)


In [32]:
taxi_modified[taxi_modified[:,5] == 2,15] = 1
taxi_modified[taxi_modified[:,5] == 3,15] = 1
taxi_modified[taxi_modified[:,5] == 5,15] = 1

In [39]:
jfk_count = taxi[taxi[:,6] == 2].shape[0]
jfk_count

11832

In [40]:
laguardia_count = taxi[taxi[:,6] == 3].shape[0]
laguardia_count

16602

In [42]:
newark_count = taxi[taxi[:,6] == 6].shape[0]
newark_count

9000

In [47]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
cleaned_taxi = taxi[trip_mph < 100]
mean_distance = cleaned_taxi[:,7].mean()
mean_length = cleaned_taxi[:,8].mean()
mean_total_amount = cleaned_taxi[:,13].mean()
mean_mph = trip_mph[trip_mph < 100].mean()
mean_mph

23.353238774840836