### Reading CSV files with NumPy
- `numpy.genfromtxt(filename, delimiter=None)`

1. Import the NumPy library and assign to the alias `np`.
2. Use the `numpy.genfromtext()` function to read the `nyc_taxis.csv` file into NumPy. Assign the result to taxi.
3. Use the `ndarray.shape` attribute to assign the shape of `taxi` to `taxi_shape`.

In [1]:
import numpy as np

taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',')
taxi_shape = taxi.shape

In [2]:
print(taxi_shape)

(89561, 15)


NumPy ndarrays can contain only one datatype. It attempts to determine the type of the file by looking at the values.

- Use `ndarray.dtype` attribute to see the internal datatype that has been used.

In [3]:
print(taxi.dtype)

float64


1. Use the `numpy.genfromtext()` function to again read the `nyc_taxis.csv` file into NumPy, but this time, skip the first row. Assign the result to `taxi`.
2. Assign the shape of `taxi` to `taxi_shape`.

In [4]:
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',', skip_header=1)
taxi_shape = taxi.shape
print(taxi_shape)

(89560, 15)


### Boolean Indexing with 1D ndarrays

- Select just the `pickup_month` column:

In [5]:
pickup_month = taxi[:,1]

- Use a boolean operation to make a boolean array, where the value `1` corresponds to January:

In [6]:
january_bool = pickup_month == 1

- Then use the new boolean array to select only the items from `pickup_month` that have a value of `1`:

In [7]:
january = pickup_month[january_bool]

- Use the `.shape` attribute to find out how many items are in the `january` ndarray, which is equal to the number of taxi rides from the month of January. 
- Use `0` to extract the value from the tuple returned by `.shape`:

In [8]:
january_rides = january.shape[0]
print(january_rides)

13481


1. Create a boolean array, `february_bool`, that evaluates whether the items in `pickup_month` are equal to `2`.
2. Use the `february_bool` boolean array to index `pickup_month`. Assign the result to `february`.
3. Use the `ndarray.shape` attribute to find the number of items in `february`. Assign the result to `february_rides`.

In [9]:
february_bool = pickup_month == 2
february = pickup_month[february_bool]
february_rides = february.shape[0]
print(february_rides)

13333


### Boolean Indexing with 2D ndarrays

When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods of 1D ndarrays. The only limitation is that the boolean array must have the same length as the dimension you're indexing.

- In the previous mission, the calculated maximum trip speed was 82,000 mph, which cannot be correct. Check for any issues with the data

In [10]:
# calculate the average speed
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

- Check for trips with an average speed greater than 20,000 mph:

In [11]:
# create a boolean array for trips with average
# speeds greater than 20,000 mph
trip_mph_bool = trip_mph > 20000

# use the boolean array to select the rows for
# those trips, and the pickup_location_code,
# dropoff_location_code, trip_distance, and 
# trip_length columns
trips_over_20000_mph = taxi[trip_mph_bool,5:9]
print(trips_over_20000_mph)

[[ 2.   2.  23.   1. ]
 [ 2.   2.  19.6  1. ]
 [ 2.   2.  16.7  2. ]
 [ 3.   3.  17.8  2. ]
 [ 2.   2.  17.2  2. ]
 [ 3.   3.  16.9  3. ]
 [ 2.   2.  27.1  4. ]]


- Examine the rows that have the highest values for the `tip_amount` column:
1. Create a boolean array, `tip_bool`, that determines which rows have values for the `tip_amount` column of more than `50`.
2. Use the `tip_bool` array to select all rows from `taxi` with tip amount values of more than `50`, and the columns from indexed 5 to 13 inclusive. Assign the resulting array to `top_tips`.

In [12]:
tip_amount = taxi[:,12]

In [13]:
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]

### Assigning Values in ndarrays

Syntax (using pseudocode):

`ndarray[location_of_values] = new_value`

In [14]:
# this creates a copy of our taxi ndarray so that our original is not changed
taxi_modified = taxi.copy()

1. The value at column index `5` (pickup_location) of row index `28214` is incorrect. Use assignment to change this value to `1` in the `taxi_modified` ndarray

In [15]:
taxi_modified[28214,5] = 1

2. The first column (index `0`) contains year values as four digit numbers in the format YYYY (`2016`, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (`16`) in the `taxi_modified` ndarray.

In [16]:
taxi_modified[:,0] = 16

3. The values at column index `7` (trip_distance) of rows index `1800` and `1801` are incorrect. Use assignment to change these values in the `taxi_modified` ndarray to the mean value for that column.

In [17]:
taxi_modified[1800:1802,7] = taxi_modified[:,7].mean()

### Assignment Using Boolean Arrays

In [18]:
# this creates a second fresh copy of the taxi ndarray to leave the original alone
taxi_copy = taxi.copy()

1. Select the fourteenth column (index 13) in `taxi_copy`. Assign it to a variable named `total_amount`.

In [19]:
total_amount = taxi_copy[:,13]

2. For rows where the value of `total_amount` is less than `0`, use assignment to change the value to `0`.

In [20]:
taxi_copy[total_amount < 0] = 0

### Assignment Using Boolean Arrays with Two Dimensions

Syntax (using pseudocode):

- using an intermediate variable:


`bool = array[:, column_for_comparison] == value_for comparison`

`array[bool, column_for assignment] = new_value`


- all as one line:

`array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value`

In [21]:
# create a new column filled with `0`
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 0.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 0.000e+00]]


1. In the new column at index `15`, assign the value `1` if the `pickup_location_code` (column index `5`) corresponds to an airport location, leaving the value as `0` otherwise by performing these three operations:
    - For rows where the value for the column index `5` is equal to `2` (JFK Airport), assign the value `1` to column index `15`.
    - For rows where the value for the column index `5` is equal to `3` (LaGuardia Airport), assign the value `1` to column index `15`.
    - For rows where the value for the column index `5` is equal to `5` (Newark Airport), assign the value `1` to column index `15`.

In [22]:
taxi_modified[taxi_modified[:, 5] == 2, 15] = 1
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1
taxi_modified[taxi_modified[:, 5] == 5, 15] = 1

### Challenge: Which is the most popular airport?

1. Using the original `taxi` ndarray, calculate how many trips had JFK Airport as their destination:
    - Use boolean indexing to select only the rows where the `dropoff_location_code` column (column index `6`) has a value that corresponds to JFK. Assign the result to `jfk`.
    - Calculate how many rows are in the new `jfk` array and assign the result to `jfk_count`.

In [23]:
jfk = taxi[taxi[:,6] == 2] 
jfk_count = jfk.shape[0]

2. Calculate how many trips from `taxi` had LaGuardia Airport as their destination:
    - Use boolean indexing to select only the rows where the `dropoff_location_code` column (column index `6`) has a value that corresponds to LaGuardia. Assign the result to `laguardia`.
    - Calculate how many rows are in the new `laguardia` array. Assign the result to `laguardia_count`.

In [24]:
laguardia = taxi[taxi[:,6] == 3]
laguardia_count = laguardia.shape[0]

3. Calculate how many trips from taxi had Newark Airport as their destination:
    - Select only the rows where the `dropoff_location_code` column has a value that corresponds to Neward, and assign the result to `newark`.
    - Calculate how many rows are in the new `newark` array and assign the result to `newark_count`.

In [25]:
newark = taxi[taxi[:,6] == 5]
newark_count = newark.shape[0]

4. Inspect the values for `jfk_count`, `laguardia_count`, and `newark_count` to see which airport had the most dropoffs.

In [26]:
print(f"JFK: {jfk_count}\nLaGuardia: {laguardia_count}\nNewark: {newark_count}")

JFK: 11832
LaGuardia: 16602
Newark: 63


### Challenge: Calculating Statistics for Trips on Clean Data

This challenge involves removing potentially bad data from our data set, and then calculating some descriptive statistics on the remaining "clean" data.

In [27]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

1. Create a new ndarrray, `cleaned_taxi`, containing only rows for which the values of `trip_mph` are less than 100.

In [28]:
cleaned_taxi = taxi[trip_mph < 100]

2. Calculate the mean of the `trip_distance` column of `cleaned_taxi`. Assign the result to `mean_distance`.

In [29]:
mean_distance = cleaned_taxi[:,7].mean()

3. Calculate the mean of the `trip_length` column of `cleaned_taxi`. Assign the result to `mean_length`.

In [30]:
mean_length = cleaned_taxi[:,8].mean()

4. Calculate the mean of the `total_amount` column of `cleaned_taxi`. Assign the result to `mean_total_amount`.

In [31]:
mean_total_amount = cleaned_taxi[:,13].mean()