### Reading CSV files with NumPy

How to use the __numpy.genfromtxt()__ function to read files into NumPy ndarrays. Here is the simplified syntax for the function, and an explanation of the two parameters:

```
np.genfromtxt(filename, delimiter=None)
```
where:
- filename: A positional argument, usually a string representing the path to the text file to be read.
- delimiter: A named argument, specifying the string used to separate each value.

In [1]:
import numpy as np

# Read a CSV file
taxi = np.genfromtxt('data/nyc_taxis.csv', delimiter=',')

taxi_shape = taxi.shape

taxi_shape

(89561, 15)

In our previous data import code, we converted all the values to floats before we converted the list of lists to a ndarray. That's because NumPy ndarrays can contain only _one datatype_.

In this case, using _genfromtxt()_, NumPy attempts to determine the data type of the file by looking at the values.

NumPy chose the float64 type, since it will allow most of the values from our CSV to be read. You can think of NumPy's float64 type as being identical to Python's float type (the "64" refers to the number of bits used to store the underlying value).

NaN is an acronym for __Not a Number__ - it literally means that the value cannot be stored as a number. It is similar to (and often referred to as a) null value, like Python's None constant.

NaN is most commonly seen when a value is missing, but in this case, we have NaN values because the first line from our CSV file contains the names of each column. NumPy is unable to convert string values like pickup_year into the float64 data type.

Alternatively, we can pass an additional parameter, __skip_header__, to the _numpy.genfromtxt()_ function. The skip_header parameter accepts an integer, the number of rows from the start of the file to skip.
- Note that because this integer should be the number of rows and not the index, skipping the first row would require a value of 1, not 0.

In [2]:
taxi = np.genfromtxt('data/nyc_taxis.csv', delimiter=',', skip_header=1) # Remove the header row

taxi_shape = taxi.shape
taxi_shape

(89560, 15)

### Boolean Arrays

A __boolean array__, as the name suggests, is an array of boolean values.
- Boolean arrays are sometimes called __boolean vectors__ or __boolean masks__

The boolean (or bool) type is a built-in Python type that can be one of two unique values:
- True
- False

In [3]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

a_bool = a < 3
b_bool = b == 'blue'
c_bool = c > 100

### Boolean Indexing with 1D ndarrays

To index using our new boolean array, we simply insert it in the square brackets.

The boolean array acts as a filter, so that the values corresponding to True become part of the result and the values corresponding to False are removed.

In [4]:
pickup_month = taxi[:, 1]

january_bool = pickup_month == 1

january = pickup_month[january_bool] # Filter out all False values / entries

january_rides = january.shape[0]

print(january_rides)

# Now to check number of rides in the month of February
february_bool = pickup_month == 2

february = pickup_month[february_bool]

february_rides = february.shape[0]

february_rides

13481


13333

### Boolean Indexing with 2D ndarrays

When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods.
-  The only limitation is that the boolean array must have the same length as the dimension you're indexing.

Because a boolean array contains no information about how it was created, we can use a boolean array made from just one column of our array to index the whole array.

In [5]:
# examine the rows that have the highest values for the tip_amount column
tip_amount = taxi[:, 12]

tip_bool = tip_amount > 50

top_tips = taxi[tip_bool, 5:14]

### Assigning Values in ndarrays

Syntax:
```
ndarray[location_of_values] = new_value
```

In [6]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

taxi_modified[28214,5] = 1 # The pickup location is 0. which is incorrect

# Change the year format from YYYY to YY since all of them are 2016
taxi_modified[:, 0] = 16

taxi_modified[1800:1802, 7] = taxi_modified[:, 7].mean()

### Assignment Using Boolean Arrays
- Boolean arrays become very powerful when we use them for assignment.
- The boolean array controls the values that the assignment applies to, and the other values remain unchanged.

"shortcut" - we inserted the definition of the boolean array directly into the selection. This "shortcut" is the conventional way to write boolean indexing.

In [7]:
# this creates a copy of our taxi ndarray
taxi_copy = taxi.copy()

total_amount = taxi_copy[:, 13]
total_amount[total_amount < 0] = 0

The pseudocode syntax for this code is as follows, first using an intermediate variable:

```
bool = array[:, column_for_comparison] == value_for_comparison
array[bool, column_for_assignment] = new_value
```
and then all in one line:
```
array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value
```

We have created a new copy of our taxi dataset, taxi_modified with an additional column containing the value 0 for every row.

In [8]:
# create a new column filled with '0'
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

# JFK Airport modification
taxi_modified[taxi_modified[:, 5] == 2, 15] = 1

# LaGuardia Airport modification
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1

# Newark Airport modification
taxi_modified[taxi_modified[:, 5] == 5, 15] = 1

[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 0.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 0.000e+00]]


## Challenge 1: Which is the most popular airport?

To do that, we'll use boolean indexing to create three filtered arrays and then look at how many rows are in each array. We'll need to check if the __dropoff_location_code__ column (column index 6) is equal to one of the following values:

- 2: JFK Airport
- 3: LaGuardia Airport
- 5: Newark Airport.

In [9]:
# JFK Airport drop offs
jfk = taxi[taxi[:, 6] == 2]
jfk_count = jfk.shape[0]

# LaGuardia Airport drop offs
laguardia = taxi[taxi[:, 6] == 3]
laguardia_count = laguardia.shape[0]

# Newark Airport drop offs
newark = taxi[taxi[:, 6] == 5]
newark_count = newark.shape[0]

In [10]:
print(jfk_count, laguardia_count, newark_count)

11832 16602 63


Looks like LaGuardia Airport has the most drop offs in our dataset.

## Challenge 2: Calculating Statistics for Trips on Clean Data

We need to remove bad data from our dataset and calculating some descriptive statistics on the remaining "clean" data.

We'll start by using boolean indexing to remove any rows that have an average speed for the trip greater than 100 mph (160 kph) which should remove the questionable data we have worked with over the past two missions. Then, we'll use array methods to calculate the mean for specific columns of the remaining data. The columns we're interested in are:

- trip_distance, at column index 7
- trip_length, at column index 8
- total_amount, at column index 13

In [11]:
trip_mph = taxi[:, 7] / (taxi[:, 8] / 3600)

# Create a new ndarray containing only rows where trip_mph are less than 100
cleaned_taxi = taxi[trip_mph < 100]
mean_distance = cleaned_taxi[:, 7].mean()
mean_length = cleaned_taxi[:, 8].mean()
mean_total_amount = cleaned_taxi[:, 13].mean()

### Conclusion

In this mission we learned:

- How to use numpy.genfromtxt() to read in an ndarray.
- About NaN values.
- What a boolean array is, and how to create one.
- How to use boolean indexing to filter values in one and two-dimensional ndarrays.
- How to assign one or more new values to an ndarray based on their locations.
- How to assign one or more new values to an ndarray based on their values.