
<img width="60" src="https://drive.google.com/uc?export=view&id=1JQRWCUpJNAvselJbC_K5xa5mcKl1gBQe"> 



In [1]:
# Uploading files from your local file system

from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving nyc_taxis.csv to nyc_taxis.csv
User uploaded file "nyc_taxis.csv" with length 5145963 bytes



# 1.0 Boolean Indexing with NumPy

## 1.1 Reading CSV files with NumPy

In the previous section we learned how to use NumPy and ndarrays to perform vectorized operations to work with data. We learned that NumPy makes it quick and easy to make selections of our data, and includes a number of functions and methods that make it easy to calculate statistics across the different axes (or dimensions).

Using the skills we've learned so far, we were able to select subsets of our taxi trip data and then calculate things like the maximum, minimum, sum, and mean of various columns and rows. But what if we wanted to find out how many trips were taken in each month? Or which airport is the busiest? For this we will need a new technique: **Boolean Indexing.**

In the previous section, we used Python's built-in [csv module](https://docs.python.org/3/library/csv.html) to import our CSV as a 'list of lists' and used loops to convert each value to a float before we created our NumPy ndarray. Now that we understand NumPy a little better, let's learn about the [numpy.genfromtxt() function](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt) to read in files.

The **numpy.genfromtxt()** function reads a text file into a NumPy ndarray. While it has over 20 parameters, for most cases you need only two. Here is the simplified syntax for the function, and an explanation of the two parameters:

```python
np.genfromtxt(filename,delimiter)
```

- **filename** - A positional argument, usually a string representing the path to the text file to be read.
- **delimiter** - A named argument, specifying the string used to separate each value.
In this case, because we have a CSV file, the delimiter is a comma. Let's look at what the code would look like to read in the **nyc_taxis.csv** file.

```python
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',')
print(taxi)
```

The output of this code is shown below:

```python
[[   nan    nan    nan ...,    nan    nan    nan]
 [  2016      1      1 ...,  11.65  69.99      1]
 [  2016      1      1 ...,      8   54.3      1]
 ..., 
 [  2016      6     30 ...,      5  63.34      1]
 [  2016      6     30 ...,   8.95  44.75      1]
 [  2016      6     30 ...,      0  54.84      2]]
```

When **numpy.genfromtxt()** reads in a file, it attempts to determine the data type of the file by looking at the values. We can use the **ndarray.dtype** attribute to see the internal datatype that has been used.

```python
>>> taxi.dtype

    float64
```

NumPy has chosen the **float64** type as it will allow most of the values from our CSV to be read. You can think of NumPy's **float64** type as being identical to Python's float type (the **'64'** refers to the number of bits used to store the underlying value).

The first row of our data contains a value that we haven't seen before: **nan**. **NaN** is an acronym for **Not a Number**. The concept of NaN is an unusual one at first - it literally means that the value cannot be stored as a number. It is similar to (and often refered to interchangably as a) null value, like Python's [None constant](https://docs.python.org/3.4/library/constants.html#None).

NaN is most commonly seen when a value is missing, but in this case we have NaN because the first line from our CSV file contains the names of each column. As we mentioned in the previous mission, NumPy ndarrays can contain only one type. NumPy is unable to convert string values like **pickup_year** into the **float64** data type. Later in this course we'll talk about NaN some more in the context of missing values. For now, we need to remove this row from our ndarray. We can do this the same way we would if our data was stored in a list of lists:

```python
taxi = taxi[1:]
```

Which removes the first row from the array. Alternatively, we can pass an additional parameter, **skip_header**, to the **numpy.genfromtext()** function. The **skip_header** parameters accepts an integer, the number of rows from the start of the file to skip (note that because this is the number of rows and not the index, to skip the first row would require a value of **1** and not **0**).

**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

1. Import the **NumPy** library.
2. Use the **numpy.genfromtxt()** function to read the **nyc_taxis.csv** file into NumPy, skipping the first row, and assign the result to **taxi**.



In [0]:
# put your code here

## 1.2 Boolean Arrays

In the last sections we mentioned five ways to index, or select, data from ndarrays:

- An **integer**, indicating a specific location.
- A **slice**, indicating a range of locations.
- A **colon**, indicating every location.
- A **list of values**, indicating specific locations.
- A **boolean array**, indicating specific locations.

In this section we're going to focus on the last and arguably the most powerful method, the boolean array. A boolean array, as the name suggests is an array full of boolean values. Boolean arrays are sometimes called boolean vectors or boolean masks.

Let's take a moment to refresh our understanding of what a boolean value is. The boolean (or **bool**) type is a built-in Python type that can contain one of two unique values:

- True
- False


Boolean values can be defined either by **'hard-coding'** them to the code using the keywords **True** or **False**, or alternatively by using any of the Python comparison operators like **== (equal) > (greater than), < (less than), != (not equal)**. They're commonly seen within if statements, like the example below:

<img width="600" src="https://drive.google.com/uc?export=view&id=1cltwKCELwoqrOzBNU7AIDsJ083ykMhPL">

As the code is executed the boolean operation is evaluated, causing the print function to run. We can use the console to perform simple boolean operations as well:

```python
>>> type(3.5) == float
    True
>>> 3 < 10
    True
>>> "hello" == "goodbye"
    False
>>> 5 > 6
    False
>>> (3 + 3) != 5
    True
```

When we explored vector math in the first section, we learned that an operation between a ndarray and a scalar (individual) value results in a new ndarray:


```python
>>> np.array([2,4,6,8]) + 10

    array([12, 14, 16, 18])
```

The **+ 10** operation is applied to each value in the array.

Now, let's look at what happens when we perform a boolean operation between an ndarray and a scalar:

```python
>>> np.array([2,4,6,8]) < 5

    array([ True,  True, False, False], dtype=bool)
```

A similar pattern occurs– the 'less than five' operation is applied to each value in the array. The diagram below shows this step by step:


<img width="600" src="https://drive.google.com/uc?export=view&id=1QINhkJfEHn-CXbppP-x-RklfQCxfKrxg">

Let's practice using vectorized boolean operations to create some boolean arrays.

**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

1. Use vectorized boolean operations to:
  - Evaluate whether the elements in array **a** are less than **3** and assign the result to **a_bool**.
  - Evaluate whether the elements in array **b** are equal to **"blue"** and assign the result to **b_bool**.
  - Evaluate whether the elements in array **c** are greater than **100** and assign the result to **c_bool**.

In [0]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

# put your code here


## 1.3 Boolean Indexing with 1D ndarrays

Now we know what a boolean array is and how to create one using vectorized boolean operations. The last piece of the puzzle is understanding how to index (or select) using boolean arrays. This is known as boolean indexing. Let's use one of the examples from the previous screen.

<img width="600" src="https://drive.google.com/uc?export=view&id=1nNX9HUvygkpb2_GowE6t3QPtozu4TaX6">


To index using our new boolean array, we simply insert it in the square brackets, just like we would do with our other selection techniques:

<img width="600" src="https://drive.google.com/uc?export=view&id=1HruGF2TejcaPODJP0PvLqNj2g9qNoRVQ">

The boolean array acts as a filter, and the values that correspond to **True** become part of the resultant ndarray, where the the values that correspond to **False** are removed.

Now, let's look at an example using our **taxi** data. The second column in the ndarray is **pickup_month**. Let's use boolean indexing to create a filtered ndarray containing only items where the value is **1**, which corresponds to January. Once we have done that, we can look at the [ndarray.shape attribute](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.shape.html) for the filtered ndarray, which will tell us the number of taxi rides in our data set from the month of January.

We'll do it step by step, starting with selecting just the **pickup_month** column:

```python
pickup_month = taxi[:,1]
```

Next, we use a boolean operation to make our boolean array:

```python
january_bool = pickup_month == 1
```

Then we use the new boolean array to select only the items from pickup_month that have a value of 1:

```python
january = pickup_month[january_bool]
```

Finally, we use the **.shape** attribute to find out how many items are in our **january** ndarray which is the number of taxi rides in our data set from the month of January. We'll use **[0]** to extract the value from the tuple returned by **.shape**

```python
january_rides = january.shape[0]
print(january_rides)

13481
```

There are 13,481 rides in our dataset from the month of January. Let's practice boolean indexing and find out the number of rides in our data set for February and March.

**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

1. Calculate the number of rides in the **taxi** ndarray that are from February:
  - Create a boolean array, **february_bool**, that evaluates whether the items in **pickup_month** are equal to **2**.
  - Use the **february_bool** boolean array to index **pickup_month**, and assign the result to **february**.
  - Use the **ndarray.shape** attribute to find the number of items in **february** and assign the result to **february_rides**.
2. Calculate the number of rides in the **taxi** ndarray that are from March:
  - Create a boolean array, **march_bool**, that evaluates whether the items in **pickup_month** are equal to **3**.
  - Use the **march_bool** boolean array to index **pickup_month**, and assign the result to **march.**
  - Use the **ndarray.shape** attribute to find the number of items in **march** and assign the result to **march_rides**.

In [0]:
# put your code here

## 214 Boolean Indexing with 2D ndarrays

When working with 2D ndarray, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing. Let's look at some examples:

<img width="500" src="https://drive.google.com/uc?export=view&id=1jXwHlU2lUX-VHmCTm9brDiTRu8L7yx2t">

Because a boolean array contains no information about how it was created, we can use a boolean array made from just one column of our array to index the whole array.

Let's look at an example from our taxi trip data. In the previous mission, we sorted our ndarray in order to view the trips that had very large average speeds. Boolean indexing makes this much easier:

```python
# calculate the average speed
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

# create a boolean array for trips with average
# speeds greater than 20,000 mph
trip_mph_bool = trip_mph > 20000

# use the boolean array to select the rows for
# those trips, and the pickup_location_code,
# dropoff_location_code, trip_distance, and
# trip_length columns
trips_over_20000_mph = taxi[trip_mph_bool,5:9]

print(trips_over_20000_mph)
```

```python
[[     2      2     23      1]
 [     2      2   19.6      1]
 [     2      2   16.7      2]
 [     3      3   17.8      2]
 [     2      2   17.2      2]
 [     3      3   16.9      3]
 [     2      2   27.1      4]]
```

Combining our boolean array with a column slice allowed us to view just the key data of these trips with very high average speeds. As we observed in the previous mission, all of these trips have the same pickup and dropoff locations, and last only a few seconds.

Let's use this technique to examine the rows that have the highest values for the **tip_amount** column.


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

1. Create a boolean array, **tip_bool**, that determines which rows have values for the **tip_amount** column of more than **50**.
2. Use the **tip_bool** array to select all rows from **taxi** with values tip amounts of more than **50**, and the columns from indexes 5 to 13 inclusive. Assign the resulting array to **top_tips.**

In [0]:
# put your code here

## 1.5 Assigning Values in ndarrays

So far we've learned how to retrieve data from ndarrays, and how to add rows or columns. There is one missing piece to our NumPy fundamentals toolbox: modifying values.

We can use the same indexing techniques we've already learned to assign values within an ndarray. The syntax we'll use (in pseudocode) is:

```python
ndarray[location_of_values] = new_value
```

Let's take a look at what that looks like in actual code. With our 1D array, we can specify one specific index location:

```python
a = np.array(['red','blue','black','blue','purple'])
a[0] = 'orange'
print(a)

['orange', 'blue', 'black', 'blue', 'purple']
```

Or we can assign multiple values at once:

```python
a[3:] = 'pink'
print(a)

['orange', 'blue', 'black', 'pink', 'pink']
```

With a 2D ndarray, just like with a 1D, we can assign one specific index location.

```python
ones = np.array([[1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1]])
ones[1,2] = 99
print(ones)

[[ 1,  1,  1,  1,  1],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]
```

We can also assign a whole row...

```python
ones[0] = 42
print(ones)

[[42, 42, 42, 42, 42],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]
```

...or a whole column:

```python
ones[:,2] = 0
print(ones)

[[42, 42, 0, 42, 42],
 [ 1,  1, 0,  1,  1],
 [ 1,  1, 0,  1,  1]]
```

Let's practice some array assignment with our taxi dataset.

**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

To help you practice without making changes to our original array, we have used the [ndarray.copy()](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.copy) method to make **taxi_modified**, a copy of our original for these exercises.


- The value at column index 5 (**pickup_location**) of row index 28214 is incorrect. Use assignment to change this value to **1** in the **taxi_modified** ndarray.
- The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (16) in the **taxi_modified** ndarray.
- The values at column index 7 (**trip_distance**) of rows index 1800 and 1801 are incorrect. Use assignment to change these values in the **taxi_modified** ndarray to the mean value for that column.



In [0]:
# put your code here

## 1.6 Assignment Using Boolean Arrays

Boolean arrays become very powerful when we use them for assignment. Let's start by looking at a simple example:

```python
>>> a = np.array([1, 2, 3, 4, 5])

>>> a[a > 2] = 99

>>> print(a)

    [ 1  2 99 99 99]
```

Before we walk through how the code works, we've just seen a 'shortcut' for the first time. The second line of code inserted the definition of the boolean array directly into the selection. This 'shortcut' way is the conventional way to write boolean indexing. Up until now, we've been taking the extra step of assigning to an intermediate variable first so that the process is clear. Let's look at how we would have written the example using the intermediate variable.

```python
>> a2 = np.array([1, 2, 3, 4, 5])

>> a2_bool = a2 > 2

>> a2[a2_bool] = 99

>> print(a2)

    [ 1  2 99 99 99]
```

You can see that both ways produce the same results. From here on, we will use the shortcut method instead of the intermediate variable. The boolean array controls the values that the assignment applies to, and the other values remain unchanged. Let's look at how this code works:

<img width="600" src="https://drive.google.com/uc?export=view&id=1u8WcLq-TYCIhSFuEa9ElfMFYPBAC_rZZ">


Next, let's look at an example of assignment using a boolean array with two dimensions:

```python
>>> b = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

>>> b[b > 4] = 99

>>> print(b)

    [[ 1  2  3]
     [ 4 99 99]
     [99 99 99]]
```

<img width="600" src="https://drive.google.com/uc?export=view&id=1VPmK9UuV1jvX74-ljHWJT6oE_vkTkS-a">


Lastly, let's look at an example that uses a 1D boolean array to perform assignment on a 2D array:

```python
>>> c = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

>>> c[c[:,1] > 2, 1] = 99

>>> print(c)

    [[ 1  2  3]
     [ 4 99  6]
     [ 7 99  9]]
```


In this example, the **c[:,1] > 2** boolean operation compares just one column's values and produces a 1D boolean array. We then use that boolean array to specify the rows for assignment, and use the integer **1** to specify the second column. This results in our boolean array only being applied to the second column, with all other values remaining unchanged:

<img width="600" src="https://drive.google.com/uc?export=view&id=1nXvILrVeMLryXgLr_TYLHPdjHVJZxstA">


This pattern, where a 1D boolean array is used to specify assignment in the row dimension and an index value is used to specify which column the array applies to is very common. The pseudocode syntax for this pattern is as follows, first using an intermediate variable:

```python
bool = array[:, column_for_comparison] == value_for_comparison
array[bool, column_for_assignment] = new_value
```

and then all in one line:

```python
array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value
```

Let's practice this pattern using our taxi data set:

**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

We have created a new copy of our taxi dataset, **taxi_modified** with an additional column containing the value 0 for every row.

1. In our new column at index **15**, assign the value **1** if the **pickup_location_code** (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
  - For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
  - For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
  - For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.

In [0]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

# create a new column filled with `0`.
zeros = np.zeros([taxi_modified.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

# put your code here

## 1.7 Challenge: Which is the most popular airport?

We'll conclude this mission with two challenges. Challenges are designed to help you practice the techniques you've learned in this mission.

**Don't be discouraged if these challenge steps take a few attempts to get right– working with data is an iterative process!**

In this challenge, we want to find out which airport is the most popular destination in our data set. To do that, we'll use boolean indexing and the **dropoff_location_code** column (column index 6) to create three filtered arrays and then look at how many rows are in each array. The values from the column we're interested in are:

- 2 - JFK Airport.
- 3 - LaGuardia Airport.
- 5 - Newark Airport.


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


- Using the original **taxi** ndarray, calculate how many trips had JFK Airport as their destination:
  - Select only the rows there the **dropoff_location_code** column has a value that corresponds to JFK, and assign the result to **jfk**.
  - Calculate how many rows are in the new **jfk** array and assign the result to **jfk_count**.
- Calculate how many trips from **taxi** had Laguardia Airport as their destination:
    - Select only the rows there the **dropoff_location_code** column has a value that corresponds to Laguardia, and assign the result to **laguardia.**
    - Calculate how many rows are in the **new laguardia** array and assign the result to **laguardia_count.**
- Calculate how many trips from **taxi** had Newark Airport as their destination:
  - Select only the rows there the **dropoff_location_code** column has a value that corresponds to Newark, and assign the result to **newark.**
  - Calculate how many rows are in the **new newark array** and assign the result to **newark_count.**
- After you have run your code, inspect the values for **jfk_count**, **laguardia_count**, and **newark_count** and see which airport has the most dropoffs.

In [0]:
# put your code here

## 1.8 Challenge: Calculating Statistics for Trips on Clean Data

Our calculations in the previous screen show that Laguardia is the most common airport for dropoffs in our data set.

Our second and final challenge involves removing potentially bad data from our data set, and then calculating some descriptive statistics on the remaining 'clean' data.

We'll start by using boolean indexing to remove any rows that have an average speed for the trip greater than 100 mph (160 kph) which should remove the questionable data we have worked with over the past two missions. Then, we'll use array methods to calculate the mean for specific columns of the remaining data. The columns we're interested in are:

- **trip_distance**, at column index 7
- **trip_length**, at column index 8
- **total_amount**, at column index 13
- **trip_mph**, not available as a column but as its own ndarray


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


The **trip_mph** ndarray has been provided for you.

- Create a new ndarray, **cleaned_taxi**, containing only rows for which the values of **trip_mph** are less than 100.
- Calculate the mean of the **trip_distance** column of **cleaned_taxi**, and assign the result to **mean_distance**.
- Calculate the mean of the **trip_length** column of **cleaned_taxi**, and assign the result to **mean_length**.
- Calculate the mean of the **total_amount** column of **cleaned_taxi**, and assign the result to **mean_total_amount.**
- Calculate the mean of the **trip_mph**, excluding values greater than 100, and assign the result to **mean_mph**.

In [0]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

# put your code here

In this section we learned:

- How to use **numpy.genfromtxt()** to read in an ndarray.
- About **NaN** values.
- What a boolean array is, and how to create one.
- How to use boolean indexing to filter values in one and two-dimensional ndarrays.
- How to assign one or more new values to an ndarray based on their locations.
- How to assign one or more new values to an ndarray based on their values.

This is the last section that deals exclusively with NumPy, however it's certainly not the last time we'll use NumPy. As we move onto using pandas, and later in our learning paths other Python data libraries, you'll see that a lot of the concepts we've learned transfer, and you'll also find yourself using a lot of these fundamental NumPy concepts. We'll also use NumPy from time to time to create, transform and otherwise work with tabular data.

In the next section, we'll start using the pandas library and learn how it compares with NumPy.