# Introduction to Numpy



## Understanding Vectorization

One of the reasons that the Python language is extremely popular is that it makes writing programs easy. When we execute Python code, the Python interpreter converts your code into bytecode that your computer can understand, and then runs that [bytecode](https://en.wikipedia.org/wiki/Bytecode). When you write code in Python, you don't have to worry about things like allocating memory on your computer or choosing how certain operations are done by your computer's processor. Python takes care of that for you.

<img width="500" src="https://drive.google.com/uc?export=view&id=1WSCD15qS89t5di-x-_WjLHwI6-Tj9EeH">

Python is what we call a **high-level language**. High level languages allow you to write programs faster as the interpreter makes the decisions on how to execute your instructions. In contrast, when you use **low-level** languages like C, you define exactly how memory will be managed and how the processor will execute your instructions. This means that coding in a **low-level language** takes longer, however you have more ability to optimize your code to run faster.

| Language Type | Example | Time taken to write program | Control over program performance |
|---------------|---------|-----------------------------|----------------------------------|
| High-Level | Python | Low | Low |
| Low-Level | C | High | High |

When choosing between a high and low-level language, you have to make a trade-off between being able to work and quickly, and having programs that run quickly and efficiently. Luckily, there are two Python libraries that were created to give us the best of both-worlds: **NumPy** and **pandas**. Together, pandas and NumPy provide a powerful toolset for working with data in Python. They allow us to write code quickly without sacrificing performance. But how do they do this? What is it that makes these libraries faster than raw Python? The answer is **vectorization**.


**How Vectorization Makes Code Faster**

Let's look at an example where we have two columns of data. Each row contains two numbers we wish to add together. Using just Python, we would use a list of lists structure to store our data, and use for loops to iterate over that data. Let's see what this would look like as Python code:


<img width="800" src="https://drive.google.com/uc?export=view&id=15rYQH5ne_AhjfSzSzsXdV2AdD7LKRsrl">


When this code is run, the Python interpreter will turn our code into bytecode, following the logic of our **for** loop. In each iteration of our loop, the bytecode asks our computer's processor to add the two numbers together and stores the result. The diagram shows the first calculation our computer's processor would make:

<img width="800" src="https://drive.google.com/uc?export=view&id=10qEvWGmvAHbT1NcqW6D8DjmZzPh08_TZ">


Our computer would take eight processor cycles to process the 8 rows of of our data.

Vectorization takes advantage of a processor feature called **Single Instruction Multiple Data (SIMD)** to process data faster. Most modern computer processors support SIMD. SIMD allows a processor to perform the same operation, on multiple data points, in a single processor cycle. Let's look at how a vectorized version of our code above might be processed using a SIMD instruction that allows four data points to be processed at once:


<img width="800" src="https://drive.google.com/uc?export=view&id=1DY8rZ_TtTrOOmG4qWgaEJcVE4vlJAhJc">

The vectorized version of our code will only take two processor cycles to process our eight rows of data - a four times speed-up. Vectorized operations might process as little as two and as many as as hundreds of operations per processor cycle, depending on the capabilities of the processor and the size of each data point.

The good news is that you don't have to worry about SIMD and processor cycles, because NumPy and pandas take care of this for you. We'll introduce pandas in more detail later in this course, but first we're going to learn about NumPy so we understand the fundamentals of working with vectorized operations.

In the next sections, we'll learn:

- How to work with data in using NumPy  objects.
- How to explore and clean data.
- How to use NumPy to analyze data quickly and efficiently.

Let's get started



## NYC Taxi-Airport Data

As we learn NumPy, we'll be analyzing taxi trip data released by the city of New York. The city releases data on taxis and for-hire vehicles on the [Taxi and Limousine Commission (TLC) Website](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml). There is data on over 1.3 trillion individual trips, reaching back as far as 2009 and is regularly updated.

<center>
<img width="400" src="https://drive.google.com/uc?export=view&id=1CjDo9pWq9g7bKjNH_Xo6cLUj17yf-f2f">
</center>


We'll be working with a subset of this data: Yellow taxi trips to and from New York City airports between January and June 2016. In our dataset, each row represents a unique taxi trip. Below is information about selected columns from the data set:

- **pickup_year** - The year of the trip.
- **pickup_month** - The month of the trip (January is 1, December is 12).
- **pickup_day** - The day of the month of the trip.
- **pickup_location_code** - The airport or borough where the the trip started, as one of eight categories:
  - 0 - Bronx.
  - 1 - Brooklyn.
  - 2 - JFK Airport.
  - 3 - LaGuardia Airport.
  - 4 - Manhattan.
  - 5 - Newark Airport.
  - 6 - Queens.
  - 7 - Staten Island.
- **dropoff_location_code** - The airport or borough where the the trip finished, using the same eight category codes as **pickup_location_code**.
- **trip_distance** - The distance of the trip in miles.
- **trip_length** - The length of the trip in seconds.
- **fare_amount** - The base fare of the trip, in dollars.
- **total_amount** - The total amount charged to the passenger, including all fees, tolls and tips.

You can find information on all columns in the [dataset data dictionary](https://s3.amazonaws.com/dq-content/289/nyc_taxi_data_dictionary.md).

We have [randomly sampled](https://en.wikipedia.org/wiki/Simple_random_sample) approximately 90,000 trips for our analysis, representing one 50th of the trips for the six month period. Our data is stored in a [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file called **nyc_taxis.csv**. Here are the first 10 rows of the data set (note that some columns were omitted due to space limitation):

| pickup_year | pickup_month | pickup_day | pickup_dayofweek | pickup_time | pickup_location_code | dropoff_location_code | trip_distance | trip_length | fare_amount | total_amount |
|-------------|--------------|------------|------------------|-------------|----------------------|-----------------------|---------------|-------------|-------------|--------------|
| 2016 | 1 | 1 | 5 | 0 | 2 | 4 | 21.00 | 2037 | 52.0 | 69.99 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 1 | 16.29 | 1520 | 45.0 | 54.30 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 12.70 | 1462 | 36.5 | 37.80 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 8.70 | 1210 | 26.0 | 32.76 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 5.56 | 759 | 17.5 | 18.80 |
| 2016 | 1 | 1 | 5 | 0 | 4 | 2 | 21.45 | 2004 | 52.0 | 105.60 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 8.45 | 927 | 24.5 | 32.25 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 7.30 | 731 | 21.5 | 22.80 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 5 | 36.30 | 2562 | 109.5 | 131.38 |
| 2016 | 1 | 1 | 5 | 0 | 6 | 2 | 12.46 | 1351 | 36.0 | 37.30 |


This, however, is how the first few lines of raw data in our CSV look like (we are showing only the first four columns from the file to make the format easier to understand:

```python
pickup_year,pickup_month,pickup_day,pickup_dayofweek
2016,1,1,5
2016,1,1,5
2016,1,1,5
2016,1,1,5
```

To start working with this CSV data in NumPy, we'll first need to start by importing the NumPy library into our Python environment. For this, we use a simple import statement:

```python
import numpy as np
```

We used the **as** syntax in our **import** statement. This allows us to access the NumPy library using another name. When working with NumPy, the convention is to import the library as **np** for brevity.

Next, we'll use Python's built-in [csv module](https://docs.python.org/3/library/csv.html) to import our CSV as a **'list of lists'**.

The last step is to convert our list of lists into a NumPy n-dimensional array, or [ndarray](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.ndarray.html). We're going to explain ndarrays in more detail in the next screen, but for now you can think of it as NumPy's version of a list of lists format. To convert from the list type to ndarray, we use the [numpy.array() constructor](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.array.html). Here's an example of how it works:

```python
# our list of lists is stored as data_list
data_ndarray = np.array(data_list)
```

We used the syntax **np.array()** instead of **numpy.array()** because of our **import numpy as np** code. When we introduce a new syntax, we'll always use the full name to describe it, and you'll need to substitute in the shorthand as appropriate.

Let's convert our taxi CSV into a NumPy ndarray!


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

In the code bellow, we have imported **numpy**, and used Python's **csv** module to import the **nyc_taxis.csv** file and convert it to a **list of lists** containing **float** values.

1. Add a line of code using the **numpy.array()** constructor to convert the **taxi_list_of_lists** variable to a NumPy ndarray. 
2. Assign the result to the variable name **taxi**.

In [1]:
import csv
import numpy as np

# import nyc_taxi.csv as a list of lists
# remove the header row
# convert each element to float

taxi_list_of_lists = [[float(item) for item in row]
 for row in list(csv.reader(open("nyc_taxis.csv", "r")))[1:]]

# put your code here

## Understanding NumPy ndarrays

As we mentioned earlier, ndarray stands for 'n-dimensional array'. In programming, array is a term that describes a collection of elements. Even if you haven't heard the term before, you have likely encountered arrays: a list object in Python could be described generically as an array. N-dimensional refers to the fact that ndarrays can have one or more dimensions. Let's look at some visualizations of one, two, and three dimensional arrays and their common names:


<img width="450" src="https://drive.google.com/uc?export=view&id=1Zcmsq84y8NNNNeujYJEO3Nc2gxqkk_m-">



Arrays with more than three dimensions do exist in data science but they're rare. We'll focus on:

- One-dimensional ndarrays (1D ndarrays)
- Two-dimensional ndarrays (2D ndarrays)

Similar to using lists of lists, we use numbers to specify the location of elements of our data that we want to work with. Just like with lists, we call these numbers index values (or collectively, indices).

Unlike with Python lists, every value in an ndarray must be of the same types. For the NYC taxi data set this does not matter, as all the values are float values. We'll talk further about this restriction and how to handle it a later mission.

Let's take a look at the data in the taxi variable from the previous screen by printing it using [Python's print() function](https://docs.python.org/3.4/library/functions.html#print):

```python
>>> print(taxi)

    [[ 2016.  1.   1.  ..., 11.65  69.99   1. ]
     [ 2016.  1.   1.  ...,  8.    54.3    1. ]
     [ 2016.  1.   1.  ...,  0.    37.8    2. ]
     ..., 
     [ 2016.  6.  30.  ...,  5.    63.34   1. ]
     [ 2016.  6.  30.  ...,  8.95  44.75   1. ]
     [ 2016.  6.  30.  ...,  0.    54.84   2. ]]
```

At first, this looks identical to a list of lists, with two exceptions:

- Between the third and fourth column of every row there is an elipsis **(...).**
- Between the third and fourth row there is another elipsis.


These elipses indicate that there is more data in our NumPy ndarray than can easily be printed. NumPy will summarize any ndarray we print if it contains more than 1000 elements. If we wanted to see the how many rows and columns are in our ndarray, we can use the [ndarray.shape attribute](http://docs.scipy.org/doc/numpy-1.12.0/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape). If you like, you can open the console from the bottom right of the interface and run this command to see it for yourself.

```python
>>> taxi.shape
    (89560, 15)
```

The output of the ndarray.shape attribute gives us a few important pieces of information:

- There are two numbers, which tells us that our ndarray is two-dimensional.
    - Note: the data type returned is called a [tuple](https://docs.python.org/3.6/library/stdtypes.html#tuples). Tuples are very similar to Python lists, but are immutable (can't be modified). Tuples are defined and displayed using parentheses **()** rather than brackets **[]**.
- The first number tells us that the first dimension is 89,560 items long, or put another way that there are 89,560 rows in our data set.
- The second number tells us that the second dimension is 15 items long, or put another way that there are 15 columns in our data set.

If we just want to select a number of rows from an ndarray, we can use slicing, just like we would with a list of lists. Here's how we would print the first five rows:

```python
>>> print(taxi[:5])

    [[ 2016  1  1  5  0  2  4  21    2037  52.   0.8  5.54  11.65  69.99   1  ]
     [ 2016  1  1  5  0  2  1  16.29  1520  45.   1.3  0     8    54.3    1  ]
     [ 2016  1  1  5  0  2  6  12.7   1462  36.5  1.3  0     0    37.8    2  ]
     [ 2016  1  1  5  0  2  6   8.7   1210  26.   1.3  0     5.46  32.76   1  ]
     [ 2016  1  1  5  0  2  6   5.56   759  17.5  1.3  0     0    18.8    2  ]]
```

You'll notice that because we have fewer than 1000 items in our output, NumPy does not summarize the data and we can see all 15 columns (although they're harder to see because each wraps onto a new line).

Let's practice making a slice of multiple rows using of our ndarray.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

Variables we created in previous section!!!! 

1. Select the first ten rows of the **taxi** ndarray, and assign the result to a new variable **taxi_ten**.
2. Use **Python's print()** function to display **taxi_ten**.

In [12]:
# put your code here

## Selecting and Slicing Rows and Items from ndarrays

Let's look at a comparison between working with ndarray's and list of lists to select one or more rows of data:

<img width="600" src="https://drive.google.com/uc?export=view&id=1jfWR9J9dsX2WhqSEnsTJMy_8x0NldSRT">


Just like we saw in the previous screen, selections of rows ndarray's look like they behave very similarly to lists of lists. In reality, what we're seeing is a shortcut of sorts. For any two-dimensional array, the full syntax for selecting data is:

```python
ndarray[row,column]

# or if you want to select all
# columns for a given set of rows
ndarray[row]
```

Where row defines the location along the row axis and column defines the location along the column axis. Both row and column can be one of the following:

- An **integer**, indicating a specific location, eg **ndarray[3,0]**.
- A **slice**, indicating a range of locations, eg **ndarray[0:5,6:]**.
- A **colon**, indicating every location, eg **ndarray[:,2].**
- A **list of values**, indicating specific locations, eg **ndarray[[0,1,3,4],0]**.
- A **boolean array**, indicating specific locations - we'll look at this method in detail later.
- Or any combination of the above.

This is how we select a single item from a 2D ndarray:


<img width="600" src="https://drive.google.com/uc?export=view&id=1PZX6Ba54H6UM7NfnlyMkyoff5ZgMaSl9">


With a list of lists, we use two separate pairs of square brackets back-to-back. With a NumPy ndarray, we use a single pair of brackets with comma separated row and column locations.

Let's practice selecting one row, multiple rows, and single items from our **taxi** ndarray.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. From the **taxi** ndarray:
  - Select the row at index 0 and assign it to **row_0**.
  - Select every column for the rows at indexes 391 to 500 inclusive and assign them to **rows_391_to_500**.
  - Select the item at row index 21 and column index 5 and assign it to **row_21_column_5**



In [18]:
# put your code here

## Selecting Columns and Custom Slicing ndarrays

Let's continue by learning how to select one or more columns of data:


<img width="550" src="https://drive.google.com/uc?export=view&id=1SMRvKH2kCSLpdANtxt4XZvSxomE0QgP5">

With a list of lists, we need to use a for loop to extract specific column(s) and append them back to a new list. With ndarray's, the process is much simpler. We again use single brackets with comma separated row and column locations, but we use a colon **(:)** for the row locations. This colon acts as a wildcard, and gives us all items in that dimension, or in other words all rows.

If we wanted to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:

<img width="550" src="https://drive.google.com/uc?export=view&id=1ywqJGXCPuLD17sTVg8f_eIJHD4hiHM2D">

Lastly, if we wanted to select a 2D slice, we can use slices for both dimensions:


<img width="550" src="https://drive.google.com/uc?export=view&id=1ag7hqo_71kgwLbhpo74rwQyKRA4Xjwpk">


Let's practice everything we've learned so far to perform some more complex selections using NumPy


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. From the **taxi** ndarray:
  - Select every row for the columns at indexes 1, 4, and 7 and assign them to **columns_1_4_7.**
  - Select the columns at indexes 5 to 8 inclusive for the row at index 99 and assign them to **row_99_columns_5_to_8**.
  - Select the rows at indexes 100 to 200 inclusive for the column at index 14 and assign them to **rows_100_to_200_column_14**.

In [23]:
# put your code here

## Vector Math

The examples in the previous two sections showed us how much easier it is to select data using NumPy ndarrays. Beyond this, the selection we are making is a lot faster when working with vectorized operations. To illustrate this, we've created a random 5000000 x 5 numpy ndarray, and an equivalent list of of lists, and then a function to select the second and third columns for each:

- **python_subset()**
- **numpy_subset()**



In [35]:
import numpy as np

# create random (5000000,5) numpy arrays and 
# list of lists
np_array = np.random.rand(5000000,5)
list_array = np_array.tolist()

def python_subset():
    filtered_cols = []
    for row in list_array:
        filtered_cols.append([row[1],row[2]])
    return filtered_cols

def numpy_subset():
    return np_array[:,1:3]



We'll use a special iPython [%timeit](http://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit) magic command to time a single run of each function:

In [54]:
%%timeit -r 2 -n 10
# the number of executions will be n * r

list_of_list = python_subset()

1.32 s ± 22.2 ms per loop (mean ± std. dev. of 2 runs, 10 loops each)


In [56]:
%%timeit -r 2 -n 10
# the number of executions will be n * r

numpy_array = numpy_subset()

703 ns ± 276 ns per loop (mean ± std. dev. of 2 runs, 10 loops each)


Our NumPy version was over $10^6$ times quicker than the list of lists version (the units of the output are in nanoseconds!!!)

When we first talked about vectorized operations, we used the example of adding two columns of data. With data in a list of lists, we'd have to construct a for-loop and add each pair of values from each row individually. To refresh your memory, here's what our example code looked like:

```python
my_numbers = [
              [6, 5],
              [1, 3],
              [5, 6],
              [1, 4],
              [3, 7],
              [5, 8],
              [3, 5],
              [8, 4]
             ]

sums = []

for row in my_numbers:
    row_sum = row[0] + row[1]
    sums.append(row_sum)
```

At the time, we only talked about how vectorized operations make this faster, however it also makes our code to execute this much simpler. We'll break this down into three steps:

- Convert our data to an ndarray,
- Select each column,
- Add the columns.

Let's look at what that looks like in code:

```python
# convert the list of lists to an ndarray
my_numbers = np.array(my_numbers)

# select each of the columns - the result
# of each will be a 1D ndarray
col1 = my_numbers[:,0]
col2 = my_numbers[:,1]

# add the two columns
sums = col1 + col2
```

We could simplify this further if we wanted to:

```python
sums = my_numbers[:,0] + my_numbers[:,1]
```

Here are some key observations about this code:

- When we selected each column, we used the syntax **ndarray[:,c]** where **c** is the column index we wanted to select. Like we saw in the previous screen, the colon acts as a wildcard and selects all rows.
- To add the two 1D ndarrays, **col1** and **col2** (which sometimes would be called **vectors** in this context), we simply use the addition operator **(+)** between them.
- The result of adding two 1D vectors is a 1D vector of the same shape (or dimensions) as the original.


Here's what happened behind the scenes:


<img width="600" src="https://drive.google.com/uc?export=view&id=14pACrFQpoxcFg9esh3CyqrKTASHfjmvm">



What we just did, adding two columns (or vectors) together is called **vector math**. When we're performing vector math on two one-dimensional vectors, both vectors must have the same shape. We can use any of the standard [Python numeric operators](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) to perform vector math:

- **vector_a + vector_b** - Addition
- **vector_a - vector_b** - Subtraction
- **vector_a \* vector_b** - Multiplication (this is unrelated to the vector multiplication used in linear algebra).
- **vector_a / vector_b** - Division
- **vector_a % vector_b** - Modulus (find the remainder when **vector_a** is divided by **vector_b**)
- **vector_a \*\* vector_b** - Exponent (raise **vector_a** to the power of **vector_b**)
- **vector_a // vector_b** - Floor Division (divide **vector_a** by **vector_b**, rounding down to the nearest integer)

Let's look at an example from our taxi dataset. Here are the first five rows of two of the columns in the data set:

| trip_distance | trip_length |
|---------------|-------------|
| 21.00 | 2037.0 |
| 16.29 | 1520.0 |
| 12.70 | 1462.0 |
| 8.70 | 1210.0 |
| 5.56 | 759.0 |


Let's use these columns to calculate the average travel speed of each trip in miles per hour. The formula for calculating miles per hour is:

$$
\textrm{miles per hour} = \textrm{distance in miles} \div \textrm{lenght in hours}
$$

As we learned in the second screen of this mission, **trip_distance** is expressed in miles, and **trip_length** is seconds, so our first step is converting **trip_length** into hours. Here's how we would do it:

```python
trip_distance = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour
```

Here we have a different example of vector math. We've divided a vector (one-dimensional array) by a scalar (single number). In this case, each value in the vector gets divided by the scalar to form the result.

From here, let's perform vector division again to calculate the miles per hour.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Use vector division to divide **trip_distance_miles** by **trip_length_hours**, assigning the result to **trip_mph**.
2. After you have run your code, inspect the contents of the new **trip_mph** variable.

In [58]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

# put your code here

## Arithmetic Numpy Functions

To make the calculations in the previous section, we used operators like the __/__ symbol to perform vectorized operations over our data. NumPy provides a second way to make these calculations - **arithmetic functions**. Let's look at how we would write the exercise from the previous screen with with the equivalent, the [numpy.divide](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.divide.html) function:

```python
# using the `/` operator:
trip_mph_1 = trip_distance_miles / trip_length_hours

# using the `numpy.divide()` function:
trip_mph_2 = np.divide(trip_distance_miles,trip_length_hours)
```

The variables **trip_mph_1** and **trip_mph_2** will be identical.

As you become more familiar with NumPy (and later, pandas), you'll find that there is often more than one way to do the same thing. Most of the time, which you choose is up to you. The general rule with situations like these it to choose the one that makes your code easier to read, which will pay dividends both as you start working with data in teams, and when you have to refer back to code you wrote some time ago. You will find that for these arithmetic operations, it's much more common to use the built-in Python operators than the functions.

As you start to feel more comfortable with these libraries, you should start exploring the documentation. This is useful because it builds out your knowledge of available functions and methods, but also because it gets you used to reading the documentation. It's not possible to remember the syntax for every variation of every data science library, but if you remember what is possible, and can read the documentation, you'll always be able to quickly refamiliarize yourself with some syntax whenever you need it.

You may have noticed that when we mention a function or method for the first time, we'll link to the documentation for it. Take a moment now to click the link for the [**numpy.divide()**](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.divide.html) function from the first paragraph of this screen and look at the documentation. It may seem a little overwhelming at first, but it is well worth your time.

You might like to also take a look at all of the [arithmetic functions from the NumPy documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.math.html#arithmetic-operations).

## Calculating Statistics For 1D ndarrays

Earlier, we created **trip_mph**, a 1D ndarray of the average mile-per-hour speed of each trip in our dataset, based off the **trip_length** and **trip_distance** columns. We might like to explore this data further, for instance working out what the maximum and minimum values are for that ndarray.

We could use the built-in Python functions **min()** and **max()** to make these calculations, however these will perform calculations without taking advantage of vectorization. Instead we can use NumPy's ndarray methods we can use to calculate statistics.

To calculate the minimum value of an 1D ndarray, we use the vectorized [ndarray.min()](http://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html) method, like so:


```python
>>> mph_min = trip_mph.min()

>>> mph_min
    0.0
```

The minimum value in our **trip_mph** ndarray is **0.0**, for a trip that didn't travel any distance at all.

Before we look at other array methods Let's take a moment to clarify the difference between **methods** and **functions**. Functions act as stand alone segments of code that usually take an input, perform some processing, and return some output. When we're working with Python lists, we can use the **len()** function to calculate the length of a list, but if we're working with Python strings, we can also use **len()**. In this case, it calculates the numbers of characters (or length) of the string.

```python
>>> my_list = [21,14,91]
>>> len(my_list)
    3

>>> my_string = 'Natal'
>>> len(my_string)
    5
```

In contrast, methods are special functions that belong to a specific type of object. Python lists have a **list.append()** method that we can use to add an item to the end of a list. If we try to use that method on a string, we will get an error:

```python
>>> my_list.append(21)

>>> my_string.append(' is the best!')'

    Traceback (most recent call last):
      File "stdin", line 1, in module
    AttributeError: 'str' object has no attribute 'append'
```

When you're learning NumPy, this can get confusing, because sometimes there are operations that are implemented as both methods and functions, but sometimes there are not. Let's look at some examples:

| Calculation | Function Representation | Method Representation |
|------------------------------------------------|-------------------------|-----------------------------------|
| Calculate the minimum value of **trip_mph** | np.min(trip_mph) | trip_mph.min() |
| Calculate the maximum value of **trip_mph** | np.max(trip_mph) | trip_mph.max() |
| Calculate the mean average value of **trip_mph** | np.mean(trip_mph) | trip_mph.mean() |
| Calculate the median average value of **trip_mph** | np.median(trip_mph) | There is no ndarray median method |


To remember the right terminology, anything that starts with np (e.g. **np.mean()**) is a function and anything you express with an object (or variable) name first (eg **trip_mph.mean()**) is a method. As we discussed in the previous section, where both exist it's up to you which you use, but it's much more common to see the method approach, and that's the one we'll use moving forward.

Numpy ndarrays have methods for many different calculations. A few key methods are:

- [ndarray.min()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min) to calculate the minimum value
- [ndarray.max()](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.max.html) to calculate the maximum value
- [ndarray.mean()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.mean.html#numpy.ndarray.mean) to calculate the mean average value
- [ndarray.sum()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.sum.html#numpy.ndarray.sum) to calculate the sum of the values

You can see them a full list of ndarray methods in the NumPy ndarray [documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation).

Let's use the methods we've just learned about to calculate the smallest, largest, and mean average speed from our **trip_mph** ndarray.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Use the [ndarray.max()](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.max.html) method to calculate the maximum value of **trip_mph** and assign the result to **mph_max**. tip: see also ndarray.argmax() 
2. Use the [ndarray.mean()](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.mean.html#numpy.ndarray.mean) method to calculate the average value of **trip_mph** and assign the result to **mph_mean**.

In [67]:
# put your code here

## Calculating Statistics For 2D ndarrays

Looking at the result of the code in the previous screen, you would have observed:

- Minimum trip speed: 0 mph
- Average (mean) trip speed (rounded): 32 mph
- Maximum trip speed (rounded): 82,000 mph

While it's easy to imagine a case where the trip speed is 0 mph - a trip that starts and ends without traveling any distance, a trip speed of 82,000 mph is definitely not possible in New York traffic - that's almost 20x faster than the fastest plane in the world! This is could be due to an error in the devices that records the data, or perhaps errors made somewhere in the data pipeline. We'll spend some time later in this mission looking into the data that gave us this unrealistic number.

For now, we're going to look at how we can calculate statistics for two-dimensional ndarrays. If we use the arrays without additional parameters, they will return a single value, just like they do with a 1D array:

<img width="500" src="https://drive.google.com/uc?export=view&id=1hOKRuh4eN2_ZeiDOMT5ZwEz8syqMlB2X">


But what if we wanted to find the maximum value of each row? For that, we need to use the **axis** parameter, and specify a value of **1**, which indicates we want to calculate values for each row.

<img width="500" src="https://drive.google.com/uc?export=view&id=151in-86Grb_igmMjTnMAHsuqu3XrxFiH">

If we want to find the maximum value of each column, we use an **axis** value of **0**:


<img width="550" src="https://drive.google.com/uc?export=view&id=1tTSmu1P6FlOADhVrAJIL8ydgbX_ToT2m">


To help you remember which is which, you can think of the first axis as rows, and the second axis as columns, just in the same way as when we're indexing a 2D NumPy array we use **ndarray[row,column]**. Then you think about which axis you want to apply the method along. The tricky part is to remember that when you apply the method along one axis, you get results in the other axis. Here is an illustration of that:


<img width="550" src="https://drive.google.com/uc?export=view&id=11Yyylj-uQYHTDJcY-M5e8PMiiXydK4U8">


Let's look at an example of from our taxi data set. Let's say that we wanted to do some validation, and check that the **total_amount** column is accurate. To remind ourselves of what the data looks like, let's look at the first five rows of columns with indexes 9 through 13:

| fare_amount | fees_amount | tolls_amount | tip_amount | total_amount |
|-------------|-------------|--------------|------------|--------------|
| 52.0 | 0.8 | 5.54 | 11.65 | 69.99 |
| 45.0 | 1.3 | 0.00 | 8.00 | 54.3 |
| 36.5 | 1.3 | 0.00 | 0.00 | 37.8 |
| 26.0 | 1.3 | 0.00 | 5.46 | 32.76 |
| 17.5 | 1.3 | 0.00 | 0.00 | 18.8 |


We want to perform a check of whether the first 4 of these columns sums to the 5th column. This is how we would do it:


```python
# we'll compare against the first 5 rows only
taxi_first_five = taxi[:5]
# select these columns: fare_amount, fees_amount, tolls_amount, tip_amount
fare_components = taxi_first_five[:,9:13] 
# select the total_amount column
fare_totals = taxi_first_five[:,13]

# sum the component columns
fare_sums = fare_components.sum(axis=1)

# compare the summed columns to the fare_totals
print(fare_totals.round())
print(fare_sums)
```

Our code outputs the following:

```python
[ 69.99  54.3   37.8   32.76  18.8 ]
[ 69.99  54.3   37.8   32.76  18.8 ]
```

We have validated that our **fare_totals** column is correct (at least for the first five rows).

Now, let's practice calculating the average for each column:

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Using a single method, calculate the mean value for each column of **taxi**, and assign the result to **taxi_column_means.**

In [69]:
# put your code here

## Adding Rows and Columns to ndarrays

Earlier in this lesson, we produced a ndarray **trip_mph** of the average speed of each trip. We also observed that the maximum speed was 82,000 mph, which is definitely not an accurate number. To take a closer look at why we might be getting this value, we're going to do the following:

- Add the **trip_mph** as a column to our **taxi** ndarray.
- Sort taxi by **trip_mph**.
- Look at the rows with the highest **trip_mph** from our sorted ndarray to see what they tell us about these large values.


To start, let's learn how to add rows and columns to an ndarray. The technique we're going to use involves the [numpy.concatenate() function](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.concatenate.html). This function accepts:

- A list of ndarrays as the first, unnamed parameter.
- An integer for the **axis** parameter, where 0 will add rows and 1 will add columns.

The **numpy.concatenate()** function requires that each array have the same shape, excepting the dimension corresponding to **axis**. Let's look at an example to understand more precisely how that works. We have two arrays, **ones** and **zeros**:

```python
>>> print(ones)

    [[ 1  1  1]
     [ 1  1  1]]

>>> print(zeros)

    [ 0  0  0]
```

Let's try and use **numpy.concatenate()** to add **zeros** as a row. Because we are wanting to add a row, we use **axis=0**

```python
>>> combined = np.concatenate([ones,zeros],axis=0)

    Traceback (most recent call last):
      File "stdin", line 1, in module
    ValueError: all the input arrays must have same number of dimensions
```

We've got an error because our dimensions don't match - let's look at the shape of each array to see if we can understand why:

```python
>>> print(ones.shape)

    (2, 3)

>>> print(zeros.shape)

    (3,)
```

Because we're using **axis=0**, our shapes have to match across all dimensions except the first. If we look at these two array's we can see that the second dimension of **ones** is 3, but **zeros** doesn't have a second dimension, because it's only a 1D array. This is the source of our error. The table below shows the shapes we need to be able to combine these arrays.


| Object | Current shape | Desired Shape |
|--------|---------------|---------------|
| ones | (2, 3) | (2, 3) |
| zeros | (3,) | (1, 3) |


In order to adjust the shape of **zeros**, we can use the [numpy.expand_dims()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.expand_dims.html) function. You might like to follow these steps in the cell. We'll start by passing **axis=0** because we want to convert our 1D array into a 2D array representing a row:


```python
>>> zeros_2d = np.expand_dims(zeros,axis=0)

>>> print(zeros_2d)

    [[ 0  0  0]]

>>> print(zeros_2d.shape)

    (1, 3)
```

Finally, we can use **numpy.concatenate()** to combine the two arrays:

```python
>>> combined = np.concatenate([ones,zeros_2d],axis=0)

>>> print(combined)

    [[ 1  1  1]
     [ 1  1  1]
     [ 0  0  0]]
```

Adding a column is done the same way, except substituting **axis=1** for **axis=0** in both functions. The initial code for this screen shows this process.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Expand the dimensions of **trip_mph** to be a single column in a 2D ndarray, and assign the result to **trip_mph_2d**.
2. Add **trip_mph_2d** as a new column at the end of **taxi**, assigning the result back to **taxi**.
3. Use the **print()** function to display **taxi** and view the new column.


In [74]:
# put your code here

## Sorting ndarrays

Now that we've added our **trip_mph** column to our array, our next step is to sort the array. For this, we'll use the [numpy.argsort() function](http://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.argsort.html#numpy.argsort). The **numpy.argsort()** function returns the indices which would sort an array. Don't worry if that sounds a little unusual, we'll look at an example to help explain it.

We'll start by defining a simple 1D ndarray, where each item is a string containing the name of a fruit:

<img width="450" src="https://drive.google.com/uc?export=view&id=1xep-cNIjyRLiSjj40rH7FBI-JsxGZjEr">

We've put the indices, or index numbers, next to each value in the array. We use the indices whenever we want to select an item, for instance **fruit[2]** would return the value **'apple'** and **fruit[1]** would return the value **'banana'**. As we learned earlier in the mission, if we selected using a list of values like **fruit[[2,1]]**, we would get back an ndarray of those values in the order: **['apple','banana'].**

Next, we'll use **numpy.argsort()** to return the indices that would sort the array:

<img width="500" src="https://drive.google.com/uc?export=view&id=1_2PNcK3Ty6blapVQBHg1efLPr_QGJAy5">


If we look at these indices carefully, we can see what has happened. The first value of **sorted_order** is 2: The value at index 2 of fruit is **'apple'**, the first item if we sort in alphabetical order. The second value is 1: The value and index 1 of fruit is **'banana'**, the second item if we sort in alphabetical order, and so on.

If we use the array of sorted indices to select items from fruit, here is what we get:

<img width="500" src="https://drive.google.com/uc?export=view&id=1CTtkZu4vHHwTAovRdWeEqPhwuV3jH9Ls">


In the code above, the values from **sorted_order** get inserted between the brackets. The code is the equivalent of:

```python
sorted_fruit = fruit[[2, 1, 4, 3, 0]]
```

As you can see, the result is that our original array has been sorted in alphabetical order.

Let's look at an example with a 2D ndarray. We'll sorting a 5x5 ndarray called int_square by it's last column:


```python
>>> print(int_square)

    [[5 2 8 3 4]
     [2 8 6 2 5]
     [1 6 2 7 7]
     [0 7 7 4 5]
     [5 7 1 1 2]]
```

We'll start by selecting just the last column.

```python
>>> last_column = int_square[:,4]

>>> print(last_column)

    [4 5 7 5 2]
```

Then, we use **numpy.argsort()** to get the indices that would sort the last column and assign them to **sorted_order**.

```python
>>> sorted_order = np.argsort(last_column)

>>> print(sorted_order)

    [4 0 1 3 2]
```

As a test, let's use **sorted_order** to sort just the last column:

```python
>>> last_column_sorted = last_column[sorted_order]

>>> print(last_column_sorted)

    [2 4 5 5 7]
```

Finally, we can pass **sorted_order** to sort to the full ndarray:

```python
>>> int_square_sorted = int_square[sorted_order]

>>> print(int_square_sorted)

    [[5 7 1 1 2]
     [5 2 8 3 4]
     [2 8 6 2 5]
     [0 7 7 4 5]
     [1 6 2 7 7]]
```

We can use the same technique to sort our **taxi** ndarray by the **trip_mph** column. NumPy only supports sorting in ascending order, however that is not a problem - we'll just look at the last few rows instead of the first few rows to examine the data we need.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Use **numpy.argsort()** to get the indices which would sort the **trip_mph** column from the **taxi** ndarray. The **trip_mph** column is at column index **15**.
2. Use the indices from the previous instruction to **sort** the **taxi** ndarray, and assign the result to **taxi_sorted**.
3. Use the **print()** function to examine the **taxi_sorted** ndarray.

In [79]:
# put your code here

In this section we learned:

- How vectorization it makes our code faster.
- About n-dimensional arrays, and NumPy's ndarrays.
- How to select specific items, rows, columns, 1D slices, and 2D slices from ndarrays.
- How to use vector math to apply simple calculations to entire ndarrays.
- How to use vectorized methods to perform calculations across either axis of ndarrays.
- How to add extra columns and rows to ndarrays.
- How to sort an ndarray.


# Boolean Indexing with NumPy

## Reading CSV files with NumPy

In the previous section we learned how to use NumPy and ndarrays to perform vectorized operations to work with data. We learned that NumPy makes it quick and easy to make selections of our data, and includes a number of functions and methods that make it easy to calculate statistics across the different axes (or dimensions).

Using the skills we've learned so far, we were able to select subsets of our taxi trip data and then calculate things like the maximum, minimum, sum, and mean of various columns and rows. But what if we wanted to find out how many trips were taken in each month? Or which airport is the busiest? For this we will need a new technique: **Boolean Indexing.**

In the previous section, we used Python's built-in [csv module](https://docs.python.org/3/library/csv.html) to import our CSV as a 'list of lists' and used loops to convert each value to a float before we created our NumPy ndarray. Now that we understand NumPy a little better, let's learn about the [numpy.genfromtxt() function](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt) to read in files.

The **numpy.genfromtxt()** function reads a text file into a NumPy ndarray. While it has over 20 parameters, for most cases you need only two. Here is the simplified syntax for the function, and an explanation of the two parameters:

```python
np.genfromtxt(filename,delimiter)
```

- **filename** - A positional argument, usually a string representing the path to the text file to be read.
- **delimiter** - A named argument, specifying the string used to separate each value.
In this case, because we have a CSV file, the delimiter is a comma. Let's look at what the code would look like to read in the **nyc_taxis.csv** file.

```python
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',')
print(taxi)
```

The output of this code is shown below:

```python
[[   nan    nan    nan ...,    nan    nan    nan]
 [  2016      1      1 ...,  11.65  69.99      1]
 [  2016      1      1 ...,      8   54.3      1]
 ..., 
 [  2016      6     30 ...,      5  63.34      1]
 [  2016      6     30 ...,   8.95  44.75      1]
 [  2016      6     30 ...,      0  54.84      2]]
```

When **numpy.genfromtxt()** reads in a file, it attempts to determine the data type of the file by looking at the values. We can use the **ndarray.dtype** attribute to see the internal datatype that has been used.

```python
>>> taxi.dtype

    float64
```

NumPy has chosen the **float64** type as it will allow most of the values from our CSV to be read. You can think of NumPy's **float64** type as being identical to Python's float type (the **'64'** refers to the number of bits used to store the underlying value).

The first row of our data contains a value that we haven't seen before: **nan**. **NaN** is an acronym for **Not a Number**. The concept of NaN is an unusual one at first - it literally means that the value cannot be stored as a number. It is similar to (and often refered to interchangably as a) null value, like Python's [None constant](https://docs.python.org/3.4/library/constants.html#None).

NaN is most commonly seen when a value is missing, but in this case we have NaN because the first line from our CSV file contains the names of each column. As we mentioned in the previous mission, NumPy ndarrays can contain only one type. NumPy is unable to convert string values like **pickup_year** into the **float64** data type. Later in this course we'll talk about NaN some more in the context of missing values. For now, we need to remove this row from our ndarray. We can do this the same way we would if our data was stored in a list of lists:

```python
taxi = taxi[1:]
```

Which removes the first row from the array. Alternatively, we can pass an additional parameter, **skip_header**, to the **numpy.genfromtext()** function. The **skip_header** parameters accepts an integer, the number of rows from the start of the file to skip (note that because this is the number of rows and not the index, to skip the first row would require a value of 1 and not 0).

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Import the **NumPy** library.
2. Use the **numpy.genfromtxt()** function to read the **nyc_taxis.csv** file into NumPy, skipping the first row, and assign the result to **taxi**.



In [12]:
import numpy as np

taxi = np.genfromtxt('nyc_taxis.csv', skip_header=1, delimiter=',')

## Boolean Arrays

In the last sections we mentioned five ways to index, or select, data from ndarrays:

- An **integer**, indicating a specific location.
- A **slice**, indicating a range of locations.
- A **colon**, indicating every location.
- A **list of values**, indicating specific locations.
- A **boolean array**, indicating specific locations.

In this section we're going to focus on the last and arguably the most powerful method, the boolean array. A boolean array, as the name suggests is an array full of boolean values. Boolean arrays are sometimes called boolean vectors or boolean masks.

Let's take a moment to refresh our understanding of what a boolean value is. The boolean (or **bool**) type is a built-in Python type that can contain one of two unique values:

- True
- False


Boolean values can be defined either by **'hard-coding'** them to the code using the keywords **True** or **False**, or alternatively by using any of the Python comparison operators like **== (equal) > (greater than), < (less than), != (not equal)**. They're commonly seen within if statements, like the example below:

<img width="600" src="https://drive.google.com/uc?export=view&id=1cltwKCELwoqrOzBNU7AIDsJ083ykMhPL">

As the code is executed the boolean operation is evaluated, causing the print function to run. We can use the console to perform simple boolean operations as well:

```python
>>> type(3.5) == float
    True
>>> 3 < 10
    True
>>> "hello" == "goodbye"
    False
>>> 5 > 6
    False
>>> (3 + 3) != 5
    True
```

When we explored vector math in the first section, we learned that an operation between a ndarray and a scalar (individual) value results in a new ndarray:


```python
>>> np.array([2,4,6,8]) + 10

    array([12, 14, 16, 18])
```

The **+ 10** operation is applied to each value in the array.

Now, let's look at what happens when we perform a boolean operation between an ndarray and a scalar:

```python
>>> np.array([2,4,6,8]) < 5

    array([ True,  True, False, False], dtype=bool)
```

A similar pattern occurs– the 'less than five' operation is applied to each value in the array. The diagram below shows this step by step:


<img width="600" src="https://drive.google.com/uc?export=view&id=1QINhkJfEHn-CXbppP-x-RklfQCxfKrxg">

Let's practice using vectorized boolean operations to create some boolean arrays.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Use vectorized boolean operations to:
  - Evaluate whether the elements in array __a__ are less than 3 and assign the result to **a_bool**.
  - Evaluate whether the elements in array __b__ are equal to **"blue"** and assign the result to **b_bool**.
  - Evaluate whether the elements in array __c__ are greater than 100 and assign the result to **c_bool**.

In [5]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

# put your code here
a_bool = a < 3
b_bool = b == "blue"
c_bool = c > 100

## Boolean Indexing with 1D ndarrays

Now we know what a boolean array is and how to create one using vectorized boolean operations. The last piece of the puzzle is understanding how to index (or select) using boolean arrays. This is known as boolean indexing. Let's use one of the examples from the previous screen.

<img width="600" src="https://drive.google.com/uc?export=view&id=1nNX9HUvygkpb2_GowE6t3QPtozu4TaX6">


To index using our new boolean array, we simply insert it in the square brackets, just like we would do with our other selection techniques:

<img width="600" src="https://drive.google.com/uc?export=view&id=1HruGF2TejcaPODJP0PvLqNj2g9qNoRVQ">

The boolean array acts as a filter, and the values that correspond to **True** become part of the resultant ndarray, where the the values that correspond to **False** are removed.

Now, let's look at an example using our **taxi** data. The second column in the ndarray is **pickup_month**. Let's use boolean indexing to create a filtered ndarray containing only items where the value is **1**, which corresponds to January. Once we have done that, we can look at the [ndarray.shape attribute](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.shape.html) for the filtered ndarray, which will tell us the number of taxi rides in our data set from the month of January.

We'll do it step by step, starting with selecting just the **pickup_month** column:

```python
pickup_month = taxi[:,1]
```

Next, we use a boolean operation to make our boolean array:

```python
january_bool = pickup_month == 1
```

Then we use the new boolean array to select only the items from pickup_month that have a value of 1:

```python
january = pickup_month[january_bool]
```

Finally, we use the **.shape** attribute to find out how many items are in our **january** ndarray which is the number of taxi rides in our data set from the month of January. We'll use **[0]** to extract the value from the tuple returned by **.shape**

```python
january_rides = january.shape[0]
print(january_rides)

13481
```

There are 13,481 rides in our dataset from the month of January. Let's practice boolean indexing and find out the number of rides in our data set for February and March.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

tip: details about the [dataset](https://s3.amazonaws.com/dq-content/289/nyc_taxi_data_dictionary.md)
1. Calculate the number of rides in the **taxi** ndarray that are from February:
  - Create a boolean array, **february_bool**, that evaluates whether the items in **pickup_month** are equal to **2**.
  - Use the **february_bool** boolean array to index **pickup_month**, and assign the result to **february**.
  - Use the **ndarray.shape** attribute to find the number of items in **february** and assign the result to **february_rides**.
2. Calculate the number of rides in the **taxi** ndarray that are from March:
  - Create a boolean array, **march_bool**, that evaluates whether the items in **pickup_month** are equal to **3**.
  - Use the **march_bool** boolean array to index **pickup_month**, and assign the result to **march.**
  - Use the **ndarray.shape** attribute to find the number of items in **march** and assign the result to **march_rides**.

In [25]:
# put your code here
february_bool = taxi[:,1] == 2
february = taxi[february_bool]
february_rides = february.shape[0]

march_bool = taxi[:,1] == 3
march = taxi[march_bool]
march_rides = march.shape[0]

## Boolean Indexing with 2D ndarrays

When working with 2D ndarray, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing. Let's look at some examples:

<img width="500" src="https://drive.google.com/uc?export=view&id=1jXwHlU2lUX-VHmCTm9brDiTRu8L7yx2t">

Because a boolean array contains no information about how it was created, we can use a boolean array made from just one column of our array to index the whole array.

Let's look at an example from our taxi trip data. In the previous mission, we sorted our ndarray in order to view the trips that had very large average speeds. Boolean indexing makes this much easier:

```python
# calculate the average speed
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

# create a boolean array for trips with average
# speeds greater than 20,000 mph
trip_mph_bool = trip_mph > 20000

# use the boolean array to select the rows for
# those trips, and the pickup_location_code,
# dropoff_location_code, trip_distance, and
# trip_length columns
trips_over_20000_mph = taxi[trip_mph_bool,5:9]

print(trips_over_20000_mph)
```

```python
[[     2      2     23      1]
 [     2      2   19.6      1]
 [     2      2   16.7      2]
 [     3      3   17.8      2]
 [     2      2   17.2      2]
 [     3      3   16.9      3]
 [     2      2   27.1      4]]
```

Combining our boolean array with a column slice allowed us to view just the key data of these trips with very high average speeds. As we observed in the previous mission, all of these trips have the same pickup and dropoff locations, and last only a few seconds.

Let's use this technique to examine the rows that have the highest values for the **tip_amount** column.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Create a boolean array, **tip_bool**, that determines which rows have values for the **tip_amount** column of more than **50**.
2. Use the **tip_bool** array to select all rows from **taxi** with values tip amounts of more than **50**, and the columns from indexes 5 to 13 inclusive. Assign the resulting array to **top_tips.**

In [31]:
# put your code here
tip_bool = taxi[:,-3] > 50
top_tips = taxi[tip_bool][:,5:13]
# top_tips

## Assigning Values in ndarrays

So far we've learned how to retrieve data from ndarrays, and how to add rows or columns. There is one missing piece to our NumPy fundamentals toolbox: modifying values.

We can use the same indexing techniques we've already learned to assign values within an ndarray. The syntax we'll use (in pseudocode) is:

```python
ndarray[location_of_values] = new_value
```

Let's take a look at what that looks like in actual code. With our 1D array, we can specify one specific index location:

```python
a = np.array(['red','blue','black','blue','purple'])
a[0] = 'orange'
print(a)

['orange', 'blue', 'black', 'blue', 'purple']
```

Or we can assign multiple values at once:

```python
a[3:] = 'pink'
print(a)

['orange', 'blue', 'black', 'pink', 'pink']
```

With a 2D ndarray, just like with a 1D, we can assign one specific index location.

```python
ones = np.array([[1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1]])
ones[1,2] = 99
print(ones)

[[ 1,  1,  1,  1,  1],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]
```

We can also assign a whole row...

```python
ones[0] = 42
print(ones)

[[42, 42, 42, 42, 42],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]
```

...or a whole column:

```python
ones[:,2] = 0
print(ones)

[[42, 42, 0, 42, 42],
 [ 1,  1, 0,  1,  1],
 [ 1,  1, 0,  1,  1]]
```

Let's practice some array assignment with our taxi dataset.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

To help you practice without making changes to our original array, we have used the [ndarray.copy()](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.copy) method to make **taxi_modified**, a copy of our original for these exercises.


- The value at column index 5 (**pickup_location**) of row index 28214 is incorrect. Use assignment to change this value to __1__ in the **taxi_modified** ndarray.
- The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (16) in the **taxi_modified** ndarray.
- The values at column index 7 (**trip_distance**) of rows index 1800 and 1801 are incorrect. Use assignment to change these values in the **taxi_modified** ndarray to the mean value for that column.



In [57]:
# put your code here
taxi_modified = taxi.copy()
taxi_modified[28214,5] = 1

taxi_modified[:,0] = taxi_modified[:,0] % 1000

taxi_modified[:,0]

array([16., 16., 16., ..., 16., 16., 16.])

## Assignment Using Boolean Arrays

Boolean arrays become very powerful when we use them for assignment. Let's start by looking at a simple example:

```python
>>> a = np.array([1, 2, 3, 4, 5])

>>> a[a > 2] = 99

>>> print(a)

    [ 1  2 99 99 99]
```

Before we walk through how the code works, we've just seen a 'shortcut' for the first time. The second line of code inserted the definition of the boolean array directly into the selection. This 'shortcut' way is the conventional way to write boolean indexing. Up until now, we've been taking the extra step of assigning to an intermediate variable first so that the process is clear. Let's look at how we would have written the example using the intermediate variable.

```python
>> a2 = np.array([1, 2, 3, 4, 5])

>> a2_bool = a2 > 2

>> a2[a2_bool] = 99

>> print(a2)

    [ 1  2 99 99 99]
```

You can see that both ways produce the same results. From here on, we will use the shortcut method instead of the intermediate variable. The boolean array controls the values that the assignment applies to, and the other values remain unchanged. Let's look at how this code works:

<img width="600" src="https://drive.google.com/uc?export=view&id=1u8WcLq-TYCIhSFuEa9ElfMFYPBAC_rZZ">


Next, let's look at an example of assignment using a boolean array with two dimensions:

```python
>>> b = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

>>> b[b > 4] = 99

>>> print(b)

    [[ 1  2  3]
     [ 4 99 99]
     [99 99 99]]
```

<img width="600" src="https://drive.google.com/uc?export=view&id=1VPmK9UuV1jvX74-ljHWJT6oE_vkTkS-a">


Lastly, let's look at an example that uses a 1D boolean array to perform assignment on a 2D array:

```python
>>> c = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

>>> c[c[:,1] > 2, 1] = 99

>>> print(c)

    [[ 1  2  3]
     [ 4 99  6]
     [ 7 99  9]]
```


In this example, the **c[:,1] > 2** boolean operation compares just one column's values and produces a 1D boolean array. We then use that boolean array to specify the rows for assignment, and use the integer **1** to specify the second column. This results in our boolean array only being applied to the second column, with all other values remaining unchanged:

<img width="600" src="https://drive.google.com/uc?export=view&id=1nXvILrVeMLryXgLr_TYLHPdjHVJZxstA">


This pattern, where a 1D boolean array is used to specify assignment in the row dimension and an index value is used to specify which column the array applies to is very common. The pseudocode syntax for this pattern is as follows, first using an intermediate variable:

```python
bool = array[:, column_for_comparison] == value_for_comparison
array[bool, column_for_assignment] = new_value
```

and then all in one line:

```python
array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value
```

Let's practice this pattern using our taxi data set:

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

We have created a new copy of our taxi dataset, **taxi_modified** with an additional column containing the value 0 for every row.

1. In our new column at index **15**, assign the value __1__ if the **pickup_location_code** (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
  - For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
  - For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
  - For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.

In [48]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

# create a new column filled with `0`.
zeros = np.zeros([taxi_modified.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
taxi_modified[:,15] = (taxi_modified[:,5] == 2) | (taxi_modified[:,5] == 3) | (taxi_modified[:,5] == 5)
print(taxi_modified[1:10,15])

# put your code here

[1. 1. 1. 1. 0. 1. 1. 1. 0.]


## Challenge: Which is the most popular airport?

We'll conclude this lesson with two challenges. Challenges are designed to help you practice the techniques you've learned in this lesson.

**Don't be discouraged if these challenge steps take a few attempts to get right– working with data is an iterative process!**

In this challenge, we want to find out which airport is the most popular destination in our data set. To do that, we'll use boolean indexing and the **dropoff_location_code** column (column index 6) to create three filtered arrays and then look at how many rows are in each array. The values from the column we're interested in are:

- 2 - JFK Airport.
- 3 - LaGuardia Airport.
- 5 - Newark Airport.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


- Using the original **taxi** ndarray, calculate how many trips had JFK Airport as their destination:
  - Select only the rows there the **dropoff_location_code** column has a value that corresponds to JFK, and assign the result to **jfk**.
  - Calculate how many rows are in the new **jfk** array and assign the result to **jfk_count**.
- Calculate how many trips from **taxi** had Laguardia Airport as their destination:
    - Select only the rows there the **dropoff_location_code** column has a value that corresponds to Laguardia, and assign the result to **laguardia.**
    - Calculate how many rows are in the **new laguardia** array and assign the result to **laguardia_count.**
- Calculate how many trips from **taxi** had Newark Airport as their destination:
  - Select only the rows there the **dropoff_location_code** column has a value that corresponds to Newark, and assign the result to **newark.**
  - Calculate how many rows are in the **new newark array** and assign the result to **newark_count.**
- After you have run your code, inspect the values for **jfk_count**, **laguardia_count**, and **newark_count** and see which airport has the most dropoffs.

In [59]:
taxi[6]

array([2.016e+03, 1.000e+00, 1.000e+00, 5.000e+00, 0.000e+00, 2.000e+00,
       6.000e+00, 8.450e+00, 9.270e+02, 2.450e+01, 1.300e+00, 0.000e+00,
       6.450e+00, 3.225e+01, 1.000e+00])

In [60]:
# put your code here
jdk = taxi[:,6] == 2
jdk_count = taxi[jdk].shape[0]

laguardia = taxi[:,6] == 3
laguardia_count = taxi[laguardia].shape[0]

newark = taxi[:,6] == 5
newark_count = taxi[newark].shape[0]

print("JSK:", jdk_count)
print("Laguardia:", laguardia_count)
print("Neward:", newark_count)

JSK: 11832
Laguardia: 16602
Neward: 63


## Challenge: Calculating Statistics for Trips on Clean Data

Our calculations in the previous screen show that Laguardia is the most common airport for dropoffs in our data set.

Our second and final challenge involves removing potentially bad data from our data set, and then calculating some descriptive statistics on the remaining 'clean' data.

We'll start by using boolean indexing to remove any rows that have an average speed for the trip greater than 100 mph (160 kph) which should remove the questionable data we have worked with over the past two missions. Then, we'll use array methods to calculate the mean for specific columns of the remaining data. The columns we're interested in are:

- **trip_distance**, at column index 7
- **trip_length**, at column index 8
- **total_amount**, at column index 13
- **trip_mph**, not available as a column but as its own ndarray


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


The **trip_mph** ndarray has been provided for you.

- Create a new ndarray, **cleaned_taxi**, containing only rows for which the values of **trip_mph** are less than 100.
- Calculate the mean of the **trip_distance** column of **cleaned_taxi**, and assign the result to **mean_distance**.
- Calculate the mean of the **trip_length** column of **cleaned_taxi**, and assign the result to **mean_length**.
- Calculate the mean of the **total_amount** column of **cleaned_taxi**, and assign the result to **mean_total_amount.**
- Calculate the mean of the **trip_mph**, excluding values greater than 100, and assign the result to **mean_mph**.

In [67]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

cleaned_taxi = taxi[trip_mph <= 100]
mean_distance = cleaned_taxi[:,7].mean()
mean_length = cleaned_taxi[:,8].mean()
mean_total_amount = cleaned_taxi[:,13].mean()
mean_mph = trip_mph[trip_mph <= 100].mean()

# dir(cleaned_taxi[:,7])

In this section we learned:

- How to use **numpy.genfromtxt()** to read in an ndarray.
- About **NaN** values.
- What a boolean array is, and how to create one.
- How to use boolean indexing to filter values in one and two-dimensional ndarrays.
- How to assign one or more new values to an ndarray based on their locations.
- How to assign one or more new values to an ndarray based on their values.

This is the last section that deals exclusively with NumPy, however it's certainly not the last time we'll use NumPy. As we move onto using pandas, and later in our learning paths other Python data libraries, you'll see that a lot of the concepts we've learned transfer, and you'll also find yourself using a lot of these fundamental NumPy concepts. We'll also use NumPy from time to time to create, transform and otherwise work with tabular data.

In the next section, we'll start using the pandas library and learn how it compares with NumPy.