# 1. Understanding Vectorization

## Learn
One of the reasons that the Python language is extremely popular is that it makes writing programs easy. When we execute Python code, the Python interpreter converts your code into bytecode that your computer can understand, and then runs that bytecode. When you write code in Python, you don't have to worry about things like allocating memory on your computer or choosing how certain operations are done by your computer's processor. Python takes care of that for you.
<img src="bytecode.svg" />
Python is what we call a **high-level language**. High level languages allow you to write programs faster as the interpreter makes the decisions on how to execute your instructions. In contrast, when you use **low-level languages** like ```C```, you define exactly how memory will be managed and how the processor will execute your instructions. This means that coding in a low-level language takes longer, however you have more ability to optimize your code to run faster.

|Language Type|Example|Time taken to write program|Control over program performance|
|------|------|------|------|
|High-Level|Python|Low|Low|
|Low-Level|C|High|High|

When choosing between a high and low-level language, you have to make a trade-off between being able to work and quickly, and having programs that run quickly and efficiently. Luckily, there are two Python libraries that were created to give us the best of both-worlds: **NumPy** and **pandas**. Together, pandas and NumPy provide a powerful toolset for working with data in Python. They allow us to write code quickly without sacrificing performance. But how do they do this? What is it that makes these libraries faster than raw Python? The answer is **vectorization**.
### How Vectorization Makes Code Faster
Let's look at an example where we have two columns of data. Each row contains two numbers we wish to add together. Using just Python, we would use a *list* of *lists* structure to store our data, and use *for loops* to iterate over that data. Let's see what this would look like as Python code:

<img src="for_loop.svg" />
When this code is run, the Python interpreter will turn our code into bytecode, following the logic of our ```for``` loop. In each iteration of our loop, the bytecode asks our computer's processor to add the two numbers together and stores the result. The diagram shows the first calculation our computer's processor would make:
<img src="unvectorized.svg" />
Our computer would take eight processor cycles to process the 8 rows of of our data.

Vectorization takes advantage of a processor feature called __Single Instruction Multiple Data (SIMD)__ to process data faster. Most modern computer processors support SIMD. SIMD allows a processor to perform the same operation, on multiple data points, in a single processor cycle. Let's look at how a vectorized version of our code above might be processed using a SIMD instruction that allows four data points to be processed at once:
<img src="unvectorized.svg" />
The vectorized version of our code will only take two processor cycles to process our eight rows of data - a four times speed-up. Vectorized operations might process as little as two and as many as as hundreds of operations per processor cycle, depending on the capabilities of the processor and the size of each data point.

The good news is that you don't have to worry about SIMD and processor cycles, because NumPy and pandas take care of this for you. We'll introduce pandas in more detail later in this course, but first we're going to learn about NumPy so we understand the fundamentals of working with vectorized operations.

In this course, we'll learn:

* How to work with data in using NumPy and pandas objects.
* How to explore and clean data in pandas.
* How to use pandas and NumPy to analyze data quickly and efficiently.

Let's get started, click 'next' to continue.

# 2. NYC Taxi-Airport Data

## Learn
As we learn NumPy, we'll be analyzing taxi trip data released by the city of New York. The city releases data on taxis and for-hire vehicles on the Taxi and Limousine Commission (TLC) Website. There is data on over 1.3 trillion individual trips, reaching back as far as 2009 and is regularly updated.

We'll be working with a subset of this data: Yellow taxi trips to and from New York City airports between January and June 2016. In our dataset, each row represents a unique taxi trip. Below is information about selected columns from the data set:
* ```pickup_year``` - The year of the trip.
* ```pickup_month``` - The month of the trip (January is ```1```, December is ```12```).
* ```pickup_day``` - The day of the month of the trip.
* ```pickup_location_code``` - The airport or borough where the the trip started, as one of eight categories:
 * ```0``` - Bronx.
 * ```1``` - Brooklyn.
 * ```2``` - JFK Airport.
 * ```3``` - LaGuardia Airport.
 * ```4``` - Manhattan.
 * ```5``` - Newark Airport.
 * ```6``` - Queens.
 * ```7``` - Staten Island.
* ```dropoff_location_code``` - The airport or borough where the the trip finished, using the same * * eight category codes as pickup_location_code.
* ```trip_distance``` - The distance of the trip in miles.
* ```trip_length``` - The length of the trip in seconds.
* ```fare_amount``` - The base fare of the trip, in dollars.
* ```total_amount``` - The total amount charged to the passenger, including all fees, tolls and tips.

You can find information on all columns in the dataset data dictionary.

We have randomly sampled approximately 90,000 trips for our analysis, representing one 50th of the trips for the six month period. Our data is stored in a CSV file called ```nyc_taxis.csv.``` Here are the first 10 rows of the data set:

|pickup_year|pickup_month|pickup_day|pickup_dayofweek|pickup_time|pickup_location_code|dropoff_location_code|trip_distance|trip_length|fare_amount|fees_amount|tolls_amount|tip_amount|total_amount|payment_type|
|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
|2016|1|1|5|0|2|4|21.00|2037|52.0|0.8|5.54|11.65|69.99|1|
|2016|1|1|5|0|2|1|16.29|1520|45.0|1.3|0.00|8.00|54.30|1|
|2016|1|1|5|0|2|6|12.70|1462|36.5|1.3|0.00|0.00|37.80|2|
|2016|1|1|5|0|2|6|8.70|1210|26.0|1.3|0.00|5.46|32.76|1|
|2016|1|1|5|0|2|6|5.56|759|17.5|1.3|0.00|0.00|18.80|2|
|2016|1|1|5|0|4|2|21.45|2004|52.0|0.8|0.00|52.80|105.60|1|
|2016|1|1|5|0|2|6|8.45|927|24.5|1.3|0.00|6.45|32.25|1|
|2016|1|1|5|0|2|6|7.30|731|21.5|1.3|0.00|0.00|22.80|2|
|2016|1|1|5|0|2|5|36.30|2562|109.5|0.8|11.08|10.00|131.38|1|
|2016|1|1|5|0|6|2|12.46|1351|36.0|1.3|0.00|0.00|37.30|2|
This, however, is how the first few lines of raw data in our CSV look like (we are showing only the first four columns from the file to make the format easier to understand:

```javascript
pickup_year,pickup_month,pickup_day,pickup_dayofweek
2016,1,1,5
2016,1,1,5
2016,1,1,5
2016,1,1,5```

To start working with this CSV data in NumPy, we'll first need to start by importing the NumPy library into our Python environment. For this, we use a simple import statement:
```python
import numpy as np```

We used the ```as``` syntax in our ```import``` statement. This allows us to access the NumPy library using another name. When working with NumPy, the convention is to import the library as ```np``` for brevity.

Next, we'll use Python's built-in ```csv``` module to import our CSV as a 'list of lists'.

The last step is to convert our list of lists into a NumPy n-dimensional array, or ndarray. We're going to explain ndarrays in more detail in the next screen, but for now you can think of it as NumPy's version of a list of lists format. To convert from the list type to ndarray, we use the ```numpy.array()``` constructor. Here's an example of how it works:

```python
# our list of lists is stored as data_list
data_ndarray = np.array(data_list)```

We used the syntax np.array() instead of numpy.array() because of our import numpy as np code. When we introduce a new syntax, we'll always use the full name to describe it, and you'll need to substitute in the shorthand as appropriate.

Let's convert our taxi CSV into a NumPy ndarray!


## Instructions
In the 'script.py' code box on the right, we have imported ```numpy```, and used Python's ```csv``` module to import the ```nyc_taxis.csv``` file and convert it to a list of lists containing float values.

* Add a line of code using the ```numpy.array()``` constructor to convert the ```converted_taxi_list``` variable to a NumPy ndarray. Assign the result to the variable name ```taxi```.
* Click 'Run' to run your code and get feedback.

If you need a hint, you can find one under the 'get help' menu.

In [26]:
#script.py
import csv
import numpy as np

# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)

# start writing your code below this comment
taxi = np.array(converted_taxi_list)

# no result print here, next chapter will be

# 3. Understanding NumPy ndarrays

## Learn
As we mentioned earlier, ndarray stands for 'n-dimensional array'. In programming, array is a term that describes a collection of elements. Even if you haven't heard the term before, you have likely encountered arrays: a list object in Python could be described generically as an array. N-dimensional refers to the fact that ndarrays can have one or more dimensions. Let's look at some visualizations of one, two, and three dimensional arrays and their common names:
<img src="dimensional_arrays.svg" />
Arrays with more than three dimensions do exist in data science but they're rare. We'll focus on:

* One-dimensional ndarrays (1D ndarrays)
* Two-dimensional ndarrays (2D ndarrays)

Similar to using lists of lists, we use numbers to specify the location of elements of our data that we want to work with. Just like with lists, we call these numbers index values (or collectively, indices).
Unlike with Python lists, every value in an ndarray must be of the same types. For the NYC taxi data set this does not matter, as all the values are float values. We'll talk further about this restriction and how to handle it a later mission.
Let's take a look at the data in the ```taxi``` variable from the previous screen by printing it using Python's ```print()``` function:
```shell
>>> print(taxi)

    [[ 2016.  1.   1.  ..., 11.65  69.99   1. ]
     [ 2016.  1.   1.  ...,  8.    54.3    1. ]
     [ 2016.  1.   1.  ...,  0.    37.8    2. ]
     ..., 
     [ 2016.  6.  30.  ...,  5.    63.34   1. ]
     [ 2016.  6.  30.  ...,  8.95  44.75   1. ]
     [ 2016.  6.  30.  ...,  0.    54.84   2. ]]```
     
At first, this looks identical to a list of lists, with two exceptions:

* Between the third and fourth column of every row there is an elipsis (```...```).
* Between the third and fourth row there is another elipsis.

These elipses indicate that there is more data in our NumPy ndarray than can easily be printed. NumPy will summarize any ndarray we print if it contains more than 1000 elements. If we wanted to see the how many rows and columns are in our ndarray, we can use the ```ndarray.shape``` attribute. If you like, you can open the console from the bottom right of the interface and run this command to see it for yourself.
```shellscript
>>> taxi.shape
    (89560, 15)```
    
Note that the ```>>>``` above isn't part of the code, but is the Python console prompt. Every time you see this in our exercises, it's an indication that we're showing you how you can use the console.

The output of the ```ndarray.shape``` attribute gives us a few important pieces of information:

* There are two numbers, which tells us that our ndarray is two-dimensional.
 * Note: the data type returned is called a tuple. Tuples are very similar to Python lists, but are immutable (can't be modified). Tuples are defined and displayed using parentheses ```()``` rather than brackets ```[]```.
* The first number tells us that the first dimension is 89,560 items long, or put another way that there are 89,560 rows in our data set.
* The second number tells us that the second dimension is 15 items long, or put another way that there are 15 columns in our data set.

If we just want to select a number of rows from an ndarray, we can use slicing, just like we would with a list of lists. Here's how we would print the first five rows:

```shellscript
>>> print(taxi[:5])

    [[ 2016  1  1  5  0  2  4  21    2037  52.   0.8  5.54  11.65  69.99   1  ]
     [ 2016  1  1  5  0  2  1  16.29  1520  45.   1.3  0     8    54.3    1  ]
     [ 2016  1  1  5  0  2  6  12.7   1462  36.5  1.3  0     0    37.8    2  ]
     [ 2016  1  1  5  0  2  6   8.7   1210  26.   1.3  0     5.46  32.76   1  ]
     [ 2016  1  1  5  0  2  6   5.56   759  17.5  1.3  0     0    18.8    2  ]]```
     
You'll notice that because we have fewer than 1000 items in our output, NumPy does not summarize the data and we can see all 15 columns (although they're harder to see because each wraps onto a new line).

Let's practice making a slice of multiple rows using of our ndarray.

In [27]:
taxi.shape

(89560, 15)

## instructions

Throughout all of our Dataquest missions, variables we created in previous screens are available.

* Select the first ten rows of the ```taxi``` ndarray, and assign the result to a new variable ```taxi_ten```.
* Use Python's ```print()``` function to display ```taxi_ten```.

In [28]:
taxi_five = taxi[:5]
#your code below

# 4. Selecting and Slicing Rows and Items from ndarrays
Let's look at a comparison between working with ndarray's and list of lists to select one or more rows of data:
<img src="selection_rows.svg" />
Just like we saw in the previous screen, selections of rows ndarray's look like they behave very similarly to lists of lists. In reality, what we're seeing is a shortcut of sorts. For any two-dimensional array, the full syntax for selecting data is:
```python
ndarray[row,column]

# or if you want to select all
# columns for a given set of rows
ndarray[row]```

Where ```row``` defines the location along the row axis and ```column``` defines the location along the column axis. Both ```row``` and ```column``` can be one of the following:

* An ***integer***, indicating a specific location, eg ```ndarray[3,0]```.
* A ***slice***, indicating a range of locations, eg ```ndarray[0:5,6:]```.
* A ***colon***, indicating every location, eg ```ndarray[:,2]```.
* A ***list of values***, indicating specific locations, eg ```ndarray[[0,1,3,4],0]```.
* A ***boolean array***, indicating specific locations - we'll look at this method in detail in the second mission of this course.
* Or ***any combination of the above***.
This is how we select a single item from a 2D ndarray:
<img src="selection_item.svg" />
With a list of lists, we use two separate pairs of square brackets back-to-back. With a NumPy ndarray, we use a single pair of brackets with comma separated row and column locations.

Let's practice selecting one row, multiple rows, and single items from our ```taxi``` ndarray.

## Instructions

* From the ```taxi``` ndarray:
 * Select the row at index ```0``` and assign it to ```row_0```.
 * Select every column for the rows at indexes ```391``` to ```500``` inclusive and assign them to ```rows_391_to_500```.
 * Select the item at row index ```21``` and column index ```5``` and assign it to ```row_21_column_5```

In [None]:
#Code heare


# 5. Selecting Columns and Custom Slicing ndarrays

## Learn

Let's continue by learning how to select one or more columns of data:
<img src="selection_columns.svg" />
With a list of lists, we need to use a for loop to extract specific column(s) and append them back to a new list. With ndarray's, the process is much simpler. We again use single brackets with comma separated row and column locations, but we use a colon (:) for the row locations. This colon acts as a wildcard, and gives us all items in that dimension, or in other words all rows.

If we wanted to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:
<img src="selection_1darray.svg" />

Lastly, if we wanted to select a 2D slice, we can use slices for both dimensions:

<img src="selection_2darray.svg" />

Let's practice everything we've learned so far to perform some more complex selections using NumPy

## Instructions

* From the ```taxi``` ndarray:
 * Select every row for the columns at indexes ```1```, ```4```, and ```7``` and assign them to ```columns_1_4_7```.
 * Select the columns at indexes ```5``` to ```8``` inclusive for the row at index ```99``` and assign them to ```row_99_columns_5_to_8```.
 * Select the rows at indexes ```100``` to ```200``` inclusive for the column at index ```14``` and assign them to ```rows_100_to_200_column_14```.

In [33]:
#Code result here
columns_1_4_7 = taxi[:,[1,4,7]]
row_99_columns_5_to_8 = taxi[99,5:9]
rows_100_to_200_column_14 = taxi[100:201,14]

# 6. Vector Math

## Learn

The examples in the previous two screens showed us how much easier it is to select data using NumPy ndarrays. Beyond this, the selection we are making is a lot faster when working with vectorized operations. To illustrate this, we've created a random 500 x 5 numpy ndarray, and an equivalent list of of lists, and then a function to select the second and third columns for each:

* ```python_subset()```
* ```numpy_subset()```
To keep things simple, we won't show you the underlying code for these functions, however we've used almost identical syntax to the diagrams from the previous screen. We'll use a special iPython ```%timeit``` magic command to time a single run of each function:
```shellscript
>>> %timeit -r 1 -n 1 python_subset()

    1 loop, best of 1: 284 µs per loop

>>> %timeit -r 1 -n 1 numpy_subset()

    1 loop, best of 1: 7.9 µs per loop```
    
Our NumPy version was over 30 times quicker than the list of lists version (the units of the output are in microseconds)!

When we first talked about vectorized operations, we used the example of adding two columns of data. With data in a list of lists, we'd have to construct a for-loop and add each pair of values from each row individually. To refresh your memory, here's what our example code looked like:

```python
my_numbers = [
              [6, 5],
              [1, 3],
              [5, 6],
              [1, 4],
              [3, 7],
              [5, 8],
              [3, 5],
              [8, 4]
             ]

sums = []

for row in my_numbers:
    row_sum = row[0] + row[1]
    sums.append(row_sum)```
    
At the time, we only talked about how vectorized operations make this faster, however it also makes our code to execute this much simpler. We'll break this down into three steps:

* Convert our data to an ndarray,
* Select each column,
* Add the columns.

Let's look at what that looks like in code:
```python
# convert the list of lists to an ndarray
my_numbers = np.array(my_numbers)

# select each of the columns - the result
# of each will be a 1D ndarray
col1 = my_numbers[:,0]
col2 = my_numbers[:,1]

# add the two columns
sums = col1 + col2```

We could simplify this further if we wanted to:

```python
sums = my_numbers[:,0] + my_numbers[:,1]```

Here are some key observations about this code:

* When we selected each column, we used the syntax ```ndarray[:,c]``` where ```c``` is the column index we wanted to select. Like we saw in the previous screen, the colon acts as a wildcard and selects all rows.
* To add the two 1D ndarrays, ```col1``` and ```col2``` (which sometimes would be called ***vectors*** in this context), we simply use the addition operator (```+```) between them.
* The result of adding two 1D vectors is a 1D vector of the same shape (or dimensions) as the original.
Here's what happened behind the scenes:
<img src="vectorized_addition.svg" />

What we just did, adding two columns (or vectors) together is called ***vector math***. When we're performing vector math on two one-dimensional vectors, both vectors must have the same shape. We can use any of the standard Python numeric operators to perform vector math:

* ```vector_a + vector_b``` - Addition
* ```vector_a - vector_b``` - Subtraction
* ```vector_a * vector_b``` - Multiplication (this is unrelated to the vector multiplication used in linear algebra).
* ```vector_a / vector_b``` - Division
* ```vector_a % vector_b``` - Modulus (find the remainder when ```vector_a``` is divided by ```vector_b```)
* ```vector_a ** vector_b``` - Exponent (raise ```vector_a``` to the power of ```vector_b```)
* ```vector_a // vector_b``` - Floor Division (divide ```vector_a``` by ```vector_b```, rounding down to the nearest integer)

Let's look at an example from our taxi dataset. Here are the first five rows of two of the columns in the data set:

|trip_distance|trip_length|
|------|------|
|21.00|2037.0|
|16.29|1520.0|
|12.70|1462.0|
|8.70|1210.0|
|5.56|759.0|

Let's use these columns to calculate the average travel speed of each trip in miles per hour. The formula for calculating miles per hour is:


*miles per hour = distance in miles / length in hours*

As we learned in the second screen of this mission, ```trip_distance``` is expressed in miles, and ```trip_length``` is seconds, so our first step is converting ```trip_length``` into hours. Here's how we would do it:

```python
trip_distance = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour```

Here we have a different example of vector math. We've divided a vector (one-dimensional array) by a scalar (single number). In this case, each value in the vector gets divided by the scalar to form the result.

From here, let's perform vector division again to calculate the miles per hour.


## Instructions

* Use vector division to divide ```trip_distance_miles``` by ```trip_length_hours```, assigning the result to ```trip_mph```.
* After you have run your code, use the variable inspector below the code box to inspect the contents of the new ```trip_mph``` variable.

In [35]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

# code below
trip_mph = trip_distance_miles / trip_length_hours
trip_mph


array([37.11340206, 38.58157895, 31.27222982, ..., 22.29907867,
       42.41551247, 36.90473407])

# 7. Arithmetic Numpy Functions

## Learn

To make the calculations in the previous screen, we used operators like the ```/``` symbol to perform vectorized operations over our data. NumPy provides a second way to make these calculations - ***arithmetic*** functions. Let's look at how we would write the exercise from the previous screen with with the equivalent, the ```numpy.divide``` function:

```python
# using the `/` operator:
trip_mph_1 = trip_distance_miles / trip_length_hours
​
# using the `numpy.divide()` function:

trip_mph_2 = np.divide(trip_distance_miles,trip_length_hours)```

The variables ```trip_mph_1``` and ```trip_mph_2``` will be identical.

As you become more familiar with NumPy (and later, pandas), you'll find that there is often more than one way to do the same thing. Most of the time, which you choose is up to you. The general rule with situations like these it to choose the one that makes your code easier to read, which will pay dividends both as you start working with data in teams, and when you have to refer back to code you wrote some time ago. You will find that for these arithmetic operations, it's much more common to use the built-in Python operators than the functions.

As you start to feel more comfortable with these libraries, you should start exploring the documentation. This is useful because it builds out your knowledge of available functions and methods, but also because it gets you used to reading the documentation. It's not possible to remember the syntax for every variation of every data science library, but if you remember what is possible, and can read the documentation, you'll always be able to quickly refamiliarize yourself with some syntax whenever you need it.

You may have noticed that when we mention a function or method for the first time, we'll link to the documentation for it. Take a moment now to click the link for the ```numpy.divide() ```function from the first paragraph of this screen and look at the documentation. It may seem a little overwhelming at first, but it is well worth your time.

You might like to also take a look at all of the <a href="https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.math.html#arithmetic-operations">arithmetic functions from the NumPy documentation</a>.

# 8. Calculating Statistics For 1D ndarrays

## Learn

Earlier, we created ```trip_mph```, a 1D ndarray of the average **mile-per-hour** speed of each trip in our dataset, based off the ```trip_length``` and ```trip_distance``` columns. We might like to explore this data further, for instance working out what the maximum and minimum values are for that ndarray.

We could use the built-in Python functions ```min()``` and ```max()``` to make these calculations, however these will perform calculations without taking advantage of vectorization. Instead we can use NumPy's ndarray methods we can use to calculate statistics.

To calculate the minimum value of an 1D ndarray, we use the vectorized ```ndarray.min()``` method, like so:
```python
>>> mph_min = trip_mph.min()

>>> mph_min

    0.0```
    
The minimum value in our ```trip_mph``` ndarray is ```0.0```, for a trip that didn't travel any distance at all.

Before we look at other array methods Let's take a moment to clarify the difference between methods and functions. Functions act as stand alone segments of code that usually take an input, perform some processing, and return some output. When we're working with Python lists, we can use the ```len()``` function to calculate the length of a list, but if we're working with Python strings, we can also use ```len()```. In this case, it calculates the numbers of characters (or length) of the string.

```python
>>> my_list = [21,14,91]

>>> len(my_list)

    3

>>> my_string = 'Dataquest'

>>> len(my_string)

    9```
    
In contrast, methods are special functions that belong to a specific type of object. Python lists have a ```list.append()``` method that we can use to add an item to the end of a list. If we try to use that method on a string, we will get an error:

```python
>>> my_list.append(21)

>>> my_string.append(' is the best!')'

    Traceback (most recent call last):
      File "stdin", line 1, in module
    AttributeError: 'str' object has no attribute 'append'```
    
When you're learning NumPy, this can get confusing, because sometimes there are operations that are implemented as both methods and functions, but sometimes there are not. Let's look at some examples:

|Calculation|Function Representation|Method Representation|
|------|------|
|Calculate the minimum value of ```trip_mph```|```np.min(trip_mph)```|```trip_mph.min()```|
|Calculate the maximum value of ```trip_mph```|```np.max(trip_mph)```|```trip_mph.max()```|
|Calculate the mean average value of ```trip_mph```|```np.mean(trip_mph)```|```trip_mph.mean()```|
|Calculate the median average value of ```trip_mph```|```np.median(trip_mph)```|There is no ndarray median method|

	np.median(trip_mph)	There is no ndarray median method
To remember the right terminology, anything that starts with ```np``` (e.g. ```np.mean()```) is a function and anything you express with an object (or variable) name first (eg ```trip_mph.mean()```) is a method. As we discussed in the previous screen, where both exist it's up to you which you use, but it's much more common to see the method approach, and that's the one we'll use moving forward.

Numpy ndarrays have methods for many different calculations. A few key methods are:

* ```ndarray.min()``` to calculate the minimum value
* ```ndarray.max()``` to calculate the maximum value
* ```ndarray.mean()``` to calculate the mean average value
* ```ndarray.sum()``` to calculate the sum of the values

You can see them a full list of ndarray methods in the <a href="https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation">NumPy ndarray documentation</a>.

Let's use the methods we've just learned about to calculate the smallest, largest, and mean average speed from our ```trip_mph``` ndarray.


## Instructions

* Use the ```ndarray.max()``` method to calculate the maximum value of ```trip_mph``` and assign the result to ```mph_max```.
* Use the ```ndarray.mean()``` method to calculate the average value of ```trip_mph``` and assign the result to ```mph_mean```.

In [36]:
mph_min = trip_mph.min()
#Code below
mph_max = trip_mph.max()
mph_mean = trip_mph.mean()

# 9. Calculating Statistics For 2D ndarrays

## Learn

Looking at the result of the code in the previous screen, you would have observed:

* Minimum trip speed: 0 mph
* Average (mean) trip speed (rounded): 32 mph
* Maximum trip speed (rounded): 82,000 mph

While it's easy to imagine a case where the trip speed is 0 mph - a trip that starts and ends without traveling any distance, a trip speed of 82,000 mph is definitely not possible in New York traffic - that's almost 20x faster than the fastest plane in the world! This is could be due to an error in the devices that records the data, or perhaps errors made somewhere in the data pipeline. We'll spend some time later in this mission looking into the data that gave us this unrealistic number.

For now, we're going to look at how we can calculate statistics for two-dimensional ndarrays. If we use the arrays without additional parameters, they will return a single value, just like they do with a 1D array:
<img src="array_method_axis_none.svg" />
But what if we wanted to find the maximum value of each row? For that, we need to use the ```axis``` parameter, and specify a value of ```1```, which indicates we want to calculate values for each row.

<img src="array_method_axis_1.svg" />
If we want to find the maximum value of each column, we use an ```axis``` value of ```0```:
<img src="array_method_axis_0.svg" />

To help you remember which is which, you can think of the first axis as rows, and the second axis as columns, just in the same way as when we're indexing a 2D NumPy array we use ```ndarray[row,column]```. Then you think about which axis you want to apply the method along. The tricky part is to remember that when you apply the method along one axis, you get results in the other axis. Here is an illustration of that:
<img src="axis_param.svg" />
Let's look at an example of from our taxi data set. Let's say that we wanted to do some validation, and check that the ```total_amount``` column is accurate. To remind ourselves of what the data looks like, let's look at the first five rows of columns with indexes 9 through 13:

|fare_amount|fees_amount|tolls_amount|tip_amount|total_amount|
|------|------|------|------|------|
|52.0|0.8|5.54|11.65|69.99|
|45.0|1.3|0.00|8.00|54.3|
|36.5|1.3|0.00|0.00|37.8|
|26.0|1.3|0.00|5.46|32.76|
|17.5|1.3|0.00|0.00|18.8|
We want to perform a check of whether the first 4 of these columns sums to the 5th column. This is how we would do it:
```python
# we'll compare against the first 5 rows only
taxi_first_five = taxi[:5]
# select these columns: fare_amount, fees_amount, tolls_amount, tip_amount
fare_components = taxi_first_five[:,9:13] 
# select the total_amount column
fare_totals = taxi_first_five[:,13]

# sum the component columns
fare_sums = fare_components.sum(axis=1)

# compare the summed columns to the fare_totals
print(fare_totals.round())
print(fare_sums)```

Our code outputs the following:

```python
[ 69.99  54.3   37.8   32.76  18.8 ]
[ 69.99  54.3   37.8   32.76  18.8 ]```
We have validated that our ```fare_totals``` column is correct (at least for the first five rows).

Now, let's practice calculating the average for each column:


## Instructions

* Using a single method, calculate the mean value for each column of ```taxi```, and assign the result to ```taxi_column_means```.

In [41]:
taxi_column_means = taxi.mean(axis=0)
taxi_column_means.shape
type(taxi_column_means)

numpy.ndarray

# 10. Adding Rows and Columns to ndarrays

## Learn

Earlier in the mission, we produced a ndarray ```trip_mph``` of the average speed of each trip. We also observed that the maximum speed was 82,000 mph, which is definitely not an accurate number. To take a closer look at why we might be getting this value, we're going to do the following:

* Add the ```trip_mph``` as a column to our ```taxi``` ndarray.
* Sort ```taxi``` by ```trip_mph```.
* Look at the rows with the highest ```trip_mph``` from our sorted ndarray to see what they tell us about these large values.

To start, let's learn how to add rows and columns to an ndarray. The technique we're going to use involves the ```numpy.concatenate()``` function. This function accepts:

* A list of ndarrays as the first, unnamed parameter.
* An integer for the ```axis``` parameter, where ```0``` will add rows and ```1``` will add columns.

The ```numpy.concatenate()``` function requires that each array have the same shape, excepting the dimension corresponding to ```axis```. Let's look at an example to understand more precisely how that works. We have two arrays, ```ones``` and ```zeros```:
```python
>>> print(ones)

    [[ 1  1  1]
     [ 1  1  1]]

>>> print(zeros)

    [ 0  0  0]```
    
Let's try and use ```numpy.concatenate()``` to add ```zeros``` as a row. Because we are wanting to add a row, we use ```axis=0```

```python
>>> combined = np.concatenate([ones,zeros],axis=0)

    Traceback (most recent call last):
      File "stdin", line 1, in module
    ValueError: all the input arrays must have same number of dimensions```
    
We've got an error because our dimensions don't match - let's look at the shape of each array to see if we can understand why:

```python
>>> print(ones.shape)

    (2, 3)

>>> print(zeros.shape)

    (3,)```

Because we're using ```axis=0```, our shapes have to match across all dimensions except the first. If we look at these two array's we can see that the second dimension of ```ones``` is ```3```, but ```zeros``` doesn't have a second dimension, because it's only a 1D array. This is the source of our error. The table below shows the shapes we need to be able to combine these arrays.


|Object|Current shape|Desired Shape|
|------|------|------|
|```ones```|```(2, 3)```|```(2, 3)```|
|```zeros```|```(3,)```|```(1, 3)```|

In order to adjust the shape of ```zeros```, we can use the ```numpy.expand_dims()``` function. You might like to follow these steps in the console. We'll start by passing ```axis=0``` because we want to convert our 1D array into a 2D array representing a row:

```python
>>> zeros_2d = np.expand_dims(zeros,axis=0)

>>> print(zeros_2d)

    [[ 0  0  0]]

>>> print(zeros_2d.shape)

    (1, 3)```
    
Finally, we can use ```numpy.concatenate()``` to combine the two arrays:
```python
>>> combined = np.concatenate([ones,zeros_2d],axis=0)

>>> print(combined)

    [[ 1  1  1]
     [ 1  1  1]
     [ 0  0  0]]```
     
Adding a column is done the same way, except substituting ```axis=1``` for ```axis=0``` in both functions. The initial code for this screen shows this process.

# Instructions

* Expand the dimensions of ```trip_mph``` to be a single column in a 2D ndarray, and assign the result to ```trip_mph_2d```.
* Add ```trip_mph_2d``` as a new column at the end of ```taxi```, assigning the result back to ```taxi```.
* Use the ```print()``` function to display ```taxi``` and view the new column.

In [43]:
# These `ones` and `zeros` variables
# are different from the ones in the
# main lesson example
"""
print(ones)
print(zeros)
print() # creates a space in our output

print(ones.shape)
print(zeros.shape)
print()

zeros_2d = np.expand_dims(zeros,axis=1)
print(zeros_2d)
print(zeros_2d.shape)
print()

combined = np.concatenate([ones,zeros_2d],axis=1)
print(combined)
print()
"""
# the `trip_mph` variable is still available from the
# previous screen
# your Code below
trip_mph_2d = np.expand_dims(trip_mph,axis=1)
taxi = np.concatenate([taxi,trip_mph_2d],axis=1)
print(taxi.shape)
print(trip_mph_2d.shape)

(89560, 16)
(89560, 1)


# 11. Sorting ndarrays

## Learn
Now that we've added our ```trip_mph``` column to our array, our next step is to sort the array. For this, we'll use the ```numpy.argsort()``` function. The ```numpy.argsort()``` function returns the indices which would sort an array. Don't worry if that sounds a little unusual, we'll look at an example to help explain it.

We'll start by defining a simple 1D ndarray, where each item is a string containing the name of a fruit:
<img src="argsort_1.svg" />

We've put the indices, or index numbers, next to each value in the array. We use the indices whenever we want to select an item, for instance ```fruit[2]``` would return the value ```'apple'``` and ```fruit[1]``` would return the value ```'banana'```. As we learned earlier in the mission, if we selected using a list of values like ```fruit[[2,1]]```, we would get back an ndarray of those values in the order: ```['apple','banana']```.

Next, we'll use ```numpy.argsort()``` to return the indices that would sort the array:
<img src="argsort_2.svg" />

If we look at these indices carefully, we can see what has happened. The first value of ```sorted_order``` is ```2```: The value at index ```2``` of ```fruit``` is ```'apple'```, the first item if we sort in alphabetical order. The second value is ```1```: The value and index ```1``` of ```fruit``` is ```'banana'```, the second item if we sort in alphabetical order, and so on.

If we use the array of sorted indices to select items from ```fruit```, here is what we get:
<img src="argsort_3.svg" />

In the code above, the values from ```sorted_order``` get inserted between the brackets. The code is the equivalent of:
```python
sorted_fruit = fruit[[2, 1, 4, 3, 0]]```
As you can see, the result is that our original array has been sorted in alphabetical order.

Let's look at an example with a 2D ndarray. We'll be sorting a 5x5 ndarray called ```int_square``` by its last column:

```python
>>> print(int_square)

    [[5 2 8 3 4]
     [2 8 6 2 5]
     [1 6 2 7 7]
     [0 7 7 4 5]
     [5 7 1 1 2]]```

We'll start by selecting just the last column.

```python
>>> last_column = int_square[:,4]

>>> print(last_column)

    [4 5 7 5 2]```
    
Then, we use ```numpy.argsort()``` to get the indices that would sort the last column and assign them to ```sorted_order```.

```python
>>> sorted_order = np.argsort(last_column)

>>> print(sorted_order)

    [4 0 1 3 2]```

As a test, let's use ```sorted_order``` to sort just the last column:
```python
>>> last_column_sorted = last_column[sorted_order]

>>> print(last_column_sorted)

    [2 4 5 5 7]```
    
Finally, we can pass ```sorted_order``` to sort to the full ndarray:
```python
>>> int_square_sorted = int_square[sorted_order]

>>> print(int_square_sorted)

    [[5 7 1 1 2]
     [5 2 8 3 4]
     [2 8 6 2 5]
     [0 7 7 4 5]
     [1 6 2 7 7]]```

We can use the same technique to sort our ```taxi``` ndarray by the ```trip_mph``` column. NumPy only supports sorting in ascending order, however that is not a problem - we'll just look at the last few rows instead of the first few rows to examine the data we need.

## Instructions

* Use ```numpy.argsort()``` to get the indices which would sort the ```trip_mph``` column from the ```taxi``` ndarray. The ```trip_mph``` column is at column index ```15```.
* Use the indices from the previous instruction to sort the ```taxi``` ndarray, and assign the result to ```taxi_sorted```.
* Use the ```print()``` function to examine the ```taxi_sorted``` ndarray.

In [45]:
#code below
argsorted = np.argsort(trip_mph)
taxi_sorted = taxi[argsorted]
taxi_sorted

array([[2.016e+03, 6.000e+00, 2.800e+01, ..., 7.000e+01, 1.000e+00,
        0.000e+00],
       [2.016e+03, 3.000e+00, 3.000e+00, ..., 6.230e+01, 1.000e+00,
        0.000e+00],
       [2.016e+03, 4.000e+00, 6.000e+00, ..., 3.300e+00, 4.000e+00,
        0.000e+00],
       ...,
       [2.016e+03, 3.000e+00, 2.800e+01, ..., 4.300e+00, 2.000e+00,
        3.204e+04],
       [2.016e+03, 2.000e+00, 1.300e+01, ..., 3.300e+00, 2.000e+00,
        7.056e+04],
       [2.016e+03, 1.000e+00, 2.200e+01, ..., 3.300e+00, 2.000e+00,
        8.280e+04]])

# 12. Analyzing Trips with High Average Speeds

# Learn
Below are the last 10 rows of our ```sorted_taxi``` ndarray, with ```trip_mph``` values ranging between 15,570 and 82,800:

|pickup_year|pickup_month|pickup_day|pickup_dayofweek|pickup_time|pickup_location_code|dropoff_location_code|trip_distance|trip_length|fare_amount|fees_amount|tolls_amount|tip_amount|total_amount|payment_type|trip_mph|
|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|------|
|2016.0|2.0|19.0|5.0|4.0|2.0|2.0|17.3|4.0|2.5|1.8|0.0|0.00|4.30|2.0|15570.0|
|2016.0|6.0|6.0|1.0|0.0|2.0|2.0|18.7|4.0|2.5|1.3|0.0|0.00|3.80|3.0|16830.0|
|2016.0|4.0|12.0|2.0|4.0|2.0|2.0|19.8|4.0|2.5|1.8|0.0|0.00|4.30|2.0|17820.0|
|2016.0|4.0|24.0|7.0|5.0|3.0|3.0|16.9|3.0|52.0|0.8|0.0|0.00|52.80|3.0|20280.0|
|2016.0|6.0|30.0|4.0|3.0|2.0|2.0|27.1|4.0|75.0|0.8|0.0|0.00|75.80|2.0|24390.0|
|2016.0|3.0|23.0|3.0|2.0|2.0|2.0|16.7|2.0|52.0|0.8|0.0|10.55|63.35|1.0|30060.0|
|2016.0|3.0|30.0|3.0|4.0|2.0|2.0|17.2|2.0|2.5|1.8|0.0|0.00|4.30|2.0|30960.0|
|2016.0|3.0|28.0|1.0|4.0|3.0|3.0|17.8|2.0|2.5|1.8|0.0|0.00|4.30|2.0|32040.0|
|2016.0|2.0|13.0|6.0|3.0|2.0|2.0|19.6|1.0|2.5|0.8|0.0|0.00|3.30|2.0|70560.0|
|2016.0|1.0|22.0|5.0|3.0|2.0|2.0|23.0|1.0|2.5|0.8|0.0|0.00|3.30|2.0|82800.0|

There is no discernible pattern to the date or time of the trips with unrealistic average speeds. We can see that most of them are very short rides - all have ```trip_length``` values of 4 or less seconds, which does not reconcile with the trip distances, all of which are more than 16 miles.

All of these rows have the same ```pickup_location_code``` and ```dropoff_location_code```. This might suggest that the machines that record the data may use the last known GPS signal if they can't find the location, and if a driver starts and finishes a fare quickly, the machine will calculate an accurate time with inaccurate location data.

In any case, it's safe to say that the data in these rows is bad, and needs to be removed before any further analysis is performed. We'll look at how to do this in the next mission.

# 13. Next Steps

## Learn

In this mission we learned:

* How vectorization it makes our code faster.
* About n-dimensional arrays, and NumPy's ndarrays.
* How to select specific items, rows, columns, 1D slices, and 2D slices from ndarrays.
* How to use vector math to apply simple calculations to entire ndarrays.
* How to use vectorized methods to perform calculations across either axis of ndarrays.
* How to add extra columns and rows to ndarrays.
* How to sort an ndarray.

In the next mission, we'll continue to work with the NYC taxi data as we learn about ```boolean indexing```, one of the most powerful tools when working with data in NumPy and pandas.