## About This Notebook 
In this **Introduction to NumPy** chapter, we will learn:
- How vectorization makes our code faster.
- About n-dimensional arrays, and NumPy's ndarrays.
- How to select specific items, rows, columns, 1D slices, and 2D slices from ndarrays.
- How to apply simple calculations to entire ndarrays.
- How to use vectorized methods to perform calculations across any axis of an ndarray.
***
## 1. Introduction

We have finished the fundamentals of Python programming in the previous two courses. In this course, we'll build on that knowledge to learn data analysis with some of the most powerful Python libraries for working with data.

Have you ever wondered why the Python language is so popular? One straight forward answer is that Python makes writing programs easy. Python is a **high-level language**, which means we don’t need to worry about allocating memory or choosing how certain operations are done by our computers' processors like we have to when we use a **low-level language**, such as C. It takes usually more time to code in a low-level language; however, it also gives us more ability to optimize the code in order for it to run faster.

We have two Python libraries that enable us to write code efficiently without sacrificing performance: <b>NumPy</b> and<b> pandas</b>.

Now let's take a closer look at NumPy.

## 2. Introduction to Ndarrays

The core data structure in NumPy is the <b>ndarray </b>or <b> n-dimensional array</b>. In data science,<b> array </b>describes a collection of elements, similar to a list. The word <b>n-dimensional </b>refers to the fact that ndarrays can have one or more dimensions. Let's first begin this session by working with one-dimensional (1D) ndarrays.

In order to use the NumPy library, the first step is to import numpy into our Python environment like this:

````python
import numpy as np
````
Note that ``np`` is the common alias for numpy.

With the NumPy library a list can be directly converted to an ndarray using the `numpy.array()` [constructor](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.array.html).
How can we create a 1D ndarray? Look at the code below:

````python
data_ndarray = np. array([5,10,15,20])
````

Have you noticed that we used the syntax `np.array()` instead of `numpy.array()`? This is because we used the alias `as np` in our code: 
````python
import numpy as np
````
Now, let's do some exercises creating 1D ndarrays.

### Task 3.1.2:
1. Import `numpy` and assign it to the alias `np`.
2. Create a NumPy ndarray from the list `[10, 20, 30]`. Assign the result to the variable `data_ndarray`.

In [0]:
# Start your code below:


## 3. NYC Taxi-Airport Data

So far we've only created one-dimensional ndarrys. However, ndarrays can also be two-dimensional. 
To illustrate this, we will analyze New York City taxi trip data released by the city of New York.

Our dataset is stored in a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values) called <b>nyc_taxis.csv</b>. To convert the data set into a 2D ndarray, we'll first use Python's built-in csv [module](https://docs.python.org/3/library/csv.html) to import our CSV as a "list of lists". Then we can convert the lists of lists to an ndarray like this:

Our list of lists is stored as `data_list`:
````python
data_ndarray = np.array(data_list)
````

Below is the information about selected columns from the data set:
- `pickup_year`: The year of the trip.
- `pickup_month`: The month of the trip (January is 1, December is 12).
- `pickup_day`: The day of the month of the trip.
- `pickup_location_code`: The airport or borough where the trip started.
- `dropoff_location_code`: The airport or borough where the trip finished.
- `trip_distance`: The distance of the trip in miles.
- `trip_length`: The length of the trip in seconds.
- `fare_amount`: The base fare of the trip, in dollars.
- `total_amount`: The total amount charged to the passenger, including all fees, tolls and tips.

### Task 3.1.3 (IMPORTANT):
We have used Python's csv module to import the nyc_taxis.csv file and convert it to a list of lists containing float values.

1. Add a line of code using the `numpy.array()` constructor to convert the `converted_taxi_list` variable to a NumPy ndarray.
2. Assign the result to the variable name `taxi`.

In [0]:
import csv
import numpy as np

# import nyc_taxi.csv as a list of lists
with open("../../../Data/nyc_taxis.csv", "r") as f:
    taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)
    
# Start your code below:


## 4. Array Shapes

if we want, we can use the `print()` function to take a look at the data in the `taxi` variable.

In [0]:
# the code below only works if you have solved task 1.2
print(taxi)

The elipses (...) between rows and columns indicate that there is more data in our NumPy ndarray than can easily be printed. In order to know the number of rows and columns in an ndarray, we can use the `ndarray.shape` attribute like this: 

In [0]:
import numpy as np
data_ndarray = np.array([[5, 10, 15], 
                         [20, 25, 30]])
print(data_ndarray.shape)

A data type **tuple** is returned as the result. Recall what we learned in the previous course about tuple — this type of value can't be modified.

This value output gives us the following information:
1. The first number tells us that there are 2 rows in `data_ndarray`.
2. The second number tells us that there are 3 columns in `data_ndarray`.

## 5. Selecting and Slicing Rows and Items from ndarrays

The following code will compare working with ndarrays and list of lists to select one or more rows of data:

#### List of lists method:
````python
# Selecting a single row
sel_lol = data_lol[1]

#Selecting multiple rows
sel_lol = data_lol[2:]
````

#### NumPy method:

````python
# Selecting a single row
sel_np = data_np[1]

#Selecting multiple rows
sel_np = data_np[2:]
````

You see that the syntax of selecting rows in ndarrays is very similar to lists of lists. In fact, the syntax that we wrote above is a kind of shortcut. For any 2D array, the full syntax for selecting data is:

````python
ndarray[row_index,column_index]
````

When you want to select the entire columns for a given set of rows, you just need to do this:
````python
ndarray[row_index]
````

Here `row_index` defines the location along the row axis and `column_index` defines the location along the column axis.

Like lists, array slicing is from the first specified index up to — but **not including** – the second specified index. For example, to select the items at index 1, 2, and 3, we'd need to use the slice `[1:4]`.

This is how we **select a single item** from a 2D ndarray:


#### List of lists method

````python
# Selecting a single row
sel_lol = data_lol[1][3]
````

#### NumPy method

````python
# Selecting a single row
sel_np = data_np[1,3] # The comma here separates row/column locations. Produces a single Python object.
````

Two separate pairs of square brackets back-to-back are used with a list of lists and a single pair of brackets with comma-separated row and column locations is used with a NumPy ndarray.

### Task 3.1.5:
From the `taxi` ndarray:

1. Select the row at index `0`. Assign it to `row_0`.
2. Select every column for the rows from index `391` up to and including `500`. Assign them to `rows_391_to_500`.
3. Select the item at row index `21` and column index `5`. Assign it to `row_21_column_5`.

In [0]:
# Start your code below:


## 6. Selecting Columns and Custom Slicing ndarrays

Let's take a look at how to select one or more columns of data:


#### List of lists method
````python
# Selecting a single row
sel_lol = []

for row in data_lol:
    col4 = row[3]
    sel_lol.append(col4)
    
#Selecting multiple columns
sel_lol = []

for row in data_lol:
    col23 = row[2:3]
    sel_lol.append(col23)
    
#Selecting multiple, specific columns
sel_lol = []

for row in data_lol:
    cols = [row[1], row[3], row[4]]
    sel_lol.append(cols)
````

#### NumPy Method

````python
# Selecting a single row
sel_np = data_np[:,3] #Produces a 1D ndarray
    
#Selecting multiple columns
sel_np = data_np[:, 1:3] # Produces a 2D ndarray
    
#Selecting multiple, specific columns
cols = [1, 3, 4]
sel_np = data_np[:,cols] # Produces a 2D ndarray``
````

You see that with a list of lists, we need to use a for loop to extract specific column(s) and append them back to a new list. It is much easier with ndarrays. We again use single brackets with comma-separated row and column locations, but we use a colon (`:`) for the row locations, which gives us all of the rows.

If we want to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:


#### List of lists method
````python
# Selecting a 1D slice (row)
sel_lol = data_lol[2][1:4]  #third row (row index of 2) of column 1, 2, 3

#Selecting a 1D slice (column)
sel_lol = []

rows = data_lol[1:5] #fifth column (column index of 4) of row 1, 2, 3, 4
for r in rows:
    col5 = r[4]
    sel_lol.append(col5) 
````

#### NumPy Method

````python
# Selecting a 1D slice (row)
sel_np = data_np[2, 1:4] # Produces a 1D ndarray
    
# Selecting a 1D slice (column)
sel_np = data_np[1:5, 4] # Produces a 1D ndarray
````

Lastly, if we want to select a 2D slice, we can use slices for both dimensions:

#### List of lists method

````python
# Selecting a 2D slice 
sel_lol = []

rows = data_lol[1:4]
for r in rows:
    new_row = r[:3]
    sel_lol.append(new_row)
````

#### NumPy method

````python
# Selecting a 2D slice 
sel_np = data_np[1:4,:3]
````

### Task 3.1.6:
From the `taxi` ndarray:

1. Select every row for the columns at indexes `1`, `4`, and `7`. Assign them to `columns_1_4_7`.
2. Select the columns at indexes `5` to `8` inclusive for the row at index `99`. Assign them to `row_99_columns_5_to_8`.
3. Select the rows at indexes `100` to `200` inclusive for the column at index `14`. Assign them to `rows_100_to_200_column_14`.

In [0]:
# Start your code below:


## 7. Vector Math
In this section we will explore the power of vectorization. Take a look at the example below:

In [0]:
# convert the list of lists to an ndarray
my_numbers = [[1,2,3],[4,5,6], [7,8,9]] 
my_numbers = np.array(my_numbers)

# select each of the columns - the result
# of each will be a 1D ndarray
col1 = my_numbers[:,0]
col2 = my_numbers[:,1]

# add the two columns
sums = col1 + col2
sums

The code above can be simplified into one line of code, like this:

In [0]:
sums = my_numbers[:,0] + my_numbers[:,1]
sums

Some key take aways from the code above:
- When we selected each column, we used the syntax `ndarray[:,c]` where `c` is the column index we wanted to select. Like we saw in the previous screen, the colon selects all rows.
- To add the two 1D ndarrays, `col1` and `col2` we can simply put the addition operator ``+`` between them.

## 8. Vector Math Continued

Do you know that the standard Python numeric operators also work with vectors such as:

- **Addition**: `vector_a + vector_b`
- **Subtraction**: `vector_a - vector_b`
- **Multiplication**: (unrelated to the vector multiplication in linear algebra): `vector_a * vector_b`
- **Division**: `vecotr_a / vector_b`

Note that all these operations are entry-wise.

Below is an example table from our taxi data set:

|trip_distance|trip_length|
|-------------|-----------|
|21.00|2037.0|
|16.29|1520.0|
|12.70|1462.0|
|8.70|1210.0|
|5.56|759.0|

We want to use these columns to calculate the average travel speed of each trip in miles per hour. For this we can use the formula below: <br>
**miles per hour = distance in miles / length in hours**

The current column `trip_distance` is already expressed in miles, but `trip_length` is expressed in seconds. First, we want to convert `trip_length` into hours:

````python
trip_distance = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 
````
Note: 3600 seconds is one hour

We can then divide each value in the vector by a single number, 3600, instead of another vector. Let's see the first five rows of the result below:

|trip_length_hours|
|-------------|
|0.565833|
|0.422222|
|0.406111|
|0.336111|
|0.210833|

## 9. Calculating Statistics For 1D ndarrays

We've created ``trip_mph`` in the previous exercise. This is a 1D ndarray of the average mile-per-hour speed of each trip in our dataset. Now, something else we can do is to calculate the ``minimum``, ``maximum``, and ``mean`` values for `trip_distance`.

In order to calculate the minimum value of a 1D ndarray, all we need to do is to use the vectorized `ndarray.min()` [method](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html), like this:

````python
distance_min = trip_distance.min()
````

For other Numpy ndarrays methods we have:
- [ndarray.min()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min) to calculate the minimum value
- [ndarray.max()](https://docs.scipy.org/doc/numpy-1.16.1/reference/generated/numpy.ndarray.max.html#numpy.ndarray.max) to calculate the maximum value
- [ndarray.mean()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.mean.html#numpy.ndarray.mean) to calculate the mean or average value
- [ndarray.sum()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.sum.html#numpy.ndarray.sum) to calculate the sum of the values

You will find the full list of ndarray methods in the NumPy ndarray documentation [here](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation).

### Task 3.1.9:
1. Use the `ndarray.max()` method to calculate the maximum value of `trip_distance`. Assign the result to `distance_max`.
2. Use the `ndarray.mean()` method to calculate the average value of `trip_distance`. Assign the result to `distance_mean`.

In [0]:
# Selecting only the relevant column distance
trip_distance = taxi[:,7]
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour
trip_mph = trip_distance_miles / trip_length_hours

# Start your code below:


## 10. Calculating Statistics For 1D ndarrays Continued (IMPORTANT)

Let's examine the difference between methods and functions.
<b>Functions</b> act as stand alone segments of code that usually take an input, perform some processing, and return some output. Take for example the `len()` function used to calculate the length of a list or the number of characters in a string:

In [0]:
my_list = [25,18,9]
print(len(my_list))

In [0]:
my_string = 'RBI'
print(len(my_string))

In contrast, <b>methods</b> are special functions that belong to a specific type of object. In other words, when we work with list objects, there are special functions or methods that can only be used with lists. For example, `list.append()` method is used to add an item to the end of a list. We will get an error if we use this method on a string:

In [0]:
my_string.append(' is the best!')

There are cases in NumPy where operations are sometimes implemented as both methods and functions. It can be confusing at first glance, so let's take a look at an example.

|Calculation|Function Representation|Method Representation|
|-------------|-----------|---------------|
|Calculate the minimum value of trip_mph|np.min(trip_mph)|trip_mph.min()|
|Calculate the maximum value of trip_mph|np.ax(trip_mph)|trip_mph.max()|
|Calculate the mean average value of trip_mph|np.mean(trip_mph)|trip_mph.mean()|
|Calculate the median average value of trip_mph|np.median(trip_mph)|There is no ndarray median method|

To help you remember, you can see it as this:
- anything that starts with `np` (e.g. `np.mean()`) is a function 
- anything expressed with an object (or variable) name first (e.g. `trip_mph.mean()`) is a method
- it's up to you to decide which one to use
- however, it is more common to use the method approach 

## 11. Calculating Statistics For 2D ndarrays

We have only worked with statistics for 1D ndarrays so far. If we use the `ndarray.max()` method on a 2D ndarray without any additional parameters, a single value will be returned, just like with a 1D array.

What happens if we want to find the **maximum value of each row**?
Specification of the axis parameter is needed as an indication that we want to calculate the maximum value for each row.

If we want to find the maximum value of each column, we'd use an axis value of 0 like this:

ndarray.max(axis = 0)

Let's use what we've learned to check the data in our taxi data set. Below is an example table of our data set:

|fare_amount|fees_amount|tolls_amount|tip_amount|total_amount|
|-------------|-----------|--------|-------------|-----------|
|52.0|0.8|5.54|11.65|69.99|
|45.0|1.3|0.00|8.00|54.3|
|36.5|1.3|0.00|0.00|37.8|
|26.0|1.3|0.00|5.46|32.76|
|17.5|1.3|0.00|0.00|18.8|

You see that **total amount = fare amount + fees amount + tolls amount + tip amount**.

Now let's see if you can perform a 2D ndarray calculation on the data set.

### Task 3.1.11:
1. Use the `ndarray.sum()` method to calculate the sum of each row in `fare_components`. Assign the result to `fare_sums`.
2. Extract the 14th column in `taxi_first_five`. Assign to `fare_totals`.
3. Print `fare_totals` and `fare_sums`. You should see the same numbers.

In [0]:
# get the table from above (first five rows of taxi and columns fare_amount, fees_amount, tolls_amount, tip_amount)
fare_components = taxi[:5,[9,10,11,12]]
# we'll compare against the first 5 rows only
taxi_first_five = taxi[:5]

# Start your code below:
