### Introduction to Ndarrays

The core data structure in NumPy is the __ndarray__ or __n-dimensional array__. In programming, __array__ describes a collection of elements, similar to a list. The word __n-dimensional__ refers to the fact that ndarrays can have one or more dimensions. We'll start by working with one-dimensional (1D) ndarrays.

__Syntax:__

_import numpy as np_

We can directly convert a list to an ndarray using the _numpy.array()_ constructor. To create a 1D ndarray, we can pass in a single list

In [1]:
# Importing the library
import numpy as np

# Create a NumPy array
data_ndarray = np.array([10, 20, 30])

### Understanding Vectorization
Example: Read 8 rows of data (two cols) and calculate the sum

Using regular Python code with list of lists and for loops, our computer would take eight processor cycles to process the eight rows of our data.

The NumPy library takes advantage of a processor feature called __Single Instruction Multiple Data (SIMD)__ to process data faster. SIMD allows a processor to perform the same operation, on multiple data points, in a single processor cycle

As a result, the NumPy version of our code would only take two processor cycles — a four times speed-up! This concept of replacing for loops with operations applied to multiple data points at once is called __vectorization__ and ndarrays make vectorization possible.

### NYC Tax-Airport Data

source: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

For this project, we'll only work with a subset of this data - approximately 90,000 yellow taxi trips to and from New York City airports between January and June 2016. Below is information about selected columns from the data set:

- _pickup_year_: The year of the trip.
- _pickup_month_: The month of the trip (January is 1, December is 12).
- _pickup_day_: The day of the month of the trip.
- _pickup_location_code_: The airport or borough where the trip started.
- _dropoff_location_code_: The airport or borough where the trip finished.
- _trip_distance_: The distance of the trip in miles.
- _trip_length_: The length of the trip in seconds.
- _fare_amount_: The base fare of the trip, in dollars.
- _total_amount_: The total amount charged to the passenger, including all fees, tolls and tips.

To convert the data set into a 2D ndarray, we'll first use Python's built-in __csv__ module to import our CSV as a "list of lists". Then, we'll convert the list of lists to an ndarray. We'll again use the _numpy.array()_ constructor, but to create a 2D ndarray, we'll pass in our list of lists instead of a single list

In [2]:
# Library imports
import csv
import numpy as np

# read dataset into list of lists
input_file = open('data/nyc_taxis.csv')
read_file = csv.reader(input_file)
taxi_list = list(read_file)

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to float
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)
    
# Using NumPy araay constructor
taxi = np.array(converted_taxi_list)

### Array Shapes

It's often useful to know the number of rows and columns in an ndarray. When we can't easily print the entire ndarray, we can use the ndarray.shape attribute instead.

The data type returned is called a tuple. Tuples are very similar to Python lists, but can't be modified.

The output gives us a few important pieces of information:
- The first number tells us num of rows in data_ndarray.
- The second number tells us num of cols 3 columns in data_ndarray.

In [3]:
# Assign the array shape to a new variable
taxi_shape = taxi.shape
taxi_shape

(89560, 15)

### Selecting and Slicing Rows and Items from ndarrays

For any 2D array, the full syntax for selecting data is:

__ndarray[row_index,column_index]__
###### # or if you want to select all
###### # columns for a given set of rows
__ndarray[row_index]__

Where row_index defines the location along the row axis and column_index defines the location along the column axis.

#### Selection:
- With a list of lists, we use two separate pairs of square brackets back-to-back.
- With a NumPy ndarray, we use a single pair of brackets with comma-separated row and column locations.

In [4]:
#Select the row at index 0
row_0 = taxi[0]

# Select every column for the rows at indexes 391 to 500 inclusive.
rows_391_to_500 = taxi[391:501]

#Select the item at row index 21 and column index 5
row_21_column_5 = taxi[21, 5]

### Selecting Columns and Custom Slicing ndarrays

In [5]:
# Select every row for the columns at indexes 1, 4, and 7
cols = [1, 4, 7]
columns_1_4_7 = taxi[:,cols]

# Select the columns at indexes 5 to 8 inclusive for the row at index 99
row_99_columns_5_to_8 = taxi[99, 5:9]

# Select the rows at indexes 100 to 200 inclusive for the column at index 14
rows_100_to_200_column_14 = taxi[100:201, 14]

### Vector Math

NumPy ndarrays allow us to select data much more easily. Beyond this, the selection we make is a lot faster when working with __vectorized operations__ because the operations are applied to multiple data points at once.

The result of adding two 1D ndarrays is a 1D ndarray of the same shape (or dimensions) as the original.
- In this context, ndarrays can also be called __vectors__, a term taken from a branch of mathematics called linear algebra.
- What we just did, adding two vectors together, is called __vector addition__.

In [6]:
fare_amount = taxi[:, 9]
fees_amount = taxi[:, 10]

fare_and_fees = fare_amount + fees_amount

Let's use the columns _trip_distance_ & _trip_length_ to calculate the average travel speed of each trip in miles per hour. The formula for calculating miles per hour is:

$$
miles per hour (m.p.h) = [distance in miles] / [length in hours]
$$

In [7]:
trip_distance_miles = taxi[:, 7]
trip_length_seconds = taxi[:, 8]

trip_length_hrs = trip_length_seconds / 3600   # 3600 seconds is 1 hour

# Calculate miles per hour
trip_mph = trip_distance_miles / trip_length_hrs

trip_mph

array([37.11340206, 38.58157895, 31.27222982, ..., 22.29907867,
       42.41551247, 36.90473407])

### Calculating Statistics for 1D ndarrays

_trip_mph_ is a 1D ndarray of the average mile-per-hour speed of each trip.

In [8]:
# calculate the minimum value of trip_mph
mph_min = trip_mph.min()

# calculate the maximum value of trip_mph
mph_max = trip_mph.max()

# calculate the average value of trip_mph
mph_mean = trip_mph.mean()

print("Min mph: ", mph_min)
print("Max mph: ", mph_max)
print("Mean mph: ", mph_mean)

Min mph:  0.0
Max mph:  82800.0
Mean mph:  32.24258580925573


Based on our analysis, a speed of 82,800 mph is impossible in New York traffic. That's almost 20x faster than the fastest plane in the world! This could be due to an error in the devices that records the data, or perhaps errors made somewhere in the data pipeline.

Before we look at other array methods, let's review the difference between methods and functions. __Functions__ act as stand alone segments of code that usually take an input, perform some processing, and return some output. For example, we can use the __len()__ function to calculate the length of a _list_ or the number of characters in a _string_.

In contrast, __methods__ are special functions that belong to a specific type of _object_. This means that, for instance, when we work with list objects, there are special functions or _methods_ that can only be used with lists. For example, we can use the __list.append()__ method to add an item to the end of a list. If we try to use that method on a _string_, we will get an error!

In NumPy, sometimes there are operations that are implemented as both methods and functions, which can be confusing.

| Calculation | Function Representation | Method Representation |
| ----------- | ----------------------- | --------------------- |
| Calculate the minimum value of trip_mph | np.min(trip_mph) | trip_mph.min() |
| Calculate the maximum value of trip_mph | np.max(trip_mph) | trip_mph.max() |
| Calculate the mean average value of trip_mph | np.mean(trip_mph) | trip_mph.mean() |
| Calculate the median average value of trip_mph | np.median(trip_mph) | There is no ndarray median method |

To remember the right terminology, anything that starts with __np__ (e.g. _np.mean()_) is a function and anything expressed with an object (or variable) name first (e.g. _trip_mph.mean()_) is a method. When both exist, it's up to you to decide which to use, but it's much more common to use the method approach.

### Calculating Statistics For 2D ndarrays

If we use the _ndarray.max()_ method on a 2D ndarray without any additional parameters, it will return a single value, just like with a 1D array.

But what if we wanted to find the maximum value of each row?
- We'd need to use the __axis__ parameter and specify a value of __1__ to indicate we want to calculate the maximum value for each row.

If we want to find the maximum value of each column, we'd use an __axis__ value of __0__.

In [9]:
# we'll compare against the first 5 rows only
taxi_first_five = taxi[:5]

# select these columns: fare_amount, fees_amount, tolls_amount, tip_amount
fare_components = taxi_first_five[:,9:13]

# calculate the sum of each row in fare_components
fare_sums = fare_components.sum(axis = 1)

# Extract the 14th column in taxi_first_five
fare_totals = taxi_first_five[:, 13]  # Index is 13

print(fare_totals, fare_sums)

[69.99 54.3  37.8  32.76 18.8 ] [69.99 54.3  37.8  32.76 18.8 ]


### Summary:

We learned the following:
- How vectorization makes our code faster.
- About n-dimensional arrays, and NumPy's ndarrays.
- How to select specific items, rows, columns, 1D slices, and 2D slices from ndarrays.
- How to apply simple calculations to entire ndarrays.
- How to use vectorized methods to perform calculations across either axis of ndarrays.