# 1.0 Access Files in Google Drive


In [0]:
#1. Install a Drive FUSE wrapper google-drive-ocamlfuse.
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse

In [0]:
#2. Generate auth tokens for Colab
from google.colab import auth
auth.authenticate_user()

In [0]:
#3. Generate creds for the Drive FUSE library.
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

In [0]:
#4. Create a directory and mount Google Drive using that directory.
!mkdir -p drive
!google-drive-ocamlfuse drive

In [0]:
# commando to umount 
# !fusermount -u drive

In [0]:
import os
import os
os.chdir("/content/drive/EEC1509_MachineLearning/Lesson #02 - Platform")

# 2.0 Introduction to Numpy



## 2.1 Understanding Vectorization

One of the reasons that the Python language is extremely popular is that it makes writing programs easy. When we execute Python code, the Python interpreter converts your code into bytecode that your computer can understand, and then runs that [bytecode](https://en.wikipedia.org/wiki/Bytecode). When you write code in Python, you don't have to worry about things like allocating memory on your computer or choosing how certain operations are done by your computer's processor. Python takes care of that for you.

<img width="500" src="https://drive.google.com/uc?export=view&id=1WSCD15qS89t5di-x-_WjLHwI6-Tj9EeH">

Python is what we call a **high-level language**. High level languages allow you to write programs faster as the interpreter makes the decisions on how to execute your instructions. In contrast, when you use **low-level** languages like C, you define exactly how memory will be managed and how the processor will execute your instructions. This means that coding in a **low-level language** takes longer, however you have more ability to optimize your code to run faster.

| Language Type | Example | Time taken to write program | Control over program performance |
|---------------|---------|-----------------------------|----------------------------------|
| High-Level | Python | Low | Low |
| Low-Level | C | High | High |

When choosing between a high and low-level language, you have to make a trade-off between being able to work and quickly, and having programs that run quickly and efficiently. Luckily, there are two Python libraries that were created to give us the best of both-worlds: **NumPy** and **pandas**. Together, pandas and NumPy provide a powerful toolset for working with data in Python. They allow us to write code quickly without sacrificing performance. But how do they do this? What is it that makes these libraries faster than raw Python? The answer is **vectorization**.


**How Vectorization Makes Code Faster**

Let's look at an example where we have two columns of data. Each row contains two numbers we wish to add together. Using just Python, we would use a list of lists structure to store our data, and use for loops to iterate over that data. Let's see what this would look like as Python code:


<img width="800" src="https://drive.google.com/uc?export=view&id=15rYQH5ne_AhjfSzSzsXdV2AdD7LKRsrl">


When this code is run, the Python interpreter will turn our code into bytecode, following the logic of our **for** loop. In each iteration of our loop, the bytecode asks our computer's processor to add the two numbers together and stores the result. The diagram shows the first calculation our computer's processor would make:

<img width="800" src="https://drive.google.com/uc?export=view&id=10qEvWGmvAHbT1NcqW6D8DjmZzPh08_TZ">


Our computer would take eight processor cycles to process the 8 rows of of our data.

Vectorization takes advantage of a processor feature called **Single Instruction Multiple Data (SIMD)** to process data faster. Most modern computer processors support SIMD. SIMD allows a processor to perform the same operation, on multiple data points, in a single processor cycle. Let's look at how a vectorized version of our code above might be processed using a SIMD instruction that allows four data points to be processed at once:


<img width="800" src="https://drive.google.com/uc?export=view&id=1DY8rZ_TtTrOOmG4qWgaEJcVE4vlJAhJc">

The vectorized version of our code will only take two processor cycles to process our eight rows of data - a four times speed-up. Vectorized operations might process as little as two and as many as as hundreds of operations per processor cycle, depending on the capabilities of the processor and the size of each data point.

The good news is that you don't have to worry about SIMD and processor cycles, because NumPy and pandas take care of this for you. We'll introduce pandas in more detail later in this course, but first we're going to learn about NumPy so we understand the fundamentals of working with vectorized operations.

In the next sections, we'll learn:

- How to work with data in using NumPy and pandas objects.
- How to explore and clean data in pandas.
- How to use pandas and NumPy to analyze data quickly and efficiently.

Let's get started



## 2.2 NYC Taxi-Airport Data

As we learn NumPy, we'll be analyzing taxi trip data released by the city of New York. The city releases data on taxis and for-hire vehicles on the [Taxi and Limousine Commission (TLC) Website](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml). There is data on over 1.3 trillion individual trips, reaching back as far as 2009 and is regularly updated.

<center>
<img width="400" src="https://drive.google.com/uc?export=view&id=1CjDo9pWq9g7bKjNH_Xo6cLUj17yf-f2f">
</center>


We'll be working with a subset of this data: Yellow taxi trips to and from New York City airports between January and June 2016. In our dataset, each row represents a unique taxi trip. Below is information about selected columns from the data set:

- **pickup_year** - The year of the trip.
- **pickup_month** - The month of the trip (January is 1, December is 12).
- **pickup_day** - The day of the month of the trip.
- **pickup_location_code** - The airport or borough where the the trip started, as one of eight categories:
  - 0 - Bronx.
  - 1 - Brooklyn.
  - 2 - JFK Airport.
  - 3 - LaGuardia Airport.
  - 4 - Manhattan.
  - 5 - Newark Airport.
  - 6 - Queens.
  - 7 - Staten Island.
- **dropoff_location_code** - The airport or borough where the the trip finished, using the same eight category codes as **pickup_location_code**.
- **trip_distance** - The distance of the trip in miles.
- **trip_length** - The length of the trip in seconds.
- **fare_amount** - The base fare of the trip, in dollars.
- **total_amount** - The total amount charged to the passenger, including all fees, tolls and tips.

You can find information on all columns in the [dataset data dictionary](https://s3.amazonaws.com/dq-content/289/nyc_taxi_data_dictionary.md).

We have [randomly sampled](https://en.wikipedia.org/wiki/Simple_random_sample) approximately 90,000 trips for our analysis, representing one 50th of the trips for the six month period. Our data is stored in a [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file called **nyc_taxis.csv**. Here are the first 10 rows of the data set (note that some columns were omitted due to space limitation):

| pickup_year | pickup_month | pickup_day | pickup_dayofweek | pickup_time | pickup_location_code | dropoff_location_code | trip_distance | trip_length | fare_amount | total_amount |
|-------------|--------------|------------|------------------|-------------|----------------------|-----------------------|---------------|-------------|-------------|--------------|
| 2016 | 1 | 1 | 5 | 0 | 2 | 4 | 21.00 | 2037 | 52.0 | 69.99 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 1 | 16.29 | 1520 | 45.0 | 54.30 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 12.70 | 1462 | 36.5 | 37.80 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 8.70 | 1210 | 26.0 | 32.76 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 5.56 | 759 | 17.5 | 18.80 |
| 2016 | 1 | 1 | 5 | 0 | 4 | 2 | 21.45 | 2004 | 52.0 | 105.60 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 8.45 | 927 | 24.5 | 32.25 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 6 | 7.30 | 731 | 21.5 | 22.80 |
| 2016 | 1 | 1 | 5 | 0 | 2 | 5 | 36.30 | 2562 | 109.5 | 131.38 |
| 2016 | 1 | 1 | 5 | 0 | 6 | 2 | 12.46 | 1351 | 36.0 | 37.30 |


This, however, is how the first few lines of raw data in our CSV look like (we are showing only the first four columns from the file to make the format easier to understand:

```python
pickup_year,pickup_month,pickup_day,pickup_dayofweek
2016,1,1,5
2016,1,1,5
2016,1,1,5
2016,1,1,5
```

To start working with this CSV data in NumPy, we'll first need to start by importing the NumPy library into our Python environment. For this, we use a simple import statement:

```python
import numpy as np
```

We used the **as** syntax in our **import** statement. This allows us to access the NumPy library using another name. When working with NumPy, the convention is to import the library as **np** for brevity.

Next, we'll use Python's built-in [csv module](https://docs.python.org/3/library/csv.html) to import our CSV as a **'list of lists'**.

The last step is to convert our list of lists into a NumPy n-dimensional array, or [ndarray](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.ndarray.html). We're going to explain ndarrays in more detail in the next screen, but for now you can think of it as NumPy's version of a list of lists format. To convert from the list type to ndarray, we use the [numpy.array() constructor](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.array.html). Here's an example of how it works:

```python
# our list of lists is stored as data_list
data_ndarray = np.array(data_list)
```

We used the syntax **np.array()** instead of **numpy.array()** because of our **import numpy as np** code. When we introduce a new syntax, we'll always use the full name to describe it, and you'll need to substitute in the shorthand as appropriate.

Let's convert our taxi CSV into a NumPy ndarray!


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

In the code bellow, we have imported **numpy**, and used Python's **csv** module to import the **nyc_taxis.csv** file and convert it to a **list of lists** containing **float** values.

1. Add a line of code using the **numpy.array()** constructor to convert the **converted_taxi_list** variable to a NumPy ndarray. 
2. Assign the result to the variable name **taxi**.

In [0]:
import csv
import numpy as np

# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)

# start writing your code below this comment
taxi = np.array(converted_taxi_list)

## 2.3 Understanding NumPy ndarrays

As we mentioned earlier, ndarray stands for 'n-dimensional array'. In programming, array is a term that describes a collection of elements. Even if you haven't heard the term before, you have likely encountered arrays: a list object in Python could be described generically as an array. N-dimensional refers to the fact that ndarrays can have one or more dimensions. Let's look at some visualizations of one, two, and three dimensional arrays and their common names:


<img width="450" src="https://drive.google.com/uc?export=view&id=1Zcmsq84y8NNNNeujYJEO3Nc2gxqkk_m-">



Arrays with more than three dimensions do exist in data science but they're rare. We'll focus on:

- One-dimensional ndarrays (1D ndarrays)
- Two-dimensional ndarrays (2D ndarrays)

Similar to using lists of lists, we use numbers to specify the location of elements of our data that we want to work with. Just like with lists, we call these numbers index values (or collectively, indices).

Unlike with Python lists, every value in an ndarray must be of the same types. For the NYC taxi data set this does not matter, as all the values are float values. We'll talk further about this restriction and how to handle it a later mission.

Let's take a look at the data in the taxi variable from the previous screen by printing it using [Python's print() function](https://docs.python.org/3.4/library/functions.html#print):

```python
>>> print(taxi)

    [[ 2016.  1.   1.  ..., 11.65  69.99   1. ]
     [ 2016.  1.   1.  ...,  8.    54.3    1. ]
     [ 2016.  1.   1.  ...,  0.    37.8    2. ]
     ..., 
     [ 2016.  6.  30.  ...,  5.    63.34   1. ]
     [ 2016.  6.  30.  ...,  8.95  44.75   1. ]
     [ 2016.  6.  30.  ...,  0.    54.84   2. ]]
```

At first, this looks identical to a list of lists, with two exceptions:

- Between the third and fourth column of every row there is an elipsis **(...).**
- Between the third and fourth row there is another elipsis.


These elipses indicate that there is more data in our NumPy ndarray than can easily be printed. NumPy will summarize any ndarray we print if it contains more than 1000 elements. If we wanted to see the how many rows and columns are in our ndarray, we can use the [ndarray.shape attribute](http://docs.scipy.org/doc/numpy-1.12.0/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape). If you like, you can open the console from the bottom right of the interface and run this command to see it for yourself.

```python
>>> taxi.shape
    (89560, 15)
```

The output of the ndarray.shape attribute gives us a few important pieces of information:

- There are two numbers, which tells us that our ndarray is two-dimensional.
    - Note: the data type returned is called a [tuple](https://docs.python.org/3.6/library/stdtypes.html#tuples). Tuples are very similar to Python lists, but are immutable (can't be modified). Tuples are defined and displayed using parentheses **()** rather than brackets **[]**.
- The first number tells us that the first dimension is 89,560 items long, or put another way that there are 89,560 rows in our data set.
- The second number tells us that the second dimension is 15 items long, or put another way that there are 15 columns in our data set.

If we just want to select a number of rows from an ndarray, we can use slicing, just like we would with a list of lists. Here's how we would print the first five rows:

```python
>>> print(taxi[:5])

    [[ 2016  1  1  5  0  2  4  21    2037  52.   0.8  5.54  11.65  69.99   1  ]
     [ 2016  1  1  5  0  2  1  16.29  1520  45.   1.3  0     8    54.3    1  ]
     [ 2016  1  1  5  0  2  6  12.7   1462  36.5  1.3  0     0    37.8    2  ]
     [ 2016  1  1  5  0  2  6   8.7   1210  26.   1.3  0     5.46  32.76   1  ]
     [ 2016  1  1  5  0  2  6   5.56   759  17.5  1.3  0     0    18.8    2  ]]
```

You'll notice that because we have fewer than 1000 items in our output, NumPy does not summarize the data and we can see all 15 columns (although they're harder to see because each wraps onto a new line).

Let's practice making a slice of multiple rows using of our ndarray.

**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

Variables we created in previous section!!!! 

1. Select the first ten rows of the **taxi** ndarray, and assign the result to a new variable **taxi_ten**.
2. Use **Python's print()** function to display **taxi_ten**.

In [25]:
# put your code here
taxi_ten = taxi[:10]
print(taxi_ten)

[[2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  4.0000e+00 2.1000e+01 2.0370e+03 5.2000e+01 8.0000e-01 5.5400e+00
  1.1650e+01 6.9990e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  1.0000e+00 1.6290e+01 1.5200e+03 4.5000e+01 1.3000e+00 0.0000e+00
  8.0000e+00 5.4300e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 1.2700e+01 1.4620e+03 3.6500e+01 1.3000e+00 0.0000e+00
  0.0000e+00 3.7800e+01 2.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 8.7000e+00 1.2100e+03 2.6000e+01 1.3000e+00 0.0000e+00
  5.4600e+00 3.2760e+01 1.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 2.0000e+00
  6.0000e+00 5.5600e+00 7.5900e+02 1.7500e+01 1.3000e+00 0.0000e+00
  0.0000e+00 1.8800e+01 2.0000e+00]
 [2.0160e+03 1.0000e+00 1.0000e+00 5.0000e+00 0.0000e+00 4.0000e+00
  2.0000e+00 2.1450e+01 2.0040e+03 5.2000e+01 8.0000e-01 0.0000e+00
  5.

## 2.4 Selecting and Slicing Rows and Items from ndarrays

Let's look at a comparison between working with ndarray's and list of lists to select one or more rows of data:

<img width="600" src="https://drive.google.com/uc?export=view&id=1jfWR9J9dsX2WhqSEnsTJMy_8x0NldSRT">


Just like we saw in the previous screen, selections of rows ndarray's look like they behave very similarly to lists of lists. In reality, what we're seeing is a shortcut of sorts. For any two-dimensional array, the full syntax for selecting data is:

```python
ndarray[row,column]

# or if you want to select all
# columns for a given set of rows
ndarray[row]
```

Where row defines the location along the row axis and column defines the location along the column axis. Both row and column can be one of the following:

- An **integer**, indicating a specific location, eg **ndarray[3,0]**.
- A **slice**, indicating a range of locations, eg **ndarray[0:5,6:]**.
- A **colon**, indicating every location, eg **ndarray[:,2].**
- A **list of values**, indicating specific locations, eg **ndarray[[0,1,3,4],0]**.
- A **boolean array**, indicating specific locations - we'll look at this method in detail later.
- Or any combination of the above.

This is how we select a single item from a 2D ndarray:


<img width="600" src="https://drive.google.com/uc?export=view&id=1PZX6Ba54H6UM7NfnlyMkyoff5ZgMaSl9">


With a list of lists, we use two separate pairs of square brackets back-to-back. With a NumPy ndarray, we use a single pair of brackets with comma separated row and column locations.

Let's practice selecting one row, multiple rows, and single items from our **taxi** ndarray.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. From the **taxi** ndarray:
  - Select the row at index 0 and assign it to **row_0**.
  - Select every column for the rows at indexes 391 to 500 inclusive and assign them to **rows_391_to_500**.
  - Select the item at row index 21 and column index 5 and assign it to **row_21_column_5**



In [28]:
# put your code here
row_0 = taxi[0]
rows_391_to_500 = taxi[391:501,:]
row_21_column_5 = taxi[21, 5]

(15,)
(110, 15)
()


## 2.5 Selecting Columns and Custom Slicing ndarrays

Let's continue by learning how to select one or more columns of data:


<img width="550" src="https://drive.google.com/uc?export=view&id=1SMRvKH2kCSLpdANtxt4XZvSxomE0QgP5">

With a list of lists, we need to use a for loop to extract specific column(s) and append them back to a new list. With ndarray's, the process is much simpler. We again use single brackets with comma separated row and column locations, but we use a colon **(:)** for the row locations. This colon acts as a wildcard, and gives us all items in that dimension, or in other words all rows.

If we wanted to select a partial 1D slice of a row or column, we can combine a single value for one dimension with a slice for the other dimension:

<img width="550" src="https://drive.google.com/uc?export=view&id=1ywqJGXCPuLD17sTVg8f_eIJHD4hiHM2D">

Lastly, if we wanted to select a 2D slice, we can use slices for both dimensions:


<img width="550" src="https://drive.google.com/uc?export=view&id=1ag7hqo_71kgwLbhpo74rwQyKRA4Xjwpk">


Let's practice everything we've learned so far to perform some more complex selections using NumPy


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. From the **taxi** ndarray:
  - Select every row for the columns at indexes **1**, **4**, and **7** and assign them to **columns_1_4_7.**
  - Select the columns at indexes **5** to **8** inclusive for the row at index 99 and assign them to **row_99_columns_5_to_8**.
  - Select the rows at indexes **100 to 200** inclusive for the column at index 14 and assign them to **rows_100_to_200_column_14**.

In [55]:
# put your code here
cols = [1,4,7]
columns_1_4_7 = taxi[:,cols]
cols = [5,6,7,8]
row_99_columns_5_to_8 = taxi[99,5:9]
rows_100_to_200_column_14 = taxi[100:201,14]
print(rows_100_to_200_column_14)

[2. 1. 1. 1. 1. 1. 2. 1. 1. 2. 1. 1. 1. 2. 2. 2. 1. 2. 1. 2. 1. 1. 2. 2.
 2. 1. 1. 2. 1. 2. 1. 1. 2. 2. 1. 1. 2. 2. 1. 1. 1. 2. 1. 1. 1. 2. 2. 2.
 2. 2. 1. 4. 2. 1. 2. 1. 2. 2. 2. 2. 1. 1. 2. 1. 2. 2. 2. 2. 1. 2. 2. 1.
 2. 1. 2. 1. 2. 2. 1. 1. 1. 1. 2. 1. 1. 2. 2. 1. 1. 2. 2. 1. 1. 2. 1. 1.
 1. 1. 1. 2. 2.]


## 2.6 Vector Math

The examples in the previous two screens showed us how much easier it is to select data using NumPy ndarrays. Beyond this, the selection we are making is a lot faster when working with vectorized operations. To illustrate this, we've created a random 500 x 5 numpy ndarray, and an equivalent list of of lists, and then a function to select the second and third columns for each:

- **python_subset()**
- **numpy_subset()**



In [0]:
import numpy as np

# create random (500,5) numpy arrays and list of lists
np_array = np.random.rand(500,5)
list_array = np_array.tolist()

def python_subset():
  filtered_cols = []
  for row in list_array:
    filtered_cols.append([row[1],row[2]])
  return filtered_cols

def numpy_subset():
  return np_array[:,1:3]

We'll use a special iPython [%timeit](http://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit) magic command to time a single run of each function:

In [42]:
%%timeit -r 1 -n 1
list_of_list = python_subset()

1 loop, best of 1: 236 µs per loop


In [41]:
%%timeit -r 1 -n 1
numpy_array = numpy_subset()

1 loop, best of 1: 7.53 µs per loop


Our NumPy version was over 20 times quicker than the list of lists version (the units of the output are in [microseconds](https://en.wikipedia.org/wiki/Microsecond))!

When we first talked about vectorized operations, we used the example of adding two columns of data. With data in a list of lists, we'd have to construct a for-loop and add each pair of values from each row individually. To refresh your memory, here's what our example code looked like:

```python
my_numbers = [
              [6, 5],
              [1, 3],
              [5, 6],
              [1, 4],
              [3, 7],
              [5, 8],
              [3, 5],
              [8, 4]
             ]

sums = []

for row in my_numbers:
    row_sum = row[0] + row[1]
    sums.append(row_sum)
```

At the time, we only talked about how vectorized operations make this faster, however it also makes our code to execute this much simpler. We'll break this down into three steps:

- Convert our data to an ndarray,
- Select each column,
- Add the columns.

Let's look at what that looks like in code:

```python
# convert the list of lists to an ndarray
my_numbers = np.array(my_numbers)

# select each of the columns - the result
# of each will be a 1D ndarray
col1 = my_numbers[:,0]
col2 = my_numbers[:,1]

# add the two columns
sums = col1 + col2
```

We could simplify this further if we wanted to:

```python
sums = my_numbers[:,0] + my_numbers[:,1]
```

Here are some key observations about this code:

- When we selected each column, we used the syntax **ndarray[:,c]** where **c** is the column index we wanted to select. Like we saw in the previous screen, the colon acts as a wildcard and selects all rows.
- To add the two 1D ndarrays, **col1** and **col2** (which sometimes would be called **vectors** in this context), we simply use the addition operator **(+)** between them.
- The result of adding two 1D vectors is a 1D vector of the same shape (or dimensions) as the original.


Here's what happened behind the scenes:


<img width="600" src="https://drive.google.com/uc?export=view&id=14pACrFQpoxcFg9esh3CyqrKTASHfjmvm">



What we just did, adding two columns (or vectors) together is called **vector math**. When we're performing vector math on two one-dimensional vectors, both vectors must have the same shape. We can use any of the standard [Python numeric operators](https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex) to perform vector math:

- **vector_a + vector_b** - Addition
- **vector_a - vector_b** - Subtraction
- **vector_a \* vector_b** - Multiplication (this is unrelated to the vector multiplication used in linear algebra).
- **vector_a / vector_b** - Division
- **vector_a % vector_b** - Modulus (find the remainder when **vector_a** is divided by **vector_b**)
- **vector_a \*\* vector_b** - Exponent (raise **vector_a** to the power of **vector_b**)
- **vector_a // vector_b** - Floor Division (divide **vector_a** by **vector_b**, rounding down to the nearest integer)

Let's look at an example from our taxi dataset. Here are the first five rows of two of the columns in the data set:

| trip_distance | trip_length |
|---------------|-------------|
| 21.00 | 2037.0 |
| 16.29 | 1520.0 |
| 12.70 | 1462.0 |
| 8.70 | 1210.0 |
| 5.56 | 759.0 |


Let's use these columns to calculate the average travel speed of each trip in miles per hour. The formula for calculating miles per hour is:

$$
\textrm{miles per hour} = \textrm{distance in miles} \div \textrm{lenght in hours}
$$

As we learned in the second screen of this mission, **trip_distance** is expressed in miles, and **trip_length** is seconds, so our first step is converting **trip_length** into hours. Here's how we would do it:

```python
trip_distance = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour
```

Here we have a different example of vector math. We've divided a vector (one-dimensional array) by a scalar (single number). In this case, each value in the vector gets divided by the scalar to form the result.

From here, let's perform vector division again to calculate the miles per hour.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

1. Use vector division to divide **trip_distance_miles** by **trip_length_hours**, assigning the result to **trip_mph**.
2. After you have run your code, inspect the contents of the new **trip_mph** variable.

In [68]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

# put your code here
trip_mph = trip_distance_miles / trip_length_hours
print(trip_mph)

[37.11340206 38.58157895 31.27222982 ... 22.29907867 42.41551247
 36.90473407]


## 2.7 Arithmetic Numpy Functions

To make the calculations in the previous screen, we used operators like the **/** symbol to perform vectorized operations over our data. NumPy provides a second way to make these calculations - **arithmetic functions**. Let's look at how we would write the exercise from the previous screen with with the equivalent, the [numpy.divide](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.divide.html) function:

```python
# using the `/` operator:
trip_mph_1 = trip_distance_miles / trip_length_hours

# using the `numpy.divide()` function:
trip_mph_2 = np.divide(trip_distance_miles,trip_length_hours)
```

The variables **trip_mph_1** and **trip_mph_2** will be identical.

As you become more familiar with NumPy (and later, pandas), you'll find that there is often more than one way to do the same thing. Most of the time, which you choose is up to you. The general rule with situations like these it to choose the one that makes your code easier to read, which will pay dividends both as you start working with data in teams, and when you have to refer back to code you wrote some time ago. You will find that for these arithmetic operations, it's much more common to use the built-in Python operators than the functions.

As you start to feel more comfortable with these libraries, you should start exploring the documentation. This is useful because it builds out your knowledge of available functions and methods, but also because it gets you used to reading the documentation. It's not possible to remember the syntax for every variation of every data science library, but if you remember what is possible, and can read the documentation, you'll always be able to quickly refamiliarize yourself with some syntax whenever you need it.

You may have noticed that when we mention a function or method for the first time, we'll link to the documentation for it. Take a moment now to click the link for the **numpy.divide()** function from the first paragraph of this screen and look at the documentation. It may seem a little overwhelming at first, but it is well worth your time.

You might like to also take a look at all of the [arithmetic functions from the NumPy documentatio](https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.math.html#arithmetic-operations).

## 2.8 Calculating Statistics For 1D ndarrays

Earlier, we created **trip_mph**, a 1D ndarray of the average mile-per-hour speed of each trip in our dataset, based off the **trip_length** and **trip_distance** columns. We might like to explore this data further, for instance working out what the maximum and minimum values are for that ndarray.

We could use the built-in Python functions **min()** and **max()** to make these calculations, however these will perform calculations without taking advantage of vectorization. Instead we can use NumPy's ndarray methods we can use to calculate statistics.

To calculate the minimum value of an 1D ndarray, we use the vectorized [ndarray.min()](http://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html) method, like so:


```python
>>> mph_min = trip_mph.min()

>>> mph_min
    0.0
```

The minimum value in our **trip_mph** ndarray is **0.0**, for a trip that didn't travel any distance at all.

Before we look at other array methods Let's take a moment to clarify the difference between **methods** and **functions**. Functions act as stand alone segments of code that usually take an input, perform some processing, and return some output. When we're working with Python lists, we can use the **len()** function to calculate the length of a list, but if we're working with Python strings, we can also use **len()**. In this case, it calculates the numbers of characters (or length) of the string.

```python
>>> my_list = [21,14,91]
>>> len(my_list)
    3

>>> my_string = 'Natal'
>>> len(my_string)
    5
```

In contrast, methods are special functions that belong to a specific type of object. Python lists have a **list.append()** method that we can use to add an item to the end of a list. If we try to use that method on a string, we will get an error:

```python
>>> my_list.append(21)

>>> my_string.append(' is the best!')'

    Traceback (most recent call last):
      File "stdin", line 1, in module
    AttributeError: 'str' object has no attribute 'append'
```

When you're learning NumPy, this can get confusing, because sometimes there are operations that are implemented as both methods and functions, but sometimes there are not. Let's look at some examples:

| Calculation | Function Representation | Method Representation |
|------------------------------------------------|-------------------------|-----------------------------------|
| Calculate the minimum value of **trip_mph** | np.min(trip_mph) | trip_mph.min() |
| Calculate the maximum value of **trip_mph** | np.max(trip_mph) | trip_mph.max() |
| Calculate the mean average value of **trip_mph** | np.mean(trip_mph) | trip_mph.mean() |
| Calculate the median average value of **trip_mph** | np.median(trip_mph) | There is no ndarray median method |


To remember the right terminology, anything that starts with np (e.g. **np.mean()**) is a function and anything you express with an object (or variable) name first (eg **trip_mph.mean()**) is a method. As we discussed in the previous section, where both exist it's up to you which you use, but it's much more common to see the method approach, and that's the one we'll use moving forward.

Numpy ndarrays have methods for many different calculations. A few key methods are:

- [ndarray.min()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.min.html#numpy.ndarray.min) to calculate the minimum value
- [ndarray.max()](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.max.html) to calculate the maximum value
- [ndarray.mean()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.mean.html#numpy.ndarray.mean) to calculate the mean average value
- [ndarray.sum()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.sum.html#numpy.ndarray.sum) to calculate the sum of the values

You can see them a full list of ndarray methods in the NumPy ndarray [documentation](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.ndarray.html#calculation).

Let's use the methods we've just learned about to calculate the smallest, largest, and mean average speed from our **trip_mph** ndarray.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. Use the [ndarray.max()](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.max.html) method to calculate the maximum value of **trip_mph** and assign the result to **mph_max**.
2. Use the [ndarray.mean()](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.ndarray.mean.html#numpy.ndarray.mean) method to calculate the average value of **trip_mph** and assign the result to **mph_mean**.

In [69]:
# put your code here
mph_max = trip_mph.max()
mph_mean = trip_mph.mean()
print(mph_max)
print(mph_mean)

82800.0
32.24258580925573


## 2.9 Calculating Statistics For 2D ndarrays

Looking at the result of the code in the previous screen, you would have observed:

- Minimum trip speed: 0 mph
- Average (mean) trip speed (rounded): 32 mph
- Maximum trip speed (rounded): 82,000 mph

While it's easy to imagine a case where the trip speed is 0 mph - a trip that starts and ends without traveling any distance, a trip speed of 82,000 mph is definitely not possible in New York traffic - that's almost 20x faster than the fastest plane in the world! This is could be due to an error in the devices that records the data, or perhaps errors made somewhere in the data pipeline. We'll spend some time later in this mission looking into the data that gave us this unrealistic number.

For now, we're going to look at how we can calculate statistics for two-dimensional ndarrays. If we use the arrays without additional parameters, they will return a single value, just like they do with a 1D array:

<img width="500" src="https://drive.google.com/uc?export=view&id=1hOKRuh4eN2_ZeiDOMT5ZwEz8syqMlB2X">


But what if we wanted to find the maximum value of each row? For that, we need to use the **axis** parameter, and specify a value of **1**, which indicates we want to calculate values for each row.

<img width="500" src="https://drive.google.com/uc?export=view&id=151in-86Grb_igmMjTnMAHsuqu3XrxFiH">

If we want to find the maximum value of each column, we use an **axis** value of **0**:


<img width="550" src="https://drive.google.com/uc?export=view&id=1tTSmu1P6FlOADhVrAJIL8ydgbX_ToT2m">


To help you remember which is which, you can think of the first axis as rows, and the second axis as columns, just in the same way as when we're indexing a 2D NumPy array we use **ndarray[row,column]**. Then you think about which axis you want to apply the method along. The tricky part is to remember that when you apply the method along one axis, you get results in the other axis. Here is an illustration of that:


<img width="550" src="https://drive.google.com/uc?export=view&id=11Yyylj-uQYHTDJcY-M5e8PMiiXydK4U8">


Let's look at an example of from our taxi data set. Let's say that we wanted to do some validation, and check that the **total_amount** column is accurate. To remind ourselves of what the data looks like, let's look at the first five rows of columns with indexes 9 through 13:

| fare_amount | fees_amount | tolls_amount | tip_amount | total_amount |
|-------------|-------------|--------------|------------|--------------|
| 52.0 | 0.8 | 5.54 | 11.65 | 69.99 |
| 45.0 | 1.3 | 0.00 | 8.00 | 54.3 |
| 36.5 | 1.3 | 0.00 | 0.00 | 37.8 |
| 26.0 | 1.3 | 0.00 | 5.46 | 32.76 |
| 17.5 | 1.3 | 0.00 | 0.00 | 18.8 |


We want to perform a check of whether the first 4 of these columns sums to the 5th column. This is how we would do it:


```python
# we'll compare against the first 5 rows only
taxi_first_five = taxi[:5]
# select these columns: fare_amount, fees_amount, tolls_amount, tip_amount
fare_components = taxi_first_five[:,9:13] 
# select the total_amount column
fare_totals = taxi_first_five[:,13]

# sum the component columns
fare_sums = fare_components.sum(axis=1)

# compare the summed columns to the fare_totals
print(fare_totals.round())
print(fare_sums)
```

Our code outputs the following:

```python
[ 69.99  54.3   37.8   32.76  18.8 ]
[ 69.99  54.3   37.8   32.76  18.8 ]
```

We have validated that our **fare_totals** column is correct (at least for the first five rows).

Now, let's practice calculating the average for each column:


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. Using a single method, calculate the mean value for each column of **taxi**, and assign the result to **taxi_column_means.**

In [0]:
# put your code here
taxi_column_means = taxi.mean(axis=0)

## 2.10 Adding Rows and Columns to ndarrays

Earlier in the lesson, we produced a ndarray **trip_mph** of the average speed of each trip. We also observed that the maximum speed was 82,000 mph, which is definitely not an accurate number. To take a closer look at why we might be getting this value, we're going to do the following:

- Add the **trip_mph** as a column to our **taxi** ndarray.
- Sort taxi by **trip_mph**.
- Look at the rows with the highest **trip_mph** from our sorted ndarray to see what they tell us about these large values.


To start, let's learn how to add rows and columns to an ndarray. The technique we're going to use involves the [numpy.concatenate() function](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.concatenate.html). This function accepts:

- A list of ndarrays as the first, unnamed parameter.
- An integer for the **axis** parameter, where **0** will add rows and **1** will add columns.

The **numpy.concatenate()** function requires that each array have the same shape, excepting the dimension corresponding to **axis**. Let's look at an example to understand more precisely how that works. We have two arrays, **ones** and **zeros**:

```python
>>> print(ones)

    [[ 1  1  1]
     [ 1  1  1]]

>>> print(zeros)

    [ 0  0  0]
```

Let's try and use **numpy.concatenate()** to add **zeros** as a row. Because we are wanting to add a row, we use **axis=0**

```python
>>> combined = np.concatenate([ones,zeros],axis=0)

    Traceback (most recent call last):
      File "stdin", line 1, in module
    ValueError: all the input arrays must have same number of dimensions
```

We've got an error because our dimensions don't match - let's look at the shape of each array to see if we can understand why:

```python
>>> print(ones.shape)

    (2, 3)

>>> print(zeros.shape)

    (3,)
```

Because we're using **axis=0**, our shapes have to match across all dimensions except the first. If we look at these two array's we can see that the second dimension of **ones** is **3**, but **zeros** doesn't have a second dimension, because it's only a 1D array. This is the source of our error. The table below shows the shapes we need to be able to combine these arrays.


| Object | Current shape | Desired Shape |
|--------|---------------|---------------|
| ones | (2, 3) | (2, 3) |
| zeros | (3,) | (1, 3) |


In order to adjust the shape of **zeros**, we can use the [numpy.expand_dims()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.expand_dims.html) function. You might like to follow these steps in the cell. We'll start by passing **axis=0** because we want to convert our 1D array into a 2D array representing a row:


```python
>>> zeros_2d = np.expand_dims(zeros,axis=0)

>>> print(zeros_2d)

    [[ 0  0  0]]

>>> print(zeros_2d.shape)

    (1, 3)
```

Finally, we can use **numpy.concatenate()** to combine the two arrays:

```python
>>> combined = np.concatenate([ones,zeros_2d],axis=0)

>>> print(combined)

    [[ 1  1  1]
     [ 1  1  1]
     [ 0  0  0]]
```

Adding a column is done the same way, except substituting **axis=1** for **axis=0** in both functions. The initial code for this screen shows this process.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. Expand the dimensions of **trip_mph** to be a single column in a 2D ndarray, and assign the result to **trip_mph_2d**.
2. Add **trip_mph_2d** as a new column at the end of **taxi**, assigning the result back to **taxi**.
3. Use the **print()** function to display **taxi** and view the new column.


In [72]:
# put your code here
trip_mph_2d = np.expand_dims(trip_mph, axis=1)
taxi = np.concatenate([taxi,trip_mph_2d], axis=1)
print(taxi)

[[2.01600000e+03 1.00000000e+00 1.00000000e+00 ... 3.71134021e+01
  3.71134021e+01 3.71134021e+01]
 [2.01600000e+03 1.00000000e+00 1.00000000e+00 ... 3.85815789e+01
  3.85815789e+01 3.85815789e+01]
 [2.01600000e+03 1.00000000e+00 1.00000000e+00 ... 3.12722298e+01
  3.12722298e+01 3.12722298e+01]
 ...
 [2.01600000e+03 6.00000000e+00 3.00000000e+01 ... 2.22990787e+01
  2.22990787e+01 2.22990787e+01]
 [2.01600000e+03 6.00000000e+00 3.00000000e+01 ... 4.24155125e+01
  4.24155125e+01 4.24155125e+01]
 [2.01600000e+03 6.00000000e+00 3.00000000e+01 ... 3.69047341e+01
  3.69047341e+01 3.69047341e+01]]


## 2.11 Sorting ndarrays

Now that we've added our **trip_mph** column to our array, our next step is to sort the array. For this, we'll use the [numpy.argsort() function](http://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.argsort.html#numpy.argsort). The **numpy.argsort()** function returns the indices which would sort an array. Don't worry if that sounds a little unusual, we'll look at an example to help explain it.

We'll start by defining a simple 1D ndarray, where each item is a string containing the name of a fruit:

<img width="450" src="https://drive.google.com/uc?export=view&id=1xep-cNIjyRLiSjj40rH7FBI-JsxGZjEr">

We've put the indices, or index numbers, next to each value in the array. We use the indices whenever we want to select an item, for instance **fruit[2]** would return the value **'apple'** and **fruit[1]** would return the value **'banana'**. As we learned earlier in the mission, if we selected using a list of values like **fruit[[2,1]]**, we would get back an ndarray of those values in the order: **['apple','banana'].**

Next, we'll use **numpy.argsort()** to return the indices that would sort the array:

<img width="500" src="https://drive.google.com/uc?export=view&id=1_2PNcK3Ty6blapVQBHg1efLPr_QGJAy5">


If we look at these indices carefully, we can see what has happened. The first value of **sorted_order** is 2: The value at index 2 of fruit is **'apple'**, the first item if we sort in alphabetical order. The second value is 1: The value and index 1 of fruit is **'banana'**, the second item if we sort in alphabetical order, and so on.

If we use the array of sorted indices to select items from fruit, here is what we get:

<img width="500" src="https://drive.google.com/uc?export=view&id=1CTtkZu4vHHwTAovRdWeEqPhwuV3jH9Ls">


In the code above, the values from **sorted_order** get inserted between the brackets. The code is the equivalent of:

```python
sorted_fruit = fruit[[2, 1, 4, 3, 0]]
```

As you can see, the result is that our original array has been sorted in alphabetical order.

Let's look at an example with a 2D ndarray. We'll sorting a 5x5 ndarray called int_square by it's last column:


```python
>>> print(int_square)

    [[5 2 8 3 4]
     [2 8 6 2 5]
     [1 6 2 7 7]
     [0 7 7 4 5]
     [5 7 1 1 2]]
```

We'll start by selecting just the last column.

```python
>>> last_column = int_square[:,4]

>>> print(last_column)

    [4 5 7 5 2]
```

Then, we use **numpy.argsort()** to get the indices that would sort the last column and assign them to **sorted_order**.

```python
>>> sorted_order = np.argsort(last_column)

>>> print(sorted_order)

    [4 0 1 3 2]
```

As a test, let's use **sorted_order** to sort just the last column:

```python
>>> last_column_sorted = last_column[sorted_order]

>>> print(last_column_sorted)

    [2 4 5 5 7]
```

Finally, we can pass **sorted_order** to sort to the full ndarray:

```python
>>> int_square_sorted = int_square[sorted_order]

>>> print(int_square_sorted)

    [[5 7 1 1 2]
     [5 2 8 3 4]
     [2 8 6 2 5]
     [0 7 7 4 5]
     [1 6 2 7 7]]
```

We can use the same technique to sort our **taxi** ndarray by the **trip_mph** column. NumPy only supports sorting in ascending order, however that is not a problem - we'll just look at the last few rows instead of the first few rows to examine the data we need.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. Use **numpy.argsort()** to get the indices which would sort the **trip_mph** column from the **taxi** ndarray. The **trip_mph** column is at column index **15**.
2. Use the indices from the previous instruction to **sort** the **taxi** ndarray, and assign the result to **taxi_sorted**.
3. Use the **print()** function to examine the **taxi_sorted** ndarray.

In [74]:
# put your code here
taxi_sorted = taxi[np.argsort(taxi[:,15])]
print(taxi_sorted)

[[2.016e+03 6.000e+00 2.800e+01 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.016e+03 3.000e+00 3.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 [2.016e+03 4.000e+00 6.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 ...
 [2.016e+03 3.000e+00 2.800e+01 ... 3.204e+04 3.204e+04 3.204e+04]
 [2.016e+03 2.000e+00 1.300e+01 ... 7.056e+04 7.056e+04 7.056e+04]
 [2.016e+03 1.000e+00 2.200e+01 ... 8.280e+04 8.280e+04 8.280e+04]]


In this mission we learned:

- How vectorization it makes our code faster.
- About n-dimensional arrays, and NumPy's ndarrays.
- How to select specific items, rows, columns, 1D slices, and 2D slices from ndarrays.
- How to use vector math to apply simple calculations to entire ndarrays.
- How to use vectorized methods to perform calculations across either axis of ndarrays.
- How to add extra columns and rows to ndarrays.
- How to sort an ndarray.

# 3.0 Boolean Indexing with NumPy

## 3.1 Reading CSV files with NumPy

In the previous section we learned how to use NumPy and ndarrays to perform vectorized operations to work with data. We learned that NumPy makes it quick and easy to make selections of our data, and includes a number of functions and methods that make it easy to calculate statistics across the different axes (or dimensions).

Using the skills we've learned so far, we were able to select subsets of our taxi trip data and then calculate things like the maximum, minimum, sum, and mean of various columns and rows. But what if we wanted to find out how many trips were taken in each month? Or which airport is the busiest? For this we will need a new technique: **Boolean Indexing.**

In the previous section, we used Python's built-in [csv module](https://docs.python.org/3/library/csv.html) to import our CSV as a 'list of lists' and used loops to convert each value to a float before we created our NumPy ndarray. Now that we understand NumPy a little better, let's learn about the [numpy.genfromtxt() function](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt) to read in files.

The **numpy.genfromtxt()** function reads a text file into a NumPy ndarray. While it has over 20 parameters, for most cases you need only two. Here is the simplified syntax for the function, and an explanation of the two parameters:

```python
np.genfromtxt(filename,delimiter)
```

- **filename** - A positional argument, usually a string representing the path to the text file to be read.
- **delimiter** - A named argument, specifying the string used to separate each value.
In this case, because we have a CSV file, the delimiter is a comma. Let's look at what the code would look like to read in the **nyc_taxis.csv** file.

```python
taxi = np.genfromtxt('nyc_taxis.csv', delimiter=',')
print(taxi)
```

The output of this code is shown below:

```python
[[   nan    nan    nan ...,    nan    nan    nan]
 [  2016      1      1 ...,  11.65  69.99      1]
 [  2016      1      1 ...,      8   54.3      1]
 ..., 
 [  2016      6     30 ...,      5  63.34      1]
 [  2016      6     30 ...,   8.95  44.75      1]
 [  2016      6     30 ...,      0  54.84      2]]
```

When **numpy.genfromtxt()** reads in a file, it attempts to determine the data type of the file by looking at the values. We can use the **ndarray.dtype** attribute to see the internal datatype that has been used.

```python
>>> taxi.dtype

    float64
```

NumPy has chosen the **float64** type as it will allow most of the values from our CSV to be read. You can think of NumPy's **float64** type as being identical to Python's float type (the **'64'** refers to the number of bits used to store the underlying value).

The first row of our data contains a value that we haven't seen before: **nan**. **NaN** is an acronym for **Not a Number**. The concept of NaN is an unusual one at first - it literally means that the value cannot be stored as a number. It is similar to (and often refered to interchangably as a) null value, like Python's [None constant](https://docs.python.org/3.4/library/constants.html#None).

NaN is most commonly seen when a value is missing, but in this case we have NaN because the first line from our CSV file contains the names of each column. As we mentioned in the previous mission, NumPy ndarrays can contain only one type. NumPy is unable to convert string values like **pickup_year** into the **float64** data type. Later in this course we'll talk about NaN some more in the context of missing values. For now, we need to remove this row from our ndarray. We can do this the same way we would if our data was stored in a list of lists:

```python
taxi = taxi[1:]
```

Which removes the first row from the array. Alternatively, we can pass an additional parameter, **skip_header**, to the **numpy.genfromtext()** function. The **skip_header** parameters accepts an integer, the number of rows from the start of the file to skip (note that because this is the number of rows and not the index, to skip the first row would require a value of **1** and not **0**).

**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

1. Import the **NumPy** library.
2. Use the **numpy.genfromtxt()** function to read the **nyc_taxis.csv** file into NumPy, skipping the first row, and assign the result to **taxi**.



In [4]:
import numpy as np
taxi = np.genfromtxt('nyc_taxis.csv')
taxi = taxi[1:]

OSError: ignored

## 3.2 Boolean Arrays

In the last sections we mentioned five ways to index, or select, data from ndarrays:

- An **integer**, indicating a specific location.
- A **slice**, indicating a range of locations.
- A **colon**, indicating every location.
- A **list of values**, indicating specific locations.
- A **boolean array**, indicating specific locations.

In this section we're going to focus on the last and arguably the most powerful method, the boolean array. A boolean array, as the name suggests is an array full of boolean values. Boolean arrays are sometimes called boolean vectors or boolean masks.

Let's take a moment to refresh our understanding of what a boolean value is. The boolean (or **bool**) type is a built-in Python type that can contain one of two unique values:

- True
- False


Boolean values can be defined either by **'hard-coding'** them to the code using the keywords **True** or **False**, or alternatively by using any of the Python comparison operators like **== (equal) > (greater than), < (less than), != (not equal)**. They're commonly seen within if statements, like the example below:

<img width="600" src="https://drive.google.com/uc?export=view&id=1cltwKCELwoqrOzBNU7AIDsJ083ykMhPL">

As the code is executed the boolean operation is evaluated, causing the print function to run. We can use the console to perform simple boolean operations as well:

```python
>>> type(3.5) == float
    True
>>> 3 < 10
    True
>>> "hello" == "goodbye"
    False
>>> 5 > 6
    False
>>> (3 + 3) != 5
    True
```

When we explored vector math in the first section, we learned that an operation between a ndarray and a scalar (individual) value results in a new ndarray:


```python
>>> np.array([2,4,6,8]) + 10

    array([12, 14, 16, 18])
```

The **+ 10** operation is applied to each value in the array.

Now, let's look at what happens when we perform a boolean operation between an ndarray and a scalar:

```python
>>> np.array([2,4,6,8]) < 5

    array([ True,  True, False, False], dtype=bool)
```

A similar pattern occurs– the 'less than five' operation is applied to each value in the array. The diagram below shows this step by step:


<img width="600" src="https://drive.google.com/uc?export=view&id=1QINhkJfEHn-CXbppP-x-RklfQCxfKrxg">

Let's practice using vectorized boolean operations to create some boolean arrays.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

1. Use vectorized boolean operations to:
  - Evaluate whether the elements in array **a** are less than **3** and assign the result to **a_bool**.
  - Evaluate whether the elements in array **b** are equal to **"blue"** and assign the result to **b_bool**.
  - Evaluate whether the elements in array **c** are greater than **100** and assign the result to **c_bool**.

In [0]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

# put your code here

## 3.3 Boolean Indexing with 1D ndarrays

Now we know what a boolean array is and how to create one using vectorized boolean operations. The last piece of the puzzle is understanding how to index (or select) using boolean arrays. This is known as boolean indexing. Let's use one of the examples from the previous screen.

<img width="600" src="https://drive.google.com/uc?export=view&id=1nNX9HUvygkpb2_GowE6t3QPtozu4TaX6">


To index using our new boolean array, we simply insert it in the square brackets, just like we would do with our other selection techniques:

<img width="600" src="https://drive.google.com/uc?export=view&id=1HruGF2TejcaPODJP0PvLqNj2g9qNoRVQ">

The boolean array acts as a filter, and the values that correspond to **True** become part of the resultant ndarray, where the the values that correspond to **False** are removed.

Now, let's look at an example using our **taxi** data. The second column in the ndarray is **pickup_month**. Let's use boolean indexing to create a filtered ndarray containing only items where the value is **1**, which corresponds to January. Once we have done that, we can look at the [ndarray.shape attribute](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.shape.html) for the filtered ndarray, which will tell us the number of taxi rides in our data set from the month of January.

We'll do it step by step, starting with selecting just the **pickup_month** column:

```python
pickup_month = taxi[:,1]
```

Next, we use a boolean operation to make our boolean array:

```python
january_bool = pickup_month == 1
```

Then we use the new boolean array to select only the items from pickup_month that have a value of 1:

```python
january = pickup_month[january_bool]
```

Finally, we use the **.shape** attribute to find out how many items are in our **january** ndarray which is the number of taxi rides in our data set from the month of January. We'll use **[0]** to extract the value from the tuple returned by **.shape**

```python
january_rides = january.shape[0]
print(january_rides)

13481
```

There are 13,481 rides in our dataset from the month of January. Let's practice boolean indexing and find out the number of rides in our data set for February and March.

**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

1. Calculate the number of rides in the **taxi** ndarray that are from February:
  - Create a boolean array, **february_bool**, that evaluates whether the items in **pickup_month** are equal to **2**.
  - Use the **february_bool** boolean array to index **pickup_month**, and assign the result to **february**.
  - Use the **ndarray.shape** attribute to find the number of items in **february** and assign the result to **february_rides**.
2. Calculate the number of rides in the **taxi** ndarray that are from March:
  - Create a boolean array, **march_bool**, that evaluates whether the items in **pickup_month** are equal to **3**.
  - Use the **march_bool** boolean array to index **pickup_month**, and assign the result to **march.**
  - Use the **ndarray.shape** attribute to find the number of items in **march** and assign the result to **march_rides**.

In [0]:
# put your code here

## 2.4 Boolean Indexing with 2D ndarrays

When working with 2D ndarray, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission. The only limitation is that the boolean array must have the same length as the dimension you're indexing. Let's look at some examples:

<img width="500" src="https://drive.google.com/uc?export=view&id=1jXwHlU2lUX-VHmCTm9brDiTRu8L7yx2t">

Because a boolean array contains no information about how it was created, we can use a boolean array made from just one column of our array to index the whole array.

Let's look at an example from our taxi trip data. In the previous mission, we sorted our ndarray in order to view the trips that had very large average speeds. Boolean indexing makes this much easier:

```python
# calculate the average speed
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

# create a boolean array for trips with average
# speeds greater than 20,000 mph
trip_mph_bool = trip_mph > 20000

# use the boolean array to select the rows for
# those trips, and the pickup_location_code,
# dropoff_location_code, trip_distance, and
# trip_length columns
trips_over_20000_mph = taxi[trip_mph_bool,5:9]

print(trips_over_20000_mph)
```

```python
[[     2      2     23      1]
 [     2      2   19.6      1]
 [     2      2   16.7      2]
 [     3      3   17.8      2]
 [     2      2   17.2      2]
 [     3      3   16.9      3]
 [     2      2   27.1      4]]
```

Combining our boolean array with a column slice allowed us to view just the key data of these trips with very high average speeds. As we observed in the previous mission, all of these trips have the same pickup and dropoff locations, and last only a few seconds.

Let's use this technique to examine the rows that have the highest values for the **tip_amount** column.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. Create a boolean array, **tip_bool**, that determines which rows have values for the **tip_amount** column of more than **50**.
2. Use the **tip_bool** array to select all rows from **taxi** with values tip amounts of more than **50**, and the columns from indexes 5 to 13 inclusive. Assign the resulting array to **top_tips.**

In [0]:
# put your code here

## 3.5 Assigning Values in ndarrays

So far we've learned how to retrieve data from ndarrays, and how to add rows or columns. There is one missing piece to our NumPy fundamentals toolbox: modifying values.

We can use the same indexing techniques we've already learned to assign values within an ndarray. The syntax we'll use (in pseudocode) is:

```python
ndarray[location_of_values] = new_value
```

Let's take a look at what that looks like in actual code. With our 1D array, we can specify one specific index location:

```python
a = np.array(['red','blue','black','blue','purple'])
a[0] = 'orange'
print(a)

['orange', 'blue', 'black', 'blue', 'purple']
```

Or we can assign multiple values at once:

```python
a[3:] = 'pink'
print(a)

['orange', 'blue', 'black', 'pink', 'pink']
```

With a 2D ndarray, just like with a 1D, we can assign one specific index location.

```python
ones = np.array([[1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1]])
ones[1,2] = 99
print(ones)

[[ 1,  1,  1,  1,  1],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]
```

We can also assign a whole row...

```python
ones[0] = 42
print(ones)

[[42, 42, 42, 42, 42],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]
```

...or a whole column:

```python
ones[:,2] = 0
print(ones)

[[42, 42, 0, 42, 42],
 [ 1,  1, 0,  1,  1],
 [ 1,  1, 0,  1,  1]]
```

Let's practice some array assignment with our taxi dataset.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

To help you practice without making changes to our original array, we have used the [ndarray.copy()](http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.copy) method to make **taxi_modified**, a copy of our original for these exercises.


- The value at column index 5 (**pickup_location**) of row index 28214 is incorrect. Use assignment to change this value to **1** in the **taxi_modified** ndarray.
- The first column (index 0) contains year values as four digit numbers in the format YYYY (2016, since all trips in our data set are from 2016). Use assignment to change these values to the YY format (16) in the **taxi_modified** ndarray.
- The values at column index 7 (**trip_distance**) of rows index 1800 and 1801 are incorrect. Use assignment to change these values in the **taxi_modified** ndarray to the mean value for that column.



In [0]:
# put your code here

## 3.6 Assignment Using Boolean Arrays

Boolean arrays become very powerful when we use them for assignment. Let's start by looking at a simple example:

```python
>>> a = np.array([1, 2, 3, 4, 5])

>>> a[a > 2] = 99

>>> print(a)

    [ 1  2 99 99 99]
```

Before we walk through how the code works, we've just seen a 'shortcut' for the first time. The second line of code inserted the definition of the boolean array directly into the selection. This 'shortcut' way is the conventional way to write boolean indexing. Up until now, we've been taking the extra step of assigning to an intermediate variable first so that the process is clear. Let's look at how we would have written the example using the intermediate variable.

```python
>> a2 = np.array([1, 2, 3, 4, 5])

>> a2_bool = a2 > 2

>> a2[a2_bool] = 99

>> print(a2)

    [ 1  2 99 99 99]
```

You can see that both ways produce the same results. From here on, we will use the shortcut method instead of the intermediate variable. The boolean array controls the values that the assignment applies to, and the other values remain unchanged. Let's look at how this code works:

<img width="600" src="https://drive.google.com/uc?export=view&id=1u8WcLq-TYCIhSFuEa9ElfMFYPBAC_rZZ">


Next, let's look at an example of assignment using a boolean array with two dimensions:

```python
>>> b = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

>>> b[b > 4] = 99

>>> print(b)

    [[ 1  2  3]
     [ 4 99 99]
     [99 99 99]]
```

<img width="600" src="https://drive.google.com/uc?export=view&id=1VPmK9UuV1jvX74-ljHWJT6oE_vkTkS-a">


Lastly, let's look at an example that uses a 1D boolean array to perform assignment on a 2D array:

```python
>>> c = np.array([[1, 2, 3],
                  [4, 5, 6],
                  [7, 8, 9]])

>>> c[c[:,1] > 2, 1] = 99

>>> print(c)

    [[ 1  2  3]
     [ 4 99  6]
     [ 7 99  9]]
```


In this example, the **c[:,1] > 2** boolean operation compares just one column's values and produces a 1D boolean array. We then use that boolean array to specify the rows for assignment, and use the integer **1** to specify the second column. This results in our boolean array only being applied to the second column, with all other values remaining unchanged:

<img width="600" src="https://drive.google.com/uc?export=view&id=1nXvILrVeMLryXgLr_TYLHPdjHVJZxstA">


This pattern, where a 1D boolean array is used to specify assignment in the row dimension and an index value is used to specify which column the array applies to is very common. The pseudocode syntax for this pattern is as follows, first using an intermediate variable:

```python
bool = array[:, column_for_comparison] == value_for_comparison
array[bool, column_for_assignment] = new_value
```

and then all in one line:

```python
array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value
```

Let's practice this pattern using our taxi data set:


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

We have created a new copy of our taxi dataset, **taxi_modified** with an additional column containing the value 0 for every row.

1. In our new column at index **15**, assign the value **1** if the **pickup_location_code** (column index 5) corresponds to an airport location, leaving the value as 0 otherwise by performing these three operations:
  - For rows where the value for the column index 5 is equal to 2 (JFK Airport), assign the value 1 to column index 15.
  - For rows where the value for the column index 5 is equal to 3 (LaGuardia Airport), assign the value 1 to column index 15.
  - For rows where the value for the column index 5 is equal to 5 (Newark Airport), assign the value 1 to column index 15.

In [0]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

# create a new column filled with `0`.
zeros = np.zeros([taxi_modified.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

# put your code here

## 3.7 Challenge: Which is the most popular airport?

We'll conclude this mission with two challenges. Challenges are designed to help you practice the techniques you've learned in this mission.

**Don't be discouraged if these challenge steps take a few attempts to get right– working with data is an iterative process!**

In this challenge, we want to find out which airport is the most popular destination in our data set. To do that, we'll use boolean indexing and the **dropoff_location_code** column (column index 6) to create three filtered arrays and then look at how many rows are in each array. The values from the column we're interested in are:

- 2 - JFK Airport.
- 3 - LaGuardia Airport.
- 5 - Newark Airport.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

- Using the original **taxi** ndarray, calculate how many trips had JFK Airport as their destination:
  - Select only the rows there the **dropoff_location_code** column has a value that corresponds to JFK, and assign the result to **jfk**.
  - Calculate how many rows are in the new **jfk** array and assign the result to **jfk_count**.
- Calculate how many trips from **taxi** had Laguardia Airport as their destination:
    - Select only the rows there the **dropoff_location_code** column has a value that corresponds to Laguardia, and assign the result to **laguardia.**
    - Calculate how many rows are in the **new laguardia** array and assign the result to **laguardia_count.**
- Calculate how many trips from **taxi** had Newark Airport as their destination:
  - Select only the rows there the **dropoff_location_code** column has a value that corresponds to Newark, and assign the result to **newark.**
  - Calculate how many rows are in the **new newark array** and assign the result to **newark_count.**
- After you have run your code, inspect the values for **jfk_count**, **laguardia_count**, and **newark_count** and see which airport has the most dropoffs.

In [0]:
# put your code here

## 3.8 Challenge: Calculating Statistics for Trips on Clean Data

Our calculations in the previous screen show that Laguardia is the most common airport for dropoffs in our data set.

Our second and final challenge involves removing potentially bad data from our data set, and then calculating some descriptive statistics on the remaining 'clean' data.

We'll start by using boolean indexing to remove any rows that have an average speed for the trip greater than 100 mph (160 kph) which should remove the questionable data we have worked with over the past two missions. Then, we'll use array methods to calculate the mean for specific columns of the remaining data. The columns we're interested in are:

- **trip_distance**, at column index 7
- **trip_length**, at column index 8
- **total_amount**, at column index 13
- **trip_mph**, not available as a column but as its own ndarray


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


The **trip_mph** ndarray has been provided for you.

- Create a new ndarray, **cleaned_taxi**, containing only rows for which the values of **trip_mph** are less than 100.
- Calculate the mean of the **trip_distance** column of **cleaned_taxi**, and assign the result to **mean_distance**.
- Calculate the mean of the **trip_length** column of **cleaned_taxi**, and assign the result to **mean_length**.
- Calculate the mean of the **total_amount** column of **cleaned_taxi**, and assign the result to **mean_total_amount.**
- Calculate the mean of the **trip_mph**, excluding values greater than 100, and assign the result to **mean_mph**.

In [0]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

# put your code here

In this section we learned:

- How to use **numpy.genfromtxt()** to read in an ndarray.
- About **NaN** values.
- What a boolean array is, and how to create one.
- How to use boolean indexing to filter values in one and two-dimensional ndarrays.
- How to assign one or more new values to an ndarray based on their locations.
- How to assign one or more new values to an ndarray based on their values.

This is the last section that deals exclusively with NumPy, however it's certainly not the last time we'll use NumPy. As we move onto using pandas, and later in our learning paths other Python data libraries, you'll see that a lot of the concepts we've learned transfer, and you'll also find yourself using a lot of these fundamental NumPy concepts. We'll also use NumPy from time to time to create, transform and otherwise work with tabular data.

In the next section, we'll start using the pandas library and learn how it compares with NumPy.

# 4.0 Introduction to Pandas

## 4.1 Understanding pandas and NumPy

As we have become familiar with the NumPy library we've discovered that it makes working with data easier. Because we can easily work across multiple dimensions, our code is a lot easier to understand. By using vectorized operations instead of loops, our code will be faster with larger data.

NumPy provides fundamental structures and tools that makes working with data easier, but there are several things that limit it usefulness as a single tool when working with data:

- The lack of support for column names forces us to frame the questions we want to answer as multi-dimensional array operations.
- Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
- There are lots of low level methods, however there are many common analysis patterns that don't have pre-built methods.

The **pandas** library provides solutions to all of these pain points and more. Pandas is not so much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses the NumPy library extensively, which means the concepts you've been learning will come in handy as you begin to learn more about pandas.

In this mission, we'll learn:

- about the two core pandas types: dataframes and series
- how to select data using row and column labels
- a variety of methods for exploring data with pandas
- how to assign data using various techniques in pandas
- how to use boolean indexing with pandas for selection and assignment

We'll be working with data set from [Fortune](http://fortune.com/) magazine's [Global 500](https://en.wikipedia.org/wiki/Fortune_Global_500) list 2017, which ranks the top 500 corporations worldwide by revenue. The dataset we'll be using was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017), however we have modified the original data set into a more accessible format.

<img width="400" src="https://drive.google.com/uc?export=view&id=1vWtPGbbxR7Mn2xHg_KMa3MOKSqs05uyE">


The dataset is a CSV file called **f500.csv**. Here is a data dictionary for some of the columns in the CSV:

- **company** - The Name of the company.
- **rank** - The Global 500 rank for the company.
- **revenues** - The company's total revenues for the fiscal year, in millions of dollars (USD).
- **revenue_change** - The percentage change in revenue between the current and prior fiscal years.
- **profits** - Net income for the fiscal year, in millions of dollars (USD).
- **ceo** - The company's Chief Executive Officer.
- **industry** - The industry in which the company operates.
- **sector** - The sector in which the company operates.
- **previous_rank** - The Global 500 rank for the company for the prior year.
- **country** - The Country in which the company is headquartered.
- **hq_location** - The City and Country, (or City and State for the USA) where the company is headquarted.
- **employees** - Total employees (full-time equivalent, if available) at fiscal year-end.


Similar to the import convention for NumPy (**import numpy as np**), the import convention for pandas is:

```python
import pandas as pd
```

We have already imported pandas and used the [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to read the CSV into a pandas object and assign it to the variable name f500. In the next mission we'll learn about **read_csv()**, but for now all you need to know is that it handles reading and parsing most CSV files automatically.

Like NumPy, pandas objects have a **.shape** attribute which returns a tuple representing the dimensions of each axis of the object. We'll use that and the Python's **type()** function to inspect the f500 pandas object.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. Use Python's **type()** function to assign the type of **f500** to **f500_type.**
2. Use the **DataFrame.shape** attribute to assign the shape of **f500** to **f500_shape.**

In [0]:
import pandas as pd
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None

# put your code here

## 4.2 Introducing DataFrames

The code we wrote in the previous screen let us know that our data has 500 rows and 16 columns, and is stored as a [pandas.core.frame.DataFrame object](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame). More commonly referred to as **pandas.DataFrame()** objects, or just **dataframes**, the type is the primary pandas data structure.

Dataframes are two dimensional pandas objects, the pandas equivalent of a Numpy 2D ndarray. Unlike NumPy, pandas does not use the same type for 1D and 2D arrays.

We'll learn about the second pandas data structure, series, later in this mission, but first, let's look at the anatomy of a dataframe, using a selection of our Fortune 500 data:

<img width="500" src="https://drive.google.com/uc?export=view&id=1lUAxPbqauhiMPdWCAM2oPOy0vsmvWtYy">

There are three key things we can observe immediately:

- In Red: Just like a 2D ndarray, there are two axes, however each axis of a dataframe has a specific name. The first axis is called **index**, and the second axis is called **columns.**
- In Blue: Our axis values have string **labels**, not just numeric locations.
- In Green: Our dataframe contains columns with **multiple dtypes**: integer, float, and string.

We can use the [DataFrame.dtypes](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes) attribute (similar to NumPy's **ndarray.dtype** attribute) to return information about the types of each column. Let's see what this would return for our selection of data above:

```python
>>> f500_selection.dtypes

    rank          int64
    revenues      int64
    profits     float64
    country      object
    dtype: object
```

We can see three different data types (dtypes), which correspond to what we observed by looking at the data:

- int64
- float64
- object


We have seen the **float64** dtype before in NumPy. Pandas uses NumPy dtypes for numeric columns, including **integer64**. There is also a type we haven't seen before, **object**, which is used for columns that have data that doesn't fit into any other dtypes. This is almost always used for columns containing string values. If you like, you can run **f500.dtypes** in the console to see the types of all the columns in the f500 dataframe.

When we import data, pandas will attempt to guess the correct dtype for each column. Generally, pandas does a pretty good job with this, which means we don't need to worry about specifying dtypes every time we start to work with data. Later in this course, we'll look at how to change the dtype of a column.

Next, let's learn a few handy methods we can use to get some high-level information about our dataframe:

- If we wanted to view the first few rows of our dataframe, we can use the [DataFrame.head()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method, which returns the first 5 rows of our dataframe. The **DataFrame.head()** method also accepts an optional integer parameter which specified the number of rows. We could use **f500.head(10)** to return the first 10 rows of our **f500 dataframe**-.
- Similar in function to **DataFrame.head()**, we can use the [DataFrame.tail()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) method, to shows us the last rows of our dataframe. The **DataFrame.tail()** method accepts an optional integer parameter to specify the number of rows, defaulting to 5.
- If we wanted to get an overview of all the dtypes used in our dataframe, along with its shape and some extra information, we could use the [DataFrame.info()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html#pandas.DataFrame.info) method. Note that **DataFrame.info()** prints the information, rather than returning it, so we can't assign it to a variable.

Let's practice using these three new methods. Just like in the previous missions, the f500 variable we created in the previous section is available to you here.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">



1. Using the links above to the documentation if you need to, use the three methods we just learned about to learn more about the **f500** dataframe:

  - Use the **head()** method to select the first 6 rows and assign the result to **f500_head**.
  - Use the **tail()** method to select the last 8 rows and assign the result to **f500_tail.**
  - Use the **info()** method to display information about the dataframe.


In [0]:
# put your code here

## 4.3  Selecting Columns From a DataFrame by Label

By looking at the results produced by the **DataFrame.head()** and **DataFrame.tail()** methods in the previous screen, we can see that our data set seems to be pre-sorted in order of Fortune 500 rank.

We can also see that the **DataFrame.info()** method showed us the number of entries in our index (representing the number of rows), a list of each column with their dtype and the number of non-null values, as well as a summary of the different dtypes and memory usage. In pandas, null values are represented using NaN, just like in NumPy.

Because our axes in pandas have labels, we can select data using those labels, unlike in NumPy where we needed to know the exact index location. To do this, we use the [DataFrame.loc[]](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc) method.

Throughout our pandas missions you'll see **df** used in code examples as shorthand for a dataframe object. We use this convention because you also see this throughout the official pandas documentation, so getting used to reading it is important. You'll notice that we use brackets **([])** instead of parentheses **(())** when selecting by location. This is similar to how we use brackets when selecting by location in Python lists or NumPy arrays. The syntax for the **DataFrame.loc[]** method is:

```python
df.loc[row, column]
```

Where **row** and **column** refer to row and column labels respectively, and can be one of:

- A single label.
- A list or array of labels.
- A slice object with labels.
- A boolean array.

We'll look at boolean arrays later in this mission - for now, we're going to focus on the first three options. We're going to use the same selection of data we used in the previous screen, which is stored using the variable name **f500_selection** to make these examples easier.

<img width="600" src="https://drive.google.com/uc?export=view&id=1WNexstd5iVGtj04VxjcnQSqa-UTSETDg">


In each of these examples, we're going to use **:** to specify that we wish to select all rows, so we can focus making selections using column labels only.

First, let's select a single column by specifying a single label:

<img width="600" src="https://drive.google.com/uc?export=view&id=15V82UWlysJjrA_Eg0b5vKPeMcagBhQCI">


Selecting a single column returns a pandas series. We'll talk about pandas series objects more in the next screen, but for now the important thing is to note that the new series has the same index axis labels as the original dataframe. Let's look at how we can use a list of labels to select specific columns:

<img width="600" src="https://drive.google.com/uc?export=view&id=1MSj7K1OU_0LwnKACxpYLOTfU71faWb4h">


When we use a list of labels, a dataframe is returned with only the columns specified in our list, in the order specified in our list. Just like when we used a single column label, the new dataframe has the same index axis labels as the original. Lets finish by using a **slice object with labels** to select specific columns.

<img width="600" src="https://drive.google.com/uc?export=view&id=1CEtF_oFReD_-6pqheEfEd4nRWgakEgXQ">

Again we get a dataframe object, with all of the columns from the first up until **and including** the last column in our slice. This is an important distinction – when we uses slices with lists and in NumPy, it does not include the end slice. The reason that this is different with **loc[]** is that with labels is less obvious what the end slice would be. When we're using integers, we know that the number after **3** is **4**, but knowing the column label that comes after profits is not as obvious.

Let's practice using these techniques to select specific columns from our f500 dataframe.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

1. Select the **industry** column, and assign the result to the variable name **industries**.
2. Select the **rank**, **previous_rank** and **years_on_global_500_list** columns, in order, and assign the result to the variable name **previous**.
3. Select all columns from **revenues** up to and including **profit_change**, in order, and assign the result to the variable name **financial_data**.

In [0]:
# put your code

## 4.4 Column selection shortcuts

There are two shortcuts that pandas provides for accessing columns.

1. **Single Bracket** – Instead of **df.loc[:,"col_1"]** you can use **df["col1"]** to select columns. This works for single columns and lists of columns but not for for column slices. This style of selecting columns is very commonly seen and we will use it throughout our Dataquest missions.
2. **Dot Accessor** – Instead of **df.loc[:"col_1"]** you can use **df.col_1**. This shortcut does not work for labels that contain spaces or special characters. This style of selecting columns is much more rarely seen, and we will not use this in our Dataquest missions.

These shortcuts are designed to make some of the more common selection tasks easier. We recommend you always use the common shorthand in your code, as it will make your code easier to read. A summary of the techniques we've learned so far is below:

<img width="600" src="https://drive.google.com/uc?export=view&id=1BlhNf3XAGs0E50GISg0-W0sZXcRowrRe">

Let's practice selecting data by column some more, this time using the common shorthand method.

**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

1. Select the **country** column, and assign the result to the variable name **countries**.
2. Select the **revenues** and **years_on_global_500_list** columns, in order, and assign the result to the variable name **revenues_years**.
3. Select all columns from **ceo** up to and including **sector**, in order, and assign the result to the variable name **ceo_to_sector**.



In [0]:
# put your code here

## 4.5 Selecting Items from a Series by Label

In the last section we observed that when you select just one column of a dataframe, you get a new pandas type: a **series object**. Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series, and anytime you see a 2D pandas object, it will be a dataframe.

You might like to think of a dataframe as being a collection of series objects, which is similar to how pandas stores the data behind the scenes.

<img width="600" src="https://drive.google.com/uc?export=view&id=1xNxF1TgmYOzlYj6ogofxKBiSO7FqsHOH">


To better understand the relationship between dataframe and series objects, we'll look at some examples. We'll start by looking at two pandas operations that each produce a series object:


<img width="600" src="https://drive.google.com/uc?export=view&id=1aI-saLdbP4eZOQKGjHTJT54mFCa2Vu42">

Because a series has only one axis, its axis labels are either the index axis or column axis labels, depending on whether it is representing a row or a column from the original dataframe. If we make a 2D selection from a dataframe, it will retain the labels from both axes:

<img width="600" src="https://drive.google.com/uc?export=view&id=1YyiMWxDLQ8bo4vkL1qZc5Pq49C4KNqE8">

Let's look at a brief summary of the differences between dataframes and series'.


<img width="400" src="https://drive.google.com/uc?export=view&id=1k7GgAAgsHY8YD-Z8x7Enerozt9MlqkuR">


Just like dataframes, we can use **Series.loc[]** to select items from a series using single labels, a list, or a slice object. We can also omit **loc[]** and use bracket shortcuts for all three. Let's look at an example:

```python
>>> print(s)

a    0
b    1
c    2
d    3
e    4
dtype: int64
```

We can select a single item:

```python
print(s["d"])

3
```

Like with dataframe columns, there is a dot accessor (eg, **s.d**) available, but this rarely used– even less than the dataframe dot accessor.

To select several items using a list:

```python
print(s[["a", "e", "c"]])

a    0
e    4
c    2
dtype: int64
```

And lastly, several items using a slice:

```python
print(s["a":"d"])

a    0
b    1
c    2
d    3
dtype: int64
```

Let's practice selecting data from pandas series':

**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. From the pandas series **ceos**:
  -  Select the item at index label **Walmart** and assign the result to the variable name **walmart**.
  -  Select the items from index label **Apple** up to and including index label **Samsung Electronics** and assign the result to the variable name **apple_to_samsung**.
  -  Select the items with index labels **Exxon Mobil**, **BP**, and **Chevron**, in order, and assign the result to the variable name **oil_companies**.



In [0]:
ceos = f500["ceo"]

# put your code here

## 4.6 Selecting Rows From a DataFrame by Label

Now that we've learned how to select columns using the labels of the **'column'** axis, let's learn how to select rows using the labels of the **'index'** axis.

<img width="400" src="https://drive.google.com/uc?export=view&id=1HH7k0yK6abEG4xKiHgPTt5I_rz1OE5UD">

Selecting **rows** from a dataframe by label uses the same syntax as we use for **columns.** As a reminder:

```python
df.loc[row, column]
```

Where **row** and **column** refer to row and column labels. We'll look at how to select rows, again using our **f500_selection** dataframe to make these examples easier.

```python
print(type(f500_selection)
print(f500_selection)
```

```python
class 'pandas.core.frame.DataFrame'

                          rank  revenues  profits country
Walmart                      1    485873  13643.0     USA
State Grid                   2    315199   9571.3   China
Sinopec Group                3    267518   1257.9   China
China National Petroleum     4    262573   1867.5   China
Toyota Motor                 5    254694  16899.3   Japan
```

To select a single row:

```python
single_row = f500_selection.loc["Sinopec Group"]
print(type(single_row))
print(single_row)

class 'pandas.core.series.Series'

rank             3
revenues    267518
profits     1257.9
country      China
Name: Sinopec Group, dtype: object
```

As we would expect, a single row is returned as a series. We should take a moment to note that the dtype of this series is object. Because this series has to store integer, float, and string values pandas uses the object dtype, since none of the numeric types could cater for all values.

To select a list of rows:

```python
list_rows = f500_selection.loc[["Toyota Motor", "Walmart"]]
print(type(list_rows))
print(list_rows)

class 'pandas.core.frame.DataFrame'

              rank  revenues  profits country
Toyota Motor     5    254694  16899.3   Japan
Walmart          1    485873  13643.0     USA
```

For selection using slices, we can use the shortcut without brackets. This is the reason we can't use this shortcut for columns - because it's reserved for use with rows:

```python
slice_rows = f500_selection["State Grid":"Toyota Motor"]
print(type(slice_rows))
print(slice_rows)
```

```python
class 'pandas.core.frame.DataFrame'

                          rank  revenues  profits country
State Grid                   2    315199   9571.3   China
Sinopec Group                3    267518   1257.9   China
China National Petroleum     4    262573   1867.5   China
Toyota Motor                 5    254694  16899.3   Japan
```

Let's take a look at a summary of all the different label selection methods we've learned so far:


<img width="800" src="https://drive.google.com/uc?export=view&id=1rQMPkOZBVh57x6kVu5qWjU3UC6vCh9v_">


Now for some practice - we're going to make it a little bit harder this time, by asking you to combine selection methods for rows and columns on both dataframes and series'!

**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">



1. By selecting data from **f500**:
  - Create a new variable, **drink_companies**, with:
    - Rows with indicies **Anheuser-Busch InBev**, **Coca-Cola**, and **Heineken Holding**, in that order.
     - All columns.
  - Create a new variable **big_movers**, with:
    - Rows with indicies **Aviva**, **HP**, **JD.com**, and **BHP Billiton**, in that order.
    - The **rank** and **previous_rank** columns, in that order.
  - Create a new variable, **middle_companies** with:
    - All rows with indicies from **Tata Motors** to **Nationwide**, inclusive.
    - All columns from **rank** to **country**, inclusive.

In [0]:
# put your code here

## 4.7 Series and Dataframe Describe Methods

We're starting to get a feel for how axes labels in pandas make selecting data much easier. Pandas also has a large number of methods and functions that make working with data easier. Let's use a few of these to explore our Fortune 500 data.

The first method we'll learn about is the [Series.describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html#pandas.Series.describe) method, which returns some descriptive statistics on the data contained within a specific pandas series. Let's look at an example:

```python
revs = f500["revenues"]
print(revs.describe())
```

```python
count       500.000000
mean      55416.358000
std       45725.478963
min       21609.000000
25%       29003.000000
50%       40236.000000
75%       63926.750000
max      485873.000000
Name: revenues, dtype: float64
```

We've assigned the **revenues** column to a new series, **revs**, and then used the **describe()** method on that series. The method tells us how many non-null values are contained in the series, the mean and standard devation, along with the minimum, maximum and [quartile](https://en.wikipedia.org/wiki/Quartile) values.

Rather than assigning the series to it's own variable, we can actually skip that step and use the method directly on the result of the column selection. This is called **method chaining** and is a way to combine multiple methods together in a single line. It's not unique to pandas, however it is something that you see a lot in pandas code. Let's see what the command looks like with method chaining, using the **assets** column.


```python
print(f500["assets"].describe())
```

```python
count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64
```

From here, you'll start to see method chaining used more in our missions. When writing code, you should always assess whether method chaining will make your code harder to read. It's always preferable to break out into more than one line if it will make your code easier to understand.

You might have noticed that the values in the code segment above look a little bit different. Because the values for this column are too long to display neatly, pandas has displayed them in **E-notation**, a type of [scientific notation](https://en.wikipedia.org/wiki/Scientific_notation). Here is an expansion of what the E-notation represents:

| Original Notation | Expanded Formula | Result |
|-------------------|--------------------|----------|
| 5.000000E+02 | 5.000000 * 10 ** 2 | 500 |
| 2.436323E+05 | 2.436323 * 10 ** 5 | 243632.3 |


If we use **describe()** on a column that contains non-numeric values, we get some different statistics. Let's look at an example:

```python
print(f500["country"].describe())

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object
```

Here is what the output indicates:

The first statistic, **count**, is the same as for numeric columns, showing us the number of non-null values. The other three statistics are new:

- **unique** - The number of unique values in the series. In this case, it tells us that there are 34 different countries represented in the Fortune 500.
- **top** - The most common value in the series. The USA is the most common country that a company in the Fortune 500 is headquartered in.
- **freq** - The frequency of the most common value. The USA is the country that 132 companies from Fortune 500 are headquartered in.

Because series' and dataframes are two distinct objects, they have their own unique methods. There are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. DataFrame objects also have a [DataFrame.describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method that returns these same statistics for every column. If you like, you can take a look at the documentation using the link in the previous sentence to familiarize yourself with some of the differences between the two methods.

One difference is that you need to specify manually if you want to see the statistics for the non-numeric columns. By default, **DataFrame.describe()** will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the **include=['O']** parameter when using the dataframe version of describe:

```python
print(f500.describe(include=['O']))
```

```python
_            ceo    industry     sector  country  hq_location    website
count        500         500        500      500          500        500
unique       500          58         21       34          235        500
top     Xavie...   Banks:...  Financ...      USA  Beijing,...  http:/...
freq           1          51        118      132           56          1
```

Another difference is that **Series.describe()** returns a series object, where **DataFrame.describe()** returns a dataframe object. 

Let's practice using both the series and dataframe describe methods:


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

1. Use the appropriate **describe()** method to:
  - Return a series of descriptive statistics for the **profits** column, and assign the result to **profits_desc**.
  - Return a dataframe of descriptive statistics for the **revenues** and **employees** columns, in order, and assign the result to **revenue_and_employees_desc**.
  - Return a dataframe of descriptive statistics for every column in the **f500** dataframe, by checking the documentation for the correct value for the **include** parameter, and assign the result to **all_desc**.




In [0]:
# put your code here

## 4.8 More Data Exploration Methods

Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported. One basic concept is vectorized operations. Let's look at an example of how this would work with a pandas series:

```python
>>> print(my_series)

    0    1
    1    2
    2    3
    3    4
    4    5
    dtype: int64

>>> my_series = my_series + 10

>>> print(my_series)

    0    11
    1    12
    2    13
    3    14
    4    15
    dtype: int64
```

Many of the descriptive stats methods are also supported. Here are a few handy methods (with links to documentation) that you might use when working with data in pandas:

- [Series.max()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html) and [DataFrame.max()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html)
- [Series.min()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html) and [DataFrame.min()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html)
- [Series.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) and [DataFrame.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html)
- [Series.median()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html) and [DataFrame.median()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html)
- [Series.mode()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html) and [DataFrame.mode()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html)
- [Series.sum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html) and [DataFrame.sum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html)


As the documentation indicates, the series methods don't require an axis parameter, however the dataframe methods will so we know which axis to calculate across. While you can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings **"index"** and **"columns"** for the axis parameter. Let's refresh our memory on how this works:

<img width="700" src="https://drive.google.com/uc?export=view&id=1euiSMOgXE7IVP_U-VIRzwV6JAXBpC4yx">

For instance, if we wanted to find the median (middle) value for the **revenues** and **profits** columns, we could use the following code:

```python
medians = f500[["revenues", "profits"]].median(axis=0)
# we could also use .median(axis="index")
print(medians)

revenues    40236.0
profits      1761.6
dtype: float64
  
```



In fact, the default value for the axis parameter with these methods is **axis=0**, so we could have just used the **median()** method without a parameter to get the same result!

Another extremely handy method for exploring data in pandas is the [Series.value_counts()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method. The **Series.value_counts()** method displays each unique non-null value from a series, with a count of the number of times that value is used. We saw above that the **sector** column has 21 unique values. Let's use **Series.value_counts()** to look at the top 5:

```python
>>> print(f500["sector"].value_counts().head())

    Financials                118
    Energy                     80
    Technology                 44
    Motor Vehicles & Parts     34
    Wholesalers                28
    Name: sector, dtype: int64
```

Let's take a moment to walk through what happened in that line of code:

- We used the **print()** function to print the output of the following method chain:
    - Select the **sector** column from the **f500** dataframe, and on the resulting series
    - Use the **Series.value_counts()** to produce a series of the unique values and their counts in order, and on the resulting series
    - Use the [Series.head()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.head.html#pandas.Series.head) method to return the first 5 items only.
    
    
We haven't seen the **Series.head()** method before, but it works similarly to **DataFrame.head()**, returning the first five items from a series, or a different number if you provide an argument.

The **Series.value_counts()** method is one of the handiest methods to use when exploring a data set. It's also one of the few series methods that doesn't have a dataframe counterpart.

Don't worry too much about having to remember which methods belong to which objects for now. You'll find that as you practice them some will stick, and for the rest you'll be able to reference the pandas documentation.

Let's start the process by practicing some of these to explore the Fortune 500 some more!


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


- Use **Series.value_counts()** and **Series.head()** to return the 5 most common values for the **country** column, and assign the results to **top5_countries.**
- Use **Series.value_counts()** and **Series.head()** to return the 5 most common values for the **previous rank** column, and assign the results to **top5_previous_rank**.
- Use the appropriate **max()** method to find the maximum value for only the numeric columns from **f500** (you may need to check the documentation), and assign the result to the variable **max_f500**.






In [0]:
# put your code here

## 4.9 Assignment with pandas

Looking at the results of the most common values for the **previous_rank** column in the last exercise, you might have noticed something a little odd:

```python
>>> print(top5_previous_rank.head())

    0      33
    159     1
    147     1
    148     1
    149     1
    Name: previous_rank, dtype: int64
```

This indicates that 33 companies had the value **0** for their rank in the Fortune 500 for the previous year. Given that a rank of zero doesn't exist, we might conclude that these companies didn't have a rank at all for the previous year. It would make more sense for us to replace these values with a null value to more clearly indicate that the value is missing. There are a few things we need to be able to do before we can correct this. The first is how to assign values using pandas.

When we used NumPy, we learned that the same techniques that we use to select data could be used for assignment. Let's look at an example:


```python
my_array = np.array([1, 2, 3, 4])

# to perform selection
print(my_array[0])

# to perform assignment
my_array[0] = 99
```

The same is true with pandas. Let's look at this example:

```python
>>> top5_rank_revenue = f500[["rank", "revenues"]].head()

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1    485873
    State Grid                   2    315199
    Sinopec Group                3    267518
    China National Petroleum     4    262573
    Toyota Motor                 5    254694

>>> top5_rank_revenue["revenues"] = 0

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1         0
    State Grid                   2         0
    Sinopec Group                3         0
    China National Petroleum     4         0
    Toyota Motor                 5         0
    
```


When we selected a whole column by label and use assignment, we assigned the value to every item in that column.

By providing labels for both axes, we can assign to a single value within our dataframe.

```python
>>> top5_rank_revenue.loc["Sinopec Group", "revenues"] = 999

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1         0
    State Grid                   2         0
    Sinopec Group                3       999
    China National Petroleum     4         0
    Toyota Motor                 5         0
```

If we assign a value using a index or column label that does not exist, pandas will create a new row or column in our dataframe. Let's add a new column and new row to our **top5_rank_revenue** dataframe:

```python
>>> top5_rank_revenue["year_founded"] = 0

>>> print(top5_rank_revenue)

                              rank  revenues  year_founded
    Walmart                      1         0             0
    State Grid                   2         0             0
    Sinopec Group                3       999             0
    China National Petroleum     4         0             0
    Toyota Motor                 5         0             0

>>> top5_rank_revenue.loc["My New Company"] = 555

>>> print(top5_rank_revenue)

                              rank  revenues  year_founded
    Walmart                      1         0             0
    State Grid                   2         0             0
    Sinopec Group                3       999             0
    China National Petroleum     4         0             0
    Toyota Motor                 5         0             0
    My New Company             555       555           555
```


There is one exception to be aware of: You **can't** create a new row/column by attempting to use the dot accessor shortcut with a label that does not exist.

Let's practice assigning values and adding new columns using our full Fortune 500 dataframe:


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


- Add a new column, **revenues_b** to the **f500** dataframe by using vectorized division to divide the values in the existing **revenues** column by 1000 (converting them from millions to billions).
- The company **'Dow Chemical'** have named a new CEO. Update the value where the index label is **Dow Chemical** and for the **ceo** column to **Jim Fitterling**.

In [0]:
# put your code here

## 4.10 Using Boolean Indexing with pandas Objects

Now that we know how assign values in pandas, we're one step closer to being able to correct the values in the **previous_rank** column that are **0**. If we knew the name of every single row label where this case was true, we could do this manually by using a list of labels when we performed our assignment.

While it's helpful to be able to replace specific values in rows where we know the row label ahead of time, this is cumbersome when we want to do this for all rows that meet the same criteria. Another option would be to use a loop, but this would be slower and would lose the benefits of vectorization that pandas gives us. Instead, we can use **boolean indexing**.

Just like NumPy, pandas allows us to use boolean indexing to select items based on their value, which will make our task a lot easier. Let's refresh our memory of how boolean indexing is used for selection, and learn how boolean indexing works in pandas.

In NumPy, boolean arrays are created by performing a vectorized boolean comparison on a NumPy ndarray. In pandas this works almost identically, however the resulting boolean object will be either a series or a dataframe, depending on the object on which the boolean comparison was performed. Let's look an example of performing a boolean comparison on a series vs a dataframe:

<img width="600" src="https://drive.google.com/uc?export=view&id=1BJNbnwxzO7TBjbYhFJppw1o93Tnn7xMU">


It's much less common to use a boolean dataframe than a boolean series in pandas. You almost always want to use the results of a comparison on one column from dataframe (a series object) to select data in the main dataframe, or a selection of the main dataframe.

Let's look at two examples of how that works in diagram form. For our example, we'll be working with this dataframe of people and their favorite numbers:

<img width="600" src="https://drive.google.com/uc?export=view&id=1FqhK-Kfr7u7JDeAbfEFxn3_0nONlIpp1">

Let's check which people have a favorite number of 8. We perform a vectorized boolean operation that produces a boolean series:

<img width="600" src="https://drive.google.com/uc?export=view&id=1OgQSNM8KwzI5Mr3UJ4Y-89tXdc9t6Ybg">


We can use that series to index the whole dataframe, leaving us the rows that correspond only to people whose favorite number is 8.

<img width="600" src="https://drive.google.com/uc?export=view&id=1WeDutjUzaV2RqdHgJkSVWQNPOguqskak">

Note that we didn't used **loc[]**. This is because boolean arrays use the same shortcut as slices to select along the index axis. 


Now let's look at an example of using boolean indexing with our Fortune 500 dataset. We want find out which are the 5 most common countries for companies belonging to the **'Motor Vehicles and Parts'** industry.

We start by making a boolean series that shows us which rows from our dataframe have the value of **Motor Vehicles and Parts** for the **industry** column. We'll then print the first five items of our boolean series so we can see it in action:


```python
>>> motor_bool = f500["industry"] == "Motor Vehicles and Parts"

>>> print(motor_bool.head())

    Walmart                     False
    State Grid                  False
    Sinopec Group               False
    China National Petroleum    False
    Toyota Motor                 True
    Name: industry, dtype: bool
```


Notice that like our examples in the diagrams above, the index labels are retained in our boolean series. Next, we use that boolean series to select only the rows that have **True** for our boolean index, and just the **country** column, and then print the first 5 items to check the values:

```python
>>> motor_countries = f500.loc[motor_bool, "country"]

>>> print(motor_countries.head())

    Toyota Motor        Japan
    Volkswagen        Germany
    Daimler           Germany
    General Motors        USA
    Ford Motor            USA
    Name: country, dtype: object
```


Lastly, we can use the **value_counts()** method for the **motor_countries** series, chained to the **head()** method to produce a series of the top 5 countries for the 'Motor Vehicles and Parts' industry:

```python
>>> top5_motor_countries = motor_countries.value_counts().head()

>>> print(top5_motor_countries)

    Japan          10
    China           7
    Germany         6
    France          3
    South Korea     3
    Name: country, dtype: int64
```

Let's practice using boolean indexing in pandas to identify the five highest ranked companies from South Korea. Remember, we observed earlier that the **f500** dataframe is already sorted by rank, so we won't need to perfom any extra sorting.


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


- Create a boolean series, **kr_bool**, that compares whether the values in the **country** column from the **f500** dataframe are equal to **"South Korea"**
- Use that boolean series to index the full **f500** dataframe, assigning just the first five rows to **top_5_kr.**

In [0]:
# put your code here

## 4.11 Using Boolean Arrays to Assign Values

We now have all the knowledge we need to fix the **0** values in the **previous_rank** column:

- perform assignment in pandas
- use boolean indexing in pandas

Let's look at an example of how we combine these two operations together. For our example, we'll want to change the **'Motor Vehicles & Parts'** values in the **sector** column to **'Motor Vehicles and Parts'** – i.e. we will change the ampersand **(&)** to **and**.

First, we create a boolean series by comparing the values in the sector column to **'Motor Vehicles & Parts'**.

```python
ampersand_bool = f500["sector"] == "Motor Vehicles & Parts"
```

Next, we use that boolean series and the string **"sector"** to perform the assignment.

```python
f500.loc[ampersand_bool,"sector"] = "Motor Vehicles and Parts"
```

Just like we saw in the NumPy mission earlier in this course, we can remove the intermediate step of creating a boolean series, and combine everything into one line. This is the most common way to write pandas code to perform assignment using boolean arrays:

```python
f500.loc[f500["sector"] == "Motor Vehicles & Parts","sector"] = "Motor Vehicles and Parts"
```

Now we can follow this pattern to replace the values in the **previous_rank** column. We'll replace these values with **np.nan**, which is used in pandas, just as it is in numpy, to represent values that can't be represented numerically, most commonly missing values.

To make comparing the values in this column before and after our operation easier, we've added the following line of code to the cell below:

```python
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
```

This uses **Series.value_counts()** and **Series.head()** to display the 5 most common values in the **previous_rank** column, but adds an additional **dropna=False** parameter, which stops the **Series.value_counts()** method from excluding null values when it makes its calculation, as shown in the [Series.value_counts() documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html#pandas.Series.value_counts).


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


- Use boolean indexing to update values in the **previous_rank** column of the **f500** dataframe:
  - Where previous there was a value of **0**, there should now be a value of **np.nan**.
  - It is up to you whether you assign the boolean series to its own variable first, or whether you complete the operation in one line.
- Create a new pandas series, **prev_rank_after**, using the same syntax that was used to create the **prev_rank_before series.**
- After you have run your code, use the variable inspector to compare **prev_rank_before** and **prev_rank_after.**

In [0]:
import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

# put your code here

## 4.12 Challenge: Top Performers by Country

You may have noticed that after we assigned NaN values the previous_rank column changed dtype. Let's take a closer look:

```python
>>> print(prev_rank_before)

    0      33
    159     1
    147     1
    148     1
    149     1

>>> print(prev_rank_after)

    NaN      33
    471.0     1
    234.0     1
    125.0     1
    166.0     1
```

The index of the series that **Series.value_counts()** produces is now showing us floats like 471.0 instead of the integers from before. The reason behind this is that pandas uses the NumPy integer dtype, which does not support NaN values. Pandas inherits this behavior, and in instances where you try and assign a NaN value to an integer column, pandas will silently convert that column to a float dtype. If you're interested in finding out more about this, [there is a specific section on integer NaN values in the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/gotchas.html#nan-integer-na-values-and-na-type-promotions).

We'll finish this mission with a challenge. In this challenge, we'll calculate a specific statistic or attribute of each of the three most common countries from our f500 dataframe. We've identified the three most common countries using the code below:

```python
>>> top_3_countries = f500["country"].value_counts().head(3)

>>> print(top_3_countries)

USA      132
China    109
Japan     51
Name: country, dtype: int64
```

Don't be discouraged if this takes a few attempts to get right– working with data is an iterative process!

**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


1. Create a series, **cities_usa**, containing counts of the five most common Headquarter Location cities for companies headquartered in the USA.
2. Create a series, **sector_china**, containing counts of the three most common sectors for companies headquartered in the China.
3. Create float object, **mean_employees_japan**, containing the mean average number of employees for companies headquartered in Japan

In [0]:
# put your code here

In this mission, we learned:

- How pandas and NumPy combine to make working with data easier
- About the two core pandas types: series' and dataframes
- How to select data from pandas objects using axis labels
- How to select data from pandas objects using boolean arrays
- How to assign data using labels and boolean arrays
- How to create new rows and columns in pandas
- Many new methods to make data analysis easier in pandas.

# 5.0 Exploring Data with pandas




## 5.1 Introduction

When we learned how to select data in NumPy, we used the integer position to create our selection

<img width="400" src="https://drive.google.com/uc?export=view&id=1sNhetXwe-iqdVzkuu65z0p2rgAb9P6J0">

In pandas, each axis has labels, and we've learned to use loc[] to specify labels to create our selection:

<img width="400" src="https://drive.google.com/uc?export=view&id=19qbWRXXH0SrBu2FnMREay_KyvifucKNd">


In some scenarios, like specifying specific columns, using labels to make selections makes things easier - in others though, it makes things harder. If you wanted to select the tenth to twentieth rows in a dataframe, you'd need to know their labels first.

In this section, we'll learn how to index by integer position with pandas. We'll also learn more advanced selection techniques which will help us perform more complex data analysis.

We'll continue to use the Fortune Global 500 (2017) dataset from the previous mission.

## 5.2 Using iloc to select by integer position

Because pandas uses NumPy objects behind the scenes to store the data, the integer positions we used to select data can also be used. To select data by integer position using pandas we use the [Dataframe.iloc[]](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) method and the [Series.iloc[]](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.iloc.html) method. It's easy to get loc[] and iloc[] confused at first, but the easiest way is to remember the first letter of each method:

- **l**oc: **l**able based selection
- **iloc**: **integer** position based selection

Using the **iloc[]** methods is almost identical to indexing with NumPy, with integer positions starting at **0** like ndarrays and Python lists. Let's take a look at how we would perform our selection from the previous screen using **iloc[]:**

<img width="400" src="https://drive.google.com/uc?export=view&id=1dQa9Y1ZVbYHCA0BhQxWL5FPgJVPAyU1P">


As you can see, **DataFrame.iloc[]** behaves similarly to **DataFrame.loc[]**. The full syntax for **DataFrame.iloc[]**, in psuedocode, is:

```python
df.iloc[row,column]
```

The valid inputs for row and column are almost identical to when you use **DataFrame.loc[]**, with the distinction being that you are using integers rather than labels:

- A single integer position.
- A list or array of integer positions.
- A slice object with integer positions.
- A boolean array.

Let's say we wanted to select just the first column from our **f500** dataframe. To do this, we use the : wildcards to specify all rows, and then use the integer 0 to specify the first column:

```python
first_column = f500.iloc[:,0]
print(first_column)
```
```python
0                        Walmart
1                     State Grid
2                  Sinopec Group
...
497    Wm. Morrison Supermarkets
498                          TUI
499                   AutoNation
Name: company, dtype: object
```

If we wanted to select a single row, we don't need to specify a column wildcard. Let's see how we'd select just the third row:

```python
third_row = f500.iloc[3]
print(third_row)
```
```python
company                 China National Petroleum
rank                                           4
revenues                                  262573
revenue_change                             -12.3
profits                                   1867.5
assets                                    585619
profit_change                              -73.7
ceo                                Zhang Jianhua
industry                      Petroleum Refining
sector                                    Energy
previous_rank                                  3
country                                    China
hq_location                       Beijing, China
website                   http://www.cnpc.com.cn
years_on_global_500_list                      17
employees                                1512048
total_stockholder_equity                  301893
Name: 3, dtype: object
```

If we are specifying a positional slice, we can take advantage of the same shortcut that we use with labels, using brackets without **loc**. Here's how we would select the rows between index positions one up to and including four:

```python
first_to_eighth_rows = f500[1:5]
```

```python
company  rank  revenues ... employees  total_stockholder_equity
1         State Grid     2    315199 ...    926067                    209456
2      Sinopec Group     3    267518 ...    713288                    106523
3  China National...     4    262573 ...   1512048                    301893
4       Toyota Motor     5    254694 ...    364445                    157210
```

In the example above, the row at index position **5** is not included, just like if we were slicing with a Python list or NumPy ndarray. It's worth reiterating again that **iloc[]** handles slicing differently, as we learned in the previous mission:

- With **loc[]**, the **ending slice is included.**
- With **iloc[]**, the **ending slice is not included.**

The table below summarizes how we can use **DataFrame.iloc[]** and **Series.iloc[]** to select by integer position:


<img width="600" src="https://drive.google.com/uc?export=view&id=18jhblUrPsASHHdT5Lgpr6mmPmaIYo6og">


**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">

We have provided code to read the **f500.csv** file into a dataframe and assigned it to **f500**, and inserted **NaN** values into the **previous_rank** column as we did in the previous section.

- Select just the fifth row of the **f500** dataframe, assigning the result to **fifth_row.**
- Select the first three rows of the **f500** dataframe, assigning the result to **first_three_rows.**
- Select the first and seventh rows and the first 5 columns of the **f500** dataframe, assigning the result to **first_seventh_row_slice**



In [0]:
# put your code here

## 5.3 Reading CSV files with pandas

So far, we've provided the code to read the CSV file into pandas for you. In this mission, we're going to teach you how to use the [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to read in CSV files. Before we start, let's take a look at the first few lines of our CSV file in its raw form. To make it easier to read, we're only showing the first four columns from each line:

```python
company,rank,revenues,revenue_change
Walmart,1,485873,0.8
State Grid,2,315199,-4.4
Sinopec Group,3,267518,-9.1
China National Petroleum,4,262573,-12.3
Toyota Motor,5,254694,7.7
```

Now let's take a moment to look at the code segment we've been using to read in the files.

```python
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
```


Looking at the first line only, we use the **pandas.read_csv()** function with an unnamed argument, the name of the CSV file, and a named argument for the **index_col** parameter. The **index_col** parameter specifies which column to use as the row labels. We use a value of **0** to specify that we want to use the first column.

Let's look at what the **f500** dataframe looks like after that first line. We'll use **DataFrame.iloc[]** to show the first 5 rows and the first 3 columns:

```python
>>> f500 = pd.read_csv("f500.csv", index_col=0)

>>> print(500.iloc[:5, :3])

                              rank  revenues  revenue_change
    company                                                    
    Walmart                      1    485873             0.8
    State Grid                   2    315199            -4.4
    Sinopec Group                3    267518            -9.1
    China National Petroleum     4    262573           -12.3
    Toyota Motor                 5    254694             7.7
```

Notice that above the index labels is the text **company**. This is the value from the start of the first row of the CSV, effectively the name of the first column. Pandas has used this value as the **axis name** for the index axis. Both the column and index axes can have names assigned to them. The next line of code removes that name:

```python
f500.index.name = None
```

First, we use **DataFrame.index** to access the index axes attribute, and then we use **index.name** to access the name of the index axes. By setting this to **None** we remove the name. Let's look at what it looks like after this action

```python
>>> f500.index.name = None

>>> print(500.iloc[:5, :3])

                              rank  revenues  revenue_change
    Walmart                      1    485873             0.8
    State Grid                   2    315199            -4.4
    Sinopec Group                3    267518            -9.1
    China National Petroleum     4    262573           -12.3
    Toyota Motor                 5    254694             7.7
```

The index name has been removed.

The **index_col** parameter we used is an optional argument. Let's look at what it looks like if we use **pandas.read_csv()** without it:

```python
>>> f500 = pd.read_csv("f500.csv")

>>> print(f500.iloc[:5,:3])

                        company  rank  revenues
    0                   Walmart     1    485873
    1                State Grid     2    315199
    2             Sinopec Group     3    267518
    3  China National Petroleum     4    262573
    4              Toyota Motor     5    254694
```

There two differences with this approach:

- The **company** column is now included as a regular column, instead of being used for the index.
- The index labels are now integers starting from **0**.
- This is the more conventional way to read in a dataframe, and it's the method we'll use from here on in. There are a few things to be aware of when you have an integer index labels, and we'll talk about them in the next screen.


For now, let's re-read in the CSV file using the conventional method:




**Exercise**

<img width="60" src="https://drive.google.com/uc?export=view&id=1QoTRiOtUzjnbRL7Ue5uPxKse03tE1tPe">


The pandas library is already imported from the previous screen.

- Use the **pandas.read_csv()** function to read the **f500.csv** CSV file as a pandas dataframe, and assign it to the variable name **f500**.
  - Do not use the **index_col** parameter, so that the dataframe has integer index labels.
- Use the code below to insert the **NaN** values into the **previous_rank** column: 
```python
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan
```





In [0]:
# put your code here