# Introduction to Pandas

## Understanding pandas and NumPy

As we have become familiar with the NumPy library we've discovered that it makes working with data easier. Because we can easily work across multiple dimensions, our code is a lot easier to understand. By using vectorized operations instead of loops, our code will be faster with larger data.

NumPy provides fundamental structures and tools that makes working with data easier, but there are several things that limit it usefulness as a single tool when working with data:

- The lack of support for column names forces us to frame the questions we want to answer as multi-dimensional array operations.
- Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
- There are lots of low level methods, however there are many common analysis patterns that don't have pre-built methods.

The **pandas** library provides solutions to all of these pain points and more. Pandas is not so much a replacement for NumPy as an extension of NumPy. The underlying code for pandas uses the NumPy library extensively, which means the concepts you've been learning will come in handy as you begin to learn more about pandas.

In this mission, we'll learn:

- about the two core pandas types: dataframes and series
- how to select data using row and column labels
- a variety of methods for exploring data with pandas
- how to assign data using various techniques in pandas
- how to use boolean indexing with pandas for selection and assignment

We'll be working with data set from [Fortune](http://fortune.com/) magazine's [Global 500](https://en.wikipedia.org/wiki/Fortune_Global_500) list 2017, which ranks the top 500 corporations worldwide by revenue. The dataset we'll be using was originally compiled [here](https://data.world/chasewillden/fortune-500-companies-2017), however we have modified the original data set into a more accessible format.

<img width="400" src="https://drive.google.com/uc?export=view&id=1vWtPGbbxR7Mn2xHg_KMa3MOKSqs05uyE">


The dataset is a CSV file called **f500.csv**. Here is a data dictionary for some of the columns in the CSV:

- **company** - The Name of the company.
- **rank** - The Global 500 rank for the company.
- **revenues** - The company's total revenues for the fiscal year, in millions of dollars (USD).
- **revenue_change** - The percentage change in revenue between the current and prior fiscal years.
- **profits** - Net income for the fiscal year, in millions of dollars (USD).
- **ceo** - The company's Chief Executive Officer.
- **industry** - The industry in which the company operates.
- **sector** - The sector in which the company operates.
- **previous_rank** - The Global 500 rank for the company for the prior year.
- **country** - The Country in which the company is headquartered.
- **hq_location** - The City and Country, (or City and State for the USA) where the company is headquarted.
- **employees** - Total employees (full-time equivalent, if available) at fiscal year-end.


Similar to the import convention for NumPy (**import numpy as np**), the import convention for pandas is:

```python
import pandas as pd
```

We have already imported pandas and used the [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to read the CSV into a pandas object and assign it to the variable name f500. In the next mission we'll learn about **read_csv()**, but for now all you need to know is that it handles reading and parsing most CSV files automatically.

Like NumPy, pandas objects have a **.shape** attribute which returns a tuple representing the dimensions of each axis of the object. We'll use that and the Python's **type()** function to inspect the f500 pandas object.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Use Python's **type()** function to assign the type of **f500** to **f500_type.**
2. Use the **DataFrame.shape** attribute to assign the shape of **f500** to **f500_shape.**

In [3]:
import pandas as pd
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None

# put your code here
f500_type = type(f500)
f500_shape = f500.shape

## Introducing DataFrames

The code we wrote in the previous screen let us know that our data has 500 rows and 16 columns, and is stored as a [pandas.core.frame.DataFrame object](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html#pandas.DataFrame). More commonly referred to as **pandas.DataFrame()** objects, or just **dataframes**, the type is the primary pandas data structure.

Dataframes are two dimensional pandas objects, the pandas equivalent of a Numpy 2D ndarray. Unlike NumPy, pandas does not use the same type for 1D and 2D arrays.

We'll learn about the second pandas data structure, series, later in this mission, but first, let's look at the anatomy of a dataframe, using a selection of our Fortune 500 data:

<img width="500" src="https://drive.google.com/uc?export=view&id=1lUAxPbqauhiMPdWCAM2oPOy0vsmvWtYy">

There are three key things we can observe immediately:

- In Red: Just like a 2D ndarray, there are two axes, however each axis of a dataframe has a specific name. The first axis is called **index**, and the second axis is called **columns.**
- In Blue: Our axis values have string **labels**, not just numeric locations.
- In Green: Our dataframe contains columns with **multiple dtypes**: integer, float, and string.

We can use the [DataFrame.dtypes](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dtypes.html#pandas.DataFrame.dtypes) attribute (similar to NumPy's **ndarray.dtype** attribute) to return information about the types of each column. Let's see what this would return for our selection of data above:

```python
>>> f500_selection.dtypes

    rank          int64
    revenues      int64
    profits     float64
    country      object
    dtype: object
```

We can see three different data types (dtypes), which correspond to what we observed by looking at the data:

- int64
- float64
- object


We have seen the **float64** dtype before in NumPy. Pandas uses NumPy dtypes for numeric columns, including **integer64**. There is also a type we haven't seen before, **object**, which is used for columns that have data that doesn't fit into any other dtypes. This is almost always used for columns containing string values. If you like, you can run **f500.dtypes** in the console to see the types of all the columns in the f500 dataframe.

When we import data, pandas will attempt to guess the correct dtype for each column. Generally, pandas does a pretty good job with this, which means we don't need to worry about specifying dtypes every time we start to work with data. Later in this course, we'll look at how to change the dtype of a column.

Next, let's learn a few handy methods we can use to get some high-level information about our dataframe:

- If we wanted to view the first few rows of our dataframe, we can use the [DataFrame.head()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) method, which returns the first 5 rows of our dataframe. The **DataFrame.head()** method also accepts an optional integer parameter which specified the number of rows. We could use **f500.head(10)** to return the first 10 rows of our **f500 dataframe**.
- Similar in function to **DataFrame.head()**, we can use the [DataFrame.tail()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html) method, to shows us the last rows of our dataframe. The **DataFrame.tail()** method accepts an optional integer parameter to specify the number of rows, defaulting to 5.
- If we wanted to get an overview of all the dtypes used in our dataframe, along with its shape and some extra information, we could use the [DataFrame.info()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html#pandas.DataFrame.info) method. Note that **DataFrame.info()** prints the information, rather than returning it, so we can't assign it to a variable.

Let's practice using these three new methods. Just like in the previous missions, the f500 variable we created in the previous section is available to you here.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. Using the links above to the documentation if you need to, use the three methods we just learned about to learn more about the **f500** dataframe:

  - Use the **head()** method to select the first 6 rows and assign the result to **f500_head**.
  - Use the **tail()** method to select the last 8 rows and assign the result to **f500_tail.**
  - Use the **info()** method to display information about the dataframe.


In [4]:
# put your code here
f500_head = f500.head()
f500_tail = f500.tail()
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 66.4+ KB


## Selecting Columns From a DataFrame by Label

By looking at the results produced by the **DataFrame.head()** and **DataFrame.tail()** methods in the previous screen, we can see that our data set seems to be pre-sorted in order of Fortune 500 rank.

We can also see that the **DataFrame.info()** method showed us the number of entries in our index (representing the number of rows), a list of each column with their dtype and the number of non-null values, as well as a summary of the different dtypes and memory usage. In pandas, null values are represented using NaN, just like in NumPy.

Because our axes in pandas have labels, we can select data using those labels, unlike in NumPy where we needed to know the exact index location. To do this, we use the [DataFrame.loc[]](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc) method.

Throughout our pandas missions you'll see **df** used in code examples as shorthand for a dataframe object. We use this convention because you also see this throughout the official pandas documentation, so getting used to reading it is important. You'll notice that we use brackets **([])** instead of parentheses **(())** when selecting by location. This is similar to how we use brackets when selecting by location in Python lists or NumPy arrays. The syntax for the **DataFrame.loc[]** method is:

```python
df.loc[row, column]
```

Where **row** and **column** refer to row and column labels respectively, and can be one of:

- A single label.
- A list or array of labels.
- A slice object with labels.
- A boolean array.

We'll look at boolean arrays later in this mission - for now, we're going to focus on the first three options. We're going to use the same selection of data we used in the previous screen, which is stored using the variable name **f500_selection** to make these examples easier.

<img width="600" src="https://drive.google.com/uc?export=view&id=1WNexstd5iVGtj04VxjcnQSqa-UTSETDg">


In each of these examples, we're going to use **:** to specify that we wish to select all rows, so we can focus making selections using column labels only.

First, let's select a single column by specifying a single label:

<img width="600" src="https://drive.google.com/uc?export=view&id=15V82UWlysJjrA_Eg0b5vKPeMcagBhQCI">


Selecting a single column returns a pandas series. We'll talk about pandas series objects more in the next screen, but for now the important thing is to note that the new series has the same index axis labels as the original dataframe. Let's look at how we can use a list of labels to select specific columns:

<img width="600" src="https://drive.google.com/uc?export=view&id=1MSj7K1OU_0LwnKACxpYLOTfU71faWb4h">


When we use a list of labels, a dataframe is returned with only the columns specified in our list, in the order specified in our list. Just like when we used a single column label, the new dataframe has the same index axis labels as the original. Lets finish by using a **slice object with labels** to select specific columns.

<img width="600" src="https://drive.google.com/uc?export=view&id=1CEtF_oFReD_-6pqheEfEd4nRWgakEgXQ">

Again we get a dataframe object, with all of the columns from the first up until **and including** the last column in our slice. This is an important distinction – when we uses slices with lists and in NumPy, it does not include the end slice. The reason that this is different with **loc[]** is that with labels is less obvious what the end slice would be. When we're using integers, we know that the number after 3 is 4, but knowing the column label that comes after profits is not as obvious.

Let's practice using these techniques to select specific columns from our f500 dataframe.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Select the **industry** column, and assign the result to the variable name **industries**.
2. Select the **rank**, **previous_rank** and **years_on_global_500_list** columns, in order, and assign the result to the variable name **previous**.
3. Select all columns from **revenues** up to and including **profit_change**, in order, and assign the result to the variable name **financial_data**.

In [21]:
# put your code
industries = f500.iloc[:,8]
previous = f500[['rank', 'previous_rank', 'years_on_global_500_list']]

financial_data = f500.loc[:,"revenues":"profit_change"]

## Column selection shortcuts

There are two shortcuts that pandas provides for accessing columns.

1. **Single Bracket** – Instead of **df.loc[:,"col_1"]** you can use **df["col1"]** to select columns. This works for single columns and lists of columns but not for for column slices. This style of selecting columns is very commonly seen and we will use it throughout our Dataquest missions.
2. **Dot Accessor** – Instead of **df.loc[:,"col_1"]** you can use **df.col_1**. This shortcut does not work for labels that contain spaces or special characters. This style of selecting columns is much more rarely seen, and we will not use this in our Dataquest missions.

These shortcuts are designed to make some of the more common selection tasks easier. We recommend you always use the common shorthand in your code, as it will make your code easier to read. A summary of the techniques we've learned so far is below:

<img width="600" src="https://drive.google.com/uc?export=view&id=1BlhNf3XAGs0E50GISg0-W0sZXcRowrRe">

Let's practice selecting data by column some more, this time using the common shorthand method.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Select the **country** column, and assign the result to the variable name **countries**.
2. Select the **revenues** and **years_on_global_500_list** columns, in order, and assign the result to the variable name **revenues_years**.
3. Select all columns from **ceo** up to and including **sector**, in order, and assign the result to the variable name **ceo_to_sector**.



In [28]:
# put your code here
countries = f500.country
revenues_years = f500[['revenues', 'years_on_global_500_list']]
ceo_to_sector = f500.loc[:,"ceo":"sector"]

## Selecting Items from a Series by Label

In the last section we observed that when you select just one column of a dataframe, you get a new pandas type: a **series object**. Series is the pandas type for one-dimensional objects. Anytime you see a 1D pandas object, it will be a series, and anytime you see a 2D pandas object, it will be a dataframe.

You might like to think of a dataframe as being a collection of series objects, which is similar to how pandas stores the data behind the scenes.

<img width="600" src="https://drive.google.com/uc?export=view&id=1xNxF1TgmYOzlYj6ogofxKBiSO7FqsHOH">


To better understand the relationship between dataframe and series objects, we'll look at some examples. We'll start by looking at two pandas operations that each produce a series object:


<img width="600" src="https://drive.google.com/uc?export=view&id=1aI-saLdbP4eZOQKGjHTJT54mFCa2Vu42">

Because a series has only one axis, its axis labels are either the index axis or column axis labels, depending on whether it is representing a row or a column from the original dataframe. If we make a 2D selection from a dataframe, it will retain the labels from both axes:

<img width="600" src="https://drive.google.com/uc?export=view&id=1YyiMWxDLQ8bo4vkL1qZc5Pq49C4KNqE8">

Let's look at a brief summary of the differences between dataframes and series'.


<img width="400" src="https://drive.google.com/uc?export=view&id=1k7GgAAgsHY8YD-Z8x7Enerozt9MlqkuR">


Just like dataframes, we can use **Series.loc[]** to select items from a series using single labels, a list, or a slice object. We can also omit **loc[]** and use bracket shortcuts for all three. Let's look at an example:

```python
>>> print(s)

a    0
b    1
c    2
d    3
e    4
dtype: int64
```

We can select a single item:

```python
print(s["d"])

3
```

Like with dataframe columns, there is a dot accessor (eg, **s.d**) available, but this rarely used– even less than the dataframe dot accessor.

To select several items using a list:

```python
print(s[["a", "e", "c"]])

a    0
e    4
c    2
dtype: int64
```

And lastly, several items using a slice:

```python
print(s["a":"d"])

a    0
b    1
c    2
d    3
dtype: int64
```

Let's practice selecting data from pandas series':

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

1. From the pandas series **ceos**:
  -  Select the item at index label **Walmart** and assign the result to the variable name **walmart**.
  -  Select the items from index label **Apple** up to and including index label **Samsung Electronics** and assign the result to the variable name **apple_to_samsung**.
  -  Select the items with index labels **Exxon Mobil**, **BP**, and **Chevron**, in order, and assign the result to the variable name **oil_companies**.



In [38]:
ceos = f500["ceo"]

walmart = ceos["Walmart"]
apple_to_samsung = ceos["Apple":"Samsung Electronics"]
oil_companies = ceos[["Exxon Mobil", "BP", "Chevron"]]

## Selecting Rows From a DataFrame by Label

Now that we've learned how to select columns using the labels of the **'column'** axis, let's learn how to select rows using the labels of the **'index'** axis.

<img width="400" src="https://drive.google.com/uc?export=view&id=1HH7k0yK6abEG4xKiHgPTt5I_rz1OE5UD">

Selecting **rows** from a dataframe by label uses the same syntax as we use for **columns.** As a reminder:

```python
df.loc[row, column]
```

Where **row** and **column** refer to row and column labels. We'll look at how to select rows, again using our **f500_selection** dataframe to make these examples easier.

```python
print(type(f500_selection)
print(f500_selection)
```

```python
class 'pandas.core.frame.DataFrame'

                          rank  revenues  profits country
Walmart                      1    485873  13643.0     USA
State Grid                   2    315199   9571.3   China
Sinopec Group                3    267518   1257.9   China
China National Petroleum     4    262573   1867.5   China
Toyota Motor                 5    254694  16899.3   Japan
```

To select a single row:

```python
single_row = f500_selection.loc["Sinopec Group"]
print(type(single_row))
print(single_row)

class 'pandas.core.series.Series'

rank             3
revenues    267518
profits     1257.9
country      China
Name: Sinopec Group, dtype: object
```

As we would expect, a single row is returned as a series. We should take a moment to note that the dtype of this series is object. Because this series has to store integer, float, and string values pandas uses the object dtype, since none of the numeric types could cater for all values.

To select a list of rows:

```python
list_rows = f500_selection.loc[["Toyota Motor", "Walmart"]]
print(type(list_rows))
print(list_rows)

class 'pandas.core.frame.DataFrame'

              rank  revenues  profits country
Toyota Motor     5    254694  16899.3   Japan
Walmart          1    485873  13643.0     USA
```

For selection using slices, we can use the shortcut without brackets. This is the reason we can't use this shortcut for columns - because it's reserved for use with rows:

```python
slice_rows = f500_selection["State Grid":"Toyota Motor"]
print(type(slice_rows))
print(slice_rows)
```

```python
class 'pandas.core.frame.DataFrame'

                          rank  revenues  profits country
State Grid                   2    315199   9571.3   China
Sinopec Group                3    267518   1257.9   China
China National Petroleum     4    262573   1867.5   China
Toyota Motor                 5    254694  16899.3   Japan
```

Let's take a look at a summary of all the different label selection methods we've learned so far:


<img width="800" src="https://drive.google.com/uc?export=view&id=1rQMPkOZBVh57x6kVu5qWjU3UC6vCh9v_">


Now for some practice - we're going to make it a little bit harder this time, by asking you to combine selection methods for rows and columns on both dataframes and series'!

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. By selecting data from **f500**:
  - Create a new variable, **drink_companies**, with:
    - Rows with indicies **Anheuser-Busch InBev**, **Coca-Cola**, and **Heineken Holding**, in that order.
     - All columns.
  - Create a new variable **big_movers**, with:
    - Rows with indicies **Aviva**, **HP**, **JD.com**, and **BHP Billiton**, in that order.
    - The **rank** and **previous_rank** columns, in that order.
  - Create a new variable, **middle_companies** with:
    - All rows with indicies from **Tata Motors** to **Nationwide**, inclusive.
    - All columns from **rank** to **country**, inclusive.

In [53]:
# put your code here
drink_companies = f500.loc[["Anheuser-Busch InBev", "Coca-Cola", "Heineken Holding"],:]

# big_movers = 
f500.loc[["Aviva", "HP", "JD.com", "BHP Billiton"], ["rank", "previous_rank"]]

Unnamed: 0,rank,previous_rank
Aviva,90,279
HP,194,48
JD.com,261,366
BHP Billiton,350,168


## Series and Dataframe Describe Methods

We're starting to get a feel for how axes labels in pandas make selecting data much easier. Pandas also has a large number of methods and functions that make working with data easier. Let's use a few of these to explore our Fortune 500 data.

The first method we'll learn about is the [Series.describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html#pandas.Series.describe) method, which returns some descriptive statistics on the data contained within a specific pandas series. Let's look at an example:

```python
revs = f500["revenues"]
print(revs.describe())
```

```python
count       500.000000
mean      55416.358000
std       45725.478963
min       21609.000000
25%       29003.000000
50%       40236.000000
75%       63926.750000
max      485873.000000
Name: revenues, dtype: float64
```

We've assigned the **revenues** column to a new series, **revs**, and then used the **describe()** method on that series. The method tells us how many non-null values are contained in the series, the mean and standard devation, along with the minimum, maximum and [quartile](https://en.wikipedia.org/wiki/Quartile) values.

Rather than assigning the series to it's own variable, we can actually skip that step and use the method directly on the result of the column selection. This is called **method chaining** and is a way to combine multiple methods together in a single line. It's not unique to pandas, however it is something that you see a lot in pandas code. Let's see what the command looks like with method chaining, using the **assets** column.


```python
print(f500["assets"].describe())
```

```python
count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64
```

From here, you'll start to see method chaining used more in our missions. When writing code, you should always assess whether method chaining will make your code harder to read. It's always preferable to break out into more than one line if it will make your code easier to understand.

You might have noticed that the values in the code segment above look a little bit different. Because the values for this column are too long to display neatly, pandas has displayed them in **E-notation**, a type of [scientific notation](https://en.wikipedia.org/wiki/Scientific_notation). Here is an expansion of what the E-notation represents:

| Original Notation | Expanded Formula | Result |
|-------------------|--------------------|----------|
| 5.000000E+02 | 5.000000 * 10 ** 2 | 500 |
| 2.436323E+05 | 2.436323 * 10 ** 5 | 243632.3 |


If we use **describe()** on a column that contains non-numeric values, we get some different statistics. Let's look at an example:

```python
print(f500["country"].describe())

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object
```

Here is what the output indicates:

The first statistic, **count**, is the same as for numeric columns, showing us the number of non-null values. The other three statistics are new:

- **unique** - The number of unique values in the series. In this case, it tells us that there are 34 different countries represented in the Fortune 500.
- **top** - The most common value in the series. The USA is the most common country that a company in the Fortune 500 is headquartered in.
- **freq** - The frequency of the most common value. The USA is the country that 132 companies from Fortune 500 are headquartered in.

Because series' and dataframes are two distinct objects, they have their own unique methods. There are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. DataFrame objects also have a [DataFrame.describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method that returns these same statistics for every column. If you like, you can take a look at the documentation using the link in the previous sentence to familiarize yourself with some of the differences between the two methods.

One difference is that you need to specify manually if you want to see the statistics for the non-numeric columns. By default, **DataFrame.describe()** will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the **include=['O']** parameter when using the dataframe version of describe:

```python
print(f500.describe(include=['O']))
```

```python
_            ceo    industry     sector  country  hq_location    website
count        500         500        500      500          500        500
unique       500          58         21       34          235        500
top     Xavie...   Banks:...  Financ...      USA  Beijing,...  http:/...
freq           1          51        118      132           56          1
```

Another difference is that **Series.describe()** returns a series object, where **DataFrame.describe()** returns a dataframe object. 

Let's practice using both the series and dataframe describe methods:


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Use the appropriate **describe()** method to:
  - Return a series of descriptive statistics for the **profits** column, and assign the result to **profits_desc**.
  - Return a dataframe of descriptive statistics for the **revenues** and **employees** columns, in order, and assign the result to **revenue_and_employees_desc**.
  - Return a dataframe of descriptive statistics for every column in the **f500** dataframe, by checking the documentation for the correct value for the **include** parameter, and assign the result to **all_desc**.




In [70]:
# put your code here
profits_desc = f500.profits.describe()

revenue_and_employees_desc = f500.loc[:,["revenues", "employees"]].describe()

df = pd.DataFrame()

for i in f500.columns:
    df[i] = f500[i].describe()

df

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,,,,222.134,,,,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,,,,146.941961,,,,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,,,,0.0,,,,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,,,,92.75,,,,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,,,,219.5,,,,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,,,,347.25,,,,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,,,,500.0,,,,23.0,2300000.0,301893.0


## More Data Exploration Methods

Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported. One basic concept is vectorized operations. Let's look at an example of how this would work with a pandas series:

```python
>>> print(my_series)

    0    1
    1    2
    2    3
    3    4
    4    5
    dtype: int64

>>> my_series = my_series + 10

>>> print(my_series)

    0    11
    1    12
    2    13
    3    14
    4    15
    dtype: int64
```

Many of the descriptive stats methods are also supported. Here are a few handy methods (with links to documentation) that you might use when working with data in pandas:

- [Series.max()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html) and [DataFrame.max()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html)
- [Series.min()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html) and [DataFrame.min()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html)
- [Series.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) and [DataFrame.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html)
- [Series.median()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html) and [DataFrame.median()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html)
- [Series.mode()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html) and [DataFrame.mode()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html)
- [Series.sum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html) and [DataFrame.sum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html)


As the documentation indicates, the series methods don't require an axis parameter, however the dataframe methods will so we know which axis to calculate across. While you can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings **"index"** and **"columns"** for the axis parameter. Let's refresh our memory on how this works:

<img width="700" src="https://drive.google.com/uc?export=view&id=1euiSMOgXE7IVP_U-VIRzwV6JAXBpC4yx">

For instance, if we wanted to find the median (middle) value for the **revenues** and **profits** columns, we could use the following code:

```python
medians = f500[["revenues", "profits"]].median(axis=0)
# we could also use .median(axis="index")
print(medians)

revenues    40236.0
profits      1761.6
dtype: float64
  
```



In fact, the default value for the axis parameter with these methods is **axis=0**, so we could have just used the **median()** method without a parameter to get the same result!

Another extremely handy method for exploring data in pandas is the [Series.value_counts()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method. The **Series.value_counts()** method displays each unique non-null value from a series, with a count of the number of times that value is used. We saw above that the **sector** column has 21 unique values. Let's use **Series.value_counts()** to look at the top 5:

```python
>>> print(f500["sector"].value_counts().head())

    Financials                118
    Energy                     80
    Technology                 44
    Motor Vehicles & Parts     34
    Wholesalers                28
    Name: sector, dtype: int64
```

Let's take a moment to walk through what happened in that line of code:

- We used the **print()** function to print the output of the following method chain:
    - Select the **sector** column from the **f500** dataframe, and on the resulting series
    - Use the **Series.value_counts()** to produce a series of the unique values and their counts in order, and on the resulting series
    - Use the [Series.head()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.head.html#pandas.Series.head) method to return the first 5 items only.
    
    
We haven't seen the **Series.head()** method before, but it works similarly to **DataFrame.head()**, returning the first five items from a series, or a different number if you provide an argument.

The **Series.value_counts()** method is one of the handiest methods to use when exploring a data set. It's also one of the few series methods that doesn't have a dataframe counterpart.

Don't worry too much about having to remember which methods belong to which objects for now. You'll find that as you practice them some will stick, and for the rest you'll be able to reference the pandas documentation.

Let's start the process by practicing some of these to explore the Fortune 500 some more!

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Use **Series.value_counts()** and **Series.head()** to return the 5 most common values for the **country** column, and assign the results to **top5_countries.**
- Use **Series.value_counts()** and **Series.head()** to return the 5 most common values for the **previous rank** column, and assign the results to **top5_previous_rank**.
- Use the appropriate **max()** method to find the maximum value for only the numeric columns from **f500** (you may need to check the documentation), and assign the result to the variable **max_f500**.






In [1]:
top5_countries = f500.country.value_counts().head(5)
top5_previous_rank = f500.previous_rank.value_counts().head(5)
max_f500 = f500.max()
f500.max()

NameError: name 'f500' is not defined

## Assignment with pandas

Looking at the results of the most common values for the **previous_rank** column in the last exercise, you might have noticed something a little odd:

```python
>>> print(top5_previous_rank.head())

    0      33
    159     1
    147     1
    148     1
    149     1
    Name: previous_rank, dtype: int64
```

This indicates that 33 companies had the value **0** for their rank in the Fortune 500 for the previous year. Given that a rank of zero doesn't exist, we might conclude that these companies didn't have a rank at all for the previous year. It would make more sense for us to replace these values with a null value to more clearly indicate that the value is missing. There are a few things we need to be able to do before we can correct this. The first is how to assign values using pandas.

When we used NumPy, we learned that the same techniques that we use to select data could be used for assignment. Let's look at an example:


```python
my_array = np.array([1, 2, 3, 4])

# to perform selection
print(my_array[0])

# to perform assignment
my_array[0] = 99
```

The same is true with pandas. Let's look at this example:

```python
>>> top5_rank_revenue = f500[["rank", "revenues"]].head()

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1    485873
    State Grid                   2    315199
    Sinopec Group                3    267518
    China National Petroleum     4    262573
    Toyota Motor                 5    254694

>>> top5_rank_revenue["revenues"] = 0

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1         0
    State Grid                   2         0
    Sinopec Group                3         0
    China National Petroleum     4         0
    Toyota Motor                 5         0
    
```


When we selected a whole column by label and use assignment, we assigned the value to every item in that column.

By providing labels for both axes, we can assign to a single value within our dataframe.

```python
>>> top5_rank_revenue.loc["Sinopec Group", "revenues"] = 999

>>> print(top5_rank_revenue)
                              rank  revenues
    Walmart                      1         0
    State Grid                   2         0
    Sinopec Group                3       999
    China National Petroleum     4         0
    Toyota Motor                 5         0
```

If we assign a value using a index or column label that does not exist, pandas will create a new row or column in our dataframe. Let's add a new column and new row to our **top5_rank_revenue** dataframe:

```python
>>> top5_rank_revenue["year_founded"] = 0

>>> print(top5_rank_revenue)

                              rank  revenues  year_founded
    Walmart                      1         0             0
    State Grid                   2         0             0
    Sinopec Group                3       999             0
    China National Petroleum     4         0             0
    Toyota Motor                 5         0             0

>>> top5_rank_revenue.loc["My New Company"] = 555

>>> print(top5_rank_revenue)

                              rank  revenues  year_founded
    Walmart                      1         0             0
    State Grid                   2         0             0
    Sinopec Group                3       999             0
    China National Petroleum     4         0             0
    Toyota Motor                 5         0             0
    My New Company             555       555           555
```


There is one exception to be aware of: You **can't** create a new row/column by attempting to use the dot accessor shortcut with a label that does not exist.

Let's practice assigning values and adding new columns using our full Fortune 500 dataframe:


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Add a new column, **revenues_b** to the **f500** dataframe by using vectorized division to divide the values in the existing **revenues** column by 1000 (converting them from millions to billions).
- The company **'Dow Chemical'** have named a new CEO. Update the value where the index label is **Dow Chemical** and for the **ceo** column to **Jim Fitterling**.

In [89]:
# put your code here
f500['revenues_b'] = f500.revenues / 1000
f500.ceo.loc['Dow Chemical'] = "Jim Fitterling"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


## Using Boolean Indexing with pandas Objects

Now that we know how assign values in pandas, we're one step closer to being able to correct the values in the **previous_rank** column that are **0**. If we knew the name of every single row label where this case was true, we could do this manually by using a list of labels when we performed our assignment.

While it's helpful to be able to replace specific values in rows where we know the row label ahead of time, this is cumbersome when we want to do this for all rows that meet the same criteria. Another option would be to use a loop, but this would be slower and would lose the benefits of vectorization that pandas gives us. Instead, we can use **boolean indexing**.

Just like NumPy, pandas allows us to use boolean indexing to select items based on their value, which will make our task a lot easier. Let's refresh our memory of how boolean indexing is used for selection, and learn how boolean indexing works in pandas.

In NumPy, boolean arrays are created by performing a vectorized boolean comparison on a NumPy ndarray. In pandas this works almost identically, however the resulting boolean object will be either a series or a dataframe, depending on the object on which the boolean comparison was performed. Let's look an example of performing a boolean comparison on a series vs a dataframe:

<img width="600" src="https://drive.google.com/uc?export=view&id=1BJNbnwxzO7TBjbYhFJppw1o93Tnn7xMU">


It's much less common to use a boolean dataframe than a boolean series in pandas. You almost always want to use the results of a comparison on one column from dataframe (a series object) to select data in the main dataframe, or a selection of the main dataframe.

Let's look at two examples of how that works in diagram form. For our example, we'll be working with this dataframe of people and their favorite numbers:

<img width="600" src="https://drive.google.com/uc?export=view&id=1FqhK-Kfr7u7JDeAbfEFxn3_0nONlIpp1">

Let's check which people have a favorite number of 8. We perform a vectorized boolean operation that produces a boolean series:

<img width="600" src="https://drive.google.com/uc?export=view&id=1OgQSNM8KwzI5Mr3UJ4Y-89tXdc9t6Ybg">


We can use that series to index the whole dataframe, leaving us the rows that correspond only to people whose favorite number is 8.

<img width="600" src="https://drive.google.com/uc?export=view&id=1WeDutjUzaV2RqdHgJkSVWQNPOguqskak">

Note that we didn't used **loc[]**. This is because boolean arrays use the same shortcut as slices to select along the index axis. 


Now let's look at an example of using boolean indexing with our Fortune 500 dataset. We want find out which are the 5 most common countries for companies belonging to the **'Motor Vehicles and Parts'** industry.

We start by making a boolean series that shows us which rows from our dataframe have the value of **Motor Vehicles and Parts** for the **industry** column. We'll then print the first five items of our boolean series so we can see it in action:


```python
>>> motor_bool = f500["industry"] == "Motor Vehicles and Parts"

>>> print(motor_bool.head())

    Walmart                     False
    State Grid                  False
    Sinopec Group               False
    China National Petroleum    False
    Toyota Motor                 True
    Name: industry, dtype: bool
```


Notice that like our examples in the diagrams above, the index labels are retained in our boolean series. Next, we use that boolean series to select only the rows that have **True** for our boolean index, and just the **country** column, and then print the first 5 items to check the values:

```python
>>> motor_countries = f500.loc[motor_bool, "country"]

>>> print(motor_countries.head())

    Toyota Motor        Japan
    Volkswagen        Germany
    Daimler           Germany
    General Motors        USA
    Ford Motor            USA
    Name: country, dtype: object
```


Lastly, we can use the **value_counts()** method for the **motor_countries** series, chained to the **head()** method to produce a series of the top 5 countries for the 'Motor Vehicles and Parts' industry:

```python
>>> top5_motor_countries = motor_countries.value_counts().head()

>>> print(top5_motor_countries)

    Japan          10
    China           7
    Germany         6
    France          3
    South Korea     3
    Name: country, dtype: int64
```

Let's practice using boolean indexing in pandas to identify the five highest ranked companies from South Korea. Remember, we observed earlier that the **f500** dataframe is already sorted by rank, so we won't need to perfom any extra sorting.


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Create a boolean series, **kr_bool**, that compares whether the values in the **country** column from the **f500** dataframe are equal to **"South Korea"**
- Use that boolean series to index the full **f500** dataframe, assigning just the first five rows to **top_5_kr.**

In [94]:
kr_bool = f500.country == "South Korea"
top_5_kr = f500[kr_bool].head(5)

## Using Boolean Arrays to Assign Values

We now have all the knowledge we need to fix the 0 values in the **previous_rank** column:

- perform assignment in pandas
- use boolean indexing in pandas

Let's look at an example of how we combine these two operations together. For our example, we'll want to change the **'Motor Vehicles & Parts'** values in the **sector** column to **'Motor Vehicles and Parts'** – i.e. we will change the ampersand **(&)** to **and**.

First, we create a boolean series by comparing the values in the sector column to **'Motor Vehicles & Parts'**.

```python
ampersand_bool = f500["sector"] == "Motor Vehicles & Parts"
```

Next, we use that boolean series and the string **"sector"** to perform the assignment.

```python
f500.loc[ampersand_bool,"sector"] = "Motor Vehicles and Parts"
```

Just like we saw in the NumPy mission earlier in this course, we can remove the intermediate step of creating a boolean series, and combine everything into one line. This is the most common way to write pandas code to perform assignment using boolean arrays:

```python
f500.loc[f500["sector"] == "Motor Vehicles & Parts","sector"] = "Motor Vehicles and Parts"
```

Now we can follow this pattern to replace the values in the **previous_rank** column. We'll replace these values with **np.nan**, which is used in pandas, just as it is in numpy, to represent values that can't be represented numerically, most commonly missing values.

To make comparing the values in this column before and after our operation easier, we've added the following line of code to the cell below:

```python
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
```

This uses **Series.value_counts()** and **Series.head()** to display the 5 most common values in the **previous_rank** column, but adds an additional **dropna=False** parameter, which stops the **Series.value_counts()** method from excluding null values when it makes its calculation, as shown in the [Series.value_counts() documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html#pandas.Series.value_counts).


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Use boolean indexing to update values in the **previous_rank** column of the **f500** dataframe:
  - Where previous there was a value of 0, there should now be a value of **np.nan**.
  - It is up to you whether you assign the boolean series to its own variable first, or whether you complete the operation in one line.
- Create a new pandas series, **prev_rank_after**, using the same syntax that was used to create the **prev_rank_before series.**
- After you have run your code, use the variable inspector to compare **prev_rank_before** and **prev_rank_after.**

In [97]:
import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

f500.loc[f500.previous_rank == 0, "previous_rank"] = np.nan

prev_rank_after = f500["rank"].value_counts(dropna=False).head()

## Challenge: Top Performers by Country

You may have noticed that after we assigned NaN values the previous_rank column changed dtype. Let's take a closer look:

```python
>>> print(prev_rank_before)

    0      33
    159     1
    147     1
    148     1
    149     1

>>> print(prev_rank_after)

    NaN      33
    471.0     1
    234.0     1
    125.0     1
    166.0     1
```

The index of the series that **Series.value_counts()** produces is now showing us floats like 471.0 instead of the integers from before. The reason behind this is that pandas uses the NumPy integer dtype, which does not support NaN values. Pandas inherits this behavior, and in instances where you try and assign a NaN value to an integer column, pandas will silently convert that column to a float dtype. If you're interested in finding out more about this, [there is a specific section on integer NaN values in the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/gotchas.html#nan-integer-na-values-and-na-type-promotions).

We'll finish this mission with a challenge. In this challenge, we'll calculate a specific statistic or attribute of each of the three most common countries from our f500 dataframe. We've identified the three most common countries using the code below:

```python
>>> top_3_countries = f500["country"].value_counts().head(3)

>>> print(top_3_countries)

USA      132
China    109
Japan     51
Name: country, dtype: int64
```

Don't be discouraged if this takes a few attempts to get right– working with data is an iterative process!

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


1. Create a series, **cities_usa**, containing counts of the five most common Headquarter Location cities for companies headquartered in the USA.
2. Create a series, **sector_china**, containing counts of the three most common sectors for companies headquartered in the China.
3. Create float object, **mean_employees_japan**, containing the mean average number of employees for companies headquartered in Japan

In [4]:
# put your code here
# cities_usa = 
f500["country"]
# sector_china
# mean_employees_japan

Walmart                                                 USA
State Grid                                            China
Sinopec Group                                         China
China National Petroleum                              China
Toyota Motor                                          Japan
Volkswagen                                          Germany
Royal Dutch Shell                               Netherlands
Berkshire Hathaway                                      USA
Apple                                                   USA
Exxon Mobil                                             USA
McKesson                                                USA
BP                                                  Britain
UnitedHealth Group                                      USA
CVS Health                                              USA
Samsung Electronics                             South Korea
Glencore                                        Switzerland
Daimler                                 

In this section, we learned:

- How pandas and NumPy combine to make working with data easier
- About the two core pandas types: series and dataframes
- How to select data from pandas objects using axis labels
- How to select data from pandas objects using boolean arrays
- How to assign data using labels and boolean arrays
- How to create new rows and columns in pandas
- Many new methods to make data analysis easier in pandas.

# Exploring Data with pandas




## Introduction

When we learned how to select data in NumPy, we used the integer position to create our selection

<img width="400" src="https://drive.google.com/uc?export=view&id=1sNhetXwe-iqdVzkuu65z0p2rgAb9P6J0">

In pandas, each axis has labels, and we've learned to use loc[] to specify labels to create our selection:

<img width="400" src="https://drive.google.com/uc?export=view&id=19qbWRXXH0SrBu2FnMREay_KyvifucKNd">


In some scenarios, like specifying specific columns, using labels to make selections makes things easier - in others though, it makes things harder. If you wanted to select the tenth to twentieth rows in a dataframe, you'd need to know their labels first.

In this section, we'll learn how to index by integer position with pandas. We'll also learn more advanced selection techniques which will help us perform more complex data analysis.

We'll continue to use the Fortune Global 500 (2017) dataset from the previous section.

## Using iloc to select by integer position

Because pandas uses NumPy objects behind the scenes to store the data, the integer positions we used to select data can also be used. To select data by integer position using pandas we use the [Dataframe.iloc[]](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html) method and the [Series.iloc[]](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.iloc.html) method. It's easy to get loc[] and iloc[] confused at first, but the easiest way is to remember the first letter of each method:

- **loc**: **l**able based selection
- **iloc**: **integer** position based selection

Using the **iloc[]** methods is almost identical to indexing with NumPy, with integer positions starting at **0** like ndarrays and Python lists. Let's take a look at how we would perform our selection from the previous screen using **iloc[]:**

<img width="400" src="https://drive.google.com/uc?export=view&id=1dQa9Y1ZVbYHCA0BhQxWL5FPgJVPAyU1P">


As you can see, **DataFrame.iloc[]** behaves similarly to **DataFrame.loc[]**. The full syntax for **DataFrame.iloc[]**, in psuedocode, is:

```python
df.iloc[row,column]
```

The valid inputs for row and column are almost identical to when you use **DataFrame.loc[]**, with the distinction being that you are using integers rather than labels:

- A single integer position.
- A list or array of integer positions.
- A slice object with integer positions.
- A boolean array.

Let's say we wanted to select just the first column from our **f500** dataframe. To do this, we use the : wildcards to specify all rows, and then use the integer 0 to specify the first column:

```python
first_column = f500.iloc[:,0]
print(first_column)
```
```python
0                        Walmart
1                     State Grid
2                  Sinopec Group
...
497    Wm. Morrison Supermarkets
498                          TUI
499                   AutoNation
Name: company, dtype: object
```

If we wanted to select a single row, we don't need to specify a column wildcard. Let's see how we'd select just the fourth row:

```python
fourth_row = f500.iloc[3]
print(fourth_row)
```
```python
company                 China National Petroleum
rank                                           4
revenues                                  262573
revenue_change                             -12.3
profits                                   1867.5
assets                                    585619
profit_change                              -73.7
ceo                                Zhang Jianhua
industry                      Petroleum Refining
sector                                    Energy
previous_rank                                  3
country                                    China
hq_location                       Beijing, China
website                   http://www.cnpc.com.cn
years_on_global_500_list                      17
employees                                1512048
total_stockholder_equity                  301893
Name: 3, dtype: object
```

If we are specifying a positional slice, we can take advantage of the same shortcut that we use with labels, using brackets without **loc**. Here's how we would select the rows between index positions one up to and including four:

```python
second_to_fifth_rows = f500[1:5]
```

```python
company  rank  revenues ... employees  total_stockholder_equity
1         State Grid     2    315199 ...    926067                    209456
2      Sinopec Group     3    267518 ...    713288                    106523
3  China National...     4    262573 ...   1512048                    301893
4       Toyota Motor     5    254694 ...    364445                    157210
```

In the example above, the row at index position 5 is not included, just like if we were slicing with a Python list or NumPy ndarray. It's worth reiterating again that **iloc[]** handles slicing differently, as we learned in the previous mission:

- With **loc[]**, the **ending slice is included.**
- With **iloc[]**, the **ending slice is not included.**

The table below summarizes how we can use **DataFrame.iloc[]** and **Series.iloc[]** to select by integer position:


<img width="600" src="https://drive.google.com/uc?export=view&id=18jhblUrPsASHHdT5Lgpr6mmPmaIYo6og">


**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


We have provided code to read the **f500.csv** file into a dataframe and assigned it to **f500**, and inserted **NaN** values into the **previous_rank** column as we did in the previous section.

- Select just the fifth row of the **f500** dataframe, assigning the result to **fifth_row.**
- Select the first three rows of the **f500** dataframe, assigning the result to **first_three_rows.**
- Select the first and seventh rows and the first 5 columns of the **f500** dataframe, assigning the result to **first_seventh_row_slice**



In [None]:
import pandas as pd
import numpy as np

f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

# put your code here

## Reading CSV files with pandas

So far, we've provided the code to read the CSV file into pandas for you. In this mission, we're going to teach you how to use the [pandas.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function to read in CSV files. Before we start, let's take a look at the first few lines of our CSV file in its raw form. To make it easier to read, we're only showing the first four columns from each line:

```python
company,rank,revenues,revenue_change
Walmart,1,485873,0.8
State Grid,2,315199,-4.4
Sinopec Group,3,267518,-9.1
China National Petroleum,4,262573,-12.3
Toyota Motor,5,254694,7.7
```

Now let's take a moment to look at the code segment we've been using to read in the files.

```python
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
```


Looking at the first line only, we use the **pandas.read_csv()** function with an unnamed argument, the name of the CSV file, and a named argument for the **index_col** parameter. The **index_col** parameter specifies which column to use as the row labels. We use a value of **0** to specify that we want to use the first column.

Let's look at what the **f500** dataframe looks like after that first line. We'll use **DataFrame.iloc[]** to show the first 5 rows and the first 3 columns:

```python
>>> f500 = pd.read_csv("f500.csv", index_col=0)

>>> print(f500.iloc[:5, :3])

                              rank  revenues  revenue_change
    company                                                    
    Walmart                      1    485873             0.8
    State Grid                   2    315199            -4.4
    Sinopec Group                3    267518            -9.1
    China National Petroleum     4    262573           -12.3
    Toyota Motor                 5    254694             7.7
```

Notice that above the index labels is the text **company**. This is the value from the start of the first row of the CSV, effectively the name of the first column. Pandas has used this value as the **axis name** for the index axis. Both the column and index axes can have names assigned to them. The next line of code removes that name:

```python
f500.index.name = None
```

First, we use **DataFrame.index** to access the index axes attribute, and then we use **index.name** to access the name of the index axes. By setting this to **None** we remove the name. Let's look at what it looks like after this action

```python
>>> f500.index.name = None

>>> print(f500.iloc[:5, :3])

                              rank  revenues  revenue_change
    Walmart                      1    485873             0.8
    State Grid                   2    315199            -4.4
    Sinopec Group                3    267518            -9.1
    China National Petroleum     4    262573           -12.3
    Toyota Motor                 5    254694             7.7
```

The index name has been removed.

The **index_col** parameter we used is an optional argument. Let's look at what it looks like if we use **pandas.read_csv()** without it:

```python
>>> f500 = pd.read_csv("f500.csv")

>>> print(f500.iloc[:5,:3])

                        company  rank  revenues
    0                   Walmart     1    485873
    1                State Grid     2    315199
    2             Sinopec Group     3    267518
    3  China National Petroleum     4    262573
    4              Toyota Motor     5    254694
```

There two differences with this approach:

- The **company** column is now included as a regular column, instead of being used for the index.
- The index labels are now integers starting from **0**.
- This is the more conventional way to read in a dataframe, and it's the method we'll use from here on in. There are a few things to be aware of when you have an integer index labels, and we'll talk about them in the next screen.


For now, let's re-read in the CSV file using the conventional method:

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

The pandas library is already imported from the previous screen.

- Use the **pandas.read_csv()** function to read the **f500.csv** CSV file as a pandas dataframe, and assign it to the variable name **f500**.
  - Do not use the **index_col** parameter, so that the dataframe has integer index labels.
- Use the code below to insert the **NaN** values into the **previous_rank** column: 
```python
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan
```





In [None]:
# put your code here

## Working with Integer Labels

As we observed in the previous screen, our index labels are now integers compared with previously where all our index labels were strings. As a result, this means, that while our dataframe has all of the rows, in the same order, as when we read it in, that the integer position and the label for the index axis is the same. Let's look at an example:

<img width="500" src="https://drive.google.com/uc?export=view&id=1VeQE7W6ylvfg524QrO2tJ_-iFRud_04F">


Because the index axis of our dataframe has labels that are identical to the integer positions, both **loc[]** and **iloc[]** give the same result. But what if we have modified our dataframe in some way. Let's reorder the rows of our dataframe, and then see what happens:

<img width="500" src="https://drive.google.com/uc?export=view&id=1KT2Dus_gxSDfPeBSSxqNkoU8NWXJFlVk">

Now we get different results. When we use **df.iloc[1]** it still selects the second row, since **DataFrame.iloc[]** uses integer position. However, **df.loc[1]** selects the **third row– DataFrame.loc[]** itself doesn't mind that the rows are out of order, it just looks at the axis labels and selects the row with the matching label.

This is one of the most confusing parts of selecting data with pandas. You might not come across it often, because a lot of the time you'll work with a dataframe where the index labels are integers, and the dataframe contains all of its original rows, in order. You can use **DataFrame.iloc[]** and **DataFrame.loc[]** interchangeably and it doesn't matter which you chose.

Then, you remove some rows or change the order, and suddenly you're getting errors or unexpected behavior. For this reason, it's important to make sure that when you're selecting data you're always asking yourself, "Do I want to select based on position or label?" and choosing **DataFrame.iloc** or **DataFrame.loc[]** accordingly. Let's look at some examples with our **f500** dataframe where we come across this 'gotcha' to do with integer labels.

Let's say that we wanted to select just the Swedish companies from the Fortune 500:

```python
>>> swedish = f500.loc[f500["country"] == "Sweden","company":"revenues"]

>>> print(swedish)

                        company  rank  revenues
    300                   Volvo   301     35269
    418             LM Ericsson   419     26004
    481  H & M Hennes & Mauritz   482     22618
```

If we wanted to select the first company from our new swedish dataframe, we can use **DataFrame.iloc[]**:

```python
>>> first_swedish = swedish.iloc[0]

>>> print(first_swedish)

    company     Volvo
    rank          301
    revenues    35269
    Name: 300, dtype: object
```

Let's see what happens when we use **DataFrame.loc[]** instead of **DataFrame.iloc[]**:

```python
>>> first_swedish = swedish.loc[0]

    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    /python3.4/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
       1410                 if key not in ax:
    -> 1411                     error()
       1412             except TypeError as e:

    /python3.4/site-packages/pandas/core/indexing.py in error()
       1405                 raise KeyError("the label [%s] is not in the [%s]" %
    -> 1406                                (key, self.obj._get_axis_name(axis)))
       1407 

    KeyError: 'the label [0] is not in the [index]'
```

We get an error, telling us that **the label [0] is not in the [index]** (the actual traceback for this error is much longer than this, we have truncated it for brevity). And indeed, there is no row that has a label **0** in the index of our **swedish** dataframe.

The four most common times we will see this is when we alter the rows in our dataframe by:

1. Selecting a subset of the data (like in the example above).
2. Removing certain rows, for example if they have null values (which we'll explore in the next mission).
3. Randomizing the order of the rows in our dataframe (which is commonly done to perform machine learning).
4. Sorting the rows.

Regardless of how we altered the dataframe, the way to avoid this is the same: Always think carefully and deliberately about whether you want to select by label or integer position, and use **DataFrame.loc[]** or **DataFrame.iloc[]** accordingly.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

In the code below, we have used the [DataFrame.sort_values()](https://www.dataquest.io/m/292/exploring-data-with-pandas/4/working-with-integer-labels) method to sort the rows in the **f500** dataframe by the employees column from most to least employees, and have assigned the resulting dataframe to **sorted_emp**.

  - Assign the first five rows of the **sorted_emp** dataframe to the variable **top5_emp**, by choosing the correct method out of either **loc[]** or **iloc[].**

In [None]:
sorted_emp = f500.sort_values("employees", ascending=False)

# put your code here

## Using pandas methods to create boolean masks

We've previously used the Python boolean operators like >, <, and **==** to create boolean masks to select subsets of data. There are also a number of pandas methods that return boolean masks that are useful for working with an exploring data.

You might have noticed that for companies from the USA, the **hq_location** column contains both the city and state that the company is headquartered in:

```python
>>> usa_hqs = f500.loc[f500["country"] == "USA", "hq_location"]

>>> print(usa_hqs.head())

    0       Bentonville, AR
    7             Omaha, NE
    8         Cupertino, CA
    9            Irving, TX
    10    San Francisco, CA
    Name: hq_location, dtype: object
```

The two letters at the end of each of these values represent the state within the USA: AR for Arkansas, NE for Nebreska, CA for California, and TX for Texas. If we wanted to look at only companies headquartered in California, it would be useful to be able to create a boolean mask based on the text within these values.

There are two pandas methods that we could use to achieve this: the [Series.str.contains()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html) method and the [Series.str.endswith()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.endswith.html) method. The Series.str.contains() method is a vectorized version of Python's in operator:

```python
>>> name = "Michael Johnson"

>>> "Michael" in name

    True

>>> "John" in name

    True

>>> "Eric" in name

    False
```

In contrast, **Series.str.endswith()** is a vectorized version of the Python string **str.endswith()** method , which is probably a better option for our purposes, as it will ensure that we don't get any stray matches. This is how we could go about it:


```python
>>> usa = f500.loc[f500["country"] == "USA"]

>>> print(usa["hq_location"].head())

    0       Bentonville, AR
    7             Omaha, NE
    8         Cupertino, CA
    9            Irving, TX
    10    San Francisco, CA
    Name: hq_location, dtype: object

>>> is_california = usa["hq_location"].str.endswith("CA")

>>> print(is_california.head())

    0     False
    7     False
    8      True
    9     False
    10     True
    Name: hq_location, dtype: bool

>>> california = usa[is_california]

>>> print(california.iloc[:5,:3])

            company  rank  revenues
    8         Apple     9    215639
    10     McKesson    11    198533
    44      Chevron    45    107567
    60  Wells Fargo    61     94176
    64     Alphabet    65     90272
```

We won't use it in this mission, but you should also be aware of the [Series.str.startswith()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.startswith.html) method, a vectorized version of the Python string [str.startswith()](https://docs.python.org/3.6/library/stdtypes.html#str.startswith) method and can be used to create boolean masks based on the start of string values.

Another pair of handy pandas methods that create boolean masks is the [Series.isnull()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html) method and [Series.notnull()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.notnull.html) method. These return boolean masks that you can use to select either rows that contain null (or NaN) values for a certain column, or inversely those rows that don't. These can be particularly useful for identifying and exploring the rows in a dataframe.

Let's see the **Series.isnull()** method in action to look at the rows that have null values in the **revenue_change** column.

```python
>>> rev_change_null = f500[f500["revenue_change"].isnull()]

>>> print(rev_change_null[["company","country","sector"]])

                            company  country      sector
    90                       Uniper  Germany      Energy
    180  Hewlett Packard Enterprise      USA  Technology
```

We can see that the two companies with missing values for the **revenue_change** column is Uniper, a German energy company; and Hewlett Parkard Enterprise, an American technology company. Let's use what we've learned to calculate ranking change for the companies that were ranked last year.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Use the **Series.notnull()** method to select all rows from **f500** that have a non-null value for the **previous_rank** column, and assign the result to **previously_ranked**
- From the **previously_ranked** dataframe, subtract the previous_rank column from the rank column, and assign the result to **rank_change.**




In [None]:
# put your code here

## Boolean Operators

Boolean indexing is a powerful tool which allows us to select or exclude parts of our data based on their values to perform analysis. There are however, some questions that we can't yet answer, like:

- Which companies have over 100 billion in revenue and also have revenue growth of more than 10%?
- What are the top 5 technology companies outside the USA?

All of these questions have two or more parts that depend on the values. As an example, to answer the first question we would have to identify all the companies that have over 100 billion in revenue and also have revenue growth of more than 10%. To do this, we need to learn how to combine boolean arrays.

To recap, boolean arrays are created using any of the Python standard **comparison operators**: **==** (equal), **>** (greater than), **<** (less than), **!=** (not equal).

We combine boolean arrays using **boolean operators**. In Python, these boolean operators are **and**, **or**, and **not**. In pandas, the operators are slightly different:


| pandas | Python equivalent | Meaning |
|--------|-------------------|-------------------------------------------|
| a & b | a and b | True if both a and b are True, else False |
| a $|$ b | a or b | True if either a or b is True |
| ~a | not a | True if a is False, else False |


Let's look at how these boolean operators work across pandas series objects, using two example series objects, **a** and **b**:


<img width="200" src="https://drive.google.com/uc?export=view&id=1911_tZkilBFq50Qeojr8E41aTBBNQbta">

We'll start by using the & operator to perform a boolean **'and'**:

<img width="600" src="https://drive.google.com/uc?export=view&id=1ZdBex9EhkUA42_IzUAsbxnFU4w5-Upkg">

Let's look at what happens when we use $|$ to perform a boolean 'or':

<img width="600" src="https://drive.google.com/uc?export=view&id=1TZqw-H0A-59yCkpDKrHIeVex5lHIEhEk">

Lastly, let's look at what happens when we use ~ to perform a boolean 'not':


<img width="600" src="https://drive.google.com/uc?export=view&id=1llHLeYGNC_mtDT0qttPD9sZTJnRAwoA1">


Let's test our understanding of how boolean operators work with some multiple choice exercises:

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

Looking at the dataframe and code below, chose the series that matches the result of the boolean operation and assign the integer 1, 2, or 3 to **answer_1**.

<img width="600" src="https://drive.google.com/uc?export=view&id=1EaFFoKazQIrAYd2tWuSSgXfePQ0nqcGm">


Looking at the dataframe and code below, chose the series that matches the result of the boolean operation and assign the integer 1, 2, or 3 to **answer_2.**


<img width="600" src="https://drive.google.com/uc?export=view&id=1NgbMDdJ_oPL12pTVfCGBfv-_HX7Wl327">

Looking at the dataframe and code below, chose the series that matches the result of the boolean operation and assign the integer 1, 2, or 3 to **answer_3.**


<img width="600" src="https://drive.google.com/uc?export=view&id=1-CZEjG5SY9Ho2alht_LjA47kKdecXA4A">

## Using Boolean Operators

Let's look at how we use boolean operators to combine multiple boolean comparisons in practice. We'll use **f500_sel**, a small selection of our f500 dataframe:

<img width="600" src="https://drive.google.com/uc?export=view&id=1jR7JlIVoPzYkjy_xfpvYEaCXvikUK3qU">


We want to find the companies in **f500_sel** with more than 265 billion in revenue that are headquarted in China. We'll start by performing two boolean comparisons to produce two separate boolean arrays; One based on revenue, and one based on country (the revenue column is already in millions).

<img width="600" src="https://drive.google.com/uc?export=view&id=1SVde3lAUEGVt69qDE7_9LHyjI80A-2_s">

We then use the & operator to combine the two boolean arrays using boolean 'and' logic:

<img width="600" src="https://drive.google.com/uc?export=view&id=1fefm6MA1piKONeawWn94ocbUZzb7KF4a">


Lastly, we use the combined boolean array to perform selection on our dataframe:


<img width="600" src="https://drive.google.com/uc?export=view&id=1f5tAqGgDvn_faQcSK8CzSC7e3NTRII11">


The result give us the two companies from **f500_sel** that are both Chinese and have over 265 billion in revenue. Just like when we use a single boolean array to perform selection, when using multiple boolean arrays with boolean operators we don't need to assign things to intermediate variables. Let's look at how we can streamline the code from the example above. First, let's look at the code as one segment:

```python
cols = ["company", "revenues", "country"]
final_cols = ["company", "revenues"]

f500_sel = f500[cols].head()
over_265 = f500_sel["revenues"] > 265000
china = f500_sel["country"] == "China"
combined = over_265 & china
result = f500_sel.loc[combined,final_cols]
```

The first place we can optimize our code is by making our two boolean comparisons, with their boolean operator in a single line, instead of assigning them to the intermediate **china** and **over_265** variables first:


```python
combined = (f500_sel["revenues"] > 265000) & (f500_sel["country"] == "China")
```

We have used parentheses around each of our boolean comparisons. This is very important– **our boolean operation will fail without parentheses**. Lastly, instead of assigning the boolean arrays to **combined**, we can insert the comparison directly into our selection:

```python
result = f500_sel.loc[(f500_sel["revenues"] > 265000) & (f500_sel["country"] == "China"), final_cols]
```

Whether to perform this final state is very much a matter of taste. As always, your decision should be driven by what will make your code more readable. Cramming everything into one line is not always the best option.

Let's practice more complex selection using boolean operators

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

- Select from the **f500** dataframe:
  - Companies with revenues over 100 billion and negative profits, assigning the result to **big_rev_neg_profit**.
  - The first 5 companies in the Technology sector that are not headquartered in the USA, assigning the result to **tech_outside_usa**.

In [None]:
# put your code here

## Pandas Index Alignment

So far, we've only seen examples where the dataframe and series objects we're working with have matching index labels. One of the most powerful aspects of pandas is that almost every operation will **align on the index labels**. Let's look at an example– below we have a dataframe **food** and a **series** colors:

<img width="300" src="https://drive.google.com/uc?export=view&id=1_LAUprnXe3BYUgUg2r5fS2kfPpoTtdgt">

Both the **food** dataframe and the **colors** series have the same index labels, however they are in totally different orders. As an example, the first row of **food** has the index label **tomato**, and the first item of **colors** has the index label **corn**.

If we wanted to add **colors** as a new column in our **food** dataframe, we can use the following code:

```python
food["color"] = colors
```

When we do this, pandas will ignore the order of the colors series, and align on the index labels:

<img width="350" src="https://drive.google.com/uc?export=view&id=1zD-HqxfZ8yUj_4pNrbQasrnASTX0zhQE">

The result of our code operation is the dataframe below:

<img width="300" src="https://drive.google.com/uc?export=view&id=1hDfSi-D5sJf788MlGOI63vmAlLYrVIYU">


We can see that pandas has done all the hard work for us, and we don't have to worry about the fact that our series and dataframe were ordered differently. Let's look at another example. Say we had the series **alt_name** below:

<img width="200" src="https://drive.google.com/uc?export=view&id=1jMt7P6e0d7X8yaE2kLm0gPzaCxQtS30F">


The **alt_name** series only has three items. The first item, with index label **arugula** doesn't have a corresponding row in the **food** dataframe, where the other two do. Let's see what happens when we assign this as a new column:

```python
food["alt_name"] = alt_name
```

<img width="300" src="https://drive.google.com/uc?export=view&id=1Vs-aLaUmDmdqUtDPt6_Eh7B3T6d3Wvhx">


In this scenario, pandas:

- Discards any items that have an index that doesn't match the dataframe.
- Aligns on the index labels for the values that do match the dataframe.
- Fills any remaining rows with **NaN**

If we assign a new column with no matching index labels, pandas follows the same three steps above, but as there are no matching labels, all rows in the new column will be **NaN** values.

The pandas library will align on index at every opportunity - this makes working with data from different sources, or working with data when you have removed, added, or reordered rows much easier than it would be otherwise. This works whether your index labels are strings or integers - as long as you haven't made modifications to the index labels, you can use index alignment to our advantage.

Let's practice this using our Fortune 500 data.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


Earlier, we created the **rank_change** series by performing vectorized subtraction only on rows without null values. We have included the code again as a reminder.

- Assign the values in the **rank_change** to a new column in the **f500** dataframe, **"rank_change".**
- Once you have run your code, use the variable inspector to look at the **f500** dataframe and observe how the new column aligns with the existing data.

In [None]:
previously_ranked = f500[f500["previous_rank"].notnull()]
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]

# put your code here

## Using Loops with pandas

So far, we've explicitly avoided doing anything with loops using pandas. Because one of the key benefits of pandas is that it has vectorized methods to work with data more efficiently, we want to avoid using loops wherever we can.

As an illustration, let's look at one common pattern that you might be tempted to use a loop for, and how we can use a vectorized operation to replace it. Let's try and replace all of the values in the column **B** a dataframe:

```python
>>> print(df)

       A  B  C
    x  6  1  0
    y  1  8  8
    z  3  8  7

>>> for row in df:
        if row["B"] == 8:
            row["B"] = 99

    ---------------------------------------------------
    TypeError                                 Traceback
    <ipython-input-17-baf1fd443d29> in module()
          1 for row in df:
    ----> 2     if row["B"] == 8:
          3         row["B"] = 99

    TypeError: string indices must be integers
</ipython-input-17-baf1fd443d29>
```

In this code, we attempted to loop over every row of the dataframe, check the value for a particular column, and if it matches our check, we change it. Unfortunately, our code produced an error.

When you attempt to loop over a dataframe, it returns the column index labels, rather than the rows as we might expect. There are pandas methods to help loop over dataframes, but they should be only used as a last resort, and can almost always be avoided (we'll learn about those methods in a later course).

```python
>>> print(df)

       A  B  C
    x  6  1  0
    y  1  8  8
    z  3  8  7

>>> for i in df:
        print(i)

    A
    B
    C
```

Instead of trying to use loops, we can perform the same operation quickly and easily using vectorized operations:

```python
>>> df.loc[df["B"] == 8, "B"] = 99

>>> print(df)

       A   B  C
    x  6   1  0
    y  1  99  8
    z  3  99  7
```

One scenario where it is useful to use loops with pandas is when we are performing aggregation. Aggregation is where we apply a statistical operation to groups of our data. Let's say that we wanted to work out what the average revenue was for each country in the data set. Our process might look like this:

- Identify each unique country in the data set.
- For each country:
  - Select only the rows corresponding to that country.
  - Calculate the average revenue for those rows.

In this process, we can use a loop to iterate over the countries. We'll still use vectorized operations to select the right rows and calculate the means, so our calculation remains fast. To identify the unique countries, we can use the [Series.unique() method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html). This method returns an array of unique values from any series. Once we have that, we can loop over that array and perform our operation. We'll use a dictionary to store the results. Here's what that looks like:


```python
# Create an empty dictionary to store the results
avg_rev_by_country = {}

# Create an array of unique countries
countries = f500["country"].unique()

# Use a for loop to iterate over the countries
for c in countries:
    # Use boolean comparison to select only rows that
    # correspond to a specific country
    selected_rows = f500[f500["country"] == c]
    # Calculate the mean average revenue for just those rows
    mean = selected_rows["revenues"].mean()
    # Assign the mean value to the dictionary, using the
    # country name as the key
    avg_rev_by_country[c] = mean
```


The resulting dictionary is below (we've shown just the first few keys):

```python
{'Australia': 33688.71428571428,
 'Belgium': 45905.0,
 'Brazil': 52024.57142857143,
 'Britain': 51588.708333333336,
 'Canada': 31848.0,
 'China': 55397.880733944956,
 'Denmark': 35464.0,
 ...
 }
```

We'll practice this pattern to calculate the company that employs the most people in each country. To do this extra step, we'll use the [DataFrame.sort_values()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) method to sort our dataframe so we can then select the first row which will give us our largest value.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>

In this exercise, we're going to produce the following dictionary of the top employer in each country:

```python
{'Australia': 'Wesfarmers',
 'Belgium': 'Anheuser-Busch InBev',
 'Brazil': 'JBS',
 ...
 'U.A.E': 'Emirates Group',
 'USA': 'Walmart',
 'Venezuela': 'Mercantil Servicios Financieros'}
```

- Read the documentation for the [DataFrame.sort_values() method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) to familiarize yourself with the syntax. You will need to use only the **by** and **ascending** parameters to complete this exercise.
- Create an empty dictionary, **top_employer_by_country** to store the results of the exercise.
- Use the **Series.unique()** method to create an array of unique values from the **country** column.
- Use a for loop to iterate over the array unique countries, and in each iteration:
  - Select only the rows that have a country name equal to the current iteration.
  - Use **DataFrame.sort_values()** to sort those rows by the **employees** column in descending order.
  - Select the first row from the sorted dataframe.
  - Extract the company name from the index label **company** from the first row.
  - Assign the results to the **top_employer_by_country** dictionary, using the country name as the key, and the company name as the value.
- When you have run your code, use the variable inspector to view the top employer for each country.



In [None]:
# put your code here
# can you do it using just one code line?

## Challenge: Calculating Return on Assets by Sector


Now it's time for a challenge to bring everything together! In this challenge we're going to add a new column to our dataframe, and then perform some aggregation using that new column.

The column we create is going to contain a metric called [return on assets (ROA)](https://www.inc.com/encyclopedia/return-on-assets-roa.html). ROA is a business-specific metric which inicates a companies ability to make profit using their available assets.

$
\textrm{return on assets} = \frac{profits}{assets}
$

Once we've created the new column, we'll aggregate by sector, and find the company with the highest ROA from each sector. Like previous challenges, we'll provide some guidance in the hints, but try to complete it without them if you can.

Don't be discouraged if this challenge takes a few attempts to get correct. Working iteratively is a great way to work, and this challenge is more difficult than exercises you have previously completed.

**Exercise**

<left><img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ"></left>


- Create a new column **roa** in the **f500** dataframe, containing the return on assets metric for each company.
- Aggregate the data by the **sector** column, and create a dictionary **top_roa_by_sector**, with:
  - Dictionary keys with the sector name.
  - Dictionary values with the company name with the highest ROA value from that sector.







In [None]:
# put your code here

In this section, we learned how to:

- Select columns, rows and individual items using their integer location.
- Use **pd.read_csv()** to read CSV files in pandas.
- Work with integer axis labels.
- How to use pandas methods to produce boolean arrays.
- Use boolean operators to combine boolean comparisons to perform more complex analysis.
- Use index labels to align data.
- Use aggregation to perform advanced analysis using loops.

In the next mission, we'll learn techniques to use when performing data cleaning to prepare a messy data set.