## Introduction

We'll learn another way pandas makes working with data easier. It has many built-in methods and functions for common exploration and analysis tasks.

__Dataset__: Fortune Global 500 list [f500.csv]

__Source__: https://data.world/chasewillden/fortune-500-companies-2017

Here is a data dictionary for some of the columns in the CSV:

- ```company```: Name of the company.
- ```rank```: Global 500 rank for the company.
- ```revenues```: Company's total revenue for the fiscal year, in millions of dollars (USD).
- ```revenue_change```: Percentage change in revenue between the current and prior fiscal year.
- ```profits```: Net income for the fiscal year, in millions of dollars (USD).
- ```ceo```: Company's Chief Executive Officer.
- ```industry```: Industry in which the company operates.
- ```sector```: Sector in which the company operates.
- ```previous_rank```: Global 500 rank for the company for the prior year.
- ```country```: Country in which the company is headquartered.
- ```hq_location```: City and Country, (or City and State for the USA) where the company is headquarted.
- ```employees```: Total employees (full-time equivalent, if available) at fiscal year-end.

In [1]:
import pandas as pd

f500 = pd.read_csv('data/f500.csv', index_col=0)
f500_head = f500.head(10)
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 52.7+ KB


### Vectorized Operations

Just like with NumPy, we can use any of the standard Python numeric operators with series, including:
- ```series_a + series_b``` - Addition
- ```series_a - series_b``` - Subtraction
- ```series_a * series_b``` - Multiplication (this is unrelated to the multiplications used in linear algebra).
- ```series_a / series_b``` - Division

In [2]:
rank_change = f500.loc[:, "previous_rank"] - f500.loc[:, "rank"]

### Series Data Exploration Methods

Like NumPy, pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones (with links to documentation):
- ```Series.max()```
- ```Series.min()```
- ```Series.mean()```
- ```Series.median()```
- ```Series.mode()```
- ```Series.sum()```

In [3]:
rank_change_max = rank_change.max()
rank_change_min = rank_change.min()

### Series Describe Method

We used the Series.max() and Series.min() methods to figure out the biggest increase and decrease in rank:
- Biggest increase in rank: 226
- Biggest decrease in rank: -500

However, according to the data dictionary, this list should only rank companies on a scale of 1 to 500. Even if the company ranked 1st in the previous year moved to 500th this year, the rank change calculated would be -499. This indicates that there is incorrect data in either the ```rank``` column or ```previous_rank``` column.

We'll learn another method that can help us more quickly investigate this issue - the ```Series.describe()``` method. This method tells us how many non-null values are contained in the series, along with the mean, minimum, maximum, and other statistics.

The first statistic, __count__, is the same as for numeric columns, showing us the number of non-null values. The other three statistics are new:
- ```unique```: Number of unique values in the series.
- ```top```: Most common value in the series.
- ```freq```: Frequency of the most common value.

In [4]:
rank = f500["rank"]
rank_desc = rank.describe()

prev_rank = f500["previous_rank"]
prev_rank_desc = prev_rank.describe()

In [5]:
prev_rank_desc

count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64

### Method Chaining

We notice that the minimum rank is 0, which is odd! To investigate the possible cause of this issue, let's confirm the number of 0 values that appear in the previous_rank column.

We can skip some of the intermediate code assignments. This is called __method chaining__ — a way to combine multiple methods together in a single line.
- When writing code, always assess whether method chaining will make your code harder to read. If it does, it's always preferable to break the code into more than one line.

In [6]:
# Count the number of zeros in our previous_rank column
zero_previous_rank = f500["previous_rank"].value_counts().loc[0]

### Dataframe Exploration Methods

We confirmed that 33 companies in the dataframe have a value of 0 in the previous_rank column. Given that multiple companies have a 0 rank, we might conclude that these companies didn't have a rank at all for the previous year. It would make more sense for us to replace these values with a null value instead.

Because series and dataframes are two distinct objects, they have their own unique methods. However, there are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. Below are some examples:
- ```Series.max()``` and ```DataFrame.max()```
- ```Series.min()``` and ```DataFrame.min()```
- ```Series.mean()``` and ```DataFrame.mean()```
- ```Series.median()``` and ```DataFrame.median()```
- ```Series.mode()``` and ```DataFrame.mode()```
- ```Series.sum()``` and ```DataFrame.sum()```

Unlike their series counterparts, dataframe methods require an _axis parameter_ so we know which axis to calculate across. While you can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings ```"index"``` and ```"columns"``` for the axis parameter

In [7]:
# find the maximum value for only the numeric columns from f500
max_f500 = f500.max(numeric_only=True)
max_f500

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64

### Dataframe Describe Method

Based on the column descriptions, the maximum for each of these columns seems reasonable. Like series objects, dataframe objects also have a ```DataFrame.describe()``` method that we can use to explore the dataframe more quickly.

One difference is that we need to manually specify if you want to see the statistics for the non-numeric columns. By default, ```DataFrame.describe()``` will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the ```include=['O']``` parameter

Whereas the ```Series.describe()``` method returns a series object, the ```DataFrame.describe()``` method returns a dataframe object

In [8]:
f500_desc = f500.describe()
f500_desc

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,previous_rank,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,222.134,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,146.941961,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,0.0,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,92.75,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,219.5,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,347.25,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,500.0,23.0,2300000.0,301893.0


### Assignment with pandas

After reviewing the descriptive statistics for the numeric columns in f500, we can conclude that no values look unusual besides the 0 values in the previous_rank column.

We'll learn how to do two things so we can correct these values:
- __Perform assignment in pandas.__
- Use boolean indexing in pandas.

Just like in NumPy, the same techniques that we use to select data could be used for assignment. When we selected a whole column by label and used assignment, we assigned the value to every item in that column.

By providing labels for both axes, we can assign them to a single value within our dataframe.

In [9]:
f500.loc["Dow Chemical","ceo"] = "Jim Fitterling"

f500.loc["Dow Chemical", "ceo"]

'Jim Fitterling'

### Using Boolean Indexing with pandas Objects

While it's helpful to be able to replace specific values when we know the row label ahead of time, this can be cumbersome when we need to replace many values. Instead, we can use __boolean indexing__ to change all rows that meet the same criteria, just like we did with NumPy.

Next, let's use boolean indexing to identify companies belonging to the "Motor Vehicles and Parts" industry in our Fortune 500 dataset.

In [10]:
motor_bool = f500["industry"] == "Motor Vehicles and Parts"

motor_countries = f500.loc[motor_bool, "country"] # one specific column

motor_countries

company
Toyota Motor                                 Japan
Volkswagen                                 Germany
Daimler                                    Germany
General Motors                                 USA
Ford Motor                                     USA
Honda Motor                                  Japan
SAIC Motor                                   China
Nissan Motor                                 Japan
BMW Group                                  Germany
Dongfeng Motor                               China
Robert Bosch                               Germany
Hyundai Motor                          South Korea
China FAW Group                              China
Beijing Automotive Group                     China
Peugeot                                     France
Renault                                     France
Kia Motors                             South Korea
Continental                                Germany
Denso                                        Japan
Guangzhou Automobile In

### Using Boolean Arrays to Assign Values

We now have all the knowledge we need to fix the 0 values in the previous_rank column:
- Perform assignment in pandas.
- Use boolean indexing in pandas.

We can remove the intermediate step of creating a boolean series, and combine everything into one line. This is the most common way to write pandas code to perform assignment using boolean arrays

Now we can follow this pattern to replace the values in the previous_rank column. We'll replace these values with ```np.nan```. Just like in NumPy, ```np.nan``` is used in pandas to represent values that can't be represented numerically, most commonly missing values.

To make comparing the values in this column before and after our operation easier, we've added the following line of code:
```
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
```
This uses ```Series.value_counts()``` and ```Series.head()``` to display the 5 most common values in the ```previous_rank``` column, but adds an additional ```dropna=False``` parameter, which stops the ```Series.value_counts()``` method from excluding null values when it makes its calculation.

In [11]:
import numpy as np

# Pandas series for the prev_rank for comparison
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

# Changing the rank 0 to NAN
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan
# The After series
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

### Creating New Columns

You may have noticed that after we assigned NaN values, the previous_rank column changed dtype.

The index of the series that Series.value_counts() produces now shows us floats like 471.0 instead of integers.
- Pandas uses the NumPy integer dtype, which does not support NaN values.
- Pandas inherits this behavior, and in instances where you try and assign a NaN value to an integer column, pandas will silently convert that column to a float dtype.

Now that we've corrected the data, let's create the rank_change series again. This time, we'll add it to our f500 dataframe as a new column.

In [12]:
f500["rank_change"] = f500["previous_rank"] - f500["rank"]

rank_change_desc = f500["rank_change"].describe()

## Challenge: Top Performers by Country

In this challenge, we'll calculate a specific statistic or attribute of each of the three most common countries from our f500 dataframe.

Tasks:
- Create a series, industry_usa, containing counts of the two most common values in the industry column for companies headquartered in the USA.
- Create a series, sector_china, containing counts of the three most common values in the sector column for companies headquartered in the China.
- Create a float object, mean_employees_japan, containing the mean (average) number of employees for companies headquartered in Japan.

In [13]:
top_3_countries = f500["country"].value_counts().head(3)

# counts of the two most common values in the industry column for companies headquartered in the USA
industry_usa = f500.loc[f500["country"] == "USA", "industry"].value_counts().head(2)

# counts of the three most common values in the sector column for companies headquartered in the China.
sector_china = f500.loc[f500["country"] == "China", "sector"].value_counts().head(3)

# Not literal Mean ;), if only!
# containing the mean (average) number of employees for companies headquartered in Japan.
mean_employees_japan = f500.loc[f500["country"] == "Japan", "employees"].mean()

## Conclusion:

### Fundamentals
- How to select data from pandas objects using boolean arrays.
- How to assign data using labels and boolean arrays.
- How to create new rows and columns in pandas.
- Many new methods to make data analysis easier in pandas.

Now we will continue with some advance pandas concepts

## Exploring data with pandas: Intermediate 

We'll continue working with the 2017 Fortune Global 500 dataset as we learn more advanced selection and exploration techniques. As a reminder, the data dictionary for the main columns in the f500.csv file is below:
- company: Name of the company.
- rank: Global 500 rank for the company.
- revenues: Company's total revenue for the fiscal year, in millions of dollars (USD).
- revenue_change: Percentage change in revenue between the current and prior fiscal year.
- profits: Net income for the fiscal year, in millions of dollars (USD).
- sector: Sector in which the company operates.
- previous_rank: Global 500 rank for the company for the prior year.
- country: Country in which the company is headquartered.
- hq_location: City and country, (or city and state for the USA) where the company is headquartered.
- employees: Total employees (full-time equivalent, if available) at fiscal year-end.

In [14]:
import pandas as pd
import numpy as np

# Read the data into a pandas dataframe
f500 = pd.read_csv('data/f500.csv', index_col=0)
f500.index.name = None

# Replace the 0 values in the 'previous_rank' column with NaN
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

f500_selection = f500[["rank", "revenues", "revenue_change"]].head() # select the top 5 rows from the resp columns

### Reading CSV files with pandas
```
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
```

When we compared the files, the index axis labels are actaully the values of the first column in the dataset. ```company```

The ```index_col``` parameter is an optional argument and should specify which column to use as the row labels for the dataframe. When we used a value of ```0```, we specified that we wanted to use the first column as the row labels.

In [15]:
f500 = pd.read_csv('data/f500.csv') # without the index_col parameter
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

### Using iloc to select by integer position



We read our CSV file into pandas again. However, this time, we didn't use the ```index_col``` parameter

There are two differences with this approach:
- The company column is now included as a regular column, instead of being used for the index.
- The index labels are now integers starting from 0.

In some scenarios, using labels to make selections makes things easier — in others though, it makes things harder.
- Just like in NumPy, we can also use integer positions to select data using ```Dataframe.iloc[]``` and ```Series.iloc[]```. It's easy to get ```loc[]``` and ```iloc[]``` confused at first, but the easiest way is to remember the first letter of each method:
    - __l__oc: __l__abel based selection
    - __i__loc: __integer__ position based selection
- Using ```iloc[]``` is almost identical to indexing with NumPy, with integer positions starting at 0 like ndarrays and Python lists.

The full syntax for ```DataFrame.iloc[]```, in pseudocode, is:

```df.iloc[row_index, column_index]```

In [16]:
fifth_row = f500.iloc[4] # 5th row of the f500 dataframe

company_value = f500.iloc[0, 0] # selct the first row of the first column 'company'

Recall that ```loc[]``` handles slicing differently:
- With ```loc[]```, the ending slice __is__ included.
- With ```iloc[]```, the ending slice __is not__ included.

The table below summarizes how we can use ```DataFrame.iloc[]``` and ```Series.iloc[]``` to select by integer position:

| Select by integer position | Explicit Syntax | Shorthand Convention |
| -------------------------- | --------------- | -------------------- |
| Single column from dataframe | ```df.iloc[:,3]``` |  |
| List of columns from dataframe | ```df.iloc[:,[3,5,6]]``` |  |
| Slice of columns from dataframe | ```df.iloc[:,3:7]``` |  |
| Single row from dataframe | ```df.iloc[20]``` |  |
| List of rows from dataframe | ```df.iloc[[0,3,8]]``` |  |
| Slice of rows from dataframe | ```df.iloc[3:5]``` | ```df[3:5]``` |
| Single items from series | ```s.iloc[8]``` | ```s[8]``` |
| List of item from series | ```s.iloc[[2,8,1]]``` | ```s[[2,8,1]]``` |
| Slice of items from series | ```s.iloc[5:10]``` | ```s[5:10]``` |

In [17]:
first_three_rows = f500.iloc[:3]
first_seventh_row_slice = f500.iloc[[0, 6], :5] # First and Seventh row, first five columns

### Using pandas methods to create boolean masks

There are a number of pandas methods that return boolean masks useful for exploring data.

Two examples are the ```Series.isnull()``` method and ```Series.notnull()``` method. These can be used to select either rows that contain null (or NaN) values or rows that do __not__ contain null values for a certain column.

In [18]:
# Select all rows that have NULL in previous_rank column
# Select all values from the "company", "rank" and "previous_rank" columns only
null_previous_rank = f500[f500["previous_rank"].isnull()]
null_previous_rank[["company", "rank", "previous_rank"]].head()

Unnamed: 0,company,rank,previous_rank
48,Legal & General Group,49,
90,Uniper,91,
123,Dell Technologies,124,
138,Anbang Insurance Group,139,
140,Albertsons Cos.,141,


### Working with Integer Labels

If we wanted to select the first company from our new `null_previous_rank` dataframe by integer position, we can use `DataFrame.iloc[]`

If we use `DataFrame.loc[]` instead of `DataFrame.iloc[]`
- We get an error, telling us that the the `label [0] is not in the [index]`. Recall that `DataFrame.loc[]` is used for _label_ based selection:
    - __l__oc: __l__abel based selection
    - __i__loc: __integer__ position based selection

In [19]:
top5_null_prev_rank = null_previous_rank.iloc[:5]

### Pandas Index Alignment

Now that we've identified the rows with null values in the previous_rank column, let's use the `Series.notnull()` method to exclude them from the next part of our analysis.

Another powerful aspect of pandas is that almost every operation will __align on the index labels__.

Pandas will also:
- Discard any items that have an index that doesn't match the dataframe (like arugula).
- Fill any remaining rows with NaN.

The pandas library will align on index at every opportunity, no matter if our index labels are strings or integers - this makes working with data from different sources or working with data when we have removed, added, or reordered rows much easier than it would be otherwise.

In [20]:
# select all rows from f500 that have a non-null value
previously_ranked = f500[f500["previous_rank"].notnull()]
# subtract the rank column from the previous_rank column
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]
# Assign the values in the rank_change to a new column in the f500 dataframe
f500["rank_change"] = rank_change

### Using Boolean Operators

Boolean indexing is a powerful tool which allows us to select or exclude parts of our data based on their values. However, to answer more complex questions, we need to learn how to combine boolean arrays.

We combine boolean arrays using boolean operators. In Python, these boolean operators are and, or, and not. In pandas, the operators are slightly different:

| pandas | Python equivalent | Meaning |
| ------ | ----------------- | ------- |
| `a & b` | `a and b` | `True` if both `a` and `b` are `True`, else `False` |
| `a  b` | `a or b` | `True` if either `a` or `b` is `True` |
| `~a` | `not a` | `True` if `a` is `False`, else `False` |

In [21]:
large_revenue = f500["revenues"] > 100000 # Greater than 100 billion
# Negative profit
negative_profits = f500["profits"] < 0
# Combine the above two
combined = large_revenue & negative_profits
big_rev_neg_profit = f500[combined]

In [22]:
big_rev_neg_profit

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
32,Japan Post Holdings,33,122990,3.6,-267.4,2631385,-107.5,Masatsugu Nagato,"Insurance: Life, Health (stock)",Financials,37.0,Japan,"Tokyo, Japan",http://www.japanpost.jp,21,248384,91532,4.0
44,Chevron,45,107567,-18.0,-497.0,260078,-110.8,John S. Watson,Petroleum Refining,Energy,31.0,USA,"San Ramon, CA",http://www.chevron.com,23,55200,145556,-14.0


Just like when we use a single boolean array to perform selection, we don't need to use intermediate variables. The first place we can optimize our code is by combining our two boolean arrays in a single line, instead of assigning them to the intermediate large_revenue and negative_profits variables first.

```
combined = (f500["revenues"] > 100000) & (f500["profits"] < 0)
```

We used parentheses around each of our boolean comparisons. This is very important — __our boolean operation will fail without parentheses__.

Lastly, instead of assigning the boolean arrays to combined, we can insert the comparison directly into our selection

In [23]:
brazil_venezuela = f500[(f500["country"] == "Brazil") | (f500["country"] == "Venezuela")]
tech_outside_usa = f500[(f500["sector"] == "Technology") & ~(f500["country"] == "USA" )].head()

In [24]:
brazil_venezuela.head()

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
74,Petrobras,75,81405,-16.3,-4838.0,246983,,Pedro Pullen Parente,Petroleum Refining,Energy,58.0,Brazil,"Rio de Janeiro, Brazil",http://www.petrobras.com.br,23,68829,76779,-17.0
112,Itau Unibanco Holding,113,66876,21.4,6666.4,415972,-13.7,Candido Botelho Bracher,Banks: Commercial and Savings,Financials,159.0,Brazil,"Sao Paulo, Brazil",http://www.itau.com.br,4,94779,37680,46.0
150,Banco do Brasil,151,58093,-13.4,2013.8,426416,-52.3,Paulo Rogerio Caffarelli,Banks: Commercial and Savings,Financials,115.0,Brazil,"Brasilia, Brazil",http://www.bb.com.br,23,100622,26551,-36.0
153,Banco Bradesco,154,57443,31.3,5127.9,366418,-5.7,Luiz Carlos Trabuco Cappi,Banks: Commercial and Savings,Financials,209.0,Brazil,"Osasco, Brazil",http://www.bradesco.com.br,21,94541,32369,55.0
190,JBS,191,48825,-0.1,107.7,31605,-92.3,Wesley Mendonca Batista,Food Production,"Food, Beverages & Tobacco",185.0,Brazil,"Sao Paulo, Brazil",http://jbss.infoinvest.com.br,8,237061,7307,-6.0


In [25]:
tech_outside_usa

Unnamed: 0,company,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity,rank_change
14,Samsung Electronics,15,173957,-2.0,19316.5,217104,16.8,Oh-Hyun Kwon,"Electronics, Electrical Equip.",Technology,13.0,South Korea,"Suwon, South Korea",http://www.samsung.com,23,325000,154376,-2.0
26,Hon Hai Precision Industry,27,135129,-4.3,4608.8,80436,-0.4,Terry Gou,"Electronics, Electrical Equip.",Technology,25.0,Taiwan,"New Taipei City, Taiwan",http://www.foxconn.com,13,726772,33476,-2.0
70,Hitachi,71,84558,1.2,2134.3,86742,48.8,Toshiaki Higashihara,"Electronics, Electrical Equip.",Technology,79.0,Japan,"Tokyo, Japan",http://www.hitachi.com,23,303887,26632,8.0
82,Huawei Investment & Holding,83,78511,24.9,5579.4,63837,-5.0,Ren Zhengfei,Network and Other Communications Equipment,Technology,129.0,China,"Shenzhen, China",http://www.huawei.com,8,180000,20159,46.0
104,Sony,105,70170,3.9,676.4,158519,-45.1,Kazuo Hirai,"Electronics, Electrical Equip.",Technology,113.0,Japan,"Tokyo, Japan",http://www.sony.net,23,128400,22415,8.0


### Sorting Values

By default, the `sort_values()` method will sort the rows in _ascending_ order — from smallest to largest.

To sort the rows in _descending_ order instead, so the company with the largest number of employees appears first, we can set the `ascending` parameter to `False`

In [26]:
Japan = f500[f500["country"] == "Japan"] # Select Japanese companies only
Japan_sorted = Japan.sort_values("employees", ascending=False) # Sort the rows in descending order
first_row = Japan_sorted.iloc[0]
top_japanese_employer = first_row.loc["company"]

In [27]:
top_japanese_employer

'Toyota Motor'

### Using Loops with pandas

We've explicitly avoided using loops in pandas because one of the key benefits of pandas is that it has vectorized methods to work with data more efficiently. We'll learn how to use loops for __aggregation__.
- Aggregation is where we apply a statistical operation to groups of our data.

Let's say that we wanted to calculate the average revenue for each country in the data set. Our process might look like this:
- Identify each unique country in the data set.
- For each country:
    - Select only the rows corresponding to that country.
    - Calculate the average revenue for those rows.

To identify the unique countries, we can use the `Series.unique()` method. This method returns an array of unique values from any series. Then, we can loop over that array and perform our operation.

In [28]:
# Create an empty dictionary
top_employer_by_country = {}

country_unique = f500["country"].unique() # Select an array of unique country names

for country in country_unique:
    rows = f500[f500["country"] == country] # Select all rows that fits the country
    rows_sorted = rows.sort_values("employees", ascending=False) # Sort by the employees column in DESC order
    first_row = rows_sorted.iloc[0] # Select first row
    company = first_row.loc["company"] # Select the company name (col 1, Index 0)
    top_employer_by_country[country] = company

In [29]:
top_employer_by_country

{'Australia': 'Wesfarmers',
 'Belgium': 'Anheuser-Busch InBev',
 'Brazil': 'JBS',
 'Britain': 'Compass Group',
 'Canada': 'George Weston',
 'China': 'China National Petroleum',
 'Denmark': 'Maersk Group',
 'Finland': 'Nokia',
 'France': 'Sodexo',
 'Germany': 'Volkswagen',
 'India': 'State Bank of India',
 'Indonesia': 'Pertamina',
 'Ireland': 'Accenture',
 'Israel': 'Teva Pharmaceutical Industries',
 'Italy': 'Poste Italiane',
 'Japan': 'Toyota Motor',
 'Luxembourg': 'ArcelorMittal',
 'Malaysia': 'Petronas',
 'Mexico': 'America Movil',
 'Netherlands': 'EXOR Group',
 'Norway': 'Statoil',
 'Russia': 'Gazprom',
 'Saudi Arabia': 'SABIC',
 'Singapore': 'Flex',
 'South Korea': 'Samsung Electronics',
 'Spain': 'Banco Santander',
 'Sweden': 'H & M Hennes & Mauritz',
 'Switzerland': 'Nestle',
 'Taiwan': 'Hon Hai Precision Industry',
 'Thailand': 'PTT',
 'Turkey': 'Koc Holding',
 'U.A.E': 'Emirates Group',
 'USA': 'Walmart',
 'Venezuela': 'Mercantil Servicios Financieros'}

## Challenge: Calculating Return on Assets by Country

The column we create is going to contain a metric called __return on assets (ROA)__. ROA is a business-specific metric which indicates a companies ability to make profit using their available assets.
\begin{equation}return\ on\ assets = \frac{profit}{assets}\end{equation}
Once we've created the new column, we'll aggregate by sector, and find the company with the highest ROA from each sector.

In [30]:
f500["roa"] = f500["profits"] / f500["assets"]

top_roa_by_sector = {}

sector_unique = f500["sector"].unique()

for sector in sector_unique:
    rows = f500[f500["sector"] == sector]
    rows_sorted_by_roa = rows.sort_values("roa", ascending=False)
    first_row = rows_sorted_by_roa.iloc[0]
    company = first_row.loc["company"]
    top_roa_by_sector[sector] = company

In [31]:
top_roa_by_sector

{'Aerospace & Defense': 'Lockheed Martin',
 'Apparel': 'Nike',
 'Business Services': 'Adecco Group',
 'Chemicals': 'LyondellBasell Industries',
 'Energy': 'National Grid',
 'Engineering & Construction': 'Pacific Construction Group',
 'Financials': 'Berkshire Hathaway',
 'Food & Drug Stores': 'Publix Super Markets',
 'Food, Beverages & Tobacco': 'Philip Morris International',
 'Health Care': 'Gilead Sciences',
 'Hotels, Restaurants & Leisure': 'McDonald’s',
 'Household Products': 'Unilever',
 'Industrials': '3M',
 'Materials': 'CRH',
 'Media': 'Disney',
 'Motor Vehicles & Parts': 'Subaru',
 'Retailing': 'H & M Hennes & Mauritz',
 'Technology': 'Accenture',
 'Telecommunications': 'KDDI',
 'Transportation': 'Delta Air Lines',
 'Wholesalers': 'McKesson'}

## Conclusion: 
### Intermediate

We learned how to:
- Select columns, rows and individual items using their integer location.
- Use pd.read_csv() to read CSV files in pandas.
- Work with integer axis labels.
- How to use pandas methods to produce boolean arrays.
- Use boolean operators to combine boolean comparisons to perform more complex analysis.
- Use index labels to align data.
- Use aggregation to perform advanced analysis using loops.