# About This Notebook
In this last chapter of the course, **Exploring Data with pandas: Intermediate**, we will learn:
- Select columns, rows and individual items using their integer location.
- Use **pd.read_csv()** to read CSV files in pandas.
- Work with integer axis labels.
- How to use pandas methods to produce boolean arrays.
- Use boolean operators to combine boolean comparisons to perform more complex analysis.
- Use index labels to align data.
- Use aggregation to perform advanced analysis using loops.
***
## 1. Reading CSV files with pandas(IMPORTANT)

In the previous notebook about the fundamentals of exploring data with pandas, we worked with Fortune Global 500 dataset. In this chapter, we will learn how to use the **pandas.read_csv()** function to read in CSV files.

Previously, we used the snippet below to read our CSV file into pandas.


In [61]:
import pandas as pd
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
f500.head()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210


But if you look closely, you may see that the index axis labels are the values from the first column in the data set, **company**:

You will see that in the [read_csv() function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), the **index_col** parameter is optional from the official documentation. When we specify a value of **0**, the first column will be used as the row labels.

Compare with the dataframe above, notice how the **f500** dataframe looks like if we remove the second line using **f500.index.name = None**.

In [62]:
f500 = pd.read_csv("f500.csv", index_col=0)
f500.head()

Unnamed: 0_level_0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210


Do you see there is the text **company** above the index labels, which is the name of the first column in the CSV. This value is used as the **axis name** for the index axis in Pandas.

You see that both the column and index axes can have names assigned to them. Originally, we accessed the name of the index axes and set it to **None**, that's why the dataframe didn't have a name for the index axis.

### Task 3.5.1
1. Use the **pandas.read_csv()** function to read the **f500.csv** CSV file as a pandas dataframe. Assign it to the variable name **f500**.
    - Do not use the **index_col** parameter.
2. Use the following code to insert the NaN values (missing values) into the **previous_rank** column: <br>
    **f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan** <br>
Remark: If you get a notice that np is not defined, you have to import NumPy by typing **import numpy as np**.

In [75]:
# Start your code below:
import pandas as pd
import numpy as np

f500 = pd.read_csv("f500.csv")
f500.index.name = None

f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

f500_selection = f500.loc[:,["rank","revenues", "revenue_change"]].head()



## 2. Using iloc to select by integer position

In the previous exercise, we read our CSV file into pandas. But this time, we didn't use the index_col parameter:

In [76]:
f500 = pd.read_csv("f500.csv")
print(f500[['company', 'rank', 'revenues']].head())

                    company  rank  revenues
0                   Walmart     1    485873
1                State Grid     2    315199
2             Sinopec Group     3    267518
3  China National Petroleum     4    262573
4              Toyota Motor     5    254694


There are two significant differences with the approach that we just took above:
- the **company** column is now included as a regular column, not just being used for the index
- the index labels now start from **0** as integers

This is the more conventional way how we should read in a dataframe, and we will be going with this method from now on.

However, do you still remember how we worked with a dataframe with string index labels? We used **loc[]** to select the data.

Using **iloc[]** is almost identical to indexing with NumPy, with integer positions starting at 0 like ndarrays and Python lists.

**DataFrame.iloc[]** behaves similarly to **DataFrame.loc[]**. The full syntax for DataFrame.iloc[], in pseudocode, is: <br>
**df.iloc[row_index, column_index]**

To help you memorize the two syntaxes easier:
- ``loc``: label based selection
- ``iloc``: integer position based selection

### Task 3.5.2
1. Select just the fifth row of the **f500** dataframe. Assign the result to **fifth_row**.
2. Select the value in first row of the **company** column. Assign the result to **company_value**.

In [77]:
# Start your code below:

fifth_row = f500.iloc[4,:]
company_value = f500.iloc[0,0] # company is the first row

## 3. Using iloc to select by integer position continued

If we want to select the first column from our **f500** dataset, we need to use ``:``, a colon, to specify all rows, and then use the integer ``0`` to specify the first column, like this:

In [78]:
first_column = f500.iloc[:,0]
print(first_column)

0                             Walmart
1                          State Grid
2                       Sinopec Group
3            China National Petroleum
4                        Toyota Motor
                    ...              
495    Teva Pharmaceutical Industries
496          New China Life Insurance
497         Wm. Morrison Supermarkets
498                               TUI
499                        AutoNation
Name: company, Length: 500, dtype: object


To specify a positional slice, try to use the same shortcut that we used with labels. Below is an example how we would select the rows between index positions one to four (inclusive):

In [79]:
second_to_sixth_rows = f500[1:5]
print(second_to_sixth_rows)

                    company  rank  revenues  revenue_change  profits  assets  \
1                State Grid     2    315199            -4.4   9571.3  489838   
2             Sinopec Group     3    267518            -9.1   1257.9  310726   
3  China National Petroleum     4    262573           -12.3   1867.5  585619   
4              Toyota Motor     5    254694             7.7  16899.3  437575   

   profit_change            ceo                  industry  \
1           -6.2        Kou Wei                 Utilities   
2          -65.0      Wang Yupu        Petroleum Refining   
3          -73.7  Zhang Jianhua        Petroleum Refining   
4          -12.3    Akio Toyoda  Motor Vehicles and Parts   

                   sector  previous_rank country     hq_location  \
1                  Energy              2   China  Beijing, China   
2                  Energy              4   China  Beijing, China   
3                  Energy              3   China  Beijing, China   
4  Motor Vehicles & P

Pay attention that the row at index position 5 is not included, just as if we were slicing with a Python list or NumPy ndarray. Recall that loc[] handles slicing differently:

- With loc[], the ending slice **is** included.
- With iloc[], the ending slice **is not** included.

We have a table below, that summarizes the usage of **DataFrame.iloc[]** and **Series.iloc[]** to select by integer position:

|Select by integer position| Explicit Syntax| Shorthand Convention|
|--|--|--|
|Single column from dataframe|df.iloc[:,3]| |
|List of columns from dataframe|df.iloc[:,[3,5,6]] | |
|Slice of columns from dataframe|df.iloc[:,3:7]| |
|Single row from dataframe|df.iloc[20]| |
|List of rows from dataframe|df.iloc[[0,3,8]]| |
|Slice of rows from dataframe|df.iloc[3:5]|df[3:5]|
|Single items from series|s.iloc[8]|s[8]|
|List of item from series |s.iloc[[2,8,1]]|s[[2,8,1]]|
|Slice of items from series|s.iloc[5:10]|s[5:10]|

### Task 3.5.3
1. Select the first three rows of the f500 dataframe. Assign the result to **first_three_rows**.
2. Select the first and seventh rows and the first five columns of the f500 dataframe. Assign the result to **first_seventh_row_slice**.

In [80]:
# Start your code below:
first_three_rows = f500[:3]
first_seventh_row_slice = f500.iloc[[0, 6], :5]

print(first_three_rows)
print(first_seventh_row_slice)

         company  rank  revenues  revenue_change  profits  assets  \
0        Walmart     1    485873             0.8  13643.0  198825   
1     State Grid     2    315199            -4.4   9571.3  489838   
2  Sinopec Group     3    267518            -9.1   1257.9  310726   

   profit_change                  ceo               industry     sector  \
0           -7.2  C. Douglas McMillon  General Merchandisers  Retailing   
1           -6.2              Kou Wei              Utilities     Energy   
2          -65.0            Wang Yupu     Petroleum Refining     Energy   

   previous_rank country      hq_location                 website  \
0              1     USA  Bentonville, AR  http://www.walmart.com   
1              2   China   Beijing, China  http://www.sgcc.com.cn   
2              4   China   Beijing, China  http://www.sinopec.com   

   years_on_global_500_list  employees  total_stockholder_equity  
0                        23    2300000                     77798  
1          

## 4. Using pandas methods to create boolean masks

There are two methods that I want to introduce to you in this chapter, which are the **Series.isnull()** [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isnull.html) and **Series.notnull()** [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.notnull.html). These two methods can be either used to select rows that contain null (or NaN) values or to select rows that do **not** contain null values.

Let's first have a look at the **Series.isnull()** method, which is used to view rows with null values (i.e. missing values) in one column.
Here is an example for the **revenue_change** column:

In [81]:
rev_is_null = f500["revenue_change"].isnull()
print(rev_is_null.head())

0    False
1    False
2    False
3    False
4    False
Name: revenue_change, dtype: bool


We see that using **Series.isnull()** resulted in a boolean series. Just like in NumPy, we can use this series to filter our dataframe, **f500**:

In [85]:
import pandas as pd
import numpy as np

f500 = pd.read_csv("f500.csv")
f500.index.name = None


rev_change_null = f500[rev_is_null]
print(rev_change_null[["company", "country","sector"]])


                        company  country      sector
90                       Uniper  Germany      Energy
180  Hewlett Packard Enterprise      USA  Technology


### Task 3.5.4
1. Use the **Series.isnull()** method to select all rows from **f500** that have a null value for the **previous_rank** column. Select only the **company**, **rank**, and **previous_rank** columns. Assign the result to **null_previous_rank**.


In [90]:
# Start your code below:

print(f500.head())

null_previous_rank_bool  = f500["previous_rank"].isnull()
null_previous_rank = f500[null_previous_rank_bool][["company", "rank", "previous_rank"]]

                    company  rank  revenues  revenue_change  profits  assets  \
0                   Walmart     1    485873             0.8  13643.0  198825   
1                State Grid     2    315199            -4.4   9571.3  489838   
2             Sinopec Group     3    267518            -9.1   1257.9  310726   
3  China National Petroleum     4    262573           -12.3   1867.5  585619   
4              Toyota Motor     5    254694             7.7  16899.3  437575   

   profit_change                  ceo                  industry  \
0           -7.2  C. Douglas McMillon     General Merchandisers   
1           -6.2              Kou Wei                 Utilities   
2          -65.0            Wang Yupu        Petroleum Refining   
3          -73.7        Zhang Jianhua        Petroleum Refining   
4          -12.3          Akio Toyoda  Motor Vehicles and Parts   

                   sector  previous_rank country      hq_location  \
0               Retailing              1     US

## 5. Working with Integer Labels (OPTIONAL)

Now let's check the difference between **DataFrame.loc[]** and **DataFrame.iloc[]**- what kind of different output will it provide to us:

We can use **DataFrame.iloc[]**, and it will get us the following result:

In [91]:
# Only works if you have completed task 3.5.4
first_null_prev_rank = null_previous_rank.iloc[0]
print(first_null_prev_rank)


IndexError: single positional indexer is out-of-bounds

But with **DataFrame.loc[]** ,it will throw us an error:

````python
first_null_prev_rank = null_previous_rank.loc[0]
````

````python
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/python3.4/site-packages/pandas/core/indexing.py in _has_valid_type(self, key, axis)
   1410                 if key not in ax:
-> 1411                     error()
   1412             except TypeError as e:

/python3.4/site-packages/pandas/core/indexing.py in error()
   1405                 raise KeyError("the label [%s] is not in the [%s]" %
-> 1406                                (key, self.obj._get_axis_name(axis)))
   1407 

KeyError: 'the label [0] is not in the [index]'
````

We get an error, telling us that **the label [0] is not in the [index]** (the actual traceback for this error is much longer than this). Remember that DataFrame.loc[] is used for label based selection:

- ``loc``: label based selection
- ``iloc``: integer position based selection

We see that there is no row with a 0 label in the index, we got the error above. If we wanted to select a row using loc[], we'd have to use the integer label for the first row — **48**.


## 6. Pandas Index Alignment (OPTIONAL)
Do you know that pandas has a very powerful aspect? --- Almost every operation will <b>align on the index labels</b>. Let's look at an example below, we have a dataframe named<b> food</b> and a series named <b>alt_name</b>:

In [92]:
import pandas as pd
d= {'fruit_veg': ["fruit", "veg", "fruit", "veg","veg"], 'qty': [4, 2,4,1,2]}
food = pd.DataFrame(data=d)
food.index = ['tomato', 'carrot', 'lime', 'corn','eggplant'] 
food

Unnamed: 0,fruit_veg,qty
tomato,fruit,4
carrot,veg,2
lime,fruit,4
corn,veg,1
eggplant,veg,2


In [7]:
alt_name = pd.Series(['rocket', 'aubergine', 'maize'], index=["arugula", "eggplant", "corn"])
alt_name

arugula        rocket
eggplant    aubergine
corn            maize
dtype: object

By observing the two dataframes above, we see that the **food** dataframe and the **alt_name** series not only have a different number of items, but they also share two of the same index labels which are **corn** and **eggplant**, even though they're in different orders. If we wanted to add alt_name as a new column in our food dataframe, we can use the following code:

In [29]:
food["alt_name"] = alt_name

food

Unnamed: 0,fruit_veg,qty,alt_name
tomato,fruit,4,
carrot,veg,2,
lime,fruit,4,
corn,veg,1,maize
eggplant,veg,2,aubergine


When we perform the code above, pandas will intentionally ignore the order of the ``alt_name`` series, and automatically align on the index labels.

In addition, Pandas will also:

- Discard any items that have an index that doesn't match the dataframe (like **arugula**).
- Fill any remaining rows with **NaN**.

Observe the result again carefully.

In [30]:
# Below is the result
food

Unnamed: 0,fruit_veg,qty,alt_name
tomato,fruit,4,
carrot,veg,2,
lime,fruit,4,
corn,veg,1,maize
eggplant,veg,2,aubergine


You see that with every occasion, the pandas library will align on index, no matter if our index labels are strings or integers - this makes working with data from different sources much much easier.

## 7. Using Boolean Operators (IMPORTANT)
We can combine boolean arrays using **boolean operators**. In Python, these boolean operators are and, or, and not. But in pandas, there is a slight difference compare to Python. Take a look at the chart below: 

|pandas|Python equivalent|Meaning|
|-|-|-|
|a & b| a and b| True if both a and b are True, else False|
| a \| b| a or b| True if either a or b is True|
|~a| not a | True if a is False, else False|

Let's try to use the syntaxes in the table in our small example below:

In [97]:
cols = ["company", "revenues", "country"]
f500_sel = f500[cols].head()
f500_sel.head()

Unnamed: 0,company,revenues,country
0,Walmart,485873,USA
1,State Grid,315199,China
2,Sinopec Group,267518,China
3,China National Petroleum,262573,China
4,Toyota Motor,254694,Japan


Take for example, if we want to find the companies in **f500_sel** with more than 265 billion in revenue, and on top of that with the headquarter located in China. We can achieve this by using two boolean comparisons like this:

In [98]:
over_265 = f500_sel["revenues"] > 265000
china = f500_sel["country"] == "China"
print(over_265.head())
print(china.head())

0     True
1     True
2     True
3    False
4    False
Name: revenues, dtype: bool
0    False
1     True
2     True
3     True
4    False
Name: country, dtype: bool


What we can do now is to use the **&** operator to combine the two boolean arrays to get the results, like this:

In [99]:
combined = over_265 & china
combined.head()

0    False
1     True
2     True
3    False
4    False
dtype: bool

Last but not least, we perform selection on our dataframe to get the final result like this:

In [96]:
final_cols = ["company", "revenues"]
result = f500_sel.loc[combined, final_cols]
result.head()

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

This is the end result that we get which fullfills all of our criteria.

### Task 3.5.7
Now try to do a similar task by yourself:
1. Select all companies with revenues over 100 billion and negative profits from the f500 dataframe. The result should include all columns.
    - Create a boolean array that selects the companies with revenues greater than 100 billion. Assign the result to **large_revenue**.
    - Create a boolean array that selects the companies with profits less than 0. Assign the result to **negative_profits**.
    - Combine large_revenue and negative_profits. Assign the result to **combined**.
    - Use combined to filter f500. Assign the result to **big_rev_neg_profit**.

In [100]:
# Start your code below:

large_revenue = f500["revenues"] > 100000
negative_profits = f500["profits"] < 0
combined = large_revenue & negative_profits
big_rev_neg_profit = f500[combined]


## 8. Sorting Values

Now let's try to answer some more complicated questions about our data set. What if we want to find the company that employs the most people in China? How can we achieve this? We can first select all of the rows where the **country** column equals **China**, like this:

In [101]:
selected_rows = f500[f500["country"] == "China"]

Then, we can use the **DataFrame.sort_values()** [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) to sort the rows on the employees column, like this:



In [102]:
sorted_rows = selected_rows.sort_values("employees")
print(sorted_rows[["company", "country", "employees"]].head())

                                company country  employees
204                         Noble Group   China       1000
458             Yango Financial Holding   China      10234
438  China National Aviation Fuel Group   China      11739
128                         Tewoo Group   China      17353
182            Amer International Group   China      17852


The **sort_values()** method will by default automatically sort the rows in ascending order — from smallest to largest.

But if we want to sort the rows in descending order instead, we can achieve this by setting the **ascending** parameter to **False**, like this:



In [103]:
sorted_rows = selected_rows.sort_values("employees", ascending=False)
print(sorted_rows[["company", "country", "employees"]].head())

                        company country  employees
3      China National Petroleum   China    1512048
118            China Post Group   China     941211
1                    State Grid   China     926067
2                 Sinopec Group   China     713288
37   Agricultural Bank of China   China     501368


Now we see the Companies in China who employ the most people is China National Petroleum. 

Can you find out the same about Japanese company?
### Task 3.5.8

1. Find the companies headquartered in Japan with the largest number of employees.
    - Select only the rows that have a country name equal to **Japan**.
    - Use **DataFrame.sort_values()** to sort those rows by the **employees** column in descending order.
    - Use **DataFrame.iloc[]** to select the first row from the sorted dataframe.
    - Extract the company name from the index label **company** from the first row. Assign the result to **top_japanese_employer**.

In [104]:
# Start your code below:

japan = f500[f500["country"] == "Japan"]
sorted_rows = japan.sort_values("employees", ascending=False)
top_japanese_employer = sorted_rows.iloc[0,]
top_japanese_employer = top_japanese_employer.loc["company"]
