## Introduction

We'll learn another way pandas makes working with data easier. It has many built-in methods and functions for common exploration and analysis tasks.

__Dataset__: Fortune Global 500 list [f500.csv]

__Source__: https://data.world/chasewillden/fortune-500-companies-2017

Here is a data dictionary for some of the columns in the CSV:

- ```company```: Name of the company.
- ```rank```: Global 500 rank for the company.
- ```revenues```: Company's total revenue for the fiscal year, in millions of dollars (USD).
- ```revenue_change```: Percentage change in revenue between the current and prior fiscal year.
- ```profits```: Net income for the fiscal year, in millions of dollars (USD).
- ```ceo```: Company's Chief Executive Officer.
- ```industry```: Industry in which the company operates.
- ```sector```: Sector in which the company operates.
- ```previous_rank```: Global 500 rank for the company for the prior year.
- ```country```: Country in which the company is headquartered.
- ```hq_location```: City and Country, (or City and State for the USA) where the company is headquarted.
- ```employees```: Total employees (full-time equivalent, if available) at fiscal year-end.

In [1]:
import pandas as pd

f500 = pd.read_csv('data/f500.csv', index_col=0)
f500_head = f500.head(10)
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 52.7+ KB


### Vectorized Operations

Just like with NumPy, we can use any of the standard Python numeric operators with series, including:
- ```series_a + series_b``` - Addition
- ```series_a - series_b``` - Subtraction
- ```series_a * series_b``` - Multiplication (this is unrelated to the multiplications used in linear algebra).
- ```series_a / series_b``` - Division

In [2]:
rank_change = f500.loc[:, "previous_rank"] - f500.loc[:, "rank"]

### Series Data Exploration Methods

Like NumPy, pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones (with links to documentation):
- ```Series.max()```
- ```Series.min()```
- ```Series.mean()```
- ```Series.median()```
- ```Series.mode()```
- ```Series.sum()```

In [3]:
rank_change_max = rank_change.max()
rank_change_min = rank_change.min()

### Series Describe Method

We used the Series.max() and Series.min() methods to figure out the biggest increase and decrease in rank:
- Biggest increase in rank: 226
- Biggest decrease in rank: -500

However, according to the data dictionary, this list should only rank companies on a scale of 1 to 500. Even if the company ranked 1st in the previous year moved to 500th this year, the rank change calculated would be -499. This indicates that there is incorrect data in either the ```rank``` column or ```previous_rank``` column.

We'll learn another method that can help us more quickly investigate this issue - the ```Series.describe()``` method. This method tells us how many non-null values are contained in the series, along with the mean, minimum, maximum, and other statistics.

The first statistic, __count__, is the same as for numeric columns, showing us the number of non-null values. The other three statistics are new:
- ```unique```: Number of unique values in the series.
- ```top```: Most common value in the series.
- ```freq```: Frequency of the most common value.

In [4]:
rank = f500["rank"]
rank_desc = rank.describe()

prev_rank = f500["previous_rank"]
prev_rank_desc = prev_rank.describe()

In [5]:
prev_rank_desc

count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64

### Method Chaining

We notice that the minimum rank is 0, which is odd! To investigate the possible cause of this issue, let's confirm the number of 0 values that appear in the previous_rank column.

We can skip some of the intermediate code assignments. This is called __method chaining__ — a way to combine multiple methods together in a single line.
- When writing code, always assess whether method chaining will make your code harder to read. If it does, it's always preferable to break the code into more than one line.

In [6]:
# Count the number of zeros in our previous_rank column
zero_previous_rank = f500["previous_rank"].value_counts().loc[0]

### Dataframe Exploration Methods

We confirmed that 33 companies in the dataframe have a value of 0 in the previous_rank column. Given that multiple companies have a 0 rank, we might conclude that these companies didn't have a rank at all for the previous year. It would make more sense for us to replace these values with a null value instead.

Because series and dataframes are two distinct objects, they have their own unique methods. However, there are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. Below are some examples:
- ```Series.max()``` and ```DataFrame.max()```
- ```Series.min()``` and ```DataFrame.min()```
- ```Series.mean()``` and ```DataFrame.mean()```
- ```Series.median()``` and ```DataFrame.median()```
- ```Series.mode()``` and ```DataFrame.mode()```
- ```Series.sum()``` and ```DataFrame.sum()```

Unlike their series counterparts, dataframe methods require an _axis parameter_ so we know which axis to calculate across. While you can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings ```"index"``` and ```"columns"``` for the axis parameter

In [7]:
# find the maximum value for only the numeric columns from f500
max_f500 = f500[["rank", "revenues", "revenue_change", "profits", "assets", "profit_change", "previous_rank", "years_on_global_500_list", "employees", "total_stockholder_equity"]].max(axis=0)