### Introduction to the Data

In [1]:
import pandas as pd
import numpy as np
f500 = pd.read_csv('f500.csv', index_col=0)
f500.index.name = None

In [2]:
f500.head()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523
China National Petroleum,4,262573,-12.3,1867.5,585619,-73.7,Zhang Jianhua,Petroleum Refining,Energy,3,China,"Beijing, China",http://www.cnpc.com.cn,17,1512048,301893
Toyota Motor,5,254694,7.7,16899.3,437575,-12.3,Akio Toyoda,Motor Vehicles and Parts,Motor Vehicles & Parts,8,Japan,"Toyota, Japan",http://www.toyota-global.com,23,364445,157210


1. Use the `DataFrame.head()` method to select the first 10 rows in `f500`. Assign the result to `f500_head`.

In [3]:
f500_head = f500.head(10)

2. Use the `DataFrame.info()` method to display information about the dataframe.

In [4]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   rank                      500 non-null    int64  
 1   revenues                  500 non-null    int64  
 2   revenue_change            498 non-null    float64
 3   profits                   499 non-null    float64
 4   assets                    500 non-null    int64  
 5   profit_change             436 non-null    float64
 6   ceo                       500 non-null    object 
 7   industry                  500 non-null    object 
 8   sector                    500 non-null    object 
 9   previous_rank             500 non-null    int64  
 10  country                   500 non-null    object 
 11  hq_location               500 non-null    object 
 12  website                   500 non-null    object 
 13  years_on_global_500_list  500 non-null    int64  
 14  em

### Vectorized Operations

Just like with NumPy, we can use any of the standard Python numeric operators with series, including:
- `series_a + series_b` - Addition
- `series_a - series_b` - Subtraction
- `series_a * series_b` - Multiplication
- `series_a / series_b` - Division

1. Subtract the values in the `rank` column from the values in the `previous_rank` column. Assign the result to `rank_change`.

In [5]:
rank_change = f500["previous_rank"] - f500["rank"]

### Series Data Exploration Methods

Like NumPy, pandas supports many descriptive stats methods that can help answer various questions.
- `Series.max()`
- `Series.min()`
- `Series.mean()`
- `Series.median()`
- `Series.mode()`
- `Series.sum()`

1. Use the `Series.max()` method to find the maximum value for the `rank_change` series. Assign the result to the variable `rank_change_max`.

In [6]:
rank_change_max = rank_change.max()

2. Use the `Series.min()` method to find the minimum value for the `rank_change` series. Assign the result to the variable `rank_change_min`.

In [7]:
rank_change_min = rank_change.min()

### Series Describe Method

The `Series.describe()` method shows how many non-null values are contained in the series, along with the mean, minimum, maximum, and other statistics

In [8]:
# numeric column
assets = f500["assets"]
print(assets.describe())

count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64


In [9]:
# non-numeric column
country = f500["country"]
print(country.describe())

count     500
unique     34
top       USA
freq      132
Name: country, dtype: object


- `count` shows the number of non-null values
- `unique` shows the number of unique values in the series
- `top` shows the most common value in the series
- `freq` shows the frequency of the most common value

1. Return a series of descriptive statistics for the `rank` column in `f500`.
    - Select the `rank` column. Assign it to a variable named `rank`.
    - Use the `Series.describe()` method to return a series of statistics for `rank`. Assign the result to `rank_desc`.

In [10]:
rank = f500["rank"]
rank_desc = rank.describe()
print(rank_desc)

count    500.000000
mean     250.500000
std      144.481833
min        1.000000
25%      125.750000
50%      250.500000
75%      375.250000
max      500.000000
Name: rank, dtype: float64


2. Return a series of descriptive statistics for the `previous_rank` column in `f500`.
    - Select the `previous_rank` column. Assign it to a variable named `prev_rank`.
    - Use the `Series.describe()` method to return a series of statistics for `prev_rank`. Assign the result to `prev_rank_desc`.

In [11]:
prev_rank = f500["previous_rank"]
prev_rank_desc = prev_rank.describe()
print(prev_rank_desc)

count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64


### Method Chaining

**Method chaining** is a way to combine multiple methods together in a single line.

In [12]:
print(f500["country"].value_counts().loc["China"])

109


1. Use `Series.value_counts()` and `Series.loc` to return the number of companies with a value of `0` in the `previous_rank` column in the `f500` dataframe. Assign the results to `zero_previous_rank`.

In [13]:
zero_previous_rank = f500["previous_rank"].value_counts().loc[0]

In [14]:
print(zero_previous_rank)

33


### DataFrame Exploration Methods

Overlapping methods:
- `Series.max()` and `DataFrame.max()`
- `Series.min()` and `DataFrame.min()`
- `Series.mean()` and `DataFrame.mean()`
- `Series.median()` and `DataFrame.median()`
- `Series.mode()` and `DataFrame.mode()`
- `Series.sum()` and `DataFrame.sum()`

DataFrame methods require an _axis parameter_ so we know which axis to calculate across.
- Calculate for each **column**:
    - `DataFrame.method(axis=0)` or `DataFrame.method(axis="index")`
- Calculate for each **row**
    - `DataFrame.method(axis=1)` or `DataFrame.method(axis="column")`

Find the median (middle) value for the `revenues` and `profits` columns:

In [15]:
medians = f500[["revenues", "profits"]].median(axis=0)
print(medians)

revenues    40236.0
profits      1761.6
dtype: float64


1. Use the `DataFrame.max()` method to find the maximum value for _only the numeric_ columns from `f500`. Assign the result to the variable `max_f500`.

In [16]:
max_f500 = f500.max(numeric_only=True)

In [17]:
max_f500

rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64

### DataFrame Describe Method

By default, the `DataFrame.describe()` method will return statistics only for numeric columns. If we want to get just the object columns, we need to use the `include=['0']` parameter:

In [18]:
print(f500.describe(include=['O']))

                      ceo                       industry      sector country  \
count                 500                            500         500     500   
unique                500                             58          21      34   
top     Michael J. Kasbar  Banks: Commercial and Savings  Financials     USA   
freq                    1                             51         118     132   

           hq_location                      website  
count              500                          500  
unique             235                          500  
top     Beijing, China  http://www.wuchanzhongda.cn  
freq                56                            1  


1. Return a dataframe of descriptive statistics for all of the numeric columns in `f500`. Assign the result to `f500_desc`.

In [19]:
f500_desc = f500.describe()

In [20]:
f500.describe()

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,previous_rank,years_on_global_500_list,employees,total_stockholder_equity
count,500.0,500.0,498.0,499.0,500.0,436.0,500.0,500.0,500.0,500.0
mean,250.5,55416.358,4.538353,3055.203206,243632.3,24.152752,222.134,15.036,133998.3,30628.076
std,144.481833,45725.478963,28.549067,5171.981071,485193.7,437.509566,146.941961,7.932752,170087.8,43642.576833
min,1.0,21609.0,-67.3,-13038.0,3717.0,-793.7,0.0,1.0,328.0,-59909.0
25%,125.75,29003.0,-5.9,556.95,36588.5,-22.775,92.75,7.0,42932.5,7553.75
50%,250.5,40236.0,0.55,1761.6,73261.5,-0.35,219.5,17.0,92910.5,15809.5
75%,375.25,63926.75,6.975,3954.0,180564.0,17.7,347.25,23.0,168917.2,37828.5
max,500.0,485873.0,442.3,45687.0,3473238.0,8909.5,500.0,23.0,2300000.0,301893.0


### Assignment with pandas

In this particular dataset, the `previous_rank` column features `0` values. We concluded that companies with a rank of zero didn't have have a rank at all. It would make more sense to replace those zero values with a null value to clearly indicate that the value is missing.

In order to correct these values, we need to:
- Perform assignment in pandas.
- Use Boolean indexing in pandas.

To assign values in a single axis:

`top5_rank_revenue = f500[["rank", "revenues"]].head()
print(top5_rank_revenue)`


`top5_rank_revenue["revenues"] = 0
print(top5_rank_revenue)`

To assign a single value:

`top5_rank_revenue.loc["Sinopec Group", "revenues"] = 999`

`print(top5_rank_revenue`

1. The company "Dow Chemical" has a new CEO. Update the value where the row label is `Dow Chemical` and for the `ceo` column to `Jim Fitterling` in the `f500` dataframe.

In [21]:
f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"

### Using Boolean Indexing with pandas Objects

We can use **boolean indexing** to change all rows that meet the same criteria.

1. Create a boolean series, `motor_bool`, that compares whether the values in the `industry` column from the `f500` dataframe are equal to `"Motor Vehicles and Parts"`.

In [22]:
motor_bool = f500["industry"] == "Motor Vehicles and Parts"

2. Use the `motor_bool` boolean series to index the `country` column. Assign the result to `motor_countries`. 

In [23]:
motor_countries = f500.loc[motor_bool, "country"]

In [24]:
motor_countries

Toyota Motor                                 Japan
Volkswagen                                 Germany
Daimler                                    Germany
General Motors                                 USA
Ford Motor                                     USA
Honda Motor                                  Japan
SAIC Motor                                   China
Nissan Motor                                 Japan
BMW Group                                  Germany
Dongfeng Motor                               China
Robert Bosch                               Germany
Hyundai Motor                          South Korea
China FAW Group                              China
Beijing Automotive Group                     China
Peugeot                                     France
Renault                                     France
Kia Motors                             South Korea
Continental                                Germany
Denso                                        Japan
Guangzhou Automobile Industry G

### Using Boolean Arrays to Assign Values

The `sector` column features both `"Motor Vehicles and Parts"` and `"Motor Vehicles & Parts"` - let's make those uniform.

In [25]:
ampersand_bool = f500["sector"] == "Motor Vehicles & Parts"

In [26]:
f500.loc[ampersand_bool, "sector"] = "Motor Vehicles and Parts"

The more direct (and common) way to do this is:

`f500.loc[f500["sector"] == "Motor Vehicles & Parts", "sector"] = "Motor Vehicles and Parts"`

Replace the `0` values from the `previous_rank` column

In [27]:
previous_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

In [28]:
previous_rank_before

0      33
159     1
147     1
148     1
149     1
Name: previous_rank, dtype: int64

1. Use boolean indexing to update values in the `previous_rank` column of the `f500` dataframe:
    - There should now be a value of `np.nan` where there previous was a value of `0`.
    - It is up to you whether you assign the boolean series to its own variable first, or whether you complete the operation in one line.

In [29]:
f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

2. Create a new pandas series, `prev_rank_after`, using the same syntax that was used to create the `prev_rank_before` series.

In [30]:
prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

In [31]:
prev_rank_after

NaN      33
471.0     1
234.0     1
125.0     1
166.0     1
Name: previous_rank, dtype: int64

### Creating New Columns

1. Add a new column named `rank_change` to the `f500` dataframe by subtracting the values in the `rank` column from the values in the `previous_rank` column.

In [32]:
f500["rank_change"] = f500["previous_rank"] - f500["rank"]

2. Use the `Series.describe()` method to return a series of descriptive statistics for the `rank_change` column. Assign the result to `rank_change_desc`.

In [33]:
rank_change_desc = f500["rank_change"].describe()

In [34]:
rank_change_desc

count    467.000000
mean      -3.533191
std       44.293603
min     -199.000000
25%      -21.000000
50%       -2.000000
75%       10.000000
max      226.000000
Name: rank_change, dtype: float64

### Challenge: Top Performers by Country

In [36]:
# given code
top_2_countries = f500["country"].value_counts().head(2)
top_2_countries

USA      132
China    109
Name: country, dtype: int64

1. Create a series, `industry_usa`, containing counts of the two most common values in the `industry` column for companies headquartered in the USA.

In [37]:
industry_usa = f500["industry"][f500["country"] == "USA"].value_counts().head(2)
industry_usa

Banks: Commercial and Savings               8
Insurance: Property and Casualty (Stock)    7
Name: industry, dtype: int64

2. Create a series, `sector_china`, containing counts of the three most common values in the `sector` column for companies headquartered in China.

In [38]:
sector_china = f500["sector"][f500["country"] == "China"].value_counts().head(3)
sector_china

Financials     25
Energy         22
Wholesalers     9
Name: sector, dtype: int64