## Introduction

Limitations of NumPy:
- The lack of support for column names forces us to frame questions as multi-dimensional array operations.
- Support for only one data type per ndarray makes it more difficult to work with data that contains both numeric and string data.
- There are lots of low level methods, but there are many common analysis patterns that don't have pre-built methods.

### Understanding pandas and NumPy

The __pandas__ library provides solutions to all of these pain points and more. Pandas is not so much a replacement for NumPy as an _extension_ of NumPy.

The primary data structure in pandas is called a __dataframe__. Dataframes are the pandas equivalent of a Numpy 2D ndarray, with a few key differences:
- Axis values can have string __labels__, not just numeric ones.
- Dataframes can contain columns with __multiple data types__: including integer, float, and string.

### Introduction to the Data

We will work with the data set from Fortune magazine's 2017 Global 500 list, which ranks the top 500 corporations worldwide by revenue.

Source: https://data.world/chasewillden/fortune-500-companies-2017

We will be using a modified version called **_f500.csv_**. The data description is as follows:
- company: Name of the company.
- rank: Global 500 rank for the company.
- revenues: Company's total revenue for the fiscal year, in millions of dollars (USD).
- revenue_change: Percentage change in revenue between the current and prior fiscal year.
- profits: Net income for the fiscal year, in millions of dollars (USD).
- ceo: Company's Chief Executive Officer.
- industry: Industry in which the company operates.
- sector: Sector in which the company operates.
- previous_rank: Global 500 rank for the company for the prior year.
- country: Country in which the company is headquartered.

In [1]:
# Import library
import pandas as pd

f500 = pd.read_csv('data/f500.csv', index_col=0)
f500.index.name = None

f500_type = type(f500)
f500_shape = f500.shape

### Introducing Dataframes

Recall that one of the features that makes pandas better for working with data is its support for string column and row labels:
- __Axis values can have string labels, not just numeric ones.__
- Dataframes can contain columns with multiple data types: including integer, float, and string.

To view the first few rows of our dataframe, we can use the ```DataFrame.head()``` method. By default, it will return the first five rows of our dataframe.
However, it also accepts an optional integer parameter, which specifies the number of rows

In [2]:
f500.head(3)

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Walmart,1,485873,0.8,13643.0,198825,-7.2,C. Douglas McMillon,General Merchandisers,Retailing,1,USA,"Bentonville, AR",http://www.walmart.com,23,2300000,77798
State Grid,2,315199,-4.4,9571.3,489838,-6.2,Kou Wei,Utilities,Energy,2,China,"Beijing, China",http://www.sgcc.com.cn,17,926067,209456
Sinopec Group,3,267518,-9.1,1257.9,310726,-65.0,Wang Yupu,Petroleum Refining,Energy,4,China,"Beijing, China",http://www.sinopec.com,19,713288,106523


Likewise, we can use the ```DataFrame.tail()``` method to show us the last rows of our dataframe

In [3]:
f500.tail(3)

Unnamed: 0,rank,revenues,revenue_change,profits,assets,profit_change,ceo,industry,sector,previous_rank,country,hq_location,website,years_on_global_500_list,employees,total_stockholder_equity
Wm. Morrison Supermarkets,498,21741,-11.3,406.4,11630,20.4,David T. Potts,Food and Drug Stores,Food & Drug Stores,437,Britain,"Bradford, Britain",http://www.morrisons.com,13,77210,5111
TUI,499,21655,-5.5,1151.7,16247,195.5,Friedrich Joussen,Travel Services,Business Services,467,Germany,"Hanover, Germany",http://www.tuigroup.com,23,66779,3006
AutoNation,500,21609,3.6,430.5,10060,-2.7,Michael J. Jackson,Specialty Retailers,Retailing,0,USA,"Fort Lauderdale, FL",http://www.autonation.com,12,26000,2310


Another feature that makes pandas better for working with data is that dataframes can contain more than one data type:
- Axis values can have string labels, not just numeric ones
- __Dataframes can contain columns with multiple data types: including integer, float, and string.__

We can use the ```DataFrame.dtypes``` attribute (similar to NumPy's ```ndarray.dtype``` attribute) to return information about the types of each column.

- Pandas uses NumPy dtypes for numeric columns, including integer64.
- There is also a type we haven't seen before, object, which is used for columns that have data that doesn't fit into any other dtypes.
    - This is almost always used for columns containing string values
    
If we wanted an overview of all the dtypes used in our dataframe, along with its shape and other information, we could use the ```DataFrame.info()``` method.
- Note: ```DataFrame.info()``` prints the information, rather than returning it, so we can't assign it to a variable.

In [4]:
f500.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, Walmart to AutoNation
Data columns (total 16 columns):
rank                        500 non-null int64
revenues                    500 non-null int64
revenue_change              498 non-null float64
profits                     499 non-null float64
assets                      500 non-null int64
profit_change               436 non-null float64
ceo                         500 non-null object
industry                    500 non-null object
sector                      500 non-null object
previous_rank               500 non-null int64
country                     500 non-null object
hq_location                 500 non-null object
website                     500 non-null object
years_on_global_500_list    500 non-null int64
employees                   500 non-null int64
total_stockholder_equity    500 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 52.7+ KB


### Selecting a Column From a DataFrame by Label

Because our axes in pandas have labels, we can select data using those labels — unlike in NumPy, where we needed to know the exact index location.

To do this, we can use the ```DataFrame.loc[]``` method. The syntax for the ```DataFrame.loc[]``` method is:
```
df.loc[row_label, column_label]
```

In [5]:
industries = f500.loc[:, "industry"]
industries_type = type(industries)

#### Introduction to Series

When you select just one column of a dataframe, you get a new pandas type: a __series__ object.
- Series is the pandas type for one-dimensional objects.
- Anytime you see a __1D__ pandas object, it will be a __series__. Anytime you see a __2D__ pandas object, it will be a __dataframe__.

We use a __list of labels__ to select specific columns. 

Because the object returned is two-dimensional, we know it's a dataframe, not a series. Again, instead of ```df.loc[:,["col1","col2"]]```  , you can also use ```df[["col1", "col2"]]``` to select specific columns.

A summary of the techniques we've learned so far is below:

| Select by Label | Explicit Syntax | Common Shorthand |
| --------------- | --------------- | ---------------- |
| Single column | ```df.loc[:,"col1"]``` | ```df["col1"]``` |
| List of columns | ```df.loc[:,["col1", "col7"]]``` | ```df[["col1", "col7"]]``` |
| Slice of columns | ```df.loc[:,"col1":"col4"]``` |  |

In [6]:
countries = f500["country"]
revenues_years = f500.loc[:, ["revenues", "years_on_global_500_list"]]
ceo_to_sector = f500.loc[:, "ceo":"sector"]

### Selecting Rows From a DataFrame by Label

- We select rows using the labels of the __index__ axis
- We use the same syntax to select rows from a dataframe as we do for columns:

```df.loc[row_label, column_label]```

Note the object returned is a series because it is one-dimensional. Since this series has to store integer, float, and string values, pandas uses the ```object``` dtype, since none of the numeric types could cater for all values

In [7]:
toyota = f500.loc["Toyota Motor"]
drink_companies = f500.loc[["Anheuser-Busch InBev", "Coca-Cola", "Heineken Holding"]]
middle_companies = f500.loc["Tata Motors":"Nationwide", "rank":"country"]

### Series vs Dataframes

<img src="_images/df_series_s_updated.svg" />
<img src="_images/df_series_df_updated.svg" />