# About This Notebook
In this **Introduction to pandas** chapter, we will learn:
- How pandas and NumPy combine to make working with data easier.
- About the two core pandas types: series and dataframes.
- How to select data from pandas objects using axis labels.
***
## 1. Understanding Pandas and NumPy

The pandas library provides solutions to a lot of problems. However, pandas is not so much a replacement for NumPy but serves more as an extension of NumPy. Pandas uses the the NumPy library extensively, and you will notice this more when you dig deeper into the concept.

The primary data structure in pandas is called a **dataframe**. 
This is the pandas equivalent of a Numpy 2D ndarray but with some key differences:

> Axis values can have string **labels**, not just numeric ones. This means that the columns can now have their own meaningful names.

> Dataframes can contain columns with **multiple data types**: including ``integer``, ``float``, and ``string``. This enables us to store, for example, strings and integers in one dataframe.

## 2. Introduction to the Data
In this chapter, we will work with a data set from Fortune magazine's 2017 Global 500 list.

The data set is stored in a CSV file called **f500.csv**. Here is a data dictionary for some of the columns in the CSV:

- **company**: Name of the company.
- **rank**: Global 500 rank for the company.
- **revenues**: Company's total revenue for the fiscal year, in millions of dollars (USD).
- **revenue_change**: Percentage change in revenue between the current and prior fiscal year.
- **profits**: Net income for the fiscal year, in millions of dollars (USD).
- **ceo**: Company's Chief Executive Officer.
- **industry**: Industry in which the company operates.
- **sector**: Sector in which the company operates.
- **previous_rank**: Global 500 rank for the company for the prior year.
- **country**: Country in which the company is headquartered.

After getting to know our data set, how do we actually import pandas library in Python?
To import pandas, we simply type in the following code:

````python
import pandas as pd
````

Pandas' dataframes have a `.shape` attribute which returns a tuple representing the dimensions of each axis of the object. Now we want to use this and Python's `type()` function to take a closer look at the `f500` dataframe.

### Task 3.3.2:

1. Use Python's `type()` function to assign the type of `f500` to `f500_type`.
2. Use the `DataFrame.shape` attribute to assign the shape of `f500` to `f500_shape`.
3. Print both the `f500_type` and `f500_shape`.

In [0]:
import pandas as pd
f500 = pd.read_csv('../../../../Data/f500.csv',index_col=0)
f500.index.name = None
# Start your code below:


## 3. Introducing DataFrames

Remember how we spent so much time in the course "Be Around of Data Science" talking about rectangular data structures? Moreover, we discussed flat tables consisting of rows (observations) and columns (features)? Now this will come in really handy! 

I want to show you the `DataFrame.head` method. By default, it will return the first five rows of our dataframe. However, it also accepts an optional integer parameter, which specifies the number of rows:

In [0]:
f500.head(3)

There is also the `DataFrame.tail` method to show us the last rows of our dataframe:

In [0]:
f500.tail(3)

### Task 3.3.3:

1. Use the `head()` method to select the **first 6 rows**. Assign the result to `f500_head`.
2. Use the `tail()` method to select the **last 8 rows**. Assign the result to `f500_tail`.

In [0]:
# Start your code here:


## 4. Introducting DataFrames Continued

Now let's talk about the `DataFrame.dtypes` attribute. The `DataFrame.dtypes` attribute returns information about the types of each column.

In [0]:
print(f500.dtypes)

To see a comprehensive overview of all the dtypes used in our dataframe, as well its shape and other information, we should use the `DataFrame.info()` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html). Remember that `DataFrame.info()` only prints the information, instead of returning it, so we can't assign it to a variable.

### Task 3.3.4:

1. Use the `DataFrame.info()` method to display information about the `f500` dataframe.

In [0]:
# Start your code below:


## 5. Selecting a Column From a DataFrame by Label (IMPORTANT)

Do you know that our axes in pandas all have labels and we can select data using just those labels? The **DataFrame.loc[]** attribute is exactly the syntax designed for this purpose:

````python
df.loc[row_label, column_label]
````

Pay close attention that we use brackets ``[]`` instead of parentheses ``()`` when selecting by location.

Now let's look at an example:

In [0]:
f500.loc[:,"rank"]

Notice we used `:` to specify that all rows should be selected. And also pay attention that the new dataframe has the same row labels as the original.

The following shortcut can also be used to select a single column:

In [0]:
rank_col = f500["rank"]
print(rank_col)

### Task 3.3.5:

1. Select the `industry` column. Assign the result to the variable name `industries`.
2. Use Python's `type()` function to assign the type of `industries` to `industries_type`.

In [0]:
# Start your code below:


## 6. Selecting Columns From a DataFrame by Label Continued
Below, we use a list of labels to select specific columns:

In [0]:
f500.loc[:,["country", "rank"]]

In [0]:
f500[["country","rank"]]

The code `f500.loc[:,["country", "rank"]]` and `f500[["country","rank"]]` eventually return us the same result. 

You see that the object returned is two-dimensional, we know it's a dataframe, not a series. So instead of `df.loc[:,["col1","col2"]]`, we can also use `df[["col1", "col2"]]` to select specific columns.

Last but not least, let's finish by using **a slice object with labels** to select specific columns:

In [0]:
f500.loc[:,"rank":"profits"]

The result is again a dataframe object with all of the columns from the first up until the last column in our slice. Unfortunately, there is no shortcut for selecting column slices.

See the table below for a short summary of techniques that we have just encountered:

|Select by Label|Explicit Syntax|Common Shorthand|
|--|--|--|
|Single column|df.loc[:,"col1"]|df["col1"]|
|List of columns|df.loc[:,["col1", "col7"]]|df[["col1", "col7"]]|
|Slice of columns|df.loc[:,"col1":"col4"]|

### Task 3.3.6:

1. Select the `country` column. Assign the result to the variable name `countries`.
2. In order, select the `revenues` and `years_on_global_500_list` columns. Assign the result to the variable name `revenues_years`.
3. In order, select all columns from `ceo` up to and including `sector`. Assign the result to the variable name `ceo_to_sector`.

In [0]:
# Start your code below:


## 7. Selecting Rows From a DataFrame by Label

Now, let's learn how to use the labels of the index axis to select rows.

We can use the same syntax to select rows from a dataframe as we already have done for columns:

````python
df.loc[row_label, column_label]
````

#### To select a single row:

In [0]:
single_row = f500.loc["Sinopec Group"]
print(type(single_row))
print(single_row)

A **series** is returned because it is one-dimensional. 
> In short, ``series`` is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). 
There are many data types that may need to be stored in this series, like integer, float, and string values, so pandas uses the **object** dtype, since none of the numeric types could cater for all values.

**To select a list of rows:**

In [0]:
list_rows = f500.loc[["Toyota Motor", "Walmart"]]
print(type(list_rows))
print(list_rows)

**To select a slice object with labels:**

For selection using slices, a shortcut can be used like below. This is the reason we can't use this shortcut for columns – because it's reserved for use with rows:

In [0]:
slice_rows = f500["State Grid":"Toyota Motor"]
print(type(slice_rows))
print(slice_rows)

### Task 3.3.7: 

By selecting data from `f500`:
1. Create a new variable `toyota`, with:
    - Just the row with index `Toyota Motor`.
    - All columns.
2. Create a new variable, `drink_companies`, with:
    - Rows with indicies `Anheuser-Busch InBev, Coca-Cola, and Heineken Holding`, in that order.
    - All columns.
3. Create a new variable, `middle_companies` with:
    - All rows with indicies from `Tata Motors` to `Nationwide`, inclusive.
    - All columns from `rank` to `country`, inclusive.

In [0]:
# Start your code below:


## 8. Value Counts Method

We understand that **series** and **dataframes** are two distinct objects. They each have their own unique methods. In this section we are going to look at an example of a method, which returns different results for each of the objects.

First, let's select just one column from the ``f500`` dataframe:

In [0]:
sectors = f500["sector"]
print(type(sectors))

Now, we want to substitute ``Series`` in `Series.value_counts()` with the name of our `sectors` series, like this:

In [0]:
sectors_value_counts = sectors.value_counts()
print(sectors_value_counts)

You see that each unique non-null value is being counted and listed in the output above.

Well, what happens when we use the `Series.value_counts()` method with a dataframe? First step, we should select the `sector` and `industry` columns to create a dataframe named `sectors_industries`, like this:

In [0]:
sectors_industries = f500[["sector", "industry"]]
print(type(sectors_industries))

Then, we'll use the `value_counts()` method:

In [0]:
si_value_counts = sectors_industries.value_counts()
print(si_value_counts)

We see that we for a dataframe the occurences of unique combinations of values in the columns are counted. This goes to show that while the methods have the same name for both objects, the result might be different. If you are unsure what the result of a method is be sure to look it up in the documentation.

In general, it is not guaranteed that a method exists for both objects. 
For example, the  method ``unique()`` only exists for series objects.
If we try to use it with a dataframe an Error is thrown.

In [0]:
sectors.unique()

In [0]:
sectors_industries.unique()

### Task 3.3.8

1. Find the counts of each unique value in the `country` column in the `f500` dataframe.
    - Select the `country` column in the `f500` dataframe. Assign it to a variable named `countries*`
    - Use the `Series.value_counts()` method to return the value counts for `countries`. Assign the results to `country_counts`.

In [0]:
# Start your code below:


## Summary

Below is a summary table of all the different label selection we have learned so far:

|Select by Label|Explicit Syntax|Shorthand Convention|
|-------------|-----------|---------------|
|Single column from dataframe|df.loc[:,"col1"]|df["col1"]|
|List of columns from dataframe|df.loc[:,"col1","col7"]|df["col1","col7"]|
|Single column from dataframe|df.loc[:,"col1":"col4"]| |
|Single row from dataframe| df.loc["row4"]|   |
|List of rows from dataframe |df.loc["row1","row8"]]|  |
|Slice of rows from dataframe|df.loc["row3":"row5"]| df["row3":"row5"]|
|Single item from series|s.loc["item8"]|s["item8"]|
|List of items from series|s.loc[["item1", "item7"]]|s[["item1","item7"]]|
|Slice of items from series|s.loc["item2":"item4"] |s["item2":"item4"]|