# <p style="background-color: #f5df18; padding: 10px;">Programming & Plotting in Python | **Pandas DataFrames** </p>




### <strong>Instructor: <span style="color: darkblue;">Name (Affliation)</span></strong>

Estimated completion time: 🕚 20 minutes


<div style="display: flex;">
    <div style="flex: 1; margin-right: 20px;">
        <h2>Questions</h2>
        <ul>
            <li>How can I do statistical analysis of tabular data?</li>
        </ul>
    </div>
    <div style="flex: 1;">
        <h2>Learning Objectives</h2>
        <ul>
            <li>Select individual values from a Pandas dataframe.</li>
    <li>Select entire rows or entire columns from a dataframe.</li>
    <li>Select a subset of both rows and columns from a dataframe in a single operation.</li>
    <li>Select a subset of a dataframe by a single Boolean criterion.</li>
        </ul>
    </div>
</div>

## Note about Pandas DataFrames/Series
---

A [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) is a collection of [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html);
The DataFrame is the way Pandas represents a table, and Series is the data-structure
Pandas use to represent a column.

Pandas is built on top of the [Numpy](https://www.numpy.org/) library, which in practice means that
most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records
of the table, proper handling of missing values, and relational-databases operations
between DataFrames.


## Selecting values
---

To access a value at the position `[i,j]` of a DataFrame, we have two options, depending on
what is the meaning of `i` in use.
Remember that a DataFrame provides an *index* as a way to identify the rows of the table;
a row, then, has a *position* inside the table as well as a *label*, which
uniquely identifies its *entry* in the DataFrame.

## Use `DataFrame.iloc[..., ...]` to select values by their (entry) position
---

- Can specify location by numerical index analogously to 2D version of character selection in strings.

## Use `DataFrame.loc[..., ...]` to select values by their (entry) label.
---

- Can specify location by row and/or column name.

## Use `:` on its own to mean all columns or all rows.

- Just like Python's usual slicing notation.

- Would get the same result printing `data.loc["Albania"]` (without a second index).

- Would get the same result printing `data["gdpPercap_1952"]`
- Also get the same result printing `data.gdpPercap_1952` (not recommended, because easily confused with `.` notation for methods)

## Select multiple columns or rows using `DataFrame.loc` and a named slice.
---

In the above code, we discover that **slicing using `loc` is inclusive at both
ends**, which differs from **slicing using `iloc`**, where slicing indicates
everything up to but not including the final index.


## Result of slicing can be used in further operations.
---

- Usually don't just print a slice.
- All the statistical operators that work on entire dataframes
  work the same way on slices.
- E.g., calculate max of a slice.


## Use comparisons to select data based on value.

- Comparison is applied element by element.
- Returns a similarly-shaped dataframe of `True` and `False`.

## Select values or NaN using a Boolean mask.
---

- A frame full of Booleans is sometimes called a *mask* because of how it can be used.

- Get the value where the mask is true, and NaN (Not a Number) where it is false.
- Useful because NaNs are ignored by operations like max, min, average, etc.

## Group By: split-apply-combine
---

Pandas vectorizing methods and grouping operations are features that provide users
much flexibility to analyse their data.

For instance, let's say we want to have a clearer view on how the European countries
split themselves according to their GDP.

1. We may have a glance by splitting the countries in two groups during the years surveyed,
  those who presented a GDP *higher* than the European average and those with a *lower* GDP.
2. We then estimate a *wealthy score* based on the historical (from 1962 to 2007) values,
  where we account how many times a country has participated in the groups of *lower* or *higher* GDP


Finally, for each group in the `wealth_score` table, we sum their (financial) contribution
across the years surveyed using chained methods:

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Selection of Individual Values </p>

---

Assume Pandas has been imported into your notebook
and the Gapminder GDP data for Europe has been loaded:

```python
import pandas as pd

data_europe = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
```

Write an expression to find the Per Capita GDP of Serbia in 2007.

In [None]:
### your answer here ####

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Extent of Slicing </p>

---

1. Do the two statements below produce the same output?
2. Based on this,
  what rule governs what is included (or not) in numerical slices and named slices in Pandas?

```python
print(data_europe.iloc[0:2, 0:2])
print(data_europe.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])
```

In [None]:
### your answer here ####

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Reconstructing Data </p>
---

Explain what each line in the following short program does:
what is in `first`, `second`, etc.?

```python
first = pd.read_csv('data/gapminder_all.csv', index_col='country')
second = first[first['continent'] == 'Americas']
third = second.drop('Puerto Rico')
fourth = third.drop('continent', axis = 1)
fourth.to_csv('result.csv')
```

In [None]:
### your answer here ####

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Selecting Indices </p>
---

Explain in simple terms what `idxmin` and `idxmax` do in the short program below.
When would you use these methods?

```python
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.idxmin())
print(data.idxmax())
```

In [None]:
### your answer here ####

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Practice with Selection </p>
---

Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded.
Write an expression to select each of the following:

1. GDP per capita for all countries in 1982.
2. GDP per capita for Denmark for all years.
3. GDP per capita for all countries for years *after* 1985.
4. GDP per capita for each country in 2007 as a multiple of
  GDP per capita for that country in 1952.


In [None]:
### your answer here ####

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Many Ways of Access </p>
---

There are at least two ways of accessing a value or slice of a DataFrame: by name or index.
However, there are many others. For example, a single column or row can be accessed either as a `DataFrame`
or a `Series` object.

Suggest different ways of doing the following operations on a DataFrame:

1. Access a single column
2. Access a single row
3. Access an individual DataFrame element
4. Access several columns
5. Access several rows
6. Access a subset of specific rows and columns
7. Access a subset of row and column ranges


In [None]:
### your answer here ####

## <p style="background-color: #f5df18; padding: 10px;"> 🛑 Exploring available methods using the `dir()` function </p>
---

Python includes a `dir()` function that can be used to display all of the available methods (functions) that are built into a data object.  In Episode 4, we used some methods with a string. But we can see many more are available by using `dir()`:

```python
my_string = 'Hello world!'   # creation of a string object 
dir(my_string)
```

This command returns:

```python
['__add__',
...
'__subclasshook__',
'capitalize',
'casefold',
'center',
...
'upper',
'zfill']
```

You can use `help()` or <kbd>Shift</kbd>\+<kbd>Tab</kbd> to get more information about what these methods do.

Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded as `data`.  Then, use `dir()`
to find the function that prints out the median per-capita GDP across all European countries for each year that information is available.

In [None]:
### your answer here ####

# <p style="background-color: #f5df18; padding: 10px;"> 🗝️ Key points</p>
---

- Use `DataFrame.iloc[..., ...]` to select values by integer location.
- Use `:` on its own to mean all columns or all rows.
- Select multiple columns or rows using `DataFrame.loc` and a named slice.
- Result of slicing can be used in further operations.
- Use comparisons to select data based on value.
- Select values or NaN using a Boolean mask.