![ine-divider](https://user-images.githubusercontent.com/7065401/92672068-398e8080-f2ee-11ea-82d6-ad53f7feb5c0.png)
<hr>

# Understanding Pandas DataFrames
<img src="https://user-images.githubusercontent.com/7065401/75165824-badf4680-5701-11ea-9c5b-5475b0a33abf.png" style="width:200px; float: right; margin: 0 40px 40px 40px;"/>

The Pandas (Panel Data) Python library is a very powerful tool for data manipulation and analysis.  We will talk about it throughout several lessons of this bootcamp, and even assume familiarity with Pandas in later lessons.

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

The data objects in Pandas you will work with most are `DataFrame`s.  As we mentioned in the last lesson, a DataFrame is, in essence, a number of Series held in a common object.  However, by collecting serveral Series together, we get far more powerful and versatile capabilities than merely working with them separately.

At the start, we can import Pandas, by convention it is usually given the two-letter name `pd` within Python programs.  We also import the library NumPy, using the conventional short name `np`.  This bootcamp will not discuss NumPy specifically, but Pandas' Series are built on top of NumPy `ndarrays`, and occasionally we want to use capabilities of their underlying arrrays.

In [None]:
# Import the pandas package under the name pd
import pandas as pd

# Print the pandas version and the configuration
print(pd.__version__)

We continue to analyze G7 countries for these DataFrame examples. A DataFrame looks like a table (this data is also in a spreadsheet [at this link](https://docs.google.com/spreadsheets/d/1IlorV2-Oh9Da1JAZ7weVw86PQrQydSMp-ydVMH135iI/edit?usp=sharing)):

| Country | Population |      GDP |  Area   |   HDI | Continent   
|--------:|-----------:|---------:|--------:|:------|--------------
| Canada  |     35.467 |  1785387 | 9984670 | 0.913 | America
| France  |     63.951 |  2833687 |  640679 | 0.888 | Europe
| Germany |     80.94  |  3874437 |  357114 | 0.916 | Europe
| Italy   |     60.665 |  2167744 |  301336 | 0.873 | Europe
| Japan   |    127.061 |  4602367 |  377930 | 0.891 | Asia
| UK      |     64.511 |  2950039 |  242495 | 0.907 | Europe      |
| US      |    318.523 | 17348075 | 9525067 | 0.915 | America     |

Creating a `DataFrame`s manually can be tedious. 99% of the time you'll be pulling the data from a Database, a CSV file, from other data file formats, or from the web. However, you *can* create a DataFrame by specifying the columns and values (several forms of arguments to the constructor are available):

In [None]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

_(The `columns` attribute is optional. I'm using it to keep the same order as in the picture above)_

In [None]:
df

`DataFrame`s also have indexes. As you can see in the "table" above, pandas has assigned a numeric and autoincremented index automatically to each *row* in our DataFrame. In our case, we know that each row represents a country (and which one), so we can set a more useful index:

In [None]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States']
df.index.name = "Country"
df

We can inquire about numerous aspects of a DataFrame.

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.info()

In [None]:
df.size

In [None]:
df.shape

The `.describe()` method is very useful to provide summary statistical information about all the columns.

In [None]:
df.describe()

Each column, being a Series, has once consistent datatype; however, the various columns can have different datatypes from each other.

In [None]:
df.dtypes

The `.dtypes` attribute is itself a Pandas Series, which means it has a method called `.value_counts()` to summarize a histogram of values.

In [None]:
df.dtypes.value_counts()

## Indexing, Selection and Slicing

Individual columns in the DataFrame can be selected with regular indexing. Each column is a Series:

In [None]:
df

We have seen the `.loc` accessor already in discussing Series.  It also works for DataFrames.  But since the same index is associated with multiple columns, what is returned by specifying one index item is a row as a Series with its own indices corresponding to the columns.

In [None]:
df.loc['Canada']

The `.iloc` accessor is similar.

In [None]:
df.iloc[3]

Whereas a Series was similar to a dictionary mapping **index** names to data values, a DataFrame is similar to a dictionary mapping **column** names to an entire row.  What is returned is, again, a Series, in this case a column.

In [None]:
df['Population']

In many cases, you can access a column using "dotted attribute" notation.  This is nicer to look at, but will not work either where a column name is not a valid Python identifier (e.g. it contains spaces or punctuation) or where the name is the same as a reserved method or attribute.

In [None]:
df.GDP

Some data scientists or developers avoid the attribute style because of the problems that sometimes occur (for example, you could not show "Surface Area" this way).  This bootcamp will sometimes use both styles.

Note that the `index` of the returned Series is the same as the DataFrame one. And its `name` is the name of the column. If you're working on DataFrame, and want to return a DataFrame even where a Series might result from an operation, use the `.to_frame()` method:

In [None]:
df.Population.to_frame()

Multiple columns can also be selected in a manner similar to Series rows:

In [None]:
df[['Population', 'GDP']]

The result is another `DataFrame`. Slicing works differently from indexing a DatFrame with a list; it acts at "row level", and can be counter intuitive:

In [None]:
df[1:3]

Because of the oddities of direct "dictionary-style" indexing, we **strongly recommend** you always use the `.loc` and `.iloc` accessors to clarify you intention.  These are extremely flexible and powerful.

In [None]:
df.loc['Italy']

In [None]:
df.loc['France': 'Italy']

In [None]:
df.loc[['France', 'Italy']]

Using `.iloc` is similar, merely with index positions.  You can equally use both individual numbers, slices, and lists of numbers (also predicates/Boolean arrays, which we see later).

In [None]:
df.iloc[[2, 4]]

In [None]:
df.iloc[2:4]

### More powerful indexing

The real power on DataFrame indexing comes when we use  a second "argument".  This lets us choose both rows and columns of interest.

In [None]:
df.loc['France': 'Italy', 'Population'].to_frame()

In [None]:
df.loc['France': 'Italy', ['Population', 'GDP']]

In [None]:
df.iloc[0]

In [None]:
df.iloc[3:6, :3]

In [None]:
df.iloc[1:3, 3]

In [None]:
df.iloc[1:3, [0, 3]]

In [None]:
df.iloc[3, 1:5]

You should have noticed a pattern here.  When there is only a single sequence of data values, Pandas returns a Series rather than a DataFrame (whether a row or a column depends on operation).  If two dimensions are needed, it returns a DataFrame.  We can manually "promote" a Series to a DataFrame using `.to_frame()`

## Conditional selection (boolean arrays)

We have seen conditional selection applied to Series and it works in the same way for DataFrames. After all, a DataFrame is a collection of Series.

In [None]:
df

In [None]:
df['Population'] > 70

In [None]:
df.loc[df['Population'] > 70]

The boolean matching is done at Index level, so you can filter by any row, as long as it contains the right indexes. Column selection still works as expected:

In [None]:
df.loc[df['Population'] > 70, 'Population']

In [None]:
df.loc[df['Population'] > 70, ['Population', 'GDP']]

## Dropping stuff

In contrast to selecting things, we can also *drop* things.

In [None]:
df.drop('Canada')

In [None]:
df.drop(['Canada', 'Japan', 'United Kingdom'])

In [None]:
df.drop(columns=['Population', 'HDI'])

In [None]:
df.drop(['Italy', 'Canada'], axis=0)

In [None]:
df.drop(['Population', 'HDI'], axis=1)

In [None]:
df.drop(['Population', 'HDI'], axis=1)

In [None]:
df.drop(['Population', 'HDI'], axis='columns')

In [None]:
df.drop(['Canada', 'Germany'], axis='rows')

All these `drop` methods return a new `DataFrame`. If you'd like to modify it "in place", you can use the `inplace` attribute (there's an example below).  You are likely to see both the `inplace` modification and assigning back to the same variable name in code you read; it is a stylistic difference among different programmers.

## Operations

In [None]:
df[['Population', 'GDP']]

I am not certain what year this data is from nor what the currency unit is.  But we are certainly measuring both population and GDP in **millions** of something (the US and Canadian dollars, the Euro, and the British Pound are all relatively close in value; within a power of two or so, again depending on date and year for specifics).

In [None]:
# Show the actual number rather than millions
df[['Population', 'GDP']] * 1_000_000

### Broadcasting

With Series, and just above with a DataFrame, we saw examples of "broadcasting" a scalar.  For example, we multiplied all the population numbers and all the GDP numbers by a million at the same time.  We can also *broadcast* a Series to a DataFrame, or one DataFrame to another.  This can be powerful, but can require some thought.

As an example, if any of these countries came to have a trillion lower GDP (remember the DataFrame already stores millions) or an HDI (Human Development Index) that was lower by 0.3, that would be an economic and social crisis.  Let us see what numbers might meet that.

In [None]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])
crisis

In [None]:
df[['GDP', 'HDI']]

In [None]:
df[['GDP', 'HDI']] + crisis

## Modifying DataFrames

You can add columns to a DataFrame, or replace the Series held in columns when working with DataFrames.

### Adding a new column

While this is a huge simplification of the languages spoken in various countries, let us stipulate that some (but not all) countries have just one predominant or official language.  By stipulation, pretend it is these:

In [None]:
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)
langs

We can add a column, and it will default to being "missing" where we do not have a value available.  The special value *NaN* (not a number) is used to indicate these missing values.

In [None]:
df['Language'] = langs
df

### Replacing values per column

In [None]:
df['Language'] = 'English'
df

Both of those versions are rather dramatically wrong, however, in terms of actual "most common" language.  We can use filtering to come up with something better.

In [None]:
df['Language'] = langs
df

In [None]:
df.loc[df.Language.isnull(), 'Language'] = "English"
df.loc['Japan', 'Language'] = "Japanese"
df

### Renaming Columns

Pandas is pretty clever in ignoring the information that is not relevant within many of its commands.  We can rename both some columns and some index names, but in some cases the "original" value was not present to start with.  Pandas still "does the right thing" despite that deficit.

In [None]:
df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Annual Popcorn Consumption': 'APC'}, 
    index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    })

We can rename in a programmatic way.  For example, perhaps we wish to uppercase all the country names in the index:

In [None]:
# Notice there is no inplace modification here
df.rename(index=str.upper)

Or lowercase them.

In [None]:
def lower(s):
    return s.lower()

df.rename(index=lower)

### Dropping columns

In [None]:
df.drop(columns='Language', inplace=True)
df

### Adding values

Note that we add a row with only some column values indicated in the below. The rest get filled in with the "missing" sentinel, NaN.

In [None]:
# Some rough guesses for an unknown year and currency 
df.append(
    pd.Series({
        'Population': 1400,
        'GDP': 10_000_000
    },
    name='China'))

Append returns a new DataFrame; it did not modify the existing one.

In [None]:
df

You can directly set the new index and values to the `DataFrame`:

In [None]:
df.loc['China'] = pd.Series({'Population': 1_400, 'Continent': 'Asia'})
df

We can use `drop` to just remove a row by index:

In [None]:
df.drop('China', inplace=True)
df

### Changing the index

In [None]:
df.reset_index(inplace=True)
df

In [None]:
df.set_index('Population')

In [None]:
df = df.set_index('Country')
df

## Creating columns from other columns

Altering a DataFrame often involves combining different columns. For example, in our Countries analysis, we could try to calculate the "GDP per capita", which is just, `GDP / Population`.

In [None]:
df[['Population', 'GDP']]

The regular pandas way of expressing that, is just dividing each series:

In [None]:
df.GDP / df.Population

The result of that operation is just another series that you can add to the original `DataFrame`:

In [None]:
df['GDP Per Capita'] = df['GDP'] / df['Population']
df

## Statistical info

You've already seen the `describe` method, which gives you a good "summary" of the `DataFrame`. Let's explore other methods in more detail:

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
population = df['Population']

In [None]:
population.min(), population.max()

In [None]:
population.sum()

In [None]:
population.sum() / len(population)

In [None]:
population.mean()

In [None]:
population.std()

In [None]:
population.median()

In [None]:
population.describe()

In [None]:
population.quantile(.25)

In [None]:
population.quantile([.2, .4, .6, .8, 1])

# Exercises

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## DataFrame creation

### Create an empty pandas DataFrame


In [None]:
# your code goes here

### Create a `marvel_df` pandas DataFrame with the given marvel data

<img width=400 src="https://cdn.dribbble.com/users/4678/screenshots/1986600/avengers.png"></img>

In [None]:
marvel_data = [
    ['Spider-Man', 'male', 1962],
    ['Captain America', 'male', 1941],
    ['Wolverine', 'male', 1974],
    ['Iron Man', 'male', 1963],
    ['Thor', 'male', 1963],
    ['Thing', 'male', 1961],
    ['Mister Fantastic', 'male', 1961],
    ['Hulk', 'male', 1962],
    ['Beast', 'male', 1963],
    ['Invisible Woman', 'female', 1961],
    ['Storm', 'female', 1975],
    ['Namor', 'male', 1939],
    ['Hawkeye', 'male', 1964],
    ['Daredevil', 'male', 1964],
    ['Doctor Strange', 'male', 1963],
    ['Hank Pym', 'male', 1962],
    ['Scarlet Witch', 'female', 1964],
    ['Wasp', 'female', 1963],
    ['Black Widow', 'female', 1964],
    ['Vision', 'male', 1968]
]

In [None]:
# your code goes here
marvel_df = pd.DataFrame()
marvel_df

### Add column names to the `marvel_df`

In [None]:
# your code goes here

### Add index names to the `marvel_df` (use the character name as index)

In [None]:
# your code goes here

### Drop the name column as it's now the index

In [None]:
# your code goes here

### Drop 'Namor' and 'Hank Pym' rows

In [None]:
# your code goes here

## DataFrame selection, slicing and indexation

### Show the first 5 elements on `marvel_df`
 

In [None]:
# your code goes here

### Show the last 5 elements on `marvel_df`

In [None]:
# your code goes here

### Show the first 6 characters by order of first appearance

In [None]:
# your code goes here

### Show just the sex of the first 5 elements on `marvel_df`

In [None]:
# your code goes here

### Show the first_appearance of all middle elements on `marvel_df` 

In [None]:
# your code goes here

### Show the first and last elements on `marvel_df`

In [None]:
# your code goes here

## DataFrame manipulation and operations

### Modify the `first_appearance` of 'Vision' to year 1964

In [None]:
# your code goes here

### Add a new column to `marvel_df` called 'years_since' with the years since `first_appearance`


In [None]:
# your code goes here

## DataFrame boolean arrays (also called masks)

### Given the `marvel_df` pandas DataFrame, make a mask showing the female characters


In [None]:
# your code goes here

### Given the `marvel_df` pandas DataFrame, get the male characters

In [None]:
# your code goes here

### Given the `marvel_df` pandas DataFrame, get the characters with `first_appearance` after 1970

In [None]:
# your code goes here


### Given the `marvel_df` pandas DataFrame, get the female characters with `first_appearance` after 1970

In [None]:
# your code goes here

## DataFrame summary statistics

### Show basic statistics of `marvel_df`

In [None]:
# your code goes here

### Given the `marvel_df` pandas DataFrame, show the mean value of `first_appearance`

In [None]:
# your code goes here

### Given the `marvel_df` pandas DataFrame, show the min value of `first_appearance`

In [None]:
# your code goes here

### Given the `marvel_df` pandas DataFrame, get the characters with the min value of `first_appearance`

In [None]:
# your code goes here