![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/75165824-badf4680-5701-11ea-9c5b-5475b0a33abf.png"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Pandas - `DataFrame`s

Probably the most important data structure of pandas is the `DataFrame`. It's a tabular structure tightly integrated with `Series`.


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

## Hands on!

In [34]:
import numpy as np
import pandas as pd

We'll keep our analysis of G7 countries and looking now at DataFrames. As said, a DataFrame looks a lot like a table (as the one you can appreciate [here](https://docs.google.com/spreadsheets/d/1IlorV2-Oh9Da1JAZ7weVw86PQrQydSMp-ydVMH135iI/edit?usp=sharing)):

<img width="700" src="https://user-images.githubusercontent.com/872296/38153492-72c032ca-3443-11e8-80f4-9de9060a5127.png" />

Creating `DataFrame`s manually can be tedious. 99% of the time you'll be pulling the data from a Database, a csv file or the web. But still, you can create a DataFrame by specifying the columns and values:

In [35]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

_(The `columns` attribute is optional. I'm using it to keep the same order as in the picture above)_

In [36]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


`DataFrame`s also have indexes. As you can see in the "table" above, pandas has assigned a numeric, autoincremental index automatically to each "row" in our DataFrame. In our case, we know that each row represents a country, so we'll just reassign the index:

In [37]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [38]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [9]:
df.columns

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')

In [10]:
df.index

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to United States
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Continent     7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 336.0+ bytes


In [12]:
df.size

35

In [13]:
df.shape

(7, 5)

In [14]:
df.describe()

Unnamed: 0,Population,GDP,Surface Area,HDI
count,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.900429
std,97.24997,5494020.0,4576187.0,0.016592
min,35.467,1785387.0,242495.0,0.873
25%,62.308,2500716.0,329225.0,0.8895
50%,64.511,2950039.0,377930.0,0.907
75%,104.0005,4238402.0,5082873.0,0.914
max,318.523,17348080.0,9984670.0,0.916


In [15]:
df.dtypes

Population      float64
GDP               int64
Surface Area      int64
HDI             float64
Continent        object
dtype: object

In [16]:
df.dtypes.value_counts()

float64    2
int64      2
object     1
Name: count, dtype: int64

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Indexing, Selection and Slicing

Individual columns in the DataFrame can be selected with regular indexing. Each column is represented as a `Series`:

In [17]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [76]:
df.loc[['Canada']]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America


In [18]:
df.loc['Canada']

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object

In [75]:
# df['Canada'] will NOT work!

In [19]:
df.iloc[-1]

Population       318.523
GDP             17348075
Surface Area     9525067
HDI                0.915
Continent        America
Name: United States, dtype: object

In [24]:
df['Population']

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

Note that the `index` of the returned Series is the same as the DataFrame one. And its `name` is the name of the column. If you're working on a notebook and want to see a more DataFrame-like format you can use the `to_frame` method:

In [25]:
df['Population'].to_frame()
# or df[['Population']]

Unnamed: 0,Population
Canada,35.467
France,63.951
Germany,80.94
Italy,60.665
Japan,127.061
United Kingdom,64.511
United States,318.523


Multiple columns can also be selected similarly to `numpy` and `Series`:

In [22]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744
Japan,127.061,4602367
United Kingdom,64.511,2950039
United States,318.523,17348075


In this case, the result is another `DataFrame`. Slicing works differently, it acts at "row level", and can be counter intuitive:

In [30]:
df.iloc[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe


Row level selection works better with `loc` and `iloc` **which are recommended** over regular "direct slicing" (`df[:]`).

`loc` selects rows matching the given index:

In [28]:
df.loc[['Italy']]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Italy,60.665,2167744,301336,0.873,Europe


In [31]:
df.loc['France': 'Italy']

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe


As a second "argument", you can pass the column(s) you'd like to select:

In [70]:
df.loc['France':'Italy', 'Population']

France     63.951
Germany    80.940
Italy      60.665
Name: Population, dtype: float64

In [71]:
df.loc['France':'Italy', ['Population']]

Unnamed: 0,Population
France,63.951
Germany,80.94
Italy,60.665


In [69]:
df.loc['France':'Italy', 'Population'].to_frame()

Unnamed: 0,Population
France,63.951
Germany,80.94
Italy,60.665


In [73]:
df.loc['France': 'Italy', ['Population', 'GDP']]

Unnamed: 0,Population,GDP
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744


In [37]:
df.loc[['Canada', 'Italy'], ['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
Italy,60.665,2167744


`iloc` works with the (numeric) "position" of the index:

In [41]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [62]:
df.iloc[[0]]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America


In [48]:
df.iloc[0]

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object

In [61]:
df.iloc[-1]

Population       318.523
GDP             17348075
Surface Area     9525067
HDI                0.915
Continent        America
Name: United States, dtype: object

In [60]:
df.iloc[[-1]]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
United States,318.523,17348075,9525067,0.915,America


In [53]:
df.iloc[[0, 1, -1]]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
United States,318.523,17348075,9525067,0.915,America


In [54]:
df.iloc[1:3]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe


In [59]:
df.iloc[1:3, 3]

France     0.888
Germany    0.916
Name: HDI, dtype: float64

In [57]:
df.iloc[1:3, [3]]

Unnamed: 0,HDI
France,0.888
Germany,0.916


In [56]:
df.iloc[1:3, [0, 3]]

Unnamed: 0,Population,HDI
France,63.951,0.888
Germany,80.94,0.916


In [58]:
df.iloc[1:3, 1:3]

Unnamed: 0,GDP,Surface Area
France,2833687,640679
Germany,3874437,357114


> **RECOMMENDED: Always use `loc` and `iloc` to reduce ambiguity, specially with `DataFrame`s with numeric indexes.**

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Conditional selection (boolean arrays)

We saw conditional selection applied to `Series` and it'll work in the same way for `DataFrame`s. After all, a `DataFrame` is a collection of `Series`:

In [63]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [64]:
df['Population'] > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: Population, dtype: bool

In [65]:
df.loc[df['Population'] > 70]

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United States,318.523,17348075,9525067,0.915,America


The boolean matching is done at Index level, so you can filter by any row, as long as it contains the right indexes. Column selection still works as expected:

In [68]:
df.loc[df['Population'] > 70, ['Population']]

Unnamed: 0,Population
Germany,80.94
Japan,127.061
United States,318.523


In [67]:
df.loc[df['Population'] > 70, ['Population', 'GDP']]

Unnamed: 0,Population,GDP
Germany,80.94,3874437
Japan,127.061,4602367
United States,318.523,17348075


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Dropping stuff

Opposed to the concept of selection, we have "dropping". Instead of pointing out which values you'd like to _select_ you could point which ones you'd like to `drop`:

In [77]:
df.drop('Canada')

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [78]:
df.drop(['Canada', 'Japan'])

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [79]:
df.drop(columns=['Population', 'HDI'])

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [80]:
df.drop(['Italy', 'Canada'], axis=0)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


In [81]:
df.drop(['Population', 'HDI'], axis=1)

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [82]:
df.drop(['Population', 'HDI'], axis=1)

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [83]:
df.drop(['Population', 'HDI'], axis='columns')

Unnamed: 0,GDP,Surface Area,Continent
Canada,1785387,9984670,America
France,2833687,640679,Europe
Germany,3874437,357114,Europe
Italy,2167744,301336,Europe
Japan,4602367,377930,Asia
United Kingdom,2950039,242495,Europe
United States,17348075,9525067,America


In [84]:
df.drop(['Canada', 'Germany'], axis='rows')

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
France,63.951,2833687,640679,0.888,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


All these `drop` methods return a new `DataFrame`. If you'd like to modify it "in place", you can use the `inplace` attribute (there's an example below).

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Operations

In [85]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387
France,63.951,2833687
Germany,80.94,3874437
Italy,60.665,2167744
Japan,127.061,4602367
United Kingdom,64.511,2950039
United States,318.523,17348075


In [86]:
df[['Population', 'GDP']] / 100

Unnamed: 0,Population,GDP
Canada,0.35467,17853.87
France,0.63951,28336.87
Germany,0.8094,38744.37
Italy,0.60665,21677.44
Japan,1.27061,46023.67
United Kingdom,0.64511,29500.39
United States,3.18523,173480.75


**Operations with Series** work at a column level, broadcasting down the rows (which can be counter intuitive).

In [116]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI']).to_frame().T
crisis

Unnamed: 0,GDP,HDI
0,-1000000.0,-0.3


In [113]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI']).to_frame()
crisis

Unnamed: 0,0
GDP,-1000000.0
HDI,-0.3


In [119]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])
crisis

GDP   -1000000.0
HDI         -0.3
dtype: float64

In [117]:
df[['GDP', 'HDI']]

Unnamed: 0,GDP,HDI
Canada,1785387,0.913
France,2833687,0.888
Germany,3874437,0.916
Italy,2167744,0.873
Japan,4602367,0.891
United Kingdom,2950039,0.907
United States,17348075,0.915


In [120]:
df[['GDP', 'HDI']] + crisis
# If I added .to_frame() when creating the "crisis", then this operation with + "crisis" will result Nah in the entire dataframe. I am not sure why.

Unnamed: 0,GDP,HDI
Canada,785387.0,0.613
France,1833687.0,0.588
Germany,2874437.0,0.616
Italy,1167744.0,0.573
Japan,3602367.0,0.591
United Kingdom,1950039.0,0.607
United States,16348075.0,0.615


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Modifying DataFrames

It's simple and intuitive, You can add columns, or replace values for columns without issues:

### Adding a new column

In [39]:
langs = pd.Series(
    ['French', 'German', 'Italian'],
    index=['France', 'Germany', 'Italy'],
    name='Language'
)

In [40]:
langs

France      French
Germany     German
Italy      Italian
Name: Language, dtype: object

In [41]:
df['Language'] = langs

In [42]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,
United Kingdom,64.511,2950039,242495,0.907,Europe,
United States,318.523,17348075,9525067,0.915,America,


---
### Replacing values per column

In [43]:
# df = df.drop(df.columns[-1], axis='columns')
df.loc['Japan','Language'] = "Japanese" # How to change one value.
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,Japanese
United Kingdom,64.511,2950039,242495,0.907,Europe,
United States,318.523,17348075,9525067,0.915,America,


In [44]:
mask = df['Language'].isnull()
mask

Canada             True
France            False
Germany           False
Italy             False
Japan             False
United Kingdom     True
United States      True
Name: Language, dtype: bool

In [45]:
df.loc[mask,'Language'] = "English"


In [46]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,Japanese
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


---
### Renaming Columns


In [245]:
df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    }, index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    })

Unnamed: 0,Population,GDP,Surface Area,Human Development Index,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,Japanese
UK,64.511,2950039,242495,0.907,Europe,English
USA,318.523,17348075,9525067,0.915,America,English


In [246]:
df.rename(index=str.lower)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
canada,35.467,1785387,9984670,0.913,America,English
france,63.951,2833687,640679,0.888,Europe,French
germany,80.94,3874437,357114,0.916,Europe,German
italy,60.665,2167744,301336,0.873,Europe,Italian
japan,127.061,4602367,377930,0.891,Asia,Japanese
united kingdom,64.511,2950039,242495,0.907,Europe,English
united states,318.523,17348075,9525067,0.915,America,English


In [247]:
df.rename(index=str.upper)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
CANADA,35.467,1785387,9984670,0.913,America,English
FRANCE,63.951,2833687,640679,0.888,Europe,French
GERMANY,80.94,3874437,357114,0.916,Europe,German
ITALY,60.665,2167744,301336,0.873,Europe,Italian
JAPAN,127.061,4602367,377930,0.891,Asia,Japanese
UNITED KINGDOM,64.511,2950039,242495,0.907,Europe,English
UNITED STATES,318.523,17348075,9525067,0.915,America,English


In [248]:
df.rename(index=lambda x: x.lower())

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
canada,35.467,1785387,9984670,0.913,America,English
france,63.951,2833687,640679,0.888,Europe,French
germany,80.94,3874437,357114,0.916,Europe,German
italy,60.665,2167744,301336,0.873,Europe,Italian
japan,127.061,4602367,377930,0.891,Asia,Japanese
united kingdom,64.511,2950039,242495,0.907,Europe,English
united states,318.523,17348075,9525067,0.915,America,English


In [322]:
df.rename(index=str.title)

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,Japanese
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


---
### Dropping columns

In [261]:
# df.drop(columns='Language', inplace=True)
# df

# inplace=True will drop that column for real.

---
### Adding values

In [47]:
'''
df.append(pd.Series({
    'Population': 3,
    'GDP': 5
}, name='China'))
# SHIT WON"T WORK!!!
'''
ch = pd.Series({
    'Population': 3,
    'GDP': 5
}, name='China')


# ch=ch.to_frame().T
ch


Population    3
GDP           5
Name: China, dtype: int64

In [48]:
ch.name

'China'

In [15]:
# for col in df.columns:
#     if col not in ch.columns:
#         ch[col] = None  # Or any default value you prefer
# ch

In [95]:
import copy
if np.size(df) == 42:
  df2=copy.deepcopy(df)
else:
  df=copy.deepcopy(df2)
df2

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,Japanese
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


In [96]:
df1 = copy.deepcopy(df2)
df1

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,Japanese
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


In [97]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,Japanese
United Kingdom,64.511,2950039,242495,0.907,Europe,English
United States,318.523,17348075,9525067,0.915,America,English


In [94]:
df1.loc[len(df1.index)] = ch
df1

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,French
Germany,80.94,3874437.0,357114.0,0.916,Europe,German
Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian
Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
China,318.523,17348075.0,9525067.0,0.915,America,English
7,3.0,5.0,,,,


Append returns a new `DataFrame`:

In [91]:
df1 = df1.rename(index={df1.index[-1]: ch.name})
df1

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387,9984670,0.913,America,English
France,63.951,2833687,640679,0.888,Europe,French
Germany,80.94,3874437,357114,0.916,Europe,German
Italy,60.665,2167744,301336,0.873,Europe,Italian
Japan,127.061,4602367,377930,0.891,Asia,Japanese
United Kingdom,64.511,2950039,242495,0.907,Europe,English
China,318.523,17348075,9525067,0.915,America,English


In [82]:
df = df1
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,French
Germany,80.94,3874437.0,357114.0,0.916,Europe,German
Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian
Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
United States,318.523,17348075.0,9525067.0,0.915,America,English
China,3.0,5.0,,,,


You can directly set the new index and values to the `DataFrame`:

In [132]:
df.loc['China'] = pd.Series({'Population': 1_400, 'Continent': 'Asia','GDP': df.loc['United States','GDP']*0.75})

In [133]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,French
Germany,80.94,3874437.0,357114.0,0.916,Europe,German
Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian
Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
United States,318.523,17348075.0,9525067.0,0.915,America,English
China,1400.0,13011056.25,,,Asia,


We can use `drop` to just remove a row by index:

In [59]:
# df.drop('China', inplace=True)

In [134]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,French
Germany,80.94,3874437.0,357114.0,0.916,Europe,German
Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian
Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
United States,318.523,17348075.0,9525067.0,0.915,America,English
China,1400.0,13011056.25,,,Asia,


---
### More radical index changes

In [135]:
df3= df.reset_index()
df3

Unnamed: 0,index,Population,GDP,Surface Area,HDI,Continent,Language
0,Canada,35.467,1785387.0,9984670.0,0.913,America,English
1,France,63.951,2833687.0,640679.0,0.888,Europe,French
2,Germany,80.94,3874437.0,357114.0,0.916,Europe,German
3,Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian
4,Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese
5,United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
6,United States,318.523,17348075.0,9525067.0,0.915,America,English
7,China,1400.0,13011056.25,,,Asia,


In [136]:
# df3.iloc[1,'Population']
df3.iloc[-2, 0] = 'USA'
df3

Unnamed: 0,index,Population,GDP,Surface Area,HDI,Continent,Language
0,Canada,35.467,1785387.0,9984670.0,0.913,America,English
1,France,63.951,2833687.0,640679.0,0.888,Europe,French
2,Germany,80.94,3874437.0,357114.0,0.916,Europe,German
3,Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian
4,Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese
5,United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
6,USA,318.523,17348075.0,9525067.0,0.915,America,English
7,China,1400.0,13011056.25,,,Asia,


In [137]:
df3 = df3.set_index('index')
df3

Unnamed: 0_level_0,Population,GDP,Surface Area,HDI,Continent,Language
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Canada,35.467,1785387.0,9984670.0,0.913,America,English
France,63.951,2833687.0,640679.0,0.888,Europe,French
Germany,80.94,3874437.0,357114.0,0.916,Europe,German
Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian
Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English
USA,318.523,17348075.0,9525067.0,0.915,America,English
China,1400.0,13011056.25,,,Asia,


In [138]:
df.set_index('Population')

Unnamed: 0_level_0,GDP,Surface Area,HDI,Continent,Language
Population,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
35.467,1785387.0,9984670.0,0.913,America,English
63.951,2833687.0,640679.0,0.888,Europe,French
80.94,3874437.0,357114.0,0.916,Europe,German
60.665,2167744.0,301336.0,0.873,Europe,Italian
127.061,4602367.0,377930.0,0.891,Asia,Japanese
64.511,2950039.0,242495.0,0.907,Europe,English
318.523,17348075.0,9525067.0,0.915,America,English
1400.0,13011056.25,,,Asia,


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Creating columns from other columns

Altering a DataFrame often involves combining different columns into another. For example, in our Countries analysis, we could try to calculate the "GDP per capita", which is just, `GDP / Population`.

In [139]:
df[['Population', 'GDP']]

Unnamed: 0,Population,GDP
Canada,35.467,1785387.0
France,63.951,2833687.0
Germany,80.94,3874437.0
Italy,60.665,2167744.0
Japan,127.061,4602367.0
United Kingdom,64.511,2950039.0
United States,318.523,17348075.0
China,1400.0,13011056.25


The regular pandas way of expressing that, is just dividing each series:

In [140]:
df['GDP'] / df['Population']

Canada            50339.385908
France            44310.284437
Germany           47868.013343
Italy             35733.025633
Japan             36221.712406
United Kingdom    45729.239975
United States     54464.120330
China              9293.611607
dtype: float64

The result of that operation is just another series that you can add to the original `DataFrame`:

In [141]:
df['GDP Per Capita'] = df['GDP'] / df['Population']

In [142]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language,GDP Per Capita
Canada,35.467,1785387.0,9984670.0,0.913,America,English,50339.385908
France,63.951,2833687.0,640679.0,0.888,Europe,French,44310.284437
Germany,80.94,3874437.0,357114.0,0.916,Europe,German,47868.013343
Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian,35733.025633
Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese,36221.712406
United Kingdom,64.511,2950039.0,242495.0,0.907,Europe,English,45729.239975
United States,318.523,17348075.0,9525067.0,0.915,America,English,54464.12033
China,1400.0,13011056.25,,,Asia,,9293.611607


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Statistical info

You've already seen the `describe` method, which gives you a good "summary" of the `DataFrame`. Let's explore other methods in more detail:

In [143]:
df.head()

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent,Language,GDP Per Capita
Canada,35.467,1785387.0,9984670.0,0.913,America,English,50339.385908
France,63.951,2833687.0,640679.0,0.888,Europe,French,44310.284437
Germany,80.94,3874437.0,357114.0,0.916,Europe,German,47868.013343
Italy,60.665,2167744.0,301336.0,0.873,Europe,Italian,35733.025633
Japan,127.061,4602367.0,377930.0,0.891,Asia,Japanese,36221.712406


In [144]:
df.describe()

Unnamed: 0,Population,GDP,Surface Area,HDI,GDP Per Capita
count,8.0,8.0,7.0,7.0,8.0
mean,268.88975,6071599.0,3061327.0,0.900429,40494.924205
std,465.821648,5808135.0,4576187.0,0.016592,14156.408293
min,35.467,1785387.0,242495.0,0.873,9293.611607
25%,63.1295,2667201.0,329225.0,0.8895,36099.540713
50%,72.7255,3412238.0,377930.0,0.907,45019.762206
75%,174.9265,6704539.0,5082873.0,0.914,48485.856484
max,1400.0,17348080.0,9984670.0,0.916,54464.12033


In [145]:
population = df['Population']

In [146]:
population.min(), population.max()

(35.467, 1400.0)

In [147]:
population.sum()

2151.1180000000004

In [148]:
population.sum() / len(population)

268.88975000000005

In [149]:
population.mean()

268.88975000000005

In [150]:
population.std()

465.82164757486004

In [151]:
population.median()

72.7255

In [152]:
population.describe()

count       8.000000
mean      268.889750
std       465.821648
min        35.467000
25%        63.129500
50%        72.725500
75%       174.926500
max      1400.000000
Name: Population, dtype: float64

In [153]:
population.quantile(.25)

63.1295

In [154]:
population.quantile([.2, .4, .6, .8, 1])

0.2      61.9794
0.4      64.3990
0.6      90.1642
0.8     241.9382
1.0    1400.0000
Name: Population, dtype: float64

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
