# Data manipulation

Documentations sources

* https://www.tutorialspoint.com/python_pandas/

More advanced topics are discussed in the following sources

* https://tomaugspurger.github.io
* https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-39e811c81a0c
* https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-part-3-d5704b4b9116
* https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-part-4-c4216f84d388




We cover here basic 

In [308]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import read_csv
from pandas import MultiIndex

## I. Data import 

### Import from files

* Data can be imported from files using `pandas.read_csv` which return a dataframe
* Default field delimiter is `,` but the delimiter can be changed with `sep` argument
* The number of header rows can be specified by `header` argument. 
* Rows for hierarchical multi-headers can be specified by `header = [level1, level2,...]`
* Rows can be skipped with `skiprows` and `skipfooter` for starting and ending rows
* The number of rows to be read can be limited by `nrows` argument
* Table can be parse chunck-by-chunk by specifying `chunksize`. Then iterator is returned 
* Column names can be specified with `names` argument
* Row names can be specified by `index_col` argument
* Datatypes for each column can be specified by `dtype` argument. Use `str` and `numpy.xx` datatypes
* Datatypes can be also foreced by column converters specified by `converters` argument
* Symbols representing **missing** values can be specified by `na_value` argument 
* Many other arguments specify details of csv format and data conversions

In [52]:
display(read_csv('realwage.csv', nrows = 5))
display(read_csv('realwage.csv', index_col = 0, nrows = 5))
display(read_csv('realwage.csv', index_col = 0, skiprows = 1, names = ['a', 'b', 'c', 'd', 'e'],  nrows = 5))
display(read_csv('realwage.csv', index_col = 0, dtype = {'value':float}, nrows = 5))
display(read_csv('realwage.csv', index_col = 0, converters = {'value': lambda x: round(float(x))}, nrows = 5))
display(read_csv('realwage.csv', index_col = 0, na_values = 'Annual', nrows = 5))

Unnamed: 0.1,Unnamed: 0,Time,Country,Series,Pay period,value
0,0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443
1,1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918
2,2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406
3,3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139
4,4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832


Unnamed: 0,Time,Country,Series,Pay period,value
0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443
1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918
2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406
3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139
4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832


Unnamed: 0,a,b,c,d,e
0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443
1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918
2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406
3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139
4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832


Unnamed: 0,Time,Country,Series,Pay period,value
0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443
1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918
2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406
3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139
4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832


Unnamed: 0,Time,Country,Series,Pay period,value
0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132
1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18101
2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747
3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580
4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18756


Unnamed: 0,Time,Country,Series,Pay period,value
0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,,17132.443
1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,,18100.918
2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,,17747.406
3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,,18580.139
4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,,18755.832


### Import from web

There are dedicated libraries for read datastreams from the web 

* https://github.com/pydata/pandas-datareader
* https://pypi.org/project/fix-yahoo-finance/
* https://github.com/statfi/opendata
* https://github.com/opendatateam

but sometimes you have to write your own scraper

* https://www.datacamp.com/community/tutorials/web-scraping-using-python
* https://pythonprogramminglanguage.com/web-scraping-with-pandas-and-beautifulsoup/
* https://towardsdatascience.com/an-introduction-to-web-scraping-with-python-bc9563fe8860

### Import form SQL database

* Data from databases can be imported with `read_sql_query` and `read_sql_table` functions
* You need to setup `SQLAlchemy` connection
* The query can be specified by `sql` argument
* Row names can be specified by `index_col` argument
* Large tables can be processed and sometimes fetched chunck-by-chunk by specifying `chunksize`
* Then iterator is returned. The function does not guarantee that the table is fetched chunck-by-chunk  
* Database schema can be specified by `schema` argument

## II. Basic manipulation

###  Basic properties of dataframe

Dataframe is characterised by the following read-only properties:

* `ndim`    - number of dimensions
* `shape`   - dimensions of the data matrix
* `size`    - number of elements in the data matrix
* `axes`    - more detailed description of row and column names
* `index`   - list of row names as in `axes[0]`
* `columns` - list of column names as in `axes[1]`
* `head`    - few rows from the top of the data matrix
* `tail`    - few rows from the bottom of the data matrix

In [213]:
df = read_csv('realwage.csv', index_col = 0)
print(df.ndim, ':', df.shape, ':' ,df.size, '\n')
print(df.axes, '\n')
print(df.index)
print(df.columns)
display(df.head())
display(df.tail())

2 : (1408, 5) : 7040 

[Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407],
           dtype='int64', length=1408), Index(['Time', 'Country', 'Series', 'Pay period', 'value'], dtype='object')] 

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407],
           dtype='int64', length=1408)
Index(['Time', 'Country', 'Series', 'Pay period', 'value'], dtype='object')


Unnamed: 0,Time,Country,Series,Pay period,value
0,2006-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17132.443
1,2007-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18100.918
2,2008-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,17747.406
3,2009-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18580.139
4,2010-01-01,Ireland,In 2015 constant prices at 2015 USD PPPs,Annual,18755.832


Unnamed: 0,Time,Country,Series,Pay period,value
1403,2012-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,
1404,2013-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,
1405,2014-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,2.41
1406,2015-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,2.56
1407,2016-01-01,Costa Rica,In 2015 constant prices at 2015 USD exchange r...,Hourly,2.63


### Comparison magic for selecting

* **Do not use indirect or chained indexing aka chained indexing in assignents**
* Use simple comparison operators like `df.value > 300` to crete Boolean indices for selection
* Use set operations to combine simple restrictions but braket all simple comparisons:
 
  * `~a`     for complement 
  * `a & b`  for intersection
  * `a | b`  for union 
  * `a & ~b` for set difference 
  * `a ^ b`  for symmetrical difference

* Avaliable comparison operators are:
  
  * numeric comparison operators
  * string comparison operators and   
  * `isin` operator for checking the value against lists
  * regex search and match operators `str.contains` and `str.match`
  * datetime comparison operations and `isin` operator for `date_ranges`


In [219]:
# Do not write this selection statement in an assignment 
display(df.loc[(df.value > 300) & (df.value < 1000), :].iloc[0:1,:])


display(df.loc[(df.value > 300) & (df.value < 1000), :])
display(df.loc[(df.value > 300) & ~(df.value >= 1000), :])

display(df.loc[(df.value <= 0.5) | (df.value >= 25500), :])
display(df.loc[(df.value >  0.5) ^ (df.value <  25500), :])

display(df.loc[df.Country < 'B', :].head())
display(df.loc[df.Country == 'Estonia', :].head()) 
display(df.loc[df.Country.isin(['Estonia', 'Latvia']), :].head()) 
display(df.loc[df.Country.isin(['Estonia', 'Latvia']) & df.Time.str.match('^2010'), :]) 

idx = pd.to_datetime(df.Time).isin(pd.date_range("2010-01-01", "2010-12-31"))
display(df.loc[idx & df.Country.isin(['Estonia', 'Latvia']), :])

Unnamed: 0,Time,Country,Series,Pay period,value
1210,2006-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Annual,568.23199


Unnamed: 0,Time,Country,Series,Pay period,value
1210,2006-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Annual,568.23199
1212,2008-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Annual,955.16498


Unnamed: 0,Time,Country,Series,Pay period,value
1210,2006-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Annual,568.23199
1212,2008-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Annual,955.16498


Unnamed: 0,Time,Country,Series,Pay period,value
120,2016-01-01,Australia,In 2015 constant prices at 2015 USD exchange r...,Annual,25643.729
206,2014-01-01,Luxembourg,In 2015 constant prices at 2015 USD exchange r...,Annual,25713.797
207,2015-01-01,Luxembourg,In 2015 constant prices at 2015 USD exchange r...,Annual,25592.293
1221,2006-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Hourly,0.234
1222,2007-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Hourly,0.448
1223,2008-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Hourly,0.393


Unnamed: 0,Time,Country,Series,Pay period,value
120,2016-01-01,Australia,In 2015 constant prices at 2015 USD exchange r...,Annual,25643.729
206,2014-01-01,Luxembourg,In 2015 constant prices at 2015 USD exchange r...,Annual,25713.797
207,2015-01-01,Luxembourg,In 2015 constant prices at 2015 USD exchange r...,Annual,25592.293
1221,2006-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Hourly,0.234
1222,2007-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Hourly,0.448
1223,2008-01-01,Russian Federation,In 2015 constant prices at 2015 USD exchange r...,Hourly,0.393


Unnamed: 0,Time,Country,Series,Pay period,value
88,2006-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,20410.652
89,2007-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,21087.568
90,2008-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,20718.238
91,2009-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,20984.768
92,2010-01-01,Australia,In 2015 constant prices at 2015 USD PPPs,Annual,20879.332


Unnamed: 0,Time,Country,Series,Pay period,value
660,2006-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,5179.6499
661,2007-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,5830.6699
662,2008-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,6383.8848
663,2009-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,6388.894
664,2010-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,6204.4941


Unnamed: 0,Time,Country,Series,Pay period,value
660,2006-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,5179.6499
661,2007-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,5830.6699
662,2008-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,6383.8848
663,2009-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,6388.894
664,2010-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,6204.4941


Unnamed: 0,Time,Country,Series,Pay period,value
664,2010-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,6204.4941
675,2010-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Hourly,2.9748
686,2010-01-01,Estonia,In 2015 constant prices at 2015 USD exchange r...,Annual,4124.624
697,2010-01-01,Estonia,In 2015 constant prices at 2015 USD exchange r...,Hourly,1.978
1280,2010-01-01,Latvia,In 2015 constant prices at 2015 USD PPPs,Annual,5923.5762
1291,2010-01-01,Latvia,In 2015 constant prices at 2015 USD PPPs,Hourly,2.84007
1302,2010-01-01,Latvia,In 2015 constant prices at 2015 USD exchange r...,Annual,3714.6919
1313,2010-01-01,Latvia,In 2015 constant prices at 2015 USD exchange r...,Hourly,1.781


Unnamed: 0,Time,Country,Series,Pay period,value
664,2010-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Annual,6204.4941
675,2010-01-01,Estonia,In 2015 constant prices at 2015 USD PPPs,Hourly,2.9748
686,2010-01-01,Estonia,In 2015 constant prices at 2015 USD exchange r...,Annual,4124.624
697,2010-01-01,Estonia,In 2015 constant prices at 2015 USD exchange r...,Hourly,1.978
1280,2010-01-01,Latvia,In 2015 constant prices at 2015 USD PPPs,Annual,5923.5762
1291,2010-01-01,Latvia,In 2015 constant prices at 2015 USD PPPs,Hourly,2.84007
1302,2010-01-01,Latvia,In 2015 constant prices at 2015 USD exchange r...,Annual,3714.6919
1313,2010-01-01,Latvia,In 2015 constant prices at 2015 USD exchange r...,Hourly,1.781


### Multidimensional indexing and index slices

* Sometimes you want to slice the dataframe based on several columns
* For that you shoud create a multi-index by specifying the list of columns `df.set_index(column_list)`
* As a result, it is possible to slice data according to columns in `column_list`
* Top-level row indexing is as usual but hierachical selection is defined through tuples 
* The data **changes format** unless all indexes are slices or lists
  * High-level outer indices that are fixed are dropped if index is a single value
  * Still, the output format is quite unpredictable. **Validate your guess in practice!** 
* Index slices `IndexSlice[...]` are handy if you want to leave some outer index columns unspecified

In [281]:
hdf = df.assign(Time = pd.to_datetime(df.Time)).set_index(['Country', 'Pay period', 'Time'])
display(hdf.head())
display(hdf.loc['Estonia', :].head())
display(hdf.loc[('Estonia', 'Annual'), :].head())
display(hdf.loc[('Estonia', 'Annual', pd.Timestamp('2008-01-01 00:00:00')), :].head())
display(hdf.loc[(['Estonia'], ['Annual'], pd.date_range('2008-01-01', '2010-01-01')), :])
display(hdf.loc[pd.IndexSlice[:, 'Annual', pd.date_range('2008', '2008')], :].head())
display(hdf.loc[pd.IndexSlice[['Estonia', 'Latvia'], :, pd.date_range('2008', '2008')], :])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Series,value
Country,Pay period,Time,Unnamed: 3_level_1,Unnamed: 4_level_1
Ireland,Annual,2006-01-01,In 2015 constant prices at 2015 USD PPPs,17132.443
Ireland,Annual,2007-01-01,In 2015 constant prices at 2015 USD PPPs,18100.918
Ireland,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,17747.406
Ireland,Annual,2009-01-01,In 2015 constant prices at 2015 USD PPPs,18580.139
Ireland,Annual,2010-01-01,In 2015 constant prices at 2015 USD PPPs,18755.832


Unnamed: 0_level_0,Unnamed: 1_level_0,Series,value
Pay period,Time,Unnamed: 2_level_1,Unnamed: 3_level_1
Annual,2006-01-01,In 2015 constant prices at 2015 USD PPPs,5179.6499
Annual,2007-01-01,In 2015 constant prices at 2015 USD PPPs,5830.6699
Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,6383.8848
Annual,2009-01-01,In 2015 constant prices at 2015 USD PPPs,6388.894
Annual,2010-01-01,In 2015 constant prices at 2015 USD PPPs,6204.4941


Unnamed: 0_level_0,Series,value
Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2006-01-01,In 2015 constant prices at 2015 USD PPPs,5179.6499
2007-01-01,In 2015 constant prices at 2015 USD PPPs,5830.6699
2008-01-01,In 2015 constant prices at 2015 USD PPPs,6383.8848
2009-01-01,In 2015 constant prices at 2015 USD PPPs,6388.894
2010-01-01,In 2015 constant prices at 2015 USD PPPs,6204.4941


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Series,value
Country,Pay period,Time,Unnamed: 3_level_1,Unnamed: 4_level_1
Estonia,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,6383.8848
Estonia,Annual,2008-01-01,In 2015 constant prices at 2015 USD exchange r...,4243.8799


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Series,value
Country,Pay period,Time,Unnamed: 3_level_1,Unnamed: 4_level_1
Estonia,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,6383.8848
Estonia,Annual,2009-01-01,In 2015 constant prices at 2015 USD PPPs,6388.894
Estonia,Annual,2010-01-01,In 2015 constant prices at 2015 USD PPPs,6204.4941
Estonia,Annual,2008-01-01,In 2015 constant prices at 2015 USD exchange r...,4243.8799
Estonia,Annual,2009-01-01,In 2015 constant prices at 2015 USD exchange r...,4247.21
Estonia,Annual,2010-01-01,In 2015 constant prices at 2015 USD exchange r...,4124.624


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Series,value
Country,Pay period,Time,Unnamed: 3_level_1,Unnamed: 4_level_1
Ireland,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,17747.406
Ireland,Annual,2008-01-01,In 2015 constant prices at 2015 USD exchange r...,19775.785
Spain,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,12172.902
Spain,Annual,2008-01-01,In 2015 constant prices at 2015 USD exchange r...,10072.337
Australia,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,20718.238


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Series,value
Country,Pay period,Time,Unnamed: 3_level_1,Unnamed: 4_level_1
Estonia,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,6383.8848
Estonia,Hourly,2008-01-01,In 2015 constant prices at 2015 USD PPPs,3.06081
Estonia,Annual,2008-01-01,In 2015 constant prices at 2015 USD exchange r...,4243.8799
Estonia,Hourly,2008-01-01,In 2015 constant prices at 2015 USD exchange r...,2.035
Latvia,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,5404.939
Latvia,Hourly,2008-01-01,In 2015 constant prices at 2015 USD PPPs,2.59141
Latvia,Annual,2008-01-01,In 2015 constant prices at 2015 USD exchange r...,3389.4529
Latvia,Hourly,2008-01-01,In 2015 constant prices at 2015 USD exchange r...,1.625


###  Controlling levels inside multiindex

* Columns in a multiindex are called levels 
* It is possible to reoder columns in the index with `hdf.reorder_levels` and `hdf.swaplevels`
* It is possible to push some levels out form the index with `hdf.reset_index` 
* It is possible to delete levels with `hdf.reset_index(..., drop = True)` form the index but this causes data loss

In [302]:
display(hdf.swaplevel('Pay period', 'Country').head())
display(hdf.reorder_levels(['Pay period', 'Country','Time']).head())
display(hdf.reset_index().head())
display(hdf.reset_index('Time').head())
display(hdf.reset_index('Time', drop = True).head())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Series,value
Pay period,Country,Time,Unnamed: 3_level_1,Unnamed: 4_level_1
Annual,Ireland,2006-01-01,In 2015 constant prices at 2015 USD PPPs,17132.443
Annual,Ireland,2007-01-01,In 2015 constant prices at 2015 USD PPPs,18100.918
Annual,Ireland,2008-01-01,In 2015 constant prices at 2015 USD PPPs,17747.406
Annual,Ireland,2009-01-01,In 2015 constant prices at 2015 USD PPPs,18580.139
Annual,Ireland,2010-01-01,In 2015 constant prices at 2015 USD PPPs,18755.832


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Series,value
Pay period,Country,Time,Unnamed: 3_level_1,Unnamed: 4_level_1
Annual,Ireland,2006-01-01,In 2015 constant prices at 2015 USD PPPs,17132.443
Annual,Ireland,2007-01-01,In 2015 constant prices at 2015 USD PPPs,18100.918
Annual,Ireland,2008-01-01,In 2015 constant prices at 2015 USD PPPs,17747.406
Annual,Ireland,2009-01-01,In 2015 constant prices at 2015 USD PPPs,18580.139
Annual,Ireland,2010-01-01,In 2015 constant prices at 2015 USD PPPs,18755.832


Unnamed: 0,Country,Pay period,Time,Series,value
0,Ireland,Annual,2006-01-01,In 2015 constant prices at 2015 USD PPPs,17132.443
1,Ireland,Annual,2007-01-01,In 2015 constant prices at 2015 USD PPPs,18100.918
2,Ireland,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,17747.406
3,Ireland,Annual,2009-01-01,In 2015 constant prices at 2015 USD PPPs,18580.139
4,Ireland,Annual,2010-01-01,In 2015 constant prices at 2015 USD PPPs,18755.832


Unnamed: 0_level_0,Unnamed: 1_level_0,Time,Series,value
Country,Pay period,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ireland,Annual,2006-01-01,In 2015 constant prices at 2015 USD PPPs,17132.443
Ireland,Annual,2007-01-01,In 2015 constant prices at 2015 USD PPPs,18100.918
Ireland,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,17747.406
Ireland,Annual,2009-01-01,In 2015 constant prices at 2015 USD PPPs,18580.139
Ireland,Annual,2010-01-01,In 2015 constant prices at 2015 USD PPPs,18755.832


Unnamed: 0_level_0,Unnamed: 1_level_0,Series,value
Country,Pay period,Unnamed: 2_level_1,Unnamed: 3_level_1
Ireland,Annual,In 2015 constant prices at 2015 USD PPPs,17132.443
Ireland,Annual,In 2015 constant prices at 2015 USD PPPs,18100.918
Ireland,Annual,In 2015 constant prices at 2015 USD PPPs,17747.406
Ireland,Annual,In 2015 constant prices at 2015 USD PPPs,18580.139
Ireland,Annual,In 2015 constant prices at 2015 USD PPPs,18755.832


### Pivot tables and reshaping 

* 

In [307]:
df.columns =  Multiindex


Index(['Time', 'Country', 'Series', 'Pay period', 'value'], dtype='object')

In [262]:
hdf.loc[(['Estonia', 'Latvia'], ['Annual']), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Series,value
Country,Pay period,Time,Unnamed: 3_level_1,Unnamed: 4_level_1
Estonia,Annual,2006-01-01,In 2015 constant prices at 2015 USD PPPs,5179.6499
Estonia,Annual,2007-01-01,In 2015 constant prices at 2015 USD PPPs,5830.6699
Estonia,Annual,2008-01-01,In 2015 constant prices at 2015 USD PPPs,6383.8848
Estonia,Annual,2009-01-01,In 2015 constant prices at 2015 USD PPPs,6388.894
Estonia,Annual,2010-01-01,In 2015 constant prices at 2015 USD PPPs,6204.4941
Estonia,Annual,2011-01-01,In 2015 constant prices at 2015 USD PPPs,5910.0601
Estonia,Annual,2012-01-01,In 2015 constant prices at 2015 USD PPPs,5932.1382
Estonia,Annual,2013-01-01,In 2015 constant prices at 2015 USD PPPs,6368.1611
Estonia,Annual,2014-01-01,In 2015 constant prices at 2015 USD PPPs,7072.6411
Estonia,Annual,2015-01-01,In 2015 constant prices at 2015 USD PPPs,7807.5259


In [187]:

* do not use indirect indexing. It creates unexpected errors



type(idx)
len(idx
#idx.index

#raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')



5

In [None]:
### Multidimensional Indexing
-index slice
-index reordering


In [None]:
Method chains
Manipulation on steroids

Add methods to class

assign (0.16.0): For adding new columns to a DataFrame in a chain (inspired by dplyr's mutate)
--simple stuff. just adds columns
                                                                   
                                                                   
pipe (0.16.2): For including user-defined methods in method chains.
rename (0.18.0): For altering axis names (in additional to changing the actual labels as before).
Window methods (0.18): Took the top-level pd.rolling_* and pd.expanding_* functions and made them NDFrame methods with a groupby-like API.
Resample (0.18.0) Added a new groupby-like API
.where/mask/Indexers accept Callables (0.18.1): In the next release you'll be able to pass a callable to the indexing methods, to be evaluated within the DataFrame's context (like .query, but with code instead of strings).


In [None]:
Inplace and fear of copying

In [98]:
df.loc[df['Country'] == 'Estonia', :]

def kala(x):
    return 6

kala(df.value)  
#round(df.value) 
#> 100 
#?? df.value < 1000

0

In [None]:
df.T


reindex
reindex_like

rename
 -columns
 -rows
 -inplace

iteration
sorting
-index
-columns
-values
sort_values() provides a provision to choose the algorithm from mergesort,
heapsort and quicksort. Mergesort is the only stable algorithm.

string manipulation

In [None]:
## I. Cleaning and filtering

Tidy Data in Action
https://tomaugspurger.github.io/Tidy%20Data%20in%20Action.html

>>> import pandas as pd
>>> df = pd.read_csv("dup.csv")
>>> ids = df["ID"]
>>> df[ids.isin(ids[ids.duplicated()])].sort("ID")

statistical functions
window functions
aggreagations

In [None]:
## II. Concatenation and merging 

In [None]:
## II. Mapping and aggregation

In [None]:
### Pivot tables