# Lecture 08

by Martin Hronec

### Table of contents

0. [Advanced Pandas](#AdvPandas)
1. [Merge, join and concatenate](#merge)
2. [Reshaping](#reshape)
3. [Split-apply-combine](#groupby)
4. [Git collaboration](#gitco)

In [2]:
import pandas as pd
import numpy as np
import os

In [28]:
# prepare empty dataframe that will be populated file-by-file
df_all = pd.DataFrame()

# columns will be czech, so let's rename them
columns_translation = {'cislo_dot' : 'number',
                    'kod_predm' : 'course_code',
                    'nazev_predm' : 'course_title',
                    'prednasejici' : 'teachers',
                    'cvicici' : 'seminar_leaders',
                    't1': 'c_value',
                    't2': 'c_improve', 
                    'katedra_code' : 'department_code'}

# data really start only in later years
for d in os.listdir('unzipped_data/')[8:]:
    try:
        year, semester = d.split('_')[1], d.split('_')[2][:2]
        df_temp = pd.read_csv('unzipped_data/' + d, sep = ';',
                              header = 0, error_bad_lines=False)
        df_temp = df_temp.rename(columns = columns_translation)
        df_temp.dropna(how = 'all', inplace = True, axis = 1)
        df_temp['year'] = int(year)
        df_temp['semester'] = semester
        df_all = df_all.reindex(df_temp.columns, axis = 1)
        df_all = df_all.append(df_temp)
        df_all.year = df_all.year.astype(int)
    except:
        print(d + ' has name not in the expected format.')
        pass        

b'Skipping line 1017: expected 21 fields, saw 22\nSkipping line 2087: expected 21 fields, saw 22\nSkipping line 2447: expected 21 fields, saw 22\nSkipping line 2736: expected 21 fields, saw 22\nSkipping line 2828: expected 21 fields, saw 23\nSkipping line 3461: expected 21 fields, saw 24\nSkipping line 3645: expected 21 fields, saw 24\nSkipping line 4490: expected 21 fields, saw 23\n'
b'Skipping line 1816: expected 21 fields, saw 22\nSkipping line 1877: expected 21 fields, saw 22\nSkipping line 3253: expected 21 fields, saw 24\nSkipping line 3270: expected 21 fields, saw 22\nSkipping line 3329: expected 21 fields, saw 22\n'
b'Skipping line 7136: expected 21 fields, saw 23\n'
b'Skipping line 4890: expected 21 fields, saw 22\nSkipping line 8304: expected 21 fields, saw 22\nSkipping line 8358: expected 21 fields, saw 22\n'
b'Skipping line 1145: expected 21 fields, saw 22\nSkipping line 1512: expected 21 fields, saw 22\n'
b'Skipping line 279: expected 21 fields, saw 22\nSkipping line 4057:

In [27]:
df_all.head(3)

Unnamed: 0,number,course_code,course_title,teachers,seminar_leaders,q1,q2,q3,q4,q5,...,q9,q10,q11,q12,q13,c_value,c_improve,department_code,year,semester
0,1.0,JPM634,Crisis Games,,"Kučera,T.,Smetana,M.,Rychnovská,D.,Parízek, M.",5.0,3.0,,,,...,1.0,5.0,5.0,5.0,5.0,"Inovativnost vyuky, interaktivitu",,kmv,2014,ls
1,2.0,JEB111,Advanced Data Analysis in MS Excel,,"Kraicová,L.,Polák,P.",5.0,4.0,,,,...,1.0,5.0,5.0,4.0,5.0,,,ies,2014,ls
2,3.0,JEB001,Bachelor´s Thesis Seminar I,,"Cahlík,T.,Cotte,P.",5.0,2.0,,,,...,3.0,3.0,1.0,1.0,5.0,Zajímavý hosté a zajímavá témata,,ies,2014,ls


In [9]:
q_columns = [x for x in df_all.columns if 'q' in x]
df_q = df_all[q_columns]

# Using functions on pandas objects

| Operation          | Function              |
|--------------------|-----------------------|
| Tablewise          | `pipe()`              |
| Row or Column-wise | `apply()`             |
| Aggregation        | `agg() / transform()` |
| Elementwise        | `applymap()`          |

**Tablewise**
* DFs and Series can be arguments of the functions
* if multiple functions need to be called in a sequence, use `pipe()` method, also called the method chaining
    * often used in the data science setting
    * inspired by unix pipes and dplyr (%>%) operator in R 


In [10]:
# prepare some toy dataframe
import statsmodels.formula.api as sm
x = np.linspace(-10,10,100)
y = x**2
ols_data = pd.DataFrame({'x': x, 'y': y})

In [11]:
# method chaining way, with pipe(function, arguments)
(ols_data.pipe((sm.ols, 'data'), 'y ~ x')
 .fit()
 .summary()
)

0,1,2,3
Dep. Variable:,y,R-squared:,-0.0
Model:,OLS,Adj. R-squared:,-0.01
Method:,Least Squares,F-statistic:,-1.542e-14
Date:,"Wed, 10 Apr 2019",Prob (F-statistic):,1.0
Time:,15:06:52,Log-Likelihood:,-483.38
No. Observations:,100,AIC:,970.8
Df Residuals:,98,BIC:,976.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,34.0067,3.072,11.070,0.000,27.910,40.103
x,-9.437e-16,0.527,-1.79e-15,1.000,-1.045,1.045

0,1,2,3
Omnibus:,14.29,Durbin-Watson:,0.006
Prob(Omnibus):,0.001,Jarque-Bera (JB):,9.864
Skew:,0.638,Prob(JB):,0.00721
Kurtosis:,2.14,Cond. No.,5.83


## Row or Column-wise Function Application
* `apply()` is extremely powerful, when used with some brainpower*

In [13]:
# df_q.apply(np.mean, axis = 0)

# using lambda
df_q.apply(lambda x: (x - np.mean(x)) / np.std(x), axis = 0);

# using custom function, with arguments (could have also be done with lambda)
def add_and_substract(df, sub = 1, add = 1):
    return df - sub + add
df_q.apply(add_and_substract, args = (0,0));

# A little bit more sophisticated:  e.g. get index of the observation with the longest value comment
df_all['c_value'].apply(lambda x: len(str(x))).idxmax()

2038

**Aggregation**
* *`aggregate()`* and *`transform()`*
* aggregation allows multiple aggregation operations in a single concise way
* transformation method returns an object that is indexed the same as the original
   * allows multiple operations at the same time, instead of one-by-one as `aggregate()` method

In [14]:
# aggregating simple function is the same as apply
df_q.agg(np.mean, axis = 0)

# aggregating more functions more interesting (you could do your own describe function easily! )
df_q.aggregate([np.mean, np.std, np.min, np.max], axis = 0)

# aggregating using dictionary, i.e. column specific aggregation
df_q.agg({'q1' : [np.mean], 'q2': np.std, 'q3': np.var})

Unnamed: 0,q1,q2,q3
mean,4.156137,,
std,,1.071142,
var,,,0.977355


**Elementwise**
* `applymap()`
* not all functions can be vectorized ... 

In [16]:
# some function
def l(x):
    return len(str(x))

# for series
df_all['c_value'].map(l);
# for dataframe
df_all[['c_value', 'c_improve']].applymap(l);

## Missing values

In [18]:
# % of missing observations for specific column
df_all['q5'].isnull().sum() / df_all['q1'].isnull().count()

0.28704510421553514

# Merge, join and concatenate

## Concat
* for combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations
* concat() (and therefore append()) makes a full copy of the data
    * constantly reusing this function can slow down performance

## Join
* in-memory join operations, similar to relational databases like SQL
* you can see the comparison with SQL [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)


## Merge
* `merge()` serves as a starting point for all standard database join operations between DataFrame or named Series objects

* `pd.merge()` is a function in the pandas namespace, and also a DataFrame instance method, with the calling DataFrame being implicitly considered the lef.t object in the join.

* merge methods and (relational algebra)
* care about merging repeatedly and _y in names 

## Join
* uses merge internally for the index-on-index (by default) and column(s)-on-index join
* DataFrame.join() is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. 


# Reshaping and Pivot Tables
* data is often stored in so-called “stacked” or “record” format, let's look at [pd documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html)

* Pivoting pivot/pivot-tables
* Stacking & unstacking
* Melting


# Groub By: split-apply-combine

* *Split* the data into groups
* *Apply* a function to each group
* *Combine* the results into a datastructure of our choosing


* the split step is straightforward
* in the apply step: we might wish to one of the following:

    * Aggregation: compute a summary statistic (or statistics) for each group, e.g. (group means)
    * Transformation: perform some group-specific computations and return a like-indexed object, e.g. (Z-score within a group)
    * Filtration: discard some groups, according to a group-wise computation that evaluates True or False, e.g. discard data from groups with only a few members


* name GroupBy should be quite familiar to you since you have used a SQL-based tools (or itertools), in which you can write code like:

``SELECT Column1, Column2, mean(Column3), sum(Column4) 

FROM SomeTable

GROUP BY Column1, Column2``

## Splitting an object into groups

* pandas objects can be split on any of their axes.
* the abstract definition of grouping is to provide a mapping of labels to group names. (more on what the GroupBy object is later)
* single group can be selected using `.get_group('label')`
* use `.get_group()` example code

* the mapping can be specified many different ways:
    * a Python function, to be called on each of the axis labels.
    * a list or NumPy array of the same length as the selected axis.
    * a dict or Series, providing a label -> group name mapping.
    * for DataFrame objects, a string indicating a column to be used to group
        * `df.groupby('A')` is just syntactic sugar for `df.groupby(df['A'])`
    * for DataFrame objects, a string indicating an index level to be used to group.
    * a list of any of the above things.

* On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either the A or B columns, or both
* example of added functionality (If we also have a MultiIndex on columns A and B, we can group by all but the specified columns)
    * `df.groupby(level=df2.index.names.difference(['B']))` 

* pd Index objects support duplicate values.
    * if a non-unique index, all values for the same index will be in one group and thus the output of aggregation functions will only contain unique index values:

* complicated data manipulations can be expressed in terms of GroupBy operations 
    * efficiency not guaranteed

* by default the group keys are sorted during the groupby operation
    * pass `sort=False` for potential speedups

### GroupBy object attributes
* the groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group.


## Aggregating
* once the GroupBy object has been created, several methods are available to perform a computation on the grouped data


* the result of the aggregation will have the group names as the new index along the grouped axis
* in the case of multiple keys $\rightarrow$ the result is a MultiIndex by default, though this can be changed by using the as_index option:

* Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:
`df.groupby('A').aggregate(np.sum)`

* The aggregating functions above will exclude NA values. Any function which reduces a Series to a scalar value is an aggregation function and will work, a trivial example is df.groupby('A').agg(lambda ser: 1)
    * aggregating multiple functions: pass a list/dict of functions to do aggregation
    * the resulting aggregations are named for the functions themselves
* By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame
`df.agg({'C': np.sum, 'D': lambda x: np.std(x, ddof=1)})`

## Transformation 

* the `transform` method returns an object that is indexed the same (same size) as the one being grouped
* suppose we wished to standardize the data within each group
* the transform function must:
    * return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, `grouped.transform(lambda x: x.iloc[-1])`).
    * operate column-by-column on the group chunk
    * not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results
    * e.g. when using fillna, inplace must be False (grouped.transform(lambda x: x.fillna(inplace=False)))

### expanding(), rolling()

## Filtration

* the filter methd returns a subset of the original object (only elements belonging to groups)
    * alternatively, instead of dropping the offending groups, we can return a like-indexed objects where the groups that do not pass the filter are filled with NaNs.
* for DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion.

## Flexible `appply`

* some operations on the grouped data might not fit into either the aggregate or transform categories
* can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases.

* df.groupby('A').colname.std(). is more efficient than df.groupby('A').std().colname
    * filtered before applying the aggregation function.


## Groupby plotting

##  Date handling