# Lecture 07

by Martin Hronec

### Table of contents

0. [Advanced Pandas](#AdvPandas)
1. [Merge, join and concatenate](#merge)
2. [Reshaping](#reshape)
3. [Split-apply-combine](#groupby)
4. [Git collaboration](#gitco)

In [1]:
import pandas as pd
import numpy as np
import os

In [4]:
# prepare empty dataframe that will be populated file-by-file
df_all = pd.DataFrame(columns = ['number', 'course_code', 'course_title', 'teachers', 'seminar_leaders',
       'q1', 'q2', 'q3', 'q4', 'q5', 'q6', 'q7', 'q8', 'q9', 'q10', 'q11',
       'q12', 'q13', 'c_value', 'c_improve', 'department_code', 'year',
       'semester'])

# columns will be czech, so let's rename them
columns_translation = {'cislo_dot' : 'number',
                    'kod_predm' : 'course_code',
                    'nazev_predm' : 'course_title',
                    'prednasejici' : 'teachers',
                    'cvicici' : 'seminar_leaders',
                    't1': 'c_value',
                    't2': 'c_improve', 
                    'katedra_code' : 'department_code'}

# data really start only in later years
for d in os.listdir('unzipped_data/')[0:]:
    try:
        year, semester = d.split('_')[1], d.split('_')[2][:2]
        df_temp = pd.read_csv('unzipped_data/' + d, sep = ';',
                              header = 0, error_bad_lines=False)
        df_temp = df_temp.rename(columns = columns_translation)
        # df_temp.dropna(how = 'all', inplace = True, axis = 1)
        df_temp['year'] = int(year)
        df_temp['semester'] = semester
        df_all = df_all.append(df_temp)
    except:
        print(d + ' has name not in the expected format.')
        pass        

b'Skipping line 1017: expected 21 fields, saw 22\nSkipping line 2087: expected 21 fields, saw 22\nSkipping line 2447: expected 21 fields, saw 22\nSkipping line 2736: expected 21 fields, saw 22\nSkipping line 2828: expected 21 fields, saw 23\nSkipping line 3461: expected 21 fields, saw 24\nSkipping line 3645: expected 21 fields, saw 24\nSkipping line 4490: expected 21 fields, saw 23\n'
b'Skipping line 1816: expected 21 fields, saw 22\nSkipping line 1877: expected 21 fields, saw 22\nSkipping line 3253: expected 21 fields, saw 24\nSkipping line 3270: expected 21 fields, saw 22\nSkipping line 3329: expected 21 fields, saw 22\n'
b'Skipping line 7136: expected 21 fields, saw 23\n'
b'Skipping line 4890: expected 21 fields, saw 22\nSkipping line 8304: expected 21 fields, saw 22\nSkipping line 8358: expected 21 fields, saw 22\n'
b'Skipping line 1145: expected 21 fields, saw 22\nSkipping line 1512: expected 21 fields, saw 22\n'
b'Skipping line 279: expected 21 fields, saw 22\nSkipping line 4057:

In [6]:
df_all.head()

Unnamed: 0,number,course_code,course_title,teachers,seminar_leaders,q1,q2,q3,q4,q5,...,q9,q10,q11,q12,q13,c_value,c_improve,department_code,year,semester
0,1,JSB008,Dějiny sociologie II.,Šanderová,Horák,3.0,3.0,4.0,4.0,3.0,...,4.0,4.0,3.0,4.0,3.0,,,ks,2010,ls
1,2,JSB008,Dějiny sociologie II.,Šanderová,Horák,3.0,4.0,4.0,5.0,4.0,...,5.0,3.0,3.0,4.0,3.0,,Vlastní názor by se mohl více cenit.,ks,2010,ls
2,3,JSB008,Dějiny sociologie II.,Šanderová,Horák,4.0,4.0,4.0,5.0,3.0,...,5.0,4.0,4.0,3.0,3.0,,,ks,2010,ls
3,4,JSB008,Dějiny sociologie II.,Šanderová,Horák,4.0,5.0,4.0,3.0,4.0,...,5.0,5.0,4.0,4.0,4.0,,,ks,2010,ls
4,5,JSB008,Dějiny sociologie II.,Šanderová,Horák,4.0,4.0,4.0,3.0,5.0,...,5.0,4.0,4.0,4.0,5.0,,,ks,2010,ls


In [7]:
# define dataframe with questions
q_columns = [x for x in df_all.columns if 'q' in x]
df_q = df_all[q_columns]

In [8]:
# define dataframe with comments 
df_c = df_all[['c_value', 'c_improve']]

# define dataframe with teachers
df_t = df_all[['teachers', 'seminar_leaders']]

# Using functions on pandas objects

| Operation          | Function              |
|--------------------|-----------------------|
| Tablewise          | `pipe()`              |
| Row or Column-wise | `apply()`             |
| Aggregation        | `agg() / transform()` |
| Elementwise        | `applymap()`          |

**Tablewise**
* DFs and Series can be arguments of the functions
* if multiple functions need to be called in a sequence, use `pipe()` method, also called the method chaining
    * often used in the data science setting
    * inspired by unix pipes and dplyr (%>%) operator in R 


* Compare
    * `df = foo3(foo2(foo1(df, arg1= 1), arg2= 2), arg3=3)`

    * ```
        df.pipe(foo1, arg1=1)

        pipe(foo2, arg2=2)
    
        pipe(foo3, arg3=3)
        ```


In [9]:
# prepare some toy dataframe
import statsmodels.formula.api as sm
x = np.linspace(-10,10,100)
y = x**2
ols_data = pd.DataFrame({'x': x, 'y': y})

In [10]:
# method chaining way, with pipe(function, arguments)
(ols_data.pipe((sm.ols, 'data'), 'y ~ x')
 .fit()
 .summary()
)

0,1,2,3
Dep. Variable:,y,R-squared:,-0.0
Model:,OLS,Adj. R-squared:,-0.01
Method:,Least Squares,F-statistic:,-1.542e-14
Date:,"Tue, 19 Nov 2019",Prob (F-statistic):,1.0
Time:,15:22:34,Log-Likelihood:,-483.38
No. Observations:,100,AIC:,970.8
Df Residuals:,98,BIC:,976.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,34.0067,3.072,11.070,0.000,27.910,40.103
x,-1.665e-16,0.527,-3.16e-16,1.000,-1.045,1.045

0,1,2,3
Omnibus:,14.29,Durbin-Watson:,0.006
Prob(Omnibus):,0.001,Jarque-Bera (JB):,9.864
Skew:,0.638,Prob(JB):,0.00721
Kurtosis:,2.14,Cond. No.,5.83


## Row or Column-wise Function Application
* `apply()` is extremely powerful, when used with some brainpower
* it also encompasses tranformation and aggregation

In [11]:
# simple
df_q.apply(np.mean, axis = 0);

In [12]:
# handy lambdas
df_q.apply(np.mean, axis = 0) == df_q.apply(lambda x: np.mean(x), axis = 0);

* if you don't want to define function outside the apply, you can use `lambda` to input x into multiple function within apply

In [13]:
# why are lambdas useful? using lambda
df_q.apply(lambda x: (x - np.mean(x)) / np.std(x), axis = 0);

In [14]:
# using custom function, with arguments (could have also be done with lambda)
def add_and_substract(df, sub = 1, add = 1):
    return df - sub + add
df_q.apply(add_and_substract, args = (0,0));


In [15]:
# A little bit more sophisticated:  e.g. get index of the observation with the longest value comment
df_c['c_value'].apply(lambda x: len(str(x))).idxmax()

2038

**Aggregation**
* *`aggregate()`* and *`transform()`*
* aggregation allows multiple aggregation operations in a single concise way
* transformation method returns an object that is indexed the same as the original
   * allows multiple operations at the same time, instead of one-by-one as `aggregate()` method

In [18]:
# aggregating simple function is the same as apply
df_q.agg(np.mean, axis = 0)

# aggregating more functions more interesting (you could do your own describe function easily! )
df_q.aggregate([np.mean, np.std, np.min, np.max], axis = 0)

Unnamed: 0,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13
mean,4.208705,3.530888,4.291746,4.447318,4.042199,4.308994,4.522126,4.161178,2.122761,4.154364,3.721939,4.013228,4.261692
std,0.882609,0.980503,0.897745,0.840287,1.048637,0.878847,0.779398,0.997058,1.515671,0.921508,1.140794,0.984485,0.955209
amin,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
amax,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [19]:
# compare with apply
df_q.apply([np.mean, np.std, np.min, np.max], axis = 0)

Unnamed: 0,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13
mean,4.208705,3.530888,4.291746,4.447318,4.042199,4.308994,4.522126,4.161178,2.122761,4.154364,3.721939,4.013228,4.261692
std,0.882609,0.980503,0.897745,0.840287,1.048637,0.878847,0.779398,0.997058,1.515671,0.921508,1.140794,0.984485,0.955209
amin,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
amax,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0


In [20]:
# aggregating using dictionary, i.e. column specific aggregation
df_q.agg({'q1' : [np.mean], 'q2': np.std, 'q3': np.var})

Unnamed: 0,q1,q2,q3
mean,4.208705,,
std,,0.980503,
var,,,0.805947


**Elementwise**
* `applymap()`
* not all functions can be vectorized ... 

In [21]:
# some function
def l(x):
    return len(str(x))

# for series
df_all['c_value'].map(l);

In [22]:
# for dataframe
df_all[['c_value', 'c_improve']].applymap(l);

## Missing values

* [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)

In [24]:
# % of missing observations for specific column
df_all['q5'].isnull().sum() / df_all['q1'].isnull().count()

0.25073325041077377

# Merge, join and concatenate

## Concat
* for combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations
* concat() (and therefore append()) makes a full copy of the data
    * constantly reusing this function can slow down performance

In [27]:
A = df_q.head(10)
B = df_q.head(10)

In [28]:
pd.concat([A,B], axis = 0);

## Merge
* `merge()` serves as a starting point for all standard database join operations between DataFrame or named Series objects

* `pd.merge()` is a function in the pandas namespace (also a DataFrame instance method)
* merge methods and (relational algebra)

In [29]:
??pd.merge;

Object `pd.merge;` not found.


In [30]:
df_q.head()

Unnamed: 0,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13
0,3.0,3.0,4.0,4.0,3.0,3.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0
1,3.0,4.0,4.0,5.0,4.0,4.0,4.0,3.0,5.0,3.0,3.0,4.0,3.0
2,4.0,4.0,4.0,5.0,3.0,3.0,4.0,4.0,5.0,4.0,4.0,3.0,3.0
3,4.0,5.0,4.0,3.0,4.0,4.0,4.0,4.0,5.0,5.0,4.0,4.0,4.0
4,4.0,4.0,4.0,3.0,5.0,4.0,4.0,3.0,5.0,4.0,4.0,4.0,5.0


In [31]:
df_t.head()

Unnamed: 0,teachers,seminar_leaders
0,Šanderová,Horák
1,Šanderová,Horák
2,Šanderová,Horák
3,Šanderová,Horák
4,Šanderová,Horák


In [32]:
pd.merge(df_q, df_t, how = 'inner', left_index = True, right_index = True);

* care about merging repeatedly and _y in names 
    * to see what I am talking about, rerun the merging code 2 times in a row

In [33]:
df_problem = df_q.copy(deep = True)

In [34]:
df_problem = pd.merge(df_problem, df_t, how = 'inner', left_index = True, right_index = True)

## Join
* uses merge internally for the index-on-index (by default) and column(s)-on-index join
* DataFrame.join() is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. 

* similar to relational databases like SQL
    * you can see the comparison with SQL [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

In [35]:
df_q.join(df_t);

# Reshaping and Pivot Tables
* data is often stored in so-called “stacked” or “record” format, let's look at [pd documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/restableshaping.html)

In [36]:
df_all.head()

Unnamed: 0,number,course_code,course_title,teachers,seminar_leaders,q1,q2,q3,q4,q5,...,q9,q10,q11,q12,q13,c_value,c_improve,department_code,year,semester
0,1,JSB008,Dějiny sociologie II.,Šanderová,Horák,3.0,3.0,4.0,4.0,3.0,...,4.0,4.0,3.0,4.0,3.0,,,ks,2010,ls
1,2,JSB008,Dějiny sociologie II.,Šanderová,Horák,3.0,4.0,4.0,5.0,4.0,...,5.0,3.0,3.0,4.0,3.0,,Vlastní názor by se mohl více cenit.,ks,2010,ls
2,3,JSB008,Dějiny sociologie II.,Šanderová,Horák,4.0,4.0,4.0,5.0,3.0,...,5.0,4.0,4.0,3.0,3.0,,,ks,2010,ls
3,4,JSB008,Dějiny sociologie II.,Šanderová,Horák,4.0,5.0,4.0,3.0,4.0,...,5.0,5.0,4.0,4.0,4.0,,,ks,2010,ls
4,5,JSB008,Dějiny sociologie II.,Šanderová,Horák,4.0,4.0,4.0,3.0,5.0,...,5.0,4.0,4.0,4.0,5.0,,,ks,2010,ls


* Pivoting pivot/pivot-tables
    *  `pivot()` provides general purpose pivoting with various data types (strings, numerics, etc.)
    * `pivot_table()` for pivoting with aggregation of numeric data
    

In [37]:
??pd.pivot();

Object `pd.pivot();` not found.


In [38]:
??pd.pivot_table();

Object `pd.pivot_table();` not found.


In [39]:
df_all.pivot_table(index = 'year', columns = 'teachers', values = 'q1').head(10);

In [40]:
df_all.pivot_table(index = 'year', columns = 'teachers', values = 'q1', aggfunc = np.mean).head(10);

* Stacking & unstacking: `stack` and `unstack()`

In [41]:
df_all.stack()

0     number                                                             1
      course_code                                                   JSB008
      course_title                                   Dějiny sociologie II.
      teachers                                                   Šanderová
      seminar_leaders                                                Horák
                                               ...                        
6994  q13                                                                5
      c_value            Ja kurz absolvovala s panem docentem Malym - j...
      department_code                                                   kz
      year                                                            2017
      semester                                                          zs
Length: 2350957, dtype: object

* Melting: `melt()` or `wide_to_long()`

In [114]:
# melting into long format
df_all.melt(id_vars = ['course_title', 'teachers']).head()

Unnamed: 0,course_title,teachers,variable,value
0,Úvod do sociologie práce,Kuchař,number,1
1,Úvod do sociologie práce,Kuchař,number,2
2,Úvod do sociologie práce,Kuchař,number,3
3,Úvod do sociologie práce,Kuchař,number,4
4,Úvod do sociologie práce,Kuchař,number,5


# Groub By: split-apply-combine


* *Split* the data into groups
* *Apply* a function to each group
* *Combine* the results into a datastructure of our choosing

* look at the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html)


* the split step is straightforward
* in the apply step: we might wish to one of the following:

    * Aggregation: compute a summary statistic (or statistics) for each group, e.g. (group means)
    * Transformation: perform some group-specific computations and return a like-indexed object, e.g. (Z-score within a group)
    * Filtration: discard some groups, according to a group-wise computation that evaluates True or False, e.g. discard data from groups with only a few members


* name GroupBy should be quite familiar to you since you have used a SQL-based tools (or itertools), in which you can write code like:

``SELECT Column1, Column2, mean(Column3), sum(Column4) 

FROM SomeTable

GROUP BY Column1, Column2``

In [42]:
# Average q1 response for each course
df_all.groupby('course_title')['q1'].mean()

course_title
#BlackLivesMatter and Racial Justice                                       4.500000
#Humansonthemove: a workshop on current aspects of migration in Europe     4.333333
(Po)válečná společenství západního Balkánu po rozpadu Východního bloku     4.923077
20th Century American Literature                                           4.000000
20th Century Black Popular Music, Globalization, and Political Identity    4.500000
                                                                             ...   
Žurnalistická tvorba III                                                   3.481481
Žurnalistická tvorba III - Časopisecká tvorba                              4.708955
Žurnalistická tvorba III.                                                  4.153846
Žurnalistické kauzy                                                        4.777778
Žurnalistika a feminismus                                                  4.436364
Name: q1, Length: 3133, dtype: float64

In [43]:
# Average q1 response for each course
df_all.groupby(['course_title','year'])['q1'].mean()

course_title                                                            year
#BlackLivesMatter and Racial Justice                                    2015    4.500000
#Humansonthemove: a workshop on current aspects of migration in Europe  2015    4.333333
(Po)válečná společenství západního Balkánu po rozpadu Východního bloku  2015    4.857143
                                                                        2016    5.000000
                                                                        2017    5.000000
                                                                                  ...   
Žurnalistika a feminismus                                               2012    4.391304
                                                                        2014    4.437500
                                                                        2015    4.727273
                                                                        2016    4.750000
                                 

## Splitting an object into groups

* pandas objects can be split on any of their axes.
* the abstract definition of grouping is to provide a mapping of labels to group names. (more on what the GroupBy object is later)
* single group can be selected using `.get_group('label')`

In [149]:
df_all.groupby(['course_title','year']).get_group(('#BlackLivesMatter and Racial Justice',2015))

Unnamed: 0,number,course_code,course_title,teachers,seminar_leaders,q1,q2,q3,q4,q5,...,q9,q10,q11,q12,q13,c_value,c_improve,department_code,year,semester
547,548,JMM679,#BlackLivesMatter and Racial Justice,"Carter,D.",,5.0,5.0,5.0,5.0,5.0,...,2.0,5.0,2.0,2.0,4.0,Prof. Carter is one of the best scholars I had...,Prepare the list of readings in advance - so t...,kas,2015,ls
2879,2880,JMM679,#BlackLivesMatter and Racial Justice,"Carter,D.",,4.0,2.0,5.0,5.0,3.0,...,2.0,4.0,2.0,4.0,5.0,,,kas,2015,ls



* complicated data manipulations can be expressed in terms of GroupBy operations 
    * efficiency not guaranteed
* by default the group keys are sorted during the groupby operation
    * pass `sort=False` for potential speedups



* once the GroupBy object has been created, several methods are available to perform a computation on the grouped data

## Aggregating
* the result of the aggregation will have the group names as the new index along the grouped axis
* in the case of multiple keys $\rightarrow$ the result is a MultiIndex by default, though this can be changed by using the as_index option:

* Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:
`df.groupby('A').aggregate(np.sum)`

* The aggregating functions above will exclude NA values. Any function which reduces a Series to a scalar value is an aggregation function and will work, a trivial example is df.groupby('A').agg(lambda ser: 1)
    * aggregating multiple functions: pass a list/dict of functions to do aggregation
    * the resulting aggregations are named for the functions themselves
* By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame
`df.agg({'C': np.sum, 'D': lambda x: np.std(x, ddof=1)})`

## Transformation 

* the `transform` method returns an object that is indexed the same (same size) as the one being grouped
* suppose we wished to standardize the data within each group
* the transform function must:
    * return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, `grouped.transform(lambda x: x.iloc[-1])`).
    * operate column-by-column on the group chunk
    * not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results
    * e.g. when using fillna, inplace must be False (grouped.transform(lambda x: x.fillna(inplace=False)))
    
Two major differences between apply and transform

There are two major differences between the transform and apply groupby method:
* apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function
* the custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.


In [167]:
zscore = lambda x: (x - x.mean()) / x.std()
df_all.dropna(subset=['course_title']).groupby('course_title')['q1'].transform(zscore)

0      -0.219876
1      -1.699041
2            NaN
3      -0.219876
4       1.259289
          ...   
7525   -0.500039
7526    1.071513
7527   -0.500039
7528   -0.500039
7529    1.071513
Name: q1, Length: 129668, dtype: float64

## Filtration

* the filter methd returns a subset of the original object (only elements belonging to groups)
    * alternatively, instead of dropping the offending groups, we can return a like-indexed objects where the groups that do not pass the filter are filled with NaNs.
* for DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion.
* df.groupby('A').colname.std(). is more efficient than df.groupby('A').std().colname
    * filtered before applying the aggregation function.

## Styling
* more in [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html)
* styling is accomplished using CSS

In [150]:
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

In [151]:
df_q.head(5).style.apply(highlight_max)

Unnamed: 0,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11,q12,q13
0,4.0,3,5.0,5,4,,,,4.0,4,4,4,4
1,3.0,3,4.0,4,4,,,,,3,4,4,3
2,,3,4.0,4,3,,,,4.0,4,3,3,3
3,4.0,3,,4,5,,,,,4,3,4,4
4,5.0,4,4.0,4,5,,,,5.0,3,3,4,5


## Options and settings

* pandas has an options system that lets you customize some aspects of its behaviour

In [171]:
dir(pd.options)

['compute', 'display', 'io', 'mode', 'plotting']

In [182]:
# example of display options
dir(pd.options.display),

(['chop_threshold',
  'colheader_justify',
  'column_space',
  'date_dayfirst',
  'date_yearfirst',
  'encoding',
  'expand_frame_repr',
  'float_format',
  'html',
  'large_repr',
  'latex',
  'max_categories',
  'max_columns',
  'max_colwidth',
  'max_info_columns',
  'max_info_rows',
  'max_rows',
  'max_seq_items',
  'memory_usage',
  'min_rows',
  'multi_sparse',
  'notebook_repr_html',
  'pprint_nest_depth',
  'precision',
  'show_dimensions',
  'unicode',
  'width'],)

In [184]:
pd.get_option('display.max_rows')

60

In [195]:
pd.set_option("display.max_rows",3)

In [196]:
df_all

Unnamed: 0,number,course_code,course_title,teachers,seminar_leaders,q1,q2,q3,q4,q5,...,q9,q10,q11,q12,q13,c_value,c_improve,department_code,year,semester
0,1,JSB119,Úvod do sociologie práce,Kuchař,,4.0,3.0,5.0,5.0,4.0,...,4.0,4.0,4.0,4.0,4.0,,,ks,2012,zs
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7529,26692,JPM574,Moderní strany a stranické systémy v Evropě,Brunclík,,5.0,3.0,5.0,5.0,5.0,...,2.0,5.0,5.0,5.0,5.0,,,kp,2014,zs


In [197]:
pd.reset_option("display.max_rows")

##  Date handling
* when working with time series, please look at the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) for proper date handling
    * hint: no need to use strings 
* there is also common functionality of rolling and expanding windows already in place

# Modules and Packages

## Modules

In [1]:
# what happens when we execute this statement?
import lame_module as lame

* [Interpreter](https://www.wikiwand.com/en/Interpreter_(computing)) searches for "lame_module.py" in a list of directories consisting of the following sources:
    * current working directory (the current directory if run interactively)
    * the list of directories contained in the PYTHONPATH environment variable (if set)
    * list of directories configured at the time Python is installed (installation-dependent)


In [44]:
import sys
# The exact contents of sys.path a|re installation-dependent. 
sys.path

['C:\\Users\\Martin Hronec\\Projects\\phd\\DPP_IES\\08',
 'C:\\Users\\Martin Hronec\\Miniconda3\\python37.zip',
 'C:\\Users\\Martin Hronec\\Miniconda3\\DLLs',
 'C:\\Users\\Martin Hronec\\Miniconda3\\lib',
 'C:\\Users\\Martin Hronec\\Miniconda3',
 '',
 'C:\\Users\\Martin Hronec\\Miniconda3\\lib\\site-packages',
 'C:\\Users\\Martin Hronec\\Miniconda3\\lib\\site-packages\\win32',
 'C:\\Users\\Martin Hronec\\Miniconda3\\lib\\site-packages\\win32\\lib',
 'C:\\Users\\Martin Hronec\\Miniconda3\\lib\\site-packages\\Pythonwin',
 'C:\\Users\\Martin Hronec\\Miniconda3\\lib\\site-packages\\pywin32_ctypes-0.2.0-py3.7.egg',
 'C:\\Users\\Martin Hronec\\Miniconda3\\lib\\site-packages\\IPython\\extensions',
 'C:\\Users\\Martin Hronec\\.ipython']

* To ensure your module is found, do one of the following:
    * Put mod.py in the directory where the input script is located(see Vitek's code from the previous lecture)
    * Add the directory where mod.py is located to the PYTHONPATH environment variable
    * Put mod.py in one of the installation-dependent directories

Additional option:

In [3]:
# put lame_module.py into that folder
sys.path.append(r'XXX')
import lame_module
lame_module.__file__

'C:\\Users\\Martin Hronec\\Projects\\phd\\DPP_IES\\08\\lame_module.py'

* We want to distinguish between when the file is loaded as a module and when it is run as a standalone script
    * A .py file imported as a module: Python sets the special dunder variable __name__ to the name of the module
    * A file run as a standalone script, __name__ is (creatively) set to the string '__main__'
    
* Modules often designed with the capability to run as a standalone script
    *  we can test the functionality that is contained within the module, i.e. [unit testing](https://www.wikiwand.com/en/Unit_testing)

## Packages

* Allow for a hierarchical structuring of the module namespace using dot notation
    * 
* Creating a package is easy, we can use the OS hierarchical file structure

! If a file named `__init__.py` is present in a package directory, it is invoked when the package or a module in the package is imported ! 

In [45]:
import pkg

In [46]:
pkg.variable_from_init

42

In [47]:
from pkg import module1

* `__init__.py` can also be used to effect automatic importing of modules from a package

* from Python 3.3, **Implicit Namespace Packages** introduced => no need for `__init__.py`

## Packaging

* easy to make your package installable
    * from local repository
    * from your github repository (you can share!)
    * from the [the python package index (PYPI)](https://pypi.org/) 

* to make ies_scraper you've seen during the previous lecture, you could follow next steps:

* let's modify setup.py to configure our package

`
from setuptools import setup 
setup(
    name='ies_scraper',
    version='0.0.1',
    description='Offers set of tools for scraping IES website.',
    url='git@github.com:martinhronec/ies_scraper.git',
    author='Vit Machacek',
    author_email= 'vit.machacek@cerge-ei.cz',
    license='unlicense'
    )
`
* we can then push our changes to our remote github repo

`git remote add origin git@github.com:martinhronec/ies_scraper.git`

`git push -u origin master`

* if we want to install package (2 options):
    * clone remote repo locally, then inside the cloned repository write `pip install .`
        * follow with `pip show ies_scraper` to see that package was successfuly installed
    * write in cmd: `pip install git+https://github.com/user/repository/archive/branch.zip`

* FOR YOU: Add functionality of IES_downloader.py and IES_pages.py to our ies_scraper package