## Working with libraries - an introduction



 A python library is
simply a file which contains function definitions. The key, here, is
that this file contains only function definitions, and no other
program code. The file must also be written in plain python, it can't
be in notebook format. This is no major inconvenience since most of
the time, we only want to use the functions in the file, rather than
changing them. Library development is hower best done with a full
fledged python IDE (Integrated Development
Environment). 

So why would we want that?

1.  Moving often used functions into a library declutters your code.
2.  Python has myriads of libraries which greatly extend the
    functionality of the language.
3.  Another important reason why we prefer to use libraries (rather
    than writing our own code) is that the developers of the
    libraries usually spend more time in optimizing the performances
    and improving the compatibilities of the code. Also, it we
    encounter problems using public libraries, we often can find
    solutions or helpful information on the Internet. In that sense,
    by using the public libraries, we are collaborating with the
    whole community.

Why you would not want to use a library:

1.  Unless it's your own library, you depend on someone else. Imagine
    this great library you found, but it has this nasty bug, and the
    guy who wrote this library is no longer responding to
    e-mail&#x2026;. it might be just easier to implement your desired
    functionality yourself&#x2026;
2.  Libraries can introduce considerable complexities, which may be
    overkill in your situation&#x2026;.
3.  While python itself is in the public domain, third party
    libraries are often under more restrictive licenses. Not a big
    deal for your private code, but if you work in a commercial
    environment, it may be a real show stopper.



### Using a library in your code



Consider the following simple library [mylib.py](mylib.py). It only provides two
functions. In order to access these functions, we first need to import
the library, and then we can inspect its
content. 



In [1]:
import mylib
dir(mylib)

Note, the file mylib.py is present in this module folder. However, if
you moved your assignment to the submission folder, you need to copy
the file `mylib.py` as well! As you can see from the output of the
above code, the `mylib` library provides two functions, `hello_world()`
and `square()` Can you imagine what would happen if you load another
library which provides the same functions? Yes, the earth would stop
spinning, and we all would float helplessly into outer space&#x2026;

Python provides an ingenious solution to avoid such and undesirable
outcome. Try the following:



In [1]:
import mylib
print(square(5))

yup, `mylib` provides `square()`, but you cannot use it by this name!
So even if two libraries use the same name for the two different
functions, earth will keep spinning!  To access a function which is
defined inside a library, you have to write



In [1]:
import mylib
print(mylib.square(5))

Pure genious! Each library creates their own namespace upon import
. This
mechanism avoids naming conflicts, but it also adds a lot of
typing&#x2026;. There are two ways around it:

1.  Often you will only need one or two functions from a library (also
    called module). In this case you can import each function
    explicitly. So the onus is on you to make sure not to import
    functions which share a name



In [1]:
from mylib import hello_world, square
hello_world()
print(square(6))

1.  Or, you can can create a library alias (aka shortname). This
    approach is particularly useful if you need to import a lot of
    functions.



In [1]:
import mylib as ml
ml.hello_world()
print(ml.square(6))

### Using pandas to read data from an excel file



Pandas is one of the most used python libraries, and provides powerful
data analysis tools. It also provides for an easy way to read data
from files which contain comma separated values (CSV), or from excel
spreadsheets, or HDF files, etc., etc.. Here we will use isotope data
from a recent paper of one of my graduate students (Yao et al. 2018)

In the following code snippet, we import the pandas library with the
alias `pd`. So all of the functions provided by pandas are available
as `pd.functionname()`. We will explore how to use some of the pandas
provided functions below. Note, since pandas is a system library, you
do not need a local copy in each working directory.

In order to read the excel file  we need to know it's name, and we need to
know the name of the data sheet we want to read. If the read operation
succeeds, the read data will be stored as a pandas dataframe
object. You can think of the pandas dataframe as a table with rows,
columns, headers, etc. Pay attention to the way how I use type hinting
to indicate that that `os_peak` is variable which contains a
dataframe.



In [1]:
import pandas as pd # inport pandas as pd

# define the file and sheetname we want to read. Note that the file
# has to be present in the local working directory! 
fn :str = "Yao_2018.xlsx" # file name
sn :str = "outside_peak"  # sheet name

# read the excel sheet using pandas read_excel function and add it to
os_peak :pd.DataFrame = pd.read_excel(fn, sheet_name=sn) # the pandas
                                                         # dataframe

### Working with the pandas DataFrame object



In most cases, your datasets will contain many lines. So listing all
of this data is wasteful. Pandas provides the `head()` and `tail()`
methods which will only show the first (or last) few lines of your
dataset. Remember, that methods are bound to an object, as opposed to
function which expect one or more variables as argument. So since line
9 above created a pandas dataframe object with the name `os_peak`, the
`head()` and `tail()` methods are now available through the data-frame
object. If this does not make sense to you, please speak up!
Otherwise, try both methods here: 



In [1]:
os_peak.head()

If you are really on the ball, you may have noticed that the first
column is not present in the actual excel file (you did check that the
`data.read()` actually read the correct file and data, did you?)

The numbers in the first row are called the index. Think of them as
line numbers. All pandas objects show them, but they are ignored when
you do computations with the data. So we do have an index column, and
then we have data columns.



#### Selecting specific rows



In order to select a specific row from a pandas dataframe, we can use
the `iloc()` method (short for integer location). In other words, if
you want to select the 4th row, you can write 



In [1]:
os_peak.iloc[3]  # get the 4th row

and you can use the normal slicing operators to get more than one row



In [1]:
os_peak.iloc[3:5]  # get's the 4th to 5th row

#### Selecting specific columns by index



The `iloc()` method can also be used to select a specific row. In this
case we have to give the row and colum index we want to retrieve (`iloc[row,col]`).



In [1]:
os_peak.iloc[1,0] # get the data in the 2nd row of the 1st col

You remember the slicing syntax (if not, review the slicing module).
so if you want to see the first two rows of the third column:



In [1]:
os_peak.iloc[0:2,3]  # get the first 2 rows from the 4th column

order to get all data from the third column you can write



In [1]:
os_peak.iloc[:,2] # get all data from the third columns

#### Selecting rows/columns by Label



Pandas also supports the selection by label, rather then index. This
is done with the `.loc(row_label,column_label)` method. So the first
argument is the row label, and the second is the column label.
However the statement below requires  **some
attention**. On first sight, it appears that we mix `iloc()` and
`loc()` syntax here. However, this is not the case, rather, this
commands treats the index-column as a label. So if your first index
number would start at 100, this code would yield no result, since
there is no label called "2". As a side note, the index does not even
have to be numeric, it could well be a date-time value, or even a
letter code. So `loc[2:4,'d34S']` does not use slicing notation,
rather, is means as long as the label is equal to 2, 3 or 4. This
difference is illustrated by the following code.



In [1]:
print(os_peak.iloc[2:4,3]) # extract index values which are >= 2 and <4 
print(os_peak.loc[2:4,'d34S']) # extract the d34S data for index
                               # labels which equal 2, 3, or 4

#### Getting statistical coefficients



Pandas supports a large number of statistical methods, and the
`describe()` method will give you a quick overview of your
data. 



In [1]:
os_peak.describe()

#### What else can you do?



The short answer is, lots. The dataframe can act as database, you can
 add/remove, values/columns/rows, you can
clean your data (e.g., missing numbers, bogus data), you can do bolean
algebra, etc., etc.. If these cases arise, please have a look at the
excellent online documentation and tutorials.

