## Working with libraries - an introduction



 Python enables us to
group code via block statements (`if`, `while`, `def` etc.).  However,
at times you want to refer to a group of code statements. This could
be a named python program (as opposed to the code cells we run in a
notebook) or a collection of function definitions that provide new
language elements (aka library or module). The popularity of python
rests, in fact, on the large number of python libraries that facilitate,
e.g., seismic inversion, borehole log data analysis, statistics,
graphical output etc. etc..

A python library is simply a file that contains function definitions. The key is
that this file contains only function definitions and no other
program code. The file must also be written in plain python; it can't
be in notebook format. This is no major inconvenience since, most of
the time, we only want to use the functions in the file rather than
changing the functions. Library development is, however, best done with a full
fledged python IDE (Integrated Development
Environment). 

So why would we want to use libraries:

1.  Moving often used functions into a library declutters your code.
2.  Python has myriads of libraries that greatly extend the
    functionality of the language.
3.  Another important reason why we prefer to use libraries (rather
    than writing our own code) is that the developers of the
    libraries usually spend more time in optimizing the performances
    and improving the compatibility of the code. Also, if we
    encounter problems using public libraries, we often can find
    solutions or helpful information on the Internet. In that sense,
    by using the libraries, we are collaborating with the
    whole community.

Why you would not want to use a library:

1.  Unless it's your own library, you depend on someone else. Imagine
    this great library you found, but it has this nasty bug, and the
    guy who wrote this library is no longer responding to
    e-mail&#x2026;. it might be just easier to implement your desired
    functionality yourself&#x2026;
2.  Libraries can introduce considerable complexities, which may be
    overkill in your situation&#x2026;.
3.  While python itself is in the public domain, third-party
    libraries are often under more restrictive licenses. Not a big
    deal for your private code, but if you work in a commercial
    environment, it may be a real show stopper.



### Using a library in your code



Consider the following simple library [mylib.py](mylib.py). It only provides two
functions. If you click on the link in the previous sentence, you can inspect the code, and you will see that it looks just like any other python code you have seen so far.

To use these functions, we first need to import
the library, and then we can inspect its
content. 



In [1]:
import mylib
print(help(mylib))

Note, the file mylib.py is present in this module folder. However, if
you have moved your assignment to the submission folder, you need to copy
the file `mylib.py` as well! As you can see from the output of the
above code, the `mylib` library provides two functions, `hello_world()`
and `square()` Can you imagine what would happen if you load another
library that provides the same functions? Yes, the earth would stop
spinning, and we all would float helplessly into outer space&#x2026;

Python provides an ingenious solution to avoid such undesirable
outcome. Try the following:



In [1]:
import mylib
print(square(5))

yup, `mylib` provides `square()`, but you cannot use it by this name!
To access a function that is defined inside a library, you have to write



In [1]:
import mylib
print(mylib.square(5))

Pure genius! Each library creates its own namespace upon import
. 
So if two libraries define the same function, they will be known by different names!
This mechanism avoids naming conflicts, but it also adds a lot of
typing&#x2026;. There are two ways around it:

1.  Often, you will only need one or two functions from a library (also
    called a module). In this case you can import each function
    explicitly. The onus is then on you to make sure not to import
    functions that overwrite existing functions



In [1]:
from mylib import hello_world, square

hello_world()
print(square(6))

1.  Or, you can create a library alias (aka short name). This
    is the preferred approach as it avoids the accidental redefinition of existing functions.



In [1]:
import mylib as ml # ml is now the shorthand for mylib

ml.hello_world()
print(ml.square(6))

### Using pandas to read data from an excel file



Pandas is one of the most used python libraries and provides powerful
data analysis tools. It also provides an easy way to read data
from files that contain comma-separated values (CSV), or from excel
spreadsheets, or HDF files, etc., etc.. Here we will use isotope data
from a recent paper of one of my graduate students (Yao et al. 2018)

In the following code snippet, we import the pandas library with the
alias `pd`. So all of the functions provided by pandas are available
as `pd.functionname()`. We will explore how to use some of the pandas
provided functions below. Since pandas is a system library, you
do not need a local copy in each working directory.

To read the excel file  we need to know its name, and we need to
know the name of the data-sheet we want to read. If the read operation
succeeds, the read data will be stored as a pandas dataframe
object. You can think of the pandas dataframe as a table with rows,
columns, headers, etc. Pay attention to the way I use type hinting
to indicate that `os_peak` is a variable that contains a dataframe.



In [1]:
import pandas as pd # inport pandas as pd

# define the file and sheetname we want to read. Note that the file
# has to be present in the local working directory! 
fn :str = "Yao_2018.xlsx" # file name
sn :str = "outside_peak"  # sheet name

# read the excel sheet using pandas read_excel function and add it to
os_peak :pd.DataFrame = pd.read_excel(fn, sheet_name=sn) # the pandas
                                                         # dataframe

### Working with the pandas DataFrame object



In most cases, your datasets will contain many lines. So listing all
of this data is wasteful. Pandas provides the `head()` and `tail()`
methods which will only show the first (or last) few lines of your
dataset. Remember that methods are bound to an object, whereas functions expect one or more variables as an argument. 

Since line 9 above created a pandas dataframe object with the name `os_peak`, the
`head()` and `tail()` methods are now available through the data-frame
object. If this does not make sense to you, please speak up!
Otherwise, try both methods here: 



In [1]:
print(os_peak.head())

If you are really on the ball, you may have noticed that the first
column is not present in the actual excel file. 
The numbers in the first row are called the index. Think of them as
line numbers. All pandas objects show them, but they are ignored when
you do computations with the data. So we do have an index column, and
then we have data columns.



#### Selecting specific columns



We can select specific columns by simply specifying the column name:



In [1]:
print(os_peak["d34S"])

#### Selecting specific row & column combinations



The above syntax is intuitive but not very flexible. Pandas provides the `iloc()` method (integer location), which allows us to access 
rows and columns by their index



In [1]:
# iloc[row,col]
print(os_peak.iloc[1,0]) # get the data in the 2nd row of the 1st col

You remember the slicing syntax (if not, review the slicing module).
so if you want to see the first two rows of the third column:



In [1]:
print(os_peak.iloc[0:2,3])  # get the first 2 rows from the 4th column

order to get all data from the third column, you can write



In [1]:
print(os_peak.iloc[:,2]) # get all data from the third column

Now modify the above statement that you print all data from the second
row.



#### Selecting rows/columns by label



Pandas also supports the selection by label rather than index. This
is done with the `.loc(row_label,column_label)` method. The first
argument is the row label, and the second is the column label.
However the statement below requires  **some
attention**. On first sight, it appears that we mix `iloc()` and
`loc()` syntax here. However, this is not the case. Rather, this
command treats the index-column as a label. So if your first index
number starts at 100, this code would yield no result since
there is no label called "2". 

As a side note, the index does not even
have to be numeric, it could well be a date-time value, or even a
letter code. So `loc[2:4,'d34S']` does not use slicing notation,
rather, is means as long as the label is equal to 2, 3 or 4. This
difference is illustrated by the following code.



In [1]:
print(os_peak.iloc[2:4,3]) # extract index values which are >= 2 and <4 
print(os_peak.loc[2:4,'d34S']) # extract the d34S data for index
                               # labels which equal 2, 3, or 4

There is a third method to select a column by label



In [1]:
import pandas as pd
all_data pd.DataFrame = pd.read_excel("Yao_2018.xlsx", sheet_name="combined_data") # the panda
delta = all_data.d34S # this will work
# age = all.data.Age [Ma] # this will not work

Obviously, this will only work if your column headers are single words only.
I thus recommend to stick to the `iloc()` method.



#### Include only specific rows



Since the `loc()` will match against the value of a given cell, we can use this to select only specific rows. In this example, we only select data where the Location equals the string "in"



In [1]:
import pandas as pd
all_data: pd.DataFrame = pd.read_excel(
    "Yao_2018.xlsx", sheet_name="combined_data"
)  # the panda
y: pd.Series = all_data.loc[all_data["Location"] == "in", "d34S"]
print(y)

#### Getting statistical coefficients



Pandas supports a large number of statistical methods. As an example,
use the `describe()` method which will give you a quick overview of
your data. 



In [1]:
print(os_peak.describe())

#### What else can you do?



The short answer is lots! The dataframe can act as a database, and you
can  add/remove,
values/columns/rows, you can clean your data (e.g., missing numbers,
bogus data), you can do boolean algebra, etc., etc.. If these cases
arise, please have a look at the excellent online documentation and
tutorials.

