## Working with libraries



A python library is simply a file which contains function
definitions. The key, here, is that this file contains only function
definitions, and no other program code. The file must also be written
in plain python, it can't be in notebook format. This is no major
inconvenience since most of the time, we only want to use the
functions in the file, rather then changing them. Library development
is howeer best done with a full fledged python IDE (Integrated
Development Environment).

So why would we want that?

1.  Moving often used functions into a library declutters your code.
2.  Python has myriads of libraries which greatly extend the
    functionality of the language.

Why you would not want to use a library:

1.  Unless it's your own library, you depend on someone else. Imagine
    this great library you found, but it has this nasty bug, and the
    guy who wrote this library is no longer responding to
    e-mail&#x2026;. it might be just easier to implement your desired
    functionality yourself&#x2026;
2.  Libraries can introduce considerable complexities, which may be
    overkill in your situation&#x2026;.
3.  While python itself is in the public domain, third party
    libraries are often under more restrictive licenses.



### Using a library in your code



Consider the following simple library [mylib.py](mylib.py). It only provides two
functions. In order to access these functions, we first need to import
the library, and then we can inspect its content.



In [1]:
import mylib
dir(mylib)

Note, this library needs to present in your local folder. So if you
moved your assignment to the submission folder, you need to move the
file mylib.py as well.

however, this wont allow you to use the function names as stated in the library files.



In [1]:
import mylib
print(square(5))

rather, you have to write



In [1]:
import mylib
print(mylib.square(5))

why is that? Most python programs import a quite a few
libraries. However, the people writing those do not know what
functions names have been used by other library developers. So in
order to avoid naming conflicts, python automatically adds the library
name in front of each function of the library. While this clever
mechanism avoids naming conflicts, it also adds a lot of
tpying&#x2026;. There are two ways around it:

1.  Often you will only need on or two functions from a library (also
    called module). In this case you can import each function explicitly.



In [1]:
from mylib import hello_world, square
     hello_world()
     print(square(6))

1.  Or, we can create our own library alias



In [1]:
import mylib as ml
    ml.hello_world()
    print(ml.square(6))

### Using pandas to read data from an excel file



Pandas is one of the most used python libraries, and provides powerful
data analysis tools. It also provides for an easy way to read data
from files which contain comma separated values (CSV), or from excel
spreadsheets, or HDF files etc.. Here we will use isotope data from a
recent paper of one of my graduate students (Yao et al., 2018)

In the following code snippet, we import two libraries using the above
syntax.  The first line imports a module from the `typing`
library. This library support type hinting beyond the standard
variable types like integer and float. You may notice that we have
used it before. The `TypeVar` function allows us to create or own
variable types, and we use it in line 5 to create a type hint for the
`DataFrame` variable type. In the context of this course, there is no
need to understand this in depth, but you should know what this is
for.

Line two, imports the entire pandas library with the alias `pd`. So
all of the functions provided by pandas are available as
`pd.functionname()`. We will explore how to use some of the pandas
provided functions below. Note, pandas, and typing are installed
system wide, so you don't need to have them present in your local
directory.

In order to read the excel file, we need to know it's name, and we
need to know the name of the data sheet we want to read. If the read
operation succeeds, the read data will be stored as a pandas dataframe



In [1]:
from typing import TypeVar # this is used to declare a new type hint
import pandas as pd # inport pandas as pd

# declare a dataframe type for type hinting
pdf = TypeVar('pandas.core.frame.DataFrame')

# define the file and sheetname we want to read. Note that the file
# has to be present in the local working directory! 
fn :str = "Yao_2018.xlsx" # file name
sn :str = "outside_peak"  # sheet name

# read the excel sheet using pandas read_excel function and add it to
os_peak :pdf = pd.read_excel(fn, sheet_name=sn) # the pandas dataframe

### Working with the pandas dataframe object



In most cases, your datasets will contain many lines. So the pdf
`head()` and `tail()` methods will only show the first (or last) few
lines of your dataset. Give this a try:



In [1]:
os_peak.head()

If you are really on the ball, you may have noticed that the first
column is not present in the actual excel file (you did check that the
`data.read()` actually read the correct file and data, did you?)

Those numbers are called the index. Think of them as line numbers. All
pandas objects show them, but they are ignored when you do
computations with the data. So we do have an index column, and then we
have data columns.



#### Selecting specific rows



In order to select a specific row from a pandas dataframe, we can use
the `iloc()` method (short for integer location). In other words, if
you want to select the 4th row, you can write



In [1]:
os_peak.iloc[3]  # get the 4th row

and you can use the normal slicing operators to get more than one row



In [1]:
os_peak.iloc[3:5]  # get the the forth to 6th rows

#### Selecting specific columns by index



The `iloc()` method can also be used to select a specific column. In
this case we have to give the row and colum index we want to
retrieve. You remember the slicing syntax (if not, review the slicing
module). So in order to get all data from the third column you can
write



In [1]:
os_peak.iloc[:,2] # get all data from the third columns

or, if you only want to see the first two rows of the third column:



In [1]:
os_peak.iloc[0:2,2]  # get the first 2 rows from the third column

# Out [30]: 
# text/plain
0    55.0011
1    55.0184
Name: Age [Ma], dtype: float64

#### Selecting rows by Label



Pandas also supports the selection by label, rather then index. This
is done with `.loc()` method.  However the statement below requires
some attention. On first sight, it appears that we mix `iloc()` and
`loc()` syntax here. However, this is not the case, rather, this
commands treats the index-column as a label. So if your first index
number would start at 100, this code would yield no result, since
there is no label called "2". As a side note, the index does not even
have to be numeric, it could very well be a date-time value, or even a
letter code.



In [1]:
os_peak.loc[2:4,'d34S'] # extract the d34S data between index 2 and 4

#### Getting statistical coefficients



Pandas supports a large number of statistical methods, and the
`describe()` method will give you a quick overview of your data.



In [1]:
os_peak.describe()

#### What else can you do?



The short answer is, lots. The dataframe can act as database, you can
add/remove, values/columns/rows, you can clean your data (e.g.,
missing numbers, bogus data), you can do bolean algebra etc. etc.  If
these cases arise, please have a look at the excellent online
documentation and tutorials.



## Assignment



In the following exercises, we will practice some of the
above. However, some tasks have not been explained. I recommend to
refer to
[https://pandas.pydata.org/pandas-docs/stable/user\_guide/indexing.html](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
or
[https://pandas.pydata.org/pandas-docs/stable/getting\_started/basics.html#basics](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics)
or [https://www.tutorialspoint.com/python\_pandas/python\_pandas\_dataframe.htm](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm)

Notes: 

-   Create a notebook ino your submissions folder with this name:
    "pandas-1-FirstName-LastName". In order to submit your
    assignment, you need to download it and submit it on Quercus
    (ipynb and pdf format).  Please have the usual header with date, name etc.

-   All questions should be solved with pandas methods. You may have
    to look up syntax or options either through the help system, or
    by using the above links.

-   Marking Scheme (per question):
    -   All variables declared and type hinting used throughout 1pt. No partial marks
    -   Code produces correct output 2pt. 1pt if code is sort of correct
    -   Proper use of comments 1pt. There is no need for doc-strings
        though
    -   Code is self contained 1pt.
    -   Max points per question: 5pts for a total of 14\*5 = 70 pts



### Exercises:



For each answer, please write self contained code, that is, it imports
all libraries, declares all variables, imports all data etc., rather
then writing code which relies on data imported in a previous
cell. Cut copy paste is your friend here.

1.  Write a code block which will import the outside peak sheet from
    `Yao_2018.xlsx` into a dataframe. Using the `tail` method display
    the last 7 lines of this dataframe.

2.  As above, but this time use the `iloc` method to display the 10<sup>th</sup> 
    row of the dataframe.
3.  As above, but this time use the `iloc` method to display the last
    3 values in the third data-column (not index!) of the dataframe.

4.  As above, but this time use the `loc` method to display the last
    3 values in the third data column of the dataframe.

5.  Use a pandas dataframe method to save your data frame as a Comma
    Separated Textfile (CSV) into your local working directory. Your
    Jupyter notebook can open CSV files, so please go and check if
    the dataframe has been saved correctly.

6.  Import the `cars.csv` into a dataframe, using the license plate
    as index.

7.  As above, but this time use the `iloc` method to display the
    third row of the dataframe.

8.  As above, but this time use the `iloc` method to display the
    third data-column of the dataframe.

9.  As above, but this time use the `loc` method to display the third
    row of the dataframe.

10. As above, but this time use the `loc` method to display the
    third data-column of the dataframe.

11. Create a two new dataframes from the following lists, and then
    use the pandas `concat()` method to join both dataframes by
    row to create the final dataframe



In [1]:
import pandas as pd 
names :list = ['Paul', 'Peter', 'Hook', 'Wendy']
ages  :list = [ 12, 14, 51, 13]
nms = pd.DataFrame(names)
ags =  pd.DataFrame(ages)
print(pd.concat([nms, ags],axis=1))

12. As above, but this time we join them along the column. The  following is Q<sup>13</sup> but something goes wrong during export. Please
    refer to the pdf version of this assignment for correct question numbers.
    1.  Create a new dataframe from this list of tuples (i.e., `names_and_ages`). Feel free to ignore the part where I
        construct the list of tuples.



In [1]:
# first we do some magic to create a list of tuples
# no need to understand this in detail, but if you get it
# it is a neat trick
names :list = ['Paul', 'Peter', 'Hook', 'Wendy']
ages  :list = [ 12, 14, 51, 13]
# zip(per) two lists int a list of tuples
names_and_ages :list = list(zip(names,ages))

14. As above, but this time include proper column headers



[literatur/journals-new.bib,literatur/new.bib,literatur/uli.bib,literatur/uli-with-students.bib](literatur/journals-new.bib,literatur/new.bib,literatur/uli.bib,literatur/uli-with-students.bib)

