## Working with the pandas series data object



The pandas dataframe is composed of smaller units, the so called
pandas series. You already used this data type implicitly in our last
module, but here we will explore it more fully. Lets create a
data-frame first. Since I am lazy, I use a function from the numpy
library to generate an array of randomly distributed integer values
between 0 and 10, and yes, I simply googled "pandas generate list of
random integers"



In [1]:
from typing import TypeVar
import pandas as pd
import numpy as np

# declare pandas types for type hinting
pdf = TypeVar('pandas.core.frame.DataFrame')
pds = TypeVar('pandas.core.series.Series')

# create a dataframe with filled with random integer values
# between 0 and 10, the dataframe will have 4 rows and 4
# columns which are named named A, B, C, D
df :pdf = pd.DataFrame(np.random.randint(0,10,size=(4, 4)),
                       columns=list('ABCD'))
print(df)

Now lets extract the data from column A, and execute the following



In [1]:
A :pds = df['A'] # extract column A, and save as pandas data series
print(A * 2)     # multiply A with 2 and print the result

You probably remember that this was not possible with a list because
lists can contain numbers, letters, other lists, tuples etc. A pandas
series on the other hand, must only have one data-type per column. In
other words, a column can contain either strings, integers, floats
etc., but all entries in a given column must be of the same
type. Since A consists only of numbers, python can directly multiply
each element with 2. What happens if you multiply A\*B is the
multiplication element by element or do you get the cross product?
What happens if you write `A**B`?  Hurray! no more loops! (kinda)

In a way python treats a pandas series object like a vector, not
unlike matlab. The numpy library even provides a vector datatype which
behaves similar to matlab. There is some cool stuff we can do with
this. We can e.g., apply a comparison operator to a pandas series



In [1]:
print(A>2)

Now why would this be useful? Remember that `False` equals zero,
whereas `True` euqals 1. So if you want to count the number of values
in A which are larger then 2, you can simply write



In [1]:
n :int = sum(A>2)
print (n)

## Assignment



Notes: 

-   Create a notebook in your submissions folder with this name:
    "pandas-series-FirstName-LastName". In order to submit your
    assignment, you need to download it and submit it on Quercus
    (ipynb and pdf format).  This time, add also add files which you
    from or write to. Please have the usual header with date, name
    etc.

-   Marking Scheme (per question):
    -   All variables declared and type hinting used throughout 1pt. No partial marks
    -   Code produces correct output 2pt. 1pt if code is sort of correct
    -   Proper use of comments 1pt. There is no need for doc-strings
        though
    -   Code is self contained 1pt.
    -   Doc strings for functions explain what the function does,
        explain all parameters, explain all return values. 1pt.
    -   Max points per question: 6 pts for a total of 6\*3 = 18 pts
    -   Last question is 14 pts. See details there.
    -   Total number of points: 32



### Exercises:



For each answer, please write self contained code, that is, it imports
all libraries, declares all variables, imports all data etc., rather
then writing code which relies on data imported in a previous
cell. Cut copy paste is your friend here.

1.  Using the above methods, find a way to set all values in A which
    are smaller than say 4 to 0. Write the result into a new variable
    `X`, rather then replacing the values in A. Note, do not use
    builtin methods like replace for this operation, nor use a
    loop (hint: use a boolean expression).

2.  Write a short function and code snippet which calculates the
    mean of a pandas series containing numbers. Do not use the builtin
    functions or the pandas method for mean. Rather use your own code
    to compute the $\mu$ mean as
    
    \begin{equation}
    \mu = \frac{\sum\limits_{i=0}^{i=N} X_i}{N}
    \end{equation}
    
    the capital sigma sign on the right side is the math symbol for a
    sum, the subscript `i` refers to the index of a given vector
    element, $X_i$ denotes an individual element (i.e. `X[i]`) and N is
    equal to the number of elements in the series. In the equation
    above I added the index expression to highlight the relation
    between the equation and a pandas series, however, often you will
    see the abbreviated form which means the same thing, but is faster
    to write
    
     \begin{equation}
    \mu = \frac{\sum X_i}{N}
    \end{equation}
    
    call your function and compare the result against the mean values
    as returned by the pandas mean method, e.g.:



In [1]:
print(f"The mean value of A using my_mean = {my_mean(A)}")
print(f"The mean value of A using A.mean() = {A.mean()}")

3.  As before, but now we will compute the population standard
    deviation sigma ($\sigma$) which is defined as
    
    \begin{equation}   
     \sigma = \sqrt{\frac{\sum (X_i - \mu)^2}{N}}
    \end{equation}
    
    where $\mu$ is the mean value. Note, since your code regenerates
    the dataframe from random numbers each time, you need to compute
    $\mu$ each time. Thus, this code also needs to define `my_mean()`
    (you can obviously cut/copy/paste).  Again, your code should be
    self contained, and compare your result against the builtin
    pandas series method `A.std`. Do not use loops.

4.  Use the coding-template to write a program which imports the
    isotope data from the last lecture as dataframe. Extract the
    delta values as pandas series. Next, use the following two
    equations to first split the delta values into $^{32}S$ and $^{34}S$
    and append the results to the dataframe as two new columns (`S32`
    and `S34`). Next compute the delta values from $^{32}S$ and $^{34}S$,
    
    \begin{equation}
    ^{32}S = \frac{1000}{(\delta +1000) \times R + 1000}
    \end{equation}
    
    \begin{equation}
    ^{34}S = \frac{(\delta + 1000) \times R}{(\delta + 1000) \times R + 1000}    	
    \end{equation}
    
    \begin{equation}
    \delta^{34}S = \left(
      \frac{
        \left(\frac{^{34}S}{^{32}S}\right) _{Sample}}
      {
        \left(\frac{^{34}S}{^{32}S}\right) _{VCDT}}
      -1
      \right) \times 1000 \quad [^0/_{00}]
    \end{equation}
    
    and append the results to the dataframe as a new column called
    `delta-new`. Compute the difference between the original and the
    new delta (it should be very small or zero). Export the data
    frame to a csv file.  Again, do not use loops, import whatever
    you need, declare all variables, add docstrings to your function
    definitions, use comments, and type hinting. Use an R-value of
    `R=0.044162589`. Remember to break down any coding task into a
    set of smaller steps which you can develop and test individually.
    
    -   Proper docstrings (functions, and program) 2 pst
    -   Correct variable definitions 2 pts
    -   Type hinting used throughout 2 pts
    -   Working code produces the required file 4pts
    -   Use of comments through out
    -   All coding template sections filled 4pts.

