# BIOEE 4940 : **Introduction to Quantitative Analysis in Ecology**
### ***Spring 2021***
### Instructor: **Xiangtao Xu** ( ✉️ xx286@cornell.edu)
### Teaching Assistant: **Yanqiu (Autumn) Zhou** (✉️ yz399@cornell.edu)

---

## <span style="color:royalblue">Lab 1</span> *Python Basics, I/O, and Visualization*
*Partly adapted from [Earth and Environmental Data Science](https://earth-env-data-science.github.io/intro)*

* Python is an interpreted, high-level and general-purpose programming language, first released in 1991 by Guido van Rossum.

* R vs Python - an inaccurate comparison....
  (read more [here](https://dev.to/daveparr/the-real-difference-tm-between-python-and-r-for-data-science-280i))

<img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ReZqX08z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://pbs.twimg.com/media/Ce8VP0FWIAI0ad2%3Fformat%3Djpg%26name%3Dsmall" alt="R vs Python" style="width: 600px;"/>


---

### 1. Fundamentals of Python

In this section, we will talk about some fundamental features of the Python language, including basic **data types, flow control, and mathematical operations**. Generally, these are also key aspects to quickly understand a programming langugage and its difference with other langugages. Some delicacies in the fundamentals might also lead to common bugs in your analyses.

Currently, the majority of the science community uses Python 3 although you might encounter legacy codes written in Python 2 in your own research. Sometimes, you will run into errors when running Python 2 codes with Python 3 because there are several key changes in Python 3, such as:

* `print` is a function
* Integer division returns a float 
* Unicode is used for encoding code
* ...

**1.1. Numbers and Math**

Numbers further include integers and floats (real numbers), which are stored differently in computer memory and might show different behaviors under mathematical and other operations.



In [None]:
# comments are anything that comes after the "#" symbol
a_int = 1 # assign an integer 1 to a_int
a_float = 1. # assign a float 1 to a_float

# python also support multiple destination variables when assigning values
a_int, a_float = 1 , 1.

# check their types
print(type(a_int))
print(type(a_float))

Basic mathematical operations in python includes arithmetic and boolean logic.

In [None]:
# addition / subtraction / multiplication / division

print(1 + 1 - 2 * 3 / 5)


Note that the results is a float although we only used integer. This is because **division in Python 3 by default returns a float**, while some other languages (e.g. C/Fortran) will truncate the fractions and return an integer, which can lead to bugs in your analysis

In [None]:
## exponentiation

print (2**10)
print (2**0.5)

## rounding

print(round(2**0.5, 4))

# modulo operation

print(12 % 7)   # get the modulus
print(12 // 7)  # get the integer



In [None]:
# Logic and Boolean and relational operator

is_weekday = True
is_snow = False

print(is_weekday and is_snow)
print(is_weekday or is_snow)
print(not (is_weekday and is_snow))
print((not is_weekday) and is_snow) # try to use parenthesis whenever possible to improve readability

In [None]:
# comparison of number values

a = 5

print(a > 1)
print(a < 10.)
print(a == 5)  # equal
print(a != 5)  # not equal

Try to avoid using `==` or `!=` for floats because there could be **numerical errors** after complex mathematical operations.

In [None]:
# a simple example for numerical errors

b = (5 ** 0.0001) ** 10000
print(b)
print(a == b)


# instead compare whether the difference is small enough
abs_tol = 1e-8 # another way to define floats using scientific notation
print(abs(a - b) < abs_tol) # abs() returns the absolute value


**1.2 Strings**

Python has powerful supports for string operations that can comes very handy for dealing with text/character-based data (e.g. species names, genome sequences, qualitative data).

Note that the indexing in python **starts from 0 and is left inclusive**.

In [None]:
# string definition, use either "" or ''

treatment_level = "1"
print(type(treatment_level) is int)

burmese_python = "\"Python bivittatus\""
print(burmese_python)

# index starts from zero and is left inclusive
print(burmese_python[0])
name_length = len(burmese_python)
print(name_length)
print(burmese_python[name_length-1])
print(burmese_python[-1])
print(burmese_python[1:7])

# this one will throw out an error
print(burmese_python[name_length])


String is a built-in class in python, which has various useful class methods. See [here](https://docs.python.org/2.5/lib/string-methods.html) for more. Here I list a few examples.

In [None]:
# use """ for long strings
oaks = """Quercus rubra
Quercus alba
Quercus velutina"""

# capitalize, use tab to auto-complete the command in jupyter notebook
print(oaks.upper())

# split
print(oaks.split('\n')) # returns a list contains all the substrings

# join
print(' '.join(['Quercus','rubra','is','the','scientific','name','of','red','oak']))

# find the second occurance of a sub-string
print(oaks.find("Quercus",2))

# replace (does not change the original string)
print(oaks.replace("rubra","coccinea"))
print('\noriginal oaks:\n',oaks,'\n')

# String math operations
print(("duplicate" + " " + "myself ") * 5) # multiplication with positive integers

Strings can be converted from or into other numerical data types

In [None]:
# string to numbers
num_str = '1'
print(int(num_str))
print(float(num_str))

# numbers to string, check https://realpython.com/python-f-strings/ and https://pyformat.info/ for more details and tricks

pi_val = 3.1415926
print(f'the value of \u03C0 is {pi_val}')

# formatting floating numbers
# L.Sf where L indicates the total length of the final string
# S indicates the number of digits
print(f'the value of \u03C0 is {pi_val:4.2f}')

# formatting floating numbers into scientific notification
print(f'{12345678:5.3e}')

year, month, day = 2021, 2, 10
print(f'Today is {year:4d}-{month:02d}-{day:2d}') # formatting the integers, 02 means to use 0 for padding




**1.3 Lists and Loops**

List is the built-in structures to represent arrays, which can host heterogenous data types

In [None]:
oaks = """Quercus rubra
Quercus alba
Quercus velutina"""
oak_list = oaks.split('\n')
print(oak_list)


# append an element
oak_list.append(1)
print(oak_list)

# pop the last element
oak_list.pop()
print(oak_list)

# sort the list
oak_list.sort()
print(oak_list)

In [None]:
# Different methods loop over the elements of the list
for oak in oak_list:
    print(oak)

# use index
oak_num = len(oak_list)
for i in range(oak_num):
    print(oak_list[i])
    
# use enumerate
for i, oak in enumerate(oak_list):
    print(i,oak)
    
# enumerate is equivalent to
for i, oak in zip(range(len(oak_list)),oak_list):
    print(i,oak)

# enumerate function can be very handy to get access to both index and contents at the same time

In [None]:
# while loop
i = 0
while i < len(oak_list):
    print(oak_list[i])
    i = i+1

In [None]:
# conditionals and flow control (break/continue)

maple_list=['Acer saccharum','Acer rubrum','Acer platanoides']


tree_list = oak_list + maple_list # + can also operate on list!

print(tree_list)

for i, tree in enumerate(tree_list):
    if tree.split(' ')[0] == 'Quercus':
        continue # skip oaks
    
    if tree == 'Acer rubrum':
        break # break out of the loop when encountering red maple
    
    print(i,tree)
    
    
    

In [None]:
# python has lots of tricks for working with lists

# list comprehension
squares = [n**2 for n in range(10)]
print(squares)

Cap_Genus_list = [tree.split(' ')[0].upper()+' '+tree.split(' ')[1] for tree in tree_list]
print(Cap_Genus_list)


**1.4 Tuples and Dictionaries**

Tuples are similar to lists, but they are ***immutable*** - they can not be extended or modified. So it is *safe* structure to pack together heterogeneous data that you do not want to change (e.g. results of a function).

In [None]:
genus='Quercus'
species='rubrum'
diameter=15.
is_alive=True
tree = (genus,species,diameter,is_alive)
print(tree)
print(tree[1]) # can be indexed like arrays

a,b,c,d = tree # can be unpacked
print(a,b,c,d)

tree[2] = 20. # error

Dictionaries are useful ***unordered*** structure to map **keys** to **values**

In [None]:
tree_dict = {'genus' : 'Quercus',
             'species' : 'rubrum',
             'diameter' : 15.,
             'is_alive' : True}
print(tree_dict)

# OR
tree_dict={} # initialize an empty dictionary using {}
tree_dict['genus'] = 'Quercus'
tree_dict['species'] = 'rubrum'
tree_dict['diameter'] = 15.
tree_dict['is_alive'] = True

print(tree_dict)

print('age' in tree_dict.keys())
print('diameter' in tree_dict.keys())

for key, val in tree_dict.items():
    print(key,val)




### Challenge 1: Convert date string to decimal year

In [None]:
date_str="2021-Feb-10"

---
### 2. Numpy and Functions

Numpy is one of the most fundamental parts of the python "ecosystem" for science. Lots of packages are built on top of it.

Numpy includes support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions.

In [None]:
# import the package, similar to library command in R
import numpy as np  # notate the namespace to more specifically call numpy functions

In [None]:
arr = np.array([11.,12.5, 5.,7.])
print(arr.dtype)
print(arr.shape)


In [None]:
arr = np.array([[1,2,3],[4,5,6]])
print(arr.dtype)
print(arr.shape)
print(arr)

# note that python's matrix is column-first (c-style)
# elements in the last column are stored next to each other in the memory
# so any operation that manipulate the last column will be more efficient 
# than operating the first column
print(arr.ravel()) # ravel to one dimensional array


In [None]:
# useful commands to generate data/index

print(np.arange(1.,2.,0.25)) # left inclusive
print(np.linspace(1.,2.,5)) # left and right inclusive
print(np.logspace(1.,2.,5)) # default has a 10-base

# matric creation
print(np.zeros((3,3))) # 3-by-3 zero matrix
print(np.ones((4,4))) # 4-by-4 unity matrix
print(np.diag(np.arange(4))) # diagonal matrix

lon = np.linspace(-100,-80,21)
lat = np.linspace(20,30,11)

xx,yy = np.meshgrid(lat,lon) # useful for plotting spatial maps
print(xx.shape,yy.shape)
print(xx)

Lots of array operations in numpy are available. All the familiar arithmetic operators are applied on an element-by-element basis

In [None]:
abundance_mat = np.random.random(xx.shape) 
# generate a [0,1) random matrix with the same shape as xx
print(abundance_mat)

In [None]:
print(abundance_mat > 0.1) # places with abudnance larger than 0.1
print(sum((abundance_mat > 0.1).ravel())) # total number of elements with abundance larger than 0.1
print(sum((abundance_mat > 0.1).ravel()) / len(abundance_mat.ravel())) # should be close to 0.9

In [None]:
# numpy is efficient in matrix manipulation
# let's use a dot product as an example

# define a function to do dot product ourselves
def my_dot_prod(A,B):
    '''
        This is documentation for a function
    '''
    
    C = np.zeros_like(A)
    
    length = len(C.ravel())
    
    for i in range(length):
        C[i] = A[i] * B[i]
        
    return C

# use numpy dot product
def numpy_prod(A,B):
    
    return A * B


# lets check their performance
A = np.random.random(100)
B = np.random.random(100)

%timeit my_dot_prod(A,B)

%timeit numpy_prod(A,B)
    

Note that values of numeric variables changed in fuctions will not be kept but values of lists/matrices changed in python functions will be kept.

In [None]:
def modify_values(a):
    a = 1
    b = 1
    print(f'in function a,b = {a}, {b}')
    return (a,b)

a, b = 0, 0
new_a,new_b = modify_values(a)

print(f'after function a,b = {a}, {b}')
print(f'after function new_a,new_b = {new_a}, {new_b}')


In [None]:
# note that values of lists/matrices changed in python functions will be kept


def modify_list(A,B):
    A.append('new_element')
    B[0] = 1.
    return

A = []
B = np.arange(5)
modify_list(A,B)
print(A)
print(B)



### Challenge 2. Function to account for metabolic temperature sensitivity.
Various metabolic processes are sensitive to temperature due to changes in enzymatic activities. A simple quantitative descritpion of such temperature senstivity is through Q10 (qotient every 10 degree Celcius):

$X = X_0\times Q_{10}^{\frac{T-T_0}{10}}$ 

Write a function to calculate X for input T array given a tuple containing $X_0$, $T_0$, and $Q_{10}$.

What if X saturates after a certain optimal temperature $T_{opt}$? (i.e., $X(T) = X(T_{opt})$ when $T > T_{opt}$)

In [None]:
params = (10.5,25.,2.) # X_0, T_0, and Q_10
T_opt = 35.

---
### 3. Pandas, Matplotlib, and Cartopy

[Pandas](https://pandas.pydata.org/) is an open source library providing high-performance, easy-to-use data structures and data analysis. It organizes data similarly to R dataframes and is particularly suited to tabular data (e.g. data in excel spreadsheet/csv files)

[Matplotlib](https://matplotlib.org/) is the backbone of data visualization in python.

[Cartopy](https://scitools.org.uk/cartopy/docs/latest/) is the package designed for geospatial data analysis and visualization

In [None]:
import pandas as pd

# the basic structure is Series - one-dimensional array with index.

Tree_ID = ['QR1','QR2','QA1','QA2','AR1','AR2']
diameter = [15.5,5.5,20.5,30.,55.,65.]

d_series = pd.Series(diameter,index=Tree_ID)
print(d_series)


In [None]:
from matplotlib import pyplot as plt # import matplotlib
# allow for inline plotting in jupyter notebook
%matplotlib inline 

In [None]:
fig = plt.figure(figsize=(3,4)) # figsize in inches
axis_handle = d_series.plot(kind='bar')
# or d_series.plot.bar()


**Indexing**
We can get values using .loc attribute

In [None]:
print(d_series.loc[['QR1','QA1']]) # use index
print(d_series.iloc[[0,2]]) # use raw position

**DataFrame**
equivalent to a table in a spreadsheet

In [None]:
# assume we know the allometric equation to convert tree diameter to tree height
tree_height = np.exp(1.0 + 0.5 * np.log(d_series.values))

# add tree species
species_dict = {'QA' : 'Quercus Alba',
                'QR' : 'Quercus Rubrum',
                'AR' : 'Acer Rubrum'}

tree_species = [species_dict[tree_index[0:2]] for tree_index in d_series.index]

# create a datafrome from a dictionary
data = {'Species' : tree_species,
        'Diameter' : d_series.values,
        'Height' : tree_height}

df = pd.DataFrame(data,index=Tree_ID)
print(df)


In [None]:
df.info()

In [None]:
df.head(2)

In [None]:
df.min()

In [None]:
df.mean()

In [None]:
df.std()

In [None]:
df.describe()

Add column to a DataFrame

In [None]:
df['Volume'] = df['Height'] * np.pi * (df['Diameter'] / 200.) ** 2. # assume each tree is a cylinder
df.describe()

Now let's explore pandas and matplotlib using a real data set
Biomass And Allometry Database for woody plants ([BAAD](https://github.com/dfalster/baad))

In [None]:
# I stored the data on my github repo
# Note we have to use 'raw csv files' (raw.githubusercontent.com)
baad_data_url = 'https://raw.githubusercontent.com/xiangtaoxu/QuantitativeEcology/main/Lab1/baad_data.csv'
baad_dictionary_url = 'https://raw.githubusercontent.com/xiangtaoxu/QuantitativeEcology/main/Lab1/baad_dictionary.csv'

# encodings are not always necessary 
# Here I include them because the raw csv is not compatible with utf-8 encoding

df_data = pd.read_csv(baad_data_url, encoding='latin_1') # can also read local files
df_dict = pd.read_csv(baad_dictionary_url, encoding='latin_1')

In [None]:
df_data.describe()

In [None]:
print(df_dict)

In [None]:
# plot two figures showing the distribution of latitude of the samples and heights of the samples
panel_x,panel_y=1,2

fig, axes = plt.subplots(panel_x,panel_y,figsize=(panel_y*3,panel_x*3))
# here note that the order of x and y are reversed when defining figsize

df_data.plot(y='latitude',kind='hist',ax=axes[0],bins=20)
df_data.plot(y='h.t',kind='hist',ax=axes[1],bins=np.linspace(0.,50.,20))

fig.tight_layout()
plt.savefig('./baad_pandas_hist.png',dpi=300)

In [None]:
# scatter plot diameter and height and diameter with leaf area
panel_x,panel_y=2,1

# we ask the two figures to share x
fig, axes = plt.subplots(panel_x,panel_y,figsize=(panel_y*3,panel_x*3),
                         sharex=True)
# here note that the order of x and y are reversed when defining figsize

df_data.plot(x='d.bh',y='h.t',kind='scatter',ax=axes[0],s=10,c='r',alpha=0.5
            ,loglog=True,xlim=[1e-4,10.])
axes[0].set_title('DBH vs Height')

df_data.plot(x='d.bh',y='a.lf',kind='scatter',ax=axes[1],s=20,c='b',alpha=0.5
            ,loglog=True,xlim=[1e-4,10.])
axes[1].set_title('DBH vs Leaf Area')

fig.tight_layout()
plt.savefig('./baad_pandas_scatter.png',dpi=300)

There are more tricks to plot with pandas. Check [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) for more.

Here I am showing one more example of grouping

In [None]:
#subset df by selecting forest vegetation types
df_sub = df_data[df_data['vegetation'].isin(['BorF','Sav','TempF','TropRF','TropSF','Wo'])]
df_sub.to_csv('./baad_forest.csv')

fig = plt.figure(figsize=(12,3))
df_sub.boxplot(column=['h.t'],by=['vegetation'])


### 4. Spatial maps with Cartopy


In [None]:
import cartopy.crs as ccrs

Let's start with a global map with coastlines

In [None]:
fig = plt.figure()

ax = plt.axes(projection=ccrs.Mollweide())

# you can try a few more projections such as
# Mercator (not suggested to use due to extortion over the polar regions)
# InterruptedGoodeHomolosine
# SouthPolarStereo

ax.stock_img()

ax.coastlines()

# add sample location from the baad database
baad_latlon=df_data[['longitude','latitude']].sample(100) # randomly sample 100

ax.plot(baad_latlon['longitude'].values,
        baad_latlon['latitude'],
        marker='o',
        markerfacecolor='r',
        linestyle='',
        transform=ccrs.PlateCarree())

plt.show()

Then, let's try to overlay the synthetic abudnance data over US state boundaries

In [None]:
import cartopy.feature as cfeature

fig = plt.figure()

# projection is for plot
# transform is for data

plot_proj = ccrs.LambertConformal()
latlon_proj = ccrs.PlateCarree() # used if you have lat/lon data
             
ax = plt.axes(projection=plot_proj)




lon = np.linspace(-100, -80, 21)
lat = np.linspace(20, 30, 11)

lon2d, lat2d = np.meshgrid(lon, lat)



abundance_mat = np.random.random(lon2d.shape) 



hpc = ax.pcolormesh(lon2d, lat2d,abundance_mat,vmin=0,vmax=1,transform=latlon_proj)
ax.set_extent([-120,-70,15,35],latlon_proj)
ax.coastlines()


# add state borders

# Create a feature for States/Admin 1 regions at 1:50m from Natural Earth

ax.add_feature(cfeature.STATES,facecolor='none',edgecolor='gray')


plt.colorbar(hpc,ax=ax, shrink=.5,label='Abundance')

plt.show()


### Challenge 3. Plot global distribution of sampling density in BAAD database

Count the sampling density for a 2-by-2 degree windows and plot it over a global map