<a href="https://colab.research.google.com/github/sio-co2o2/keelingcurve_notebooks/blob/main/notebooks/overview_of_notebooks_keelingcurve.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview of python and its use in Keeling Curve Notebooks

This notebook is an overview of python used in python notebooks that open in Google Colab to create graphics appearing on the Keeling Curve website

Keeling Curve website: [keelingcurve.ucsd.edu](keelingcurve.ucsd.edu)

[Keeling Curve Overview notebook](https://colab.research.google.com/github/sio-co2o2/keelingcurve_notebooks/blob/main/notebooks/overview_of_notebooks_keelingcurve.ipynb) with links to all Keeling Curve notebooks



Additional overviews of python can be found at [https://www.knowledgehut.com/tutorials/python-tutorial](https://www.knowledgehut.com/tutorials/python-tutorial) and [https://www.codecademy.com/learn/learn-python-3](https://www.codecademy.com/learn/learn-python-3). There are also many YouTube videos, free courses for python [https://www.freecodecamp.org/news/tag/python/](https://www.freecodecamp.org/news/tag/python/), and affordable online courses such as those from [Udemy](udemy.com) (they have sales frequently).  

### Keeling Curve website graphics created with python

The Keeling curve is an atmospheric carbon dioxide concentration record from the Mauna Loa Observatory, Hawaii starting in 1958. The Keeling curve website presents a series of graphics of the Keeling curve at various time periods alongside Ice Core records going back 800K years and each of those plots can be generated one per notebook. 

<a name="toc"></a>

## This notebooks consists of two parts. 

One is an overview of python and the other is an overview of python functions used in the Keeling Curve notebooks

## Overview of Python TOC

1. [Using python in Google Colab Notebooks](#using-python-in-google-colab-notebooks)
2. [Python code basics](#Python-code-basics)
3. [Python functions](#Python-functions)
4. [Importing external code packages](#Importing-code-packages)
5. [Requests package to fetch data files](#requests-package-to-fetch-data-files)
6. [NumPy package to work with numbers](#NumPy-package-to-work-with-numbers)
7. [Pandas package to read in and manipulate data](#pandas-package-to-read-in-and-manipulate-data)
8. [Matplotlib package for plotting](#matplotlib-package-for-plotting)
9.  [Requests package to fetch data files](#requests-package-to-fetch-data-files)
10. [NumPy package to work with numbers](#numpy-package-to-work-with-numbers)
11. [Pandas package to read in and manipulate data](#pandas-package-to-read-in-and-manipulate-data)
12. [Matplotlib package for plotting](#matplotlib-package-for-plotting)
13. [Using notebooks locally](#using-notebooks-locally)

## Python Functions Used in Keeling Curve Notebooks TOC

1. [Functions to fetch data](#functions-to-fetch-data)
2. [Examples of functions used in the plotting notebooks](#examples-of-functions-used-in-the-plotting-notebooks)
3. [Functions to configure plot properties](#functions-to-configure-plot-properties)
4. [Functions to save a plot](#functions-to-save-a-plot)

# Overview of python

## Using Python in Google Colab Notebooks

Google Colab notebooks automatically includes python and allows useful python packages to be imported to use in your notebook.


## Python basics (used to form functions)

This section discusses 1) collection types to hold related numbers and/or strings (three types discussed are lists, dicts, and tuples), 2) true/false conditions, and 3) code loops. Functions which are collections of code are discussed later.


### Collection types are lists, dicts and tuples

#### Lists

In python, keeping track of a series of numbers can either be done with a "list" or a "NumPy array". The NumPy array is better for math, and the list is good for basic use. The NumPy package is discussed below. A python list is a way of representing a collection of comma separated values inside a pair of square brackets '[]'. For example, [1, 2, 3, 4, 5] and ['one', 'two', 'three']. And a list can contain both numbers and strings at the same time. The first element is at index 0 since python starts at 0 instead of 1. To get the first element in a list, type the following:

    a_list = [1, 2, 3, 4, 5]
    variable = a_list[0]
    
where variable = 1.

To get a part of a list, use a colon ':' to separate contiguous indices. A bit tricky is that python does not include the last index when selecting it with a ':'. So the following code for a slice of a list results in one number in a list and not two. 

    a_value = a_list[2:3]
    
The variable a_value = [3]. The number 3 is at index = 2. Python list selections are thus [start_index: end_index] where the value at end_index is not included in the result. Notice that a list is returned when a colon ':' is used, and that a single value of 3 is returned if you type a_list[2]. 

    b_value = a_list[2]
    
Lists are great for collecting similar items together.


#### Dicts

It's helpful to use names associated with numbers and strings. There are variables of course, but there are also dicts, short for dictionaries. A dict is represented with a pair of curly braces '{}' and is a collection of names (called keys) and their associated values. These values can be numbers, strings, lists, more dicts, etc. An example of a dict is:

    a_dict = {'x': 5}
    b_dict = {'x': 5, 'y': 7}
    
where the names called keys and their values are separated by a colon ':'. Dicts are a helpful way of collecting a lot of variable names and values into one identifier. To get the value of a key, type the following:

    variable = a_dict['x'] 
    
where variable is the value 5. So a variable dict is enclosed in curly brackets composed of keys and values, where the key is associated with its value using a colon ':' and the key/value pairs are separated by commas ','. 

    variable = {key1: value1, key2: value2, key3: value3}



#### Tuples

Tuples are another way of representing a collection. Tuples are different than lists and use double parenthesis '()'. Tuples are ordered, unchangeable, and allow duplicate values. You can read more about tuples here [https://www.geeksforgeeks.org/python-tuples/](https://www.geeksforgeeks.org/python-tuples/).

### TRUE/FALSE conditions

In programming, there are many times where a decision has to be made, and to represent these, a True result means the decision was 'yes' and a False result means the decision was 'no'. To represent this decision, the words 'if', 'else', and 'elif' are used. And to separate the decisions to be made, a colon ':' is added after the decision statement. For python to know what code is associated with an if statement, it is indented. Usually 4 spaces. For example:

    if True:
        print('The decision is yes')
    else:
        print('The decision is no')
        
And if there are more decisions, add in an elif statement.

    variable = 10

    if variable == 5:
        print('The decision is yes, the variable is equal to 5')
    elif variable == 10:
        print('The decision is yes, the variable is equal to 10')   
    else:
        print('The decision wasn't made, so do this instead')

The double equal sign '==' stands for checking if the value on the left is the same as the value on the right. 

And in python, code is indented by 4 spaces after a statement that ends in a colon ':'. The 'print' statement means display the result.



### Code loops

Python represents looping with a 'for' statement. 

To do something for each element of a list, use the following code:

    a_list = ['a', 'b', 'c']
    
    for elem in a_list:
        print(elem)

This will look at each element in the list, set the value to a variable named 'elem' and then do something with that variable. In this case, display it. 

The for statement also works with dicts, but in a different way to get to the keys and values inside.

    a_dict = {'key1': value1, 'key2': value2}
    
    for a_key, a_val in a_dict.items():
        print(a_key)
        print(a_val)
        
This will look at each key/value item pair, and then extract out the keys to a variable a_key and the values to a variable a_val. And for each loop, both the key and the value are printed to the display.  



There is also another way to do looping called list comprehension. 

The following for loop

    y = []
    for x in [1,2,3]:
       x=x*2
       y.append(x)
       
is the same as

    y = [x*2 for x in [1,2,3]]

For both, y = [2, 4, 6]



## Python functions

In python, you can type commands without a function, but if you want to contain them and be able to use the same code in many places, functions are necessary. Functions are created using the 'def' keyword followed by the function name and a pair of parentheses '()', and finally a colon ':'. All code of the function is indented so that python knows which code is part of the function. 

    def function_name():
        print("I'm inside a function")
        
and then to call the function elsewhere, type the function name followed by a pair of parentheses '()'.

    function_name()

If there are values to pass to the function, place them inside the parentheses. And to return a calculated value from the function, use the keyword 'return'. If nothing is being returned, then you can skip writing a return statement. You can think of 'def' as standing for 'definition' since a function is being defined.

    def function_name(x):
        y = x + 2
        return y
       
and to call it,

    answer = function_name(5)
    
with the value of answer is 7.

In python, a function has to be written before you call it. And if you have two functions in the program, the order of the functions is important. Imagine reading the program from the top to the bottom, you can't use a function that hasn't been written yet. So if you have one function that calls another, the called function must appear first.

    def function_being_called():
        print('I'm being called')
        
    def function_calling_another():
        function_being_called()
        
and the result is "I'm being called" printed to the console.


## Importing code packages

Google Colab easily enables access to external code packages using the import statement. 

The import statements, seen at the top of every notebook, are commands to enable external python code packages to be use in a notebook. This way another programmer can write a code package and store it in an online repository to be downloaded when needed. 

If you were to download the Jupyter notebook and use it on your own computer, it is necessary to create a "python environment" which is like a container that holds downloaded code packages. Even though the packages have been downloaded, it is required to use the import statement in a notebook to let python know you will be using them. Google Colab notebooks create the python environment automatically and so packages don't need to be downloaded and installed first, but just imported. 


[TOC](#toc)

## Requests package to fetch data files <a name="requests-package-to-fetch-data-files"></a>

The [requests package](https://docs.python-requests.org/en/latest/) enables data to be downloaded from a website link, and saved into a variable instead of a file. Well formatted data can also be read into a variable using the pandas package which is described below. The notebook uses the requests package instead of pandas because the icecore data files are text based with many comments and unneeded data mixed in. The notebook uses the pandas package for the MLO data because it is well formed with clear comment lines and column names. To use reqeusts to fetch data, use the following code:

    variable = requests.get('file location')
    
But the code doesn't know if the data is to be formatted as text or json, so you have to tell it with the following code:

    variable_text = variable.text
    
[TOC](#toc)

## NumPy package to work with numbers <a name="numpy-package-to-work-with-numbers"></a>

With the NumPy package, you can work with arrays. Lists are one dimensional (only rows or only columns). And arrays can hold multiple dimensions (rows and columns at the same time along with other dimensions like time, etc.). NumPy arrays can also just be one dimensional. NumPy is used because it is much faster than working with lists. 

As a shortcut, NumPy is imported such that an abbreviation of 'np' can be used.

    import numpy as np
    
    
A python list can be turned into a numpy array by the following

    a_list = [1, 2, 3, 4]
    
    a_numpy_array = np.array(a_list)
    
    
For a two dimensional array, type the following (notice the square brackets '[]' enclosing everything):

    arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
    
With a numpy array, there are functions that are included with it. You call a function by placting a dot (.) after np and followed by the function name. For example: 
    
    variable = np.shape(arr)

This gives the shape of the array which is (2,4) which means there are two rows and 4 columns. The first dimension = 2 and the second dimension = 4.


[TOC](#toc)

## Pandas package to read in and manipulate data <a name="pandas-package-to-read-in-and-manipulate-data"></a>

The [pandas package](https://pandas.pydata.org/) can be thought of as a python representaion of an Excel worksheet. Data is stored in rows and column cells with column headers. Data stored in a variable represented by a combination of columns and rows is referred to as a dataframe. This makes it easy to work with one column at a time or multiple. And it's intuitive to work with data having a label rather than just the numbers themselves. 

In the notebooks, the pandas package is called by abbreviating it as pd using the statement: 

    import pandas as pd

Functions associated with the pandas package are called using the format pd followed by a dot (.) and the function name. An example function call is "pd.read_csv(filename)" where filename is the name of a file you want to read in data from. This file can either be found locally or remotely. The function read_csv 'reads' or extracts data from the file and stores it in a varaible. To avoid having functions with the same name conflicting with each other, the 'pd' part is used in front of the function name. You'll also notice the variable itself can call pandas functions by using a dot (.) and a function name after it. And example is 

    df = pd.read_csv('filename.txt')
    df.head()
    
The first code line reads in the data from the file filename.txt into the variable df (called a dataframe if the data is more than one dimension meaning rows and columns), and the next code line tells the notebookk to display to the screen the first 5 lines of the dataframe. A file with the extension '.csv' means the file is a text file containing values separated by commas. Each value separated by a comma in a row is placed in a "column", meaning you can think of it like a spreadsheet column. If there are header lines, the columns are named using them. To get all the data in this column as a variable, use the following code:

    variable = df['column name']
    
To get a subset of a dataframe such as two columns, use the following code (notice the double square brackets):

    variable = df[['column name 1', 'column name 2']]

A single set of square brackets for one column, represents a "Series" or a one dimensional representation of the data. For two columns, a double set of square brackets is used which represents a "DataFrame" or a two dimensional representation of the data.

The row numbers of a dataframe are referred to as the index. As an aside, the index number is not the same as the line number. An index is a way to label each row so it is unique. It is possible for there to be 4 rows and instead of referring to the rows as 0, 1, 2, 3, the index is 0, 1, 3, 4. Note that in python, an index starts at 0 and not at 1. To get back to rows that start at 0 again and increase by 1, use the reindex function. 

To convert a pandas dataframe into a numpy array, run
    
    x = df['column_name'].to_numpy()

This is a nice introduction to pandas [https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/).


[TOC](#toc)

## Matplotlib package for plotting <a name="matplotlib-package-for-plotting"></a>

Matplotlib is plotting package that enables plots to display when run in a Jupyter code cell. 

#### Setup

For matplotlib commands, the package is imported with the abbreviation 'plt' and then all functions follow with a dot '.' and the function name. The import command is

    import matplotlib.pyplot as plt

To display plots inside the notebook, the next line is needed and place it at the top of the notebook

    %matplotlib
    
This command will make your plot outputs appear and be stored within the notebook.


#### Plotting
There are two attributes associated with a plot, the Figure and the axes. 

The Figure refers to the plot that is created. To create a plot, one way to plot and keep track of the plot (figure) to save later is to set a variable fig equal to the output of calling plt.Figure(), e.g. fig = plt.Figure(). The variable fig is referred to as a 'handle'. A handle is a way to keep track of the plot so that you can save it later.

    fig = plt.Figure()
    
    x = [1,5]
    y = [3,4]
    
    plt.plot(x,y)
    
    fig.savefig(filename)
    
The axes refers to the graphic settings of the plot. When a plot is created, use a variable to keep track of the axes, called a handle, so that configurations of the plot can be set to customize the graphic. As an example of getting an axes handle are the statements following where 'ax' is the axes handle and fig is the figure (plot) handle. To apply a function to the axes, such as creating a label for the axis or setting the limits of the plot, type the name of the axes handle, ax, followed by dot '.' and then the function name. Below is a function call to set the plot limits of the x axis from 0 to 5.

    fig, ax = plt.subplots()
    
    x = [1,5]
    y = [3,4] 
    
    ax.set_xlim(0, 5)

Notice that 'subplots' is used. In the previous discussion about figures, plt.Figure() creates a plot with no handle to configure the plot properties. To get an axes handle, you can either define it right away with fig, ax = plt.subplots() or use fig = plt.Figure() followed by ax = fig.add_subplot(111) where the 111 refers the figure size and position.

The x and y axes have many configurations that can be set using functions on the axes handle. Some options include setting the limits of the plot, including labels for the x and y axis, and adding a title to the plot.

Along with properties, the plot function is called using the axes handle. 

    ax.plot(x, y)
    
You can plot directly without using an axes handle, plt.plot(x, y), but then you won't be able to customize the plot. 

Another feature of matplotlib is to set font and linewidth information for all plots created so that the options are set once and not each time a plot is created. The property to do this is rcParams. To set the font family, type the following:

    plt.rcParams.update({
        "font.family": "sans-serif"
        })

A great instroduction to plotting with Matplotlib is [https://matplotlib.org/stable/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py](https://matplotlib.org/stable/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py).


[TOC](#toc)

## Using notebooks locally

You can download notebooks from the Keeling Curve GitHub repository and run it on your own computer. The best way to use a python notebook on your local computer is to create a python environment (see below). 

You can also modify notebooks using Google Colab and save your changes. Here is an overview of Google Colab along with a description of how to download files (notebooks) from Colab. [https://www.dummies.com/article/technology/programming-web-design/python/working-with-google-colaboratory-notebooks-262687/](https://www.dummies.com/article/technology/programming-web-design/python/working-with-google-colaboratory-notebooks-262687/)

[TOC](#toc)

### Python environment

A python environment is a way to separate different python versions and python package versions used for a coding project. With a python envrironment, you can start a new code project and use a new python version and new package versions without breaking other code projects. Sometimes updates have new features that don't work with older versions and can break code unless it is separated into environments.

Using pyenv to manage python environments [https://realpython.com/intro-to-pyenv/](https://realpython.com/intro-to-pyenv/).



[TOC](#toc)

# Python functions used in Keeling Curve Notebooks

### Functions to fetch data

Two methods are used to fetch data.

1. One is using Pandas and it's function read_csv which can read in a csv file or text file locally and remotely from a computer. Here a web address, url, will be used since the notebook is both on Google Colab and the data is located in a remote GitHub repository.

    df = pd.read_csv(url)

2. Two is using the requests package. It's a python package to call a url and retrieve multiple file formats. Since multiple types can be retrieved, the response needs to know what kind, so the response is followed by a dot '.' and 'text' here because the file to be fetched is a text file. 

    response = requests.get(icecore_2K_url)
    file_text = response.text




### Examples of functions used in the plotting notebooks

This line gets the decimal date of the seasonal adjusted data out of the pandas data frame and converts it to a numeric numpy array. The plots can actually use pandas dataframes, but numpy arrays were used for consistency since on some occasions, the data are used in the numpy numeric format outside of plotting.

    mlo_date = df_mlo['date_seas_adj'].to_numpy()

This line takes the file that was read into a string and splits it on the return character and this results in a list of strings representing the lines of the file. This is to get the data into a list form that can be read into a pandas dataframe. A pandas dataframe is a very convenient form to hold and transform the data such as removing any lines with NaN CO2 values.

    text_lines = file_text.split('\n')

This line is using a list comprehension (a type of for loop) to iterate over each line of a text file to find the header at the start of icecore data to be extracted. 

    start_section = [i for i in range(len(text_lines)) if text_lines[i].startswith('2. CO2 by Core')][0]

This line gets a range of data from a text file starting at the row of the start of data to be retrieved and the end of the data section. 

    data_lines = section_lines[start_data: end_section]

This line creates a regular expression to retrieve numeric values (\d represents a numeric digit).

    r = re.compile('(.+\d+.*\d+.*\d)\s.*')

These lines convert a list of strings into a pandas dataframe and then names the column 'data'.

    df_icecore_2K = pd.DataFrame(data_list)
    df_icecore_2K.columns = ['data']

This line removes any lines with a NaN value of CO2.

    df_icecore_800K = df_icecore_800K.dropna()

This line filters the data to find icecore data going back 800K years up to 2K years back.

    df_icecore_800K = df_icecore_800K[df_icecore_800K['date_ce'] < min_2K]

This line combines two dataframes horizontally so there are 4 columns.

    df_combined = pd.concat([df_combined_icecore, df_mlo], ignore_index=True)

This function converts a datetime (a string representation of a date) into a decimal.

    dt2t(adatetime)


### Functions to configure plot properties

The gradient_fill function is used to apply a color gradient underneath a line.

The function below sets font and line width settings of the plots.

    set_matplotlib_properties

The function below sets properties of the plot like ticks, tick labels, and axes labels.

    set_website_plot_props

The function below customizes how tick labels are displayed such as shifting their position to be centered between ticks when labeling the x-axes dates as days where the left and right ticks represent the start and end of a day and not noon.

    create_xtick_labels

The function below applies a title at a custom distance from the top axis and sets its font properties.
    
    add_plot_title

The function below adds arrows at specific years by finding the CO2 value at specific years and pointing to those CO2 values. 

    apply_arrow_annotations

The function below adds a high resolution png UCSD/SIO logo to the plot.

    add_sio_logo



### Functions to save a plot

The function below saves the plots created by matplotlib into PDF and png formats. It saves them at a specified size and resolution.

    save_plot_for_website

The function below makes use of Google Colab's ability to download plot files saved in the Google Colab virtual environment. It uses ipywidgets to create a clickable button. 
    
    download_files



[TOC](#toc)