# **Lab 5 — NumPy and pandas**
---

## Introduction

If you're going to be manipulating data mathematically within Python, you're almost certainly going to be using **NumPy**, a Python library for numerical computing. Likewise, for reading in CSV or Excel files and keeping them organized, **pandas** is an essential tool. This lab explores these two key Python libraries.

Your deliverable for this lab will be this notebook, with **"deliverables" completed as requested below**. The "exercises" are exploratory and not graded. Please rename the notebook from `lab_05.ipynb` to `<last_name>_lab_05.ipynb` prior to submission. Download the file using **File $\rightarrow$ Download .ipynb**. Submit it to Canvas under the Lab 5 assignment **no later than midnight Thursday, September 30th**.

## Resources

[NumPy](https://numpy.org/doc/stable/)  

[pandas](https://pandas.pydata.org/docs/) 

[seaborn](https://seaborn.pydata.org/) 

## Exercise I: Introduction to math with NumPy

In previous labs or HW, you've likely noticed the common command
```python
import numpy as np
```
which imports the NumPy library with name `np`. Then we access functions of NumPy using, for example, `np.sqrt()` for the square root function, or `np.mean()` for the mean. Here are links for some common commands:

* [Mathematical functions](https://numpy.org/doc/stable/reference/routines.math.html)
* [Statistics](https://numpy.org/doc/stable/reference/routines.statistics.html)

For any of the commands listed above, you can run them by writing `np.command_name_here()`. Try out the following code. If any function is unclear, find the documentation at one of the two links in the list above. Note also the handy `np.min()` and `np.max()` functions.

In [1]:
#Run this cell to install the pandas and seaborn libraries.
!pip install pandas seaborn

Collecting pandas
  Using cached pandas-1.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
Collecting seaborn
  Using cached seaborn-0.11.2-py3-none-any.whl (292 kB)
Collecting scipy>=1.0
  Using cached scipy-1.7.1-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.4 MB)
Installing collected packages: pandas, scipy, seaborn
Successfully installed pandas-1.3.3 scipy-1.7.1 seaborn-0.11.2


In [1]:
import numpy as np

print(np.sqrt(4))
print(np.sin(90))  # Trigonometric functions use radians, not degrees...
print(np.sin(np.deg2rad(90)))  # ...so you must convert using deg2rad()
print(np.ceil(3.78), np.floor(3.78))

a = [2, 4, -3]
print(f'\na = {a}')

print('\ndiff(a)')
print(np.diff(a))

print('\nmean(a)')
print(np.mean(a))

print('\nabs(a)')
print(np.abs(a))

print('\nsin(a)')
print(np.sin(a))  # We can operate on arrays as well as scalars!

print('\nmin(a)')
print(np.min(a))

2.0
0.8939966636005579
1.0
4.0 3.0

a = [2, 4, -3]

diff(a)
[ 2 -7]

mean(a)
1.0

abs(a)
[2 4 3]

sin(a)
[ 0.90929743 -0.7568025  -0.14112001]

min(a)
-3


## Deliverable 1: Demonstrating NumPy functions

In a **new code cell** below, perform the following operations using NumPy functions, and print — using `print()` — the results for each case. The only basic Python operator needed should be `+`; everything else should be NumPy! Remember that you can peruse the docs linked in Exercise I, or use Google, to find the relevant function.

1. $\cos(270^\circ)$

2. $\sqrt{3^2 + 4^2}$ — use `np.power()` here, not `**`

3. $\sum a_i$ where $a = [1, 2, 3]$

4. $\ln(e^{47})$

## Deliverable 2: Revisiting the tilt difference calculation using NumPy tools

Recall Deliverable 4 from Lab 3, where you wrote code "from scratch" to calculate the mean and standard deviation of a volcanic tilt dataset. That was a useful learning exercise, but in reality we usually want to use the tools already available to us to make things simpler. In this deliverable, you'll recreate the tilt difference and uncertainty calculation, this time using NumPy functions. First, please run the code cell below to download the data and define the `y_tilt` variable:

In [None]:
!curl -O https://raw.githubusercontent.com/uafgeoteach/GEOS636_PAG/master/AV37_lava_lake.txt

import numpy as np

# Read text file
data = np.genfromtxt("AV37_lava_lake.txt", dtype=None, names=['date', 'x', 'y'], encoding='utf-8')

# Define y_tilt
y_tilt = data['y']

**Some reminder notes and hints:**

* Indices 0–4980 correspond to "pre-event" data, while indices 5640–end correspond to "post-event" data. You can use these to subset the `y_tilt` variable appropriately.

* To find the uncertainty in your tilt change calculation, note the following error propagation rule. For a quantity $Q$ computed from the difference of quantities $a$ and $b$, i.e.
$$Q = a - b\,,$$
the uncertainty in $Q$ is
$$\delta Q = \sqrt{(\delta a)^2 + (\delta b)^2}\,,$$
where $\delta a$ and $\delta b$ are the uncertainties in quantities $a$ and $b$, respectively. You can use your calculated standard deviations as estimates for these. Note that this means your final answer should be of the form $Q \pm \delta Q$. You can read more about error propagation [here](http://ipl.physics.harvard.edu/wp-uploads/2013/03/PS3_Error_Propagation_sp13.pdf).

In a **new code cell** below, include your code which calculates the change in tilt *as well as the uncertainty in that calculation*. Feel free to copy over any relevant code from Lab 3, but note that NumPy functions should be doing most of the work for you here!

## Exercise II: Working with pandas

pandas is a valuable tool for organizing tabular data within Python. It allows you to organize and manipulate your data more easily than just keeping things in lists/dictionaries. If you're used to working with CSV or Excel files, pandas provides easy ways to read, manipulate, and write these files. We'll learn more about pandas I/O (Input/Output) in a future lab. Here, we introduce pandas using some simple examples.

First, run the below code to import the pandas library, define some dictionaries corresponding to station coordinates, and create a pandas [DataFrame object](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html):

In [1]:
import pandas as pd

station_1 = {'Name': 'SBY', 'Lat': 33.975, 'Lon': -107.181}
station_2 = {'Name': 'LEM', 'Lat': 34.166, 'Lon': -106.972}
station_3 = {'Name': 'SC01', 'Lat': 34.068, 'Lon': -106.967}

df = pd.DataFrame([station_1, station_2, station_3])

print(df)

   Name     Lat      Lon
0   SBY  33.975 -107.181
1   LEM  34.166 -106.972
2  SC01  34.068 -106.967


In the above code, we created a DataFrame using the command `pd.DataFrame()` and supplying it with a list of dictionaries. The keys in the dictionaries became column headers, and the values became column entries. You can think of a DataFrame as a spreadsheet like Excel, but stored "in-memory" in Python. You can extract columns of the DataFrame by name, similar to dictionaries. You can also select rows. A single column of a DataFrame is called a [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) in pandas-speak.

In [None]:
print(df['Name'])  # The Name column

In [None]:
print(df.loc[1])  # The second row

As shown above, you can select a row by providing its index via `df.loc[index]`. You can also select rows based upon a condition. The syntax for this is `df.loc[df['<column_name>'] <condition>]`. **Note that this assumes your DataFrame is called `df`.** Here are a few examples:

In [None]:
print(df.loc[df['Lat'] > 34])  # Only show rows for which the latitude is larger than 34

In [None]:
print(df.loc[df['Name'] == 'SBY'])  # Only show the row with Name = SBY

In [None]:
print(df.loc[df['Name'] == 'LEM']['Lat'])  # Show the latitude of station LEM

## Deliverable 3: Working with a DataFrame

Now it's your turn to play with a DataFrame! For this deliverable we'll be working with some data on passenger flights from 1949–1960. Run the below cell to load the data into a DataFrame called `flights`.

In [2]:
import seaborn as sns
flights = sns.load_dataset('flights')
print(flights)

     year month  passengers
0    1949   Jan         112
1    1949   Feb         118
2    1949   Mar         132
3    1949   Apr         129
4    1949   May         121
..    ...   ...         ...
139  1960   Aug         606
140  1960   Sep         508
141  1960   Oct         461
142  1960   Nov         390
143  1960   Dec         432

[144 rows x 3 columns]


As you can see, this is a table with three columns and many rows. For each of the following prompts, write an expression which prints the requested portion of the `flights` DataFrame. For some queries the printed output may be automatically shortened with `...` in between rows, don't worry about this I am just looking to see if the query was done correctly. Please do this in a **new code cell** below.

1. All rows where less than 120 passengers flew.
2. All rows corresponding to the month of November.
3. All rows where an even number of passengers flew.

## Deliverable 4: Calculations with Pandas

Now that you've gotten your feet wet with extracting rows using conditions, it's time to get a little more involved. Please answer the following questions in a **new code cell** below.

> **Hint:** Note that the pandas Series object (remember, a Series is a *column* of a DataFrame) has many handy methods for computations, such as `sum()`. For example, if you have a Series `s` you can find the sum of the values using `s.sum()` or the minimum value with `s.min()`. Another handy method is `s.unique()`, which returns the unique values of `s`. 

1. How many passengers flew in 1957?
2. In 1952, which month was the most popular for flying? (**Hint:** Define a new DataFrame corresponding to just 1952. Then look for the month with the maximum number of passengers.)
3. **[BONUS — up to 1 pt]** Overall, between 1949–1960 (the time range of the dataset) which month was the most popular for flying?