<a href="https://colab.research.google.com/github/weymouth/NumericalPython/blob/main/TablesAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Tables and Analysis Functions

Congratulations! In the previous 3 notebooks we have covered the fundamentals of numerical computing with Python
 - Variables, operations and functions
 - Conditionals, lists and looping
 - Arrays, vector operations and plotting

Plus some handy concepts like print formating, list comprehensions and lamda functions to keep your code tidy and efficient.

For this final tutorial notebook we will look at two more advanced topics which nevertheless occur frequently in practice:
1. Reading, manipulating, and writing data
2. Using advanced built in operations such as optimization, integration, and root finding. 

These topics are covered in more detail in walk-throughs online, so we will only give an overview here.

---

# Pandas

A single _observation_ of real engineering data might include a wide range of different types of measurements. Take for example this (simplified) table summarizing the vehicles used at the [National Oceanography Center](https://www.noc.ac.uk/facilities/national-marine-equipment-pool)

| vehicle | count | speed (m/s) | size (m) | working fluid | flow type |
|-------|----------|-------|------|---------------|-----------|
| quad rotors | 4 | 18 | 0.06 | air | turbulent |
| slocum gliders | 9 | 0.4 | 2 | water | transitional |
| autosubs | 12 | 1.5 | 3 | water | turbulent
| wave drones | 2 | 0.5 | 0.4 | water | laminar

The data consists of labels, counts, floats and categories. Since every element in a NumPy array has to be the same data type, each column would need to be it's own array - which clashes with the natural grouping in terms of observations. 

The [Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) library introduces a new data type called a [data frame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) to hold tables of data like this. 

In [None]:
import pandas as pd

data = [['quad rotors', 4, 18, 0.06, 'air', 'turbulent'],
        ['slocum gliders', 9, 0.4, 2, 'water', 'transitional'],
        ['autosubs', 12, 1.5, 3, 'water', 'turbulent'],
        ['wave drones', 2, 0.5, 0.4, 'water', 'laminar']]
names = ['vehicle', 'count', 'speed (m/s)', 'size (m)', 'working fluid', 'flow type']
table = pd.DataFrame(data,columns = names)
table

The code above `import`ed Pandas using the nickname `np` and used the method `DataFrame` to convert the data (a list of lists) and the column names (a list of strings) into a table. 

** Important notes: ** 
1. In practical usage, you would not *create* the data, you would *read* it. The data with the header would already be stored in a spreadsheet or a csv file. Pandas is great at [reading in data](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/02_read_write.html#min-tut-02-read-write) from sources like this.
2. Such data would typically contain many thousands of observations. Keeping this data [tidy](https://vita.had.co.nz/papers/tidy-data.pdf) and developing a reliable analysis pipeline is where the advantages of the programming approach to data science are most obvious.

Once we have the data table, Pandas has a ton of built-in statistical operations to perform on it. For example, we can take the standard deviation:

In [None]:
table.std()

This only works for columns with a numerical data type. We can loop through the columns and get a description of each using the `describe` function:

In [None]:
for col in table:
    print('     ',col)           # column name
    print(table[col].describe()) # descriptive stats
    print('---------')

Since this table is so short, we could also loop through *rows* using the row's `index`. The [loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) method lets us access the data similar to a numpy array.

In [None]:
for i in table.index:
    print(table.loc[i])
    print('--------')

We can generate plots very easily from the table. The `plot` function is a wrapper around `plt.plot()` and adds a legend with the column names automatically.

In [None]:
table.plot(xlabel='row');

## Modifying the table

Let's quickly show how to add rows and columns to the table. First, add a row for the NOC's research ships by selecting an index which isn't filled yet:

In [None]:
table.loc[4] = ['ships',2,8,100,'water','turbulent']
table

Note that we can overwrite a row this way as well.

Lets add a Reynolds number $Re$ column to our table since it governs the transition from laminar to turbulent flow. Typically $Re$ 500k-1M marks the transition from laminar to turbulent, where $Re$ is defined as $\text{speed} * \text{size } / \text{ kinematic viscosity}$ and the [kinematic viscosity](https://en.wikipedia.org/wiki/Viscosity) of water is $\nu\approx 10^{-6}~m^2/s$ and air is $\nu\approx 15\times 10^{-6}~m^2/s$. 

We can use the NumPy [where](https://numpy.org/doc/stable/reference/generated/numpy.where.html) function to set the viscosity depending on the fluid and then apply the formula to get the Reynolds number:

In [None]:
import numpy as np
table['kin visc (m*m/s)'] = np.where(table['working fluid']=='water', 1e-6, 15e-6)
table['Re'] = table['speed (m/s)']*table['size (m)']/table['kin visc (m*m/s)']
table

Notice that the $Re$ predicts the flow type nicely for all the vehicles except the quad rotor. The jets from the rotor are much faster and more turbulent than the vehicle speed implies, leading to the disparity. 

**Overall** Pandas is great for keeping this table organized, but it is a bit awkward and verbose to work with, and uses a lot of specialized syntax. If you can get away with a multi-dimensional array or two, you may find those easier to work with. 

# SciPy

Typically, we only have access to the noisy measurements, not the true model. We can use the [curve_fit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) function to fit a `linear` model function to this data.

In [None]:
from scipy.optimize import curve_fit
def linear(x,m,b): return m*x+b
params,_ = curve_fit(linear,data.time,data.y)
params

In [None]:
data['y fit'] = linear(data.time,*params)
data.plot(x='time');

Since the true model happend to be a linear model, the curve fit is nearly perfect! 