<a href="https://colab.research.google.com/github/stb2145/cig/blob/master/Week_5_Salinity_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Week 2: Pandas, NumPy, SciPy**

**Learning Goals**
- Learn about/ `import` Python libraries (15 min)
- Learn about NumPy (5 min)
- Learn about Pandas (40 min)
- Learn about SciPy (30 min)

### **Icebreaker!**
> **How many oceans are there?** 🤔...

##### Sort of a trick question...

# **Solution to Week 1 exercises**



In [None]:
#Boolean 
indian_ocean = 30 #˚C
atlantic_ocean = 28 #˚C

indian_ocean == atlantic_ocean

In [None]:
# in statement
oceans = ['Arctic','Atlantic','Pacific','Indian']

In [None]:
#check to see if the Southern Ocean is included in our `oceans` variable
'Southern' in oceans

In [None]:
# if statement

temp1 = 30 #˚C
temp2 = 0 #˚C

if (temp1 != temp2):
  print('temp1 does not equal temp2')

> **We measure ocean temperature and depth with argo floats. What else do you think an Argo float measures in the water?**


# **Ocean Salinity**

### Salt vs Salinity

> **Salts** are compounds like sodium chloride, magnesium sulfate, potassium nitrate, and sodium bicarbonate which dissolve into ions.

> **Salinity** is the quantity of dissolved salt content of the water. It is measured as a mass fraction, or the ratio of dissolved salts (g) to a unit mass of water (kg). Often the unit for salinity is expressed as "practical salinity unit" (psu= $\frac{g}{kg}$).

<img width=800 src="https://smap.jpl.nasa.gov/system/news_items/main_images/1265_SMAP_salinity.jpg">

> **Based on the figure above...**
>
> **1) Which ocean basin is the saltiest/freshest?**
>
> **2) What is the average salinity of the ocean?**

# **What are Python Packages?**
##Most of the power of a programming language is in its libraries.

> A Python package is a collection/directory of Python modules. In other words, it's a library of python files and in those files are scripts of code with specific functions.

<img src='https://drive.google.com/uc?id=1C7Y1p1Nlj0QhLqEPUGMoU6FP1QWMqrFU' width="520" height="300" />

- A library is a collection of files (called modules) that contains functions for use by other programs.
 - May also contain data values (e.g., numerical constants) and other things.
 - Library’s contents are supposed to be related, but there’s no way to enforce that.
- The Python [standard library](https://docs.python.org/3/library/) is an extensive suite of modules that comes with Python itself.
- Many additional libraries are available from [PyPI](https://pypi.org/) (the Python Package Index).
- We will see later how to write new libraries.



> **Libraries and Modules**
>
> A library is a collection of modules, but the terms are often used interchangeably, especially since many libraries only consist of a single module, so don’t worry if you mix them.

# **A program must import a library module before using it.**

- Use `import` to load a library module into a program’s memory.
- Then refer to things from the module as `module_name.thing_name`.
 - Python uses `.` to mean “part of”.
- Using `numpy`, one of the modules in the standard library:

In [3]:
import numpy

print('pi is', numpy.pi)
print('cos(pi) is', numpy.cos(numpy.pi))

pi is 3.141592653589793
cos(pi) is -1.0


> Have to refer to each item with the module’s name.
>> `numpy.cos(pi)` won’t work: the reference to `pi` doesn’t somehow “inherit” the function’s reference to `numpy`

## Use `help` to learn about the contents of a library module.

In [None]:
help(numpy)

## Import specific items from a library module to shorten programs.

In [None]:
from math import cos, pi

print('cos(pi) is', cos(pi))

#### **Essential Python Packages:**

- `numpy`
- `pandas`
- `matplotlib`

> **How do we get packages into our notebooks?**
>
> We `import` them!

In [None]:
import numpy as np
import pandas as pd
import scipy

# **NumPy**

[Numpy Documentation](https://numpy.org/doc/)

> Numpy is a 

# **Pandas**

<img width="400" src='https://miro.medium.com/max/1400/1*KdxlBR9P3mDp9JZ_URMdYQ.jpeg'>

>No but seriously, `pandas` is a Python toolbox (module) that allows for efficient, high-performing analysis on _tabular_ data (i.e. excel sheet type of data).

### **There are two main data structures in pandas**

> **Data Series**: 1-dimensional array of values with an index
>
> **Data Frame**: 2-dimensional array of values with a row and a column index

# Pandas Capabilities

[Documentation](https://pandas.pydata.org/pandas-docs/stable/)

- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

# Introduce pd frame and series with class age and height, with the names as index and age/height as values.

<img width="500" src='https://miro.medium.com/max/1400/1*o5c599ueURBTZWDGmx1SiA.png'>

> Anytime you need more information on a package/function, call `?` after the function name.

In [None]:
pd?

#### **Example of a pandas Data Series:**

In [None]:
#create index
names = ['Cherif', 'Damian', 'Eric', 'Silas', 'Victor']
#create data values
ages = []
#height = []

In [None]:
club = #create pandas series

> You can use many statistical functions on both Series and DataFrames.

In [None]:
#oldest age in your series
club.max()

In [None]:
#youngest age value in your series
club.min()

In [None]:
#average age in the whole club
club.mean()

> **Ocean basins pandas series**

In [None]:
ocean_basins = ['Arctic', 'Atlantic', 'Indian', 'Pacific', 'Southern']
avg_salinity = [32, 35, 34.5, 35, 34.7]
ds = pd.Series(data=avg_salinity, index=ocean_basins, name="Ocean basins' average salinities")

In [None]:
ds

In [None]:
# If you're not sure what the index of your pd series is:
ds.index

In [None]:
# If you're not sure what the values of your pd series are:
ds.values

> **Find the freshest ocean basin(s)**

In [None]:
#first find the minimum salinity value
ds.min()

In [None]:
#next find the index associated with that salinity value
ds[ds == 32.0]

In [None]:
#another way to write the same code!
ds[ds == ds.min()]

> **Find the saltiest ocean basin(s)**

In [None]:
#your code here

In [None]:
#your code here

#### **Example of a pandas Data Frame:**

In [None]:
#first create a dictionary
ocean_basins = ['Arctic', 'Atlantic', 'Indian', 'Pacific', 'Southern']
avg_salinity = [32, 35, 34.5, 35, 34.7]
avg_temp = [-1.8, 14, 22, 20, 4]

avg_data = {'avg_salinity': avg_salinity,
        'avg_temp': avg_temp}


df = pd.DataFrame(data=avg_data, index=ocean_basins)

In [None]:
df

In [None]:
df.info()

> You can use many statistical functions on both Series and DataFrames.

In [None]:
df.min()

In [None]:
df.max()

In [None]:
df.mean()

> Or, if you want all the basic stats, you can call `describe()`


In [None]:
df.describe()

> We can get a single column as a Series using python's getitem syntax on the DataFrame object.

In [None]:
df['avg_salinity']

> or using attribute syntax.

In [None]:
df.avg_salinity

> **Find the following in this dataframe:**

In [None]:
#What ocean basin has the coldest average temperature? What is that temperature?
df.avg_temp.min()

In [None]:
df.avg_salinity.plot(kind='bar')

In [None]:
df.plot(kind='bar')

**Learning Goals**
- Learn about ocean salinity 
- What are Python packages?
- `import` essential packages 
- Let's play with `pandas`!