<a href="https://colab.research.google.com/github/stb2145/cig/blob/master/Week_2_Pandas_NumPy_SciPy_blank_cells.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Week 2: Pandas, NumPy, SciPy**

**Learning Goals**
- Learn about/ `import` [Python libraries](http://swcarpentry.github.io/python-novice-gapminder/06-libraries/index.html) (20 min)
- Learn about [Pandas](http://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/index.html) (50 min)
- Learn about SciPy (20 min)

# **Solution to Week 1 exercises**

## Breakout rooms!

- Breakout rooms (5 min)

- Share results (5 min)



# **Open files with `open()`**



> A common way to work with files in Python is to create file handler with “open” statement and work with the file. After finishing the work with the file, we need to close the file handler with close statement. For example, if we want to read all lines of a file using Python , we use

```
fh = open(filename,'r')
all_lines = fh.readlines()
fh.close()
```

Example code taken from [this website](https://cmdlinetips.com/2016/01/opening-a-file-in-python-using-with-statement/).

> Sometimes, you don't always remember to close the file once we are done with the file. We can use `with` statement in Python such that we don’t have to close the file handler. 

> The with statement creates a context manager and it will automatically close the file handler for you when you are done with it. Here is an example using with statement to read all lines of a file.

```
with open(filename,'r') as fh
     all_lines = fh.readlines()
```

# **What are Python Libraries?**
##Most of the power of a programming language is in its libraries.

> A Python package is a collection/directory of Python modules. In other words, it's a library of python files and in those files are scripts of code with specific functions.

<img src='https://drive.google.com/uc?id=1C7Y1p1Nlj0QhLqEPUGMoU6FP1QWMqrFU' width="520" height="300" />

- A library is a collection of files (called modules) that contains functions for use by other programs.
 - May also contain data values (e.g., numerical constants) and other things.
 - Library’s contents are supposed to be related, but there’s no way to enforce that.
- The Python [standard library](https://docs.python.org/3/library/) is an extensive suite of modules that comes with Python itself.
- Many additional libraries are available from [PyPI](https://pypi.org/) (the Python Package Index).
- We will see later how to write new libraries.



> **Libraries and Modules**
>
> A library is a collection of modules, but the terms are often used interchangeably, especially since many libraries only consist of a single module, so don’t worry if you mix them.

## A program must import a library module before using it.

- Use `import` to load a library module into a program’s memory.
- Then refer to things from the module as `module_name.thing_name`.
 - Python uses `.` to mean “part of”.
- Using `numpy`, one of the modules in the standard library:

> Have to refer to each item with the module’s name.
>> `numpy.cos(pi)` won’t work: the reference to `pi` doesn’t somehow “inherit” the function’s reference to `numpy`

## Use `help` to learn about the contents of a library module.

> [Numpy Documentation](https://numpy.org/doc/)

### Difference between `math` and `numpy`
[from StackOverflow](https://stackoverflow.com/questions/41648058/what-is-the-difference-between-import-numpy-and-import-math)

- Use `math` if you are doing simple comutations with only with scalars (and no lists or arrays).
> `math` is part of the standard python library. It provides functions for basic mathematical operations as well as some commonly used constants.

- Use `numpy` if you are doing scientific computations with matrices, arrays, or large datasets.
> numpy on the other hand is a third party package geared towards scientific computing. It is the defacto package for numerical and vector operations in python. It provides several routines optimized for vector and array computations as a result, is a lot faster for such operations than say just using python lists. See http://www.numpy.org/ for more info.

## Import specific items from a library module to shorten programs.

## Create an alias for a library module when importing it to shorten programs.

> Use `import ... as ...` to give a library a short alias while importing it.
>
> Then refer to items in the library using that shortened name.

### Essential Python libraries:

- `numpy`
- `pandas`
- `matplotlib`

### Create alias for these libraries:

## A few exercises:

## 1)

When a colleague of yours types `help(math)`, Python reports an error:

`NameError: name 'math' is not defined`

What has your colleague forgotten to do?

In [None]:
#type solution here


## 2)
Take the square root of a 4x4 2-D array with the number 144 in the diagonals and 0's elsewhere.

## 3)
Create a 1D array of numbers going from 0 to 20 with 2 as the step count.

What is the length of this object?

# **Pandas!**

<img width="400" src='https://miro.medium.com/max/1400/1*KdxlBR9P3mDp9JZ_URMdYQ.jpeg'>

## No but seriously, `pandas` is a powerful Python library that allows for efficient, high-performing analysis (typically statistics) on _tabular_ data (i.e. excel sheet type of data).

> Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

### Pandas Capabilities

[Documentation](https://pandas.pydata.org/pandas-docs/stable/)



- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

### **There are two main data structures in pandas**

> **Data Series**: 1-dimensional array of values with an index
>
> **Data Frame**: 2-dimensional array of values with a row and a column index

> A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.


## Data series (left), Data Frame (right)

Introduce pd frame and series with class age and height, with the names as index and age/height as values.

<img width="500" src='https://miro.medium.com/max/1400/1*o5c599ueURBTZWDGmx1SiA.png'>

> Anytime you need more information on a package/function, call `?` after the function name.

## Pandas Data Series:

> You can use many statistical functions on both Series and DataFrames.

In [None]:
#oldest age in your series


In [None]:
#youngest age value in your series


In [None]:
#average age in the whole club


> Ocean basins pandas series

In [None]:
# If you're not sure what the index of your pd series is:


In [None]:
# If you're not sure what the values of your pd series are:


> Find the freshest ocean basin(s)

In [None]:
#first find the minimum salinity value


In [None]:
#next find the index associated with that salinity value


In [None]:
#another way to write the same code!


> Find the saltiest ocean basin(s)

In [None]:
# Try it: your code here

In [None]:
# Try it: your code here

## Pandas Data Frame:

In [None]:
#first create a dictionary


> You can use many statistical functions on both Series and DataFrames.

> Or, if you want all the basic stats, you can call `describe()`


> We can get a single column as a Series using python's getitem syntax on the DataFrame object.

> or using attribute syntax.

## Indexing & Slicing

- Use `DataFrame.iloc[..., ...]` to select values by their (entry)  **position**

- Use `DataFrame.loc[..., ...]` to select values by their (entry) **label**

> we can also specify the column we want to access

> If we make a calculation using columns from the DataFrame, it will keep the same index:

> Which we can easily add as another column to the DataFrame:

> Now let's add a row to the Dataframe:

In [None]:
#create new DataFrame object of global averages


> **Find the following in this dataframe:**

In [None]:
# What is the coldest temperature?


In [None]:
# What ocean basin has the coldest average temperature?

In [None]:
# Another way to find the index associated with coldest temperature:

In [None]:
# Plot avg salinity!

In [None]:
# Plot all the columns in the DataFrame!

# **BREAK TIME!**

## A few exercises: GDP per capita in Europe

In [None]:
url = 'https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/gapminder_gdp_europe.csv'


In [None]:
# Select Austria by entry position


In [None]:
# Select Austria by entry label


In [None]:
# Select/slice to all the rows in Denmark (two ways to do this)


In [None]:
# Select/slice to all the countries/rows in the 4th column (two ways to do this)


In [None]:
# Select multiple columns or rows using .loc and a named slice


> Can do same statistical operations on the slices

In [None]:
# Find the maximum gdp values within Italy through Poland, 1962-1972


In [None]:
# Find the minimum gdp values within Italy through Poland, 1962-1972


>Use comparisons to select data based on value.
  - Comparison is applied element by element.
  - Returns a similarly-shaped dataframe of `True` and `False`.

In [None]:
# Use a subset of data to keep output readable.


# Which values were greater than 10000 ?


> Select values or NaN using a Boolean mask.
  - A frame full of Booleans is sometimes called a mask because of how it can be used.


- Get the value where the mask is true, and NaN (Not a Number) where it is false.
- Useful because NaNs are ignored by operations like max, min, average, etc.


### Group By: split-apply-combine

>Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.

>For instance, let’s say we want to have a clearer view on how the European countries split themselves according to their GDP.
  - We may have a glance by splitting the countries in two groups during the years surveyed, those who presented a GDP higher than the European average and those with a lower GDP.
  - We then estimate a wealthy score based on the historical (from 1962 to 2007) values, where we account how many times a country has participated in the groups of lower or higher GDP

> Finally, for each group in the wealth_score table, we sum their (financial) contribution across the years surveyed using chained methods:

# **SciPy**

## is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages: 
  - SciPy
  - NumPy
  - Pandas
  - Matplotlib
  - SymPy

It is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in hand. Check out their [User Guide](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html)

All the notes here can be found on [this website](https://scipy-lectures.org/intro/scipy.html).

In [None]:
#Start by importing the SciPy library


## Interpolation

In [None]:
#Create experimental data close to a sine function using `numpy`


In [None]:
# Visualize with matplotlib

In [None]:
#import scipy's 1D interp function to then build a linear interpolation function:


In [None]:
# Evaluate the result at the time of interest:


In [None]:
# Plot the linear results


In [None]:
# A cubic interpolation can also be selected by providing the kind optional keyword argument:


In [None]:
# Plot the cubic results


In [None]:
# Plot the data and the interpolation


## Optimization and Fit

Optimization is the problem of finding a numerical solution to a minimization or equality.

 The `scipy.optimize` module provides algorithms for function minimization (scalar or multi-dimensional), curve fitting and root finding.

### Curve fitting

In [None]:
# Suppose we have data on a sine wave, with some noise:
# Seed the random number generator for reproducibility


In [None]:
# And plot it


> If we know that the data lies on a sine wave, but not the amplitudes or the period, we can find those by least squares curve fitting. First we have to define the test function to fit, here a sine with unknown amplitude and period:

> We then use `scipy.optimize.curve_fit()` to find $\it{a}$ and $\it{b}$:

In [None]:
# import optimize


In [None]:
params, params_covariance = optimize.curve_fit(test_func, x_data, y_data, p0=[2, 2])
print(params)

In [None]:
# Plot the resulting curve onto the data


## Assignment: Curve fitting of temperature data
The temperature extremes in Alaska for each month, starting in January, are given by (in degrees Celcius):

`max:  17,  19,  21,  28,  33,  38, 37,  37,  31,  23,  19,  18`

`min: -62, -59, -56, -46, -32, -18, -9, -13, -25, -46, -52, -58`

1. Plot these temperature extremes.
2. Define a function that can describe min and max temperatures. Hint: this function has to have a period of 1 year. Hint: include a time offset.
3. Fit this function to the data with `scipy.optimize.curve_fit()`.
4. Plot the result. Is the fit reasonable? If not, why?
5. Is the time offset for min and max temperatures the same within the fit accuracy?

### 1. Plot these temperature extremes

### 2. Define a function that can describe min and max temperatures. Hint: this function has to have a period of 1 year. Hint: include a time offset.

### 3. Fit this function to the data with `scipy.optimize.curve_fit()`.

### 4. Plot the result. Is the fit reasonable? If not, why?

## Finding the minimum of a scalar function

In [None]:
# Let's first define a function:


In [None]:
# And plot it:


> This function has a global minimum around -1.3 and a local minimum around 3.8.

> Searching for minimum can be done with `scipy.optimize.minimize()`, given a starting point x0, it returns the location of the minimum that it has found:

> **Methods:** As the function is a smooth function, gradient-descent based methods are good options. The [lBFGS algorithm](https://en.wikipedia.org/wiki/Limited-memory_BFGS) is a good choice in general:

> **Global minimum:** A possible issue with this approach is that, if the function has local minima, the algorithm may find these local minima instead of the global minimum depending on the initial point x0:

> If we don’t know the neighborhood of the global minimum to choose the initial point, we need to resort to costlier global optimization. To find the global minimum, we use `scipy.optimize.basinhopping()` (added in version 0.12.0 of Scipy). It combines a local optimizer with sampling of starting points:

## Assignment: 2-D minimization

The six-hump camelback function

$f(x, y) = (4 - 2.1x^2 + \frac{x^4}{3})x^2 + xy + (4y^2 - 4)y^2$

has multiple global and local minima. Find the global minima of this function.

Hints:

- Variables can be restricted to $-2 < x < 2$ and $-1 < y < 1$.
- Use `numpy.meshgrid()` and `matplotlib.pyplot.imshow()` to find visually the regions.
- Use `scipy.optimize.minimize()`, optionally trying out several of its methods.

How many global minima are there, and what is the function value at those points? What happens for an initial guess of $(x, y) = (0, 0)$?

In [None]:


# Define the function that we are interested in


# Make a grid to evaluate the function (for plotting)


In [None]:
# Visualize the function in 2D


In [None]:
# Visualize the function in 3D


In [None]:
# Find the minima


# Show the function in 2D

# And the minimum that we've found:
