# PHYS 105A:  Introduction to Scientific Computing

## Data Processing with Python


## Using numerical and data science packages

* This lecture is about statsistics, which means that we need to handle (relatively) large data sets.

* While we have learned how to read in text files, handle lists, etc in pure python, it's useful to get some help!

* At the end, python is so popular in Data Scinece because of all the packages the python community develop!

* We will learn the basic of three packages: `numpy`, `pandas`, and `scipy`

## `numpy`

* We will start with the `numpy` package.

* `numpy` enables array programming in python.  I.e., it enables us to work on a whole array of objects (numbers) "in one go" in python.

* The backend functionality in `numpy` are written in C, making it very high performance.

* The array programming model also provide a natural way to perform handle functions on arrays.

* `numpy` is the core package that enables scientific computation in python.

In [None]:
from matplotlib import pyplot as plt
from math import sin, pi

# for every item i in list [0,1,2,...,98,99] put 0.1*i in the list
X = [0.1 * i for i in range(100)]

# for every item x in list X put x*x in the list
F = [x * x for x in X]

# for every item x in list X put 100*sin(x) in the list
G = [100 * sin(x) for x in X]

plt.plot(X, F) # X and F are two lists with the same number of elements
plt.plot(X, G) # the number of elements is determined by the list X

In [None]:
import numpy as np # load package numpy and rename it to np

X = np.linspace(0, 10, num=100) # google "numpy.linspace" see what it does
F = X * X # multiplication element-wise
G = 100 * np.sin(X) # calculate sin for every element in X then * 100

plt.plot(X, F) # same as above
plt.plot(X, G) # same as above

In [None]:
# X has a "data type"

print(X.dtype) # type of data elements in the array
print(type(X)) # type of the array

# All the values in a numpy array is densely packed as a C array.
# Instead of a list of python object.
# Numpy array always has a shape, which is a tuple of positive integers.
# In 1D, the shape is the same as len()

print(X.shape) # X.shape is still a tuple
print(len(X)) # but len(X) is a number

# But in 2D, they are different

Y = np.array([[1,2,3], [4,5,6]]) # definition of a 2D array, see what it looks like when printout

print(Y)
print(Y.shape)
print(len(Y))

In [None]:
# Numpy arrays, by default, operate in an "element-wise" way.

print(Y + 2)
print(Y * 2)
print(Y * Y)

# There is a large number of functions that also work in the "element-wise" fasion.

print(np.sin(Y))
print(np.cos(Y))
print(Y ** 3)

# if all quantity in an equation is either numpy array or number it'll be element-wise

## `pandas`

* While numpy is the core of scientific computation in python, sometimes a large data set contains more information than a plain array.

* For example, when you look at an excel spreadsheet, very often each column contains a different physical quality carrying different meaning and even unit (time, income, output).  Saying a spreadsheet is a 2D-array calculator is not totally fair.

* The `pandas` package allows us to add that structure, and physical meaning, to different columns of a table.

* `pandas` is one of the main package that makes data science work in python!

In [None]:
import pandas as pd

# The most useful data structure of pandas is a DataFrame, which is more or less a table of 2D array.
df = pd.DataFrame([[1,2,3], [4,5,6]])
display(df)

# The difference is that you can assign meaning to different columns, such as index and column name
df = pd.DataFrame([[1,2,3], [4,5,6]], columns=['a', 'b', 'c'])
display(df)

# Now it is possible to access the diffrent columns by name
print(df['a']) # access by key
print(df.a)    # access by attribute

In [None]:
# It is easy to create a new columns in pandas DataFrame

df['sum']  =  df.a + df.b + df.c
df['mean'] = (df.a + df.b + df.c) / 3

# Note that each column acts as a numpy array, that we can perform "element-wise" operations

display(df)

# We may also see a pandas DataFrame as a database.
# Then it makes sense to "drop" information...

df = df.drop(['sum', 'mean'],axis=1)

display(df)

In [None]:
# Since DataFrame is like a database, we may use pandas to perform some operation "per row" for us.

display(df.apply(np.sum, axis=1))

# We can of course add the resulting column back to the DataFrame

df['mean'] = df.apply(np.mean, axis=1)
display(df)

In [None]:
# It is actually possible to run shell command in Jupyter notebook.
# We simply start with a "!".
# Let's first look at what files we have in this directory:

!ls

# We would like to load the "temperature.csv" file.
# Let's use the Unix command `head` to show what is in it:

!head -10 temperature.csv

# This file actually contains the world temperature as function of time for the last ~ 150 years!

In [None]:
# Let's load this file using pandas:

df = pd.read_csv('temperature.csv')

# That's it!  No file opening, no for loop!

display(df)

In [None]:
# We may now plot the data set.
# The plot will take some time.  What's going on?

plt.plot(df.date, df.temperature)

In [None]:
# It turns out that the the "date" column is still in string!
# And matplotlib is slow in figure out the labels of the x-axis.

df.date

In [None]:
# Fortuantely, pandas has a "datetime" type that can help us fix it.

df.date = pd.to_datetime(df.date)
df.date

In [None]:
# Now plotting should take less than a second!
# Data type is important!

plt.plot(df.date, df.temperature)

In [None]:
# There are too many data points, let's zoom into the data

plt.plot(df.date, df.temperature)
plt.xlim(pd.to_datetime('2000'), pd.to_datetime('2010'))

# What are these waves?

In [None]:
# Looking at the period, these look like seasonal variations.
# pandas provide tools for us to group the data and then perform group operation.

mm = df.groupby(by=[df.date.dt.month]).mean()

display(mm)

plt.plot(mm.index, mm.temperature)
plt.title('Seasonal changes')

## Use `scipy` to fit curves

* Another very common progress we need to do it to fit curves to data.

* `scipy` provides standard [curve fitting](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) functions that we can use.

In [None]:
from scipy.optimize import curve_fit

# Provide a functional form to fit
def func(t, off, amp, phi):
    return off + amp * np.sin(2 * pi * (t - phi) / 12)

popt, pcov = curve_fit(func, mm.index, mm.temperature)

# This contains the fitted parameters
print(popt)

In [None]:
# And we can now overplot the data and the fit

plt.plot(mm.index, mm.temperature)
plt.plot(mm.index, func(mm.index, *popt))
plt.title('Seasonal changes')