## PHYS 105A:  Introduction to Scientific Computing

# Statistics and Data Processing Methods

Chi-kwan Chan

## Project

First thing first, the first project is due on March 11th 12:00am.  The project has twice as many as your assignment so we give you two weeks to work on it.  The goal of the project for you to take what you have learned from this course, and come up with an independent project and finish it.  You have learned python, C, random numbers, and you will learn some statistics today.  So acceptable projects include:

- Dice roller generator: specifying the number and sides of the dice, plot the frequencies of the resulting number, and explain, for example, why rolling a single die of 12 sides is not the same as rolling 2 dice of six sides.

- Perform a monte carlo simulation of the Monty Hall problem.

- A more detailed study of Brownian motion and difussion.

- Clebsch-Gordan coefficient calculator.

- Sudoku solver.

- ...

Please plan your project, prepare a single slide, and discuss it with the class next week.

## Using numerical and data science packages

* This lecture is about statsittics, which means that we need to handle (relatively) large data sets.

* While we have learned how to read in text files, handle lists, etc in pure python, it's useful to get some help!

* At the end, python is so popular in Data Scinece because of all the packages the python community develop!

* We will learn the basic of two packages: `numpy` and `pandas`.

## `numpy`

* We will start with the `numpy` package.

* `numpy` enables array programming in python.  I.e., it enables us to work on a whole array of objects (numbers) "in one go" in python.

* The backend functionality in `numpy` are written in C, making it very high performance.

* The array programming model also provide a nature way to perform handle functions on arrays.

* `numpy` is the core package that enables scientific computation in python.

* There is a [nature paper](https://www.nature.com/articles/s41586-020-2649-2) on `numpy`!

In [None]:
from matplotlib import pyplot as plt
from math import sin

X = [0.1 * i for i in range(100)]
F = [x * x for x in X]
G = [100 * sin(x) for x in X]

plt.plot(X, F)
plt.plot(X, G)

In [None]:
import numpy as np

X = np.linspace(0, 10, num=100)
F = X * X
G = 100 * np.sin(X)

plt.plot(X, F)
plt.plot(X, G)

In [None]:
# X has a "data type"

print(X.dtype)

# All the values in a numpy array is densely packed as a C array.
# Instead of a list of python object.
# Numpy array always has a shape, which is a tuple of positive integers.
# In 1D, the shape is the same as len()

print(X.shape)
print(len(X))

# But in 2D, they are different

Y = np.array([[1,2,3], [4,5,6]])

print(Y)
print(Y.shape)
print(len(Y))

In [None]:
# Numpy arrays, by default, operate in an "element-wise" way.

print(Y + 2)
print(Y * 2)
print(Y * Y)

# There is a large number of functions that also work in the "element-wise" fasion.

print(np.sin(Y))
print(np.cos(Y))
print(Y ** 3)

## `pandas`

* While numpy is the core of scientific computation in python, sometimes a large data set contains more information than a plain array.

* For example, when you look at an excel spreadsheet, very often each column contains a different physical quality carrying different meaning and even unit (time, income, output).  Saying a spreadsheet is a 2D-array calculator is not totally fair.

* The `pandas` package allows us to add that structure, and physical meaning, to different columns of a table.

* `pandas` is one of the main package that makes data science work in python!

In [None]:
import pandas as pd

# The most useful data structure of pandas is a DataFrame, which is more or less a table of 2D array.
df = pd.DataFrame([[1,2,3], [4,5,6]])
display(df)

# The difference is that you can assign meaning to different columns, such as index and column name
df = pd.DataFrame([[1,2,3], [4,5,6]], columns=['a', 'b', 'c'])
display(df)

# Now it is possible to access the diffrent columns by name
print(df['a']) # access by key
print(df.a)    # access by attribute

In [None]:
# It is easy to create a new columns in pandas DataFrame

df['sum']  =  df.a + df.b + df.c
df['mean'] = (df.a + df.b + df.c) / 3

# Note that each column acts as a numpy array, that we can perform "element-wise" operations

display(df)

# We may also see a pandas DataFrame as a database.
# Then it makes sense to "drop" information...

df = df.drop(['sum', 'mean'], axis=1)

display(df)

In [None]:
# Since DataFrame is like a database, we may as pandas to perform some operation "per row" for us.

display(df.apply(np.sum, axis=1))

# We can of course add the resulting column back to the DataFrame

df['mean'] = df.apply(np.mean, axis=1)
display(df)