<p style="text-align: center; font-size: 300%"> Introduction to Programming in Python </p>
<img src="img/logo.svg" alt="LOGO" style="display:block; margin-left: auto; margin-right: auto; width: 30%;">

# Working with PyCharm
There are no slides for this bit. Some nice tutorials are available on the web, like [this one](https://www.mygreatlearning.com/blog/pycharm-tutorial/#runningacodeinpycharm).

PyCharm also has extensive documentation:

* https://www.jetbrains.com/help/pycharm/creating-and-running-your-first-python-project.html#summary
* https://www.jetbrains.com/help/pycharm/debugging-your-first-python-application.html#summary


# Some important packages
## Numpy
* Numpy is the most fundamental package for numerical computations in Python ([user guide](https://docs.scipy.org/doc/numpy/user/index.html)).
* Basically, it provides a datatype `ndarray` and defines mathematical functions for it
* An array is similar to a `list`, except that
  * it can have more than one dimension;
  * its elements are homogeneous (they all have the same type).
* NumPy provides a large number of functions (*ufuncs*) that operate elementwise on arrays. This allows *vectorized* code, avoiding loops (which are slow in Python).

#### Constructing Arrays
* Arrays can be constructed using the `array` function which takes sequences (e.g, lists) and converts them into arrays. The data type is inferred automatically or can be specified.

In [None]:
import numpy as np
a = np.array([1, 2, 3, 4])
print(a)

In [None]:
a = np.array([1, 2, 3, 4], dtype='float64')  # or np.array([1., 2., 3., 4.])
print(a)

* NumPy uses C++ data types which differ from Python's (though `float64` is equivalent to Python's `float`).

* Nested lists result in multidimensional arrays. We won't need anything beyond two-dimensional (i.e., a matrix or table).

In [None]:
a = np.array([[1., 2.], [3., 4.]]); a

In [None]:
a.shape # number of rows and columns

* Other functions for creating arrays include:

In [None]:
np.ones([2, 3])  # there's also np.zeros, and np.empty (which results in an uninitialized array).

In [None]:
np.arange(0, 10, 2)  # like range, but creates an array instead of a list.

#### Indexing
* Indexing and slicing operations are similar to lists:

In [None]:
a = np.array([[1., 2.], [3., 4.]])
print(a)
a[0, 0] # [row, column]

In [None]:
b = a[:, 0]; b # entire first column. note that this yields a 1-dimensional array (vector), not a matrix with one column. 

* Apart from indexing by row and column, arrays also support *Boolean* indexing:

In [None]:
a = np.arange(10); a

In [None]:
ind = a < 5; ind

In [None]:
a[ind]

A shorter way to write this is

In [None]:
a[a<5]

This is useful for selecting elements according to some condition

#### Arithmetic and `ufunc`s
* NumPy `ufunc`s are functions that operate elementwise:

In [None]:
a = np.arange(1, 5); np.sqrt(a)

* Other useful ufuncs are `exp`, `log`, `abs`, and `sqrt`.
* Basic arithmetic on arrays works elementwise: 

In [None]:
a = np.arange(1, 5); b = np.arange(5, 9); a, b, a+b, a-b, a/b.astype(float)

#### Broadcasting

* Operations between scalars and arrays are also supported:

In [None]:
np.array([1, 2, 3, 4]) + 2

* This is a special case of a more general concept known as *broadcasting*, which allows operations between arrays of different shapes.
* NumPy compares the shapes of two arrays dimension-wise. It starts with the trailing dimensions, and then works its way forward. Two dimensions are compatible if
  * they are equal, or
  * one of them is 1 (or not present).
* In the latter case, the singleton dimension is "stretched" to match the larger array.

* Example:

In [None]:
x = np.arange(6).reshape((2, 3)); x  # x has shape (2,3).

In [None]:
m = np.mean(x, axis=0); m  # m has shape (3,).

In [None]:
x-m  # the trailing dimension matches, and m is stretched to match the 2 rows of x.

#### Array Reductions
* *Array reductions* are operations on arrays that return scalars or lower-dimensional arrays, such as the `mean` function used above.
* They can be used to summarize information about an array, e.g., compute the standard deviation:

In [None]:
a = np.random.randn(300, 3)  # create a 300x3 matrix of standard normal variates.
a.std(axis=0)  # or np.std(a, axis=0)

* By default, reductions operate on the *flattened* array (i.e., on all the elements). For row- or columnwise operation, the `axis` argument has to be given.
* Other useful reductions are `sum`, `median`, `min`, `max`, `argmin`, `argmax`, `any`, and `all` (see help).

#### Saving Arrays to Disk

* There are several ways to save an array to disk:

In [None]:
np.save('myfile.npy', a) # save `a` as a binary .npy file

In [None]:
import os
print(os.listdir('.'))

In [None]:
b = np.load('myfile.npy')  # load the data into variable b
os.remove('myfile.npy')  # clean up

In [None]:
np.savetxt('myfile.csv', a, delimiter=',')  # save `a` as a CSV file (comma seperated values, can be read by MS Excel)

In [None]:
b = np.loadtxt('myfile.csv', delimiter=',')  # load data into `b`.
os.remove('myfile.csv')

### Pandas Dataframes
#### Introduction to Pandas
* `pandas` (from *p*anel *d*ata) is another fundamental package ([user quide](http://pandas.pydata.org/pandas-docs/stable/overview.html)).
* It provides a number of datastructures (*series*, *dataframes*, and *panels*) designed for storing observational data, and powerful methods for manipulating (*munging*, or *wrangling*) these data.
* It is usually imported as `pd`:

In [None]:
import pandas as pd

#### Series
* A pandas `Series` is essentially a NumPy array with an associated index:

In [None]:
pop = pd.Series([5.7, 82.7, 17.0], name='Population'); pop  # the descriptive name is optional.

* The difference is that the index can be anything, not just a list of integers:

In [None]:
pop.index=['DK', 'DE', 'NL']

* The index can be used for indexing (duh...):

In [None]:
pop['NL']

* NumPy's `ufunc`s preserve the index when operating on a `Series`:

In [None]:
gdp = pd.Series([3494.898, 769.930], name='Nominal GDP in Billion USD', index=['DE', 'NL']); gdp

In [None]:
gdp / pop

* One advantage of a `Series` compared to NumPy arrays is that they can handle missing data, represented as `NaN` (not a number).

#### Dataframes

* A `DataFrame` is a collection of `Series` with a common index (which labels the rows).

In [None]:
data = pd.concat([gdp, pop], axis=1); data  # concatenate two Series to a DataFrame.

* Columns are indexed by column name:

In [None]:
data.columns

In [None]:
data['Population']  # data.Population works too

* Rows are indexed with the `loc` method:

In [None]:
data.loc['NL']

* Unlike arrays, dataframes can have columns with different datatypes.
* There are different ways to add columns. One is to just assign to a new column:

In [None]:
data['Language'] = ['German', 'Danish', 'Dutch']  # add a new column from a list.

* Another is to use the `join` method:

In [None]:
s = pd.Series(['EUR', 'DKK', 'EUR', 'GBP'], index=['NL', 'DK', 'DE', 'UK'], name='Currency')
data.join(s)  # add a new column from a series or dataframe.

* Notes:
  * The entry for 'UK' has disappeared. Pandas takes the *intersection* of indexes ('inner join') by default.
  * The returned series is a temporary object. If we want to modify `data`, we need to assign to it:

In [None]:
data = data.join(s)

* To add rows, use `loc` or `append`:

In [None]:
print(data.loc["DE"])
data.loc['AT'] = [386.4, 8.7, 'German', 'EUR']  # add a row with index 'AT'.
s = pd.DataFrame([[511.0, 9.9, 'Swedish', 'SEK']], index=['SE'], columns=data.columns)
data = data.append(s)  # add a row by appending another dataframe. May create duplicates.
data

* The `dropna` method can be used to delete rows with missing values:

In [None]:
data = data.dropna(); data

* Useful methods for obtaining summary information about a dataframe are `mean`, `std`, `info`, `describe`, `head`, and `tail`.

In [None]:
data.describe()

In [None]:
data.head()  # show the first few rows; data.tail shows the last few

* To save a dataframe to disk as a csv file, use

In [None]:
data.to_csv('myfile.csv')  # to_excel exists as well.

* To load data into a dataframe, use `pd.read_csv`:

In [None]:
pd.read_csv('myfile.csv', index_col=0)

In [None]:
os.remove('myfile.csv')  # clean up

Pandas can also open CSV files directly from a URL:

In [None]:
URL = "https://covid.ourworldindata.org/data/owid-covid-data.csv"
df = pd.read_csv(URL)
df.head()

## Working with Time Series
### Data Types

* Different data types for representing times and dates exist in Python.
* The most basic one is `datetime` from the eponymous package:

In [None]:
from datetime import datetime
datetime.today()

* `datetime` objects can be created from strings using `strptime` and a format specifier:

In [None]:
datetime.strptime('2017-03-31', '%Y-%m-%d')

* Pandas uses `Timestamps` instead of `datetime` objects. Unlike timestamps, they store frequency and time zone information. The two can mostly be used interchangeably.

In [None]:
pd.Timestamp('2017-03-31')

* A time series is a `Series` with a special index, called a `DatetimeIndex`; essentially an array of `Timestamp`s.
* It can be created using the `date_range` function.

In [None]:
myindex = pd.date_range(end=pd.Timestamp.today(), normalize=True, periods=100, freq='B')
P = 20 + np.random.randn(100).cumsum()  # make up some share prices.
aapl = pd.Series(P, name="AAPL", index=myindex)
aapl.tail()

* As a convenience, Pandas allows indexing timeseries with date strings:

In [None]:
aapl['4/11/2022']

In [None]:
aapl['4/11/2022':'4/12/2022']

# Further reading (optional)
* https://python-course.eu/numerical-programming/

# Homework

Try to do these in PyCharm:
 * Ex. 1-16 of https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises.md (skip 4 and 11)
 * https://github.com/guipsamora/pandas_exercises