# 6. NumPy & Pandas

In the sixth section we present two powerful libraries for
efficient scientific computing: __NumPy__ and __Pandas__. 
Note that we won't be able to illustrated the full potential 
of each library but rather present a selection of useful tools. 
For more details you might want to check out the
[NumPy User Guide](https://docs.scipy.org/doc/numpy/user/index.html) 
or the [Pandas Tutorial Guide](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html).

In addition, we have a look on how one can use __seaborn__ to create
plots when working with Pandas. For more details on seaborn you might
want to visit the [seaborn Tutorial Overview](https://seaborn.pydata.org/tutorial.html).

In this part we revisit some of the previous concepts
and learn

* about array manipulation with NumPy,
* how to work with Pandas DataFrames and
* how to plot in these frameworks.

Keywords: ```np.array```, ```shape```, ```astype```, ```np.matmul```,
```np.multiply```, ```np.mean```, ```np.arange```, ```np.reshape```, 
```np.append```, ```np.random```, ```np.newaxis```, ```np.savetxt```, 
```np.loadtxt```, ```pd.DataFrame```, ```value_counts```, ```head```,  
```pd.groupby```, ```pd.describe```, ```pd.read_csv```, ```pd.to_csv```, 
```seaborn```, ```sns.countplot```, ```sns.boxplot```, ```sns.violinplot```, 
```sns.jointplot```, ```*.feather```

***
## NumPy

NumPy adds a lot of efficient ways to work with large list and 
matrices, which are also referred to as __arrays__, and a large number 
of high-level mathematical functions. In most cases, it is much 
__more efficient__ to work with NumPy objects instead of the built-in 
objects we encountered so far. So if you have large arrays you are
working with, performing calculations with NumPy is usally a 
good choice to do fast computation.

As before, we need to import the NumPy library first. The common
abbrevation is ```np```.

In [8]:
import numpy as np

### Initialising NumPy arrays

Let us compare the "old" list type with NumPy arrays!

In [2]:
old_list = [1,2,3,4]
np_list = np.array( old_list )

print(old_list, "vs", np_list)

[1, 2, 3, 4] vs [1 2 3 4]


In [3]:
old_matrix = [[1,2,3],[4,5,6]]
np_matrix = np.array(old_matrix)

print(old_matrix, "\nvs\n", np_matrix)

[[1, 2, 3], [4, 5, 6]] 
vs
 [[1 2 3]
 [4 5 6]]


In [4]:
print(type(old_matrix))
print(type(np_matrix))

<class 'list'>
<class 'numpy.ndarray'>


In [5]:
len(np_matrix)

2

In [6]:
np_matrix.shape

(2, 3)

In [7]:
np_matrix.ndim

2

#### Note
that the output of ```shape``` can be interpeted as the 
__number of rows and columns__ in the matrix, while ```dim```
specifies the number of __dimensions__, i.e. _there is one row
and one column dimension_.

In [8]:
test_array = np.arange(2,10)
print("array:", test_array)
print("shape:", test_array.shape)
print("number dimensions:", test_array.ndim)

array: [2 3 4 5 6 7 8 9]
shape: (8,)
number dimensions: 1


In [10]:
matrix = np.array( [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ] )
print("array:\n", matrix)
print("shape:", matrix.shape)
print("number of dimensions:", matrix.ndim)

array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
shape: (3, 3)
number of dimensions: 2


In [11]:
matrix[0]

array([1, 2, 3])

In [12]:
matrix[1,1]

5

In [13]:
matrix[1][1]

5

In [17]:
tensor = np.array(
    [
    [[1, 2, 3], [4, 5, 6]],
    [[7, 8, 9], [10, 11, 12]],
    [[13, 14, 15], [16, 17, 18]]
    ]
)

print("array:\n", tensor)
print("shape:", tensor.shape)
print("number of dimensions:", tensor.ndim)

array:
 [[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]

 [[13 14 15]
  [16 17 18]]]
shape: (3, 2, 3)
number of dimensions: 3


#### Note 
that you can think of a tensor as a 3-dimension matrix or
you can view it as a "cube of values".

Some special types of arrays:

In [18]:
np.zeros( (3,4) )

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [19]:
np.ones( (3,4) )

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [20]:
eye = np.eye( 3 )
print(eye)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


***
### Type converison

As before, changing types is not a big deal in Python. However, note that
we now use new functions for this.

In [21]:
matrix

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [22]:
matrix = matrix.astype(float)

In [23]:
matrix

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

#### Note
that we used ```float(...)``` to convert a variable to a float variable, before.

In [24]:
type(matrix.astype(float))

numpy.ndarray

In [25]:
matrix.dtype

dtype('float64')

#### Note 
that similar to ```.astype(...)```, we require ```.dtype``` with NumPy arrays.

In [26]:
matrix.astype(str)

array([['1.0', '2.0', '3.0'],
       ['4.0', '5.0', '6.0'],
       ['7.0', '8.0', '9.0']], dtype='<U32')

***
### Arithmetic

Arithmetic operations work, again, very intuitively. But note that some 
operations with NumPy arrays are much closer to the mathematical intuition
than with lists.

In [27]:
old_matrix

[[1, 2, 3], [4, 5, 6]]

In [28]:
old_matrix*2

[[1, 2, 3], [4, 5, 6], [1, 2, 3], [4, 5, 6]]

#### Note
that the "rows" of ```old_matrix``` just got append when using ```*2```. 
With NumPy arrays, we can indeed multiply the matrix with the scalar 2.

In [29]:
matrix

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [31]:
matrix*2

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.],
       [14., 16., 18.]])

In [32]:
matrix - matrix*2

array([[-1., -2., -3.],
       [-4., -5., -6.],
       [-7., -8., -9.]])

In [33]:
print(matrix)

matrix_mulitplication = np.matmul(matrix,matrix)

print(matrix_mulitplication)

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
[[ 30.  36.  42.]
 [ 66.  81.  96.]
 [102. 126. 150.]]


#### Note
that ```np.matmul``` provides the matrix multiplication which 
we would expect, where entry

In [None]:
matrix_mulitplication[0][0]

is calculated with

$$1 \cdot 1 + 2 \cdot 4 + 3 \cdot 7 = 1 + 8 + 21 = 30.$$

However, 

In [34]:
elementwise_product = matrix * matrix
print(elementwise_product)

[[ 1.  4.  9.]
 [16. 25. 36.]
 [49. 64. 81.]]


is obtained by multiplying each entry of the first matrix with 
the respective entry of the second matrix element-wise. I.e.

In [35]:
elementwise_product[2][2]

81.0

is calculated with ```matrix[2,2]``` = 9 times ```matrix[2,2]``` = 9, so  $$ 9\cdot 9=81. $$
In NumPy this is also implemented as 

In [36]:
np.multiply(matrix, matrix)

array([[ 1.,  4.,  9.],
       [16., 25., 36.],
       [49., 64., 81.]])

In [37]:
matrix_mulitplication = matrix @ matrix
print(matrix_mulitplication)

[[ 30.  36.  42.]
 [ 66.  81.  96.]
 [102. 126. 150.]]


#### Note 
that operator ```@``` can be used for the matrix multiplication of numpy arrays (insted of np.matmul).

In [38]:
vec1 = np.array([2,2,2])
vec2 = np.array([5,10,20])

np.dot(vec1, vec2)

70

#### Note 
that ```np.dot``` is the scalar or dot product. 

In [39]:
np_exp = np.exp(2)
print(np_exp)

7.38905609893065


In [40]:
2.71828**2

7.3890461584

In [45]:
np.power(2,4)

16

In [46]:
np.sqrt(16)

4.0

In [47]:
np.log(np_exp)

2.0

***
### Common array operations

In [48]:
matrix

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [49]:
matrix.transpose()

array([[1., 4., 7.],
       [2., 5., 8.],
       [3., 6., 9.]])

In [50]:
matrix.flatten()

array([1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [51]:
matrix.min()

1.0

In [52]:
matrix.max()

9.0

In [53]:
matrix.mean()

5.0

In [54]:
matrix

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [55]:
matrix.sum()

45.0

#### Note 
that you can also perform a column-wise sum

In [56]:
matrix

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [57]:
matrix.sum(axis=0)

array([12., 15., 18.])

or row-wise sum

In [60]:
matrix.sum(axis=1)

array([ 6., 15., 24.])

In [59]:
matrix.sum(axis=2)

AxisError: axis 2 is out of bounds for array of dimension 2

***

### Reshaping, slicing and appending arrays

In [61]:
new_np_matrix = np.arange(24)
new_np_matrix

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

In [62]:
new_np_matrix = new_np_matrix.reshape( 4,6 )
new_np_matrix

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23]])

In [63]:
new_np_matrix.shape

(4, 6)

In [66]:
new_np_matrix = new_np_matrix.reshape(2,3,4)
new_np_matrix

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [67]:
new_np_matrix[1]

array([[12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [68]:
new_np_matrix[0,0,1]

1

In [71]:
new_np_matrix[0,0,:]

array([0, 1, 2, 3])

#### Note 
that this short for
```Python
new_np_matrix[0,0,start:end]
```
and is the same as
```Python
new_np_matrix[0,0]
```

In [72]:
new_np_matrix[0,:,0]

array([0, 4, 8])

#### Note 
that ```:``` indicates that all elements in this dimension shall be selected.

In [None]:
np_vector = np.array([10,20,30,40]) 

np_append1 = np.append( np_vector, new_np_matrix[0,0,:] )

print(np_append1)
print(np_vector)

#### Note 
that after appending to matrices the result needs to be assigned to a variable. 
For lists this was not necessary because the result of 

```Python
list_vector.extend(list_matrix[0,0,:])
```

would be directly (_in-place_) appended to ```list_vector```.


With NumPy arrays, the __dimensions of the arrays play an important role__ for 
appending. Let's have a look at the shape of ```np_vector``` and ```new_np_matrix[0,0,:]```: 

In [None]:
list_vector = [10,20,30,40]
list_matrix = [0,1,2,3]
print(list_vector)

In [None]:
list_vector.extend(list_matrix)
print(list_vector)

In [None]:
print("Dimension np_vector:", np_vector.shape)
print("Dimension new_np_matrix[0,0,:]:", new_np_matrix[0,0,:].shape)

In [None]:
np_a = np.array([ [1,2,3,4] ])
np_b = np.array([ [10,12,13,14] ])
print(np_a)

In [None]:
print("Dimension np_a:", np_a.shape)
print("Dimension np_b:", np_b.shape)

#### Note 
that for ```np_vector``` and ```new_np_matrix[0,0,:]``` we had vectors of length 4
and for ```np_a``` and ```np_b``` we have "matrices" with one row and 4 columns.

In [None]:
np.append(np_a, np_b)

In [None]:
np.append(np_a, np_b).shape

In [None]:
np.append(np_a, np_b, axis = 0)

In [None]:
np.append(np_a, np_b, axis = 0).shape

In [None]:
np.append(np_a, np_b, axis = 1)

In [None]:
np.append(np_a, np_b, axis = 1).shape

In [None]:
print(np_a)
print(np_a.shape)

In [None]:
print(new_np_matrix[0,0,:])
print(new_np_matrix[0,0,:].shape)

In [None]:
np.append(np_a, new_np_matrix[0,0,:], axis=0)

In [None]:
np.append(np_a, new_np_matrix[0,0,:].reshape(1,4), axis=0)

In [None]:
new_np_matrix[0,0,:].transpose()

***
### Random numbers

NumPy offers a lot of different probability distributions to sample from.

In [None]:
np.random.random(size=4)

#### Note 
that ```random``` provides 4 random numbers uniformly sampled between 0 and 1.

In [None]:
np.random.randint(low=0, high=10, size=3)

#### Note 
that ```randint``` provides 3 random integers uniformly sampled between 
the integer specified as low (0) and high (10).

In [None]:
np.random.randn(5)

#### Note
that ```randn``` denotes sampling from the standard normal distribution.

In [None]:
np.random.normal(loc=5, scale=2.0, size=4)

#### Note
that ```loc``` corresponds to the mean $\mu$ and ```scale``` to the standard deviation $\sigma$ of the 
normal / Gaussian distribution.

In [None]:
np.random.seed(1234)

#### Note
that ```seed(1234)``` sets the _seed_ or _starting point_ with index ```1234``` 
from which the (pseudo) random numbers are generated. In this way, the same
sequence of (pseudo) random numbers can be retrieved. This means if we execute

In [None]:
np.random.random(size=4)

the first three runs will always produce 

1. ```array([0.19151945, 0.62210877, 0.43772774, 0.78535858])```

2. ```array([0.77997581, 0.27259261, 0.27646426, 0.80187218])```

3. ```array([0.95813935, 0.87593263, 0.35781727, 0.50099513])```


***
### Remove redundant elements

Previously, we've had the following example

In [None]:
days = ['Friday',
        'Monday',
        'Tuesday',
        'Wednesday',
        'Thursday',
        'Friday',
        'Saturday',
        'Sunday' 
       ]

print(days)

and we wanted to remove __all occurences__ of the element ```'Friday'```.

In [None]:
remove_element = 'Friday'

One way to achieve this is by using a _list comprehension_:

In [None]:
res_1 = [d for d in days if d != remove_element]
print("Variant 1:\n", res_1)

As we often work with NumPy arrays, the following might be the best option

In [None]:
days_np = np.array(days)

indices2delete = np.where(days_np == remove_element)

print("The follwowing indices will be deleted:\n", indices2delete)

res_2 = np.delete(days_np, indices2delete)

print("\nVariant 2:\n", res_2)

***
### Read-in and write files

NumPy comes in really handy if we can use it for our data manipulation. 
Usually, this requires that we read in data from a file, first.

In [None]:
csv_data = np.loadtxt('data/numpy_example.csv', delimiter = ',')

print(csv_data)

In [None]:
csv_data.shape

In [None]:
import matplotlib.pyplot as plt

plt.plot(csv_data[:,0], csv_data[:,1])
plt.show()

Let's take the cubic root of the second column with ```np.cbrt```
and multiply the result with ```-10```.

In [None]:
new_column = np.cbrt(csv_data[:,1]) * -10

print("new_column:\n", new_column)
print("shape:", new_column.shape)

We would like to append ```new_column``` to our data matrix. For this
to work we need to reshape our vector of length 20 to a matrix of shape
20 x 1. We can use ```np.newaxis``` for this.

In [None]:
new_column = new_column[:, np.newaxis] 
# or 
# new_column = new_column.reshape(20,1)

print("new_column:\n", new_column)
print("shape:", new_column.shape)

In [None]:
new_csv_data = np.append(csv_data, new_column, axis = 1)
new_csv_data

Let's visualise the result.

In [None]:
plt.plot(new_csv_data[:,0], new_csv_data[:,1], label = 'input data', marker = 'd')
plt.plot(new_csv_data[:,0], new_csv_data[:,2], label = 'transformation', marker = 's')

plt.xlabel('x values')
plt.ylabel('y values')

plt.title('Our NumPy example')

plt.legend()

plt.show()

Finally, we store the results in a new .csv file with ```np.savetxt```.

In [None]:
np.savetxt('data/saved_numpy_example.csv', new_csv_data, delimiter=',', 
           fmt='%1.3f', header='x,y,z')

#### Note 
that ```delimiter``` sets the character with which the numbers in 
the resulting output file shall be separated with. Further, 
```fmt='%1.3f'```  specifies that you want your entries to
be stored as floats with 3 decimals. Another example would be
```fmt = '%d'``` which would indicate that the entries shall be 
saved as integers.
***
## Pandas

Pandas is another Python library which offers a powerful way
to work with more efficient data structures and allows for
advanced data manipulation and analysis. If you have some 
experience with R, the way to work with Pandas will look 
very familiar to you.

The common abbrevation for Pandas is ```pd```.

In [None]:
import pandas as pd

In [None]:
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
matrix

In [None]:
matrix_df = pd.DataFrame(matrix, columns = ['col1','col2','col3'], 
                         index = ['row1','row2','row3'])

matrix_df

In [None]:
matrix_df['col1']

In [None]:
matrix_df.loc['row1']

#### Note 
that for the rows you need to use ```.loc[...]```.

In [None]:
matrix_df.index

In [None]:
matrix_df.columns

In [None]:
col_df = pd.DataFrame([10,11,12], columns = ['col4'], 
                      index = ['row1','row2','row3']) 
col_df

In [None]:
row_df = col_df.T
row_df

#### Note 
that ```.T``` is the tranpose operation.

In [None]:
row_df.index = ['row4']
row_df.columns = ['col1','col2','col3']
row_df

In [None]:
matrix_df = pd.concat( [matrix_df, row_df]  )

In [None]:
matrix_df

For a slightly more interesting example, we revist our country codes.

In [None]:
country_codes = {'country': ['Switzerland', 'France', 'Italy', 'UK', 'Germany'],
                 'code':[41, 33, 39, 44, 49]}

codes_df = pd.DataFrame(country_codes)
codes_df

In [None]:
codes_df['country'] == 'UK'

In [None]:
codes_df[ codes_df['country'] == 'UK' ]

In [None]:
codes_df.loc[3]

***
### Analyse input data and write out a result file

In the following, we read in a table which specifies for different red
wines a selection of their respective properties. Each row in the table 
corresponds to a different wine. We study the data set a little bit. Pandas 
is well-suited to do data exploration with methods like
```groupby``` and ```describe```.

In [None]:
import pandas as pd

wine_data = pd.read_csv('data/winequality-red.csv', sep=';')
wine_data

In [None]:
wine_data.shape

In [None]:
wine_data.head(5)

In [None]:
wine_data.tail(5)

Count the number of red wines with a particular quality with ```value_counts```.

In [None]:
quality_counts = wine_data['quality'].value_counts()
print(quality_counts)

In [None]:
import matplotlib.pyplot as plt

plt.bar(quality_counts.index, quality_counts)
plt.xlabel('Quality assessment')
plt.ylabel('Amount of different wines')
plt.show()

Let us add a new column which classifies whether a red wine is a 
__premium__ wine with a rating larger than 5.

In [None]:
wine_data['premium'] = wine_data['quality'] > 5

wine_data.head(5)

In [None]:
colours = ['green','red','red','green','red','green']

plt.bar(quality_counts.index, quality_counts, color=colours)
plt.xlabel('Quality assessment')
plt.ylabel('Amount of different wines')
plt.show()

#### Note 
that in this example the colours where abbreviated. I.e. instead of 
```color=['green','red','red','green','red','green']``` you can use just
the initial letter in one string, i.e. ```color='grrgrg'```. Also note that
the colouring is order by the heights of the bars.

In [None]:
quality_grouped = wine_data.groupby('quality')
quality_grouped

#### Note 
that ```groupby('quality')``` groups all rows with the same quality
together. However, after _collecting_ the groups it is a prior not clear
how the different rows (with the same quality) are supposed to be 
combined. Pandas now allows you to choose what operation you would like to 
perform on the grouped rows. In the following, we see some examples.

For instance, we can start with the actual groups which were identified. 
A bit similar to dictionaries, the group names are accessed by ```.keys()```.

In [None]:
quality_grouped.groups.keys()

Or we can just provide the first row in the respective group with ```.first()```.

In [None]:
quality_grouped.first()

Let's display all rows in group 3 with ```.get_group(3)```.

In [None]:
quality_grouped.get_group(3)

In [None]:
quality_grouped.mean()

There many more methods you can apply to a ```groupby``` object.
A particularly useful one is describe which provides you with some 
statistics.

In [None]:
stats = quality_grouped.describe()
stats['alcohol']

In [None]:
stats['sulphates']

In [None]:
stats['alcohol'].to_csv("data/saved_pandas_example.csv", float_format='%.3f', sep = ',', header = True, index = False)

#### Note 
that similar to ```fmt='%1.3f'``` for NumPy, ```float_format='%.3f'``` specifies
that the floats shall only have 3 decimals when writte to the .csv file.

### Efficient data format for DataFrames with *.feather files:

The ```*.feather``` file format allows you to store Pandas DataFrames in an efficient
data format with which you can load your DataFrames in R, too! Check out [this blog post](https://blog.finxter.com/pandas-dataframe-to_feather-method/). This is how it might look like in __Python__

```Python
import pandas as pd

wine_data = pd.read_csv('data/winequality-red.csv', sep=';')

# Do some computation in Python ...

wine_data.to_feather('data/winequality-red.feather')

wine_data_feather = pd.read_feather('data/winequality-red.feather')
```

and in __R__

```R
library(arrow)

wine_data_feather <- read_feather('data/winequality-red.feather')

# ... and continue in R!
```

***
## Seaborn with Pandas

Seaborn is a statistical data visualisation library which builds upon matplotlib and 
uses Pandas data structures. It makes plotting of attractive figures really easy, in 
particular in combination with Pandas objects.

The common abbrevation for seaborn is ```sns```.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

For ggplot stlye plots use:

In [None]:
sns.set()

In [None]:
sns.countplot(x='quality', data=wine_data)
plt.show()

In [None]:
sns.boxplot(x='quality', y='alcohol', data=wine_data)
plt.show()

In [None]:
sns.violinplot(x='quality', y='sulphates', data=wine_data)
plt.show()

Let's consider one of the standard examples of seaborn,
the __tips__ data set.

In [None]:
tips = sns.load_dataset("tips")
tips.head(5)

In [None]:
sns.jointplot(x="total_bill", y="tip", data=tips)
plt.show()

***
## Exercise section

(1.) Create a NumPy array with entries from 4 to 9 and reshape the array to have 
shape (3,2). Make us of ```np.arange``` and ```reshape```. Let's call this matrix
```ex1```. Put your solution here:

Check your result by executing:

In [None]:
print(ex1)

(2.) Create an array with three random integers between 0 and 20. 
Make use of ```np.random.randint```. Let's call this matrix ```rand_ints```.
Put your solution here:

Check your result by executing:

In [None]:
print(rand_ints)

(3.) Multiply (element-wise) the last column of ```ex1``` with ```rand_ints``` 
and assign the result to the the last column of ```ex1```. Put your solution here:

Check your result by executing:

In [None]:
print(ex1)

(4.) Append ```rand_ints``` as a column to matrix ```ex1```. Put your solution here:

Check your result by executing:

In [None]:
print(ex1)

(5.) Convert NumPy matrix ```ex1``` into a Pandas dataframe ```ex5``` and name the columns
of ```A```, ```B``` and ```C```. Put your solution here:

Check your result by executing:

In [None]:
ex5