# Practical: NumPy and Pandas

**14th November 2019 – 11am to 1pm**

**Christopher Ingold Building G20** 

## Intro to NumPy

NumPy is a Python library for scientific computing. It provides high-performance multidimensional data structure called arrays. NumPy is usually imported like this:

In [1]:
import numpy as np

### Arrays

A NumPy array is a tensor of values all of the same type. The number of dimensions is the rank of the array, the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can create a NumPy array (aka a vector) from Python lists:

In [8]:
a = np.array([1, 2, 3, 4])
a

array([1, 2, 3, 4])

The previous code cell has created a NumPy array of rank 1. To check that this is a NumPy array we can use the function `type`:

In [9]:
type(a)

numpy.ndarray

If we are given an array and we need to know its shape, we can use its attribute `shape`:

In [10]:
a.shape

(4,)

What shape returns is a tuple containing the length of each dimension. In this case, we have only one dimension of size 4.

To access the values of a one-dimensional array we can do like we do when using Python lists:

In [11]:
a[0]

1

Let's now create a 2-dimensional array (aka a matrix in math):

In [12]:
a2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
a2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Now we use shape to get the length of the dimensions:

In [13]:
a2.shape

(2, 4)

This is a matrix (with rank 2) with 2 rows and 4 columns.

To access the individual values of this matrix we can do like this:

In [14]:
a2[0, 0]

1

This will access the value positioned at the first row and first column. Try to access other locations until you get and error.

NumPy also provides some handy functions to create commonly used tensors when developing scientific applications. To use these functions we just need to give a tuple as a parameter defining the size of each dimensions. 

For example, use zeros to create a tensor filled with zeros:

In [15]:
np.zeros((4, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

Use ones to create a tensor filled with ones:

In [16]:
np.ones((4, 4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Use random.random to create a tensor filled with random values $\in [0,1]$:

In [17]:
np.random.random((4, 4))

array([[0.04651746, 0.43821428, 0.82770678, 0.14876231],
       [0.45203205, 0.20096141, 0.76213478, 0.49995163],
       [0.09892925, 0.00630044, 0.32641344, 0.87079664],
       [0.49676333, 0.74235836, 0.54794594, 0.66073299]])

To create a matrix containing a given value, we can either use the method full and provide the given value, or create a tensor of ones and multiplying it by the given value, or create a tensor of zeros and sum the given value:

In [18]:
np.full((4, 4), 3)

array([[3, 3, 3, 3],
       [3, 3, 3, 3],
       [3, 3, 3, 3],
       [3, 3, 3, 3]])

In [19]:
np.ones((4, 4)) * 3

array([[3., 3., 3., 3.],
       [3., 3., 3., 3.],
       [3., 3., 3., 3.],
       [3., 3., 3., 3.]])

In [20]:
np.zeros((4, 4)) + 3

array([[3., 3., 3., 3.],
       [3., 3., 3., 3.],
       [3., 3., 3., 3.],
       [3., 3., 3., 3.]])

The two code cells above also demonstrate that in NumPy we can easily do arithmetic operations between tensors and scalars.

To create a matrix with the diagonal full with ones (aka the identity matrix), we can use the method eye like this:

In [21]:
np.eye(4)

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

### Array Selection

We can select a part of a tensor using the slicing operator of Python. However, in NumPy the range operator can be used with multi-dimensional tensors:

In [22]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

For example, to select only the 2x2 matrix on the top right, we can do:

In [23]:
a[:2, 2:]

array([[3, 4],
       [7, 8]])

If we want to select the last row:

In [24]:
a[2:]

array([[ 9, 10, 11, 12]])

If we want to select the last two columns:

In [25]:
a[:,2:]

array([[ 3,  4],
       [ 7,  8],
       [11, 12]])

We can also use a list of integers to extract only the coordinates we want, for example if we want to extract only the values of the 4 corners of the matrix `a`:

In [26]:
a[[0, 0, 2, 2],[0, 3, 0, 3]]

array([ 1,  4,  9, 12])

### Arithmetic Operators with Arrays

To sum a scalar to a matrix:

In [21]:
a = np.eye(4) 
a + 3

array([[4., 3., 3., 3.],
       [3., 4., 3., 3.],
       [3., 3., 4., 3.],
       [3., 3., 3., 4.]])

To subtract a scalar:

In [22]:
a - 5

array([[-4., -5., -5., -5.],
       [-5., -4., -5., -5.],
       [-5., -5., -4., -5.],
       [-5., -5., -5., -4.]])

To multiply a scalar to a matrix:

In [23]:
a * 3

array([[3., 0., 0., 0.],
       [0., 3., 0., 0.],
       [0., 0., 3., 0.],
       [0., 0., 0., 3.]])

To divide by a scalar:

In [24]:
a / 3

array([[0.33333333, 0.        , 0.        , 0.        ],
       [0.        , 0.33333333, 0.        , 0.        ],
       [0.        , 0.        , 0.33333333, 0.        ],
       [0.        , 0.        , 0.        , 0.33333333]])

To exponentiate by a value:

In [25]:
(a + 1) ** 3

array([[8., 1., 1., 1.],
       [1., 8., 1., 1.],
       [1., 1., 8., 1.],
       [1., 1., 1., 8.]])

We can also do matrix operations like the dot product like this:

In [26]:
a1 = np.array([4, 3])
a2 = np.array([[1, 2],[3, 4]])

a1 @ a2

array([13, 20])

We can transpose a matrix like this:

In [27]:
a2.T

array([[1, 3],
       [2, 4]])

### Comparison Operators with Arrays

We can use the Python comparison operators with arrays. However, this will not return a Boolean value but an array of Boolean values:

In [28]:
a = np.array([1, 2, 3, 4])
a == 3

array([False, False,  True, False])

Or, greater or equal:

In [29]:
a >= 3

array([False, False,  True,  True])

### Extended Datatypes

NumPy is implemented in C. This allows us to use C datatypes like this:

In [30]:
a = np.array([1, 2, 3, 4], dtype=np.int32)
a

array([1, 2, 3, 4], dtype=int32)

We can find details about C datatypes [here](https://docs.scipy.org/doc/numpy/user/basics.types.html).

# Exercise 27

Compute the following sequence when n is equal to 5:

$
a_0 = 
\begin{bmatrix}
1&0\\
0&1\\
\end{bmatrix}
$

$
a_n = 
\begin{bmatrix}
1\\
2\\
\end{bmatrix} \cdot 
\begin{bmatrix}
2&1\\
\end{bmatrix} -
a_{n-1}$


The result should be:

$a_5 = \begin{bmatrix}
1&1\\
4&1\\
\end{bmatrix}$

In [31]:
n = 5

a_n = np.eye(2)
for _ in range(n):
    a_n = np.array([[1], [2]]) @ np.array([[2, 1]]) - a_n

a_n

array([[1., 1.],
       [4., 1.]])

## Intro to Pandas

Pandas is an open source providing high-performance easy-to-use data structures and data analysis tools for the Python programming language.

Pandas is usually imported using the following import statement:

In [28]:
import pandas as pd

Pandas defines two data structures called DataFrame and Series. 

### Read a CSV

To read a CSV file we can use the read_csv method. This will create a DataFrame. To read a CSV you can pass to this method either a file path to a CSV file stored locally, or an URL to a CSV file which pandas will download and parse automatically. 

For this tutorial we will work with a famous dataset called IRIS. This dataset describes the features of 3 different species of flowers. Each row is a flower and each column is a different measurement made on the flowers with the exception of the last column which represents the species of the flower:

In [29]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


We can check the first 5 rows of the DataFrame by using head:

In [30]:
iris.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Or tail:

In [35]:
iris.tail(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


Like NumPy, we can get the size of the dimensions of the DataFrame using shape:

In [33]:
iris.shape

(150, 5)

We can analyze the content of the DataFrame using describe:

In [34]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


To get the type of each column (Series) we can use dtypes:

In [38]:
iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

### Select Columns (Series)

To access a column we can either use the name of the column as a method or use the squared brackets.

In [39]:
iris.sepal_length

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

In [40]:
iris["sepal_length"]

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

To get a DataFrame back we need to put the name of the column in a list:

In [41]:
iris[["sepal_length"]]

Unnamed: 0,sepal_length
0,5.1
1,4.9
2,4.7
3,4.6
4,5.0
...,...
145,6.7
146,6.3
147,6.5
148,6.2


We can also select two columns using a list:

In [42]:
iris[["sepal_length", "sepal_width"]]

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


To select the first row we need to use the iloc like this:

In [43]:
iris.iloc[0]

sepal_length       5.1
sepal_width        3.5
petal_length       1.4
petal_width        0.2
species         setosa
Name: 0, dtype: object

### Select Rows

To select rows based on a condition we can use boolean operators. For example, to select all the flowers that have a sepal length greater or equal than 5: 

In [44]:
iris[iris.sepal_length >= 5.0]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
7,5.0,3.4,1.5,0.2,setosa
10,5.4,3.7,1.5,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


### Changing Values

We can change the value of a column similarly to the way we access the DataFrame.

For example to change all the values of a column:

In [45]:
iris.sepal_length = 0
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0,3.5,1.4,0.2,setosa
1,0,3.0,1.4,0.2,setosa
2,0,3.2,1.3,0.2,setosa
3,0,3.1,1.5,0.2,setosa
4,0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,0,3.0,5.2,2.3,virginica
146,0,2.5,5.0,1.9,virginica
147,0,3.0,5.2,2.0,virginica
148,0,3.4,5.4,2.3,virginica


To change the values of a row:

In [46]:
iris.iloc[0] = [0, 0, 0, 0, "ok"]
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0,0.0,0.0,0.0,ok
1,0,3.0,1.4,0.2,setosa
2,0,3.2,1.3,0.2,setosa
3,0,3.1,1.5,0.2,setosa
4,0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,0,3.0,5.2,2.3,virginica
146,0,2.5,5.0,1.9,virginica
147,0,3.0,5.2,2.0,virginica
148,0,3.4,5.4,2.3,virginica


### Statistical Functions

We can use the method mean, std, mode, max, min, etc. To compute any basic statistics we need:

In [47]:
iris.sepal_width.mean()

3.0340000000000003

In [48]:
iris.sepal_width.mode()

0    3.0
dtype: float64

In [49]:
iris.sepal_width.std()

0.5008489437240835

### Plotting with Pandas

Pandas uses matplotlib to plot series. For example we can plot a column using the plot method:

In [35]:
%matplotlib notebook

iris.sepal_width.plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x2a0eb7b0e08>

We can also plot the histogram of the column like this:

In [51]:
iris.sepal_width.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1220754e0>

# Exercise 28

Reload the IRIS dataset and compute the standard error of each numerical column per flower species, and store them into the DataFrame.

The standard error for a normal distribution is computed like this:

$$\mu \pm 1.96 \sqrt{\frac{\sigma^2}{n}}$$

where, $\mu$ is the mean, $\sigma^2$ is the variance, 1.96 is the approximate value of the 97.5 percentile point of the normal distribution, and $n$ is the sample size.

In [39]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

numerical_columns = []
for column in iris.columns:
    if iris[column].dtypes == "float64":
        numerical_columns.append(column)

for column in numerical_columns:
    iris[column + "_ste"] = 0
        
for species in iris.species.unique():
    species_iris = iris[iris.species == species]
    for column in numerical_columns:
        mean = species_iris[column].mean()
        var = species_iris[column].var()
        n = species_iris[column].count()
        ste = mean * (var/n) ** (1/2)
        iris.loc[iris.species == species, column + "_ste"] = ste

iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_length_ste,sepal_width_ste,petal_length_ste,petal_width_ste
0,5.1,3.5,1.4,0.2,setosa,0.249547,0.183768,0.035906,0.003666
1,4.9,3.0,1.4,0.2,setosa,0.249547,0.183768,0.035906,0.003666
2,4.7,3.2,1.3,0.2,setosa,0.249547,0.183768,0.035906,0.003666
3,4.6,3.1,1.5,0.2,setosa,0.249547,0.183768,0.035906,0.003666
4,5.0,3.6,1.4,0.2,setosa,0.249547,0.183768,0.035906,0.003666
...,...,...,...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica,0.592439,0.135638,0.433332,0.078693
146,6.3,2.5,5.0,1.9,virginica,0.592439,0.135638,0.433332,0.078693
147,6.5,3.0,5.2,2.0,virginica,0.592439,0.135638,0.433332,0.078693
148,6.2,3.4,5.4,2.3,virginica,0.592439,0.135638,0.433332,0.078693


# Exercise 29

Imagine you are given a NumPy matrix and you want to transform this matrix into a pandas DataFrame, how can you do this? (Use the Internet to answer this question).

What about when you have a pandas DataFrame and you need a NumPy matrix?

In [53]:
import numpy as np

matrix = np.random.random((10, 10))

pd_matrix = pd.DataFrame(data=matrix)

pd_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.245712,0.037203,0.580942,0.868591,0.450839,0.019332,0.474156,0.522755,0.310662,0.497705
1,0.45862,0.673569,0.702974,0.574135,0.41359,0.535259,0.749459,0.312665,0.668747,0.88095
2,0.956952,0.046424,0.246616,0.90461,0.093465,0.372178,0.705437,0.836618,0.254034,0.171453
3,0.283044,0.216251,0.093299,0.720375,0.713038,0.380102,0.253251,0.694433,0.542765,0.39539
4,0.3361,0.091247,0.403522,0.290011,0.474667,0.781729,0.276244,0.800076,0.02596,0.533593
5,0.757662,0.86717,0.419666,0.4469,0.162172,0.698573,0.055458,0.425394,0.95727,0.072171
6,0.039678,0.102826,0.327672,0.513742,0.227006,0.769327,0.521846,0.446913,0.468475,0.05466
7,0.505711,0.146879,0.132481,0.087413,0.468893,0.841021,0.008547,0.766,0.520554,0.153011
8,0.188063,0.73068,0.820042,0.001094,0.734928,0.525328,0.414754,0.895966,0.134007,0.707739
9,0.197008,0.470113,0.459806,0.784453,0.798315,0.27092,0.777536,0.864515,0.034132,0.478453


In [54]:
matrix = pd_matrix.values

matrix

array([[0.245712  , 0.03720307, 0.58094173, 0.86859062, 0.45083903,
        0.01933194, 0.47415631, 0.52275535, 0.3106624 , 0.49770454],
       [0.45862033, 0.67356859, 0.70297385, 0.57413517, 0.4135902 ,
        0.53525875, 0.7494592 , 0.3126648 , 0.66874673, 0.88094981],
       [0.95695184, 0.04642446, 0.24661616, 0.90460992, 0.09346467,
        0.37217777, 0.70543722, 0.83661843, 0.25403395, 0.17145323],
       [0.28304352, 0.21625119, 0.0932993 , 0.72037532, 0.71303837,
        0.38010158, 0.25325118, 0.69443276, 0.54276483, 0.39538971],
       [0.33610045, 0.09124732, 0.4035224 , 0.2900107 , 0.47466655,
        0.78172935, 0.27624375, 0.80007631, 0.02596011, 0.53359339],
       [0.75766246, 0.86717023, 0.41966579, 0.44690011, 0.16217153,
        0.69857319, 0.0554579 , 0.42539397, 0.95727034, 0.07217068],
       [0.03967793, 0.1028258 , 0.32767173, 0.51374152, 0.22700639,
        0.76932675, 0.52184564, 0.44691333, 0.46847474, 0.05465985],
       [0.50571125, 0.14687948, 0.1324809