# 2 NumPy and Pandas

## The `numpy` library

In [1]:
import numpy as np

### `numpy.ndarray`

In [2]:
a = np.random.default_rng(42).random((2,3))

In [3]:
a

array([[0.77395605, 0.43887844, 0.85859792],
       [0.69736803, 0.09417735, 0.97562235]])

The type of arrays NumPy implements is the N-dimention array `ndarray`.

In [4]:
type(a)

numpy.ndarray

The data type of elements in the above array is

In [5]:
a.dtype

dtype('float64')

The size of the array is (2, 3) &mdash; 2 rows, 3 columns.

In [6]:
a.shape

(2, 3)

Selecting array elements is pretty simple. We just need to specify the index of the matrix as `a[m,n]`. Here, `m` is the row index and `n` is the column index of the matrix.

In [7]:
a[0, 1]

0.4388784397520523

### Broadcasting

NumPy operations are usually done on pairs of arrays on an element-by-element basis. In the simplest case, the two arrays must have exactly the same shape, as in the following example:

In [8]:
b = np.random.default_rng(51).random((2,3))

In [9]:
a

array([[0.77395605, 0.43887844, 0.85859792],
       [0.69736803, 0.09417735, 0.97562235]])

In [10]:
b

array([[0.9424412 , 0.30713836, 0.25490663],
       [0.22261852, 0.78620422, 0.21670804]])

In [11]:
a + b

array([[1.71639724, 0.7460168 , 1.11350455],
       [0.91998655, 0.88038157, 1.1923304 ]])

In [12]:
a - b

array([[-0.16848515,  0.13174008,  0.60369129],
       [ 0.4747495 , -0.69202687,  0.75891431]])

In [13]:
a * b

array([[0.72940806, 0.1347964 , 0.21886231],
       [0.15524704, 0.07404263, 0.21142521]])

In [14]:
a / b

array([[0.82122476, 1.42892747, 3.36828393],
       [3.13256962, 0.11978739, 4.50201262]])

NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. The simplest broadcasting example occurs when an array and a scalar value are combined in an operation:

In [15]:
a

array([[0.77395605, 0.43887844, 0.85859792],
       [0.69736803, 0.09417735, 0.97562235]])

In [16]:
a + 0.01

array([[0.78395605, 0.44887844, 0.86859792],
       [0.70736803, 0.10417735, 0.98562235]])

In [17]:
a - 0.01

array([[0.76395605, 0.42887844, 0.84859792],
       [0.68736803, 0.08417735, 0.96562235]])

In [18]:
a * 2

array([[1.5479121 , 0.87775688, 1.71719584],
       [1.39473606, 0.1883547 , 1.9512447 ]])

In [19]:
a / 2

array([[0.38697802, 0.21943922, 0.42929896],
       [0.34868401, 0.04708867, 0.48781118]])

## The `pandas` library

In [20]:
import pandas as pd

### `Series`

`Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the *index*.

In [21]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [22]:
s

a   -1.232586
b   -0.220167
c    0.955787
d   -0.911245
e   -1.332573
dtype: float64

In [23]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [24]:
s['a']

-1.232586090096256

In [25]:
s['c':]

c    0.955787
d   -0.911245
e   -1.332573
dtype: float64

If no index is passed, one will be created having values `[0, ..., len(data) - 1]`.

In [26]:
pd.Series(np.random.randn(5))

0   -0.289766
1   -0.340723
2    0.041206
3   -2.069280
4    1.463924
dtype: float64

### `DataFrame`

`DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types.

In [27]:
df = pd.read_csv('./data/WHO_first9cols.csv')

In [28]:
df.head()

Unnamed: 0,Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) total
0,Afghanistan,1,1,151.0,28.0,,,,26088.0
1,Albania,2,2,27.0,98.7,6000.0,93.0,94.0,3172.0
2,Algeria,3,3,6.0,69.9,5940.0,94.0,96.0,33351.0
3,Andorra,4,2,,,,83.0,83.0,74.0
4,Angola,5,3,146.0,67.4,3890.0,49.0,51.0,16557.0


In [29]:
type(df)

pandas.core.frame.DataFrame

In [30]:
df.shape

(202, 9)

You can think of it like a dict of `Series` objects.

In [31]:
df['Country'].head()

0    Afghanistan
1        Albania
2        Algeria
3        Andorra
4         Angola
Name: Country, dtype: object

In [32]:
type(df['Country'])

pandas.core.series.Series

### Describing data

The `describe()` method will show most of the descriptive statistical measures for columns:

In [33]:
df[['Adolescent fertility rate (%)', 'Adult literacy rate (%)']].describe()

Unnamed: 0,Adolescent fertility rate (%),Adult literacy rate (%)
count,177.0,131.0
mean,59.457627,78.871756
std,49.105286,20.41576
min,0.0,23.6
25%,19.0,68.4
50%,46.0,86.5
75%,91.0,95.3
max,199.0,99.8


In [34]:
df.groupby('Continent')['Adult literacy rate (%)'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,18.0,76.9,16.647911,28.0,74.15,81.9,87.55,93.3
2,26.0,97.911538,2.598665,87.4,97.225,98.75,99.475,99.8
3,42.0,61.690476,20.395346,23.6,48.625,67.3,80.8,91.8
4,1.0,91.6,,91.6,91.6,91.6,91.6,91.6
5,22.0,87.940909,9.113865,69.1,81.675,90.3,94.475,99.8
6,14.0,87.607143,12.076962,57.3,89.0,91.45,92.675,98.9
7,8.0,69.8125,20.767037,47.5,49.575,67.3,90.75,96.3


### Dealing with dates