# NumPy and pandas
## General
### NumPy
NumPy is used for performing numerical computations on arrays and matrices, such as mean, median, percentiles and linear algebra computations. Simply install numpy with pip `pip3 install numpy`. 

### Pandas
Pandas is used for handling tabular datasets that usually combine different types of data columns (integer, float, nominals, etc). Pandas requires NumPy. To install: `pip3 install pandas`.

## Numpy examples

### The basics

In [4]:
# zeros and ones. array shape
import numpy as np
a = np.zeros((2, 4))
b = np.ones((2, 4))
print(f"a:\n{a}")
print(f"b:\n{b}")
print(f"a+b:\n{a+b}")
print(f"a-2b:\n{a-2*b}")
print(f"shape:\n{a.shape}")

a:
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]]
b:
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]]
a+b:
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]]
a-2b:
[[-2. -2. -2. -2.]
 [-2. -2. -2. -2.]]
shape:
(2, 4)


In [1]:
# creating arrays from lists and array types
import numpy as np
a = np.array([1, 2, 5])
b = np.array([2.0, 10, -1])
print(f"a+b{a + b}")
print(a.dtype)
print(b.dtype)
print((a+b).dtype)

a+b[ 3. 12.  4.]
int64
float64
float64


In [6]:
# numpy.arange. basic operations
import numpy as np
a = np.arange(0, 20, 5)
b = np.arange(0, 20, 5) - 10
print(f"a:{a}")
print(f"a-10:{a-10}")
print(f"a^2:{a ** 2}")
print(f"a-b:{a-b}")
print(f"cos(b * pi / 20):{np.cos(b * np.pi / 20.0)}")

a:[ 0  5 10 15]
a-10:[-10  -5   0   5]
a^2:[  0  25 100 225]
a-b:[10 10 10 10]
cos(b * pi / 20):[6.12323400e-17 7.07106781e-01 1.00000000e+00 7.07106781e-01]


In [8]:
# element-wise product matrix product
import numpy as np
A = np.array([[0, 2], [1, 1]])
B = np.array([[-1, 1], [1, 1]])
print(f"A .* B =\n {A * B}")     # element-wise
print(f"A * B =\n {A.dot(B)}")  # matrix product

A .* B =
 [[0 2]
 [1 1]]
A * B =
 [[2 2]
 [0 2]]


In [20]:
# reshaping arrays
import numpy as np
x = np.arange(10)
print(x)
print(x.reshape(2, 5))

[0 1 2 3 4 5 6 7 8 9]
[[0 1 2 3 4]
 [5 6 7 8 9]]


### Numpy statistics
The following example reads the temperatures from NYC in the last 150 years on the same day (9th April). The csv file contains 3 rows, namely date, max day temp and min day temp. Date is saved in a list and the two temperatures in numpy arrays. The following code extracts some basic statistics including, mean vale, median value, max, min and  10 and 90 percentiles. 

In [2]:
import csv
import numpy as np

years = []
max_t, min_t = np.array([]), np.array([])
# read the csv file of New York min and max temperatures of 9th April for the last 150 years:
with open('data_ny_temperatures.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='|')
    for ir, row in enumerate(reader):
        if ir>0:
            max_t = np.append(max_t, float(row[1]))
            min_t = np.append(min_t, float(row[2]))
            years.append(int(row[0].split('-')[0]))

print(f"Average max-day temperature is {max_t.mean():.1f}")
print(f"Median max-day temperature is {np.median(max_t):.1f}")
print(f"Average max-day temperature is {min_t.mean():.1f}")
print(f"Median max-day temperature is {np.median(min_t):.1f}")

print(f"The maximum max-day temp was {np.max(max_t):.1f} in {years[np.argmax(max_t)]}")
print(f"The maximum min-day temp was {np.max(min_t):.1f} in {years[np.argmax(min_t)]}")
print(f"The minimum max-day temp was {np.min(max_t):.1f} in {years[np.argmin(max_t)]}")
print(f"The minimum max-day temp was {np.min(min_t):.1f} in {years[np.argmin(min_t)]}")

max_t_p_10 = np.percentile(max_t, 10)
max_t_p_90 = np.percentile(max_t, 90)
years_max_10 = [y for i, y in enumerate(years) if max_t[i] < max_t_p_10]
print(years_max_10)
years_max_90 = [y for i, y in enumerate(years) if max_t[i] > max_t_p_90]
print(years_max_90)
min_t_p_10 = np.percentile(min_t, 10)
min_t_p_90 = np.percentile(min_t, 90)
years_min_10 = [y for i, y in enumerate(years) if min_t[i] < min_t_p_10]
print(years_min_10)
years_min_90 = [y for i, y in enumerate(years) if min_t[i] > min_t_p_90]
print(years_min_90)

Average max-day temperature is 55.3
Median max-day temperature is 54.0
Average max-day temperature is 39.6
Median max-day temperature is 39.0
The maximum max-day temp was 86.0 in 1991
The maximum min-day temp was 68.0 in 1991
The minimum max-day temp was 39.0 in 1885
The minimum max-day temp was 25.0 in 1977
[1874, 1884, 1885, 1900, 1907, 1911, 1917, 1935, 1974, 1979, 1982, 1996, 1997, 2003]
[1871, 1879, 1921, 1929, 1934, 1945, 1959, 1968, 1970, 1981, 1991, 2001, 2002, 2013]
[1876, 1880, 1885, 1888, 1891, 1900, 1917, 1920, 1950, 1958, 1972, 1977, 1997, 2000]
[1871, 1895, 1915, 1921, 1922, 1929, 1959, 1968, 1970, 1980, 1981, 1991, 2002, 2012, 2013]


A note on speed: if you need to append a large number of elements in a numpy array, it is much faster to append it to a list and then convert the list to numpy array (instead of using the numpy.append() method). And list comprehension is obvioysly even faster. 

In [39]:
import numpy as np
import time

t1 = time.time()
a = np.array([])
for i in range(1, 10000):
    a = np.append(a, i)
t2 = time.time()
print(f"numpy.append(): {1000 * (t2 - t1):.2f} msecs")

t1 = time.time()
a = []
for i in range(1, 10000):
    a.append(i)
a = np.array(a)
t2 = time.time()
print(f"list append and numpy array conversion: {1000 * (t2 - t1):.2f} msecs")

t1 = time.time()
a = [i for i in range(1, 1000)]
a = np.array(a)
t2 = time.time()
print(f"list comprehension and numpy array conversion: {1000 * (t2 - t1):.2f} msecs")

numpy.append(): 87.97 msecs
list append and numpy array conversion: 1.96 msecs
list comprehension and numpy array conversion: 0.22 msecs


Talking about statistics, two of the most important quantities used in random variable statistics (whatever quantity they measure) are mean and standard deviation. We've already seen mean in some examples above. Standard deviation, which measures how close the values of the variable are to their mean value. Belowe, we are showing how to compute mean and std of a sequence and how to standardize the values of the sequence into having a standard deviation of 1 and mean value equal to 0. This is a very important process, used in machine learning and data science before training models and before predicting. An alternative is the max / min normalization, not shown here. 

In [39]:
import numpy as np
import numpy.random
m, s, n_samples = 10, 5, 1000
x = numpy.random.normal(m, s, n_samples)
m_est = x.mean()
s_est = x.std()

print(f"mean is {m_est:.3f} and std is {s_est:.3f}")
# z = (x - m) / s
x_norm = (x - m_est) / s_est
print(f"after standardization mean is {x_norm.mean():.3f} and std is {x_norm.std():.3f}")

mean is 10.266 and std is 4.983
after standardization mean is 0.000 and std is 1.000


### Numpy slicing and row - column operations

In [8]:
import numpy as np
x = np.array([[1,2,3], [4,5,6], [7, 8, 9], [10, 11, 12]])
print("x:")
print(x)
print("\nx[1:, :-1]:")
print(x[1:, :-1])

x:
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]

x[1:, :-1]:
[[ 4  5]
 [ 7  8]
 [10 11]]


In [3]:
# global and row-wise or column-wise calculations
import numpy as np
x = np.array([[1,2,3], [4,5,6], [7, 8, 9], [10, 11, 12]])
print(f"global mean {x.mean()}")
print(f"global min {x.min()}")
print(f"global max {x.max()}")
print(f"column-wise mean {x.mean(axis=0)}")
print(f"row-wise mean {x.mean(axis=1)}")

global mean 6.5
global min 1
global max 12
column-wise mean [5.5 6.5 7.5]
row-wise mean [ 2.  5.  8. 11.]


### Broadcasting
Broadcasting in numpy is. a very powerful mechanism that allows numpy operators to work on arrays of different shapes.

We saw previously that element-to-element operations are possible in numpy when arrays have the same dimensions. However, operations on arrays that do not share the same shapes is possible in numpy because of broadcasting. Broadcasting can be performed when the shape of each dimension in the arrays are equal or one has the one of its dimensions equal to 1. Below are some broadcasting examples:

In [16]:
# broacasting examples
import numpy as np
# example 1:
x = np.array([[1, 2], [3, 4]])
print(x + 2)  # scalar and 2D array broadcasting

# example 2:
x = np.array([[1,2,3], [4,5,6], [7, 8, 9], [10, 11, 12]])
y = np.array([1, 2, 3])
print(f"add a {x.shape[0]}x{x.shape[1]} with a {y.shape[0]}x{1} numpy array:")
print(x + y)

#example 3:
y = np.array([1, 2, 3, 4]).reshape(4,1)
print(f"add a {x.shape[0]}x{x.shape[1]} with a {y.shape[0]}x{1} numpy array:")
print(x + y)

[[3 4]
 [5 6]]
add a 4x3 with a 3x1 numpy array:
[[ 2  4  6]
 [ 5  7  9]
 [ 8 10 12]
 [11 13 15]]
add a 4x3 with a 4x1 numpy array:
[[ 2  3  4]
 [ 6  7  8]
 [10 11 12]
 [14 15 16]]


## Pandas
### Pandas data structures
Two are the basic types used in pandas: *series* and *dataframes*.
Series is a 1D labeled array that holds any data type (integers, strings, floats etc). To define a Series we need its data and its indices. Obviously the index must be of the same length to the data. If index is not defined, then the default value is \[0, ..., len(data) - 1\]. 

#### Series

In [7]:
# series definition
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(10), index=[f'index{i}' for i in range(10)])
print("series:"); print(s)
print("s.index"); print(s.index)

series:
index0    0.319514
index1   -0.305171
index2    0.606106
index3    1.519641
index4   -1.117003
index5    0.283825
index6   -0.465226
index7   -0.107709
index8   -2.135473
index9    1.015661
dtype: float64
s.index
Index(['index0', 'index1', 'index2', 'index3', 'index4', 'index5', 'index6',
       'index7', 'index8', 'index9'],
      dtype='object')


In [22]:
# one can also initialize series from dict:
s = pd.Series({'a': 2.1, 'c': 1.9, 'b': 1, 'd': -1})
print("series:"); print(s)
print("s.index"); print(s.index)

series:
a    2.1
c    1.9
b    1.0
d   -1.0
dtype: float64
s.index
Index(['a', 'c', 'b', 'd'], dtype='object')


In [23]:
# indexing in series can be done with both its indices and integers
print(s[1], s['c'])

1.9 1.9


In [24]:
# also Series shares functions from numpy arrays:
s.mean(), s.median()

(1.0, 1.45)

In [25]:
# ... and more functions:
np.cos(s)

a   -0.504846
c   -0.323290
b    0.540302
d    0.540302
dtype: float64

In [28]:
# slicing similar to numpy arrays:
s[:-2]

a    NaN
b    2.0
c    NaN
d    NaN
dtype: float64

In [27]:
s[s>0.5]

a    2.1
c    1.9
b    1.0
dtype: float64

In [31]:
# BUT, operations are nos the same as numpy. E.g. + results in the union of the indices involved
# NaN is assigned as the default value for indices that are not in both series 
a = pd.Series({'a': 2.1, 'b': 1, 'c': -1})
b = pd.Series({'a': 1, 'd': 1, 'g': -1, 'c': -1})
a + b

a    3.1
b    NaN
c   -2.0
d    NaN
g    NaN
dtype: float64

#### DataFrame
When your data is tabular with row index and column index, the go-to choice is pandas.DataFrame. DataFrame  is a 2D data structure with columns of potentially different types. Conceptually, DataFrame can be considered as a data table stored in a spreadsheet, a csv, a json file or a database. 