# Week 2: Data Representations

##  - Hands-on Practice - 


_Table of contents_

* $n$-dimensional arrays
* Pandas: Series
* Pandas: DataFrame

## NumPy Arrays

NumPy is widely used in Python for working with arrays. 

It is used in numerous data analytics and machine learning applications, either explicitly or implicitly.

Also, other popular data science libraries, e.g., Pandas or SciPy, built upon NumPy.

NumPy provides a high-performance, $n$-dimentional array type, `ndarray`.

In [1]:
import numpy as np  # 'np' is a common alias for NumPy

# Create (declare and define its size) a new array using items from 
# another iterable object, e.g. a list
data = np.array([9, 7, 5, 4, 1, 3, 2, 8, 0, 6])
type(data)

numpy.ndarray

In [2]:
data

array([9, 7, 5, 4, 1, 3, 2, 8, 0, 6])

Let's declare and define a 2-dimensional array.

In [3]:
data = np.array([[9, 7, 5, 4, 1], [3, 2, 8, 0, 6]])
data

array([[9, 7, 5, 4, 1],
       [3, 2, 8, 0, 6]])

__Food-for-Thought: Question.__ What happens if we pass as an argument a list with an irregular structure?

In [4]:
# Try it out

### Basic attributes

The `ndarray` object has a number of attributes to use for handling the structure of data.

In [4]:
def info(data):
    return "Array of type {}, shape {}, with {} dimension{} and {} element{}".format(
        data.dtype,
        data.shape, 
        data.ndim,
        "s" if data.ndim > 1 else '',
        data.size,
        "s" if data.size > 1 else '',
    )


print(info(data))

Array of type int8, shape (2, 5), with 2 dimensions and 10 elements


__Discussion.__ Array data types - what is `int64`? See [here](https://numpy.org/devdocs/user/basics.types.html) for details. Remember, NumPy exhibits high performance when used for handling data because it is implemented in C, a low-level programming language.

### Populating arrays, part II

There are a number of functions to fill arrays with specific values.

In [5]:
# Fill array with zeros
data = np.zeros(10)
print(info(data))

Array of type float64, shape (10,), with 1 dimension and 10 elements


In [6]:
data = np.zeros((2, 5), dtype=np.int8)
print(info(data))

Array of type int8, shape (2, 5), with 2 dimensions and 10 elements


In [7]:
# Fill array with ones
data = np.ones(10)
print(info(data))

Array of type float64, shape (10,), with 1 dimension and 10 elements


In [8]:
# Fill array with a specific constant (of a given type)
data = np.full((2, 5), 5)
print(info(data))
print(data)

Array of type int32, shape (2, 5), with 2 dimensions and 10 elements
[[5 5 5 5 5]
 [5 5 5 5 5]]


We can create arrays using special ___range___ functions too, namely `arange` for integers and `linrange` for floats.

In [12]:
# The arange() function has a similar signature 
# to the built-in range() function
# i.e. arange(from, to(excluding), increment by)
data = np.arange(0, 10, 2)
print(data)
print(info(data)) 

[0 2 4 6 8]
Array of type int32, shape (5,), with 1 dimension and 5 elements


In [18]:
# The linspace() function produced evenly spaced
# floating-point numbers. 
# 
# Note that the end value is included i.e.
#linspance(from, to(inclusive), number of elements to be created)
data = np.linspace(0.0, 12.0, num=7)
print(data)
print(info(data))

[ 0.  2.  4.  6.  8. 10. 12.]
Array of type float64, shape (7,), with 1 dimension and 7 elements


### Reshaping arrays

The shape of an array is a tuple. It can be used to change the dimensions of an array.

In [20]:
data = np.array([[1, 2], [3, 4], [5, 6]])
print(data)
print(info(data))
print()

# Reshape array
data.shape = (6, 1)
print(data)
print(info(data))

[[1 2]
 [3 4]
 [5 6]]
Array of type int32, shape (3, 2), with 2 dimensions and 6 elements

[[1]
 [2]
 [3]
 [4]
 [5]
 [6]]
Array of type int32, shape (6, 1), with 2 dimensions and 6 elements


The `reshape` function creates a new array without modifying the original data.

In [21]:
y = np.reshape(data, 6)
print(y)
print(info(y))
print(data)

[1 2 3 4 5 6]
Array of type int32, shape (6,), with 1 dimension and 6 elements
[[1]
 [2]
 [3]
 [4]
 [5]
 [6]]


### Array concatenation

Two or more arrays can be concatenated with the `concatenate` function. See [API reference](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) for similar functions.

In [22]:
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
print(np.concatenate((a, b), axis=0), "\n")
print(np.concatenate((a, b.T), axis=1))

[[1 2]
 [3 4]
 [5 6]] 

[[1 2 5]
 [3 4 6]]


### Identity and eye

An _identity array_ is a square matrix where the diagonal elements have value $1$ and all other elements are $0$.

In [23]:
I = np.identity(4)  # Create a 3x3 array
print(I)

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


The `eye` function constructs an $N \times M$ array with $1$s in the diagonal and $0$'s elsewhere. An integer $k$ shifts the diagonal up or down. For example:

```python
np.eye(3, 3, k=0) == np.identity(3)
```

In [24]:
print(np.eye(4, 4, k=1))

[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]]


__Exercise.__ A one-hot encoding (https://www.statology.org/one-hot-encoding-in-python/) is a vector with a $1$ at a particular index and $0$'s elsewhere. It is typically used to represent categorical data. For example, if there are three categories, say $A$, $B$ and $C$, then: 

In [27]:
# Try it out
import pandas as pd
df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
                   'points': [25, 12, 15, 14, 19, 23, 25, 29]})
print(df)

  team  points
0    A      25
1    A      12
2    B      15
3    B      14
4    B      19
5    B      23
6    C      25
7    C      29


In [28]:
from sklearn.preprocessing import OneHotEncoder

In [30]:
#creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

In [31]:
#perform one-hot encoding on 'team' column 
encoder_df = pd.DataFrame(encoder.fit_transform(df[['team']]).toarray())

In [32]:
#merge one-hot encoded columns back with original DataFrame
final_df = df.join(encoder_df)

In [33]:
#view final df
print(final_df)

  team  points    0    1    2
0    A      25  1.0  0.0  0.0
1    A      12  1.0  0.0  0.0
2    B      15  0.0  1.0  0.0
3    B      14  0.0  1.0  0.0
4    B      19  0.0  1.0  0.0
5    B      23  0.0  1.0  0.0
6    C      25  0.0  0.0  1.0
7    C      29  0.0  0.0  1.0


### Summary statistics

NumPy provides some basic functions to extract summary statistics from an array. The methods ignore the array's shape and use all elements:

In [35]:
data = np.array([[9, 7, 5, 4, 1], [3, 2, 8, 0, 6]])

print("Sum: ", data.sum())
print("Min: ", data.min())
print("Max: ", data.max())
print("Avg: ", data.mean())
print("Var: ", data.var())
print("Std: ", data.std())

Sum:  45
Min:  0
Max:  9
Avg:  4.5
Var:  8.25
Std:  2.8722813232690143


### Sum, product, min, max

Sometimes we are interested on the sum or product of array elements over a given axis:

In [36]:
data = np.array([[9, 7, 5, 4, 1], 
                 [3, 2, 8, 0, 6]])

# This is the sum over all axes
assert np.sum(data) == data.sum()

print("Sum over 0-axis:", np.sum(data, axis=0))
print("Sum over 1-axis:", np.sum(data, axis=1))
print()
print("Product over 0-axis:", np.prod(data, axis=0))
print()
print("Product over sum over 1-axis:", np.prod(np.sum(data, axis=1), axis=0))
print()
print("Minimum over 0-axis:", np.min(data, axis=0))
print("Maximum over 1-axis:", np.max(data, axis=1))

Sum over 0-axis: [12  9 13  4  7]
Sum over 1-axis: [26 19]

Product over 0-axis: [27 14 40  0  6]

Product over sum over 1-axis: 494

Minimum over 0-axis: [3 2 5 0 1]
Maximum over 1-axis: [9 8]


### Arithmetic operations

Simple arithmetic operations with arrays require operands of the same size and shape. NumPy makes it easier using __broadcasting__. For example, given array

```
X = [1, 2, 3, 4]
```

the operation $X \times 2$ is equivalent to:

```
[1, 2, 3, 4] * [2, 2, 2, 2]
```

In general, two arrays of different shape can be combined in the same expression __without copying data__ based on __two broadcasting rules__:

__Rule 1.__ Append `1` to the shape of the array with less dimensions. For example, if array `X` has shape `(3, 3)` and array `Y` has shape `(3)`, then `Y`'s shape becomes `(1, 3)`.

__Rule 2.__ Dimensions of size `1` are repeated as many times to the dimension of the other operand. In our example, array `Y`'s row is repeated 3 times to become a $3 \times 3$ array. 

In [37]:
# Add 1 to every element in the array and multiply by 2
data = np.arange(1, 10, 1).reshape(3, 3)
#
# data is:
#
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]
#
data = (data + 1) * 2
print(data)

[[ 4  6  8]
 [10 12 14]
 [16 18 20]]


Basic arithmetic functions operate element-wise on arrays. Given two arrays, `a` and `b` of dimensions $N \times M$:

| Description    |  Operator | Function                |
| :------------- | :---------| :---------------------- |
| Addition       | `a + b`   | `np.add(a, b)`          | 
| Subtraction    | `a – b`   | `np.substract(a, b)`    |
| Multiplication | `a * b`   | `np.multiply(a, b)`     |
| Exponentiation | `a ** b`  | `np.power(a, b)`        | 
| True division  | `a / b`   | `np.divide(a, b)`       |
| Floor division | `a // b`  | `np.floor_divide(a, b)` |
| Remainder      | `a % b`   | `np.mod(a, b)`          |

For example:

In [38]:
a = np.array([[1, 2, 3, 4]], float)
b = np.array([[5, 6, 7, 8]], float)

# You can control print options with np.set_printoptions()
print(np.power(a, b))

# Try out the rest 

[[1.0000e+00 6.4000e+01 2.1870e+03 6.5536e+04]]


There are unary arithhmetic operations too, e.g., `np.floor`, `np.ceil` and `np.rint`.

The `floor` of $x$ is smallest integer $i$ where $i \le x$.

The ceiling (`ceil`) of $x$ is largest integer $i$ where $i \ge x$.

The `rint` of $x$ rounds $x$ to the nearest integer.

In [40]:
data = np.array([1.1, 2.6, 3.2, 4.8, 5.9])

print(np.floor(data))
print(np.ceil(data))
print(np.rint(data))

[1. 2. 3. 4. 5.]
[2. 3. 4. 5. 6.]
[1. 3. 3. 5. 6.]


### `np.dot` & matrix multiplications

The `np.dot(a, b)` computes the dot product of two arrays. Specifically,

* If both `a` and `b` are 1-D arrays (i.e., vectors), it returns the inner product of vectors ($\sum{(a_i \times b_i)}$).

* If both `a` and `b` are 2-D arrays, it is matrix multiplication (equiv. `a @ b`).

* If either `a` or `b` is a scalar, it is equivalent to `np.multiply(a, b)` or `a * b`.

In [41]:
a = np.array([1, 1, 1, 1], float)
b = np.array([1, 2, 3, 4], float)

# Multiply vectors a and b element-wise and sum the results 
np.dot(a, b)

10.0

In [42]:
a = np.array([[1, 2], [3, 4]], float)
b = np.array([[1, 2], [3, 4]], float)

#
# 1 2  @  1 2  =  (1x1 + 2x3) (1x2 + 2x4)   
# 3 4     3 4     (3x1 + 3x4) (3x2 + 4x4)
#
np.dot(a , b)

array([[ 7., 10.],
       [15., 22.]])

In [46]:
a = np.array([[1, 2], [3, 4]], float)
b = 3

c = np.dot(a, b)
print(c)
print()
d = np.multiply(a, b)
print(d)


[[ 3.  6.]
 [ 9. 12.]]

[[ 3.  6.]
 [ 9. 12.]]


### A note on matrix multiplication and matrix dimensions compatibility

If either argument array is $n$-dimensional, where $n > 2$, it is treated as a stack of 2-dimensional arrays residing in the last two indices; and the function broadcasts the other array accordingly.

In [50]:
# A stack of three 7 x 4 arrays
a = np.ones([3, 7, 4]).reshape(21, 4)
print(a)
print()
# A stack of one 4 x 3 arrays, broadcasted to three
c = np.ones([1, 4, 3]).reshape(4, 3)
print(c)

np.matmul(a, c).shape

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


(21, 3)

## About Pandas

NumPy arrays are optimised for for __homogeneous, numerical__ data, accessed via integer indices. Some datasets, however, contain __heterogeneous, unstructured and missing__ data. 

A data application, therefore, need to support custom indexing and transform data into an appropriate form for analysis. _Pandas_ is a popular library for dealing with such data.

Pandas are built atop `ndarray`s. Pandas object are valid arguments for many NumPy operations and vice versa.

There are two key Pandas data collections, `Series` for one-dimensional data and `DataFrames` for two-dimensional data.

In [None]:
import pandas as pd

## Pandas Series

A `Series` is an enhanced one-dimensional `array` that supports non-integer indexing, including strings:

```
pd.Series(data=None, index=None, dtype=None)
```

where:

* `data` is an array, iterable or dictionary data structure;
* `index` is an array of hashable values of the same length as `data`. If not specified, the `Series` has integer indices by default (0, 1, 2 and so on). If `data` is a dictionary, the index will override the dictionary's keys.
* `dtype` is the data type of the Series (otherwise inferred from `data`).

In [59]:
data = np.array([87, 100, 91])
grades = pd.Series(data, index=['Christopher', 'Ada', 'Michael'])
print(grades)

# Alternatively, the same series can be defined as follows
ht = {"Christopher": 87, "Ada": 100, "Michael": 91}
grades = pd.Series(ht)
print()
print(grades)
#notice the dtype though changes from int32 to int64

Christopher     87
Ada            100
Michael         91
dtype: int32

Christopher     87
Ada            100
Michael         91
dtype: int64


A Series object is very similar to a dictionary. For example, an item in a Series can be accessed or modified using the index name as a key:

In [60]:
grades['Michael'] += 1
print(grades['Michael'])

92


Interestingly, if the index is a string, it is automatically treated as an attribute:

In [61]:
grades.Michael += 1
print(grades.Michael)

93


__Iterating__ through a `Series` is similar to a dictionary:

In [62]:
for k, v in grades.items():
    print("{:7s} {:-3d}".format(k, v))
print("- A series of {} {} items".format(grades.size, grades.dtype))

Christopher  87
Ada     100
Michael  93
- A series of 3 int64 items


### Summary statistics

The `Series` object provides many methods for common tasks, including producing various statistics that summarise the central tendancy, dispersion and shape of the data distribution.

In [63]:
grades.describe()

count      3.000000
mean      93.333333
std        6.506407
min       87.000000
25%       90.000000
50%       93.000000
75%       96.500000
max      100.000000
dtype: float64

## Pandas DataFrames

A `DataFrame` is an enhanced two-dimensional array, with custom row and column indices. Each column in a `DataFrame` is a `Series`.  Let's create a DataFrame from a dictionary:

In [64]:
data = {
    "Christopher": [ 85,  76, 43,  95],
    "Ada": [100, 100, 95, 100],
    "Michael": [ 65]
}

grades = pd.DataFrame(data, index=["A", "B", "C", "D"])
grades

Unnamed: 0,Christopher,Ada,Michael
A,85,100,65
B,76,100,65
C,43,95,65
D,95,100,65


The dictionary keys become the `DataFrame`'s __columns__ and the dictionary values become the row values for every row in the `DataFrame`.

A few observations:

* By default, row indices are integers, but we have assigned string names in the constructor.

* Be careful of the dimensionality of the frame. Conviniently, in the example above, the Pandas implementation uses __broadcasting__ to fill in `Michael`'s missing row values.

### Indexing

DataFrames are useful data representations when processing requires frequent indexing operations. As with `Series`, __string column indices__ are accessible as attributes too.

In [65]:
grades.Christopher  # Equivalent to grades["Christopher"]

A    85
B    76
C    43
D    95
Name: Christopher, dtype: int64

Each column is a `Series`, so we can access a specific "cell" (row, column) thus:

In [66]:
grades.Christopher.C

43

We can access a specific cell using `at` and `iat`, this time using (row, column) pairs:

In [67]:
grades.at['C', 'Christopher']

43

In [73]:
# Or, equivalently to the above
result = grades.iat[2, 0]
print(result)
print(grades)

43
   Christopher  Ada  Michael
A           85  100       65
B           76  100       65
C           43   95       65
D           95  100       65


The Pandas library provides two (optimised) methods to access __rows__, __`loc`__ to access them by index name (or _label_) and __`iloc`__ to access them by _index number_. Both methods return rows as a `Series` object.

In [74]:
grades.loc['A']

Christopher     85
Ada            100
Michael         65
Name: A, dtype: int64

In [75]:
grades.iloc[1]

Christopher     76
Ada            100
Michael         65
Name: B, dtype: int64

### Slicing

It is possible to select more than one row:

In [76]:
grades.loc[['A', 'C']]

Unnamed: 0,Christopher,Ada,Michael
A,85,100,65
C,43,95,65


In [77]:
grades.iloc[[0, 2]]

Unnamed: 0,Christopher,Ada,Michael
A,85,100,65
C,43,95,65


Or, specify a range (e.g. the first three rows). When using numbers to specify a range, the end index is excluded; when using labels, the end index is included.

In [78]:
grades.iloc[0:3]  # Equivalent to grades.loc['A':'C']

Unnamed: 0,Christopher,Ada,Michael
A,85,100,65
B,76,100,65
C,43,95,65


Finally, we can slice a `DataFrame` by __combining rows $i$ with columns $j$__:

* $[[i_{1}, i_{2}, \dots]$__,__ $[j_{1}, j_{2}, \dots]]$ selects specific rows ($i_{1}$, $i_{2}$ and so on) and specific columns ($j_{1}$, $j_{2}$ and so on).

* $[i_{\mathrm{start}}:i_{\mathrm{end}}$__,__ $[j_{1}, j_{2}, \dots]]$ selects a range of rows (from $i_{\mathrm{start}}$ to $i_{\mathrm{end}}$) and specific columns.

* $[[i_{1}, i_{2}, \dots]$__,__ $j_{\mathrm{start}}:j_{\mathrm{end}}]$ selects specific rows and a range of columns (from $j_{\mathrm{start}}$ to $j_{\mathrm{end}}$).

* $[i_{\mathrm{start}}:i_{\mathrm{end}}$__,__ $j_{\mathrm{start}}:j_{\mathrm{end}}]$ selects a range of rows and a range of columns.

In [79]:
grades.iloc[0:2, [0, 2]]

Unnamed: 0,Christopher,Michael
A,85,65
B,76,65


### Summary statistics

The `DataFrame` object provides many methods for common tasks, including producing various statistics that summarise the central tendancy, dispersion and shape of the data distribution.

In [85]:
pd.set_option("display.precision", 1)
grades.describe() 

Unnamed: 0,Christopher,Ada,Michael
count,4.0,4.0,4.0
mean,74.8,98.8,65.0
std,22.5,2.5,0.0
min,43.0,95.0,65.0
25%,67.8,98.8,65.0
50%,80.5,100.0,65.0
75%,87.5,100.0,65.0
max,95.0,100.0,65.0


### Transposing a DataFrame

It is fairly easy to transpose a `DataFrame` - that is, rows to become columns and columns to become rows: 

In [86]:
grades.T

Unnamed: 0,A,B,C,D
Christopher,85,76,43,95
Ada,100,100,95,100
Michael,65,65,65,65


__Exercise in class.__ Suppose that we want to get the average grade by course (that is, 'A' to 'D').

In [88]:
# Solve it here.
grades.mean()

Christopher    74.8
Ada            98.8
Michael        65.0
dtype: float64

### From CSV to DataFrames and back

A Comma-Separated Values (CVS) file is a text with one row - a list comma-separated values - per line. The first row is typically, but optionally, a list of column names (the _header_). The `read_csv()` function reads a CSV file into `DataFrame`.

There are numerous keyword arguments to the function (see [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)). For example, you can specify or overide column names with the `names`; specify the absense of a header with `header=None`; and specify which column should serve as an index with `index_col`. 

In [89]:
# The default column names are 'name' and 'score'. 
# You can overide them by passing the following as
# an argument:
# 
# names=['Student name', 'Grade']
#
df = pd.read_csv('grades.csv')

Use the `to_csv()` method to write or append a `DataFrame` to a CSV file.

### Other supported file formats

Besides CSV files, you can read (respectively, write) data stored in various formats including:

* Excel spreadsheets (`read_excel`)
* HTML tables (`read_html`)
* JSON strings (`read_json`)
* Fixed-width formatted lines (`read_fwf`)

### A. Views versus copies

If you __slice__ an array or data frame for any reason other than __an immediate analysis or visualisation of the data__, you should __make a copy__ of that slice.

For example, the `pandas` library will raise a warning if we try to modify the base data frame indirectly from a slide, but not the other way around.

In [90]:
df = pd.DataFrame({'x': np.arange(4), 'y': ['a', 'b', 'c', 'd']})
df

Unnamed: 0,x,y
0,0,a
1,1,b
2,2,c
3,3,d


In [91]:
# Let's create a slice of the first two rows
sl = df.iloc[:2,]
sl

Unnamed: 0,x,y
0,0,a
1,1,b


In [94]:
# Let's try to modify an element of the slide
sl.iloc[0, 0] = 10
print(sl)

    x  y
0  10  a
1   1  b


In [95]:
# But there is no warning if the slice is modified
# due to changes in the base data frame
df.iloc[0,0] = 10
sl

Unnamed: 0,x,y
0,10,a
1,1,b
