# Data Management Notebook

As we learnt in the first lecture, one of the big factors that has pushed the current AI wave is the availability of data.  So, we will learn how to handle numerical data in this workbook, moving later on to imaging data.

In this tutorial, we will learn to load data (stored as comma separated values or .csv files), access, summarise, select, and manipulate data using NumPy and pandas.

## NumPy
Numpy is the core library for scientific computing in Python. 

In [56]:
import numpy as np

### NumPy Arrays
We will learn about arrays (focused on 1D and 2D arrays), and some tools for working with them. 
A numpy array is a grid of values, all of the same type. They are initialised from lists for example:


In [57]:
array1d = np.array([1, 2, 3])           # this is a list of values - 1D
array2d = np.array([[1,2,3],[4,5,6]])   # this is a list of lists - 2D

You can find the number of dimensions and size of these array using shape. Moreover, you can access the content using square brackets, either with a single number (for 1D arrays) or tuples:

In [None]:
print('Shape array1d ', array1d.shape)
print('First value in array1d ', array1d[0])
print('Last value in array1d ', array1d[-1])
print('Values in array1d ', array1d)
array1d[0] = 5
print('Values in array1d after change:', array1d)

In [None]:
print('Shape array2d ', array2d.shape)
print('Value at 0,0 position ', array2d[0, 0])
print('Values in array2d:' )
print(array2d)
array2d[0, 0] = -1
print('Values in array2d after change:' )
print(array2d)

### Slicing
Slicing is the extraction of a part of an array in a specific range of elements by mentioning their indices.  Let's create arrays with random values and slice them in different ways:

In [None]:
a = np.array([0,1,2,3,4,5,6,7,8,9])
print(a)
print(a.shape)
print(a[3:7])

In [None]:
b = np.random.random((10,4))
print(b)
print(b.shape)
print(b[:,2:4])
print(b[7:9,:])

### Boolean indexing
Arrays can also be indexed using conditions (that create binary arrays). For example, let's create a array of random numbers and we can find all values that are larger than 0.7.

In [None]:
rand1d = np.random.random((10,))
print(rand1d)
print(rand1d > 0.7)
print(rand1d[rand1d>0.7])

### Array math
Basic mathematical functions operate elementwise on arrays.  Some examples include addition, substraction, multiplication, division and square root. 

In [None]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

print('x =', x)
print('y =',y)

In [None]:
# Elementwise addition --> the function produces an array as output
z = x + y # or z = np.add(x, y)
print(z) # or z = np.add(x, y)

In [None]:
# Elementwise difference --> the function produces an array as output
z = x - y # or z = np.subtract(x, y)
print(z)

In [None]:
# Elementwise product --> the function produces an array as output 
z = x * y # z = np.multiply(x, y)
print(z)

In [None]:
# Elementwise division --> the function produces an array as output
z = x / y # z = np.divide(x, y)
print(z)

There are also functions that work within a single array, for example square root or sum.

In [None]:
# Elementwise square root --> the function produces an array as output
z = np.sqrt(x)
print(z)

In [None]:
# Sum of the elements --> the function produces an array as output
summall = np.sum(x)
print(summall)

colsums = np.sum(x,axis=0)
print(colsums)

rowsums = np.sum(x,axis=1)
print(rowsums)

For more information, check the documentation of NumPy: https://numpy.org/doc/stable/reference/ 

## Pandas

Python package for data manipulation and analysis, especially good for numerical tables and time series. Pandas stands for: "panel data", an econometrics term for longitudinal data sets or "Python data analysis"... you choose!  Pandas is built on NumPy's array, so many functions available in NumPy are accessible in Pandas.


In [78]:
import pandas as pd

Let's load the data using pandas. You can also open the file using excel or any spreadsheet processing package.



In [79]:
# we are loading data from github. 
dataurl = 'https://github.com/rrr-uom-projects/MPiCRT-AI/raw/main/Data/titanic.csv' 
pax = pd.read_csv(dataurl, sep = ',')

# You can also write the path if done locally, for example
# dataloc = 'c:/temp/titanic.csv'
# pax = pd.read_csv(dataloc,sep=',')

### Main data structure: DataFrame and Series

Pandas is built around data structures called _Series_ and _DataFrames_. 
pax above is a dataframe. We can check this easily:

In [None]:
print(type(pax))

A DataFrame is a 2D data structure of rows and columns, similar to a spreadsheet. Each 'column' in the dataframe is stored as a 1D NumPy array, which also include an index (corresponding to the row number in spreadsheet). Keep in mind that indexes do not have to be numerical. 

Let's see how many rows/colums this dataframe has, and what columns we have. 

In [None]:
print(pax.shape)
print(pax.columns)

DataFrames keeps all series in a structure analogous to a Python dictionary, mapping column names (keys) to Series (values).  Let's select one series:

In [None]:
a = pax['Name'] # or pax.Name
print(type(a))
print(a)

An important aspect we learnt about Numpy arrays is that they can only hold the same type of data.  In contrast, a dataframe can contain several types, as each series is independent.  Let's see what type of data we have in the dataset, using the function info().

In [None]:
print(pax.info())

Do you see that the non-null counts differ... why is that?

### Series operations
Numerical series can be used arithmetically, e.g., series_3 = series_1 + series_2.  This operations aligns values with corresponding indices in series_1 and series_2, then add them together to produce new values in series_3.

For example, we could add the age and fare into a new column... although with not much meaning:

In [None]:
nonsensicalseries = pax.Age + pax.Fare
print(nonsensicalseries)

You can add a new series with the result of a mathematical operation very easily. For example:

In [None]:
print(pax.columns)
pax['nonsense'] = pax['Age'] + pax['Fare']
print(pax.columns)

How could we find the most expensive fare? 
Remember that the Series are based on NumPy!


In [None]:
# Try!


We can also do all sort of String operations using the string functions within pandas.str (here some extra info: https://pandas.pydata.org/docs/user_guide/text.html).
Let's try and find out how many passengers were called Mary and show the names:

In [None]:
# First we need to cast the type of the Name series to str. 
# From pax.info() we saw that Name has 891 non-null values, and it is type object
pax['Name'] = pax['Name'].astype('string')
print(pax.info())

In [None]:
# Now, we can use the function count to count the number of time that paxname is within the values in pax.Name
paxname = 'Mary' # Notice that this can also include regular expressions. Check for more info https://docs.python.org/3/library/re.html#module-re
marycounts = pax['Name'].str.count(paxname)
num = np.sum(marycounts)
print(num)

In [None]:
# we can use marycounts to select the cells that had Mary in them:
allmarys = pax['Name'][marycounts>0]
print(allmarys)

We can also split strings 'easily'. For example, we could extract the surname of the names series.  

In [None]:
surnamefirstnames = pax['Name'].str.split(',')  # this splits the string by the token given (,)
print(surnamefirstnames)                        

In [None]:
pax['Surname'] = surnamefirstnames.str.get(0)   # here we get the first bit of the divided sentence
print(pax['Surname'][0:5])
print(pax['Name'][0:5])

### Accessing the data within the dataframe

Selecting a specific cell can be done using two alternative:
1. Selecting the series, then the index location,
2. Using 'at'. In this case, the index goes before the series label.

In [None]:
print(pax['Name'][100])

In [None]:
print(pax.at[100,'Name'])

You can also select multiple columns for using 'loc', which creates a new dataframe with the selected data. For example, we can select Surname, age and fare for all passengers like this: 

In [None]:
selcols = ['Surname','age','fare'] # this should produce an error... any idea why?
selpax = pax.loc[:,selcols]
print(selpax)

We could also decide to only get these data for the passengers called Mary.  In this case, we can use marycounts, obtained above:

In [None]:
marysdata = pax.loc[marycounts>0,selcols]
print(marysdata)

You can also access the dataframe, line by line. For this, we can use iterrows(). Let's do that for marysdata:


In [None]:
for index, row in marysdata.iterrows():
    print(index, row['Surname'], row['Age'])

What are those pesky NaN (or nan)?? 

There are many online resources that can be of use.  For example:
- https://pandas.pydata.org/docs/reference/index.html
- https://pandas.pydata.org/docs/user_guide/index.html

