In [None]:
import numpy as np

# Numpy Basics

Numpy stands for **Numerical Python**. This is the module to use for intensive numerical computations, particularly processing of matrixes.

The foundation of Numpy is the ndarray object that provides _vectorised_ arithmetic operations as well as _broadcasting_ capabilities. 

Vectorisation and broadcasting are complex topics themselves that are outside of the scope of this introductory course. For now, lets only say this: `for` loops are computationally expensive, they take a long time to run. A lot of machine learning algorithms need to apply functions to vectors. Vectorisation and broadcasting help us with methods to do things that are like for loops, but without having to write for loops.


## Numpy arrays

Creating a numpy array is simple. First, you need to make sure you have imported the Numpy module:

```python
import numpy as np
```

Then, to create an array we may simply do the following

```python
my_array = np.array([
  [1,2,3],
  [4,5,6],
  [7,8,9]]
)
```

We then have access to methods such as `my_array.shape` that will tell us the dimensions of our array (in this example this should return `(3,3)` or `my_array.dtype` that will tell us the type of the elements stored in the array.

In the context of Numpy we will refer to numpy arrays simply as _arrays_ and they can be either vectors or matrices.

Another common way to create arrays is by asking numpy to instantiate a vector or matrix with dimensions provided by us, and filled with zeros or ones. For example, a matrix with three rows and four columns:

```python
my_array_zeros = np.zeros((3,4))
```

While numpy is almost always used for integers and floats, it supports other data types, including Booleans, strings, complex numbers and even Python objects. 

## Operations between Arrays and Scalars

Arrays are very different from native lists because they "understand" operations that involve vectors, matrices and scalars.

For example, if we add 5 to an array with a numeric dtype then 5 will be added to all of its elements using vectorisation (an optimised operation that is much faster than a for loop). All arithmetic operations between an array and a scalar will work by _broascasting_ the operator and the scalar to all the elements in the array.


In [None]:
# Creating numpy arrays

import numpy as np

my_array = np.array([
  [1,2,3],
  [4,5,6],
  [7,8,9]]
)

print('the dimensions of my_array are:', my_array.shape)
print('the data type of the elments in my_array is:', my_array.dtype)

my_array_zeros = np.zeros((3,4))
print(my_array_zeros)

my_array_ones = np.ones((3,4))
print(my_array_ones)


the dimensions of my_array are: (3, 3)
the data type of the elments in my_array is: int64
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]


In [None]:
# Vectorised operations between arrays and scalars
# What happens if we represent a vector as a native list
# and want to add five to each element?

l = [1,2,3,4]
try:
  l + 5 
except TypeError:
  print('cannae do this. Data types are incompatible')

# the only way is to write a for loop and add 5 to each element

cannae do this. Data types are incompatible


In [None]:
# Vectorised operations between arrays and scalars

l = np.array([1,2,3,4])
print('arithmetic between array and scalar broadcasts')
l + 5

arithmetic between array and scalar broadcasts


array([6, 7, 8, 9])

## Indexing and Slicing

Typically, when we select an element or slice of an array, what we are doing is to get a view on the original array. That is, when we make value assignments to selected elements or slides we are modifying the original array. We will see this in practise in a short while.

### Getting single elements

Getting single elements works just like it does in native lists. We use square brakets and pass on the index of the element we want to get, also starting from index zero.

### Getting slices in vector arrays

Given an array such as:

```python
my_array = np.array([1,2,3,4,5,6,7])
```
I can get the first three elements by:

```python
my_array[0:3]
```
As always, the first index before the : is included, and the last one excluded.

### Getting slices in 2D arrays

The principle is the same, but we work in two dimensions:

`[from_row : to_row, from_col : to_col]`

### Counting backwards 

Indexes in numpy arrays can be counted from the end, using negative indexes. For example, given a 1D array, we can slide the last three elements using `[-3:]`


In [None]:
# 1D selecting and slicing

my_array = np.array([1,2,3,4,5,6,7])

# get first element
print('first element in my array: ', my_array[0])

# get first three elements
print('first 3 elements in my array: ', my_array[0:3])

# get 2nd to 4th elements
print('second to 4th element in my array: ', my_array[1:4])

# get last element, without resorting to knowing the size of the array
print('last element in my array: ', my_array[-1])

# get last 3 elements, without resorting to knowing the size of the array
print('last three elements in my array: ', my_array[-3:])

# get first 3 elements
print('first three elements in my array: ', my_array[:3])



first element in my array:  1
first 3 elements in my array:  [1 2 3]
second to 4th element in my array:  [2 3 4]
last element in my array:  7
last three elements in my array:  [5 6 7]
first three elements in my array:  [1 2 3]
[ 0.  0.  0. -1. -1. -1.  0.  0.  0.  0.]


In [None]:
# Two-dimensional slicing

my_array2 = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16]])
print(my_array2, '\n')

# slice 1st row
print('first row: ', my_array2[0], '\n')

# slice 1st column
print('first column', my_array2[:,0], '\n')

# slice 1st two rows, 1st and 2nd col
print(my_array2[:2,:2])

#slice last column, 2nd and 3rd rows
print(my_array2[1:3, -1], '\n')


[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]] 

first row:  [1 2 3 4] 

first column [ 1  5  9 13] 

[[1 2]
 [5 6]]
[ 8 12] 



In [None]:
my_array2[:,2]

array([ 3,  7, 11, 15])


## Boolean Indexing

Suppose you have what is called a _rectangular_ data structure, which is a matrix where rows correspond to subjects and columns correspond to variables measured on those subjects. We may want  to select a slice of subjects, for example those of a given gender, if the subjects are people. Assuming there is a list of the genders of all the subjects, defined in the same order as they appear as rows in the matrix, we can easily produce a Boolean vector, where, for example we assign True to females. If we pass that Boolean vector inside the selector of an array -- that is the square brakets -- numpy will pick those rows for with the corresponding element in the selector is true. We can combine this with other forms of slicing. The general form of the selector is:

```python
array[ array_of_booleans ]
```

Where the size of the `array_of_booleans` must correspond to the number of files in the `array`.

## Fancy Indexing

This type of selection is similar to the Boolean case, but here we use double square brackets to define a list of row index positions we want to slice our of our numpy array. The general form is as follows:

```python
array[[ list_of_row_indexes ]] 
```

## Making assignments on slices (broadcasting)

Whenever and in whichever form we slice an array, if we make an assignment using `=` the assignment will be _broadcast_ to the entire slice.

In [None]:
# lets assume we have seven subjects in our rectangular data structure
names =   np.array(['helen', 'mary', 'lucy', 'matt', 'lindsay', 'ben', 'luke'])
genders = np.array(['f',   'f',      'f',    'm',    'f',       'm',   'm'])

# lets create a dummy matrix of integers that has seven rows (subjects) and
# five columns (variables)
d = np.arange(35).reshape((7,5))
print(d, '\n')

# now lets create a Boolean version of genders where female is True
sel_f = genders == 'f'
print('sub matrix females: ', d[sel_f], '\n')

# lets make an assignment of the value zero to the rows that correspond to male
sel_m = genders != 'f'
d[sel_m] = 0
print('after assigning males to zero we get', d, '\n')

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]
 [25 26 27 28 29]
 [30 31 32 33 34]] 

sub matrix females:  [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [20 21 22 23 24]] 

after assigning males to zero we get [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [ 0  0  0  0  0]
 [20 21 22 23 24]
 [ 0  0  0  0  0]
 [ 0  0  0  0  0]] 



In [None]:
d[[0,2,3]]

array([[ 0,  1,  0,  3,  0],
       [ 0, 11,  0, 13,  0],
       [15,  0, 17,  0, 19]])

In [None]:
# 1D slice assignment 

my_array = np.zeros(10)
my_array[3:6] = -1
print('assignment to slice broadcast')
print(my_array)

## Transposing

For many matrix operations we need to transpose a matrix. That means exchanging rows and columns. In Numpy, given a matrix $m$ we transpose simply by:

```python
m.T
```

In [None]:
# lets see how transposing a matrix works

my_array = np.arange(32).reshape(8,4)
print('the original matrix is: ', '\n', my_array, '\n')

print('the transposed matrix is: ', '\n', my_array.T, '\n')



the original matrix is:  
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]
 [24 25 26 27]
 [28 29 30 31]] 

the transposed matrix is:  
 [[ 0  4  8 12 16 20 24 28]
 [ 1  5  9 13 17 21 25 29]
 [ 2  6 10 14 18 22 26 30]
 [ 3  7 11 15 19 23 27 31]] 



## Universal Functions

Universal functions are special operations that are implemented in Python to transform vectors without resorting to explicit `for` loops. These operations are vectorised, so in the low-level machine language they take advantage of optimisation techniques that make them run faster. Importantly they are available to us as simple operations.

In Numpy these operations can be _unary_ that is, applying to a single vector, or _binary_ they operate on two input vectors.

This is better understood with programming examples. Check the [Numpy](https://numpy.org/doc/) documentation to learn about the many available vectorised functions you can use to process your arrays,

In [None]:
# Unary universal functions in Numpy

my_array = np.arange(10)
print('original array:\n', my_array, '\n')

print('Square rootof the array:\n', np.sqrt(my_array), '\n\n')

print('original array:\n', my_array, '\n')

print('the array times 4:\n', my_array*4)

original array:
 [0 1 2 3 4 5 6 7 8 9] 

Square rootof the array:
 [0.         1.         1.41421356 1.73205081 2.         2.23606798
 2.44948974 2.64575131 2.82842712 3.        ] 


original array:
 [0 1 2 3 4 5 6 7 8 9] 

the array times 4:
 [ 0  4  8 12 16 20 24 28 32 36]


In [None]:
# binary universal functions

x = np.random.randn(4)
y = np.random.randn(4)

print('array x has: ', x)
print('array y has: ', y)

print('the array of per-component max is ', np.maximum(x,y))

array x has:  [-0.06680882 -1.07043159  0.41487425 -0.58325531]
array y has:  [ 0.16322395 -0.45276835  0.31523851 -1.81854458]
the array of per-component max is  [ 0.16322395 -0.45276835  0.41487425 -0.58325531]


### Sorting

You are likely to use Numpy during your Machine Learning lessons. For now, the last bit of information we will introduce about Numpy is sorting.

For one-dimensional arrays it is very simple and non-ambiguous:

```python
my_array.sort()
```

But when our array has two dimensions or more we can tell on what axis we want the sort to work. Let's see this with examples.

In [None]:
# sorting numpy arrays

my_array = np.random.randn(6)

print('original 1D array has:', '\n', my_array, '\n')
my_array.sort()
print('sorted 1D array has:', '\n', my_array , '\n')

my_array = np.random.randn(25).reshape(5,5)
print('original 2D array has:', '\n', my_array, '\n')

# use axis 0 to have sorted columns
print('sorting columns')
my_array.sort(0)
print(my_array, '\n')

print('sorting rows')
# use axis 1 to have sorted rows
my_array.sort(1)
my_array

original 1D array has: 
 [ 1.440268   -0.4098845  -0.18418626  0.28291348 -0.87760655  0.88570487] 

sorted 1D array has: 
 [-0.87760655 -0.4098845  -0.18418626  0.28291348  0.88570487  1.440268  ] 

original 2D array has: 
 [[ 0.75066187 -0.11640946 -2.12272904  0.19788095 -1.10082345]
 [ 1.46552222 -1.26842548  0.4537372   0.82273143 -1.66312068]
 [-0.89739758 -1.3573491  -0.9967238   1.04257973  0.59784321]
 [ 0.24571419 -1.907682    2.06941822  0.4371083   0.64782353]
 [ 0.13964124  0.01277631 -0.59512089  0.34343148  0.84951675]] 

sorting columns
[[-0.89739758 -1.907682   -2.12272904  0.19788095 -1.66312068]
 [ 0.13964124 -1.3573491  -0.9967238   0.34343148 -1.10082345]
 [ 0.24571419 -1.26842548 -0.59512089  0.4371083   0.59784321]
 [ 0.75066187 -0.11640946  0.4537372   0.82273143  0.64782353]
 [ 1.46552222  0.01277631  2.06941822  1.04257973  0.84951675]] 

sorting rows


array([[-2.12272904, -1.907682  , -1.66312068, -0.89739758,  0.19788095],
       [-1.3573491 , -1.10082345, -0.9967238 ,  0.13964124,  0.34343148],
       [-1.26842548, -0.59512089,  0.24571419,  0.4371083 ,  0.59784321],
       [-0.11640946,  0.4537372 ,  0.64782353,  0.75066187,  0.82273143],
       [ 0.01277631,  0.84951675,  1.04257973,  1.46552222,  2.06941822]])

## Saving and Loading Numpy files

# Pandas Basics

Pandas is built on top of Numpy, but extends it in many ways. One of these is the use of naming conventions for rows and columns that originate in native dictionaries. But perhaps the most insteresting feature of Pandas is that in includes many methods to perform exploratory and statistical analytics directly on the core Pandas data type, the **DataFrame**. 

## Series

A series is a one dimensional structure that is just like a Numpy array with the difference that it also stores information about the index of each row, which by default is initalised as integer numbers.

```python
sx = Series([1,4,-9,0])
```

We will see shortly how we can manipulate the index part of a series can call the rows of our series array with any name we like.






In [None]:
# from numpy to series
import pandas as pd

sx = pd.Series([1,4,-9,-3])
print('my first pandas series', sx, '\n')

# on a series object I can get the underlying numpy values
print('a series has an underlying numpy array:')
print(sx.values, '\n')

print('and an added index')
print(sx.index, '\n')

print('I can redefine the index:')
sx.index = ['a', 'b', 'c', 'd']
print(sx, '\n')

print('dictionary style indexing, no problem')
print(sx['c'], '\n')

print('Boolean indexing, no problem')
print(sx[[True, False, False, True]], '\n')

print('Fancy indexing, no problem')
print(sx[['a','c']], '\n')

print('check for index in Series, no problem')
'a' in sx

my first pandas series 0    1
1    4
2   -9
3   -3
dtype: int64 

a series has an underlying numpy array:
[ 1  4 -9 -3] 

and an added index
RangeIndex(start=0, stop=4, step=1) 

I can redefine the index:
a    1
b    4
c   -9
d   -3
dtype: int64 

dictionary style indexing, no problem
-9 

Boolean indexing, no problem
a    1
d   -3
dtype: int64 

Fancy indexing, no problem
a    1
c   -9
dtype: int64 



True

indeed if you hava a native Python dictionary, it can be cast directly as a Pandas series:

In [None]:
my_dict = {'Lisboa': 12000, 'Porto': 14000, 'Faro': 1800, 'Aveiro': 8000}

my_series = pd.Series(my_dict)
my_series

Lisboa    12000
Porto     14000
Faro       1800
Aveiro     8000
dtype: int64

## DataFrame

We can safely think of a dataframe as a collection of series that:

1. Share the dame index
2. Each series has a name, that becomes the column name of the dataframe
3. Each such series has its own datatype, not all series in a dataframe need to be of the same dtype



In [3]:
import pandas as pd
data = {'region': ['murcia', 'murcia', 'murcia', 'valencia', 'valencia'],
        'year': [2000,2001,2002,2001,2002],
        'pop': [370, 382, 386, 740, 766]}

df = pd.DataFrame(data)
df

Unnamed: 0,region,year,pop
0,murcia,2000,370
1,murcia,2001,382
2,murcia,2002,386
3,valencia,2001,740
4,valencia,2002,766


In [4]:
# in this case, because we are passing a dictionary Pandas assumes we want 
# everything from it, but we could have specified just a part of our dict

dfx = pd.DataFrame(data, columns=['region', 'year'])
dfx

Unnamed: 0,region,year
0,murcia,2000
1,murcia,2001
2,murcia,2002
3,valencia,2001
4,valencia,2002


In [5]:
# or even add a column we don't have values for yet
df = pd.DataFrame(data, columns=['region', 'year', 'debt'])
df


Unnamed: 0,region,year,debt
0,murcia,2000,
1,murcia,2001,
2,murcia,2002,
3,valencia,2001,
4,valencia,2002,


In [6]:
# getting rows by their index position using iloc

df.iloc[1]

region    murcia
year        2001
debt         NaN
Name: 1, dtype: object

In [7]:
# how to append a new row to a Pandas dataframe?

new_row = {'region': 'valencia', 'year':2000, 'pop':736}
df.append(new_row, ignore_index = True)

Unnamed: 0,region,year,debt,pop
0,murcia,2000,,
1,murcia,2001,,
2,murcia,2002,,
3,valencia,2001,,
4,valencia,2002,,
5,valencia,2000,,736.0


In [8]:
# how to add a column to a pandas dataframe
# constant value, broadcasting

df['debt'] = 0
df

Unnamed: 0,region,year,debt
0,murcia,2000,0
1,murcia,2001,0
2,murcia,2002,0
3,valencia,2001,0
4,valencia,2002,0


In [10]:
import numpy as np
# how to add a column to a pandas dataframe
# automatically generated values in sequence

df['debt'] = np.arange(len(df))
df

Unnamed: 0,region,year,debt
0,murcia,2000,0
1,murcia,2001,1
2,murcia,2002,2
3,valencia,2001,3
4,valencia,2002,4


In [11]:
# how to add a column to a pandas dataframe
# specific values you got from e.g. a function

df['debt'] = [2.1, 2.0, 2.2, 1.5, 1.8]
df


Unnamed: 0,region,year,debt
0,murcia,2000,2.1
1,murcia,2001,2.0
2,murcia,2002,2.2
3,valencia,2001,1.5
4,valencia,2002,1.8


In [13]:
# selecting and slicing

dfx = df['year'] == 2001 # Boolean slicing
dfy = df['debt'] > 0.2 # Boolean slicing

# notice that here the "and" does not work
# in this contect we use the "low level" logical operators
# & for and
# | for or
# ~ for not

# also notice that when selecting and slicing gives you a single series
# that is the dtype you get instead of a DataFrame

df[dfx & dfy]

Unnamed: 0,region,year,debt
1,murcia,2001,2.0
3,valencia,2001,1.5


In [14]:
df['region'].unique()

array(['murcia', 'valencia'], dtype=object)

In [None]:
# selecting and slicing

df[df['year'] == 2002]

Unnamed: 0,region,year,debt
2,murcia,2002,2.2
4,valencia,2002,1.8
