**Carl Stermann-Lücke**

## Session 2: Working with Data using NumPy & Pandas
* NumPy arrays: creation, indexing, operations
* Pandas DataFrames: loading, filtering, describing data
* Real dataset mini-example (e.g., Iris or simple CSV)

# Numpy

Numpy is a library that provides data structures for vectors, matrices and tensors, and functions for performing computations on these data structures. These data structures are collectively called "Numpy Arrays".
In machine learning, we often deal with many numbers that are structured with indices, that means, as a tensor.
As you have learned before, python also provides lists. You could think that you could build arrays out of lists. But numpy arrays are optimized to allow easy access to specific elements and parts of the array, to perform operations on every element, and to do so much faster than we could do with lists.

The official website of numpy: https://numpy.org

Since numpy is a library, you need to import it. Every python installation includes numpy, so you don't need to install it separately.
The convention is to import numpy as np, like so:

In [1]:
import numpy as np
# Run this cell before anything else

## Numpy Arrays

We can make a numpy array of arbitrary numbers:

In [2]:
np.array([3,2,1,0])

array([3, 2, 1, 0])

The above Numpy Array has one dimension. That means, every number in the array has a position in the array that can be identified with exactly one number. For example, the first number in the array has the index 0.

In [3]:
myArray = np.array([3,2,1,0])
myArray[0]

3

Numpy Arrays can have more than one dimension. Think of it as a matrix or a tensor.

In [4]:
# 2 dimensional numpy array
my2dArray = np.array([[3,2],[1,0]])

my2dArray

array([[3, 2],
       [1, 0]])

In [5]:
# 3 dimensional numpy array
my3dArray = np.array([[[7,6],[5,4]],[[3,2],[1,0]]])

my3dArray

array([[[7, 6],
        [5, 4]],

       [[3, 2],
        [1, 0]]])

You can access the individual elements by providing more than one index:

In [6]:
print(my2dArray[0,1])
print(my3dArray[0,1,1])

2
4


Now let's compare operations on lists vs numpy arrays:

In [7]:
list1 = [1,2,3,4]
list2 = [1,2,3,4]
list3 = [[1,2,3,4],[1,2,3,4]]

In [8]:
list1 + list2

[1, 2, 3, 4, 1, 2, 3, 4]

In [9]:
list3 + list1

[[1, 2, 3, 4], [1, 2, 3, 4], 1, 2, 3, 4]

In [10]:
list1 * list2

TypeError: can't multiply sequence by non-int of type 'list'

Since multiplication is not defined on lists, you see an error here. Not so with numpy arrays...

In [11]:
numpyarray1 = np.array([1,2,3,4])
numpyarray2 = np.array([1,2,3,4])
numpyarray3 = np.array([[1,2,3,4],[1,2,3,4]])

In [12]:
numpyarray1 + numpyarray2

array([2, 4, 6, 8])

In [13]:
numpyarray3 + numpyarray1

array([[2, 4, 6, 8],
       [2, 4, 6, 8]])

In [14]:
numpyarray1 * numpyarray2

array([ 1,  4,  9, 16])

In [15]:
numpyarray3 * numpyarray1

array([[ 1,  4,  9, 16],
       [ 1,  4,  9, 16]])

Elements in the numpy array can be any types. However, if the types are different from one another, then one type would be the type of the array we set; that means all the elements are converted to the same type. Python decides which type that will be. For instance, if you use a mix of integers and strings, the string would be the final type of all elements. Also note that when you initially decide on which type you want to create your array, and later you if you insert an element within the array with different type, the type of the new element would be converted to the original type of the array, if possible. Look at the following examples.

In [16]:
x1 = np.array([1,2,3,4])
x2 = np.array([1,2,3,"C"])
print(type(x1[0]), type(x2[0]))

<class 'numpy.int32'> <class 'numpy.str_'>


x1 is an array of integers.
x2 is an array of strings.
Now what happens if we replace the only "actual" string in x2 by an integer?

In [17]:
x2[3] = 4
print(x2)
print(type(x2[3]))

['1' '2' '3' '4']
<class 'numpy.str_'>


You see that, even though we have changed the last element of x2 to an integer, the type still remains string.

Any what happens if we include a non-integer (float) in the integer array x1?

In [18]:
x1[3] = 7.7
print(x1)
print(type(x1[3]))

[1 2 3 7]
<class 'numpy.int32'>


You see that the type of the element at position 3 of x1 is still integer, even though we have put a float number there.

Does that also work with strings?

In [19]:
x1[3] = "C"
print(x1)
print(type(x1[3]))

ValueError: invalid literal for int() with base 10: 'C'

Strings cannot be converted to integer, so we cannot put a string into an integer array.

A side note: If you insert integers, strings, and booleans in a numpy array, you get to have a single type of string.
Note that the property of the NumPy array which requires it to hold elements of a single type makes the NumPy faster in calculation compared with list. Also note that if you have a numpy array with booleans and number types (float, integer), numpy will convert the boolean `True` to 1 and `False` to 0.

For our purposes, it only makes sense to have numpy arrays of integers, floats or booleans.

## Slicing: Accessing specific elements from a numpy array

Slicing means accessing a subsection of a numpy array.
The following examples can represent how it works.

Let's experiment with a simple one-dimensional numpy array

In [21]:
simpleArray = np.array([19,8,7,1,5,4])

In [22]:
firstElement = simpleArray[0]
firstElement
# Numpy arrays start indexing with 0, just like lists.

In [23]:
lastElement = simpleArray[-1]
lastElement
# The second-to-last element would be simpleArray[-2].

In [24]:
withoutFirstAndLastElement = simpleArray[1:-1]
withoutFirstAndLastElement
# x:y means every element from position x (including) to position y (excluding). If x is not given, it takes elements from the start. If y is not given, it takes elements to the end.

How does that work with a two-dimensional numpy array?

In [25]:
myArray = np.array([[1,2,3],[4,5,6],[7,8,9]])

In [26]:
firstRow = myArray[0,:]
firstRow

In [27]:
firstColumn=myArray[:,0]
firstColumn

In [28]:
secondColumn = myArray[:,1]
secondColumn

In [29]:
oddRowsEvenColumns = myArray[1::2,0::2]
oddRowsEvenColumns
# x:y:z means every element from x (including) to y (excluding) in steps of z. Like before, if x is not given, it takes elements from the start. If y is not given, it takes elements to the end.

## Shape and Reshape

You might want to change the dimensions of a numpy array. For this, you can use the shape and reshape functions.

In [30]:
myarray = np.array([[2,5,6],[3,4,7]])

In [31]:
myarray.shape
# this shows: (gives the number of rows and columns of myarray) the dimensionality and the size of each dimension.

(2, 3)

In [32]:
myarray.reshape(-1,1)
# The size of the new array in each dimension is given. If you type -1, that dimension is going to be computed based on the number of elements in the original array.

array([[2],
       [5],
       [6],
       [3],
       [4],
       [7]])

In [33]:
myarray.reshape(-1)
# It's also possible to change the dimensionality. This now is a one-dimensional array,

array([2, 5, 6, 3, 4, 7])

In [34]:
myarray.reshape(2,1,-1)
# The dimensionality can also be increased. Only one parameter can be -1, because otherwise it could not be inferred.
# the number of numbers inside parentheses determines the dimensionality of the array, the -1 will automatically provide the adequate number,
# and the multiplication of the numbers (excluding -1, means including any proper multiplication) should match to the numbers of the elements of the array.

array([[[2, 5, 6]],

       [[3, 4, 7]]])

## linspace

Linspace function is being used to create a line with the amount of discretization that we would like to have. For instance, if we want to have a 20 meters stick and we want to chop it 100 times, each piece would have 0.2 meter length, this can be useful in some problem. This example can be seen in the following code

In [35]:
np.linspace(0,20,100)

array([ 0.        ,  0.2020202 ,  0.4040404 ,  0.60606061,  0.80808081,
        1.01010101,  1.21212121,  1.41414141,  1.61616162,  1.81818182,
        2.02020202,  2.22222222,  2.42424242,  2.62626263,  2.82828283,
        3.03030303,  3.23232323,  3.43434343,  3.63636364,  3.83838384,
        4.04040404,  4.24242424,  4.44444444,  4.64646465,  4.84848485,
        5.05050505,  5.25252525,  5.45454545,  5.65656566,  5.85858586,
        6.06060606,  6.26262626,  6.46464646,  6.66666667,  6.86868687,
        7.07070707,  7.27272727,  7.47474747,  7.67676768,  7.87878788,
        8.08080808,  8.28282828,  8.48484848,  8.68686869,  8.88888889,
        9.09090909,  9.29292929,  9.49494949,  9.6969697 ,  9.8989899 ,
       10.1010101 , 10.3030303 , 10.50505051, 10.70707071, 10.90909091,
       11.11111111, 11.31313131, 11.51515152, 11.71717172, 11.91919192,
       12.12121212, 12.32323232, 12.52525253, 12.72727273, 12.92929293,
       13.13131313, 13.33333333, 13.53535354, 13.73737374, 13.93

## arange

Arange is a function that generate arrays with desired space between the elements. It is similar to linspace with the difference that the last number would indicate the size of the steps rather than the number of the steps. It works exactly like range, except that it makes numpy array. Look at the section about `range` The following examples would elaborate on it better.

In [39]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [40]:
np.arange(1,10)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [41]:
np.arange(1,10,2)

array([1, 3, 5, 7, 9])