# Excursion 1:  Broadcasting, Advanced Array Manipulations and Indexing/Slicing

NumPy supports broadcasting, allowing operations between arrays of different shapes.

What does shape mean?

In [1]:
import numpy as np

array_a = np.array([1, 2, 3])
array_b = np.array([[10], [20], [30]])
print(array_a.shape)
print(array_b.shape)

(3,)
(3, 1)


NumPy's is able to perform element-wise operations on arrays of different shapes by automatically expanding one or both arrays to have compatible shapes. This allows you to perform operations without explicitly resizing or replicating data, making computations more concise and efficient.

In [2]:
result = array_a + array_b
print("Array A:\n", array_a)
print("Array B:\n", array_b)
print("Broadcasting Result:\n", result)

Array A:
 [1 2 3]
Array B:
 [[10]
 [20]
 [30]]
Broadcasting Result:
 [[11 12 13]
 [21 22 23]
 [31 32 33]]


In pratice, this means, that NumPy is especially powerful to work with whenever it comes to matrices and mathematics related to them. 

Let's explore this quickly. We define two 2x2 matrices and show that we can compute the matrix product (dot-product with [np.dot()](https://numpy.org/doc/2.1/reference/generated/numpy.dot.html)) and an element-wise multiplication using Numpy:

In [3]:
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])
print("Matrix A:\n", matrix_a)
print("Matrix B:\n", matrix_b)

matrix_a.shape

Matrix A:
 [[1 2]
 [3 4]]
Matrix B:
 [[5 6]
 [7 8]]


(2, 2)

In [4]:
matrix_product = np.dot(matrix_a, matrix_b)
print("Matrix Product:\n", matrix_product)

matrix_elementwise = matrix_a * matrix_b
print("Element-wise Product:\n", matrix_elementwise)

Matrix Product:
 [[19 22]
 [43 50]]
Element-wise Product:
 [[ 5 12]
 [21 32]]


Last, NumPy is very convient to reshape arrays and slice/index them:

In [5]:
original_array = np.arange(12)
reshaped_array = original_array.reshape(3, 4) # (rows, columns)
print("Original Array:\n", original_array)
print("Reshaped Array:\n", reshaped_array)

Original Array:
 [ 0  1  2  3  4  5  6  7  8  9 10 11]
Reshaped Array:
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [6]:
print("Element at index 2:", original_array[2])
print("Slice (2:6):", original_array[2:6])

Element at index 2: 2
Slice (2:6): [2 3 4 5]


Also slicing 2D arrays works with a similar logic:

In [7]:
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
#print("Original 2D Array:\n", matrix)
#print("First row:", matrix[0, :])  # All columns of the first row
#print("First column:", matrix[:, 0])  # All rows of the first column
#print("Submatrix (2x2 from top-left):\n", matrix[:2, :2])

print(matrix)
print("Row i=1:",  matrix[1],
      "with column i=2:", matrix[1][2]) # [row] [column]

[[1 2 3]
 [4 5 6]
 [7 8 9]]
Row i=1: [4 5 6] with column i=2: 6


We will see that the slicing of data in numpy is different from the slicing in pandas. It is really important to understand the differences here.

## Excursion 2: Data manipulation

We now approach a first task to showcase the power of programming. 

So far we could read-in a file and structure the data within an object which seemed to be quite useful, the pandas DataFrame. We also explored how to do mathematical operations and to write a little user-interactive program. But now we will start to manipulate (i.e. work on) some data such that we can use them in a different context and for different targets. 

We start with our *molecule.txt* file, where we just have a list of molecule names and their sum formulas. Using regular expressions as in Assignment 1,  we can extract the chemical composition from these sum formulas such that we can use them, i.e. we could only pick moelcules from our data set which contain at least one carbon atom (i.e. all organic molecules):

The `re` library provides very useful methods to find and replace regular expressions. 

In [8]:
import re

Let's start with building a regular expression, which is done by defining a string with a leading `r` and we can the use `re.search` to find the first occurrence and `re.findall()` to find all occurrences of the regular expression in a string:

In [9]:
pattern = r"[0-9]*"
print(f're pattern: {pattern}')
string = "ahjdc6dsacM987cdsajkB"
first_capital_letter = re.findall(pattern, string)
print(f'{first_capital_letter}')

re pattern: [0-9]*
['', '', '', '', '', '6', '', '', '', '', '', '987', '', '', '', '', '', '', '', '']


The regular expression `[A-Z]` identifies any capital letter from A-Z and hence the `re.findall` method returns the matches with that regular expression. 

Other regular expressions exist: 

- `[abc]` matches any one character that is either 'a', 'b', or 'c'.
- `[a-z]` matches any one lowercase letter from 'a' to 'z'.
- `[0-9]` matches any one digit from '0' to '9'. Optionaly, use \d metacharacter.
- `[^abc]` matches any one character that is not 'a', 'b', or 'c'.
- `[\w]` matches any one-word character, including letters, digits, and underscore.
- `[\s]` matches any whitespace character, including space, tab, and newline.
- `*` matches zero or more of the preceding character.
- `(...)` captures parts of the match for later use.

With that we can built a pattern which matches the structure of an element with a following number of its occurrences:

In [10]:
pattern = r"([A-Z][a-z]*)([0-9]*)"
caffeine = 'C8H10N4O2'

elements = re.findall(pattern, caffeine)

for i in range(len(elements)):
    print(elements[i])

for i in elements:
    # also possible, but index i is not callable anymore, which can be useful sometimes
    print(i)

('C', '8')
('H', '10')
('N', '4')
('O', '2')
('C', '8')
('H', '10')
('N', '4')
('O', '2')


Regular expressions are incredibly powerful for tasks like:
- Searching for patterns in text.
- Extracting specific parts of a string.
- Replacing or modifying text based on patterns.
  
Common use cases for regular expressions are:
- Validation: Check if a string follows a specific format (e.g., email addresses, phone numbers).
- Extraction: Extract meaningful parts from text (e.g., dates, URLs, or chemical formulas).
- Search and Replace: Replace parts of a string that match a pattern.

#### Expanding a pandas DataFrame 

That worked. Let's now use this to expand our pandas DataFrame of molecule names and molecule sum formulas with columns indicating the abundances of different elements. 

Let's first introduce a way how to loop over the rows in a DataFrame, using `df.index`:

In [11]:
import pandas as pd

df = pd.read_csv('molecules.txt', names=['name', 'sum formula'])
pattern = r"([A-Z][a-z]*)([0-9]*)"

# print(df)
print(df.index)
print(f'**Type of df.index**: {type(df.index)}')

for idx in df.index:
    # idx == index over dataframe.index
    parsed = re.findall(pattern, df.loc[idx, 'sum formula'])
    for element in parsed:
        print(element)

RangeIndex(start=0, stop=50, step=1)
**Type of df.index**: <class 'pandas.core.indexes.range.RangeIndex'>
('H', '2')
('O', '')
('C', '')
('H', '4')
('C', '2')
('H', '5')
('O', '')
('H', '')
('C', '6')
('H', '12')
('O', '6')
('C', '12')
('H', '22')
('O', '11')
('C', '')
('O', '2')
('C', '')
('O', '')
('O', '2')
('N', '2')
('N', '')
('H', '3')
('H', '2')
('O', '2')
('C', '')
('H', '3')
('C', '')
('O', '')
('O', '')
('H', '')
('C', '')
('H', '2')
('O', '')
('C', '6')
('H', '6')
('C', '6')
('H', '5')
('O', '')
('H', '')
('C', '7')
('H', '8')
('C', '3')
('H', '6')
('O', '')
('C', '4')
('H', '10')
('C', '4')
('H', '10')
('C', '3')
('H', '8')
('C', '5')
('H', '12')
('C', '6')
('H', '14')
('C', '7')
('H', '16')
('C', '8')
('H', '18')
('C', '2')
('H', '6')
('O', '2')
('C', '3')
('H', '8')
('O', '3')
('H', '')
('N', '')
('O', '3')
('H', '2')
('S', '')
('O', '4')
('H', '')
('Cl', '')
('Na', '')
('Cl', '')
('Ca', '')
('C', '')
('O', '3')
('Na', '')
('H', '')
('C', '')
('O', '3')
('Na', '')
('O', '

Now, we create a new column for each element with the element symbol as header and containing its abundance per molecule. 

To do so, we need to check if the element-column already has been created whenever a new row is called and the sum formula disentengled, and create it if it wasn't done so far. We will use `df.columns` to access the columns of the DataFrame. 

In [12]:
import pandas as pd

df = pd.read_csv('molecules.txt', names=['name', 'sum formula'])
pattern = r"([A-Z][a-z]*)([0-9]*)"

for idx in df.index:    
    parsed = re.findall(pattern, df.loc[idx, 'sum formula'])
    for element in parsed:
        if element[0] not in df.columns:
            df[element[0]] = 0

df

Unnamed: 0,name,sum formula,H,O,C,N,S,Cl,Na,Ca,K,P
0,Water,H2O,0,0,0,0,0,0,0,0,0,0
1,Methane,CH4,0,0,0,0,0,0,0,0,0,0
2,Ethanol,C2H5OH,0,0,0,0,0,0,0,0,0,0
3,Glucose,C6H12O6,0,0,0,0,0,0,0,0,0,0
4,Sucrose,C12H22O11,0,0,0,0,0,0,0,0,0,0
5,Carbon dioxide,CO2,0,0,0,0,0,0,0,0,0,0
6,Carbon monoxide,CO,0,0,0,0,0,0,0,0,0,0
7,Oxygen,O2,0,0,0,0,0,0,0,0,0,0
8,Nitrogen,N2,0,0,0,0,0,0,0,0,0,0
9,Ammonia,NH3,0,0,0,0,0,0,0,0,0,0


And the last step is to fill each of the entries directly with the count of the disentengled sum formula (careful, our regular expression returned an empty string if the count is actually 1):

In [36]:
import pandas as pd

df = pd.read_csv('molecules.txt', names=['name', 'sum formula'])
pattern = r"([A-Z][a-z]*)([0-9]*)"

for idx in df.index:    
    parsed = re.findall(pattern, df.loc[idx, 'sum formula'])
    for element in parsed:
        if element[0] not in df.columns:
                df[element[0]] = 0
        if element[1] == '':
            df.loc[idx, element[0]] = 1
        else:
            df.loc[idx, element[0]] = int(element[1])

df

Unnamed: 0,name,sum formula,H,O,C,N,S,Cl,Na,Ca,K,P
0,Water,H2O,2,1,0,0,0,0,0,0,0,0
1,Methane,CH4,4,0,1,0,0,0,0,0,0,0
2,Ethanol,C2H5OH,1,1,2,0,0,0,0,0,0,0
3,Glucose,C6H12O6,12,6,6,0,0,0,0,0,0,0
4,Sucrose,C12H22O11,22,11,12,0,0,0,0,0,0,0
5,Carbon dioxide,CO2,0,2,1,0,0,0,0,0,0,0
6,Carbon monoxide,CO,0,1,1,0,0,0,0,0,0,0
7,Oxygen,O2,0,2,0,0,0,0,0,0,0,0
8,Nitrogen,N2,0,0,0,2,0,0,0,0,0,0
9,Ammonia,NH3,3,0,0,1,0,0,0,0,0,0


#### Slicing a pandas DataFrame: Data selection/reduction



And now we can finally slice our DataFrame such that we reduce it to only organics (i.e. somethind with `df['C']>0`), which is an extremly simple operation in pandas. 

In [39]:
# Pandas: [row, column]
# While numpy: [column, row]

# accessing using column name - pandas specific!! (very useful)
df.loc[1, 'sum formula']

# with iloc, using only indexes
df.iloc[1, 1]

organics = df.loc[df['C']>0, :]

organics.reset_index(inplace = True)

organics

Unnamed: 0,index,name,sum formula,H,O,C,N,S,Cl,Na,Ca,K,P
0,1,Methane,CH4,4,0,1,0,0,0,0,0,0,0
1,2,Ethanol,C2H5OH,1,1,2,0,0,0,0,0,0,0
2,3,Glucose,C6H12O6,12,6,6,0,0,0,0,0,0,0
3,4,Sucrose,C12H22O11,22,11,12,0,0,0,0,0,0,0
4,5,Carbon dioxide,CO2,0,2,1,0,0,0,0,0,0,0
5,6,Carbon monoxide,CO,0,1,1,0,0,0,0,0,0,0
6,11,Acetic acid,CH3COOH,1,1,1,0,0,0,0,0,0,0
7,12,Formaldehyde,CH2O,2,1,1,0,0,0,0,0,0,0
8,13,Benzene,C6H6,6,0,6,0,0,0,0,0,0,0
9,14,Phenol,C6H5OH,1,1,6,0,0,0,0,0,0,0


As boolean statements can be combined, we can also slice the DataFrame on combinations of conditions:

In [40]:
# df.loc == locate rows that match the statement
# df['C'] > 0 ... all rows with a value in 'C'-column bigger than 0
# df['O'] > 0 ... all rows with a value in 'O'-column bigger than 0
# & is a logical and == both criteria must be met!
# | this would be an logical or, one or the other must be met (what about both???)


ox_organics = df.loc[(df['C']>0) & (df['O']>0), :]

ox_organics

Unnamed: 0,name,sum formula,H,O,C,N,S,Cl,Na,Ca,K,P
2,Ethanol,C2H5OH,1,1,2,0,0,0,0,0,0,0
3,Glucose,C6H12O6,12,6,6,0,0,0,0,0,0,0
4,Sucrose,C12H22O11,22,11,12,0,0,0,0,0,0,0
5,Carbon dioxide,CO2,0,2,1,0,0,0,0,0,0,0
6,Carbon monoxide,CO,0,1,1,0,0,0,0,0,0,0
11,Acetic acid,CH3COOH,1,1,1,0,0,0,0,0,0,0
12,Formaldehyde,CH2O,2,1,1,0,0,0,0,0,0,0
14,Phenol,C6H5OH,1,1,6,0,0,0,0,0,0,0
16,Acetone,C3H6O,6,1,3,0,0,0,0,0,0,0
24,Ethanediol,C2H6O2,6,2,2,0,0,0,0,0,0,0


Here the `&` operator ([bitwise AND](https://www.geeksforgeeks.org/difference-between-and-and-in-python/)) to make sure that only if both statements in are true in each row, that row is sliced into the final DataFrame. Similarily, the `|` is the bitewise equivalent for OR. 

#### Indexing a pandas DataFrame: The subtle differences

We have seen that indexing can be done in two ways for a `pd.DataFrame`:

In [48]:
organics['sum formula'][28]

'C2H4'

In [17]:
organics.loc[42, 'sum formula']

'C8H10N4O2'

The latter is recommended in usage as it ensures modification of the original `pd.DataFrame`, while the latter sometimes just modifies the copy which is created up `df['A']` and then accessed with `df['A']['B']`. Currently this creates a warning but still works in some cases, but in the future it will never work, see below Error Message:

In [44]:
organics['sum formula'][42] = 'c8h10n4o2'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  organics['sum formula'][42] = 'c8h10n4o2'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  organics['sum formula'][42] = 'c8h10n4o2'


# Let's make a random choice

In [91]:
import numpy as np

kreuzerl_liste = ['Alexander', 'Vicent', 'Benjamin', 'Dmytro', 'Riddhi', 'Adéla', 'Balazs', 'Daniela', 'Laura', 'Itzi', 'Anton', 'Yagmur', 'Markus', 'Florian']

print(np.random.choice(kreuzerl_liste))

Yagmur


# Excursion 3: namespaces

We have seen that after importing `numpy`, we can access the available functions and classes with the `.` operator. 

This is because upon import of `numpy` a new *namespace* was created, which contained all the functions belonging to the numpy module. 

*Why are namespaces important?*

Namespaces help to organize and manage names so there are no naming conflicts, especially in larger programs.

#### Types of Namespaces in Python

Python has different kinds of namespaces depending on the **lifetime** and **scope** of the names. The main types are:

1. **Built-in Namespace**
2. **Global Namespace**
3. **Local Namespace**
4. **Enclosing Namespace**

Let's explore each of these with examples!


There is a built-in namespace. This includes all the built-in functions which can be just accessed by their names.
`print` and `len` are part of the **built-in namespace**:

In [20]:
print(len("Hello"))

5


Next, comes the **global namespace**, which includes all variables defined on a global level (often capital letters are used to name global variables):

In [21]:
MY_GLOBAL_VAR = 10


def my_function():
    print(MY_GLOBAL_VAR)


my_function()

10


So even if `MY_GLOBAL_VAR` was defined outside of the function `my_function()` (in the global namespace), python knows where to look for the value of `MY_GLOBAL_VAR` when it is called inside the function.

This is different if a variable is defined in a **local namespace**:

In [22]:
def my_function():
    my_local_var = 5
    print(my_local_var)


my_function()

5


The variable `my_local_var`could now not be called outside of the function `my_function()`!

This concept is continued with namespaces which **enclose** each other, so a function in a function for example:

In [23]:
def outer_function():
    my_enclosed_var = 20  # Enclosing namespace


    def inner_function():
        print(my_enclosed_var)  # Accesses 'z' from the enclosing namespace


    inner_function()


outer_function()

20


#### The LEGB Rule: Name Resolution in Python

When Python encounters a name, it resolves it using the **LEGB rule**. Python searches for the name in the following order:

1. **L**ocal: The innermost scope (inside the function or method).
2. **E**nclosing: The scope of the enclosing function (if the function is nested).
3. **G**lobal: The top-level namespace of the module/script.
4. **B**uilt-in: The built-in namespace (e.g., `print()` or `len()`).

If Python cannot find the name in any of these scopes, it raises a `NameError`.

In [24]:
MY_GLOBAL_VAR = 1  # Global namespace


def outer():
    my_enclosed_var = 2  # Enclosing namespace


    def inner():
        my_local_var = 3  # Local namespace
        print(MY_GLOBAL_VAR)  # From global namespace
        print(my_enclosed_var)  # From enclosing namespace
        print(my_local_var)  # From local namespace


    inner()


outer()

1
2
3


#### Checking Namespaces

Python provides the built-in functions `globals()` and `locals()` to inspect namespaces:
- **`globals()`**: Returns the global namespace as a dictionary.
- **`locals()`**: Returns the local namespace as a dictionary (inside functions).

In [25]:
MY_GLOBAL_VAR = 10


def my_function():
    my_local_var = 5
    print("Local namespace:", locals())  # Shows the local variables
    print("Global namespace:", globals())  # Shows global variables


my_function()

Local namespace: {'my_local_var': 5}
Global namespace: {'__name__': '__main__', '__doc__': 'Automatically created module for IPython interactive environment', '__package__': None, '__loader__': None, '__spec__': None, '__builtin__': <module 'builtins' (built-in)>, '__builtins__': <module 'builtins' (built-in)>, '_ih': ['', 'import numpy as np\n\narray_a = np.array([1, 2, 3])\narray_b = np.array([[10], [20], [30]])\nprint(array_a.shape)\nprint(array_b.shape)', 'result = array_a + array_b\nprint("Array A:\\n", array_a)\nprint("Array B:\\n", array_b)\nprint("Broadcasting Result:\\n", result)', 'matrix_a = np.array([[1, 2], [3, 4]])\nmatrix_b = np.array([[5, 6], [7, 8]])\nprint("Matrix A:\\n", matrix_a)\nprint("Matrix B:\\n", matrix_b)\n\nmatrix_a.shape', 'matrix_product = np.dot(matrix_a, matrix_b)\nprint("Matrix Product:\\n", matrix_product)\n\nmatrix_elementwise = matrix_a * matrix_b\nprint("Element-wise Product:\\n", matrix_elementwise)', 'original_array = np.arange(12)\nreshaped_array

#### Modules and Namespaces

When you import a module, Python creates a **namespace for the module**. You can access objects within the module using the dot `.` operator.

Example:
```python
import numpy
print(numpy.pi)  # Accessing 'pi' from the 'numpy' namespace
```

If you import using `from numpy import pi`, `pi` is placed directly into the global namespace, avoiding the need for the `numpy.` prefix.

Moreover, we can also give the modules a **new namespace**:

In [26]:
import numpy as np

print(np.log10(10))
print(np.exp(1))
print(np.cos(np.pi))

1.0
2.718281828459045
-1.0
