## 1.1 Our tools

In this course we will use [**jupyter notebooks**](https://jupyter.org/) using [**python**](https://www.python.org/).  What is this?

**python** is a programming language: 

> "A programming language is a system of notation for writing computer programs.
> Programming languages are described in terms of their **syntax** (form) and **semantics** (meaning), usually defined by a formal language.
> Languages usually provide features such as a type system, **variables** and mechanisms for error handling.
> An implementation of a programming language is required in order to execute programs, namely a compiler or an interpreter.
> An interpreter directly executes the source code, while a compiler produces an executable program."
>
> \- *Wikipedia*

**jupyter notebooks** are web-browser based environments for interactively hosting programs and other content:

> Jupyter Notebook (formerly IPython Notebook) is a **web-based interactive computational environment** for creating notebook documents.
> Jupyter Notebook is built using several open-source libraries, including **Python** (...).
> A Jupyter Notebook application is (...) browser-based (...) containing a (...) list of input/output cells which can contain code, text, mathematics, plots and rich media.
>
> \- *Wikipedia*

So that's where we are: In a **jupyter notebook** where we can execute **python** along with explanations, plots, etc. 

What is the **alternative** to **jupyter**?

- On Windows PC, it's easiest to install python via the [**Anaconda**](https://www.anaconda.com/download) distribution.
- On Linux and Mac, basic python is typically pre-installed
- From a **terminal** (e.g., Anaconda Powershell Prompt on Windows) python scripts (files which contain python code) can be run simply by: ``python script.py`` 
- You can also just start a python shell by just running ``python``, i.e. calling the python interpreter
- However, it might be more useful to work with **IPython** which is also the underlying interpreter of **jupyter notebooks**. In the Anaconda Powershell Promt you call it via ``ipython3``

## 1.2 Let'start with some python code

The most simple program is the classic **hello world**. We can execute that here in our jupyter-cell directly and see it's output:

In [None]:
print('hello world')

Python also knows **basic math**, it can be used as a simple calculator:

In [None]:
4 + 4

In [None]:
3 * 3

In [None]:
2 / 2

In [None]:
1 - 1

In [None]:
2**3

Python is *dynamically* typed (different to e.g., C++), so we can do the following assignments without declaring the types of
width and height.

In [None]:
width = 2
height = 4.0
area = width*height
# we calculate the area now
print(area)

As in almost every programming language, python provides a tool to also write stuff into code which isn't code, so-called **comments**.

In python these parts of code always begin with a hashtag `# this is a comment`:

In [None]:
# this is a comment and doesn't do anything!

### Coding style

This course is not a programming course, where you learn all details of how code should look like and how to structure entire programs properly. 

Coding style is important to make code easily readable for other programmers and there are some official conventions on how python code should look like.
These can be found under [PEP8](https://peps.python.org/pep-0008/) coding conventions. 

I really suggest that everybody has a look at this. I will try to follow those conventions as much as possible, but I am no programmer, so don't learn my *bad* style. 

There is even more to learn on how to structure python code such that entire sets of programs and packages can be constructed and easily distributed. To learn this, I suggest you take the excellent course [164.373 Programming for Chemists](https://tiss.tuwien.ac.at/course/courseDetails.xhtml?dswid=4614&dsrid=750&semester=2025S&courseNr=164373).

### Variables, types and built-in functions

#### Variables and their types

We have seen now that python can handle basic math operations, introduce variables and act on these variables. There are also certain basic functions available which perform certain tasks:
- ``2 + 2`` was a simple math operation
- variables were declared and assigned like ``width = 2``, afterwards the name of the variable was used to perform a mathematical operation
- it is important to realize that every variable in programming typically has a value and a type, with the latter defining the meaning of the variable
- numbers can be e.g. of type ``int`` (integer) or ``float`` (floating point number)
- another type are e.g. strings: ``str``, i.e. series of characters
- the third important type of variables are booleans (``bool``), which is a variable which can be either ``True`` or ``False``
- a variable always has a type according to their first declarations, and there is typically no need to specify the type of variable within python
- python also offers some built-in functions, which execute certain pre-defined pieces of code and often take arguments (they are also referred to as callables)
- the print() function interacts with the console I/O and outputs its argument
- the type() function returns the type of the object


In [None]:
type(width)

In [None]:
type(height)

In [None]:
type(area)

In [None]:
text = 'i am a string'
type(text)

In [None]:
boolean = True
type(boolean)

#### Built-in functions

Built-in functions are very powerful and extremely important for any python code. It is recommended that their names are never used as variable names for anything else. 

The following Table shows the most important built-in functions in python3:


| Built-in function    | Usage                                                                             |
| -------------------- | ----------------------------------------------------------------------------------|
| ``abs()``            | returns the absolute value of a number                                            |
| ``all()``            | returns ``True`` if all elements of the iterable are true                         |
| ``any()``            | returns ``True`` if any element of the iterable is true                           |
| ``bool()``           | returns a Boolean value, i.e. one of ``True`` or ``False``                        |
| ``callable()``       | returns ``True`` if the object argument appears callable                          |
| ``float()``          | returns a floating-point number constructed from a number or a string             |
| ``help()``           | invokes the built-in help system                                                  |
| ``input()``          | reads a line from input, converts it to a string, and returns that                |
| ``int()``            | returns an integer object constructed from a number or a string                   |
| ``isinstance()``     | returns True if the object argument is an instance of the classinfo argument      |
| ``len()``            | returns the length (the number of items) of an object                             |
| ``list()``           | type constructor for an iterable list                                             |
| ``min()``            | returns the smallest item in an iterable or the smallest of two or more arguments |
| ``max()``            | returns the largest item in an iterable or the largest of two or more arguments   |
| ``open()``           | opens file and return a corresponding file object                                 |
| ``print()``          | print objects to the text output stream, i.e. the terminal, but can be a file too |
| ``round()``          | returns number rounded to n digits precision after the decimal point              |
| ``sum()``            | sums the items of an iterable from left to right and returns the total            |
| ``str()``            | returns a str version of object                                                   |
| ``type()``           | returns the type of an object                                                     |

all other built-in functions together with their documentation can be found in the [Python Docs](https://docs.python.org/3/library/functions.html)

In [None]:
string_number = 'hello world'

### User-defined functions

Besides the built-in functions (which are not all by default functions in the sense of python, more on that later), we the user can define a function.
Functions take arguments, modify these and return other values. The are used to structure code (sort pieces of well-defined operations out and give them reasonable names, but also to make certain chunks of code repetitively available. 

Let's see how we define functions in python:

In [None]:
def exponential(x):
    expo = 1 + x + x**2/2 + x**3/(3*2) + x**4/(4*3*2) + x**5/(5*4*3*2) + x**6/(6*5*4*3*2)
    return expo


def maximum(x, y):
    if x > y:
        return x
    elif y > x:
        return y
    else:
        raise ValueError("no maximum found")

Here we find several new features of python. Let's go through them step-by-step:

1. The function is defined by the key-word ``def`` followed by the name of the function and in brackets the arguments, which the function requires as input (here to variables named x and y). These variables are available inside the function for manipulation. 
2. The colon marks the end of the defining line of the function. The next line starts with an indent (4 spaces or a tab). This is always the case in python after colons. Indentation is generally used to group lines of code together in python, here the core of the function ``maximum``. It replaces the extensive usage of brackets in other programming languages and makes code easily readable as the user is forced to use indents.
3. Inside the function, we have an ``if``, ``elif``, ``else`` statement. They are simple conditional statements. If ``True``, than the followin code is executed. If ``False``, the next statement (``elif``) is tested until we reach ``else``, which is executed only if none of the previous ``if`` or ``elif`` statements was ``True``.
4. The keyword ``return`` marks the return value of the function. Whenever executed, the function is terminated and the ``return`` statement is then available as the output of the function.
5. Last, the function contains a built-in exception (``ValueError``), i.e. an Error message, which is returned as soon as both variables are either no numbers (such that no maximum exists), or if the numbers are equal. It is used together with the ``raise`` statement, which terminates the execution of the code at that point, where the error is raised. 

Let's see how the function works in practice:

In [None]:
a = 4
b = 2
result = maximum(a, b)
print(result)
a = 2
print(maximum(a, b))

We see that the first execution of ``maximum`` resulted in the correct answer. Here we used the function such that its return value is saved in a new variable, called ``result`` and subsequently that variable was printed to the output. 

In the second execution we executed the function inside the print function, which should directly print the maximum to the output. However, as the variable ``a`` is modified in the meantime to have the same value as ``b``, no maximum can be found and the function raises an error. Python execution is **aborted** and the error message points us to the line of code, where the execution fails. This can be extremely helpful in big prograqmming projects. In addition, Errors can be programmed such, that certain types of Errors can be handeled and do not directly induce the program to fail. For more details see the [Python Docs](https://docs.python.org/3/tutorial/errors.html).

### Iterables and loops

#### Lists

The next essential type we are getting to know is the ``list``, which is a so-called Iterable. So far our data types were stroing individual peices of data, i.e. one number, one string, one boolean. 
If we want to store a collection of data, specifically an ordered collection of items, then we use a ``list``, which is a new datatype by itself. 

Let's see how lists work:

In [None]:
fruits = ['apples', 'bananas', 'oranges']
type(fruits)

We can also check if certain elements are found in a list with a very inuitive syntax:

In [None]:
'apples' in fruits

We defined a list using square-brackets [] with the elements of the list separated by commas, here we had a list of different strings. And with the statement ``'apples' in fruits`` we were able to check if a certain expression was in that list, i.e. it returend the boolean ``True`` as the string ``'apples'`` was an itemj in that list. This also works for other variable types, we can also have a list of numbers and check if a certain number is in that list. 

Our list names fruits contained three items. We can actually obtain the length of the list with the built-in function ``len()``:

In [None]:
len(fruits)

We can access individual items of that list using the square brackets and the index of the element. We always start at 0 for the left-most element, but we can also index from the other end (using negative numbers): 

In [None]:
fruits[0]

In [None]:
fruits[-1]

In [None]:
print(fruits[-2])
print(fruits[1])

here we saw that ``fruits[-2]`` returns the same item than ``fruits[1]``. 

``list`` is a powerful data-type which also gives access to many built-in methods, which operate on list objects. Basically methods are functions, which are accessible for each element of a certain type. We will see later how this is connected to the ``class``. 

One of the built-in methods for lists allows us to add a new item at the end of the list. This built-in method is a calles ``list.append()``, see also [Python Docs](https://docs.python.org/3/tutorial/datastructures.html) what other methods are available for the type ``list``.

In [None]:
fruits.append('plums')
fruits

In [None]:
len(fruits)

We can also remove items with such built-in methods and we then see that our ``'apples' in fruits`` call now returns ``False``:

In [None]:
fruits.remove('apples')
'apples' in fruits

If we now try to remove ``'apples'`` once again, we will get a ``ValueError`` raised, because it is not in the list anymore:

In [None]:
fruits.remove('apples')

Interestingly, **strings work very similar to lists**, i.e. individual letters of a string can be found and accessed with the same syntax:

In [None]:
my_name = 'Dominik'
print(my_name[0])
print('i' in my_name)

But the difference is that the built-ion methods associated with ``list`` are not available:

In [None]:
my_name.append(' Stolzenburg')
print(my_name)

Here we get an ``AttributeError`` which shows that this method is not part of the dataytpe ``str``. 

#### Loops

Most powerful are Iterables, because they can be used within loops. The most simple syntax to loop over all elements of a list is the ``for`` loop:

In [None]:
drinks = []
drinks.append('whiskey')
drinks.append('aperol')
drinks.append('spritzer')

for k in drinks:
    print(k)

In that case, everything after the colon and within the next indent was executed for all elements in the list drinks. So in the first iteration the code is executed as if ``element='whiskey'`` and in the next iteration it is executed as if ``element='aperol'``, etc. 

If you imagine that the used ``list`` can contain thousands of elements, then you can execuate thousands of operations with just two lines of code. That's the **power of loops**. 

While `for` loops are often used to loop over the elements of a list, they are also often used to execute a certain operation a defined number of times. If a `for` loop is designed for that purpose we often loop over a *fake* list of just integer number (the numbers of executions):

In [None]:
for i in [0, 1, 2, 3, 4]:
    print(i)

In [None]:
for i in range(5):
    print(i)

There is a third purpose of a loop, which is we want to execute some code over and over again as long as a condition is `TRUE`.  For that purpose python has the built-in `while` loops. 

That way, we can actually code the very same operation (printing 'Hello' five times) but with a different syntax:

In [None]:
counter = 0
while counter < 5:
    print('Hello')
    counter = counter + 1

#### Dictonaries

So far, we have learned about a container for several items, the `list`, which stores objects in an ordered manner and lets use retrieve these objects according to their position in the `list`. They are powerful tools, where we can loop over with the help of `for`loops. 

`Dictionaries` are another data type where you can iterate over, but they are really more like a classical dictionary book or a phone book. It's more about setting different things into relation to each other. In a phone book, you go look for a name of a certain person and then you find its phone number. Or in chemistry database you look for a molecules name and you might find its enthalpy of combustion.  

Unlike lists, which are indexed by a range of numbers, dictionaries are indexed by keys, which can strings and numbers but also other types. A dictionary is a collection of key-value pairs. It is defined using curly braces {}.

In [None]:
dictionary_example = {"name": "Alice", "age": 25, "city": "New York"}

for key in dictionary_example.keys():
    print(key)

Iterating over the values corresponding to the keys can be done in two different ways:

In [None]:
for value in dictionary_example.values():
    print(value)

for key in dictionary_example.keys():
    print(dictionary_example[key])



And also iterating over key-value pairs can be done, also showcasing another power of `for`loops:

In [None]:
print("\nIterating over key-value pairs:")
for k, v in dictionary_example.items():
    print(str(k) + " : " + str(v))

So dictionaries have apprently at least three built-in methods, which can be called using the `.` operator. 

These are `dictionary.keys()`, `dictionary.values()` and `dictionary.items()`. 

We can also manipulate dictionaries similar to lists, i.e., we can add new items and delete them:

In [None]:
dictionary_example["country"] = "USA"
print(dictionary_example)

del dictionary_example["age"]
print(dictionary_example)

Note that the syntax for deleting an item is different compared to the `list`. However, the syntax used now would also work on a `list`:

In [None]:
print(fruits)
del fruits[-1]
print(fruits)

## 1.3 First steps in python

---
**EXERCISE**

Write a function (called ``mapping``), which applies another function (e.g., a quadratic polynomial) to each element (in that case of type ``float``) of an Iterable (here a ``list``). 

---

We will do that as an assignment within **JupyterHub**, such that you see, how the assignments work. 

## 1.4 Python packages

While there are so-called built-in functions, which are contained in any basic python installation, the true power of the programming language is only unfolded, when we use certain `packages` which provide a lot more pre-defined functionalization compared to basic python. The most important packages, which we will use during this course very frequently, are:

- [Numpy](https://numpy.org/): The fundamental package for scientific computing with Python
- [Matplotlib](https://matplotlib.org/): Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python
- [Pandas](https://pandas.pydata.org/): Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool
- [Scipy](https://scipy.org/): Fundamental algorithms for scientific computing in Python
- [Scikit-learn](https://scikit-learn.org/stable/): Simple and efficient tools for predictive data analysis including machine learning

So how can we access these packages?

In [None]:
import numpy

We do that with the `import` statement. However, this only works, if the content of the packages is installed into your python distribution (environment). For the JupyterHub you do need to care, most important things are pre-installed. On your local machine, it depends on your setup, on how to install packages from the internet. It is importnat to know that packages typically depend on other packages and therefore some sort of package management needs to be implemented and used for package installation (more info on the [lecture slides](https://tuwel.tuwien.ac.at/mod/resource/view.php?id=2485853)).

After the import statement, we can access all functionality of the packages:

In [None]:
print(numpy.log10(10))
print(numpy.exp(1))
print(numpy.cos(numpy.pi))

### Guess the sum formula: An example of all we learned so far

We are now building a sum formula guesser program, where the user will be prompted with a trivial name of a certain substance and they need to guess its sum formula, with the programm answering if the guess was correct and counting the points.

- First, we will build a dictionary which contains all the assignments:
Remember the syntax for dictionarys:  ```dict = {key1 : item1, key2 : item2, ... }```

In [None]:
molecule_dict = {
    "Water": "H2O",
    "Methane": "CH4",
    "Ethanol": "C2H5OH",
    "Glucose": "C6H12O6",
    "Sucrose": "C12H22O11",
    "Carbon dioxide": "CO2",
    "Carbon monoxide": "CO",
    "Oxygen": "O2",
    "Nitrogen": "N2",
    "Ammonia": "NH3",
    "Hydrogen peroxide": "H2O2",
    "Acetic acid": "CH3COOH",
    "Formaldehyde": "CH2O",
    "Benzene": "C6H6",
    "Phenol": "C6H5OH",
    "Toluene": "C7H8",
    "Acetone": "C3H6O",
    "Butane": "C4H10",
    "Isobutane": "C4H10",
    "Propane": "C3H8",
    "Pentane": "C5H12",
    "Hexane": "C6H14",
    "Heptane": "C7H16",
    "Octane": "C8H18",
    "Ethanediol (Ethylene glycol)": "C2H6O2",
    "Glycerol": "C3H8O3",
    "Nitric acid": "HNO3",
    "Sulfuric acid": "H2SO4",
    "Hydrochloric acid": "HCl",
    "Sodium chloride (table salt)": "NaCl",
    "Calcium carbonate": "CaCO3",
    "Sodium bicarbonate (baking soda)": "NaHCO3",
    "Sodium hydroxide": "NaOH",
    "Potassium hydroxide": "KOH",
    "Urea": "CH4N2O",
    "Hydrogen cyanide": "HCN",
    "Phosphoric acid": "H3PO4",
    "Methanol": "CH3OH",
    "Propionic acid": "C2H5COOH",
    "Acetylene": "C2H2",
    "Ethylene": "C2H4",
    "Chloroform": "CHCl3",
    "Caffeine": "C8H10N4O2",
    "Citric acid": "C6H8O7",
    "Lactic acid": "C3H6O3",
    "Ascorbic acid (Vitamin C)": "C6H8O6",
    "Aniline": "C6H5NH2",
    "Formic acid": "HCOOH",
    "Oxalic acid": "C2H2O4",
    "Uric acid": "C5H4N4O3"
}

- Second, we pull a list of all compound names. As our solutions is stored in a dictionary, we will pull the keys of that dictionary: 

In [None]:
compounds = list(molecule_dict.keys())

- Third, we need to make a random choice that one of these compound names is prompted to the user. For that we will use `numpy.random`, which provides all sorts of functions for random distributions. Using the function `np.random.choice()` on the array of compounds, you'll see that you obtain a different compound every time you run the code below:

In [None]:
import numpy as np

compound = np.random.choice(compounds)

print(compound)

- Fourth, after selection of a compound, we want to ask the user for input of the sum formula. For that we will use the built-in function `input()`:

In [None]:
sumformula = input('Sum Formula >')

- Last, we check if the user-input was correct using if clauses:

In [None]:
if molecule_dict[compound] == sumformula:
    print('Correct!')
else:
    print('Wrong formula, the correct answer would have been:'
          + str(molecule_dict[compound]))

---
**EXERCISE**

Write a function (called `guess_sum_formula()`), which, when executed let's the user play the guess the sum formula game five times in a row, counting correct answers and when finished returns the score. 

---

## 1.5 The power of Numpy

Now, we will explore the basics of NumPy and highlight how it compares to standard Python lists.

To do so, we will use a library called `time`, which allows us to measure the computational time of certain code snipplets. 

In [None]:
import time

### Creating NumPy Arrays and performance comparison

NumPy arrays can be created from Python lists or generated using built-in functions.


In [None]:
python_list = [1, 2, 3, 4, 5]
numpy_array = np.array(python_list)
print("Python List:", python_list)
print("NumPy Array:", numpy_array)

Generating an array of zeros:

In [None]:
zeros_array = np.zeros(5)
print("Array of Zeros:", zeros_array)

Generating an array of evenly spaced values:

In [None]:
linspace_array = np.linspace(0, 1, 5)  # 5 values from 0 to 1
print("Linspace Array:", linspace_array)

Let's compare the performance of a simple operation: adding 1 to each element in a list/array.

First, we define a large dataset: 

In [None]:
size = 10**7
python_list_large = list(range(size))
numpy_array_large = np.arange(size)

numpy_array_large

Measure time for Python list

In [None]:
start = time.time()
python_list_result = [x + 1 for x in python_list_large]
end = time.time()
print("Time taken with Python list:", end - start, "seconds")

Measure time for NumPy array

In [None]:
start = time.time()
numpy_array_result = numpy_array_large + 1
end = time.time()
print("Time taken with NumPy array:", end - start, "seconds")

**It's more than 10 times faster!!!**

**NumPy offers powerful features such as:**
 - Efficient operations on large datasets
 - Broadcasting for arrays of different shapes
 - Easy-to-use mathematical functions
 - Advanced array manipulation capabilities
 - Simplified matrix mathematics

## 1.6 Reading and writing to files

NumPy is the base module for most of our work, but now we will get to know another package which comes in even more handy for dataset manipulation and is very helpful for read-in datasets from files. 

This module is named **pandas**. 

### Read/write files with built-in functions

First we have a look how python reads-in a file with built-in functions.

In the same directory than our lecture we have a file named *molecule_dict.txt* . It contains the same list of molecules which we hard-coded in above as a dictionary. 

To open it with a buil-in function, we use `open()`, see its documentation [here](https://docs.python.org/3/library/functions.html#open).

In [None]:
file = open('molecules.txt', 'r')

file

Apparently, open returns a file handle, i.e. a variable which stores the open file object, a `_io.TextIOWrpper`. As arguments it takes the name (the path) where the file is found and a second argument, which explains the mode the file is opened (`'r'` stands for read). 

Now we can loop over the content of the file, using our `for` loop and afterwards we close our file again with `file.close()` (this is important that we can start from scratch again when opening the file once again below):

In [None]:
for line in file:
    print(line)

file.close()

Now, we want to store that information in an array or a list, such that we can work with it as previously. 

For that we could just append every line to an numpy array which is empty at the beginning. 

In [None]:
file = open('molecules.txt', 'r')

file_content = []

for line in file:
    file_content.append(line)

file.close()

file_content

When inspecting the list, we see that the lines contain a newline character `'\n'`, which also introduced the empty lines after each `print()` statement above. We should get rid of it. 

Moreover, like this we still haven't seperated the names of the molecules from the sumformulas. 

We will use the built-in method [`split()`](https://docs.python.org/3.3/library/stdtypes.html) for `str` to achieve both:

In [None]:
file = open('molecules.txt', 'r')

names = []
sum_formulas = []

for line in file:
    text_in_line = line.split('\n')[0]
    name, sum_formula = text_in_line.split(',')
    names.append(name)
    sum_formulas.append(sum_formula)

file.close()

names, sum_formulas

You see that now, the molecule names got separated from the sum formulas in two seperate lists, which can be easily combined into a dictionary again:

In [None]:
molecule_dict = {n:s for n,s in zip(names, sum_formulas)}

molecule_dict

Here, we get to know the built-in function `zip()` which iterates over several iterables in parallel, producing tuples with an item from each one. More details can be found [here](https://docs.python.org/3/library/functions.html#zip).

### Pandas

Now we want to combine the power of dictonaries with the power of numpy arrays, and the data type combining advantages of both are so-called Pandas DataFrames. 

>pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
>
>\-https://pandas.pydata.org/

The pandas DataFrame is a fast and efficient object for data manipulation with integrated indexing. The advantage compared to a numpy array is that our data can structured into named columns and named indices, which facilitates better organization. 

To create a DataFrame with our data above, we construct a dictionary pointing to the list of names and list of sum formulas:

In [None]:
import pandas as pd  # pd is the most commonly used abbreviation for the pandas import

dataset = {'name': names, 'sum formula': sum_formulas}

df = pd.DataFrame(dataset)

df

We see that the pandas DataFrame is a beautifully structured Table with numbers representing the rows and named columns.

This makes precise indexing much easier and more meaningful:

In [None]:
caffeine = df['sum formula'][42]

caffeine

In practice we will mainly use the `.loc` method to slice and call subsets of the DataFrame: 

In [None]:
caffeine = df.loc[42, 'sum formula']

caffeine

This is because we still can use the indexing known from NumPy, which only goes after the position in the Table by using the `.iloc` method:

In [None]:
caffeine = df.iloc[42, 1]

caffeine

Pandas is really practical in handling tabulized data and it is also very handy to read files, such that we do not need the cumbersome built-in file read functionality:

In [2]:
df2 = pd.read_csv('molecules.txt', names=['name', 'sum formula'])

caffeine = df2.loc[42, 'sum formula']

caffeine

'C8H10N4O2'

**Pandas offers:**

- Efficient data manipulation: Pandas provides powerful tools for filtering, aggregating, merging, reshaping, and cleaning data, enabling users to handle complex data transformation tasks with minimal code.
- Versatile data integration: It supports reading and writing data from various file formats, such as CSV, Excel, SQL, and JSON, making it easy to integrate data from multiple sources into a single workflow.
- Robust analytical capabilities: Pandas offers a wide range of built-in functions for descriptive statistics, time-series analysis, and data grouping, enabling comprehensive and detailed data exploration.
- Optimized performance: Its underlying integration with NumPy allows pandas to handle large datasets efficiently, leveraging vectorized operations and optimized memory usage for speed and scalability.

## 1.7 Data manipulation: a first example

We now approach a first task to showcase the power of programming. 

So far we could read-in a file and structure the data within an object which seemed to be quite useful, the pandas DataFrame. We also explored how to do mathematical operations and to write a little user-interactive program. But now we will start to manipulate (i.e. work on) some data such that we can use them in a different context and for different targets. 

We start with our *molecule.txt* file, where we just have a list of molecule names and their sum formulas. Now we want to extract the chemical composition from these sum formulas such that we can use them, i.e. we could only pick moelcules from our data set which contain at least one carbon atom (i.e. all organic molecules).

### Regular expressions

To do that we first need to transform the string representing the sum formula into a dictionary which gives the element and its abundance within that molecule. We will use regular expressions for that. Regular expressions are patterns used to match and manipulate text. They are incredibly powerful for tasks like:
- Searching for patterns in text.
- Extracting specific parts of a string.
- Replacing or modifying text based on patterns.
  
Common use cases for regular expressions are:
- Validation: Check if a string follows a specific format (e.g., email addresses, phone numbers).
- Extraction: Extract meaningful parts from text (e.g., dates, URLs, or chemical formulas).
- Search and Replace: Replace parts of a string that match a pattern.

The `re` library provides very useful methods to find and replace regular expressions. 

In [None]:
import re

Let's start with building a regular expression, which is done by defining a string with a leading `r` and we can the use `re.search` to find the first occurrence and `re.findall()` to find all occurrences of the regular expression in a string:

In [None]:
pattern = r"[A-Z]"
string = "ahjdcdsacMcdsajkB"
first_capital_letter = re.findall(pattern, string)
first_capital_letter[0]

The regular expression `[A-Z]` identifies any capital letter from A-Z and hence the `re.findall` method returns the matches with that regular expression. 

Other regular expressions exist: 

- `[abc]` matches any one character that is either 'a', 'b', or 'c'.
- `[a-z]` matches any one lowercase letter from 'a' to 'z'.
- `[0-9]` matches any one digit from '0' to '9'. Optionaly, use \d metacharacter.
- `[^abc]` matches any one character that is not 'a', 'b', or 'c'.
- `[\w]` matches any one-word character, including letters, digits, and underscore.
- `[\s]` matches any whitespace character, including space, tab, and newline.
- `*` matches zero or more of the preceding character.
- `(...)` captures parts of the match for later use.

With that we can built a pattern which matches the structure of an element with a following number of its occurrences:

In [None]:
pattern = r"([A-Z][a-z]*)([0-9]*)"

elements = re.findall(pattern, caffeine)

for i in range(len(elements)):
    print(elements[i])

#### Expanding a pandas DataFrame 

That worked. Let's now use this to expand our pandas DataFrame of molecule names and molecule sum formulas with columns indicating the abundances of different elements. 

Let's first introduce a way how to loop over the rows in a DataFrame, using `df.index`:

In [None]:
import pandas as pd

df = pd.read_csv('molecules.txt', names=['name', 'sum formula'])
pattern = r"([A-Z][a-z]*)([0-9]*)"

for idx in df.index:    
    parsed = re.findall(pattern, df.loc[idx, 'sum formula'])
    for element in parsed:
        print(element)

Now, we create a new column for each element with the element symbol as header and containing its abundance per molecule. 

To do so, we need to check if the element-column already has been created whenever a new row is called and the sum formula disentengled, and create it if it wasn't done so far. We will use `df.columns` to access the columns of the DataFrame. 

In [None]:
import pandas as pd

df = pd.read_csv('molecules.txt', names=['name', 'sum formula'])
pattern = r"([A-Z][a-z]*)([0-9]*)"

for idx in df.index:    
    parsed = re.findall(pattern, df.loc[idx, 'sum formula'])
    for element in parsed:
        if element[0] not in df.columns:
            df[element[0]] = 0

df

And the last step is to fill each of the entries directly with the count of the disentengled sum formula (careful, our regular expression returned an empty string if the count is actually 1):

In [None]:
import pandas as pd

df = pd.read_csv('molecules.txt', names=['name', 'sum formula'])
pattern = r"([A-Z][a-z]*)([0-9]*)"

for idx in df.index:    
    parsed = re.findall(pattern, df.loc[idx, 'sum formula'])
    for element in parsed:
        if element[0] not in df.columns:
                df[element[0]] = 0
        if element[1] == '':
            df.loc[idx, element[0]] = 1
        else:
            df.loc[idx, element[0]] = int(element[1])

df

#### Slicing a pandas DataFrame: Data selection/reduction

And now we can finally slice our DataFrame such that we reduce it to only organics (i.e. somethind with `df['C']>0`), which is an extremly simple operation in pandas. 

In [None]:
organics = df.loc[df['C']>0, :]

organics

As boolean statements can be combined, we can also slice the DataFrame on combinations of conditions:

In [None]:
ox_organics = df.loc[(df['C']>0) & (df['O']>0), :]

ox_organics

Here the `&` operator ([bitwise AND](https://www.geeksforgeeks.org/difference-between-and-and-in-python/)) to make sure that only if both statements in are true in each row, that row is sliced into the final DataFrame. Similarily, the `|` is the bitewise equivalent for OR. 