# Assignment 1

## Exercise 1: Vapor pressures

The vapor pressure of a substance if often approximated by the so-called Antoine equation: $ p = 10^{\left( a − \frac{b}{temp + c} \right)} $ with the vapor pressure $p$ in bar and the temperature $temp$ in Kelvin. The empirical parameters $a$, $b$, and $c$ are valid over a certain temperature range. 

**(a) Write a function called `vap_pres_antoine(temp, a, b, c)` which calculates the vapor pressure of a substance for which we know the input parameters $a$, $b$, and $c$ at a certain temperature $temp$.**  

In [1]:
def vap_pres_antoine(temp, a, b, c):
    p = 10 ** (a - (b / (temp + c)))
    
    return p


# print(vap_pres_antoine(293, 1, 1, 1))


**(b) Build a vapor pressure guesser (function called `guess_vap_press()`), which prompts the user input to guess the vapor pressure of a substance known by its trivial name. The vapor pressure does not need to be guessed precisiely, but should just be within the correct order of magnitude such that the answer by the user is correct. Besides a printed answer to the test ('correct within one order of magnitude' or 'wrong vapor pressure'), the function should return 0 when the vapor pressure is guessed wrongly, and it should return 1, when it is guessed correctly (i.e. within one order of magnitude).**

In [2]:
import numpy as np

vapor_pressures = {
    "water": 3169,           # in Pascal
    "ethanol": 7897,         # in Pascal
    "methanol": 12700,       # in Pascal
    "acetone": 24700,        # in Pascal
    "benzene": 12400,        # in Pascal
    "chloroform": 21500,     # in Pascal
    "diethyl_ether": 53700,  # in Pascal
    "n_hexane": 16000,       # in Pascal
    "toluene": 3800,         # in Pascal
    "acetic_acid": 1560,     # in Pascal
    "carbon_tetrachloride": 15000,  # in Pascal
    "ammonia": 86600,        # in Pascal
    "butane": 220000,        # in Pascal
    "pentane": 57000,        # in Pascal
    "propane": 101000,       # in Pascal
    "hydrogen_peroxide": 750, # in Pascal
    "formic_acid": 5610,     # in Pascal
    "hydrogen_sulfide": 202000,  # in Pascal
    "carbon_dioxide": 5700,  # in Pascal
    "sulfur_dioxide": 3000,  # in Pascal
    "hydrochloric_acid": 40000, # in Pascal
    "nitrogen": 75500,       # in Pascal
    "oxygen": 49200,         # in Pascal
    "argon": 27000,          # in Pascal
    "xenon": 77,             # in Pascal
    "mercury": 0.27,         # in Pascal
    "bromine": 2700,         # in Pascal
    "iodine": 0.031,         # in Pascal
    "phosphine": 200000,     # in Pascal
    "nitric_acid": 47,       # in Pascal
    "ethene": 581000,        # in Pascal
    "ethyne": 417000         # in Pascal
}

In [3]:
def guess_vap_pres(vap_pres_dict):
    trivial_name = input("Enter the trivial name of a substance: ")

    pressure_dict = int(vapor_pressures[trivial_name])
    pressure_user = int(input("Guess the pressure (in Pascal): "))

    if pressure_user > (pressure_dict * 10):
        print("Sorry, your guess is too high")
    elif pressure_user < (pressure_dict / 10):
        print("Sorry, your guess is too low")
    else:
        print("Correct! The actual pressure is {pressure_dict}, so you are within the limit.")

    return

# guess_vap_pres(vapor_pressures)


**(c) we will now read-in a file `organic_vapor_pressures.csv` which contains the compound names, CAS number, sum formulas and vapor pressures at 300 K of a set of organic molecules. Write a function `read_in_basic(filename)` and a function `read_in_pandas(filename)` to read in the file and which return its content as a pandas DataFrame `df` (with the same column names than the first row of the file). The former solution should use python built-in functions only and the latter can use pandas read-in. Please be aware that a semicolon is used as the list seperator in the file (look up the pandas docs to change the seperator in the read-in)**


In [4]:
import pandas as pd
import numpy as np

def read_in_basic(filename):
    file = open(filename)
    names = []
    cas_list = []
    sum_formulas = []
    pressures = []
    
    for line in file:
        text_in_line = line.split('\n')[0]
        # print(text_in_line)
        name, cas, sum_formula, pressure = text_in_line.split(';')
        names.append(name)
        cas_list.append(cas)
        sum_formulas.append(sum_formula)
        pressures.append(pressure)
    # print(cas_list)
    file.close()

    zipped = list(zip(names, cas_list, sum_formulas, pressures))
    df = pd.DataFrame(zipped, columns = zipped[0])
    df = df.drop(index=0)
    df = df.reset_index(drop = True)
    #print(df)
    return df

def read_in_pandas(filename):
    df = pd.read_csv(filename, sep = ';')
    #print(df)
    return df


read_in_basic('organic_vapor_pressures.csv')
read_in_pandas('organic_vapor_pressures.csv')

Unnamed: 0,Name,CAS,Sum Formula,P [bar] at 300 K
0,"2,3-Dimethylbutane",79-29-8,C6H14,0.388000
1,2-Methylpentane,107-83-5,C6H14,0.305000
2,3-Methylpentane,96-14-0,C6H14,0.273500
3,n-Hexane,110-54-3,C6H14,0.219000
4,"2,2-Dimethylbutane",75-83-2,C6H14,0.457000
...,...,...,...,...
89,Toluene,108-88-3,C7H8,0.041690
90,Cyclohexanol,108-93-0,C6H12O,0.000947
91,Cycloheptanol,502-41-0,C7H14O,0.000328
92,Cyclohexanone,108-94-1,C6H10O,0.006063


In [5]:
###BEGIN ALWAYS HIDDEN TEST
import pandas as pd
import numpy as np
import os

def test_read_in_functions():
    # Create a temporary CSV file for testing
    test_filename = "test_organic_vapor_pressures.csv"
    with open(test_filename, "w") as f:
        f.write("Name;CAS Number;Sum Formula;Vapor Pressure\n")
        f.write("Methanol;67-56-1;CH4O;128\n")
        f.write("Ethanol;64-17-5;C2H6O;59\n")
        f.write("Acetone;67-64-1;C3H6O;231\n")
    
    try:
        # Expected DataFrame
        expected_data = {
            "Name": ["Methanol", "Ethanol", "Acetone"],
            "CAS Number": ["67-56-1", "64-17-5", "67-64-1"],
            "Sum Formula": ["CH4O", "C2H6O", "C3H6O"],
            "Vapor Pressure": [128, 59, 231]
        }
        expected_df = pd.DataFrame(expected_data)

        # Run student functions
        df_basic = read_in_basic(test_filename)
        df_pandas = read_in_pandas(test_filename)

        expected_df["Vapor Pressure"] = expected_df["Vapor Pressure"].astype(float)
        df_basic["Vapor Pressure"] = df_basic["Vapor Pressure"].astype(float)
        df_pandas["Vapor Pressure"] = df_pandas["Vapor Pressure"].astype(float)

        # Test if the outputs are DataFrames
        assert isinstance(df_basic, pd.DataFrame), "read_in_basic() should return a DataFrame"
        assert isinstance(df_pandas, pd.DataFrame), "read_in_pandas() should return a DataFrame"

        # Test if column names match expected
        assert list(df_basic.columns) == list(expected_df.columns), "Column names do not match in read_in_basic()"
        assert list(df_pandas.columns) == list(expected_df.columns), "Column names do not match in read_in_pandas()"

        # Test if data matches expected
        assert df_basic.equals(expected_df), "Data in read_in_basic() does not match expected"
        assert df_pandas.equals(expected_df), "Data in read_in_pandas() does not match expected"

        print("All tests passed!")

    finally:
        # Ensure the temporary test file is deleted even if an error occurs
        if os.path.exists(test_filename):
            os.remove(test_filename)

# Run the test function
test_read_in_functions()
### END ALWAYS HIDDEN TEST


All tests passed!


**(d) Measure the time it takes for the two functions to read-in that file and store the outcome in the variables `time_basic` and `time_pandas`**

In [6]:
import time

start_basic = time.time()
read_in_basic('organic_vapor_pressures.csv')
end_basic = time.time()
time_basic = end_basic - start_basic


start_pandas = time.time()
read_in_pandas('organic_vapor_pressures.csv')
end_pandas = time.time()
time_pandas = end_pandas - start_pandas

print(f'time_basic = {time_basic} and time_pandas = {time_pandas}')
print(f'Time difference (basic - pandas) = {time_basic - time_pandas}')


time_basic = 0.0007066726684570312 and time_pandas = 0.000850677490234375
Time difference (basic - pandas) = -0.00014400482177734375


In [7]:
###BEGIN ALWAYS HIDDEN TEST
assert time_basic>0
assert time_pandas>0

assert time_basic<0.1
assert time_pandas<0.1
###END ALWAYS HIDDEN TEST

**(e) Now disentengle the sum formulas such that the DataFrame has new columns giving integer numbers for the number of atome of each element found in the molecules. Again use two approaches, a function `disentengle_sum_formulas_basic(df)` and `disentengle_sum_formulas_regex(df)` with the former using only basic python string manipulation (check out the [string class methods](https://docs.python.org/3/library/stdtypes.html#str) isdigit() and islower(), they might come in handy) and the latter using regular expressions. The functions should return the modified DataFrame. Again compare the time the two functions need for execution by storing them into `time_basic` and `time_regex`.**

In [8]:
import re
import pandas as pd
import time

def disentengle_sum_formulas_basic(df):
    c_atom_list = []
    h_atom_list = []
    o_atom_list = []

    for line in df['Sum Formula']:
        text_in_line = line.split('\n')[0]

        c_index = text_in_line.find('C')
        h_index = text_in_line.find('H')
        o_index = text_in_line.find('O')
        #print(f'C = {c_index} H = {h_index} O = {o_index}')

        c_number = text_in_line[c_index+1:h_index]

        if o_index == -1:
            h_number = text_in_line[h_index+1:len(text_in_line)]
            o_number = 0
        else:
            h_number = text_in_line[h_index+1:o_index]
            if o_index+1 > len(text_in_line)-1:
                o_number = 1
            else:
                o_number = text_in_line[o_index+1:len(text_in_line)]

        c_atom_list.append(c_number)
        h_atom_list.append(h_number)
        o_atom_list.append(o_number)

    df['C'] = c_atom_list
    df['H'] = h_atom_list
    df['O'] = o_atom_list

    return df


def disentengle_sum_formulas_regex(df):
    pattern = r"([A-Z][a-z]*)([0-9]*)"
    
    for idx in df.index:    
        parsed = re.findall(pattern, df.loc[idx, 'Sum Formula'])
        for element in parsed:
            if element[0] not in df.columns:
                    df[element[0]] = 0
            if element[1] == '':
                df.loc[idx, element[0]] = 1
            else:
                df.loc[idx, element[0]] = int(element[1])

    return df
    


time_basic = 0
time_regex = 0 
# YOUR CODE HERE

df = read_in_basic('organic_vapor_pressures.csv')

start_basic = time.time()
disentengle_sum_formulas_basic(df)
end_basic = time.time()
time_basic = end_basic - start_basic


start_pandas = time.time()
disentengle_sum_formulas_regex(df)
end_pandas = time.time()
time_pandas = end_pandas - start_pandas

print(f'time_basic = {time_basic} and time_pandas = {time_pandas}')
print(f'Time difference (basic - pandas) = {time_basic - time_pandas}')



time_basic = 0.0010952949523925781 and time_pandas = 0.02438068389892578
Time difference (basic - pandas) = -0.023285388946533203


In [9]:
import pandas as pd
import numpy as np

def test_disentangle_functions():
    # Sample data for testing
    test_data = {
        'Molecule': ['Water', 'Carbon Dioxide', 'Ethanol', 'Ammonia', 'Methane'],
        'Sum Formula': ['H2O', 'CO2', 'C2H6O', 'NH3', 'CH4']
    }
    df_test = pd.DataFrame(test_data)

    # Expected output for each molecule
    expected_results = {
        'H2O': {'H': 2, 'O': 1, 'C': 0, 'N': 0},
        'CO2': {'C': 1, 'O': 2, 'H': 0, 'N': 0},
        'C2H6O': {'C': 2, 'H': 6, 'O': 1, 'N': 0},
        'NH3': {'N': 1, 'H': 3, 'C': 0, 'O': 0},
        'CH4': {'C': 1, 'H': 4, 'O': 0, 'N': 0},
    }

    # Run both functions
    df_basic = disentengle_sum_formulas_basic(df_test.copy())
    df_regex = disentengle_sum_formulas_regex(df_test.copy())

    # Ensure output is a DataFrame
    assert isinstance(df_basic, pd.DataFrame), "disentengle_sum_formulas_basic() did not return a DataFrame"
    assert isinstance(df_regex, pd.DataFrame), "disentengle_sum_formulas_regex() did not return a DataFrame"

    # Ensure all expected elements exist as columns
    expected_elements = ['H', 'O', 'C', 'N']
    for element in expected_elements:
        assert element in df_basic.columns, f"Element {element} missing from basic implementation output"
        assert element in df_regex.columns, f"Element {element} missing from regex implementation output"

    # Check values against expected results
    for idx, row in df_basic.iterrows():
        formula = row['Sum Formula']
        for element, expected_value in expected_results[formula].items():
            assert row[element] == expected_value, f"Basic: Incorrect value for {element} in {formula}"

    for idx, row in df_regex.iterrows():
        formula = row['Sum Formula']
        for element, expected_value in expected_results[formula].items():
            assert row[element] == expected_value, f"Regex: Incorrect value for {element} in {formula}"

    # Ensure timing variables are set
    assert isinstance(time_basic, float), "time_basic is not a float"
    assert isinstance(time_regex, float), "time_regex is not a float"

    print("All tests passed.")

# Run the test
test_disentangle_functions()

AssertionError: Element N missing from basic implementation output

**(f) Now we want to filter the available data, which we do within a function `filter_low_vap_press(df)`, which returns compented and filtered vapor pressure DataFrame as specified below: First we need to convert the vapor pressures in bar to vapor pressures given in units of cm<sup>-3</sup>, for which we use the ideal gas law (at 300 K) for conversion, because we are interested in their phase-partitioning in the gas-phase (aerosol formation). The new column is to be named `'P [cm-3] at 300 K'`. In addition, calculate the molecular mass for these compounds in a new column (`M [g mol-1]`). Then we reduce the DataFrame `df` to those substances with a vapor pressure smaller than 10<sup>17</sup> cm<sup>-3</sup>.  What pattern do you observe by inspecting the remaining DataFrame (answer for the discussion, no text needed here)?** 

In [21]:
def filter_low_vap_press(df):
    # Define constants for conversion
    T = 300  # Temperature in K
    R = 8.314  # Gas constant in J/(mol·K)
    N_A = 6.022e23  # Avogadro's number
    conversion_prefactor = (N_A / (R * T * 1e6)) * 1e5  # Conversion factor for bar to cm^-3

    # Conversion to Volume
    df['P [cm-3] at 300 K'] = df['P [bar] at 300 K'] * conversion_prefactor

    df['M [g mol-1]'] = df['C'] * 12.011 + df['H'] * 1.008 + df['O'] * 15.999
    
    #print(df['P [cm-3] at 300 K'])
    
    df = df[df['P [cm-3] at 300 K'] < 10**(17)]

    #print(df['P [cm-3] at 300 K'])

    return df

df = read_in_pandas('organic_vapor_pressures.csv')
df = disentengle_sum_formulas_regex(df)
df = filter_low_vap_press(df)

0     9.367877e+18
1     7.363924e+18
2     6.603388e+18
3     5.287539e+18
4     1.103381e+19
          ...     
89    1.006564e+18
90    2.285472e+16
91    7.928894e+15
92    1.463779e+17
93    1.459506e+18
Name: P [cm-3] at 300 K, Length: 94, dtype: float64
35    3.295658e+16
41    3.072809e+16
48    8.291054e+16
51    9.392021e+15
53    3.148379e+15
54    9.109537e+15
55    1.036503e+16
56    1.235208e+16
73    5.408259e+16
75    4.345923e+16
76    8.641143e+16
77    4.104482e+16
78    7.701941e+16
80    8.324856e+16
84    4.097239e+16
85    5.825951e+16
86    6.719279e+15
87    1.888545e+16
90    2.285472e+16
91    7.928894e+15
Name: P [cm-3] at 300 K, dtype: float64


In [22]:
import pandas as pd
import numpy as np

# Define constants for conversion
T = 300  # Temperature in K
R = 8.314  # Gas constant in J/(mol·K)
N_A = 6.022e23  # Avogadro's number
conversion_prefactor = (N_A / (R * T * 1e6)) * 1e5  # Conversion factor for bar to cm^-3

def test_filter_low_vap_press():
    # Sample input data
    test_data = {
        'Molecule': ['Water', 'Carbon Dioxide', 'Ethanol', 'Large Molecule'],
        'Sum Formula': ['H2O', 'CO2', 'C2H6O', 'C30H62'],
        'H': [2, 0, 6, 62],
        'O': [1, 2, 1, 0],
        'C': [0, 1, 2, 30],
        'P [bar] at 300 K': [1.0, 0.04, 1e-3, 1e-10]  # Vapor pressures in bar
    }
    df_test = pd.DataFrame(test_data)

    # Run student's function
    df_filtered = filter_low_vap_press(df_test.copy())

    # Ensure output is a DataFrame
    assert isinstance(df_filtered, pd.DataFrame), "filter_low_vap_press() did not return a DataFrame"

    # Ensure the required columns exist
    required_columns = ['M [g mol-1]', 'P [cm-3] at 300 K']
    for col in required_columns:
        assert col in df_filtered.columns, f"Missing column: {col}"

    # Check molecular mass calculation
    expected_mass = {
        'H2O': 2 * 1.008 + 1 * 15.999,  
        'CO2': 1 * 12.011 + 2 * 15.999,  
        'C2H6O': 2 * 12.011 + 6 * 1.008 + 1 * 15.999,  
        'C30H62': 30 * 12.011 + 62 * 1.008  
    }

    for idx, row in df_test.iterrows():
        formula = row['Sum Formula']
        if formula in df_filtered['Sum Formula'].values:
            expected_value = expected_mass[formula]
            student_value = df_filtered.loc[df_filtered['Sum Formula'] == formula, 'M [g mol-1]'].values[0]
            assert np.isclose(student_value, expected_value, atol=0.001), f"Incorrect molecular mass for {formula}"

    # Check vapor pressure conversion
    for idx, row in df_test.iterrows():
        formula = row['Sum Formula']
        expected_vp_cm3 = row['P [bar] at 300 K'] * conversion_prefactor  # Convert bar to cm^-3
        if formula in df_filtered['Sum Formula'].values:
            student_vp_cm3 = df_filtered.loc[df_filtered['Sum Formula'] == formula, 'P [cm-3] at 300 K'].values[0]
            assert np.isclose(student_vp_cm3, expected_vp_cm3, rtol=0.1), f"Incorrect vapor pressure for {formula}"

    # Check if filtering was applied correctly (P < 1e17 cm^-3)
    assert (df_filtered['P [cm-3] at 300 K'] < 1e17).all(), "Filtered DataFrame contains high vapor pressure compounds"

    print("All tests passed!")

# Run the test
test_filter_low_vap_press()

0    2.414401e+19
1    9.657606e+17
2    2.414401e+16
3    2.414401e+09
Name: P [cm-3] at 300 K, dtype: float64
2    2.414401e+16
3    2.414401e+09
Name: P [cm-3] at 300 K, dtype: float64
All tests passed!


## Exercise 2: Dual mass spectrometer data

In this exercise we will deal with two sets of data from two complimentary mass spectrometers. These atmospheric-pressure interface mass spectrometers are used in climate and air pollution studies to identify gaseous trace compounds in the atmosphere. They use chemical ionization witch specific reagent ions to ionize the substances by forming adduct clusters. In this case we use two different instruments with two different reagent ions, one with Nitrate (NO<sub>3</sub><sup>-</sup>) and one with Bromide (Br<sup>-</sup>). The mass spectrometers were operated in Hyytiälä, Finland, where they detected gaseous substances which can form new aerosol particles due to their low vapor pressure. The raw dataset was already processed  by a skilled experimental chemist, i.e. the peaks in the mass spectrum were identified, fitted and averaged over a longer time period. Now, we have two excel sheets available which contain the identified sum formulas and their average concentration. With these we will get to know how to compute, slice and overlap datasets with pandas. A detailed guide for this can be found in the [pandas user guide](https://pandas.pydata.org/docs/dev/user_guide/merging.html).

**(a) use the pandas function [read_excel](https://pandas.pydata.org/pandas-docs/version/1.5/reference/api/pandas.read_excel.html) to read in the two files `nitrate.xlsx` and `bromide.xlsx` and name the DataFrames `nitrate` and `bromide`, respectively. Apply a procedure to transform the sum formulas to columns containing the elemental abundances. Be careful with -, (NO3-) and (Br-) in the sum formulas, as they are not part of the analyte, but specify the ionization mode of detection. They should not end up in the elemental abundances list.**

In [None]:
import pandas as pd
import re

# YOUR CODE HERE
raise NotImplementedError()

**(b) now we want to clean the dataset, as we are only interested in organic compounds this time. Except for the primary ion peaks (NO3-), H1N1O3(NO3-), H2N2O6(NO3-) and (Br-), H2O1(Br-) we want to get rid of all inorganics in the dataset, i.e. create two dataframes nitrate_org and bromide_org, which only contain the primary ions of the ionization method and organic peaks (C>0). Recall how conditional statements can be used to reduce a DataFrame and note that they can be [combined](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing). Possibly, [pd.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) could be helpful. At the end also drop ([pd.drop()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)) all columns which show up only in inorganic molecular ions, namely 'Cl', 'I'.** 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**(c) Now we will identify molecules which are measured twice by the *same* instrument, e.g., they might be measured as the (NO3-) adduct (M+(NO3-)) and a deprotonated (M-) signal, so as different molecular ions, by writing a function called `remove_duplicate_molecules(df)`. There are two ways: One using [pd.duplicated](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) and one using [pd.groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html). In either case, the raw signal in counts per second of the molecules measured via two different molecular ions should be summed and one of the duplicates removed from the DataFrame.**

In [None]:
import numpy as np

def remove_duplicate_molecules(df):
    df = df.copy()
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
### BEGIN ALWAYS HIDDEN TEST
import pandas as pd
import numpy as np

# Check if function exists
assert 'remove_duplicate_molecules' in globals(), "The function 'remove_duplicate_molecules' is not defined."

# Create a test DataFrame
data = {
    "sum formula": ["C6H6O2(NO3-)", "C6H6O2", "C5H10O", "C6H6O2(NO3-)", "C5H10O"],
    "detected mass": [172.024599, 110.036780, 86.073165, 172.024599, 86.073165],
    "mean cps": [100, 200, 300, 150, 400],
    "C": [6, 6, 5, 6, 5],
    "H": [6, 6, 10, 6, 10],
    "O": [2, 2, 1, 2, 1],
    "N": [0, 0, 0, 0, 0]
}

df_test = pd.DataFrame(data)

# Expected output: Remove duplicate elemental compositions, sum signals
expected_data = {
    "sum formula": ["C6H6O2(NO3-)", "C5H10O"],  # Only unique elemental compositions remain
    "detected mass": [172.024599, 86.073165],
    "mean cps": [450, 700],  # Summed signal values
    "C": [6, 5],
    "H": [6, 10],
    "O": [2, 1],
    "N": [0, 0]
}

expected_df = pd.DataFrame(expected_data)

# Run student function
df_cleaned = remove_duplicate_molecules(df_test)

# Check if output is a DataFrame
assert isinstance(df_cleaned, pd.DataFrame), "The output of 'remove_duplicate_molecules' is not a pandas DataFrame."

# Check the number of rows in the cleaned DataFrame
assert df_cleaned.shape[0] == expected_df.shape[0], f"Expected {expected_df.shape[0]} rows, but got {df_cleaned.shape[0]}."

# Check that sum formulas remain consistent
pd.testing.assert_series_equal(df_cleaned["sum formula"].reset_index(drop=True),
                               expected_df["sum formula"],
                               check_dtype=False,
                               obj="The 'sum formula' column does not match the expected output.")

# Check that elemental compositions remain correct
for element in ["C", "H", "O", "N"]:
    pd.testing.assert_series_equal(df_cleaned[element].reset_index(drop=True),
                                   expected_df[element],
                                   check_dtype=False,
                                   obj=f"The '{element}' column does not match the expected output.")

# Check that Raw Signal (cps) was summed correctly
pd.testing.assert_series_equal(df_cleaned["mean cps"].reset_index(drop=True),
                               expected_df["mean cps"],
                               check_dtype=False,
                               obj="The 'Raw Signal (cps)' column does not match the expected summed values.")

test = remove_duplicate_molecules(bromide_org)

assert np.isclose(test.loc[test['sum formula'] == 'C2H2O4-', 'mean cps'], 0.030224, rtol=1e-2)

print('All tests passed')

### END ALWAYS HIDDEN TEST

**(d) Currently we only have the raw signal per molecular ion ($cps(analyte)$) available in both mass spectra, but we want to transform this to meaningful concentrations ($c(analyte)$ in molecules per cm<sup>3</sup>) for both mass spectra. For that the data is normalized by the primary ion signals (to account for differences in the performance between calibration and actual measurement) and then multiplied by a calibration factor $f$.** 

For nitrate we use the following formula:
$$ c(analyte) = \frac{cps(analyte)}{cps(NO3-)+cps(HNO3(NO3-))+cps(H2N2O6)(NO3-)}\cdot f(nitrate)$$
with $f(nitrate)=2.6\cdot10^{9}$. 

And for bromide we use:
$$ c(analyte) = \frac{cps(analyte)}{cps(Br-)+cps(H2O(Br-))}\cdot f(bromide)$$
with $f(bromide)=3.0\cdot10^{9}$.

**Create two new DataFrames `nitrate_calib` and `bromide_calib` which originate from `bromide_org`and `nitrate_org` and have a new column giving the proper concentrations (named 'conc') for each measured molecule.**

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN ALWAYS HIDDEN TEST

# Check if variables exist
assert 'nitrate_calib' in globals(), "The variable 'nitrate_calib' is not defined."
assert 'bromide_calib' in globals(), "The variable 'bromide_calib' is not defined."

# Check if they are pandas DataFrames
assert isinstance(nitrate_calib, pd.DataFrame), "'nitrate_calib' is not a pandas DataFrame."
assert isinstance(bromide_calib, pd.DataFrame), "'bromide_calib' is not a pandas DataFrame."

# Check if the 'conc' column exists
assert 'conc' in nitrate_calib.columns, "The column 'conc' is missing in 'nitrate_calib'."
assert 'conc' in bromide_calib.columns, "The column 'conc' is missing in 'bromide_calib'."

assert np.isclose(bromide_calib.loc[2:, 'conc'].median(), 2e7, rtol=1e-1), "bromide_calib 'conc' are in the wrong order of magnitude"
assert np.isclose(nitrate_calib.loc[3:, 'conc'].median(), 7e6, rtol=1e-1), "nitrate_calib 'conc' are in the wrong order of magnitude"

print('All tests passed')
### END ALWAYS HIDDEN TEST

**(e) The organics measured in the two mass spectra (`nitrate_calib` and `bromide_calib`) need to be combined. For that we first need to identify the molecules measured in both instruments, i.e. we need to extract the overlap. We will use [pd.merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) for this. `pd.merge`  is commonly used to merge data from two DataFrames based on a common column(s), and when used with the `how='inner'` keyword, it performs an inner join returning only the rows where there is a match in both DataFrames based on the specified key columns. Drop the primary ions from both DataFrames, and also get rid of the columns for the halogenes which might be still in your `bromide_calib` (namely 'Br','Cl','I'). For the overlapping peaks we will then compare the measured concentrations and use the instrument with the higher concentration and discard the other. At the end, we combine everything together again into one DataFrame which contains all sum formulas, the elemental abundances, the instrument which was used and the concentration. Put your code into a function called `combine_bromide_nitrate(bromide, nitrate)` where you pass `bromide_calib` and `nitrate_calib` as arguments and which return the final mass spectrum as a DataFrame (with reset index, going from 0 to ).**

In [None]:
def combine_bromide_nitrate(bromide, nitrate):
    # YOUR CODE HERE
    raise NotImplementedError()