# Mepha - List
## Wichtig
* PDF-Konvertierung mit Tabula kommt immer schräg raus. Daher wird das PDF in eine HTML-Tabelle konvertiert. Dazu müssen aber händisch einige Schritte vorgenommen werden. Script folgen.
* HCP & HCO manuell setzen!
* In 2018, Mepha calculated number of participants to values... recalculate everything.

In [1]:
import sys
sys.path.insert(0, '../../../lib/')

import numpy as np
import pandas as pd
import importlib
import pikepdf

import pdfexport
importlib.reload(pdfexport)

from pdfexport import *

## Unlock PDF
Dieser Teil entfernt den Passwortschutz, damit das PDF in HTML exportiert werden kann

In [None]:
pdf = pikepdf.open('pkk-erfassungmepha-pharma-ag_eng_final_250619_ss.pdf')
pdf.save('unlocked.pdf')

## PDF to HTML
Dieser Schritt muss **manuell** getätigt werden!
* PDF `unlocked.pdf` in Adobe Acrobat (nicht Reader!) öffnen
* Datei -> Exportieren in -> HTML Website
* Speichern als `unlocked.html`

## Import HTML

In [2]:
df_list = pd.read_html("unlocked.html", "")
df = pd.concat(df_list)

## Format Table

In [16]:
df_export = df.copy()

#Rename Columns
df_export.columns = ['name', 'location', 'country', 'address', 'date', 'donations_grants', 'empty1', 'empty2', 'sponsorship', 'registration_fees','travel_accommodation', 'empty3', 'fees', 'related_expenses', 'total', 'empty4']

#Replace Strings
df_export.loc[df_export['name'].str.contains('Health Care Professionals', na=False), 'name'] = np.nan
df_export.loc[df_export['name'].str.contains('Health Care Organisations', na=False), 'name'] = np.nan

#Multiline-Address is on different rows. Put them together
for index, row in df_export.iterrows():
    if not isinstance(row['country'], str) and not isinstance(row['location'], str) and isinstance(row['address'], str):
        df_export.iloc[index + 1]['address'] = row['address'] + ' ' + df_export.iloc[index + 1]['address']

#Shift
df_export[df_export['name'].isna()] = df_export[df_export['name'].isna()].shift(-1, axis='columns')

#Remove rows which have no values
df_export = df_export.dropna(subset=['name'], how='all')
df_export = df_export[~df_export['name'].str.contains("HCP")]
df_export = df_export.reset_index(drop=True)

#Move Cells from the first page
copy = df_export['total'].isna()

df_export.loc[copy, 'total'] = df_export['empty3']
df_export.loc[copy, 'empty3'] = np.NaN

df_export.loc[copy, 'fees'] = df_export['travel_accommodation']
df_export.loc[copy, 'travel_accommodation'] = np.NaN

df_export.loc[copy, 'empty3'] = df_export['registration_fees']
df_export.loc[copy, 'registration_fees'] = np.NaN

df_export.loc[copy, 'registration_fees'] = df_export['sponsorship']
df_export.loc[copy, 'sponsorship'] = np.NaN

#Remove rows which have no values
df_export = df_export.dropna(subset=['total'], how='all')
df_export = df_export[df_export['country'] != '‐']
df_export = df_export.reset_index(drop=True)

#Remove empty
df_export.drop(columns=['date', 'empty1', 'empty2', 'empty3', 'empty4'], inplace=True)

#Remove Total and recalculate it
df_export['total'] = 0

#Convert to Numbers
df_export = cleanup_number(df_export)
df_export = amounts_to_number(df_export)
df_export = sum_amounts(df_export)


#Add Fields
df_export = add_uci(df_export)
df_export = add_plz(df_export)

#Add Type manually
first_hco = 'Hausärztlicher Qualitätszirkel Amriswil'
index = df_export[df_export['name'] == first_hco].index[0]
df_export.loc[0: index, 'type'] = 'hcp'
df_export.loc[index:, 'type'] = 'hco'

#Revert name
df_export = revert_name(df_export, ', ')

#basic string conversion
df_export = basic_string_conversion(df_export)

export_list(df_export, 'mepha')

Duplicates found. Please check for duplicates: df_export[df_export.duplicated()]
saved


In [4]:
df_export[df_export.duplicated()]

Unnamed: 0,name,location,country,address,plz,donations_grants,uci,sponsorship,registration_fees,travel_accommodation,fees,related_expenses,total,type,_export_information,source
111,Forum für medizinische Fortbildung,Baar,CH,Oberneuhofstrasse 6,,,,5700.0,,,,,5700.0,hco,,mepha


In [15]:
#write_to_csv(df_export, 'tmp.csv')
#write_to_excel(df_export, 'tmp.xlsx', open=True)