# Data Preparation

`Overview`
This notebook handles the initial data processing pipeline:
- Loading raw data from source files
- Performing exploratory data analysis (EDA)
- Cleaning and handling missing values
- Feature preprocessing and engineering
- Exporting processed datasets for modeling

`Inputs`
- Raw data files from `../data/raw/` 

`Outputs`
- Processed datasets in `../data/processed/`
- EDA visualizations in `../reports/figures/`

`Dependencies`
- pandas
- numpy
- matplotlib
- seaborn

*Note: This is notebook 1 of the analysis pipeline*

In [1]:
# Imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path

# Import custom modules
# from src.save_load import save_parquet

In [2]:
!which python


/Users/bertdepoorter/opt/anaconda3/envs/MDA/bin/python


Here we load the project specific datasets as CSV files. In the follow-up cell, we load the auxiliary dataset containing extra information on the CORDIS-HORIZON projects. This includes
- Scientific vocabulary 
- legal basis documents
- organization
- project
- topics
- webItem 
- webLink

In [69]:
# Import the dataset as pandas DataFrame
run_dir = os.getcwd()
parent_dir = os.path.dirname(run_dir)

raw_dir = f'{parent_dir}/data/raw'
interim_dir = f'{parent_dir}/data/interim'
processed_dir = f'{parent_dir}/data/processed'

# define file paths to project-specific files
data_report_path = f'{raw_dir}/reportSummaries.csv'
data_filereport_path = f'{raw_dir}/file_report.csv'
data_publications_path = f'{raw_dir}/projectPublications.csv'
data_deliverables_path = f'{raw_dir}/projectDeliverables.csv'



In [14]:
CORDIS_framework_docs_dir = f'{raw_dir}/cordis-HORIZONprojects-csv'

SciVoc_path = f'{CORDIS_framework_docs_dir}/euroSciVoc.csv'
legalBasis_path = f'{CORDIS_framework_docs_dir}/legalBasis.csv'
organization_path = f'{CORDIS_framework_docs_dir}/organization.csv'
project_path = f'{CORDIS_framework_docs_dir}/project.csv'
topics_path = f'{CORDIS_framework_docs_dir}/topics.csv'
webItems_path = f'{CORDIS_framework_docs_dir}/webItem.csv'
webLink_path = f'{CORDIS_framework_docs_dir}/webLink.csv'

## Inspect Reports

In [15]:
# get DataFrame keys
data_report = pd.read_csv(data_report_path, delimiter=';')
data_report.keys()

Index(['id', 'title', 'projectID', 'projectAcronym', 'attachment',
       'contentUpdateDate', 'rcn'],
      dtype='object')

In [7]:
data_report.head()

Unnamed: 0,id,title,projectID,projectAcronym,attachment,contentUpdateDate,rcn
0,101066069_PSHORIZON,Periodic Reporting for period 1 - ERASMUS (Ear...,101066069,ERASMUS,,2025-03-17 10:38:00,1267558
1,101073231_PSHORIZON,Periodic Reporting for period 1 - OncoProTools...,101073231,OncoProTools,/docs/results/horizon/101073/101073231_PS/2024...,2025-03-18 12:31:34,1270628
2,101068156_PSHORIZON,Periodic Reporting for period 1 - BLISS (Beta-...,101068156,BLISS,/docs/results/horizon/101068/101068156_PS/pict...,2025-03-05 11:47:45,1260626
3,101072180_PSHORIZON,Periodic Reporting for period 1 - Green2Ice (W...,101072180,Green2Ice,/docs/results/horizon/101072/101072180_PS/2023...,2025-02-14 10:36:27,1252991
4,101063407_PSHORIZON,Periodic Reporting for period 1 - GHost (His E...,101063407,GHost,/docs/results/horizon/101063/101063407_PS/pa-1...,2025-02-26 17:32:14,1257475


### Missing values
1. we check each column for missing values
2. Define decision tree for handling missing values
3. Change values algorithmically
4. Store update dataframe in interim directory


In [28]:
# look for missing values
report_missing = data_report.isnull()

# check which columns are missing data
for key in data_report:
    missing = report_missing[report_missing[key] == True]
    print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

For key id:
     0 elements are missing.
For key title:
     0 elements are missing.
For key projectID:
     0 elements are missing.
For key projectAcronym:
     0 elements are missing.
For key attachment:
     1861 elements are missing.
For key contentUpdateDate:
     0 elements are missing.
For key rcn:
     0 elements are missing.


We see that there are only missing attachments. These attachments refer to some additional documents, mostly png picture.
We can handle this in three ways
- Look manually for the missing attachments 
- Ignore this column during analysis
- if attachment present: add to dashboard when user wants to inspect a particular project. If not present: leave blank. 

I recommend using the last approach. 

In [None]:
# handle missing values

# define missng values rule here

# change the missing values in dataframe
project_reports_interim = data_report
# save updated dataframe to data/interim
project_reports_interim.to_csv(f'{interim_dir}/reportSummaries_interim.csv', delimiter=';')

### Inspect other report file
This CSV file does not contain useful information

In [71]:
data_filereport = pd.read_csv(data_filereport_path, delimiter=';')
data_filereport

Unnamed: 0,"filename,status, issue_cause downloadURL, issue_cause accessURL"
0,HORIZON Report summaries (individual XML files...
1,"HORIZON Projects,delivered,,"
2,"HORIZON Projects Deliverables,delivered,,"
3,"HORIZON Projects (individual XML files),delive..."
4,HORIZON Projects Deliverables (individual XML ...
5,"HORIZON Report summaries,delivered,,"
6,"HORIZON Publications,delivered,,"
7,"HORIZON Projects Deliverables,delivered,,"
8,"HORIZON Publications,delivered,,"
9,"HORIZON Projects Deliverables,delivered,,"


## Inspect deliverables

In [72]:
# Inspect Dataframe
data_deliverables = pd.read_csv(data_deliverables_path, delimiter=';')
data_deliverables.keys()

Index(['id', 'title', 'deliverableType', 'description', 'projectID',
       'projectAcronym', 'url', 'collection', 'contentUpdateDate', 'rcn'],
      dtype='object')

In [74]:
data_deliverables

Unnamed: 0,id,title,deliverableType,description,projectID,projectAcronym,url,collection,contentUpdateDate,rcn
0,101071179_10_DELIVHORIZON,Technical/scientific review meeting 2 documents,"Documents, reports",Draft agenda and presentations,101071179,SUSTAIN,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-03 17:40:48,1246973.0
1,101072491_9_DELIVHORIZON,MIRELAI project website,"Websites, patent fillings, videos etc.",MIRELAI project website,101072491,MIRELAI,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-26 08:02:37,1256521.0
2,101066116_3_DELIVHORIZON,"Communication, Dissemination & Outreach Plan","Documents, reports",The plan describes the planned measures to max...,101066116,Ship Clones,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-03 17:40:42,1246926.0
3,101064988_11_DELIVHORIZON,SINFONICA Knowledge map creation and System Ar...,"Documents, reports",The deliverable will be the output of Tasks 4....,101064988,SINFONICA,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-03 17:40:44,1246937.0
4,101071179_51_DELIVHORIZON,Report on Portfolio activities 01,"Documents, reports",The report will present the collaboration acti...,101071179,SUSTAIN,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-03 17:40:44,1246934.0
...,...,...,...,...,...,...,...,...,...,...
21919,101091749_16_DELIVHORIZON,Report on DBL state of play,"Documents, reports",Report on DBL state of play,101091749,Demo-BLog,https://ec.europa.eu/research/participants/doc...,Project deliverable,2024-05-10 13:03:27,1092491.0
21920,101091749_30_DELIVHORIZON,Demo-BLog website,"Websites, patent fillings, videos etc.",Demo-BLog website,101091749,Demo-BLog,https://ec.europa.eu/research/participants/doc...,Project deliverable,2024-05-10 13:03:26,1092484.0
21921,101091749_29_DELIVHORIZON,Demo-BLog project identity,Other,Demo-BLog project identity,101091749,Demo-BLog,https://ec.europa.eu/research/participants/doc...,Project deliverable,2024-05-10 13:03:27,1092496.0
21922,101091749_33_DELIVHORIZON,Promotional films (No1),"Websites, patent fillings, videos etc.",Promotional films (No1),101091749,Demo-BLog,https://ec.europa.eu/research/participants/doc...,Project deliverable,2024-05-10 13:03:27,1092488.0


I have changed the following lines to enable opening the file with pandas:
- 1412: something wrong in the deliverable description
- 1412: wrong use of delimiter
- 6677: wrong use of quotation marks
- 6678: wrong use of quotation marks
- 8812: use of delimiter inside string
- 8826: use of delimiter inside string
- 9360: use of delimiter inside string
- 9524: use of delimiter inside string
- 10128: use of delimiter inside string
- 13108: use of delimiter inside string
- 19931: use of delimiter inside string

### Missing values
Here we handle the missing values in the dataset

In [75]:
# look for missing values
deliverables_missing = data_deliverables.isnull()

# check which columns are missing data
for key in deliverables_missing.keys():
    missing = deliverables_missing[deliverables_missing[key] == True]
    print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

For key id:
     0 elements are missing.
For key title:
     0 elements are missing.
For key deliverableType:
     35 elements are missing.
For key description:
     15 elements are missing.
For key projectID:
     0 elements are missing.
For key projectAcronym:
     0 elements are missing.
For key url:
     1 elements are missing.
For key collection:
     0 elements are missing.
For key contentUpdateDate:
     0 elements are missing.
For key rcn:
     1 elements are missing.


We are missing elements in the following columns:
- deliverableType
    - option 1: change to `'other'`
    - option 2: look up individual titles and add manually
- description
    - option 1: add empty string
    - Inspect manually to gain more insight what they exactly represent
        - Update: all the titles related to the projects are quite related. I suggest we copy title values into the description column.
- url
    - 1 missing url. Add the url to the main page of this project (SELFY, id = 101069748_16_DELIVHORIZON) instead of link to deliverable?
- rcn
    - 1 rcn is missing. 
    - Looked this number up in publication list based on the projectAcronym = `'GeneBEcon'`. There the rcn number is gives as `1077637.0`


In [77]:
# change unknown deliverable types to other
data_deliverables['deliverableType'] = data_deliverables['deliverableType'].fillna('Other') 

# change empty descriptions to title of that particular row
data_deliverables['description'] = data_deliverables['description'].fillna(data_deliverables['title'])

# change missing url to homepage of the particular project
data_deliverables['url'] = data_deliverables['url'].fillna('https://selfy-project.eu/')

# add missing rcn number
data_deliverables['rcn'] = data_deliverables['rcn'].fillna(1077637.0)

In [79]:
# check whether filling executed correctly
data_deliverables[deliverables_missing.deliverableType == True]

Unnamed: 0,id,title,deliverableType,description,projectID,projectAcronym,url,collection,contentUpdateDate,rcn
823,101062643_1_DELIVHORIZON,Data Management Plan,Other,The Data Management Plan describes the data ma...,101062643,LUARC,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-09-26 17:38:03,973315.0
963,101039920_1_DELIVHORIZON,Data Management Plan,Other,Data Management Plan,101039920,AFREXTRACT,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:06:37,914351.0
973,101040355_1_DELIVHORIZON,Data Management Plan,Other,Drafting of Data Management Plan,101040355,NeuRoPROBE,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:05:36,914340.0
2526,101066209_1_DELIVHORIZON,Data Management Plan,Other,The Data Management Plan describes the data ma...,101066209,POSTURE,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:08:51,914373.0
4173,101068487_1_DELIVHORIZON,Data Management Plan,Other,The Data Management Plan describes the data ma...,101068487,TeaGre,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:08:07,914367.0
4174,101062409_1_DELIVHORIZON,Data Management Plan,Other,The Data Management Plan describes the data ma...,101062409,SuperElectro,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:07:53,914364.0
4179,101060143_1_DELIVHORIZON,Data Management Plan,Other,The Data Management Plan describes the data ma...,101060143,ODeLiCs,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:05:45,914342.0
4180,101063914_1_DELIVHORIZON,Data Management Plan,Other,The Data Management Plan describes the data ma...,101063914,ReCHISVac,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:09:04,914375.0
5733,101043589_1_DELIVHORIZON,Data Management Plan,Other,Data Management Plan,101043589,HybridExpress,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:05:08,914335.0
5740,101052538_1_DELIVHORIZON,Data Management Plan,Other,Data Management Plan,101052538,NovoGenePop,https://ec.europa.eu/research/participants/doc...,Project deliverable,2023-02-21 14:05:21,914338.0


In [80]:
# save updated dataframe to data/interim
data_deliverables.to_csv(f'{interim_dir}/projectdeliverables_interim.csv', sep=';')

## Inspect Publications

In [89]:
# Inspect Dataframe
data_publications = pd.read_csv(data_publications_path, delimiter=';')
data_publications.keys()

Index(['id', 'title', 'isPublishedAs', 'authors', 'journalTitle',
       'journalNumber', 'publishedYear', 'publishedPages', 'issn', 'isbn',
       'doi', 'projectID', 'projectAcronym', 'collection', 'contentUpdateDate',
       'rcn'],
      dtype='object')

Some entries in the publications CSV have been changed by hand, in order to allow loading them:
- 7588
- 7748

Both are from the same conference. Problem: switched the order of the columns and add one additional empty column causing pandas loader to crash. 

Next problems:
- 12036: wrong notation of authors names + use of ; delimiter inside string.
- 12043: start authors string with four " + use ; to separate names.
- 12099: same problem as stated above
- 12110: same problem
- 12115: same problem
- 12270: same problem
- 18735: same problem
- 24019: same problem




In [90]:
data_publications

Unnamed: 0,id,title,isPublishedAs,authors,journalTitle,journalNumber,publishedYear,publishedPages,issn,isbn,doi,projectID,projectAcronym,collection,contentUpdateDate,rcn
0,101040480_113381_PUBLIHORIZON,The Microwave Rotational Electric Resonance (R...,Peer reviewed articles,"Hamza El Hadki, Kenneth J. Koziol, Oum Keltoum...",Molecules,28,2023,,1420-3049,,10.3390/molecules28083419,101040480,LACRIDO,Project publication,2025-02-11 11:41:50,1243351
1,101040480_113371_PUBLIHORIZON,The microwave spectra of the conformers of n-b...,Peer reviewed articles,"Susanna L. Stephens, Eléonore Antonelli, Alexa...",Journal of Molecular Spectroscopy,397,2024,,0022-2852,,10.1016/j.jms.2023.111824,101040480,LACRIDO,Project publication,2025-02-11 11:01:16,1243327
2,101040480_113383_PUBLIHORIZON,Coupled internal rotations and 14N quadrupole ...,Peer reviewed articles,"Mike Barth, Isabelle Kleiner, Ha Vinh Lam Nguyen",The Journal of Chemical Physics,160,2024,,0021-9606,,10.1063/5.0213319,101040480,LACRIDO,Project publication,2025-02-11 11:40:00,1243350
3,101040480_113375_PUBLIHORIZON,"The Heavy Atom Structure, <i>“cis</i> effect” ...",Peer reviewed articles,"Truong Anh Nguyen, Isabelle Kleiner, Martin Sc...",ChemPhysChem,25,2024,,1439-4235,,10.1002/cphc.202400387,101040480,LACRIDO,Project publication,2025-02-11 11:11:17,1243343
4,101040480_113374_PUBLIHORIZON,"Structure determination of 2,5-difluorophenol ...",Peer reviewed articles,"K.P. Rajappan Nair, Kevin G. Lengsfeld, Philip...",Journal of Molecular Structure,1321,2024,,0022-2860,,10.1016/j.molstruc.2024.139971,101040480,LACRIDO,Project publication,2025-02-11 11:04:26,1243340
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24145,101060809_11542_PUBLIHORIZON,Review of International Studies,Peer reviewed articles,"Pogdda, S., Richmond, O., Visoka, G.",Review of International Studies,3,2023,,0260-2105,,10.1017/S0260210522000377,101060809,EMBRACE,Project publication,2024-02-13 10:14:03,1042445
24146,101061890_16827_PUBLIHORIZON,Dual-band electro-optically steerable antenna,Peer reviewed articles,"Dmytro Vovchuk, Anna Mikhailovskaya, Dmitry Do...",Journal of Optics,25 105601,2023,,2040-8986,,10.1088/2040-8986/acf1ae,101061890,DeepSight,Project publication,2024-02-19 09:22:09,1043146
24147,101061201_13995_PUBLIHORIZON,Positron Annihilation Study of RPV Steels Radi...,Peer reviewed articles,Vladimir Slugen; Tomas Brodziansky; Jana Simeg...,Materials; Volume 15; Issue 20; Pages: 7091,,2022,,1996-1944,,10.3390/ma15207091,101061201,DELISA- LTO,Project publication,2024-01-22 18:22:57,1035198
24148,101061201_13826_PUBLIHORIZON,Round Robin Tests for WWER Heat Exchange Tubes,Peer reviewed articles,"Roman Krajcovic, Michal Benak, Radim Kopriva, ...",e-Journal of Nondestructive Testing 28(7),,2023,,1435-4934,,10.58286/28273,101061201,DELISA- LTO,Project publication,2024-01-16 10:28:00,1033841


### Missin values
Here we inspect the missing data in this file, and outline how we are goiing to treat these missing data points

In [91]:
# look for missing values
publications_missing = data_publications.isnull()

# check which columns are missing data
for key in publications_missing.keys():
    missing = publications_missing[publications_missing[key] == True]
    if len(missing.id) > 0:
        print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

For key authors:
     75 elements are missing.
For key journalTitle:
     2099 elements are missing.
For key journalNumber:
     13622 elements are missing.
For key publishedPages:
     24142 elements are missing.
For key issn:
     7004 elements are missing.
For key isbn:
     23219 elements are missing.
For key doi:
     2293 elements are missing.


There is quite some missing data in this file. Let us go through each line individually.
- authors:
    - This sucks. Would have been very nice to decompose author strings into single authors and make the connections
    - How to treat this: look into the article title string to check whether this one contains more author infromation
- journalTitle:
    - chack in the publication title. Sometimes there one has just copy-pasted the whole article reference
- journalNumber:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- publishedYear:
    - Manually look this up
- publishedPages:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- issn:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- isbn:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- doi:
    - Fuck this, just pass about:blank as url. 
- rcn:
    - Manually adjust this one. 
        - Update: this entry was missing an entry for authors, all following field shifted 1 column to the left. Manually fixed this one. 



In [85]:
data_publications.keys()

Index(['id', 'title', 'isPublishedAs', 'authors', 'journalTitle',
       'journalNumber', 'publishedYear', 'publishedPages', 'issn', 'isbn',
       'doi', 'projectID', 'projectAcronym', 'collection', 'contentUpdateDate',
       'rcn'],
      dtype='object')

In [None]:
# check missing rcn. 
data_publications[publications_missing.rcn == True]

Unnamed: 0,id,title,isPublishedAs,authors,journalTitle,journalNumber,publishedYear,publishedPages,issn,isbn,doi,projectID,projectAcronym,collection,contentUpdateDate,rcn
19443,101058393_2261_PUBLIHORIZON,"FAIRChemistry _ Webinar 03_""What is a chemical...",WorldFAIR Chemistry,FAIRChemistry Webinar 03,3,2023,,,,10.5281/zenodo.7689655,101058393,WorldFAIR,Project publication,2023-08-22 09:48:09,970524,


In [94]:
# fill some gaps in the data structure
data_publications['isbn'] = data_publications['isbn'].fillna('0000-0000')
data_publications['issn'] = data_publications['issn'].fillna('0000-0000')
data_publications['publishedPages'] = data_publications['publishedPages'].fillna(0)
data_publications['doi'] = data_publications['doi'].fillna('about:blank')
data_publications['journalTitle'] = data_publications['journalTitle'].fillna('Miscalleneous')
data_publications['journalNumber'] = data_publications['journalNumber'].fillna(0)
data_publications['authors'] = data_publications['authors'].fillna('sine nome')


In [95]:
# check data_publications again
publications_missing = data_publications.isnull()

# check which columns are missing data
for key in publications_missing.keys():
    missing = publications_missing[publications_missing[key] == True]
    if len(missing.id) > 0:
        print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

Now there are no empty entries left. We store the completed dataset in the interim folder

In [97]:
# Save to intermediate
data_publications.to_csv(f'{interim_dir}/projectPublications_interim.csv', sep=';')

## Construct functions to access cleaned data

We now define some functions that allow easy access to all aspects of different projects. 


In [None]:
def get_project_data():
    pass

