# Data Preparation

`Overview`
This notebook handles the initial data processing pipeline:
- Loading raw data from source files
- Performing exploratory data analysis (EDA)
- Cleaning and handling missing values
- Feature preprocessing and engineering
- Exporting processed datasets for modeling

`Inputs`
- Raw data files from `../data/raw/` 

`Outputs`
- Processed datasets in `../data/processed/`
- EDA visualizations in `../reports/figures/`

`Dependencies`
- pandas
- numpy
- matplotlib
- seaborn

*Note: This is notebook 1 of the analysis pipeline*

In [1]:
# Imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path

# Import custom modules
# from src.save_load import save_parquet

In [2]:
!which python


/Users/bertdepoorter/opt/anaconda3/envs/MDA/bin/python


Here we load the project specific datasets as CSV files. In the follow-up cell, we load the auxiliary dataset containing extra information on the CORDIS-HORIZON projects. This includes
- Scientific vocabulary 
- legal basis documents
- organization
- project
- topics
- webItem 
- webLink

In [19]:
# Import the dataset as pandas DataFrame
run_dir = os.getcwd()
parent_dir = os.path.dirname(run_dir)

# define file paths to project-specific files
data_report_path = f'{parent_dir}/data/raw/reportSummaries.csv'
data_publications_path = f'{parent_dir}/data/raw/projectPublications.csv'
data_deliverables_path = f'{parent_dir}/data/raw/projectDeliverables.csv'



In [None]:
CORDIS_framework_docs_dir = f'{parent_dir}/data/raw/cordis-HORIZONprojects-csv'

SciVoc_path = f'{CORDIS_framework_docs_dir}/euroSciVoc.csv'
legalBasis_path = f'{CORDIS_framework_docs_dir}/legalBasis.csv'
organization_path = f'{CORDIS_framework_docs_dir}/organization.csv'
project_path = f'{CORDIS_framework_docs_dir}/project.csv'
topics_path = f'{CORDIS_framework_docs_dir}/topics.csv'
webItems_path = f'{CORDIS_framework_docs_dir}/webItem.csv'
webLink_path = f'{CORDIS_framework_docs_dir}/webLink.csv'

## Inspect Reports

In [15]:
# get DataFrame keys
data_report = pd.read_csv(data_report_path, delimiter=';')
data_report.keys()

Index(['id', 'title', 'projectID', 'projectAcronym', 'attachment',
       'contentUpdateDate', 'rcn'],
      dtype='object')

In [18]:
data_report.head()

Unnamed: 0,id,title,projectID,projectAcronym,attachment,contentUpdateDate,rcn
0,101066069_PSHORIZON,Periodic Reporting for period 1 - ERASMUS (Ear...,101066069,ERASMUS,,2025-03-17 10:38:00,1267558
1,101073231_PSHORIZON,Periodic Reporting for period 1 - OncoProTools...,101073231,OncoProTools,/docs/results/horizon/101073/101073231_PS/2024...,2025-03-18 12:31:34,1270628
2,101068156_PSHORIZON,Periodic Reporting for period 1 - BLISS (Beta-...,101068156,BLISS,/docs/results/horizon/101068/101068156_PS/pict...,2025-03-05 11:47:45,1260626
3,101072180_PSHORIZON,Periodic Reporting for period 1 - Green2Ice (W...,101072180,Green2Ice,/docs/results/horizon/101072/101072180_PS/2023...,2025-02-14 10:36:27,1252991
4,101063407_PSHORIZON,Periodic Reporting for period 1 - GHost (His E...,101063407,GHost,/docs/results/horizon/101063/101063407_PS/pa-1...,2025-02-26 17:32:14,1257475


## Inspect deliverables

In [34]:
# Inspect Dataframe
data_deliverables = pd.read_csv(data_deliverables_path, delimiter=';')
data_deliverables.keys()

Index(['id', 'title', 'deliverableType', 'description', 'projectID',
       'projectAcronym', 'url', 'collection', 'contentUpdateDate', 'rcn'],
      dtype='object')

In [36]:
data_deliverables

Unnamed: 0,id,title,deliverableType,description,projectID,projectAcronym,url,collection,contentUpdateDate,rcn
0,101071179_10_DELIVHORIZON,Technical/scientific review meeting 2 documents,"Documents, reports",Draft agenda and presentations,101071179,SUSTAIN,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-03 17:40:48,1246973.0
1,101072491_9_DELIVHORIZON,MIRELAI project website,"Websites, patent fillings, videos etc.",MIRELAI project website,101072491,MIRELAI,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-26 08:02:37,1256521.0
2,101066116_3_DELIVHORIZON,"Communication, Dissemination & Outreach Plan","Documents, reports",The plan describes the planned measures to max...,101066116,Ship Clones,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-03 17:40:42,1246926.0
3,101064988_11_DELIVHORIZON,SINFONICA Knowledge map creation and System Ar...,"Documents, reports",The deliverable will be the output of Tasks 4....,101064988,SINFONICA,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-03 17:40:44,1246937.0
4,101071179_51_DELIVHORIZON,Report on Portfolio activities 01,"Documents, reports",The report will present the collaboration acti...,101071179,SUSTAIN,https://ec.europa.eu/research/participants/doc...,Project deliverable,2025-02-03 17:40:44,1246934.0
...,...,...,...,...,...,...,...,...,...,...
21919,101091749_16_DELIVHORIZON,Report on DBL state of play,"Documents, reports",Report on DBL state of play,101091749,Demo-BLog,https://ec.europa.eu/research/participants/doc...,Project deliverable,2024-05-10 13:03:27,1092491.0
21920,101091749_30_DELIVHORIZON,Demo-BLog website,"Websites, patent fillings, videos etc.",Demo-BLog website,101091749,Demo-BLog,https://ec.europa.eu/research/participants/doc...,Project deliverable,2024-05-10 13:03:26,1092484.0
21921,101091749_29_DELIVHORIZON,Demo-BLog project identity,Other,Demo-BLog project identity,101091749,Demo-BLog,https://ec.europa.eu/research/participants/doc...,Project deliverable,2024-05-10 13:03:27,1092496.0
21922,101091749_33_DELIVHORIZON,Promotional films (No1),"Websites, patent fillings, videos etc.",Promotional films (No1),101091749,Demo-BLog,https://ec.europa.eu/research/participants/doc...,Project deliverable,2024-05-10 13:03:27,1092488.0


I have changed the following lines to enable opening the file with pandas:
- 1412: something wrong in the deliverable description
- 1412: wrong use of delimiter
- 6677: wrong use of quotation marks
- 6678: wrong use of quotation marks
- 8812: use of delimiter inside string
- 8826: use of delimiter inside string
- 9360: use of delimiter inside string
- 9524: use of delimiter inside string
- 10128: use of delimiter inside string
- 13108: use of delimiter inside string
- 19931: use of delimiter inside string

## Inspect Publications

In [17]:
# Inspect Dataframe
data_publications = pd.read_csv(data_publications_path, delimiter=';')
data_publications.keys()

Index(['id', 'title', 'isPublishedAs', 'authors', 'journalTitle',
       'journalNumber', 'publishedYear', 'publishedPages', 'issn', 'isbn',
       'doi', 'projectID', 'projectAcronym', 'collection', 'contentUpdateDate',
       'rcn'],
      dtype='object')

Some entries in the publications CSV have been changed by hand, in order to allow loading them:
- 7588
- 7748

Both are from the same conference. Problem: switched the order of the columns and add one additional empty column causing pandas loader to crash. 

Next problems:
- 12036: wrong notation of authors names + use of ; delimiter inside string.
- 12043: start authors string with four " + use ; to separate names.
- 12099: same problem as stated above
- 12110: same problem
- 12115: same problem
- 12270: same problem
- 18735: same problem
- 24019: same problem




In [35]:
data_publications

Unnamed: 0,id,title,isPublishedAs,authors,journalTitle,journalNumber,publishedYear,publishedPages,issn,isbn,doi,projectID,projectAcronym,collection,contentUpdateDate,rcn
0,101040480_113381_PUBLIHORIZON,The Microwave Rotational Electric Resonance (R...,Peer reviewed articles,"Hamza El Hadki, Kenneth J. Koziol, Oum Keltoum...",Molecules,28,2023.0,,1420-3049,,10.3390/molecules28083419,101040480,LACRIDO,Project publication,2025-02-11 11:41:50,1243351.0
1,101040480_113371_PUBLIHORIZON,The microwave spectra of the conformers of n-b...,Peer reviewed articles,"Susanna L. Stephens, Eléonore Antonelli, Alexa...",Journal of Molecular Spectroscopy,397,2024.0,,0022-2852,,10.1016/j.jms.2023.111824,101040480,LACRIDO,Project publication,2025-02-11 11:01:16,1243327.0
2,101040480_113383_PUBLIHORIZON,Coupled internal rotations and 14N quadrupole ...,Peer reviewed articles,"Mike Barth, Isabelle Kleiner, Ha Vinh Lam Nguyen",The Journal of Chemical Physics,160,2024.0,,0021-9606,,10.1063/5.0213319,101040480,LACRIDO,Project publication,2025-02-11 11:40:00,1243350.0
3,101040480_113375_PUBLIHORIZON,"The Heavy Atom Structure, <i>“cis</i> effect” ...",Peer reviewed articles,"Truong Anh Nguyen, Isabelle Kleiner, Martin Sc...",ChemPhysChem,25,2024.0,,1439-4235,,10.1002/cphc.202400387,101040480,LACRIDO,Project publication,2025-02-11 11:11:17,1243343.0
4,101040480_113374_PUBLIHORIZON,"Structure determination of 2,5-difluorophenol ...",Peer reviewed articles,"K.P. Rajappan Nair, Kevin G. Lengsfeld, Philip...",Journal of Molecular Structure,1321,2024.0,,0022-2860,,10.1016/j.molstruc.2024.139971,101040480,LACRIDO,Project publication,2025-02-11 11:04:26,1243340.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24145,101060809_11542_PUBLIHORIZON,Review of International Studies,Peer reviewed articles,"Pogdda, S., Richmond, O., Visoka, G.",Review of International Studies,3,2023.0,,0260-2105,,10.1017/S0260210522000377,101060809,EMBRACE,Project publication,2024-02-13 10:14:03,1042445.0
24146,101061890_16827_PUBLIHORIZON,Dual-band electro-optically steerable antenna,Peer reviewed articles,"Dmytro Vovchuk, Anna Mikhailovskaya, Dmitry Do...",Journal of Optics,25 105601,2023.0,,2040-8986,,10.1088/2040-8986/acf1ae,101061890,DeepSight,Project publication,2024-02-19 09:22:09,1043146.0
24147,101061201_13995_PUBLIHORIZON,Positron Annihilation Study of RPV Steels Radi...,Peer reviewed articles,Vladimir Slugen; Tomas Brodziansky; Jana Simeg...,Materials; Volume 15; Issue 20; Pages: 7091,,2022.0,,1996-1944,,10.3390/ma15207091,101061201,DELISA- LTO,Project publication,2024-01-22 18:22:57,1035198.0
24148,101061201_13826_PUBLIHORIZON,Round Robin Tests for WWER Heat Exchange Tubes,Peer reviewed articles,"Roman Krajcovic, Michal Benak, Radim Kopriva, ...",e-Journal of Nondestructive Testing 28(7),,2023.0,,1435-4934,,10.58286/28273,101061201,DELISA- LTO,Project publication,2024-01-16 10:28:00,1033841.0


## Construct general data structure
The goal here is to design a versatile high-level class to interact with the data in a Python-friendly manner. 
Goals:
- Initialize class once, incldes loading data on backend
- Add different columns as atttributes
- Assign each project a unique label containing vital information
- Create a dictionary for each project with all information, linking back to the cleaned and post-processed datasets

In [None]:
def get_project_data()
    pass

