# Data Preparation

`Overview`
This notebook handles the initial data processing pipeline:
- Loading raw data from source files
- Performing exploratory data analysis (EDA)
- Cleaning and handling missing values
- Feature preprocessing and engineering
- Exporting processed datasets for modeling

`Inputs`
- Raw data files from `../data/raw/` 

`Outputs`
- Processed datasets in `../data/processed/`
- EDA visualizations in `../reports/figures/`

`Dependencies`
- pandas
- numpy
- matplotlib
- seaborn

*Note: This is notebook 1 of the analysis pipeline*

In [1]:
# Imports 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import datetime
from pprint import pprint
from pathlib import Path
import csv
import sys

# Import data classes
project_root = Path.cwd().parent  # assumes you're in /notebooks
sys.path.append(str(project_root))

# Import custom modules
from backend.etl.ingestion import inspect_bad_lines, auto_fix_row, robust_csv_reader
from backend.etl.cleaning import standardize_columns, clean_numeric_column, clean_date_column


In [2]:
!which python


'which' is not recognized as an internal or external command,
operable program or batch file.


Here we load the project specific datasets as CSV files. In the follow-up cell, we load the auxiliary dataset containing extra information on the CORDIS-HORIZON projects. This includes
- Scientific vocabulary 
- legal basis documents
- organization
- project
- topics
- webItem 
- webLink

In [3]:
run_dir = os.getcwd()
print(run_dir)

c:\Users\suley\Desktop\ManaMa\MDA\EU_Horizon_Dashboard\notebooks


In [4]:
# Import the dataset as pandas DataFrame
run_dir = os.getcwd()
parent_dir = os.path.dirname(run_dir)

raw_dir = f'{parent_dir}/data/raw'
interim_dir = f'{parent_dir}/data/interim'
processed_dir = f'{parent_dir}/data/processed'

# define file paths to project-specific files
data_report_path = f'{raw_dir}/reportSummaries.csv'
data_filereport_path = f'{raw_dir}/file_report.csv'
data_publications_path = f'{raw_dir}/projectPublications.csv'
data_deliverables_path = f'{raw_dir}/projectDeliverables.csv'



## Define functions for cleaning
The following functions are necessary to load the datasets correctly without manually changing them.
- `inspect_bad_lines`: inspect lines that cannot be read directly
- `auto_fix_row`: function that fixes row by merging excess columns together
- `robust_csv_reader`: robust function that loads CSV files while applying `auto_fix_row` function on the bad lines

Usage:
```
# check bad lines
project_df, problematic_lines = inspect_bad_lines(project_path)

# INspect how many bad lines there are 
print(f"DataFrame loaded with {len(project_df)} rows.")
print(f"Number of problematic lines: {len(problematic_lines)}")
```

## Inspect Reports

In [5]:
# get DataFrame keys
data_report = pd.read_csv(data_report_path, delimiter=';')


### Missing values



In [6]:
# look for missing values
report_missing = data_report.isnull()

# check which columns are missing data
for key in data_report:
    missing = report_missing[report_missing[key] == True]
    print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

For key id:
     0 elements are missing.
For key title:
     0 elements are missing.
For key projectID:
     0 elements are missing.
For key projectAcronym:
     0 elements are missing.
For key attachment:
     1861 elements are missing.
For key contentUpdateDate:
     0 elements are missing.
For key rcn:
     0 elements are missing.


## Inspect deliverables

In [7]:
# Inspect Dataframe
# account for problematic lines

deliverables_df, problematic_lines = inspect_bad_lines(data_deliverables_path)

print(f"DataFrame loaded with {len(deliverables_df)} rows.")
print(f"Number of problematic lines: {len(problematic_lines)}")
    

Found 11 problematic lines. Displaying the first 5:
Line 1416: ['101081964_11_DELIVHORIZON', 'Guideline for trial implementation', 'Documents, reports', 'This guideline will summarise and present agricultural practices performed at BioMonitor4CAP farms and research sites with links to local/regional nature conservation goals and targeted species. Based on these baseline information the guideline will present how, when and to what extend research groups of WPs 2 and 4 are integrated into the activities of WP3. Supports specifically achieving following outcomes (Part B, section 2.3): A and B ""Strategy on monitoring soil biodiversity at farm scale adopted by science, end users, and policy"" and E ""Roadmap on expanding enhancing application and implementation of agri-environmental measures committing to the preservation of biodiversity', ' particularly agroforestry."" This deliverable is an output of task 3.1.""', '101081964', 'BioMonitor4CAP', 'https://ec.europa.eu/research/participants

In [8]:
# Try loaoding with the robust CSV rreader
data_deliverables = robust_csv_reader(data_deliverables_path)

# Check the number of rows in the DataFrame
print(f"DataFrame loaded with {len(data_deliverables)} rows.")
# Check the number of problematic lines
print(f"Number of problematic lines: {len(problematic_lines)}")
# Check the first few rows of the DataFrame
print(data_deliverables.head())
# Check the columns of the DataFrame
print(data_deliverables.columns)
# Check the data types of the columns
print(data_deliverables.dtypes)


DataFrame loaded with 21924 rows.
Number of problematic lines: 11
                          id  \
0  101071179_10_DELIVHORIZON   
1   101072491_9_DELIVHORIZON   
2   101066116_3_DELIVHORIZON   
3  101064988_11_DELIVHORIZON   
4  101071179_51_DELIVHORIZON   

                                               title  \
0    Technical/scientific review meeting 2 documents   
1                            MIRELAI project website   
2       Communication, Dissemination & Outreach Plan   
3  SINFONICA Knowledge map creation and System Ar...   
4                  Report on Portfolio activities 01   

                          deliverableType  \
0                      Documents, reports   
1  Websites, patent fillings, videos etc.   
2                      Documents, reports   
3                      Documents, reports   
4                      Documents, reports   

                                         description  projectID  \
0                     Draft agenda and presentations  101071179   

### Missing values
Here we handle the missing values in the dataset

We are missing elements in the following columns:
- deliverableType
    - option 1: change to `'other'`
    - option 2: look up individual titles and add manually
- description
    - option 1: add empty string
    - Inspect manually to gain more insight what they exactly represent
        - Update: all the titles related to the projects are quite related. I suggest we copy title values into the description column.
- url
    - 1 missing url. Add the url to the main page of this project (SELFY, id = 101069748_16_DELIVHORIZON) instead of link to deliverable?
- rcn
    - 1 rcn is missing. 
    - Looked this number up in publication list based on the projectAcronym = `'GeneBEcon'`. There the rcn number is gives as `1077637.0`


In [9]:
# Check is the issue still persist
from backend.etl.cleaning import clean_deliverables, standardize_columns
print("Cleaning deliverables data...")


data_deliverables_cleaned = clean_deliverables(data_deliverables)

# Check for missing values in data_deliverables
print("Missing values in data_deliverables_cleaned:")
missing_values = data_deliverables_cleaned.isnull().sum()
for col, count in missing_values.items():
    if count > 0:
        print(f"  {col}: {count} missing values")

# Check specific issues mentioned previously
if 'deliverableType' in data_deliverables_cleaned.columns:
    print("\nDeliverableType unique values:")
    print(data_deliverables_cleaned['deliverableType'].unique())
    missing_type = data_deliverables_cleaned[data_deliverables_cleaned['deliverableType'].isnull()]
    if not missing_type.empty:
        print(f"\nSample of rows with missing deliverableType ({len(missing_type)} rows):")
        print(missing_type.head())

# Check for missing descriptions
if 'description' in data_deliverables_cleaned.columns:
    missing_desc = data_deliverables_cleaned[data_deliverables_cleaned['description'].isnull()]
    if not missing_desc.empty:
        print(f"\nSample of rows with missing description ({len(missing_desc)} rows):")
        print(missing_desc.head())

# Check for missing URLs
if 'url' in data_deliverables_cleaned.columns:
    missing_url = data_deliverables_cleaned[data_deliverables_cleaned['url'].isnull()]
    if not missing_url.empty:
        print(f"\nSample of rows with missing URL ({len(missing_url)} rows):")
        print(missing_url.head())

# Check for missing RCNs
if 'rcn' in data_deliverables_cleaned.columns:
    missing_rcn = data_deliverables_cleaned[data_deliverables_cleaned['rcn'].isnull()]
    if not missing_rcn.empty:
        print(f"\nSample of rows with missing RCN ({len(missing_rcn)} rows):")
        print(missing_rcn.head())



# Save the cleaned DataFrame to a new CSV file
cleaned_deliverables_path = os.path.join(interim_dir, 'projectDeliverables_interim.csv')
data_deliverables_cleaned.to_csv(cleaned_deliverables_path, index=False, sep=';')
print(f"Cleaned deliverables data saved to {cleaned_deliverables_path}")

Cleaning deliverables data...
Missing values in data_deliverables_cleaned:

DeliverableType unique values:
['Documents, reports' 'Websites, patent fillings, videos etc.'
 'Data Management Plan' 'Other' 'Demonstrators, pilots, prototypes'
 'Data sets, microdata, etc' 'Ethics Requirements' '']
Cleaned deliverables data saved to c:\Users\suley\Desktop\ManaMa\MDA\EU_Horizon_Dashboard/data/interim\projectDeliverables_interim.csv


## Inspect Publications

In [10]:
# Inspect Dataframe
data_publications = pd.read_csv(data_publications_path, delimiter=';')


In [11]:
publications_df, problematic_lines = inspect_bad_lines(data_publications_path, expected_columns=16)

print(f"DataFrame loaded with {len(publications_df)} rows.")
print(f"Number of problematic lines: {len(problematic_lines)}")

DataFrame loaded with 24150 rows.
Number of problematic lines: 0


In [12]:
publications_df.head()

Unnamed: 0,id,title,isPublishedAs,authors,journalTitle,journalNumber,publishedYear,publishedPages,issn,isbn,doi,projectID,projectAcronym,collection,contentUpdateDate,rcn
0,101040480_113381_PUBLIHORIZON,The Microwave Rotational Electric Resonance (R...,Peer reviewed articles,"Hamza El Hadki, Kenneth J. Koziol, Oum Keltoum...",Molecules,28,2023,,1420-3049,,10.3390/molecules28083419,101040480,LACRIDO,Project publication,2025-02-11 11:41:50,1243351
1,101040480_113371_PUBLIHORIZON,The microwave spectra of the conformers of n-b...,Peer reviewed articles,"Susanna L. Stephens, Eléonore Antonelli, Alexa...",Journal of Molecular Spectroscopy,397,2024,,0022-2852,,10.1016/j.jms.2023.111824,101040480,LACRIDO,Project publication,2025-02-11 11:01:16,1243327
2,101040480_113383_PUBLIHORIZON,Coupled internal rotations and 14N quadrupole ...,Peer reviewed articles,"Mike Barth, Isabelle Kleiner, Ha Vinh Lam Nguyen",The Journal of Chemical Physics,160,2024,,0021-9606,,10.1063/5.0213319,101040480,LACRIDO,Project publication,2025-02-11 11:40:00,1243350
3,101040480_113375_PUBLIHORIZON,"The Heavy Atom Structure, <i>“cis</i> effect” ...",Peer reviewed articles,"Truong Anh Nguyen, Isabelle Kleiner, Martin Sc...",ChemPhysChem,25,2024,,1439-4235,,10.1002/cphc.202400387,101040480,LACRIDO,Project publication,2025-02-11 11:11:17,1243343
4,101040480_113374_PUBLIHORIZON,"Structure determination of 2,5-difluorophenol ...",Peer reviewed articles,"K.P. Rajappan Nair, Kevin G. Lengsfeld, Philip...",Journal of Molecular Structure,1321,2024,,0022-2860,,10.1016/j.molstruc.2024.139971,101040480,LACRIDO,Project publication,2025-02-11 11:04:26,1243340


### Missing values
Here we inspect the missing data in this file, and outline how we are goiing to treat these missing data points

In [13]:
# look for missing values
publications_missing = data_publications.isnull()

# check which columns are missing data
for key in publications_missing.keys():
    missing = publications_missing[publications_missing[key] == True]
    if len(missing.id) > 0:
        print(f'For key {key}:\n     {len(missing.id)} elements are missing.')

For key authors:
     75 elements are missing.
For key journalTitle:
     2099 elements are missing.
For key journalNumber:
     13622 elements are missing.
For key publishedPages:
     24142 elements are missing.
For key issn:
     7004 elements are missing.
For key isbn:
     23219 elements are missing.
For key doi:
     2293 elements are missing.


There is quite some missing data in this file. Let us go through each line individually.
- authors:
    - This sucks. Would have been very nice to decompose author strings into single authors and make the connections
    - How to treat this: look into the article title string to check whether this one contains more author infromation
- journalTitle:
    - chack in the publication title. Sometimes there one has just copy-pasted the whole article reference
- journalNumber:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- publishedYear:
    - Manually look this up
- publishedPages:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- issn:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- isbn:
    - Not the most relevant parameter in my opinion. Just make all NaN zeros
- doi:
    - Fuck this, just pass about:blank as url. 
- rcn:
    - Manually adjust this one. 
        - Update: this entry was missing an entry for authors, all following field shifted 1 column to the left. Manually fixed this one. 



The cell above does give an empty DataFrame. Used it to get the information to look up rcn number in other datasets.

In [14]:
# test cleaning function

In [15]:
# Save to intermediate
data_publications.to_csv(f'{interim_dir}/projectPublications_interim.csv', sep=';')

# Inspect CORDIS-HORIZON projects data files
This is the folder containing some more datasets on the different projects.

### Define data paths

In [16]:
# define file paths
SciVoc_path = f'{raw_dir}/euroSciVoc.csv'
legalBasis_path = f'{raw_dir}/legalBasis.csv'
organization_path = f'{raw_dir}/organization.csv'
project_path = f'{raw_dir}/project.csv'
topics_path = f'{raw_dir}/topics.csv'
webItems_path = f'{raw_dir}/webItem.csv'
webLink_path = f'{raw_dir}/webLink.csv'

In [17]:
# Import some informative files

# load datasets
read_csv_options = {
    "delimiter": ";",
    "quotechar": '"',
    "escapechar": "\\",
    'doublequote': False,
    # "on_bad_lines": "skip",   # we skip lines that do not import properly for now
    "engine": "python"  # 'python' engine handles complex parsing better
}


## SciVoc dataset

In [18]:
# load
sci_voc_df = pd.read_csv(SciVoc_path, **read_csv_options)

# clean
from backend.etl.cleaning import clean_scivoc
sci_voc_cleaned = clean_scivoc(sci_voc_df)

# save to interim
sci_voc_cleaned.to_csv(f'{interim_dir}/euroSciVoc_interim.csv', sep=';')

In [19]:
sci_voc_df

Unnamed: 0,projectID,euroSciVocCode,euroSciVocPath,euroSciVocTitle,euroSciVocDescription
0,101159220,/21/33/121/621,/medical and health sciences/health sciences/i...,malaria,
1,101159220,/23/49/315/997/1613,/natural sciences/biological sciences/biochemi...,proteins,
2,101093997,/23/49/315/997/1611,/natural sciences/biological sciences/biochemi...,carbohydrates,
3,101126531,/23/49/341/325,/natural sciences/biological sciences/microbio...,virology,
4,101126531,/21/33/121/44109686/8132740,/medical and health sciences/health sciences/i...,coronaviruses,
...,...,...,...,...,...
34607,101112145,/21/35/153/22901471/633,/medical and health sciences/basic medicine/ne...,alzheimer,
34608,101112145,/29/101/553/1353,/social sciences/sociology/demography/mortality,mortality,
34609,101057477,/29/91/521/1299/1757,/social sciences/economics and business/econom...,productivity,
34610,101134907,/21/39/195,/medical and health sciences/clinical medicine...,paediatrics,


In [20]:
sci_paths = (
            sci_voc_df
            .groupby('projectID')['euroSciVocPath']
            .apply(list)
            .reset_index(name='sci_voc_paths')
        )

In [21]:
missing = sci_paths['sci_voc_paths'].isna()
sci_paths[missing]

Unnamed: 0,projectID,sci_voc_paths


In [22]:
sci_paths.explode('sci_voc_paths')

Unnamed: 0,projectID,sci_voc_paths
0,101039048,/natural sciences/chemical sciences/inorganic ...
0,101039048,/natural sciences/physical sciences/theoretica...
1,101039060,/humanities/history and archaeology/history/pr...
1,101039060,/natural sciences/computer and information sci...
1,101039060,/medical and health sciences/health sciences/n...
...,...,...
13527,190199469,/engineering and technology/nanotechnology
13527,190199469,/medical and health sciences/clinical medicine...
13528,190199874,/natural sciences/biological sciences/genetics
13528,190199874,/natural sciences/biological sciences/microbio...


## organization dataset

In [23]:
# load
organization_df = pd.read_csv(organization_path, delimiter=';')


# clean
from backend.etl.cleaning import clean_organization, standardize_columns
# Print the standardized column names
print("Standardized column names:")
print(organization_df.columns)
organization_cleaned = clean_organization(organization_df)

# save to interim
organization_cleaned.to_csv(f'{interim_dir}/organization_interim.csv', sep=';')

Standardized column names:
Index(['projectID', 'projectAcronym', 'organisationID', 'vatNumber', 'name',
       'shortName', 'SME', 'activityType', 'street', 'postCode', 'city',
       'country', 'nutsCode', 'geolocation', 'organizationURL', 'contactForm',
       'contentUpdateDate', 'rcn', 'order', 'role', 'ecContribution',
       'netEcContribution', 'totalCost', 'endOfParticipation', 'active'],
      dtype='object')
Dropped 101153 rows with missing organization ID.


  return pd.to_datetime(series, errors='coerce', infer_datetime_format=True)


## topics dataset

In [24]:
# load
topics_df = pd.read_csv(topics_path, **read_csv_options)

# clean
from backend.etl.cleaning import clean_topics
topics_cleaned = clean_topics(topics_df)

# save to interim
topics_cleaned.to_csv(f'{interim_dir}/topics_interim.csv', sep=';')

## Legal Basis dataset

In [25]:
# load
legal_basis_df = pd.read_csv(legalBasis_path, **read_csv_options)

# clean
from backend.etl.cleaning import clean_legalbasis, standardize_columns
legal_basis_cleaned = clean_legalbasis(legal_basis_df)

# save to interim
legal_basis_cleaned.to_csv(f'{interim_dir}/legalBasis_interim.csv', sep=';')

## webItem / webLink dataset

In [26]:
# load
web_items_df = pd.read_csv(webItems_path, **read_csv_options)
web_link_df = pd.read_csv(webLink_path, **read_csv_options)

# clean

from backend.etl.cleaning import clean_webitem, clean_weblink
web_items_cleaned = clean_webitem(web_items_df)
web_link_cleaned = clean_weblink(web_link_df)

# save to interim
web_items_cleaned.to_csv(f'{interim_dir}/webItems_interim.csv', sep=';')
web_link_cleaned.to_csv(f'{interim_dir}/webLink_interim.csv', sep=';')

## projects dataset

In [27]:
project_df, problematic_lines = inspect_bad_lines(project_path, expected_columns=20)

print(f"DataFrame loaded with {len(project_df)} rows.")
print(f"Number of problematic lines: {len(problematic_lines)}")

Found 127 problematic lines. Displaying the first 5:
Line 11: ['101114248', 'ELIA', 'CLOSED', 'Elia - Smart Assistant for English Learning', '2023-07-01', '2024-06-30', '0', '75000', 'HORIZON.3.2', 'HORIZON-EIE-2022-SCALEUP-02-02', '2023-06-05', 'HORIZON', 'HORIZON-EIE-2022-SCALEUP-02', 'HORIZON-EIE-2022-SCALEUP-02', 'HORIZON-CSA', '', "We are a startup founded exclusively by women holding the top positions CEO and CTO. Our vision is to reduce inequalities by helping people own their English. That's why we created a personal assistant – Elia. Elia is a tool for busy professionals or swamped students struggling with their English. It connects English learning to their daily activities, e.g. writing an email at work", ' watching videos on YouTube', ' or reading an article for a biology class. Because learning that is personalised and in context has been found to be the most effective form of learning. Elia started as a PhD project. Hence it\'s based on insights from cognitive linguistics

In [28]:
project_df = robust_csv_reader(project_path, expected_columns=20, problematic_column=14)

In [29]:
# clean the dataset
from backend.etl.cleaning import clean_project
project_cleaned = clean_project(project_df)

  return pd.to_datetime(series, errors='coerce', infer_datetime_format=True)
  return pd.to_datetime(series, errors='coerce', infer_datetime_format=True)


In [30]:
# save the dataframe to interim folder
project_df.to_csv(f'{interim_dir}/project_interim.csv', sep=';')

## Construct functions to access cleaned data

We now define some functions that allow easy access to all aspects of different projects. 


- Merge datasets into one object
- Standardize column names => they are compatible
- Create function that allow access to project-specific data:
    - function argument: project name / acronym / identifier
    - function output: data class with project information as attributes
    - Or: approach this from a class init perspective

Find some way to pass load datasets
apply class on this, without having to load the full dataset each time we initialize the class


In [31]:
import sys
import os
from pathlib import Path

# Add the project root directory to sys.path so absolute imports work


from backend.classes import CORDIS_data

parent_dir = project_root
Data_structure = CORDIS_data(parent_dir, enrich=True)

# print organization_cleaned
print(Data_structure.organization_df.columns.tolist())

Enriching the projects dataset with temporal information.
Enriching the projects dataset with people and institutions information.
Enriching the projects dataset with financial information.
Enriching the projects dataset with thematic / scientific information.
['projectID', 'projectAcronym', 'organisationID', 'vatNumber', 'name', 'shortName', 'SME', 'activityType', 'street', 'postCode', 'city', 'country', 'nutsCode', 'geolocation', 'organizationURL', 'contactForm', 'contentUpdateDate', 'rcn', 'order', 'role', 'ecContribution', 'netEcContribution', 'totalCost', 'endOfParticipation', 'active']


In [32]:
Data_structure.project_df.head()

Unnamed: 0,id,acronym,status,title,startDate,endDate,totalCost,ecMaxContribution,legalBasis,topics,...,projectID_y,institutions,projectID,coordinator_name,ecContribution_per_year,totalCost_per_year,field_class,field,subfield,niche
0,101159220,PvSeroRDT,SIGNED,A point-of-care serological rapid diagnostic t...,2025-02-01,2030-01-31,,,HORIZON.2.1,HORIZON-JU-GH-EDCTP3-2023-02-02-two-stage,...,101159220,"[Institut Pasteur de Madagascar, INSTITUT PAST...",101159220,INSTITUT PASTEUR,,,"[medical and health sciences, natural sciences]","[biological sciences, health sciences]","[infectious diseases, biochemistry]","[biomolecules, malaria]"
1,101096150,BIOBoost,SIGNED,Boosting innovation agencies for bioeconomy va...,2023-02-01,2025-01-31,0.0,500000.0,HORIZON.3.2,HORIZON-EIE-2022-CONNECT-01-01,...,101096150,"[CLIC INNOVATION OY, FBCD AS, BIOECONOMY FOR C...",101096150,FBCD AS,500000.0,0.0,[other],[other],[other],[other]
2,101093997,GlycanTrigger,SIGNED,GLYCANS AS MASTER TRIGGERS OF HEALTH TO INTEST...,2023-01-01,2028-12-31,6771571.0,6771571.0,HORIZON.2.1,HORIZON-HLTH-2022-STAYHLTH-02-01,...,101093997,"[ACADEMISCH ZIEKENHUIS LEIDEN, LUDGER LIMITED,...",101093997,I3S - INSTITUTO DE INVESTIGACAO E INOVACAO EM ...,1354314.2,1354314.2,[natural sciences],[biological sciences],[biochemistry],[biomolecules]
3,101126531,CHIKVAX_CHIM,SIGNED,Late-stage clinical development of Chikungunya...,2023-06-01,2028-11-30,100000000.0,70000000.0,HORIZON.2.1,HORIZON-HLTH-2022-CEPI-15-01-IBA,...,101126531,[COALITION FOR EPIDEMIC PREPAREDNESS INNOVATIONS],101126531,COALITION FOR EPIDEMIC PREPAREDNESS INNOVATIONS,14000000.0,20000000.0,"[medical and health sciences, natural sciences]","[basic medicine, biological sciences, health s...","[pharmacology and pharmacy, infectious disease...","[virology, pharmaceutical drugs, RNA viruses]"
4,101113979,The Oater,CLOSED,The Oater develops a compact machine for hyper...,2023-07-01,2023-12-31,0.0,75000.0,HORIZON.3.2,HORIZON-EIE-2022-SCALEUP-02-02,...,101113979,[OIY SOLUTIONS GMBH],101113979,OIY SOLUTIONS GMBH,inf,,"[social sciences, natural sciences]","[computer and information sciences, biological...","[ecology, internet, business and management]","[internet of things, commerce, ecosystems]"


In [33]:
# store feature-enriched dataframe to the processed directory
Data_structure.export_dataframes(f'{processed_dir}/')

print(Data_structure.organization_df.columns.tolist())
print(Data_structure.project_df.columns.tolist())
print(Data_structure.data_deliverables.columns.tolist())
print(Data_structure.data_publications.columns.tolist())
print(Data_structure.sci_voc_df.columns.tolist())
print(Data_structure.legal_basis_df.columns.tolist())
print(Data_structure.topics_df.columns.tolist())
print(Data_structure.web_items_df.columns.tolist())

['projectID', 'projectAcronym', 'organisationID', 'vatNumber', 'name', 'shortName', 'SME', 'activityType', 'street', 'postCode', 'city', 'country', 'nutsCode', 'geolocation', 'organizationURL', 'contactForm', 'contentUpdateDate', 'rcn', 'order', 'role', 'ecContribution', 'netEcContribution', 'totalCost', 'endOfParticipation', 'active']
['id', 'acronym', 'status', 'title', 'startDate', 'endDate', 'totalCost', 'ecMaxContribution', 'legalBasis', 'topics', 'ecSignatureDate', 'frameworkProgramme', 'masterCall', 'subCall', 'fundingScheme', 'nature', 'objective', 'contentUpdateDate', 'rcn', 'grantDoi', 'duration_days', 'duration_months', 'duration_years', 'projectID_x', 'n_institutions', 'projectID_y', 'institutions', 'projectID', 'coordinator_name', 'ecContribution_per_year', 'totalCost_per_year', 'field_class', 'field', 'subfield', 'niche']
['id', 'title', 'deliverableType', 'description', 'projectID', 'projectacronym', 'url', 'collection', 'contentupdatedate', 'rcn']
['Unnamed: 0', 'id', '

In [34]:
Data_structure.list_of_acronyms()

Unnamed: 0,0
0,PvSeroRDT
1,BIOBoost
2,GlycanTrigger
3,CHIKVAX_CHIM
4,The Oater
...,...
15048,EUCYS2022
15049,RESAVER_2023
15050,Leiden2022-ECS-ESOF
15051,EUCYS2024


## Project_data class

In [35]:
from backend.classes import Project_data


p = Project_data(Data_structure, acronym="CLIMB")
summary = p.summary()
print(summary["financials"])

{'ec_total': 1622273.0, 'total_cost': 1622273.0, 'ec_sum_from_partners': 1622273.0, 'cost_sum_from_partners': '1622273', 'ec_per_deliverable': None, 'ec_per_publication': None}


Use pprint to get out the background information in a readable format.

In [36]:
pprint(summary["financials"])

{'cost_sum_from_partners': '1622273',
 'ec_per_deliverable': None,
 'ec_per_publication': None,
 'ec_sum_from_partners': 1622273.0,
 'ec_total': 1622273.0,
 'total_cost': 1622273.0}


In [37]:
# Inspect a certain project
p = Project_data(Data_structure,acronym="BIOBoost")
p.inspect_project_data()


Project: BIOBoost (ID: 101096150)

Publications:
Empty DataFrame
Columns: [Unnamed: 0, id, title, isPublishedAs, authors, journalTitle, journalNumber, publishedYear, publishedPages, issn, isbn, doi, projectID, projectAcronym, collection, contentUpdateDate, rcn]
Index: []

Deliverables:
                              deliverableType  \
18191                      Documents, reports   
18192                      Documents, reports   
18193  Websites, patent fillings, videos etc.   
18194                      Documents, reports   
18195                    Data Management Plan   
18196                      Documents, reports   

                                             description  
18191  The project management handbook will provide c...  
18192  The PDEC provides information to all project p...  
18193  Interactive online tool showing and mapping ma...  
18194  The report includes information on innovation ...  
18195  The DMP provides clear information on the cons...  
18196  Assessm