# Collaboration Network Analysis

`Overview`
This notebook analyzes and visualizes collaboration networks between countries and institutions:
- Constructing network graphs from project partnership data
- Analyzing network metrics (centrality, clustering, community detection)
- Creating interactive network visualizations
- Identifying key collaboration patterns and influential nodes

`Inputs`
- Processed project and partnership data from `../data/processed/`

`Outputs`
- Network analysis results in `../data/results/`
- Interactive network visualizations in `../reports/figures/`

`Dependencies`
- NetworkX
- Matplotlib
- Plotly
- Pandas
- Community detection libraries

*Note: This is notebook 3 of the analysis pipeline focusing on collaboration patterns*

In [1]:
# imports
import os
import sys
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import networkx as nx

# Import custom modules
run_dir = os.getcwd()
parent_dir = os.path.dirname(run_dir)
sys.path.append(parent_dir)

from backend.utils.plots import CORDISPlots
from backend.classes.cordis_data import CORDIS_data

from backend.classes.project_data  import Project_data

## Explore possibilities of NetworkX package
We aim to create the following plots: 
- Frequent collaborating institutions. Indicate the 20 most frequent collaborations with lines on a map
- Frequently collaborating people. Determine from the authors list of the publications
- project-deliverable plots: which projects produce the main number of plots

In [3]:
# load projects class
projects_df = pd.read_csv(f'{parent_dir}/data/processed/projects.csv')
organization_df = pd.read_csv(f'{parent_dir}/data/processed/organizations.csv')

In [4]:
projects_df.keys()

Index(['id', 'acronym', 'status', 'title', 'start_date', 'end_date',
       'total_cost', 'ec_max_contribution', 'ec_signature_date',
       'framework_programme', 'master_call', 'sub_call', 'funding_scheme',
       'nature', 'objective', 'content_update_date', 'rcn', 'grant_doi',
       'duration_days', 'duration_months', 'duration_years', 'n_institutions',
       'coordinator_name', 'ec_contribution_per_year', 'total_cost_per_year',
       'field_class', 'field', 'sub_field', 'niche'],
      dtype='object')

In [5]:
projects_df['funding_scheme'].unique()

array(['HORIZON-JU-RIA', 'HORIZON-CSA', 'HORIZON-RIA', 'HORIZON-COFUND',
       'HORIZON-JU-CSA', 'HORIZON-EIT-KIC', 'HORIZON-TMA-MSCA-PF-EF',
       'CSA', 'HORIZON-AG-UN', 'HORIZON-AG', 'HORIZON-IA',
       'HORIZON-TMA-MSCA-SE', 'HORIZON-TMA-MSCA-PF-GF',
       'HORIZON-TMA-MSCA-Cofund-D', 'HORIZON-TMA-MSCA-Cofund-P',
       'MSCA-PF', 'EURATOM-CSA', 'HORIZON-TMA-MSCA-DN', 'HORIZON-AG-LS',
       'EURATOM-IA', 'HORIZON-TMA-MSCA-DN-ID', 'HORIZON-TMA-MSCA-DN-JD',
       'HORIZON-PCP', 'HORIZON-JU-IA', 'EURATOM-RIA', 'IA',
       'HORIZON-EIC-ACC-BF', 'HORIZON-EIC', 'EIC-ACC', 'HORIZON-EIC-ACC',
       'EURATOM-COFUND', 'RIA', 'HORIZON-ERC-POC', 'HORIZON-ERC',
       'HORIZON-ERC-SYG', 'ERC', 'ERC-POC', 'EIC'], dtype=object)

In [6]:
horizon_data = CORDIS_data(parent_dir, enrich=False)

In [7]:
horizon_data.sci_voc_df['euroSciVocTitle'].unique()[:20]

array(['malaria', 'proteins', 'carbohydrates', 'virology',
       'coronaviruses', 'vaccines', 'internet of things', 'e-commerce',
       'ecosystems', 'orthodontics', 'government systems',
       'energy and fuels', 'linguistics', 'obstetrics', 'microbiology',
       'mortality', 'diabetes', 'cardiology', 'obesity', 'software'],
      dtype=object)

In [8]:
horizon_data.organization_df.keys()

Index(['projectID', 'projectAcronym', 'organisationID', 'vatNumber', 'name',
       'shortName', 'SME', 'activityType', 'street', 'postCode', 'city',
       'country', 'nutsCode', 'geolocation', 'organizationURL', 'contactForm',
       'contentUpdateDate', 'rcn', 'order', 'role', 'ecContribution',
       'netEcContribution', 'totalCost', 'endOfParticipation', 'active'],
      dtype='object')

In [12]:
from backend.utils.plots import CORDISPlots

plots = CORDISPlots(horizon_data)

In [13]:
plots.plot_collaboration_network(field_filter='ecosystems', 
                                 max_projects=20, 
                                 min_participants=3, 
                                #  org_types=['REC'], 
                                 project_type =['HORIZON-ERC-SYG'])

KeyError: 'sci_voc_titles'

This works but we definetely need to work on some filterings. Now it is not clear at all. We optionally include filters on the scientific field, and later on we will also allow filtering on countries.

Implemented filters:
- `field_filter`: filter on the scientific field (specific, as specified in sci_voc_title). Gives a lot of control.
- `org-types`: filter on type of organization: research, public, private, non-profit (IDK what the acronyms stand for)
- `max_projects`: maximal number of projects to plot
- `min_participants`: minimal number of participating institutes. Must me larger than 2 (minimum to form a colllaboration)
- `countries`: filter on countries
- `scientific field`: high_level_discipline classification (EuroSciVocPath)
- `year`: start year of the project (start_year)
- `contribution`: contribution of the different countries
- `project_type`: this corresponds to the different calls. e.g. ERC grand are prstigious research grants for big, ambitious, collaborative fundamental research projects



## Refinement search per field

There are many different fiels, each with their own subfield. We create a function that returns a filtered DataFrame with flexible filtering, allowing for different levels.

Levels

In [None]:
class CORDISPlotter(horizon_data):
    def __init__(self, cordis_data):
        self.data = cordis_data
    
    def filter_field(self, 
                     field_class= None, 
                     field=None,
                     subfield=None):
        '''
        Function to ffilter the data on different levels of dields. In descending order:
        - field_class: e.g. natural sciences
        - field: e.g. biology
        - subfield: e.g. molecular biology
        '''
        

We need to implement this in the CORDIS_data class. 
 We add the following three fields:
 - `field_class`
 - `field`
 - `subfield`

Some projects have several SciVocPaths. To cover those, we add the three field as lists for all projects. To filter one can just check field in `df_project['field'].astype(list)` to get True / False and thus a criterion to slice the DataFrame with.  