# Collaboration Network Analysis

`Overview`
This notebook analyzes and visualizes collaboration networks between countries and institutions:
- Constructing network graphs from project partnership data
- Analyzing network metrics (centrality, clustering, community detection)
- Creating interactive network visualizations
- Identifying key collaboration patterns and influential nodes

`Inputs`
- Processed project and partnership data from `../data/processed/`

`Outputs`
- Network analysis results in `../data/results/`
- Interactive network visualizations in `../reports/figures/`

`Dependencies`
- NetworkX
- Matplotlib
- Plotly
- Pandas
- Community detection libraries

*Note: This is notebook 3 of the analysis pipeline focusing on collaboration patterns*

In [1]:
# imports
import os
import sys
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import geopandas as gpd
from plotly.subplots import make_subplots
import networkx as nx

# Import custom modules
run_dir = os.getcwd()
parent_dir = os.path.dirname(run_dir)
sys.path.append(parent_dir)

from backend.utils.plots import CORDISPlots
from backend.classes.cordis_data import CORDIS_data

from backend.classes.project_data  import Project_data

## Explore possibilities of NetworkX package
We aim to create the following plots: 
- Frequent collaborating institutions. Indicate the 20 most frequent collaborations with lines on a map
- Frequently collaborating people. Determine from the authors list of the publications
- project-deliverable plots: which projects produce the main number of plots

In [None]:
class CORDISPlotter:
    def __init__(self, cordis_data):
        self.data = cordis_data 
        
    def plot_collaboration_network(self, 
                                   field_filter=None, 
                                   org_types = None, 
                                   max_projects=1000, 
                                   min_participants=2, 
                                   countries=None,
                                   disciplines=None,
                                   year=None,
                                   contribution=None,
                                   project_type=None):
        '''
        Function to plot the collaboration network of institutions involved in projects
        Parameters:
        -----------------
        - field_filter: Optional filter for scientific field
        - max_projects: Maximum number of projects to include in the plot, to avoid cluttering 
            (default: 1000, which is still too much)
        - min_participants: minimum of participating institutions in a=project to be included. 
            (default: 2, which is the minimum for a collaboration)
        - org_types: List of organization types to include in the plot.
            (default: ['HES', 'REC', 'PUB', 'PRC', 'SME']) => all types
        - countries: list of counntries of which we include institutions in the plot.
            default: None, which means all countries are included.

        Returns:
        -----------------   
        - Plotly figure object
        '''
        df_proj = self.data.project_df
        df_org = self.data.organization_df
        # Filter for relevant organization types
        if org_types is None:
            org_types = ['HES', 'REC', 'PUB', 'PRC', 'SME']
        else:
            assert type(org_types) == list or type(org_types) == np.array

        # Apply scientific field filter if provided
        if field_filter:
            # Filter projects based on scientific field
            df_proj = df_proj[df_proj['sci_voc_titles'].apply(lambda field: field_filter in field if isinstance(field, str) else False)]
            project_ids = df_proj['projectID'].unique()
            df_org = df_org[df_org['projectID'].isin(project_ids)]
            
        if org_types:
            df_org =df_org[df_org['activityType'].astype(str).isin(org_types)]

        if countries:
            df_org = df_org[df_org['country'].astype(str).isin(countries)]

        if disciplines:
            for discipline in disciplines:
                df_org = df_org[df_org['discipline'].astype(str).str.contains(discipline, na=False)]

        if year:
            df_org = df_org[df_org['startDate'].astype(str).str.contains(year, na=False)]

        if contribution:
            df_org = df_org[df_org['contribution'].astype(float) >= contribution]
   
        if project_type:
            for call in project_type:
                df_proj = df_proj[df_proj['fundingScheme'].apply(lambda funding_scheme: call in funding_scheme if isinstance(funding_scheme, str) else False)]
                project_ids = df_proj['projectID'].unique()
                df_org = df_org[df_org['projectID'].isin(project_ids)]

        df_org = df_org[['projectID', 'name']].drop_duplicates()

        # Group by project and filter based on number of participants
        collab_df = df_org.groupby('projectID')['name'].apply(list).reset_index()
        collab_df = collab_df[collab_df['name'].apply(lambda x: len(x) >= min_participants)]
        collab_df = collab_df.head(max_projects)

        from itertools import combinations
        from collections import Counter

        # Build edge list
        edge_list = []
        for names in collab_df['name']:
            edge_list.extend(combinations(names, 2))
        edge_counts = Counter(edge_list)

        G = nx.Graph()
        for (u, v), weight in edge_counts.items():
            G.add_edge(u, v, weight=weight)

        pos = nx.spring_layout(G, k=0.15, iterations=20)

        edge_x, edge_y = [], []
        for edge in G.edges():
            x0, y0 = pos[edge[0]]
            x1, y1 = pos[edge[1]]
            edge_x += [x0, x1, None]
            edge_y += [y0, y1, None]

        edge_trace = go.Scatter(
            x=edge_x, y=edge_y,
            line=dict(width=0.5, color='#888'),
            hoverinfo='none',
            mode='lines')

        node_x, node_y, node_text = [], [], []
        for node in G.nodes():
            x, y = pos[node]
            node_x.append(x)
            node_y.append(y)
            node_text.append(node)

        node_trace = go.Scatter(
            x=node_x, y=node_y,
            mode='markers+text',
            text=node_text,
            textposition='top center',
            marker=dict(
                showscale=False,
                color='blue',
                size=10,
                line_width=2))
        if field_filter:
            title = f'Institution Collaboration Network for {field_filter}'
        else:
            title = 'Institution Collaboration Network'
        fig = go.Figure(data=[edge_trace, node_trace],
                        layout=go.Layout(
                            title=title,
                            showlegend=False,
                            hovermode='closest',
                            margin=dict(b=20, l=5, r=5, t=40),
                            xaxis=dict(showgrid=False, zeroline=False),
                            yaxis=dict(showgrid=False, zeroline=False)))

        fig.show()

In [3]:
# load projects class
projects_df = pd.read_csv(f'{parent_dir}/data/processed/project_df.csv')
organization_df = pd.read_csv(f'{parent_dir}/data/processed/organization_df.csv')

In [4]:
projects_df.keys()

Index(['id', 'acronym', 'status', 'title', 'startDate', 'endDate', 'totalCost',
       'ecMaxContribution', 'legalBasis', 'topics', 'ecSignatureDate',
       'frameworkProgramme', 'masterCall', 'subCall', 'fundingScheme',
       'nature', 'objective', 'contentUpdateDate', 'rcn', 'grantDoi',
       'duration_days', 'duration_months', 'duration_years', 'projectID_x',
       'n_institutions', 'projectID_y', 'institutions', 'projectID',
       'coordinator_name', 'ecContribution_per_year', 'totalCost_per_year',
       'sci_voc_titles', 'sci_voc_paths', 'topic_titles'],
      dtype='object')

In [16]:
projects_df['fundingScheme'].unique()

array(['HORIZON-JU-RIA', 'HORIZON-CSA', 'HORIZON-RIA', 'HORIZON-COFUND',
       'HORIZON-JU-CSA', 'HORIZON-EIT-KIC', 'HORIZON-TMA-MSCA-PF-EF',
       'CSA', 'HORIZON-AG-UN', 'HORIZON-AG', 'HORIZON-IA',
       'HORIZON-TMA-MSCA-SE', 'HORIZON-TMA-MSCA-PF-GF',
       'HORIZON-TMA-MSCA-Cofund-D', 'HORIZON-TMA-MSCA-Cofund-P',
       'MSCA-PF', 'EURATOM-CSA', 'HORIZON-TMA-MSCA-DN', 'HORIZON-AG-LS',
       'EURATOM-IA', 'HORIZON-TMA-MSCA-DN-ID', 'HORIZON-TMA-MSCA-DN-JD',
       'HORIZON-PCP', 'HORIZON-JU-IA', 'EURATOM-RIA', 'IA',
       'HORIZON-EIC-ACC-BF', 'HORIZON-EIC', 'EIC-ACC', 'HORIZON-EIC-ACC',
       'EURATOM-COFUND', 'RIA', 'HORIZON-ERC-POC', 'HORIZON-ERC',
       'HORIZON-ERC-SYG', 'ERC', 'ERC-POC', 'EIC'], dtype=object)

In [5]:
horizon_data = CORDIS_data(parent_dir, enrich=False)

In [6]:
horizon_data.sci_voc_df['euroSciVocTitle'].unique()[:20]

array(['malaria', 'proteins', 'carbohydrates', 'virology',
       'coronaviruses', 'vaccines', 'internet of things', 'e-commerce',
       'ecosystems', 'orthodontics', 'government systems',
       'energy and fuels', 'linguistics', 'obstetrics', 'microbiology',
       'mortality', 'diabetes', 'cardiology', 'obesity', 'software'],
      dtype=object)

In [10]:
horizon_data.organization_df.keys()

Index(['projectID', 'projectAcronym', 'organisationID', 'vatNumber', 'name',
       'shortName', 'SME', 'activityType', 'street', 'postCode', 'city',
       'country', 'nutsCode', 'geolocation', 'organizationURL', 'contactForm',
       'contentUpdateDate', 'rcn', 'order', 'role', 'ecContribution',
       'netEcContribution', 'totalCost', 'endOfParticipation', 'active'],
      dtype='object')

In [15]:
plots = CORDISPlotter(horizon_data)

In [18]:
plots.plot_collaboration_network(field_filter='ecosystems', 
                                 max_projects=20, 
                                 min_participants=3, 
                                #  org_types=['REC'], 
                                 project_type =['HORIZON-ERC-SYG'])

This works but we definetely need to work on some filterings. Now it is not clear at all. We optionally include filters on the scientific field, and later on we will also allow filtering on countries.

Implemented filters:
- `field_filter`: filter on the scientific field (specific, as specified in sci_voc_title). Gives a lot of control.
- `org-types`: filter on type of organization: research, public, private, non-profit (IDK what the acronyms stand for)
- `max_projects`: maximal number of projects to plot
- `min_participants`: minimal number of participating institutes. Must me larger than 2 (minimum to form a colllaboration)
- `countries`: filter on countries
- `scientific field`: high_level_discipline classification (EuroSciVocPath)
- `year`: start year of the project (start_year)
- `contribution`: contribution of the different countries
- `project_type`: this corresponds to the different calls. e.g. ERC grand are prstigious research grants for big, ambitious, collaborative fundamental research projects



## Refinement search per field

There are many different fiels, each with their own subfield. We create a function that returns a filtered DataFrame with flexible filtering, allowing for different levels.

Levels

In [None]:
class CORDISPlotter(horizon_data):
    def __init__(self, cordis_data):
        self.data = cordis_data
    
    def filter_field(self, 
                     field_class= None, 
                     field=None,
                     subfield=None):
        '''
        Function to ffilter the data on different levels of dields. In descending order:
        - field_class: e.g. natural sciences
        - field: e.g. biology
        - subfield: e.g. molecular biology
        '''
        

We need to implement this in the CORDIS_data class. 
 We add the following three fields:
 - `field_class`
 - `field`
 - `subfield`

Some projects have several SciVocPaths. To cover those, we add the three field as lists for all projects. To filter one can just check field in `df_project['field'].astype(list)` to get True / False and thus a criterion to slice the DataFrame with.  