#### Link to github repo

https://github.com/tryllekunstneren/Assignment1_Group11/blob/main/Assignment1.ipynb

## Assignment 1: Computational Social Science 

##### Christian Warburg and Sofus Carstens

## Part 1: Web-scraping
Week 1, ex 3.

> **Exercise: Web-scraping the list of participants to the International Conference in Computational Social Science**   

> 1. Inspect the HTML of the page and use web-scraping to get the names of all researchers that contributed to the conference in 2023. The goal is the following: (i) get as many names as possible including: keynote speakers, chairs, authors of parallel talks and authors of posters; (ii) ensure that the collected names are complete and accuarate as reported in the website (e.g. both first name and family name); (iii) ensure that no name is repeated multiple times with slightly different spelling. 

> 2. Some instructions for success: 
>    * First, inspect the page through your web browser to identify the elements of the page that you want to collect. Ensure you understand the hierarchical structure of the page, and where the elements you are interested in are located within this nested structure.   
>    * Use the [BeautifulSoup Python package](https://pypi.org/project/beautifulsoup4/) to navigate through the hierarchy and extract the elements you need from the page. 
>    * You can use the [find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) method to find elements that match specific filters. Check the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) of the library for detailed explanations on how to set filters.  
>    * Parse the strings to ensure that you retrieve "clean" author names (e.g. remove commas, or other unwanted charachters)
>    * The overall idea is to adapt the procedure I have used [here](https://nbviewer.org/github/lalessan/comsocsci2023/blob/master/additional_notebooks/ScreenScraping.ipynb) for the specific page you are scraping. 

> 3. Create the set of unique researchers that joined the conference and *store it into a file*.
>     * *Important:* If you notice any issue with the list of names you have collected (e.g. duplicate/incorrect names), come up with a strategy to clean your list as much as possible. 
>
##### ANSWER

To take care of duplicates, we converted the list of names into a set, and then back to a list. For typos in names, we used fuzzy matching with Python’s difflib. By comparing names with a similarity threshold (0.9), similar entries (e.g., names with typos) were grouped together, and a single representative was chosen for each group.

> 4. *Optional:* For a more complete represenation of the field, include in your list: (i) the names of researchers from the programme committee of the conference, that can be found at [this link](https://ic2s2-2023.org/program_committee); (ii) the organizers of tutorials, that can be found at [this link](https://ic2s2-2023.org/tutorials)

> 5. How many unique researchers do you get?

#### Answer
We got 1524 unique authors

> 6. Explain the process you followed to web-scrape the page. Which choices did you make to accurately retreive as many names as possible? Which strategies did you use to assess the quality of your final list? Explain your reasoning and your choices __(answer in max 150 words)__.
>
##### Answer:

We started by sending a GET request to the target URL and parsing its HTML content using BeautifulSoup. Recognizing that names were enclosed within `<i>` tags, we extracted their text and replaced newline characters for cleaner data. To address cases where multiple names appeared in a single tag, we split the text using “, ” as a delimiter. Converting the resulting list to a set removed duplicates, ensuring only unique names remained. Finally, we created a DataFrame for further validation. We assessed the quality of our final list by verifying the unique count of names and manually inspecting the formatting to confirm consistency and accuracy in retrieval. This systematic approach enabled us to maximize the extraction of correctly formatted names while minimizing redundancy.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import difflib
import re
from collections import Counter

# URL of the webpage to scrape
link = 'https://ic2s2-2023.org/program'

# Send a GET request to the webpage
r = requests.get(link)

# Parse the content of the webpage with BeautifulSoup using the html.parser
soup = BeautifulSoup(r.content, 'html.parser')

# Find all <i> tags in the HTML (where names are located)
names = soup.find_all('i')

# Extract text from each <i> tag and clean up whitespace/newlines
raw_names = [tag.get_text(strip=True).replace("\n", " ") for tag in names]

# Use regex to split on comma followed by any whitespace, then remove quotes
names_list = [re.split(r',\s*', name) for name in raw_names]
names_flat = [name.strip('"') for sublist in names_list for name in sublist]

# Count the frequency of each name (optional)
name_counts = Counter(names_flat)

# Function to group similar names using fuzzy matching
def group_similar_names(names, threshold=0.9):
    names_copy = names.copy()
    groups = []
    while names_copy:
        base = names_copy.pop(0)
        group = [base]
        to_remove = []
        for other in names_copy:
            similarity = difflib.SequenceMatcher(None, base.lower(), other.lower()).ratio()
            if similarity >= threshold:
                group.append(other)
                to_remove.append(other)
        for name in to_remove:
            names_copy.remove(name)
        groups.append(group)
    return groups

# Group the names with a similarity threshold of 0.9
groups = group_similar_names(names_flat, threshold=0.9)

# Function to choose the best candidate from a group
def choose_best_candidate(group):
    return min(group, key=len)

final_names = []
removed_names = []

for group in groups:
    candidate = choose_best_candidate(group)
    final_names.append(candidate)
    removed = [name for name in group if name != candidate]
    removed_names.extend(removed)

# Create a DataFrame and export to CSV
df = pd.DataFrame(final_names, columns=['Author'])
df.to_csv('authors.csv', index=False)

# Print the results
print(f'Unique names in the list after typo correction: {len(df)}')
print(f'Number of names removed as typos: {len(removed_names)}')
print('Removed names:', removed_names)


Unique names in the list after typo correction: 1499
Number of names removed as typos: 35
Removed names: ['Xindi Wang', 'Luca Verginer', 'Luca Verginer', 'Duncan J. Watts', 'Duncan J. Watts', 'Duncan J. Watts', 'David M Rothschild', 'Anne C. Kroon', 'Michele Tizzoni', 'Michele Tizzoni', 'Michele Tizzoni', 'Fabio Carrella', 'Alessandro Flammini', 'Kathyrn R Fair', 'Woo-sung Jung', 'Lisette Espin Noboa', 'Nicholas A Christakis', 'Pantelis P. Analytis', 'Pantelis P Analytis', 'Sonja M Schmer Galunder', 'Bedoor AlShebli', 'Bedoor AlShebli', 'Bedoor AlShebli', 'Mariano Gaston Beiro', 'Diogo Pachecho', 'Marton Karsai', 'Marton Karsai', 'José Javier Ramasco', 'Federico Barrera-Lemarchand', 'Scott A. Hale', 'Scott A. Hale', 'Scott A. Hale', 'Marcos A. Oliveira', 'Matthew R DeVerna', 'Ana Maria Jaramillo']


## Part 2: Ready Made vs Custom Made Data
Week 2, ex 1.

> **Exercise: Ready made data vs Custom made data** In this exercise, I want to make sure you have understood they key points of my lecture and the reading. 

> 1. What are pros and cons of the custom-made data used in Centola's experiment (the first study presented in the lecture) and the ready-made data used in Nicolaides's study (the second study presented in the lecture)? You can support your arguments based on the content of the lecture and the information you read in Chapter 2.3 of the book __(answer in max 150 words)__.

##### Answer

Centola’s experiment used custom-made data, allowing for controlled conditions, precise measurement of social influence, and elimination of confounding variables. Researchers could manipulate network structures and directly observe behavioral changes, ensuring strong internal validity. However, the artificial setting may reduce external validity, as participants might behave differently in real-world contexts. Additionally, sample sizes are often smaller due to resource constraints. Nicolaides’s study used ready-made data from real-world sources, providing large-scale insights into disease transmission and high external validity. However, this data contains biases, lacks control over confounding factors, and may have measurement inaccuracies, such as missing data or inconsistencies in self-reported behaviors. Since observational data is not designed for experimental purposes, causality is harder to establish, requiring careful statistical modeling to infer relationships.

> 2. How do you think these differences can influence the interpretation of the results in each study? __(answer in max 150 words)__

##### Answer

Centola’s controlled experiment ensures causality by isolating specific variables, making it easier to identify the mechanisms driving social contagion. However, the findings may not fully capture complex social behaviors outside the lab, limiting their generalizability. The artificial setting may also fail to reflect spontaneous, large-scale diffusion processes. In contrast, Nicolaides’s study reflects real-world patterns of human mobility and interactions, offering valuable insights into disease spread. Yet, the reliance on observational data means that multiple external factors, such as policy interventions or demographic differences, could influence the results. While Nicolaides’s study provides practical, large-scale implications, it requires careful interpretation to avoid confounding correlations with causation. These methodological differences shape how confidently each study’s results can be applied to broader social contexts, particularly in policymaking and behavioral interventions.

## Part 3: Gathering Research Articles using the OpenAlex API
Week 3, ex 1.

> **Exercise : Collecting Research Articles from IC2S2 Authors**
>
>In this exercise, we'll leverage the OpenAlex API to gather information on research articles authored by participants of the IC2S2 2024 (NOT 2023) conference, referred to as *IC2S2 authors*. **Before you start, please ensure you read through the entire exercise.**

> **Steps:**
>  
> 1. **Retrieve Data:** Starting with the *authors* you identified in Week 2, Exercise 2, use the OpenAlex API [works endpoint](https://docs.openalex.org/api-entities/works) to fetch the research articles they have authored. For each article, retrieve the following details:
>    - _id_: The unique OpenAlex ID for the work.
>    - _publication_year_: The year the work was published.
>    - _cited_by_count_: The number of times the work has been cited by other works.
>    - _author_ids_: The OpenAlex IDs for the authors of the work.
>    - _title_: The title of the work.
>    - _abstract_inverted_index_: The abstract of the work, formatted as an inverted index.
> 

>     **Important Note on Paging:** By default, the OpenAlex API limits responses to 25 works per request. For more efficient data retrieval, I suggest to adjust this limit to 200 works per request. Even with this adjustment, you will need to implement pagination to access all available works for a given query. This ensures you can systematically retrieve the complete set of works beyond the initial 200. Find guidance on implementing pagination [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/paging#cursor-paging).

> 2. **Data Storage:** Organize the retrieved information into two Pandas DataFrames and save them to two files in a suitable format:
>    - The *IC2S2 papers* dataset should include: *id, publication\_year, cited\_by\_count, author\_ids*.
>    - The *IC2S2 abstracts* dataset should include: *id, title, abstract\_inverted\_index*.
>  

> **Filters:**
> To ensure the data we collect is relevant and manageable, apply the following filters:
> 
>    - Only include *IC2S2 authors* with a total work count between 5 and 5,000.
>    - Retrieve only works that have received more than 10 citations.
>    - Limit to works authored by fewer than 10 individuals.
>    - Include only works relevant to Computational Social Science (focusing on: Sociology OR Psychology OR Economics OR Political Science) AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science), as defined by their [Concepts](https://docs.openalex.org/api-entities/works/work-object#concepts). *Note*: here we only consider Concepts at *level=0* (the most coarse definition of concepts). 

> **Efficiency Tips:**
> Writing efficient code in this exercise is **crucial**. To speed up your process:
> - **Apply filters directly in your request:** When possible, use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) of the *works* endpoint to apply the filters above directly in your API request, ensuring only relevant data is returned. Learn about combining multiple filters [here](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists).  
> - **Bulk requests:** Instead of sending one request for each author, you can use the [filter parameter](https://docs.openalex.org/api-entities/works/filter-works) to query works by multiple authors in a single request. *Note: My testing suggests that can only include up to 25 authors per request.*
> - **Use multiprocessing:** Implement multiprocessing to handle multiple requests simultaneously. I highly recommmend [Joblib’s Parallel](https://joblib.readthedocs.io/en/stable/) function for that, and [tqdm](https://tqdm.github.io/) can help monitor progress of your jobs. Remember to stay within [the rate limit](https://docs.openalex.org/how-to-use-the-api/rate-limits-and-authentication) of 10 requests per second.

> **Data Overview and Reflection questions:** Answer the following questions: 
> - **Dataset summary.** How many works are listed in your *IC2S2 papers* dataframe? How many unique researchers have co-authored these works? 
> - **Efficiency in code.** Describe the strategies you implemented to make your code more efficient. How did your approach affect your code's execution time? __(answer in max 150 words)__

> - **Filtering Criteria and Dataset Relevance** Reflect on the rationale behind setting specific thresholds for the total number of works by an author, the citation count, the number of authors per work, and the relevance of works to specific fields. How do these filtering criteria contribute to the relevance of the dataset you compiled? Do you believe any aspects of Computational Social Science research might be underrepresented or overrepresented as a result of these choices? __(answer in max 150 words)__

## Part 4: The Network of Computational Social Scientists
Week 4, ex 1. Please use the final dataset you collected from both authors and co-authors (IC2S2 2024).

In [None]:
import pandas as pd
import time
import json
import concurrent.futures
from tqdm import tqdm
from pyalex import Works, Authors, config

# Set your email (to use the polite pool)
pyalex_email = "s225083@dtu.dk"  
config.email = pyalex_email

# Increase retry settings for transient issues
config.max_retries = 5
config.retry_backoff_factor = 0.5
config.retry_http_codes = [429, 500, 503]

# --- Define Helper Functions ---

def get_author_id(author_name):
    """
    Retrieve the OpenAlex ID for an author given their name.
    """
    try:
        pager = Authors().filter(display_name=author_name).paginate(per_page=1)
        for page in pager:
            if page:
                return page[0].get("id")
        return None
    except Exception as e:
        #print(f"Error fetching id for author {author_name}: {e}")
        return None

def get_author_works(author_id, per_page=100):
    """
    Retrieve works for a given author applying the following filters:
      - More than 10 citations.
      - Authored by fewer than 10 individuals.
      - Concepts at level 0.
      - Relevant to Computational Social Science (Sociology OR Psychology OR Economics OR Political Science)
        AND intersecting with a quantitative discipline (Mathematics OR Physics OR Computer Science).
    """
    # Define concept groups (replace placeholder IDs with actual OpenAlex concept IDs as needed)
    concepts_group_a = "C.SOC|C.PSY|C.ECO|C.POL"
    concepts_group_b = "C.MAT|C.PHY|C.CS"
    
    filter_string = (
         f"author.id:{author_id},"
         "cited_by_count:>10,"
         "authorships_count:<10,"
         "concepts.level:0,"
         f"concepts.id:({concepts_group_a}),"
         f"concepts.id:({concepts_group_b})"
    )
    
    all_works = []
    try:
        pager = Works().filter(filter_string).paginate(per_page=per_page)
        for page_num, page in enumerate(tqdm(pager, desc=f'Author {author_id}', leave=False), start=1):
            all_works.extend(page)
            print(f"Author {author_id} - Page {page_num}: Retrieved {len(page)} works.")
            time.sleep(1)  # Pause to respect rate limits
    except Exception as e:
        print(f"Error fetching works for author {author_id}: {e}")
    return all_works

def fetch_author_works_by_name(author_name):
    """
    Given an author name, retrieve their OpenAlex ID and fetch works using the above filters.
    """
    #print(f"\nFetching works for author: {author_name}")
    author_id = get_author_id(author_name)
    if not author_id:
        print(f"Could not retrieve id for {author_name}")
        return author_name, []
    works = get_author_works(author_id)
    time.sleep(1)
    return author_name, works

# --- Main Execution ---

def main():
    # Load the CSV containing a list of author names (assumed to have a column "name")
    authors_df = pd.read_csv('authors.csv')
    author_names = authors_df["Author"].tolist()
    
    author_works = {}

    # Fetch works concurrently for each author using tqdm to monitor progress.
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        futures = {executor.submit(fetch_author_works_by_name, name): name for name in author_names}
        for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures), desc="Processing authors"):
            name = futures[future]
            try:
                author_name, works = future.result()
                author_works[author_name] = works
            except Exception as exc:
                print(f"Author {name} generated an exception: {exc}")

    # Aggregate and deduplicate works based on work ID.
    all_works = []
    for works in author_works.values():
        all_works.extend(works)
    works_dict = {work.get("id"): work for work in all_works}
    all_works = list(works_dict.values())
    print(f"Total works retrieved after deduplication: {len(all_works)}")
    
    # --- Create DataFrames for the Two Datasets ---

    def extract_paper_details(work):
        """
        For the IC2S2 papers dataset: extract id, publication_year, cited_by_count, and author_ids.
        """
        author_ids_list = []
        if "authorships" in work:
            author_ids_list = [auth.get("author", {}).get("id") for auth in work["authorships"] if auth.get("author", {}).get("id")]
        return {
            "id": work.get("id"),
            "publication_year": work.get("publication_year"),
            "cited_by_count": work.get("cited_by_count"),
            "author_ids": author_ids_list,
        }

    def extract_abstract_details(work):
        """
        For the IC2S2 abstracts dataset: extract id, title, and abstract_inverted_index.
        """
        return {
            "id": work.get("id"),
            "title": work.get("display_name", ""),
            "abstract_inverted_index": work.get("abstract_inverted_index"),
        }

    papers_data = [extract_paper_details(work) for work in all_works]
    abstracts_data = [extract_abstract_details(work) for work in all_works]

    papers_df = pd.DataFrame(papers_data)
    abstracts_df = pd.DataFrame(abstracts_data)

    # Save the DataFrames to CSV files.
    papers_df.to_csv("IC2S2_papers.csv", index=False)
    abstracts_df.to_csv("IC2S2_abstracts.csv", index=False)

    print("IC2S2 papers and abstracts datasets have been saved.")

if __name__ == '__main__':
    main()


Processing authors:   0%|          | 1/1499 [00:00<09:00,  2.77it/s]

Could not retrieve id for Chair: Claudia Wagner
Could not retrieve id for Jon Kleinberg
Could not retrieve id for Xinyi Wang
Could not retrieve id for Jonas L Juul
Could not retrieve id for Chloe Ahn


Processing authors:   0%|          | 6/1499 [00:00<02:33,  9.70it/s]

Could not retrieve id for Giuseppe RussoCould not retrieve id for Giona Casiraghi

Could not retrieve id for Manoel Horta Ribeiro
Could not retrieve id for Almog Simchon
Could not retrieve id for luca verginer


Processing authors:   1%|          | 11/1499 [00:02<04:46,  5.20it/s]

Could not retrieve id for Matthew Edwards
Could not retrieve id for Manuel Vimercati
Could not retrieve id for Stephan Lewandowsky
Could not retrieve id for Arianna Pera
Could not retrieve id for Adam Sutton


Processing authors:   1%|          | 16/1499 [00:02<03:21,  7.35it/s]

Could not retrieve id for Mohammed Alsobay
Could not retrieve id for Matteo Palmonari
Could not retrieve id for David G. Rand
Could not retrieve id for Duncan J Watts
Could not retrieve id for Abdullah Almaatouq


Processing authors:   2%|▏         | 24/1499 [00:04<04:09,  5.90it/s]

Could not retrieve id for Santo FortunatoCould not retrieve id for Francesco Rinaldi

Could not retrieve id for Francesco Tudisco
Could not retrieve id for Satyaki Sikdar
Could not retrieve id for Sara Venturini


Processing authors:   2%|▏         | 29/1499 [00:04<02:59,  8.18it/s]

Could not retrieve id for Isabella Loaiza
Could not retrieve id for Alex Pentland
Could not retrieve id for Takahiro Yabe
Could not retrieve id for Chair: Taha Yasseri
Could not retrieve id for Silvia De Sojo Caso


Processing authors:   2%|▏         | 31/1499 [00:05<03:53,  6.28it/s]

Could not retrieve id for Laura Alessandretti
Could not retrieve id for Rubén Rodríguez Casañ


Processing authors:   2%|▏         | 33/1499 [00:05<04:58,  4.92it/s]

Could not retrieve id for Antonio Ariño Villarroya


Processing authors:   2%|▏         | 34/1499 [00:06<05:14,  4.67it/s]

Could not retrieve id for Emil Bakkensen Johansen
Could not retrieve id for Mia Ann Jørgensen
Could not retrieve id for Sunny Rai


Processing authors:   2%|▏         | 37/1499 [00:06<04:14,  5.74it/s]

Could not retrieve id for Ashley Francisco
Could not retrieve id for Salvatore Giorgi
Could not retrieve id for Brenda Curtis


Processing authors:   3%|▎         | 40/1499 [00:06<03:39,  6.65it/s]

Could not retrieve id for Lyle Ungar
Could not retrieve id for Sharath Chandra Guntuku
Could not retrieve id for Allison Koenecke


Processing authors:   3%|▎         | 43/1499 [00:07<03:21,  7.24it/s]

Could not retrieve id for Eric Giannella
Could not retrieve id for Sune Lehmann
Could not retrieve id for Sharad Goel
Could not retrieve id for Robb Willer
Could not retrieve id for Mathias Wullum Nielsen


Processing authors:   3%|▎         | 52/1499 [00:07<02:00, 11.97it/s]

Could not retrieve id for Ziv Epstein
Could not retrieve id for Hause Lin
Could not retrieve id for Bramantyo Supriyatno
Could not retrieve id for Levin Brinkmann
Could not retrieve id for Iyad Rahwan


Processing authors:   4%|▎         | 54/1499 [00:07<02:09, 11.18it/s]

Could not retrieve id for Sandro Ferreira Sousa
Could not retrieve id for Vincenzo Nicosia
Could not retrieve id for Chair: David Garcia


Processing authors:   4%|▎         | 56/1499 [00:09<05:03,  4.75it/s]

Could not retrieve id for Tiancheng Hu
Could not retrieve id for Marilena Hohmann
Could not retrieve id for Yara Kyrychenko
Could not retrieve id for Michele Coscia
Could not retrieve id for Sander van der Linden


Processing authors:   4%|▍         | 61/1499 [00:09<03:32,  6.77it/s]

Could not retrieve id for Jon Roozenbeek
Could not retrieve id for Nikolaos Nakis
Could not retrieve id for Louis Boucherie
Could not retrieve id for Morten Mørup
Could not retrieve id for Abdulkadir Celikkanat


Processing authors:   5%|▍         | 70/1499 [00:10<02:40,  8.89it/s]

Could not retrieve id for Jacopo D'Ignazi
Could not retrieve id for Corrado Monti
Could not retrieve id for Homa Hosseinmardi
Could not retrieve id for Gianmarco De Francisci Morales
Could not retrieve id for Michele Starnini


Processing authors:   5%|▌         | 75/1499 [00:10<02:12, 10.76it/s]

Could not retrieve id for Sam Wolken
Could not retrieve id for Duncan Watts
Could not retrieve id for David Rothschild
Could not retrieve id for Li Zhang
Could not retrieve id for Isabelle Lorge
Could not retrieve id for Melanie Oyarzun


Processing authors:   5%|▌         | 77/1499 [00:11<03:50,  6.17it/s]

Could not retrieve id for Cristian E Candia
Could not retrieve id for Bernardo Garcia Bulle Bueno


Processing authors:   5%|▌         | 79/1499 [00:12<04:42,  5.02it/s]

Could not retrieve id for Xiaowen Dong
Could not retrieve id for Janet B. Pierrehumbert
Could not retrieve id for Chair: Milena Tsvetkova
Could not retrieve id for Morgan Ryan Frank
Could not retrieve id for Carlos Rodriguez-Sickert


Processing authors:   6%|▌         | 84/1499 [00:12<03:17,  7.17it/s]

Could not retrieve id for Esteban Moro
Could not retrieve id for Dingeman Jan Van der Laan
Could not retrieve id for Edwin De Jonge
Could not retrieve id for Marjolijn Das
Could not retrieve id for Veniamin Veselovsky


Processing authors:   6%|▌         | 89/1499 [00:12<02:38,  8.90it/s]

Could not retrieve id for Mark E Whiting
Could not retrieve id for Robert West
Could not retrieve id for Josep Perelló
Could not retrieve id for Marc Sadurní Parera


Processing authors:   6%|▌         | 93/1499 [00:13<02:24,  9.76it/s]

Could not retrieve id for Chair: Elisa Omodei
Could not retrieve id for Jamell Dacon
Could not retrieve id for Jiliang Tang
Could not retrieve id for Minsu Park


Processing authors:   6%|▋         | 97/1499 [00:13<02:15, 10.37it/s]

Could not retrieve id for Chao Yu
Could not retrieve id for Michael Macy
Could not retrieve id for Kat Albrecht
Could not retrieve id for Nicola Fanton


Processing authors:   7%|▋         | 101/1499 [00:13<02:08, 10.85it/s]

Could not retrieve id for Michael Roth
Could not retrieve id for Agnieszka Falenska


Processing authors:   7%|▋         | 103/1499 [00:15<04:31,  5.14it/s]

Could not retrieve id for Zeyneb Nahide Kaya
Could not retrieve id for Miquel Montero
Could not retrieve id for Chair: Sandra Gonzalez-Bailón
Could not retrieve id for Alexandra Segerberg
Could not retrieve id for Matteo Magnani


Processing authors:   7%|▋         | 108/1499 [00:15<03:20,  6.95it/s]

Could not retrieve id for Damian Trilling
Could not retrieve id for Rupert Kiddle
Could not retrieve id for Anne C Kroon
Could not retrieve id for Zilin Lin
Could not retrieve id for Roeland Dubèl


Processing authors:   8%|▊         | 113/1499 [00:16<03:14,  7.11it/s]

Could not retrieve id for Mónika Simon
Could not retrieve id for Susan Vermeer
Could not retrieve id for Kasper Welbers
Could not retrieve id for Mark Boukes


Processing authors:   8%|▊         | 117/1499 [00:16<02:51,  8.05it/s]

Could not retrieve id for Sukankana Chakraborty
Could not retrieve id for Leonardo Castro-Gonzalez
Could not retrieve id for Helen Margetts
Could not retrieve id for Jonathan Bright


Processing authors:   8%|▊         | 121/1499 [00:16<02:34,  8.91it/s]

Could not retrieve id for Lisa Merten
Could not retrieve id for Helena Sophia Rauxloh


Processing authors:   8%|▊         | 123/1499 [00:17<02:47,  8.22it/s]

Could not retrieve id for Judith Moeller
Could not retrieve id for Heidi Schulze


Processing authors:   8%|▊         | 125/1499 [00:17<03:05,  7.40it/s]

Could not retrieve id for Sebastian Stier
Could not retrieve id for Ardon Z. Shorr


Processing authors:   8%|▊         | 127/1499 [00:17<03:14,  7.07it/s]

Could not retrieve id for Dafna Yavetz
Could not retrieve id for Chair: Leo Ferres


Processing authors:   9%|▊         | 129/1499 [00:18<03:15,  7.01it/s]

Could not retrieve id for Sanjay Kairam
Could not retrieve id for Dylan Thurgood
Could not retrieve id for Isabel Lerch
Could not retrieve id for Simon Martin Breum
Could not retrieve id for Bojan Kostic


Processing authors:   9%|▉         | 134/1499 [00:18<02:28,  9.19it/s]

Could not retrieve id for Zhuangyuan Fan
Could not retrieve id for Becky P.Y. Loo
Could not retrieve id for Michael Szell
Could not retrieve id for Ting Lian
Could not retrieve id for Feiyang Zhang


Processing authors:   9%|▉         | 139/1499 [00:18<02:03, 11.01it/s]

Could not retrieve id for Bruno Lepri
Could not retrieve id for Massimiliano Luca
Could not retrieve id for Antonio Bucchiarone


Processing authors:   9%|▉         | 142/1499 [00:20<04:05,  5.53it/s]

Could not retrieve id for Annapaola Marconi
Could not retrieve id for Hugo Barbosa
Could not retrieve id for Laura Maria Alessandretti
Could not retrieve id for Simone Centellegher
Could not retrieve id for Surendra Hazarie


Processing authors:  10%|▉         | 147/1499 [00:20<03:07,  7.22it/s]

Could not retrieve id for Ronaldo MenezesCould not retrieve id for Henrik Wolf

Could not retrieve id for Gourab Ghoshal
Could not retrieve id for Ane Rahbek Vierø
Could not retrieve id for Adam Frank


Processing authors:  10%|█         | 152/1499 [00:21<03:06,  7.23it/s]

Could not retrieve id for Viktoria Spaiser
Could not retrieve id for Kelton R Minor
Could not retrieve id for Nicole Nisbett
Could not retrieve id for Chair: Clara Vandeweerdt





Could not retrieve id for Hannah-Marie Büttner
Could not retrieve id for Hendrik Meyer
Could not retrieve id for Patrick Zerrer
Could not retrieve id for Max Falkenberg
Could not retrieve id for Andrea Baronchelli
Could not retrieve id for Francesco Lamperti
Could not retrieve id for Roberto MaviliaCould not retrieve id for Giorgio Tripodi

Could not retrieve id for Andrea mina
Could not retrieve id for Francesca Chiaromonte
Could not retrieve id for Kristoffer Lind Glavind
Could not retrieve id for Aaron J. Schwartz
Could not retrieve id for Christopher Danforth
Could not retrieve id for Chair: Andreia Sofia Teixeira
Could not retrieve id for Andreas Bjerre-Nielsen
Could not retrieve id for Michele Tizzani
Could not retrieve id for Daniela Paolotti
Could not retrieve id for Pietro Coletti
Could not retrieve id for Alessandro De Gaetano
Could not retrieve id for Christopher I Jarvis
Could not retrieve id for Amy Gimma
Could not retrieve id for Kerry Wong
Could not retrieve id for John 

> **Exercise: Constructing the Computational Social Scientists Network**
>
> In this exercise, we will create a network of researchers in the field of Computational Social Science using the NetworkX library. In our network, nodes represent authors of academic papers, with a direct link from node _A_ to node _B_ indicating a joint paper written by both. The link's weight reflects the number of papers written by both _A_ and _B_.

> **Part 1: Network Construction**
>

> 1. **Weighted Edgelist Creation:** Start with your dataframe of *papers*. Construct a _weighted edgelist_ where each list element is a tuple containing three elements: the _author ids_ of two collaborating authors and the total number of papers they've co-authored. Ensure each author pair is listed only once. 

> 2. **Graph Construction:**
>    - Use NetworkX to create an undirected [``Graph``](https://networkx.org/documentation/stable/reference/classes/graph.html).
>    - Employ the [`add_weighted_edges_from`](https://networkx.org/documentation/stable/reference/classes/generated/networkx.Graph.add_weighted_edges_from.html#networkx.Graph.add_weighted_edges_from) function to populate the graph with the weighted edgelist from step 1, c

> 3. **Node Attributes:**
>    - For each node, add attributes for the author's _display name_, _country_, _citation count_, and the _year of their first publication_ in Computational Social Science. The _display name_ and _country_ can be retrieved from your _authors_ dataset. The _year of their first publication_ and the _citation count_  can be retrieved from the _papers_ dataset.
>    - Save the network as a JSON file.

> **Part 2: Preliminary Network Analysis**
> Now, with the network constructed, perform a basic analysis to explore its features.

> 1. **Network Metrics:**
>    - What is the total number of nodes (authors) and links (collaborations) in the network? 
>    - Calculate the network's density (the ratio of actual links to the maximum possible number of links). Would you say that the network is sparse? Justify your answer.

>    - Is the network fully connected (i.e., is there a direct or indirect path between every pair of nodes within the network), or is it disconnected?
>    - If the network is disconnected, how many connected components does it have? A connected component is defined as a subset of nodes within the network where a path exists between any pair of nodes in that subset. 
>    - How many isolated nodes are there in your network?  An isolated node is defined as a node with no connections to any other node in the network.
>    - Discuss the results above on network density, and connectivity. Are your findings in line with what you expected? Why?  __(answer in max 150 words)__
> 

> 3. **Degree Analysis:**
>    - Compute the average, median, mode, minimum, and maximum degree of the nodes. Perform the same analysis for node strength (weighted degree). What do these metrics tell us 

> 4. **Top Authors:**
>    - Identify the top 5 authors by degree. What role do these node play in the network? 

>    - Research these authors online. What areas do they specialize in? Do you think that their work aligns with the themes of Computational Social Science? If not, what could be possible reasons? __(answer in max 150 words)__