Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [1]:
NAME = "Rosa Muilu"
STUDENT_ID = "14387360"

---

# Analyzing Gender Distribution Among Scientific Authors in Computational Social Science

*Objective*: Understand the gender distribution of authors across different scientific disciplines using web scraping and API-based gender identification.

Gender diversity in research is crucial for ensuring diverse perspectives and approaches in scientific inquiry, and for the comprehensiveness and richness of research findings. A balanced gender representation can help challenge systemic biases that might otherwise marginalize or overlook significant areas of study. A diverse research community can also act as a role model, inspiring future generations of all genders to pursue scientific endeavors.

This assignment focuses on the question of the gender distribution of researchers in different disciplines, and on identifying how often women are the first or last author of publications. 

To do so, you will scrape a preprint website, and you will use the API genderize.io to identify the gender of the author based on their name.

1. Prepare: Identify a source and decide a scraping strategy

2. Scrape the list of articles and authors

3. Use API to identify gender 

4. Analyze gender distribution and authorship order

5. Reflect on your findings. 

6. Scrape the paper abstracts

### Setup and requirements
First make sure that you have the needed libraries for Python correctly installed.

In [2]:
# Selenium
# !pip install selenium
# !pip install webdriver-manager
# !pip install webdriver-manager --upgrade
# !pip install packaging
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By

# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# driver.get("https://www.google.com")

In [3]:
# Request
#!pip install requests
import requests

In [4]:
# Beautifulsoup
#!pip install beautifulsoup
from bs4 import BeautifulSoup

In [5]:
import pandas as pd
import numpy as np
import json
import pickle


## 1. Plan and strategize

We first need to decide which site to scrape and our strategy for doing so. We will focus on a preprint repository. Preprint repositories host and disseminate research papers before they are peer-reviewed and published in academic journals. They therefore give a view of the latest research.

There are several repositories that represent different scientific disciplines (e.g., PubMed for life sciences, arXiv for physics and computer science, JSTOR for humanities and social sciences, SocArxiv for social science, etc.) 

We will here focus on arxiv.org, where many Computational Social Scientists publish, often under the category "Computers and Society".

You need to pick a page on ArXiv where you can get a representative sample of these research papers -- and which you are allowed to scrape.

1. Browse Arxiv.org, and select a page on the website where you can find a sample of research papers.
2. Check the robots.txt. Are you allowed to scrape the page you selected? (If not, you will have to choose another one!)
3. Decide a strategy for scraping the page as quickly and easily as possible to find the names of the authors for each paper, their titles, and a link to the pages.
4. Choose which Python libraries for scraping that you will use.

### Question 1: Which library is most suitable?

Given the structure of the website, which Python libraries for scraping do you think is appropriate to use? Motivate your choice in a few sentences.

I'm using Requests with Beautiful Soup, as the site is not dynamic and there is no need to scroll et cetera.

[Evaluation: This is an open question. Any motivation that makes sense is fine, but in general, requests make more sense for this page than selenium, since the site in question is not dynamic. Using selenium will be slower and more difficult.]

## 2. Scrape the list of articles and authors 

Implement your scraping strategy. Scrape the page and collect the information about the publication. 

- You will need to get (1) the link to the article, (2) the title of the article, (3) the names of all authors of the paper, in the same order as they appear on the paper. 
- You need to scrape 200 research papers.

- Note that you may need to iterate over multiple pages.
- Note that you need to handle possible exceptions and that your code needs to be able to restart if it crashes.
- You final result should be a list of dicts, with keys 'title', 'url', and 'authors'. 'authors' should consist of a list where the authors are listed in the order that they were on the paper. 
- You need to clean and validate your data: check that all papers have authors, that all papers have titles, clean the texts to remove empty spaces and similar, etc.
- Store the resulting array persistently as a pickle with the name 'scraping_result.pkl'.

For instance: [{'title': 'How to use Large Langauge Models for Text Analysis', 'authors': ['Törnberg, Petter'], 'url':'https://arxiv.org/abs/2307.13106' } ...]


In [13]:
data_list = ...

# YOUR CODE HERE
base_url = "https://export.arxiv.org/"
article_list = 'list/cs.CV/'
show = 200

url = f'{base_url}{article_list}pastweek?skip=0&show={show}'
response = requests.get(url)
if response.status_code == 200:
    print('done!')
else:
    print(response)

data_list = []
soup = BeautifulSoup(response.text, 'html.parser')
article_list_soup = soup.select('dd')

for article_soup in article_list_soup:
    authors = article_soup.find_all('a')
    authors_list = []
    for author in authors:
        name = author.get_text()
        name = name.replace('\n', '').strip()
        name = name.replace('Authors:', '')
        authors_list.append(name)
    title = article_soup.find('div', {'class': 'list-title mathjax'}).get_text()
    article = {
        'title': title[8:].strip(),
        'authors': authors_list
    }
    data_list.append(article)

identifiers = soup.find_all('a', {'title': 'Abstract'})

counter = 0
for identifier in identifiers:
    identifier = identifiers[counter]['href']
    data_list[counter]['url'] = base_url + identifier
    counter = counter + 1

data = pd.DataFrame(data_list)

done!


In [14]:
# Check if keys exists in dictionary
assert 'title' in data_list[0], "Key 'title' not found in dictionary"
assert 'authors' in data_list[0], "Key 'author' not found in dictionary"
assert 'url' in data_list[0], "Key 'url' not found in dictionary"

In [15]:
data.to_pickle('scraping_result.pkl')

In [6]:
with open('scraping_result.pkl', 'rb') as file:
    df = pickle.load(file)

In [8]:
df.head()

Unnamed: 0,title,authors,url
0,Scaling Laws of Synthetic Images for Model Tra...,"[Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina...",https://export.arxiv.org//abs/2312.04567
1,Gen2Det: Generate to Detect,"[Saksham Suri, Fanyi Xiao, Animesh Sinha, Sean...",https://export.arxiv.org//abs/2312.04566
2,MuRF: Multi-Baseline Radiance Fields,"[Haofei Xu, Anpei Chen, Yuedong Chen, Christos...",https://export.arxiv.org//abs/2312.04565
3,EAGLES: Efficient Accelerated 3D Gaussians wit...,"[Sharath Girish, Kamal Gupta, Abhinav Shrivast...",https://export.arxiv.org//abs/2312.04564
4,Visual Geometry Grounded Deep Structure From M...,"[Jianyuan Wang, Nikita Karaev, Christian Ruppr...",https://export.arxiv.org//abs/2312.04563


## 3. Use Genderize.io to identify author gender

The next step is to identify the gender of the authors. To do so, we will use the free API genderdize.io. 

1. Go to https://genderize.io/ and read the API documnentation.
2. Do you need to register to use it? Do you need an API key? 
3. How do you call the API? What parameters do you need to send? 
4. What rate limiting is used? How long do you need to wait between calls?

You will use what you learned to carry out the following tasks.


#### Task 1: _identify_gender()_
Write a function _identify_gender(first_name)_ that takes a name, and uses the API to guess the gender. The function should send a request to genderize.io, and parse the resulting json to a dict. The function should return a dict with the data provided by the API.

In [9]:
def identify_gender(first_name):
    # YOUR CODE HERE
    response = requests.get(f'https://api.genderize.io?name={first_name}')
    gender = json.loads(response.text)
    return gender
# Test
print(identify_gender("Sasha"))

{'count': 14512, 'name': 'Sasha', 'gender': 'female', 'probability': 0.52}


#### Task 2: Identify gender of all authors

Your task is now to use your new function to identify the genders of all authors that you previously scraped. 

To do so, you first need to extract the first name of each author. You need to iterate over these names and use your function to identify the gender of the author.

Your result should be a dataframe with the following columns:

- article_url | author_full_name | first_name | author_order | estimated_gender | gender_probability

Author_order should be a number specifying where the author was in the author list for the publication (e.g., 0 = first author, 1 = second author...) _Estimated_gender_ should contain the API response on gender, and _gender_probability_ the certainty of the gender, according to the API.

Note:
- You will need to transform your dict to the dataframe shown above, with one author per line. (This means that each URL will be associated to multiple author names.)
- Make sure that you respect the rate limiting of the API. 
- Make sure that you handle exceptions and that your function continues 
- Note that you get a maximum of 1,000 free calls per day, so you need to make sure that you do not waste your API calls!
- The API may not have all names stored. For these names, store a _np.nan_ value as the gender.

Pickle the resulting dataframe with the name: 'author_gender.df.pkl'

In [19]:
import pickle
import pandas as pd
import requests
import numpy as np
import time
from ast import literal_eval

# Load your scraped data
with open('scraping_result.pkl', 'rb') as f:
    data_list = pickle.load(f)

df = data_list.explode('authors')
df['first_name'] = df['authors'].str.split(' ').str[0]
df.reset_index(drop=True, inplace=True)
df['author_order'] = df.groupby('url').cumcount()


df['gender'] = df['first_name'].apply(identify_gender)
display(df)



Unnamed: 0,title,authors,url,first_name,author_order,gender
0,Scaling Laws of Synthetic Images for Model Tra...,Lijie Fan,https://export.arxiv.org//abs/2312.04567,Lijie,0,"{'count': 53, 'name': 'Lijie', 'gender': 'male..."
1,Scaling Laws of Synthetic Images for Model Tra...,Kaifeng Chen,https://export.arxiv.org//abs/2312.04567,Kaifeng,1,"{'count': 18, 'name': 'Kaifeng', 'gender': 'ma..."
2,Scaling Laws of Synthetic Images for Model Tra...,Dilip Krishnan,https://export.arxiv.org//abs/2312.04567,Dilip,2,"{'count': 1956, 'name': 'Dilip', 'gender': 'ma..."
3,Scaling Laws of Synthetic Images for Model Tra...,Dina Katabi,https://export.arxiv.org//abs/2312.04567,Dina,3,"{'count': 98305, 'name': 'Dina', 'gender': 'fe..."
4,Scaling Laws of Synthetic Images for Model Tra...,Phillip Isola,https://export.arxiv.org//abs/2312.04567,Phillip,4,"{'count': 106320, 'name': 'Phillip', 'gender':..."
...,...,...,...,...,...,...
1180,AI-SAM: Automatic and Interactive Segment Anyt...,Alison D. Gernand,https://export.arxiv.org//abs/2312.03119,Alison,2,{'error': 'Request limit reached'}
1181,AI-SAM: Automatic and Interactive Segment Anyt...,Jeffery A. Goldstein,https://export.arxiv.org//abs/2312.03119,Jeffery,3,{'error': 'Request limit reached'}
1182,AI-SAM: Automatic and Interactive Segment Anyt...,James Z. Wang,https://export.arxiv.org//abs/2312.03119,James,4,{'error': 'Request limit reached'}
1183,ScAR: Scaling Adversarial Robustness for LiDAR...,Xiaohu Lu,https://export.arxiv.org//abs/2312.03085,Xiaohu,0,{'error': 'Request limit reached'}


In [20]:
df.to_pickle('gender_data.pkl')

In [10]:
with open('gender_data.pkl', 'rb') as file:
    df = pickle.load(file)

In [22]:
# request limit reached, completed next day
df['gender'][737:] = df['first_name'][737:].apply(identify_gender)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['gender'][737:] = df['first_name'][737:].apply(identify_gender)


In [26]:
df['gender_probability'] = df['gender'].apply(lambda x: x.get('probability'))
df['estimated_gender'] = df['gender'].apply(lambda x: x.get('gender'))

In [27]:
df['author_order'] = df.groupby('url').cumcount()

In [28]:
new_names = {'url': 'article_url', 'authors': 'author_full_name'}
df = df.rename(columns=new_names)
df = df.iloc[:, [2, 1, 3, 4, 7, 6]]


In [30]:
df

Unnamed: 0,article_url,author_full_name,first_name,author_order,estimated_gender,gender_probability
0,https://export.arxiv.org//abs/2312.04567,Lijie Fan,Lijie,0,male,0.51
1,https://export.arxiv.org//abs/2312.04567,Kaifeng Chen,Kaifeng,1,male,1.00
2,https://export.arxiv.org//abs/2312.04567,Dilip Krishnan,Dilip,2,male,0.99
3,https://export.arxiv.org//abs/2312.04567,Dina Katabi,Dina,3,female,0.99
4,https://export.arxiv.org//abs/2312.04567,Phillip Isola,Phillip,4,male,1.00
...,...,...,...,...,...,...
1180,https://export.arxiv.org//abs/2312.03119,Alison D. Gernand,Alison,2,female,1.00
1181,https://export.arxiv.org//abs/2312.03119,Jeffery A. Goldstein,Jeffery,3,male,1.00
1182,https://export.arxiv.org//abs/2312.03119,James Z. Wang,James,4,male,1.00
1183,https://export.arxiv.org//abs/2312.03085,Xiaohu Lu,Xiaohu,0,male,0.98


In [31]:
assert 'article_url' in df.columns, "article_url column is missing"
assert 'author_full_name' in df.columns, "author_full_name column is missing"
assert 'first_name' in df.columns, "first_name column is missing"
assert 'author_order' in df.columns, "author_order column is missing"
assert 'estimated_gender' in df.columns, "estimated_gender column is missing"
assert 'gender_probability' in df.columns, "gender_probability column is missing"

with open('author_gender.df.pkl', 'wb') as f:
    pickle.dump(df, f)

display(df.head(10))

Unnamed: 0,article_url,author_full_name,first_name,author_order,estimated_gender,gender_probability
0,https://export.arxiv.org//abs/2312.04567,Lijie Fan,Lijie,0,male,0.51
1,https://export.arxiv.org//abs/2312.04567,Kaifeng Chen,Kaifeng,1,male,1.0
2,https://export.arxiv.org//abs/2312.04567,Dilip Krishnan,Dilip,2,male,0.99
3,https://export.arxiv.org//abs/2312.04567,Dina Katabi,Dina,3,female,0.99
4,https://export.arxiv.org//abs/2312.04567,Phillip Isola,Phillip,4,male,1.0
5,https://export.arxiv.org//abs/2312.04567,Yonglong Tian,Yonglong,5,male,1.0
6,https://export.arxiv.org//abs/2312.04566,Saksham Suri,Saksham,0,male,1.0
7,https://export.arxiv.org//abs/2312.04566,Fanyi Xiao,Fanyi,1,male,0.53
8,https://export.arxiv.org//abs/2312.04566,Animesh Sinha,Animesh,2,male,1.0
9,https://export.arxiv.org//abs/2312.04566,Sean Chang Culatana,Sean,3,male,1.0


## 4. Analyze gender distribution and authorship order

Now that you have gathered the necessary data, you will use this data to answer some research questions about gender equality in CSS research. Note that in calculating this, you need to handle that the API may have failed to identify the gender of some authors.

1. What fraction of the authors included are women? 
2. What happens to this fraction if you only include authors for which the gender_probability is higher than 80%? 
3. Being the first or single author on a research paper is an important status signal among researchers: it often means that you made the most work. What fraction of the first or single authors are women? 
4. Being the _last_ author on a research paper with _three or more authors_ is also an important status signal: it tends to mean that you were the one to acquire funding or lead the lab. What fraction of the last-authors on papers with three or more author are women?


In [32]:
with open('author_gender.df.pkl', 'rb') as file:
    df = pickle.load(file)

In [33]:
# YOUR CODE HERE
gender_df = df[['estimated_gender', 'gender_probability', 'author_order', 'article_url']]
gender_df = gender_df.dropna()

# 1st question
fractions = gender_df['estimated_gender'].value_counts(normalize=True)
female_proportion = fractions.get('female')
print(f'The fraction of women is {female_proportion}')

# 2nd question
gender_df_copy = gender_df[gender_df['gender_probability'] > 0.8]
fractions = gender_df_copy['estimated_gender'].value_counts(normalize=True)
female_proportion = fractions.get('female')
print(f'The fraction of women when including only cases when gender probability is higher than 80 % is {female_proportion}')

# 3rd question
gender_df_first_authors = gender_df[gender_df['author_order'] == 0]
fractions = gender_df_first_authors['estimated_gender'].value_counts(normalize=True)
female_proportion = fractions.get('female')
print(f'The fraction of women in first or single authors is {female_proportion}')

# 4th question
gender_df_last_authors = gender_df.groupby('article_url')['estimated_gender'].last()
fractions = gender_df_last_authors.value_counts(normalize=True)
female_proportion = fractions.get('female')
print(f'The fraction of women in last authors is {female_proportion}')

The fraction of women is 0.19929140832595216
The fraction of women when including only cases when gender probability is higher than 80 % is 0.14675324675324675
The fraction of women in first or single authors is 0.19021739130434784
The fraction of women in last authors is 0.1507537688442211


## 5. Reflect on your findings

You have now carried out your analysis of the gender distribution in articles in CSS using scraped data. Reflect on your findings and method, and answer each of the following questions in a few sentences.

1. What are the implications of the observed gender distribution and author order in CSS? How do these distributions compare with your expectations?
2. How accurate do you think your findings are? What are the limitations of determining gender based solely on names? Are there cultural or regional nuances that the API might miss?
3. Reflect on the ethical considerations involved in scraping this data. 


1. It does indicate a male-dominated authorship on a niche field of Computer Science. The difference between women in general and women in first/last positions isn't that big. These distributions match my expectations, but I did hope the difference would not be that great.

2. While this kind of drastic spread does say something about the distribution, it is also just the result of 2-3 days of publishing. More datapoints would make the trend clearer. Determining gender based on only first names is not that accurate, as different regions have different gender distributions for names. For example, in Japan, the name 'Mika' is strongly female, but in Finland it is mainly male. Only looking at the first name will not reveal many cultural aspects to gender in different countries. Additionally, just taking the first part of the name and assuming it is the first name will not be accurate in many Asian naming traditions.

3. While the site does allow the list data to be scraped, it is ethical to not do it extensively, as it leads to increased bandwith usage and might lead to problems on the website's side. It is also ethical to abide with the website's own guidelines, like using export.arxiv.org mirrorsite instead of the normal site. 

## 6. Scrape the paper abstract

Your next task is to get the abstract for each paper. You will use these abstracts in a later exercise in the course, where we will use text analysis to examine whether the content of research papers are a function of the gender of the author. 

To do so, you need to iterate over the papers that you have already identified, and scrape the abstract from the URL listed. 

#### Task 1: scrape_abstract()
Write a function scrape_abstract(url) that goes to the research paper URL, and scrapes the content of the abstract. The function should return the abstract as a string, and nothing else.

In [34]:
import requests
from bs4 import BeautifulSoup

def scrape_abstract(url):
    """
    Fetch the abstract from the provided arXiv URL using XPath.

    Parameters:
    - url (str): The URL of the arXiv paper.

    Returns:
    - str: The abstract of the paper.
    """

    # YOUR CODE HERE
    response = requests.get(url)
    time.sleep(1)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    abstract = soup.find('blockquote', {'class': 'abstract mathjax'}).get_text()
    abstract = abstract[10:]
    return abstract
# Test
url = "https://export.arxiv.org/abs/2307.13106"
print(scrape_abstract(url))


 This guide introduces Large Language Models (LLM) as a highly versatile text
analysis method within the social sciences. As LLMs are easy-to-use, cheap,
fast, and applicable on a broad range of text analysis tasks, ranging from text
annotation and classification to sentiment analysis and critical discourse
analysis, many scholars believe that LLMs will transform how we do text
analysis. This how-to guide is aimed at students and researchers with limited
programming experience, and offers a simple introduction to how LLMs can be
used for text analysis in your own research project, as well as advice on best
practices. We will go through each of the steps of analyzing textual data with
LLMs using Python: installing the software, setting up the API, loading the
data, developing an analysis prompt, analyzing the text, and validating the
results. As an illustrative example, we will use the challenging task of
identifying populism in political texts, and show how LLMs move beyond the
existin

#### Task 2: Scrape all urls

You will now use your function to scrape all the URLs that you collected in step 2.

The following will provide instructions for how you can go about this task. However, there are several ways to do this, and you are free to choose your preferred method.

Prepare your data:

1. Load your list of dicts from step 2 (scraping_result.pkl)
2. Use it to create a dataframe. 
3. Add a column 'scraped' which should be False for all rows, and a column 'abstract' that should be None for all rows.
4. Store the dataframe persistently (e.g., by pickling it.)

The scraping procedure:

1. Load the persitent pickle as dataframe (so that if your computer crashes, the function will continue where you were)
2. Repeat the following steps until there are no more rows with scraped == False:
3. Fetch a random row with scraped == False
4. Go to the URL and scrape the abstract.
5. Set abstract column in the dataframe to the resulting abstract, set scraped to True.
6. Store the dataframe persistently as a pickle. 

Remember: 
- You may use another strategy. However, since you will be scraping many pages, you should expect your scraper to encounter problems along the way. You therefore need to make sure that you regularly store the results persistently.
- Make sure to handle any exceptions gracefully.
- Be respectful toward the website owners: wait at least one second between each call. 

Your final result should be a dataframe stored as 'scraped_abstracts.df.pkl', with filled 'abstract' and 'scraped' columns.

<!-- [Evaluation: ]
- Load dataframe as df
- Check that the len of df = len of the result list from question 2. 
- Check that each line has an abstract, with len() > 100 e.g.
 -->

In [35]:
with open('scraping_result.pkl', 'rb') as file:
    df = pickle.load(file)
df['scraped'] = False
df['abstract'] = None
with open('scraping_process.pkl', 'wb') as f:
    pickle.dump(df, f)

In [36]:
filepath = 'scraping_process.pkl'

def process_row(row):
    if not row['scraped'] == True:
        url = row['url']
        abstract = scrape_abstract(url)
        row['abstract'] = abstract
        row['scraped'] = True
    return row

df = pd.read_pickle(filepath)
df = df.apply(process_row, axis=1)

# Rename and save the final dataframe
df.to_pickle('scraped_abstracts.df.pkl')