Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [2]:
NAME = "Rachna Mallara"
STUDENT_ID = "14444372"

---

# Analyzing Gender Distribution Among Scientific Authors in Computational Social Science

*Objective*: Understand the gender distribution of authors across different scientific disciplines using web scraping and API-based gender identification.

Gender diversity in research is crucial for ensuring diverse perspectives and approaches in scientific inquiry, and for the comprehensiveness and richness of research findings. A balanced gender representation can help challenge systemic biases that might otherwise marginalize or overlook significant areas of study. A diverse research community can also act as a role model, inspiring future generations of all genders to pursue scientific endeavors.

This assignment focuses on the question of the gender distribution of researchers in different disciplines, and on identifying how often women are the first or last author of publications. 

To do so, you will scrape a preprint website, and you will use the API genderize.io to identify the gender of the author based on their name.

1. Prepare: Identify a source and decide a scraping strategy

2. Scrape the list of articles and authors

3. Use API to identify gender 

4. Analyze gender distribution and authorship order

5. Reflect on your findings. 

6. Scrape the paper abstracts

### Setup and requirements
First make sure that you have the needed libraries for Python correctly installed.

In [1]:
# Selenium
# !pip install selenium
# !pip install webdriver-manager
# !pip install webdriver-manager --upgrade
# !pip install packaging
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.common.by import By

# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# driver.get("https://www.google.com")

In [2]:
# Request
#!pip install requests
import requests

In [3]:
# Beautifulsoup
#!pip install beautifulsoup
from bs4 import BeautifulSoup

In [4]:
import pandas as pd
import numpy as np
import json

## 1. Plan and strategize

We first need to decide which site to scrape and our strategy for doing so. We will focus on a preprint repository. Preprint repositories host and disseminate research papers before they are peer-reviewed and published in academic journals. They therefore give a view of the latest research.

There are several repositories that represent different scientific disciplines (e.g., PubMed for life sciences, arXiv for physics and computer science, JSTOR for humanities and social sciences, SocArxiv for social science, etc.) 

We will here focus on arxiv.org, where many Computational Social Scientists publish, often under the category "Computers and Society".

You need to pick a page on ArXiv where you can get a representative sample of these research papers -- and which you are allowed to scrape.

1. Browse Arxiv.org, and select a page on the website where you can find a sample of research papers.
2. Check the robots.txt. Are you allowed to scrape the page you selected? (If not, you will have to choose another one!)
3. Decide a strategy for scraping the page as quickly and easily as possible to find the names of the authors for each paper, their titles, and a link to the pages.
4. Choose which Python libraries for scraping that you will use.

### Question 1: Which library is most suitable?

Given the structure of the website, which Python libraries for scraping do you think is appropriate to use? Motivate your choice in a few sentences.

_[Student answers here.]_

[Evaluation: This is an open question. Any motivation that makes sense is fine, but in general, requests make more sense for this page than selenium, since the site in question is not dynamic. Using selenium will be slower and more difficult.]

## 2. Scrape the list of articles and authors 

Implement your scraping strategy. Scrape the page and collect the information about the publication. 

- You will need to get (1) the link to the article, (2) the title of the article, (3) the names of all authors of the paper, in the same order as they appear on the paper. 
- You need to scrape 200 research papers.

- Note that you may need to iterate over multiple pages.
- Note that you need to handle possible exceptions and that your code needs to be able to restart if it crashes.
- You final result should be a list of dicts, with keys 'title', 'url', and 'authors'. 'authors' should consist of a list where the authors are listed in the order that they were on the paper. 
- You need to clean and validate your data: check that all papers have authors, that all papers have titles, clean the texts to remove empty spaces and similar, etc.
- Store the resulting array persistently as a pickle with the name 'scraping_result.pkl'.

For instance: [{'title': 'How to use Large Langauge Models for Text Analysis', 'authors': ['Törnberg, Petter'], 'url':'https://arxiv.org/abs/2307.13106' } ...]


In [5]:
import pickle
import copy
import requests
from bs4 import BeautifulSoup
import time

data_list = []
dict_papers = {}

for papers in range (0, 200, 50):
    #open the webpage using selenium
    url = f'https://arxiv.org/list/cs/pastweek?skip={papers}&show=50'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    for i in range(2, 102, 2):
        #extract title as string
        title = soup.select(f'#dlpage > dl > dd:nth-child({i}) > div > div.list-title.mathjax')
        title_text = title[0].get_text()[8:]
        title_text = title_text.strip("\n")
        dict_papers["title"] = title_text

        #extract author names as list of strings
        authors = soup.select(f'#dlpage > dl > dd:nth-child({i}) > div > div.list-authors')
        author_text = authors[0].get_text()[10:].split(", \n")
        author_text[-1] = author_text[-1].strip("\n")
        dict_papers["authors"] = author_text

        #get url as string
        url_paper = soup.select(f'#dlpage > dl > dt:nth-child({i-1}) > span > a:nth-child(1)')
        url_text = 'https://arxiv.org' + url_paper[0].get("href")
        dict_papers["url"] = url_text

        data_list.append(copy.deepcopy(dict_papers))
    
    time.sleep(15)

#print(data_list)

# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

In [8]:
#pickle the list

with open('scraping_result.pkl', 'wb') as f:
    pickle.dump(data_list, f)


In [9]:
# Check if keys exists in dictionary
assert 'title' in data_list[0], "Key 'title' not found in dictionary"
assert 'authors' in data_list[0], "Key 'author' not found in dictionary"
assert 'url' in data_list[0], "Key 'url' not found in dictionary"

## 3. Use Genderize.io to identify author gender

The next step is to identify the gender of the authors. To do so, we will use the free API genderdize.io. 

1. Go to https://genderize.io/ and read the API documnentation.
2. Do you need to register to use it? Do you need an API key? 
3. How do you call the API? What parameters do you need to send? 
4. What rate limiting is used? How long do you need to wait between calls?

You will use what you learned to carry out the following tasks.


#### Task 1: _identify_gender()_
Write a function _identify_gender(first_name)_ that takes a name, and uses the API to guess the gender. The function should send a request to genderize.io, and parse the resulting json to a dict. The function should return a dict with the data provided by the API.

In [73]:
#Written with help from code in this Github repository: https://github.com/acceptable-security/gender.py

def identify_gender(*args):
    base_url = "https://api.genderize.io/?name[]="

    list_names = list(args)

    # Batch the names into groups of 10 or less
    batches = [list_names[i:i + 10] for i in range(0, len(list_names), 10)]

    results_list = []

    for batch in batches:
        url = base_url + '&name[]='.join(batch)
        
        req = requests.get(url)
        results = json.loads(req.text)
        dict_name = {}

        for result in results:
            if 'error' not in result:
                if result["gender"] is not None:
                    dict_name["first_name"] = result["name"]
                    dict_name["estimated_gender"] = result["gender"]
                    dict_name["gender_probability"] = result["probability"]
                else:
                    dict_name["first_name"] = "None"
                    dict_name["estimated_gender"] = np.nan
                    dict_name["gender_probability"] = np.nan
            else:
                dict_name["first_name"] = "None"
                dict_name["estimated_gender"] = np.nan
                dict_name["gender_probability"] = np.nan
            
            results_list.append(copy.deepcopy(dict_name))

    return results_list

# Example function call
names = ["Owen", "Martha", "Google", "Ram", "Sushmita", "Karen", "Vivek", "Pulkit", "Hannah", "Rachna", "Xi"]
print(identify_gender(*names))

#raise NotImplementedError()

[{'first_name': 'None', 'estimated_gender': nan, 'gender_probability': nan}, {'first_name': 'Xi', 'estimated_gender': 'female', 'gender_probability': 0.97}]


#### Task 2: Identify gender of all authors

Your task is now to use your new function to identify the genders of all authors that you previously scraped. 

To do so, you first need to extract the first name of each author. You need to iterate over these names and use your function to identify the gender of the author.

Your result should be a dataframe with the following columns:

- article_url | author_full_name | first_name | author_order | estimated_gender | gender_probability

Author_order should be a number specifying where the author was in the author list for the publication (e.g., 0 = first author, 1 = second author...) _Estimated_gender_ should contain the API response on gender, and _gender_probability_ the certainty of the gender, according to the API.

Note:
- You will need to transform your dict to the dataframe shown above, with one author per line. (This means that each URL will be associated to multiple author names.)
- Make sure that you respect the rate limiting of the API. 
- Make sure that you handle exceptions and that your function continues 
- Note that you get a maximum of 1,000 free calls per day, so you need to make sure that you do not waste your API calls!
- The API may not have all names stored. For these names, store a _np.nan_ value as the gender.

Pickle the resulting dataframe with the name: 'author_gender.df.pkl'

In [61]:
import pickle
import pandas as pd
import requests
import numpy as np
import time

def name_splitter(writers):
    first_names = []
    
    for full_name in writers:
        first_name = full_name.split()[0]
        if '-' in first_name:
            first_name = first_name[0 : first_name.index('-')]
        first_names.append(first_name)
    
    return first_names
    

In [74]:
#main
# Load scraped data
with open('scraping_result.pkl', 'rb') as f:
    data_list = pickle.load(f)

authors = []
papers_df = [] #will be converted to df later

for record in data_list:
    authors = record["authors"]   #find list of authors
    #print(authors)
    first_names = name_splitter(authors) #obtain list of first names
    #print(first_names)
    results = identify_gender(*first_names) #obtain list of dicts with name and gender details
   
    for i, result in enumerate(results):
        papers_df.append([record["url"], authors[i], result["first_name"], i+1, result["estimated_gender"], result["gender_probability"]])

# convert to dataframe
papers_df = pd.DataFrame(papers_df, columns=['article_url', 'author_full_name', 'first_name', 'author_order', 'estimated_gender', 'gender_probability'])

print(papers_df.head())
# YOUR CODE HERE
#raise NotImplementedError()

                        article_url      author_full_name first_name  \
0  https://arxiv.org/abs/2311.10709         Rohit Girdhar       None   
1  https://arxiv.org/abs/2311.10708  Sai Saketh Rambhatla       None   
2  https://arxiv.org/abs/2311.10707         Xiaohui Zhang       None   
3  https://arxiv.org/abs/2311.10706          Zhongtian He       None   
4  https://arxiv.org/abs/2311.10702         Hamish Ivison       None   

   author_order  estimated_gender  gender_probability  
0             1               NaN                 NaN  
1             1               NaN                 NaN  
2             1               NaN                 NaN  
3             1               NaN                 NaN  
4             1               NaN                 NaN  


In [None]:
assert 'article_url' in df.columns, "article_url column is missing"
assert 'author_full_name' in df.columns, "author_full_name column is missing"
assert 'first_name' in df.columns, "first_name column is missing"
assert 'author_order' in df.columns, "author_order column is missing"
assert 'estimated_gender' in df.columns, "estimated_gender column is missing"
assert 'gender_probability' in df.columns, "gender_probability column is missing"

with open('author_gender.df.pkl', 'wb') as f:
    pickle.dump(df, f)

display(df.head(10))

## 4. Analyze gender distribution and authorship order

Now that you have gathered the necessary data, you will use this data to answer some research questions about gender equality in CSS research. Note that in calculating this, you need to handle that the API may have failed to identify the gender of some authors.

1. What fraction of the authors included are women? 
2. What happens to this fraction if you only include authors for which the gender_probability is higher than 80%? 
3. Being the first or single author on a research paper is an important status signal among researchers: it often means that you made the most work. What fraction of the first or single authors are women? 
4. Being the _last_ author on a research paper with _three or more authors_ is also an important status signal: it tends to mean that you were the one to acquire funding or lead the lab. What fraction of the last-authors on papers with three or more author are women?


In [None]:
# YOUR CODE HERE
raise NotImplementedError()


## 5. Reflect on your findings

You have now carried out your analysis of the gender distribution in articles in CSS using scraped data. Reflect on your findings and method, and answer each of the following questions in a few sentences.

1. What are the implications of the observed gender distribution and author order in CSS? How do these distributions compare with your expectations?
2. How accurate do you think your findings are? What are the limitations of determining gender based solely on names? Are there cultural or regional nuances that the API might miss?
3. Reflect on the ethical considerations involved in scraping this data. 


YOUR ANSWER HERE

## 6. Scrape the paper abstract

Your next task is to get the abstract for each paper. You will use these abstracts in a later exercise in the course, where we will use text analysis to examine whether the content of research papers are a function of the gender of the author. 

To do so, you need to iterate over the papers that you have already identified, and scrape the abstract from the URL listed. 

#### Task 1: scrape_abstract()
Write a function scrape_abstract(url) that goes to the research paper URL, and scrapes the content of the abstract. The function should return the abstract as a string, and nothing else.

In [None]:
import requests
from bs4 import BeautifulSoup

def scrape_abstract(url):
    """
    Fetch the abstract from the provided arXiv URL using XPath.

    Parameters:
    - url (str): The URL of the arXiv paper.

    Returns:
    - str: The abstract of the paper.
    """

    # YOUR CODE HERE
    raise NotImplementedError()

# Test
url = "https://arxiv.org/abs/2307.13106"
print(scrape_abstract(url))


#### Task 2: Scrape all urls

You will now use your function to scrape all the URLs that you collected in step 2.

The following will provide instructions for how you can go about this task. However, there are several ways to do this, and you are free to choose your preferred method.

Prepare your data:

1. Load your list of dicts from step 2 (scraping_result.pkl)
2. Use it to create a dataframe. 
3. Add a column 'scraped' which should be False for all rows, and a column 'abstract' that should be None for all rows.
4. Store the dataframe persistently (e.g., by pickling it.)

The scraping procedure:

1. Load the persitent pickle as dataframe (so that if your computer crashes, the function will continue where you were)
2. Repeat the following steps until there are no more rows with scraped == False:
3. Fetch a random row with scraped == False
4. Go to the URL and scrape the abstract.
5. Set abstract column in the dataframe to the resulting abstract, set scraped to True.
6. Store the dataframe persistently as a pickle. 

Remember: 
- You may use another strategy. However, since you will be scraping many pages, you should expect your scraper to encounter problems along the way. You therefore need to make sure that you regularly store the results persistently.
- Make sure to handle any exceptions gracefully.
- Be respectful toward the website owners: wait at least one second between each call. 

Your final result should be a dataframe stored as 'scraped_abstracts.df.pkl', with filled 'abstract' and 'scraped' columns.

<!-- [Evaluation: ]
- Load dataframe as df
- Check that the len of df = len of the result list from question 2. 
- Check that each line has an abstract, with len() > 100 e.g.
 -->

In [None]:
df = pd.DataFrame(data_list)

# YOUR CODE HERE
raise NotImplementedError()


# Rename and save the final dataframe
df.to_pickle('scraped_abstracts.df.pkl')