# U of T Statistical Sciences Research Program Project on Nobel Prize Winners

## Background

Li et al. state that 

> ... literature in the field of innovation shows that the prize-winning works by Nobel laureates tend to occur early within a career, providing evidence of precocious minds that break through in an exceptional way. By contrast, growing evidence shows that ordinary scientific careers are determined by the random impact rule, suggesting that age and creativity are not intertwined, and the most important work in a career occurs randomly within the sequence of works. Second, there is an acclaimed tradition in the history of science that emphasizes the role of individual genius in scientific discovery. However, one of the most fundamental shifts in science over the past century is the flourishing of large teams across all areas of science. This shift raises the question of whether Nobel laureates are unique in being solitary thinkers making guiding contributions.


## Question to explore

Among Nobel Prize winner's in Physics, Chemistry, and Biology, what is the relationship between scientific impact of a paper and the timing of the paper during a scientist's career?  Does this relationship depend on a scientist's age, gender, team size, or prize category?

## Issues to consider

- How will you measure scientific impact?
  
- How will you measure timing of impact?


## Data sources

You may use any data that is publicly available.

- [A dataset of publication records for Nobel laureates](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6NJ5RN). This data is used in ref 1. and described in ref 2. 
  
-  [Crossref REST API](https://github.com/CrossRef/rest-api-doc) contains information on publications such as number of citations.  There are several excellent libraries that can be used to access this API such as [habenaro](https://github.com/sckott/habanero), and [rcrossref](https://github.com/ropensci/rcrossref).

- [The Nobel Prize Developer Zone](https://www.nobelprize.org/about/developer-zone-2/).  Endpoints for the API contain information on Laureates and Nobel Prizes.


## References

1. [Li, Jichao, Yian Yin, Santo Fortunato, and Dashun Wang. "Nobel laureates are almost the same as us." Nature Reviews Physics 1, no. 5 (2019): 301.](https://www-nature-com.myaccess.library.utoronto.ca/articles/s42254-019-0057-z)

2. [Li, Jichao, Yian Yin, Santo Fortunato, and Dashun Wang. "A dataset of publication records for Nobel laureates." Scientific data 6, no. 1 (2019): 33.](https://www.nature.com/articles/s41597-019-0033-6)

3. [Sinatra, Roberta, Dashun Wang, Pierre Deville, Chaoming Song, and Albert-László Barabási. "Quantifying the evolution of individual scientific impact." Science 354, no. 6312 (2016): aaf5239.](https://science.sciencemag.org/content/354/6312/aaf5239.short)




## Exploring data sources

### A dataset of publication records for Nobel Laureates 

In [71]:
# read in chem publication record

import pandas as pd

chem = pd.read_csv('Chemistry publication record.csv')

chem.head()

Unnamed: 0,Laureate ID,Laureate name,Prize year,Title,Pub year,Paper ID,DOI,Journal,Affiliation,Is prize-winning paper
0,20001,"stoddart, j",2016,a molecular shuttle,1991,1976039000.0,10.1021/ja00013a096,journal of the american chemical society,northwestern university,YES
1,20001,"stoddart, j",2016,chemical synthesis of nanostructures,1993,1963538000.0,10.1557/PROC-330-57,mrs proceedings,northwestern university,NO
2,20001,"stoddart, j",2016,formation and x ray crystal structure of pt h2...,1981,1963552000.0,10.1039/C39810000851,journal of the chemical society chemical commu...,northwestern university,NO
3,20001,"stoddart, j",2016,single walled carbon nanotubes under the influ...,2005,2095637000.0,10.1002/smll.200400070,small,northwestern university,NO
4,20001,"stoddart, j",2016,synthesis of medium heterocyclic rings from 6 ...,1974,2095679000.0,10.1016/S0008-6215(00)82105-9,carbohydrate research,northwestern university,NO


### `habanero` library for publication data

#### Citation counts

- use [habanero](https://habanero.readthedocs.io/en/latest/) library to look up citation counts

In [72]:
from habanero import Crossref
cr = Crossref()

In [73]:
# pick the first chem paper via DOI in the dataset

a_chem_paper = chem['DOI'][0]

In [74]:
# get citation counts

from habanero import counts

counts.citation_count(doi = a_chem_paper)

667

#### Get number of publication authors

1. Extract (meta) data about `a_chem_paper`.

In [75]:
from habanero import cn
a_chem_paper_content = cn.content_negotiation(ids = a_chem_paper)

a_chem_paper_content.split(',')

[' @article{Anelli_1991',
 ' title={A molecular shuttle}',
 ' volume={113}',
 ' ISSN={1520-5126}',
 ' url={http://dx.doi.org/10.1021/ja00013a096}',
 ' DOI={10.1021/ja00013a096}',
 ' number={13}',
 ' journal={Journal of the American Chemical Society}',
 ' publisher={American Chemical Society (ACS)}',
 ' author={Anelli',
 ' Pier Lucio and Spencer',
 ' Neil and Stoddart',
 ' J. Fraser}',
 ' year={1991}',
 ' month=jun',
 ' pages={5131–5133} }\n']

2. Calculate the number of co-authors on `a_chem_paper_content`.

In [76]:
import re

# Find the line with 'author = {...}'
match = re.search(r'author\s*=\s*{([^}]*)}', a_chem_paper_content)
if match:
    authors_str = match.group(1)
    # Authors are usually separated by ' and '
    authors = [a.strip() for a in authors_str.split(' and ')]
    num_authors = len(authors)
else:
    print("Author field not found.")

print("Number of authors:", num_authors)
print("Number of co-authors:", num_authors - 1)
print("Authors:", authors)


Number of authors: 3
Number of co-authors: 2
Authors: ['Anelli, Pier Lucio', 'Spencer, Neil', 'Stoddart, J. Fraser']


###  Extract data on laureates using nobelprize.org API

- use the API to extract gender

In [77]:
import requests

BASE_URL = "https://api.nobelprize.org/2.1/laureates"


params = {
    "nobelPrizeYear": 2016,
    "nobelPrizeCategory": "che", # chemistry
    "limit": 25       
}


resp = requests.get(BASE_URL, params=params, timeout=30)


payload = resp.json()            # the top‑level JSON object
laureates = payload["laureates"] # list of laureate dicts


df = pd.json_normalize(laureates)

selected_cols = ['id', 'gender', 'knownName.en', 'familyName.en']

df[selected_cols]



Unnamed: 0,id,gender,knownName.en,familyName.en
0,933,male,Bernard L. Feringa,Feringa
1,931,male,Jean-Pierre Sauvage,Sauvage
2,932,male,Sir J. Fraser Stoddart,Stoddart


## Statistical analyses

### Some suggestions and questions

1. Reproduce the anayses in [Li, Jichao, Yian Yin, Santo Fortunato, and Dashun Wang. "Nobel laureates are almost the same as us." Nature Reviews Physics 1, no. 5 (2019): 301.](https://www-nature-com.myaccess.library.utoronto.ca/articles/s42254-019-0057-z)
   
2. What are the dependent and independent variables related to exploring the relationship between scientific impact of a paper and the *timing* of the paper during a scientist's career?

3. How does age, gender, team size, or prize area affect the relationship?
   

### Statistical models that might be helpful

1. Linear regression
2. General linear model
3. Bayesian hierarchical model

