# INFO 2950 Group Project: What Makes a Nobel Prize Laureate?

## Research Questions:
1. What region of the world produces the most Nobel Prize laureates?
2. At what age is your likelihood of winning the highest?
3. Does age at the time of winning the Nobel Prize vary between prize categories?
3. Is there a relationship between being affliated with a higher-ranked university and winning the Nobel Prize?
    a. Which universities have dominance in each of the prize categories?
4. Does Nobel Prize favor one gender over another?
5. What proportion of laureates won the Nobel Prize twice or more?
6. What proportion of laureates are family members?

## Data & Data Cleaning

In [309]:
import pandas as pd 
import numpy as np
import seaborn 
from matplotlib import pyplot
from datetime import datetime, date

In [310]:
nobel_data_raw = pd.read_csv("laureate.csv")
print(nobel_data_raw)

      id       firstname    surname        born        died  \
0      1  Wilhelm Conrad    Röntgen  1845-03-27  1923-02-10   
1      2  Hendrik Antoon    Lorentz  1853-07-18  1928-02-04   
2      3          Pieter     Zeeman  1865-05-25  1943-10-09   
3      4   Antoine Henri  Becquerel  1852-12-15  1908-08-25   
4      5          Pierre      Curie  1859-05-15  1906-04-19   
..   ...             ...        ...         ...         ...   
970  933      Bernard L.    Feringa  1951-05-18  0000-00-00   
971  934     Juan Manuel     Santos  0000-00-00  0000-00-00   
972  935          Oliver       Hart  1948-10-09  0000-00-00   
973  936           Bengt  Holmström  1949-04-18  0000-00-00   
974  937             Bob      Dylan  1941-05-24  0000-00-00   

               bornCountry bornCountryCode                bornCity  \
0    Prussia (now Germany)              DE  Lennep (now Remscheid)   
1          the Netherlands              NL                  Arnhem   
2          the Netherlands       

In [311]:
nobel_data_raw.shape

(975, 20)

In [312]:
print(nobel_data_raw['born'].dtypes)

object


In [313]:
nobel_data_raw['born'] = pd.to_datetime(nobel_data_raw['born'], format = '%Y-%m-%d', errors = 'coerce')
nobel_data_raw['died'] = pd.to_datetime(nobel_data_raw['died'], format = '%Y-%m-%d', errors = 'coerce')
#nobel_data_humans.dropna()
nobel_data_raw

Unnamed: 0,id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year,category,overallMotivation,share,motivation,name,city,country
0,1,Wilhelm Conrad,Röntgen,1845-03-27,1923-02-10,Prussia (now Germany),DE,Lennep (now Remscheid),Germany,DE,Munich,male,1901.0,physics,,1.0,"""in recognition of the extraordinary services ...",Munich University,Munich,Germany
1,2,Hendrik Antoon,Lorentz,1853-07-18,1928-02-04,the Netherlands,NL,Arnhem,the Netherlands,NL,,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Leiden University,Leiden,the Netherlands
2,3,Pieter,Zeeman,1865-05-25,1943-10-09,the Netherlands,NL,Zonnemaire,the Netherlands,NL,Amsterdam,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Amsterdam University,Amsterdam,the Netherlands
3,4,Antoine Henri,Becquerel,1852-12-15,1908-08-25,France,FR,Paris,France,FR,,male,1903.0,physics,,2.0,"""in recognition of the extraordinary services ...",École Polytechnique,Paris,France
4,5,Pierre,Curie,1859-05-15,1906-04-19,France,FR,Paris,France,FR,Paris,male,1903.0,physics,,4.0,"""in recognition of the extraordinary services ...",École municipale de physique et de chimie indu...,Paris,France
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
970,933,Bernard L.,Feringa,1951-05-18,NaT,the Netherlands,NL,Barger-Compascuum,,,,male,2016.0,chemistry,,3.0,"""for the design and synthesis of molecular mac...",University of Groningen,Groningen,the Netherlands
971,934,Juan Manuel,Santos,NaT,NaT,Colombia,CO,Bogotá,,,,male,2016.0,peace,,1.0,"""for his resolute efforts to bring the country...",,,
972,935,Oliver,Hart,1948-10-09,NaT,United Kingdom,GB,London,,,,male,2016.0,economics,,2.0,"""for their contributions to contract theory""",Harvard University,"Cambridge, MA",USA
973,936,Bengt,Holmström,1949-04-18,NaT,Finland,FI,Helsinki,,,,male,2016.0,economics,,2.0,"""for their contributions to contract theory""",Massachusetts Institute of Technology (MIT),"Cambridge, MA",USA


In [314]:
age = (pd.Timestamp.now().normalize() - nobel_data_raw['born']).where(nobel_data_raw['died'].isnull(), other = nobel_data_raw['died'] - nobel_data_raw['born'])
age = age / np.timedelta64(1, 'Y')

nobel_data_raw['Age'] = age
nobel_data_raw['Age'] = nobel_data_raw['Age'].apply(np.floor)
nobel_data_raw

Unnamed: 0,id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,...,gender,year,category,overallMotivation,share,motivation,name,city,country,Age
0,1,Wilhelm Conrad,Röntgen,1845-03-27,1923-02-10,Prussia (now Germany),DE,Lennep (now Remscheid),Germany,DE,...,male,1901.0,physics,,1.0,"""in recognition of the extraordinary services ...",Munich University,Munich,Germany,77.0
1,2,Hendrik Antoon,Lorentz,1853-07-18,1928-02-04,the Netherlands,NL,Arnhem,the Netherlands,NL,...,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Leiden University,Leiden,the Netherlands,74.0
2,3,Pieter,Zeeman,1865-05-25,1943-10-09,the Netherlands,NL,Zonnemaire,the Netherlands,NL,...,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Amsterdam University,Amsterdam,the Netherlands,78.0
3,4,Antoine Henri,Becquerel,1852-12-15,1908-08-25,France,FR,Paris,France,FR,...,male,1903.0,physics,,2.0,"""in recognition of the extraordinary services ...",École Polytechnique,Paris,France,55.0
4,5,Pierre,Curie,1859-05-15,1906-04-19,France,FR,Paris,France,FR,...,male,1903.0,physics,,4.0,"""in recognition of the extraordinary services ...",École municipale de physique et de chimie indu...,Paris,France,46.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
970,933,Bernard L.,Feringa,1951-05-18,NaT,the Netherlands,NL,Barger-Compascuum,,,...,male,2016.0,chemistry,,3.0,"""for the design and synthesis of molecular mac...",University of Groningen,Groningen,the Netherlands,71.0
971,934,Juan Manuel,Santos,NaT,NaT,Colombia,CO,Bogotá,,,...,male,2016.0,peace,,1.0,"""for his resolute efforts to bring the country...",,,,
972,935,Oliver,Hart,1948-10-09,NaT,United Kingdom,GB,London,,,...,male,2016.0,economics,,2.0,"""for their contributions to contract theory""",Harvard University,"Cambridge, MA",USA,74.0
973,936,Bengt,Holmström,1949-04-18,NaT,Finland,FI,Helsinki,,,...,male,2016.0,economics,,2.0,"""for their contributions to contract theory""",Massachusetts Institute of Technology (MIT),"Cambridge, MA",USA,73.0


In [315]:
if nobel_data_raw['died'] == NaT:
    today = date.today()
    Age = today.year - nobel_data_raw['born'].year
    nobel_data_raw['age'] = Age

NameError: name 'NaT' is not defined

## Data Description 
### What are the observations (rows) and the attributes (columns)?

### Why was this dataset created?

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

### Who funded the creation of the dataset?

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?

How many instances are there in total (of each type, if appropriate)?

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?

What data does each instance consist of?

Is any information missing from individual instances?

Are there any errors, sources of noise, or redundancies in the dataset?

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

How was the data associated with each instance acquired?

### What processes might have influenced what data was observed and recorded and what was not?

### What preprocessing was done, and how did the data come to be in the form that you are using?

### If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?

### Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a Cornell Google Drive or Cornell Box)


## Data Limitations

One primary limitation of our data is that it was collected in 2016: there have been 6 years' worth of laureates since then, data about whom we were not able to work with. 

## Exploratory Data Analysis