# 1. Introduction

The [Nobel Prize](https://www.nobelprize.org/) is one of the world's most well known awards. Aside from the honor, prestige and substantial prize money the recipient also gets a gold medal showing Alfred Nobel (1833 - 1896) who established the prize. Every year it's given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

The first Nobel Prize was handed out in 1901. At the time the Prize was very Eurocentric and given out mostly to men. How has this trend changed, if at all? That's what I'm trying to find out.

In [None]:
#required modules
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

# Reading the data
nobel = pd.read_csv("../input/archive.csv")

# The first few winners
nobel.head(10)

Looking at the first few laureates, we can see that most of them were from Europe, and all of them were men. How does the entirety of Nobel data compare to this observation?

In [None]:
print("Total (some shared) prizes:",len(nobel))
print("\nPrizes by gender and country:")
display(nobel['Sex'].value_counts())
display(nobel['Birth Country'].value_counts().head(10))

# 2. Dominance of the USA

Clearly, men maintain a strong dominance in the space of Nobel Prizes. Also, the typical Nobel Prize winner seems to be a man from the United States of America. However, as we saw earlier, in 1901 all awardees were from Europe. When did the USA begin its significant dominance in the charts? Let's find out.

In [None]:
# Proportion of US-born winners per decade
nobel['USA-born Winners'] = nobel['Birth Country'] == 'United States of America'
nobel['Decade'] = (np.floor(nobel['Year'] / 10) * 10).astype('int64')
prop_usa_winners = nobel[['USA-born Winners', 'Decade']].groupby('Decade', as_index = False).mean()

print("Proportion of winners born in the USA, per decade:")
display(prop_usa_winners)

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = [12, 7]
ax = sns.lineplot(x = "Decade", y = "USA-born Winners", data = prop_usa_winners)
ax.yaxis.set_major_formatter(PercentFormatter())

It seems that the United States started its dominance in the 1930s, and has since not let go. While their proportion has somewhat dropped in the last decade, they still maintain their lead.

# 3. Gender Differences

Let us now look at the other dominant group - men, and how the disparity of winners between men and women varies with different disciplines of the prize.

In [None]:
# Proportion of female laureates per decade
nobel['Female Winners'] = nobel['Sex'] == 'Female'
prop_female_winners = nobel[['Female Winners', 'Decade', 'Category']].groupby(['Decade', 'Category'], as_index = False).mean()

ax = sns.lineplot(x = "Decade", y = "Female Winners", hue = "Category", data = prop_female_winners)
ax.yaxis.set_major_formatter(PercentFormatter())

The above somewhat messy plot from overplotting lines shows that the imbalance is very large, with Physics, Economics and Chemistry faring the worst. Medicine has been doing better since the 1960s, and literature has also picked up. Peace, however, has the largest proportion of women by far in the last decade. Let us have a look at the first ever female Nobel Prize winner.

In [None]:
nobel[nobel['Sex'] == "Female"].nsmallest(1, 'Year')

# 4. Ages of the laureates

Let us now look at the last of the major factors - age. First, I will show you the ages of all laureates combined. We will see what we can glean from a plot of that, and then we will move onto looking at the ages by category.

In [None]:
nobel['Birth Date'] = pd.to_datetime(nobel['Birth Date'], errors = 'coerce')
nobel['Age'] = nobel['Year'] - nobel['Birth Date'].dt.year

ax = sns.lmplot(x = 'Year', y = 'Age', data = nobel, lowess = True, aspect = 2, line_kws = {'color' : 'black'})

The above plot is very informative. It tells us that in 1901 the average age of a laureate was 55 years, whereas now it is over 65. We also see that there is a larger spread in the ages of laureates now, and that there are many more laureates now (most likely because more prizes now are shared, and also from the introduction of the Economics prize). Finally, notice that there was a disruption in the awards during the Second World War.

Lastly, let us look at the ages of the winners by category.

In [None]:
ax = sns.lmplot(x = 'Year', y = 'Age', data = nobel, row = 'Category', lowess = True, aspect = 2, line_kws = {'color' : 'black'})

We see that Chemistry, Literature, Medicine and Physics all follow the same pattern - the winners have gotten older over time. This trend is the strongest in Physics, where laureates used to be less than 50 years of age and are now over 70 on average. Literature is the most stable. Economics is a new category, and Peace has followed the opposite pattern where winners have gotten younger with time, with an exceptionally young recent winner:

In [None]:
print(nobel.nsmallest(1, 'Age')['Full Name'])

There is a lot more to be learned from this very rich database of Nobel laureates. This only serves as a short primer into the very major trends of this prestigious award.