<a href="https://colab.research.google.com/github/yakubszatkowski/100_days_python/blob/master/push/Nobel_Prize_Analysis_(start).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup and Context

### Introduction

On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.

Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.

Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

<img src=https://i.imgur.com/36pCx5Q.jpg>

Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel prize and our world more generally?

### Upgrade plotly (only Google Colab Notebook)

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.

In [332]:
%pip install --upgrade plotly

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [333]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Import Statements

In [334]:
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

### Notebook Presentation

In [335]:
pd.options.display.float_format = '{:,.2f}'.format

### Read the Data

In [336]:
df_data = pd.read_csv('drive/MyDrive/Colab Notebooks/Day 78 - Analyzing Nobel Prize with Plotly, Matplotlib and Seaborn/nobel_prize_data.csv')

Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.


# Data Exploration & Cleaning

**Challenge**: Preliminary data exploration.
* What is the shape of `df_data`? How many rows and columns?
* What are the column names?
* In which year was the Nobel prize first awarded?
* Which year is the latest year included in the dataset?

In [337]:
df_data.shape

(962, 16)

In [338]:
df_data.columns

Index(['year', 'category', 'prize', 'motivation', 'prize_share',
       'laureate_type', 'full_name', 'birth_date', 'birth_city',
       'birth_country', 'birth_country_current', 'sex', 'organization_name',
       'organization_city', 'organization_country', 'ISO'],
      dtype='object')

In [339]:
df_data.year.min()

1901

In [340]:
df_data.year.max()

2020

**Challange**:
* Are there any duplicate values in the dataset?
* Are there NaN values in the dataset?
* Which columns tend to have NaN values?
* How many NaN values are there per column?
* Why do these columns have NaN values?

### Check for Duplicates and NaN Values

In [341]:
df_data.duplicated().values.any()

False

In [342]:
df_data.isna().values.any()

True

In [343]:
df_data.isna().sum()

year                       0
category                   0
prize                      0
motivation                88
prize_share                0
laureate_type              0
full_name                  0
birth_date                28
birth_city                31
birth_country             28
birth_country_current     28
sex                       28
organization_name        255
organization_city        255
organization_country     254
ISO                       28
dtype: int64

In [344]:
# when filtering the columns where bith date is NaN, by organization name and full name, we can see there is no organisation name because it is already in the full name
df_data[df_data.birth_date.isna()][['full_name', 'organization_name']].head()

Unnamed: 0,full_name,organization_name
24,Institut de droit international (Institute of ...,
60,Bureau international permanent de la Paix (Per...,
89,Comité international de la Croix Rouge (Intern...,
200,Office international Nansen pour les Réfugiés ...,
215,Comité international de la Croix Rouge (Intern...,


In [345]:
# other lacks organization name because they were inviduals that weren't associated with any organization especially in category of literature and peace
df_data[df_data.organization_name.isna()][['category', 'laureate_type', 'full_name', 'organization_name']]

Unnamed: 0,category,laureate_type,full_name,organization_name
1,Literature,Individual,Sully Prudhomme,
3,Peace,Individual,Frédéric Passy,
4,Peace,Individual,Jean Henry Dunant,
7,Literature,Individual,Christian Matthias Theodor Mommsen,
9,Peace,Individual,Charles Albert Gobat,
...,...,...,...,...
932,Peace,Individual,Nadia Murad,
942,Literature,Individual,Peter Handke,
946,Peace,Individual,Abiy Ahmed Ali,
954,Literature,Individual,Louise Glück,


### Type Conversions

**Challenge**:
* Convert the `birth_date` column to Pandas `Datetime` objects
* Add a Column called `share_pct` which has the laureates' share as a percentage in the form of a floating-point number.

#### Convert Year and Birth Date to Datetime

In [346]:
df_data.birth_date = pd.to_datetime(df_data.birth_date)
df_data.birth_date

0     1852-08-30
1     1839-03-16
2     1854-03-15
3     1822-05-20
4     1828-05-08
         ...    
957   1949-07-02
958          NaT
959   1965-06-16
960   1952-03-24
961   1931-08-08
Name: birth_date, Length: 962, dtype: datetime64[ns]

#### Add a Column with the Prize Share as a Percentage

In [347]:
df_data.prize_share.value_counts()

1/1    352
1/2    321
1/3    219
1/4     70
Name: prize_share, dtype: int64

In [348]:
# eval() executes equation given as a string
for equation in df_data.prize_share:
  print(f'{round(eval(equation)*100,1)}%')

100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
50.0%
25.0%
25.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
50.0%
50.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
50.0%
50.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
50.0%
50.0%
100.0%
1

In [349]:
# My solution
# used list comprehension that evaluates string equation from the column, then convertet it into the series and inserted after 5th column
prize_pct = pd.Series([round(eval(equation)*100, 2) for equation in df_data.prize_share])
try:  # using error handling because when lines below ran multiple times it shows ValueError because of
  df_data.insert(5, 'prize_pct', prize_pct)
except ValueError:
  pass
df_data.head()

Unnamed: 0,year,category,prize,motivation,prize_share,prize_pct,laureate_type,full_name,birth_date,birth_city,birth_country,birth_country_current,sex,organization_name,organization_city,organization_country,ISO
0,1901,Chemistry,The Nobel Prize in Chemistry 1901,"""in recognition of the extraordinary services ...",1/1,100.0,Individual,Jacobus Henricus van 't Hoff,1852-08-30,Rotterdam,Netherlands,Netherlands,Male,Berlin University,Berlin,Germany,NLD
1,1901,Literature,The Nobel Prize in Literature 1901,"""in special recognition of his poetic composit...",1/1,100.0,Individual,Sully Prudhomme,1839-03-16,Paris,France,France,Male,,,,FRA
2,1901,Medicine,The Nobel Prize in Physiology or Medicine 1901,"""for his work on serum therapy, especially its...",1/1,100.0,Individual,Emil Adolf von Behring,1854-03-15,Hansdorf (Lawice),Prussia (Poland),Poland,Male,Marburg University,Marburg,Germany,POL
3,1901,Peace,The Nobel Peace Prize 1901,,1/2,50.0,Individual,Frédéric Passy,1822-05-20,Paris,France,France,Male,,,,FRA
4,1901,Peace,The Nobel Peace Prize 1901,,1/2,50.0,Individual,Jean Henry Dunant,1828-05-08,Geneva,Switzerland,Switzerland,Male,,,,CHE


In [350]:
# Angela's solution
separated_values = df_data.prize_share.str.split('/', expand=True)
other_prize_pct = pd.to_numeric(separated_values[0])/pd.to_numeric(separated_values[1])*100
other_prize_pct

0     100.00
1     100.00
2     100.00
3      50.00
4      50.00
       ...  
957    33.33
958   100.00
959    25.00
960    25.00
961    50.00
Length: 962, dtype: float64

# Coming up with my three questions that I want to ask data

1st - Which country received most of the nobel prizes? Visualize by pie chart in plotly

In [351]:
nobel_prizes_by_country = df_data.organization_country.value_counts().head(10)  # choosing first 10
nobel_prizes_by_country

United States of America    368
United Kingdom               93
Germany                      67
France                       38
Switzerland                  24
Japan                        18
Sweden                       17
Russia                       12
Netherlands                  11
Canada                        9
Name: organization_country, dtype: int64

In [352]:
prize_by_country_pie = px.pie(nobel_prizes_by_country, values=nobel_prizes_by_country.values, names=nobel_prizes_by_country.index, width=800, height=500)
prize_by_country_pie.update_traces(textinfo='value')  # this shows us the actual value instead of percentage
prize_by_country_pie.show()

2nd - count the categories of nobel prizes won throughout the history

In [353]:
prizes_by_category = df_data.category.value_counts()
prizes_by_category

Medicine      222
Physics       216
Chemistry     186
Peace         135
Literature    117
Economics      86
Name: category, dtype: int64

In [354]:
len(prizes_by_category)

6

In [355]:
prizes_by_category_pie = px.pie(prizes_by_category, values=prizes_by_category.values, names=prizes_by_category.index, width=800, height=500)
prizes_by_category_pie.update_traces(textinfo='value')
prizes_by_category_pie.show()

3rd - Explore the categories of won nobel prizes by gender

In [356]:
prizes_male = df_data[df_data.sex == 'Male']
prizes_female = df_data[df_data.sex == 'Female']
print(f'Amount of nobel prize winners by gender:\nMen: {len(prizes_male)}\nWomen: {len(prizes_female)}')

Amount of nobel prize winners by gender:
Men: 876
Women: 58


In [357]:
print('Categories won throughout the history by men.')
prizes_male.category.value_counts()

Categories won throughout the history by men.


Physics       212
Medicine      210
Chemistry     179
Literature    101
Peace          90
Economics      84
Name: category, dtype: int64

In [358]:
print('Categories won throughout the history by women.')
prizes_female.category.value_counts()

Categories won throughout the history by women.


Peace         17
Literature    16
Medicine      12
Chemistry      7
Physics        4
Economics      2
Name: category, dtype: int64

# Plotly Donut Chart: Percentage of Male vs. Female Laureates

**Challenge**: Create a [donut chart using plotly](https://plotly.com/python/pie-charts/) which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?

In [359]:
prizes_by_gender = df_data.sex.value_counts()
prizes_by_gender

Male      876
Female     58
Name: sex, dtype: int64

In [360]:
prizes_by_gender_pie = px.pie(prizes_by_gender, names=prizes_by_gender.index, values=prizes_by_gender.values, width=800, height=500, hole=0.5)
prizes_by_gender_pie.update_traces(textinfo='value + percent')
prizes_by_gender_pie.show()

# Who were the first 3 Women to Win the Nobel Prize?

**Challenge**:
* What are the names of the first 3 female Nobel laureates?
* What did the win the prize for?
* What do you see in their `birth_country`? Were they part of an organisation?

In [361]:
prizes_female.sort_values('year').head(3)[['full_name', 'prize', 'birth_country', 'organization_name']]

Unnamed: 0,full_name,prize,birth_country,organization_name
18,"Marie Curie, née Sklodowska",The Nobel Prize in Physics 1903,Russian Empire (Poland),
29,"Baroness Bertha Sophie Felicita von Suttner, n...",The Nobel Peace Prize 1905,Austrian Empire (Czech Republic),
51,Selma Ottilia Lovisa Lagerlöf,The Nobel Prize in Literature 1909,Sweden,


# Find the Repeat Winners

**Challenge**: Did some people get a Nobel Prize more than once? If so, who were they?

In [362]:
df_data.full_name.duplicated().values.any()

True

In [363]:
df_data[df_data.full_name.duplicated(keep=False) == True].full_name.value_counts()

Comité international de la Croix Rouge (International Committee of the Red Cross)    3
Marie Curie, née Sklodowska                                                          2
Linus Carl Pauling                                                                   2
Office of the United Nations High Commissioner for Refugees (UNHCR)                  2
John Bardeen                                                                         2
Frederick Sanger                                                                     2
Name: full_name, dtype: int64

In [364]:
name_count = df_data.full_name.value_counts()
name_count[name_count>1]

Comité international de la Croix Rouge (International Committee of the Red Cross)    3
Frederick Sanger                                                                     2
Linus Carl Pauling                                                                   2
John Bardeen                                                                         2
Office of the United Nations High Commissioner for Refugees (UNHCR)                  2
Marie Curie, née Sklodowska                                                          2
Name: full_name, dtype: int64

# Number of Prizes per Category

**Challenge**:
* In how many categories are prizes awarded?
* Create a plotly bar chart with the number of prizes awarded by category.
* Use the color scale called `Aggrnyl` to colour the chart, but don't show a color axis.
* Which category has the most number of prizes awarded?
* Which category has the fewest number of prizes awarded?

In [365]:
# Already done this more or less earlier
prizes_by_category

Medicine      222
Physics       216
Chemistry     186
Peace         135
Literature    117
Economics      86
Name: category, dtype: int64

In [366]:
len(prizes_by_category)

6

In [367]:
# same as before but now it's bar chart
prizes_by_category_bar = px.bar(prizes_by_category, color=prizes_by_category.values, width=800, height=500, color_continuous_scale='Aggrnyl')
prizes_by_category_bar.update_layout(title='', xaxis_title='Nobel\'s prize category', yaxis_title='Amount of prizes' ,coloraxis_showscale=False)
prizes_by_category_bar.show()

**Challenge**:
* When was the first prize in the field of Economics awarded?
* Who did the prize go to?

In [398]:
df_data_economics = df_data[df_data.category == 'Economics']
first_prize_year = df_data_economics.year.min()
first_prize_year

1969

In [396]:
df_data_economics[df_data_economics.year == first_prize_year].min().full_name

'Jan Tinbergen'

# Male and Female Winners by Category

**Challenge**: Create a [plotly bar chart](https://plotly.com/python/bar-charts/) that shows the split between men and women by category.
* Hover over the bar chart. How many prizes went to women in Literature compared to Physics?

<img src=https://i.imgur.com/od8TfOp.png width=650>

In [425]:
df_category_sex = df_data.groupby(['category', 'sex'], as_index=False).agg({'prize': pd.Series.count})  # the 'prize' was random we had to find some column to count that had no NaN
df_category_sex

Unnamed: 0,category,sex,prize
0,Chemistry,Female,7
1,Chemistry,Male,179
2,Economics,Female,2
3,Economics,Male,84
4,Literature,Female,16
5,Literature,Male,101
6,Medicine,Female,12
7,Medicine,Male,210
8,Peace,Female,17
9,Peace,Male,90


In [427]:
df_category_female = df_category_sex[df_category_sex.sex == 'Female']
df_category_male = df_category_sex[df_category_sex.sex == 'Male']

df_category_male

Unnamed: 0,category,sex,prize
1,Chemistry,Male,179
3,Economics,Male,84
5,Literature,Male,101
7,Medicine,Male,210
9,Peace,Male,90
11,Physics,Male,212


# Number of Prizes Awarded Over Time

**Challenge**: Are more prizes awarded recently than when the prize was first created? Show the trend in awards visually.
* Count the number of prizes awarded every year.
* Create a 5 year rolling average of the number of prizes (Hint: see previous lessons analysing Google Trends).
* Using Matplotlib superimpose the rolling average on a scatter plot.
* Show a tick mark on the x-axis for every 5 years from 1900 to 2020. (Hint: you'll need to use NumPy).

<img src=https://i.imgur.com/4jqYuWC.png width=650>

* Use the [named colours](https://matplotlib.org/3.1.0/gallery/color/named_colors.html) to draw the data points in `dogerblue` while the rolling average is coloured in `crimson`.

<img src=https://i.imgur.com/u3RlcJn.png width=350>

* Looking at the chart, did the first and second world wars have an impact on the number of prizes being given out?
* What could be the reason for the trend in the chart?


# Are More Prizes Shared Than Before?

**Challenge**: Investigate if more prizes are shared than before.

* Calculate the average prize share of the winners on a year by year basis.
* Calculate the 5 year rolling average of the percentage share.
* Copy-paste the cell from the chart you created above.
* Modify the code to add a secondary axis to your Matplotlib chart.
* Plot the rolling average of the prize share on this chart.
* See if you can invert the secondary y-axis to make the relationship even more clear.

# The Countries with the Most Nobel Prizes

**Challenge**:
* Create a Pandas DataFrame called `top20_countries` that has the two columns. The `prize` column should contain the total number of prizes won.

<img src=https://i.imgur.com/6HM8rfB.png width=350>

* Is it best to use `birth_country`, `birth_country_current` or `organization_country`?
* What are some potential problems when using `birth_country` or any of the others? Which column is the least problematic?
* Then use plotly to create a horizontal bar chart showing the number of prizes won by each country. Here's what you're after:

<img src=https://i.imgur.com/agcJdRS.png width=750>

* What is the ranking for the top 20 countries in terms of the number of prizes?

# Use a Choropleth Map to Show the Number of Prizes Won by Country

* Create this choropleth map using [the plotly documentation](https://plotly.com/python/choropleth-maps/):

<img src=https://i.imgur.com/s4lqYZH.png>

* Experiment with [plotly's available colours](https://plotly.com/python/builtin-colorscales/). I quite like the sequential colour `matter` on this map.

Hint: You'll need to use a 3 letter country code for each country.


# In Which Categories are the Different Countries Winning Prizes?

**Challenge**: See if you can divide up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what you're aiming for:

<img src=https://i.imgur.com/iGaIKCL.png>

* In which category are Germany and Japan the weakest compared to the United States?
* In which category does Germany have more prizes than the UK?
* In which categories does France have more prizes than Germany?
* Which category makes up most of Australia's nobel prizes?
* Which category makes up half of the prizes in the Netherlands?
* Does the United States have more prizes in Economics than all of France? What about in Physics or Medicine?


The hard part is preparing the data for this chart!


*Hint*: Take a two-step approach. The first step is grouping the data by country and category. Then you can create a DataFrame that looks something like this:

<img src=https://i.imgur.com/VKjzKa1.png width=450>


### Number of Prizes Won by Each Country Over Time

* When did the United States eclipse every other country in terms of the number of prizes won?
* Which country or countries were leading previously?
* Calculate the cumulative number of prizes won by each country in every year. Again, use the `birth_country_current` of the winner to calculate this.
* Create a [plotly line chart](https://plotly.com/python/line-charts/) where each country is a coloured line.

# What are the Top Research Organisations?

**Challenge**: Create a bar chart showing the organisations affiliated with the Nobel laureates. It should looks something like this:

<img src=https://i.imgur.com/zZihj2p.png width=600>

* Which organisations make up the top 20?
* How many Nobel prize winners are affiliated with the University of Chicago and Harvard University?

# Which Cities Make the Most Discoveries?

Where do major discoveries take place?

**Challenge**:
* Create another plotly bar chart graphing the top 20 organisation cities of the research institutions associated with a Nobel laureate.
* Where is the number one hotspot for discoveries in the world?
* Which city in Europe has had the most discoveries?

# Where are Nobel Laureates Born? Chart the Laureate Birth Cities

**Challenge**:
* Create a plotly bar chart graphing the top 20 birth cities of Nobel laureates.
* Use a named colour scale called `Plasma` for the chart.
* What percentage of the United States prizes came from Nobel laureates born in New York?
* How many Nobel laureates were born in London, Paris and Vienna?
* Out of the top 5 cities, how many are in the United States?


# Plotly Sunburst Chart: Combine Country, City, and Organisation

**Challenge**:

* Create a DataFrame that groups the number of prizes by organisation.
* Then use the [plotly documentation to create a sunburst chart](https://plotly.com/python/sunburst-charts/)
* Click around in your chart, what do you notice about Germany and France?


Here's what you're aiming for:

<img src=https://i.imgur.com/cemX4m5.png width=300>



# Patterns in the Laureate Age at the Time of the Award

How Old Are the Laureates When the Win the Prize?

**Challenge**: Calculate the age of the laureate in the year of the ceremony and add this as a column called `winning_age` to the `df_data` DataFrame. Hint: you can use [this](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html) to help you.



### Who were the oldest and youngest winners?

**Challenge**:
* What are the names of the youngest and oldest Nobel laureate?
* What did they win the prize for?
* What is the average age of a winner?
* 75% of laureates are younger than what age when they receive the prize?
* Use Seaborn to [create histogram](https://seaborn.pydata.org/generated/seaborn.histplot.html) to visualise the distribution of laureate age at the time of winning. Experiment with the number of `bins` to see how the visualisation changes.

### Descriptive Statistics for the Laureate Age at Time of Award

* Calculate the descriptive statistics for the age at the time of the award.
* Then visualise the distribution in the form of a histogram using [Seaborn's .histplot() function](https://seaborn.pydata.org/generated/seaborn.histplot.html).
* Experiment with the `bin` size. Try 10, 20, 30, and 50.

### Age at Time of Award throughout History

Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?

**Challenge**

* Use Seaborn to [create a .regplot](https://seaborn.pydata.org/generated/seaborn.regplot.html?highlight=regplot#seaborn.regplot) with a trendline.
* Set the `lowess` parameter to `True` to show a moving average of the linear fit.
* According to the best fit line, how old were Nobel laureates in the years 1900-1940 when they were awarded the prize?
* According to the best fit line, what age would it predict for a Nobel laureate in 2020?


### Winning Age Across the Nobel Prize Categories

How does the age of laureates vary by category?

* Use Seaborn's [`.boxplot()`](https://seaborn.pydata.org/generated/seaborn.boxplot.html?highlight=boxplot#seaborn.boxplot) to show how the mean, quartiles, max, and minimum values vary across categories. Which category has the longest "whiskers"?
* In which prize category are the average winners the oldest?
* In which prize category are the average winners the youngest?

**Challenge**
* Now use Seaborn's [`.lmplot()`](https://seaborn.pydata.org/generated/seaborn.lmplot.html?highlight=lmplot#seaborn.lmplot) and the `row` parameter to create 6 separate charts for each prize category. Again set `lowess` to `True`.
* What are the winning age trends in each category?
* Which category has the age trending up and which category has the age trending down?
* Is this `.lmplot()` telling a different story from the `.boxplot()`?
* Create another chart with Seaborn. This time use `.lmplot()` to put all 6 categories on the same chart using the `hue` parameter.
