## Nobel Prize Dataset
---

In [14]:
from tkinter.constants import FIRST

import pandas as pd
import plotly.express as px

In [4]:
df = pd.read_csv("nobel_prize_data.csv")

df.sample(5)

Unnamed: 0,year,category,prize,motivation,prize_share,laureate_type,full_name,birth_date,birth_city,birth_country,birth_country_current,sex,organization_name,organization_city,organization_country,ISO
845,2011,Medicine,The Nobel Prize in Physiology or Medicine 2011,"""for their discoveries concerning the activati...",1/4,Individual,Jules A. Hoffmann,1941-08-02,Echternach,Luxembourg,Luxembourg,Male,University of Strasbourg,Strasbourg,France,LUX
438,1973,Physics,The Nobel Prize in Physics 1973,"""for their experimental discoveries regarding ...",1/4,Individual,Leo Esaki,1925-03-12,Osaka,Japan,Japan,Male,IBM Thomas J. Watson Research Center,"Yorktown Heights, NY",United States of America,JPN
65,1911,Peace,The Nobel Peace Prize 1911,,1/2,Individual,Alfred Hermann Fried,1864-11-11,Vienna,Austria,Austria,Male,,,,AUT
297,1956,Physics,The Nobel Prize in Physics 1956,"""for their researches on semiconductors and th...",1/3,Individual,John Bardeen,1908-05-23,"Madison, WI",United States of America,United States of America,Male,University of Illinois,"Urbana, IL",United States of America,USA
466,1976,Medicine,The Nobel Prize in Physiology or Medicine 1976,"""for their discoveries concerning new mechanis...",1/2,Individual,Baruch S. Blumberg,1925-07-28,"New York, NY",United States of America,United States of America,Male,The Institute for Cancer Research,"Philadelphia, PA",United States of America,USA


### Initial checks
---

In [5]:
### Get an understanding of the dataset
df.shape            # 962 by 16
df.head()           # 1901 is the earliest
df.tail()           # 2020 is the latest date
df.columns          # Need to adjust
df.dtypes           # Year needs to be adjusted + prize share as well
df.isna().sum()     # There are missing values in some categories, not the first ones
                    ## More likely for the latter categories --> Mainly for more complicated variables.
df.duplicated(keep= False).sum()        # There are no duplicated values
df.columns          # Some columns need to be removed


Index(['year', 'category', 'prize', 'motivation', 'prize_share',
       'laureate_type', 'full_name', 'birth_date', 'birth_city',
       'birth_country', 'birth_country_current', 'sex', 'organization_name',
       'organization_city', 'organization_country', 'ISO'],
      dtype='object')

In [87]:
## Adjustments from the earlier analysis:

#### Keep the columns+ Change the names to make it easier to access
df.rename(columns= {'year': "Year",
                    'category':"Category",
                    'prize': "Prize",
                    'motivation':" Motivation",
                    'prize_share':"Prize_Share",
                    'laureate_type':"Laureate_Type",
                    'full_name' :"Full_Name",
                    'birth_date':"Birth_Date",
                    'birth_city':"Birth_City",
                    'birth_country':"Birth_Country",
                    'birth_country_current':"Birth Country_Current",
                    'sex':"Sex",
                    'organization_name':"Organization_Name",
                    'organization_city':"Organisation_City",
                    'organization_country':"Organisation_Country",
                    'ISO':"ISO"}, inplace=True)

#### Datetime for the right column
df["Year"] = pd.to_datetime(df["Year"])


values = df["Prize_Share"].str.split("/", expand=True)      # Split the values horizontally
numerator= pd.to_numeric(values[0])
denominator= pd.to_numeric(values[1])
share_pct:float                                             # assignment of float
share_pct =numerator/denominator
df["Share_Pct"] =share_pct                                  # Saved into a new column.


### Visualisation
---


Create a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?

In [69]:
men_and_women  = df["Sex"].value_counts()
print(men_and_women)

fig = px.pie(names =men_and_women.index, values = men_and_women.values, title =" Number of Nobel Laureates by Sex", hole= 0.6)
fig.update_traces(textposition="outside", textinfo="percent+label")
fig.show()

Sex
Male      876
Female     58
Name: count, dtype: int64


What are the names of the first 3 female Nobel laureates?

What did the win the prize for?

What do you see in their birth_country? Were they part of an organisation?



In [82]:
# Names of the first three nobel laureates
first_female_laureates = df[df["Sex"]=="Female"].sort_values(by = "Year" , ascending=True )[:5]
first_female_laureates["Category"]          # Shows the categories of the topics
first_female_laureates["Birth_Country"]     # Shows their brith countries
first_female_laureates["Organization_Name"]

18                     NaN
29                     NaN
51                     NaN
62     Sorbonne University
128                    NaN
Name: Organization_Name, dtype: object


Did some people get a Nobel Prize more than once? If so, who were they?

In [140]:
duplicated = df[["Full_Name", "Birth_Date"]].duplicated(keep = False)           # Returns a boolean. Feed the boolean back inot thethe df function and then do sort values
df[duplicated].sort_values("Full_Name", ascending = False)["Full_Name"].nunique()           # 6 unique people won twice
names = df[duplicated]["Full_Name"].unique()   # Pulls the number of unique names:
final_names = names.tolist()
print(f"There were 6 multiple winners of the Nobel Prize. ")
for name in names:
    print(" -", name)


There were 6 multiple winners of the Nobel Prize. 
 - Marie Curie, née Sklodowska
 - Comité international de la Croix Rouge (International Committee of the Red Cross)
 - Linus Carl Pauling
 - Office of the United Nations High Commissioner for Refugees (UNHCR)
 - John Bardeen
 - Frederick Sanger
