## Nobel Prize Dataset
---

In [1]:
from tkinter.constants import FIRST

import pandas as pd
import plotly.express as px

In [2]:
df = pd.read_csv("nobel_prize_data.csv")

df.sample(5)

Unnamed: 0,year,category,prize,motivation,prize_share,laureate_type,full_name,birth_date,birth_city,birth_country,birth_country_current,sex,organization_name,organization_city,organization_country,ISO
376,1967,Chemistry,The Nobel Prize in Chemistry 1967,"""for their studies of extremely fast chemical ...",1/4,Individual,George Porter,1920-12-06,Stainforth,United Kingdom,United Kingdom,Male,Royal Institution of Great Britain,London,United Kingdom,GBR
407,1970,Peace,The Nobel Peace Prize 1970,,1/1,Individual,Norman E. Borlaug,1914-03-25,"Cresco, IA",United States of America,United States of America,Male,,,,USA
121,1925,Chemistry,The Nobel Prize in Chemistry 1925,"""for his demonstration of the heterogenous nat...",1/1,Individual,Richard Adolf Zsigmondy,1865-04-01,Vienna,Austrian Empire (Austria),Austria,Male,Goettingen University,Göttingen,Germany,AUT
509,1980,Literature,The Nobel Prize in Literature 1980,"""who with uncompromising clear-sightedness voi...",1/1,Individual,Czeslaw Milosz,1911-06-30,Šeteniai,Russian Empire (Lithuania),Lithuania,Male,,,,LTU
660,1995,Physics,The Nobel Prize in Physics 1995,"""for pioneering contributions to the developme...",1/2,Individual,Frederick Reines,1918-03-16,"Paterson, NJ",United States of America,United States of America,Male,University of California,"Irvine, CA",United States of America,USA


### Initial checks
---

In [3]:
### Get an understanding of the dataset
df.shape            # 962 by 16
df.head()           # 1901 is the earliest
df.tail()           # 2020 is the latest date
df.columns          # Need to adjust
df.dtypes           # Year needs to be adjusted + prize share as well
df.isna().sum()     # There are missing values in some categories, not the first ones
                    ## More likely for the latter categories --> Mainly for more complicated variables.
df.duplicated(keep= False).sum()        # There are no duplicated values
df.columns          # Some columns need to be removed


Index(['year', 'category', 'prize', 'motivation', 'prize_share',
       'laureate_type', 'full_name', 'birth_date', 'birth_city',
       'birth_country', 'birth_country_current', 'sex', 'organization_name',
       'organization_city', 'organization_country', 'ISO'],
      dtype='object')

In [4]:
## Adjustments from the earlier analysis:

#### Keep the columns+ Change the names to make it easier to access
df.rename(columns= {'year': "Year",
                    'category':"Category",
                    'prize': "Prize",
                    'motivation':" Motivation",
                    'prize_share':"Prize_Share",
                    'laureate_type':"Laureate_Type",
                    'full_name' :"Full_Name",
                    'birth_date':"Birth_Date",
                    'birth_city':"Birth_City",
                    'birth_country':"Birth_Country",
                    'birth_country_current':"Birth Country_Current",
                    'sex':"Sex",
                    'organization_name':"Organization_Name",
                    'organization_city':"Organisation_City",
                    'organization_country':"Organisation_Country",
                    'ISO':"ISO"}, inplace=True)

#### Datetime for the right column
df["Year"] = pd.to_datetime(df["Year"])


values = df["Prize_Share"].str.split("/", expand=True)      # Split the values horizontally
numerator= pd.to_numeric(values[0])
denominator= pd.to_numeric(values[1])
share_pct:float                                             # assignment of float
share_pct =numerator/denominator
df["Share_Pct"] =share_pct                                  # Saved into a new column.


### Visualisation
---


Create a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?

In [5]:
men_and_women  = df["Sex"].value_counts()
print(men_and_women)

fig = px.pie(names =men_and_women.index, values = men_and_women.values, title =" Number of Nobel Laureates by Sex", hole= 0.6)
fig.update_traces(textposition="outside", textinfo="percent+label")
fig.show()

Sex
Male      876
Female     58
Name: count, dtype: int64


What are the names of the first 3 female Nobel laureates?

What did the win the prize for?

What do you see in their birth_country? Were they part of an organisation?



In [6]:
# Names of the first three nobel laureates
first_female_laureates = df[df["Sex"]=="Female"].sort_values(by = "Year" , ascending=True )[:5]
first_female_laureates["Category"]          # Shows the categories of the topics
first_female_laureates["Birth_Country"]     # Shows their brith countries
first_female_laureates["Organization_Name"]

18                     NaN
29                     NaN
51                     NaN
62     Sorbonne University
128                    NaN
Name: Organization_Name, dtype: object


Did some people get a Nobel Prize more than once? If so, who were they?

In [7]:
duplicated = df[["Full_Name", "Birth_Date"]].duplicated(keep = False)           # Returns a boolean. Feed the boolean back inot thethe df function and then do sort values
df[duplicated].sort_values("Full_Name", ascending = False)["Full_Name"].nunique()           # 6 unique people won twice
names = df[duplicated]["Full_Name"].unique()   # Pulls the number of unique names:
final_names = names.tolist()
print(f"There were 6 multiple winners of the Nobel Prize. ")
for name in names:
    print(" -", name)


There were 6 multiple winners of the Nobel Prize. 
 - Marie Curie, née Sklodowska
 - Comité international de la Croix Rouge (International Committee of the Red Cross)
 - Linus Carl Pauling
 - Office of the United Nations High Commissioner for Refugees (UNHCR)
 - John Bardeen
 - Frederick Sanger


In how many categories are prizes awarded?

Create a plotly bar chart with the number of prizes awarded by category.

Use the color scale called Aggrnyl to colour the chart, but don't show a color axis.

Which category has the most number of prizes awarded?

Which category has the fewest number of prizes awarded?

In [21]:
df["Category"].nunique()            # There are siz categories

#----------------------------------------Number of Prizes by Cateogory ------------------------------#
categories_prizes = df["Category"].value_counts()
chart= px.bar( x= categories_prizes.index,
               y=  categories_prizes.values,
               title=" Nobel Prizes by Category",
               labels = {"x": "Categories", "y": "Prizes"},
               color_continuous_scale="Aggrnyl",
               color = categories_prizes.values,
               )
chart.update_layout(coloraxis_showscale=True)
chart.show()

            # Largest number of Prizes in Medicine and smallest is in Economics.

When was the first prize in the field of Economics awarded?

Who did the prize go to?

In [45]:
economics_subset = df[df["Category"]== "Economics"].sort_values(by = "Year" , ascending=True )
first_economics_prize = economics_subset.iloc[0,:]              # First position
economics_year = first_economics_prize["Year"].year
economics_who = first_economics_prize["Full_Name"]
print(f"The first economics prize went to { economics_who} in {economics_year}.")
first_economics_prize

The first economics prize went to Jan Tinbergen in 1970.


Year                                         1970-01-01 00:00:00.000001969
Category                                                         Economics
Prize                    The Sveriges Riksbank Prize in Economic Scienc...
 Motivation              "for having developed and applied dynamic mode...
Prize_Share                                                            1/2
Laureate_Type                                                   Individual
Full_Name                                                    Jan Tinbergen
Birth_Date                                                      1903-04-12
Birth_City                                                       the Hague
Birth_Country                                                  Netherlands
Birth Country_Current                                          Netherlands
Sex                                                                   Male
Organization_Name                      The Netherlands School of Economics
Organisation_City        

Create a plotly bar chart that shows the split between men and women by category.

In [103]:
    # This produces a series weith multi index
dataset = df[["Sex", "Category"]].value_counts().reset_index()


            ## Can also use group by to get the same thing. Use this when you want more control- group by more things etc

cat_men_women = df.groupby(['Category', 'Sex'],as_index=False).agg({'Prize': pd.Series.count})
cat_men_women.sort_values('Prize', ascending=False, inplace=True)
cat_men_women

Unnamed: 0,Category,Sex,Prize
11,Physics,Male,212
7,Medicine,Male,210
1,Chemistry,Male,179
5,Literature,Male,101
9,Peace,Male,90
3,Economics,Male,84
8,Peace,Female,17
4,Literature,Female,16
6,Medicine,Female,12
0,Chemistry,Female,7


In [96]:
    # Name is the bale that appears in the neged.
    ## Best to split them by categories into different datasets into different forms. Same with the pri charts.


import plotly.graph_objects as go
male= dataset[dataset["Sex"]=="Male"]
female= dataset[dataset["Sex"]=="Female"]

fig = go.Figure(data=[
    go.Bar(name="Male", x=male["Category"], y=male["count"]),
    go.Bar(name="Female", x=female["Category"], y=female["count"])
])
# Change the bar mode
fig.update_layout(barmode='group',
                  title="Nobel Laureates by Gender",
                  )

fig.update_xaxes(title_text="Subject")

# Optional: Change Y-axis label too
fig.update_yaxes(title_text="Number of Nobel Laureates")

fig.show()
