# Index

## Prerequisites
[Packages and Data](#Packages-and-Data)

## Preprocessing: 
### Missing Values
[Missing Values Correction](#Missing-Values-Correction)
* Missing Value treatment for [Sex](#Sex)
* Missing Value treatment for [Category](#Category)
* Missing Value treatment for [Country](#Country)

### Removing Duplication
[Removing duplication of records](#Removing-duplication-of-records)

## EDA and Descriptive Analytics
* For countries: [Distribution of laureates/ prizes across countries](#Distribution-of-laureates/-prizes-across-countries)
* For categories: [Distribution across Categories](#Distribution-across-Categories)
* For time and age related stats: [Time and Age](#Time-and-Age)
* Prize Share: [Prize Share](#Prize-Share)
* Some interesting facts: [The first evers... and only evers...](#The-first-evers...-and-only-evers...)
* Motivvation Analysis: [Motivations](#Motivations)




## Packages and Data

In [None]:
## Importing packages

# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image: 
# https://github.com/kaggle/docker-rstats

suppressWarnings(library(tidyverse)) # metapackage with lots of helpful functions
suppressWarnings(library(dplyr))
suppressWarnings(library(ggthemes))
suppressWarnings(library(httr))
suppressWarnings(library(kableExtra))
suppressWarnings(library(ggpubr))
suppressWarnings(library(treemap))
suppressWarnings(library(reshape2))
suppressWarnings(library(ggrepel))
suppressWarnings(library(grid))
suppressWarnings(library(gridExtra))
suppressWarnings(library(dbplyr))
suppressWarnings(library(ggalt))
suppressWarnings(library(tidytext))
suppressWarnings(library(textstem))
suppressWarnings(library(viridis))
suppressWarnings(library(wordcloud))
install.packages("qdap")#-- to install keep the internet on in the settings
suppressWarnings(library(qdap))
library(widyr)
library(igraph)
library(ggraph)
#install.packages("tidytext")
#install.packages("textdata")

## Running code

# In a notebook, you can run a single code cell by clicking in the cell and then hitting 
# the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, 
# you can run code by highlighting the code you want to run and then clicking the blue arrow
# at the bottom of this window.

## Reading in files

# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below. 

list.files(path = "../input")

## Saving data

# If you save any files or images, these will be put in the "output" directory. You 
# can see the output directory by committing and running your kernel (using the 
# Commit & Run button) and then checking out the compiled version of your kernel.

In [None]:
data<- read.csv('../input/nobel-laureates/archive.csv')
#str(data)
oc<- read.csv("../input/organizations-and-countries/Organizations_Countries.csv")
nobel_winner_all_pubs <- read.csv("../input/nobel-publications/nobel_winner_all_pubs.csv")

In [None]:
dim(data)

In [None]:
head(data,3)

In [None]:
#replace the special e characters to plain text e
##--- This will be used to join the organization and countries dataset to the main data.
data <- data.frame(lapply(data, function(x) {
                  gsub("é", "e", x)
              }))
data <- data.frame(lapply(data, function(x) {
                  gsub("è", "e", x)
              }))

In [None]:
#join organization countries to get the missing country names
data<-merge(x=data,y=oc,by='Full.Name',all.x=TRUE)
# formatting country column from joined dataset
data$Country<-as.character(data$Country)
data$Country[is.na(data$Country)] <- ""
data$Country <- as.factor(data$Country)

## Missing Values Correction

The missing values in a few columns can be replaced by Undisclosed or Not Available as this information cannot be replaced in any other way. Some other values like countries can be derived from cities or organizations.

In [None]:
for (col in colnames(data)){
    print(col)
print(sum(data[col] == ""))}

### Sex

In [None]:
unique(subset(data,Sex=='',select=c("Laureate.Type","Sex")))

Sex fields are blank because the laureate type is Organization. Thus, the sex is ambiguous. We will thus, just replace this by 'Organization'.

In [None]:
#Sex
data$Sex <- as.character(data$Sex)
data$Sex[data$Sex==""] <- "Organization"
data$Sex <- as.factor(data$Sex)
table(data$Sex)

### Category

In [None]:
table(data$Category)

There are no missing values in the category column.

In [None]:
data %>% 
    filter( !is.na("Category") )%>%
    group_by(Category,Sex, Laureate.Type)%>%
    summarise(counts=n())%>%
    arrange(desc(counts))

Looks like the only the Peace category has the Laureate Type 'Organization'.

### Country

There are 3 country fields:
1. Birth.Coountry
2. Death.Country
3. Organization.Country

#### Organization Country

In [None]:
#Organization Country
#data %>% group_by(data$Organization.Country)%>%
#   count()

There are 253 blanks in Organization Country.

In [None]:
#Let us see if these fields have a organization name or a city
unique(subset(data,Organization.Country=='',select=c("Organization.Name","Organization.City","Organization.Country")))

In [None]:
#use the 2 organization names to populate the organization countries 
data$Organization.Country <- as.character(data$Organization.Country)
data$Organization.Country[(data$Organization.Country == "")&(data$Organization.Name == 'Institut Pasteur')]<- "France"
data$Organization.Country[(data$Organization.Country == "")&(data$Organization.Name == 'Howard Hughes Medical Institute')]<- "United States of America"


There are a few records that have an organization name as a full name. Thus let us take all those records where the full name is an organization name. We can get that list from the blank birth country columns.

In [None]:
unique(subset(data,( Birth.Country==''),select=c("Death.Country","Birth.Country","Full.Name","Organization.Country","Country")))

In [None]:
data$Country <- as.character(data$Country)
data$Organization.Country<-ifelse(data$Organization.Country=="", data$Country, data$Organization.Country)
data$Country <- as.factor(data$Country)

In [None]:
#the other countries are unavailable
data$Organization.Country[data$Organization.Country == ""]<-"Unavailable"
data$Organization.Country <- as.factor(data$Organization.Country)

#### Birth Country

In [None]:
# data %>% group_by(data$Birth.Country)%>%
#      count()

There are 26 blanks in Birth Country

In [None]:
unique(subset(data,( Birth.Country==''),select=c("Death.Country","Birth.Country","Full.Name","Organization.Country","Organization.Name","Country")))

In [None]:
data$Birth.Country <- as.character(data$Birth.Country)
data$Birth.Country[data$Birth.Country == ""]<-"Unavailable/Organization"
data$Birth.Country <- as.factor(data$Birth.Country)

#### Death Country

In [None]:
# data %>% group_by(data$Death.Country)%>%
#     count()

There are 364 blanks in Death Country

In [None]:
unique(subset(data,Death.Country=='',select=c("Death.City","Death.Country","Death.Date","Laureate.Type")))

There are a few records where the person has not died yet. Some other records are blank cause the Laureate Type is an organization. Some other records are simply unavailable. Let us club them all as Unavailable/Undisclosed.

In [None]:
data$Death.Country <- as.character(data$Death.Country)
data$Death.Country[data$Death.Country == ""]<-"Unavailable/Undisclosed"

In [None]:
data$Death.Country <- as.factor(data$Death.Country)

## Cleaning Dataset

### Removing duplication of records

In [None]:
test<-data[duplicated(data[,c("Year","Category","Prize.Share","Full.Name")])==TRUE,]
test<-test%>% arrange(Year,Category,Full.Name)
test$duplicated<- "1"
dim(test)
#head(test,3)

There are 58 unique records in this dataset that are actually duplicated. This means that the Year, Category, Full.Name and Country records are all same. However, the affiliated organization differs. It is also important to note that one of the organizations is either a subset or a part of the other organization (For instance: Kaiser-Wilhelm-Institut (now Max-Planck-Institut) für Biochemie is a part of the Berlin University (1939-Chemistry)).

Such records must be removed.

In [None]:
#join the duplicated row dataset to main dataset to remove them
data2<-merge(x=data,y=test[ , c('Year','Full.Name','Category','Prize.Share',"Laureate.ID", "duplicated","Sex","Organization.Name")],by=c('Year','Full.Name','Category','Prize.Share',"Laureate.ID","Sex","Organization.Name"),all.x=TRUE)
data3<-data2[(is.na(data2$duplicated)),]
data3$duplicated<-NULL
data<-data3
dim(data)

## Data Visualization and EDA

### Distribution of laureates/ prizes across countries

In [None]:
#functions to clean (country) strings- replaces all special characters
stripp<-function(x){
 re <- "\\(([^()]+)\\)"
 x<-as.character(x)
 if(grepl("\\(",x) == TRUE){
  y<-gsub(re, "\\1", str_extract_all(x, re)[[1]])
  return(as.character(y))
 }
 else{return(as.character(x))}
}

#### Birth Country

In [None]:
#histogram for Birth Country
library(repr)
options(repr.plot.width=15, repr.plot.height=8)
byCountry<-as.data.frame(data %>% select(Birth.Country) %>% filter(Birth.Country != "" & Birth.Country != "Unavailable/Organization") %>% group_by(Birth.Country) %>% summarise(number=n()) %>% arrange(-number))
byCountry$name<-sapply(byCountry$Birth.Country,stripp)
byCountry<-as.data.frame(byCountry %>% select(number,name) %>% group_by(name) %>% summarise(sum=sum(number)))
colnames(byCountry)<-c('country','number')
byCountry$country<-as.character(byCountry$country)
g2<-ggplot(byCountry, aes(x=country, y=number)) + geom_bar(width = 1, stat="identity") + 
  xlab('Birth Country') + ylab('Number of Nobel Prizes') + theme(axis.text.x = element_text(angle=90, hjust=1)) +
  ggtitle('Distribution of Nobel Prizes per birth country') + 
  theme(plot.margin = margin(2,0.2,1,1, "cm"))
g2

United States has the highest number of laureates. Close next are Germany and United Kingdom. Let us also do the same by organization country.

#### Organization Country

In [None]:
#histogram for Organization Country
library(repr)
options(repr.plot.width=15, repr.plot.height=8)
byOC<-as.data.frame(data %>% select(Organization.Country) %>% filter(Organization.Country != "" & Organization.Country != "Unavailable") %>% group_by(Organization.Country) %>% summarise(number=n()) %>% arrange(-number))
byOC$name<-sapply(byOC$Organization.Country,stripp)
byOC<-as.data.frame(byOC %>% select(number,name) %>% group_by(name) %>% summarise(sum=sum(number)))
colnames(byOC)<-c('country','number')
byOC$country<-as.character(byOC$country)
g3<-ggplot(byOC, aes(x=country, y=number)) + geom_bar(width = 1, stat="identity") + 
  xlab('Organization Country') + ylab('Number of Nobel Prizes') + theme(axis.text.x = element_text(angle=90, hjust=1)) +
  ggtitle('Distribution of Nobel Prizes per organization country') + 
  theme(plot.margin = margin(2,0.2,1,1, "cm"))
g3

Looks like organizations from the USA are the highest laureate producing organizations. UK is next, followed by Germany.

#### Trend of the top most country- USA

In [None]:
#buy birth country
usa_perc <- data %>% mutate(usa_winner=ifelse(Birth.Country=="United States of America",TRUE,FALSE),
       decade= floor(as.numeric(as.character(Year))/10)*10)%>% group_by(decade)%>%summarize(proportion=mean(usa_winner==TRUE,na.rm=TRUE))
       
usa_perc %>%ggplot(aes(decade,proportion))+geom_line(color="blue")+geom_point(color="orange")+
labs(title="Proportion of USA born winners by each decade")+
geom_text(aes(label=paste0(as.character(round(proportion,2)*100),"%"), vjust = -0.2))+
scale_y_continuous(labels=scales::percent,limits=c(0.0,1.0),expand=c(0,0))+theme_solarized()

The overall trend of the USA born winners seems to be increasing over the decades.

#### Sex and Birth Countries

In [None]:
#dataframe of birth country, sex and number of laureates
poppyrdf<-as.data.frame(data %>% select(Year,Sex,Category,Birth.Country) %>% 
  filter(Birth.Country != "" & Birth.Country != "Unavailable/Organization" & Sex!="Organization") %>% 
  group_by(Birth.Country,Sex) %>% 
  summarise(number=n()) %>% 
  arrange(-number))

#top 10 countries by birth country 
#head(byCountry[order(-byCountry$number),],10)
# 10 top most birth countries with most laureates: 
## United States of America,Germany,United Kingdom,France,Poland,
##Russia,Sweden,Japan,Italy,Austria

top10<-c("United States of America","Germany","United Kingdom","France","Poland","Russia","Sweden","Japan","Italy","Austria")
ppd_10<-poppyrdf %>% filter(Birth.Country %in% top10)
ppd_10$number<-ifelse(ppd_10$Sex=="Male",-1*ppd_10$number,ppd_10$number)
#ppd_10

In [None]:

# X Axis Breaks and Labels 
brks <- seq(-250, 0, 50)
lbls <- paste0(as.character(c(seq(250, 0, -50))))

# Plot
ggplot(ppd_10, aes(x = Birth.Country, y = number, fill = Sex)) + 
  geom_bar(subset = (ppd_10$Sex == "Female"), stat = "identity") + 
  geom_bar(subset = (ppd_10$Sex == "Male"), stat = "identity") +  
                              coord_flip() +  # Flip axes
                              labs(title="Number of male and female laureates in top 10 Countries (birth)") +
                                 # draw the bars
                              scale_y_continuous(breaks = brks,labels=lbls) + # Labels
                              theme_tufte() +  # Tufte theme from ggfortify
                              theme(plot.title = element_text(hjust = .5), 
                                    axis.ticks = element_blank()) +   # Centre plot title
                              scale_fill_brewer(palette = "Dark2")  # Color palette

We see that all countries have higher number of male laureates than females. UK, Russia and Japan have not had a single female laureate so far.

### Distribution across Categories

#### Categories

In [None]:
cat1<-as.data.frame(data %>% select(Year,Sex,Category,Birth.Country,Laureate.Type) %>% 
  group_by(Category) %>% 
  summarise(number=n()) %>% 
  arrange(-number))
cat1$index<-paste0(cat1$Category,":",cat1$number)
treemap(dtf = cat1,
        index=c("index"),
        vSize="number",
        vColor="number",
        palette="Spectral",
        type="value",
        border.col=c("black"),
        fontsize.title = 18,
        algorithm="pivotSize",
        title ="Treemap of the distribution of number of laureates across categories",
        title.legend="Number of Laureates")

Clearly Medicine is the biggest category with most records here. A close second is Physics.

#### Laureate Type and Categories

In [None]:
cat2<-as.data.frame(data %>% select(Year,Sex,Category,Birth.Country,Laureate.Type) %>% 
  group_by(Category,Laureate.Type) %>% 
  summarise(number=n()) %>% 
  arrange(-number))

ggdotchart(cat2, x = "Category", y = "number",
           color = "Laureate.Type",                                # Color by groups
           palette = c("#00AFBB", "#E7B800", "#FC4E07"), # Custom color palette
           sorting = "ascending",                        # Sort value in descending order
           add = "segments",  
           dot.size = 6,                                 # Large dot size
           label = round(cat2$number,1),                        # Add mpg values as dot labels# Add segments from y = 0 to dots
           ggtheme = theme_pubr()                        # ggplot2 theme
           )

We had seen earlier that only peace has awards under the laureate type- "Organization". In a total of 130 awards in that category, 30 were won by an organization and 100 by individuals. Let us compare how each gender performs in each category

#### Sex and Categories

In [None]:
slopedf<-as.data.frame(data %>% select(Year,Sex,Category,Birth.Country) %>% 
  filter(Birth.Country != "" & Sex!="Organization") %>% 
  group_by(Category,Sex) %>% 
  summarise(number=n()) %>% 
  arrange(-number))

cat_sex<-dcast(slopedf, Category ~ Sex)
cat_sex


In [None]:
gg_slope <- cat_sex %>%
  # add a variable for when men are more successful than women (for colours)
  mutate(high_ratio = (Female/Male > 0.10)) %>%
  ggplot() +
  # add a line segment that goes from men to women for each discipline
  geom_segment(aes(x = 1, xend = 2, 
                   y = Male, 
                   yend = Female,
                   group = Category,
                   col = high_ratio), 
               size = 1.2) +
  # remove all axis stuff
  theme_classic() + 
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank())+
  # add vertical lines that act as axis for men
  geom_segment(x = 1, xend = 1, 
               y = min(cat_sex$Female) - 2,
               yend = max(cat_sex$Male) + 1,
               col = "grey70", size = 0.5) +
  # add vertical lines that act as axis for women
  geom_segment(x = 2, xend = 2, 
               y = min(cat_sex$Female) - 2,
               yend = max(cat_sex$Male) + 1,
               col = "grey70", size = 0.5) +
  # add the words "men" and "women" above their axes
  geom_text(aes(x = x, y = y, label = label),
            data = data.frame(x = 1:2, 
                              y = 2 + max(cat_sex$Male),
                              label = c("men", "women")),
            col = "grey30",
            size = 6) +
  # add the label and number of laureates for each category next the men axis
  geom_text(aes(x = 1 - 0.03, 
                y = Male, 
                label = paste0(Category, ", ", Male)),
             col = "grey30", hjust = "right") +
  # add the number of laureates next to each point on the women axis
  geom_text(aes(x = 2 + 0.08, 
                y = Female, 
                label = paste0(Category, ", ",Female)),
            col = "grey30") +
  # set the limits of the x-axis so that the labels are not cut off
  scale_x_continuous(limits = c(0.5, 2.1)) 
            
gg_slope

Every category has more male laureates than Females. Category 'Literature' and 'Peace' have a female:male ratio of > 0.1 (highlighted in blue above). The other have a ration <= 0.1 

#### Sex, Year and Categories

In [None]:
distr<-as.data.frame(data %>% select(Year,Sex,Category,Birth.Country) %>% 
  filter(Birth.Country != "") %>% 
  group_by(Year,Sex,Category) %>% 
  summarise(number=n()) %>% 
  arrange(-number))
distr$Year<- as.numeric(as.character(distr$Year))
g1<-ggplot(data=filter(distr,Sex=='Female'),aes(Year,Category,fill=number)) + 
  geom_tile(aes(fill = number),colour = "white",na.rm=TRUE) + 
  scale_fill_gradient(low="#3B9AB2",high="black",guide = guide_legend(title = "Female")) +
  theme(legend.position="right") + xlab('') + ylab('') + theme(axis.title.x = element_blank()) +xlim(1900,2015)
g2<-ggplot(data=filter(distr,Sex=='Male'),aes(Year,Category,fill=number)) + 
  geom_tile(aes(fill = number),colour = "white",na.rm=TRUE) + 
  scale_fill_gradient(low="#EBCC2A",high="black",guide = guide_legend(title = "Male")) +
  theme(legend.position="right") + xlab('') + ylab('') + theme(axis.title.x = element_blank())+xlim(1900,2015)


ggarrange(g1, g2, ncol = 1, nrow = 2)

We see that female laureates are scarce as compared to males. 

There seems to be higher number of female laureates in Peace, Medicine and Literature, year after year after 1975. This observation is in comparison to other categories and previous years.

A lot of males seem to have received a lot of awards in Physics, Medicine and Chemistry (especially after 1950).

Looks like there were no Economic Nobel Prize till later 1960s. Further research reveals that the prize for Economics was only established in 1968.

Let us now look at the cumulative trend over the years too.

In [None]:
cum_trend<-as.data.frame(data %>% filter(Birth.Country != "Unavailable/Organization" ) %>% select(Sex,Year,Category) %>% group_by(Category,Sex,Year) %>% summarise(number=n()) %>% mutate(cs = cumsum(number)))
h2<-ggplot(data=cum_trend,aes(x=Year,y=cs,group=Sex)) +facet_wrap(.~Category)+ geom_line(aes(color=Sex),size=2) +
  scale_color_manual(values = c("red3","#006400")) + 
  ylab('Cumulated number of Nobel Prizes') + theme(axis.title.x = element_blank()) + scale_x_discrete(breaks = levels(cum_trend$Year)[c(T, rep(F, 9))])

h2

The cumulative count of female laureates seems to be increasing year over year for Physics, Medicine and Peace. Economics does not show a trend for females because there is a single point for the year 2009. 

### Time and Age

Let us first create an Age column. The age of a laureate at the time of his/her award will be the difference between the year in which the nobel prize was awarded and the birth year.

In [None]:
data$Age <- as.numeric(as.character(data$Year)) - as.integer(substr(data$Birth.Date, 1,4))
paste0("Maximum age is : ",max(data$Age, na.rm=TRUE)," years and minimum age is : ",min(data$Age, na.rm=TRUE)," years")

#### Age and Category

In [None]:
data %>% ggplot(aes(Category,Age,fill=Category))+geom_boxplot(alpha=0.5, na.rm=TRUE)+geom_jitter(na.rm=TRUE)+theme_solarized()+scale_fill_manual(values=heat.colors(6))+
   theme(axis.text.x=element_text(vjust=0.5),legend.position='none',plot.title = element_text(size=12)) +labs(title="Distribution of Age across Categories")

It looks like Physics has a bulk of the youngest laureates. Chemistry and Medicine follow next. However, there are two points in Peace which are the youngest laureates across all categories. They are aged below 25. Economics has 4 people well above the 87.5 mark which might be the oldest laureates across categories. Literature, Physics, Peace and Medicine has have a couple of people each at the 87.5 borderline.

All categories have the average age around 60-65 years.

#### Time, Age and Category

In [None]:
ggplot(data,aes(Year,Age,col=Category))+geom_point(na.rm=TRUE)+facet_wrap(~Category)+geom_smooth(method='loess',linetype = 'solid',na.rm=TRUE) +scale_color_manual(values=heat.colors(6))+theme_excel() + theme(
          axis.text.x = element_text(color="black",size=7, angle=90)) + scale_x_discrete(breaks = levels(data$Year)[c(T, rep(F, 9))])

Every category has an increasing trend except Peace. The age of laureates in the Peace category are decreasing with every passing year.

#### Time, Sex and Category

In [None]:
female_male_bydecade <- data %>% mutate(female_winner=ifelse(Sex=="Female",TRUE,FALSE),male_winner=ifelse(Sex=="Male",TRUE,FALSE),
                                        decade=floor(as.numeric(as.character(Year))/10)*10)%>%
group_by(decade,Category)%>%summarize(fperc=mean(female_winner,na.rm=TRUE),mperc=mean(male_winner,na.rm=TRUE))
    
#female_male_bydecade

female_male_bydecade %>%ggplot(aes(decade,fperc,color=Category,group=Category))+geom_line()+geom_point()+
labs(title="Proportion of female winners by each decade across categories")+
xlab("Decade") + ylab("Percentage of female winners")+
scale_y_continuous(labels=scales::percent,limits=c(0.0,1.0),expand=c(0,0))+
scale_x_continuous(breaks = seq(1900, 2020, 10))+theme_solarized()

The percentage of female winners seems to have increased in Literature, Medicine and Peace 1970 onwards. The other categories shown no evident trend otherwise.

In [None]:
female_male_bydecade %>%ggplot(aes(decade,mperc,color=Category,group=Category))+geom_line(size=1)+geom_point()+
labs(title="Proportion of male winners by each decade across categories")+
xlab("Decade") + ylab("Percentage of male winners")+
scale_y_continuous(labels=scales::percent,limits=c(0.0,1.0),expand=c(0,0))+
scale_x_continuous(breaks = seq(1900, 2020, 10))+theme_solarized()

Peace has a very evident downward trend in the percentage of male winners. Literature and Medicine also show a slight downward trend 1970 onwards.

#### Age and Country

In [None]:
# 10 top most birth countries with most laureates: 
## United States of America,Germany,United Kingdom,France,Poland,
##Russia,Sweden,Japan,Italy,Austria

# For these let us find the distribution of the age
data_top10<-data %>% filter(Birth.Country %in% top10)

data_top10 %>% ggplot(aes(Birth.Country,Age,fill=Birth.Country))+geom_boxplot(alpha=0.5, na.rm=TRUE)+geom_jitter(na.rm=TRUE)+theme_solarized()+scale_fill_manual(values=heat.colors(10))+
   theme(axis.text.x=element_text(vjust=0.5),legend.position='none',plot.title = element_text(size=12)) +labs(title="Distribution of Age across Categories")

We see that the plot for USA has a high number of points indicating the high number of laureates with their birth country as USA. The distribution of age seems to be similar in most countries. The average age is around 60 for France, Italy, Russia, UK and USA. The average for Germany is below the 60 mark. Moreover, a lot of the points are also lying below 60 for Germany, indicating a younger set of laureates from Germany.

We see that the plots for some countries like Russia, Polang, Germany and Italy are smaller than the others (in height). This implies that most laureates from these countries were similarly aged when they received the prize and there wasn't much deviation from the average. However, if you look at the plot for Austria, we see a longer plot with uneven sections. This indicates the large spread of ages overall. While some laureates from Austria, were very young (below 60), a few others were above 70. The uneven sections indicate the wide demarcation between the two sections of age distribution. While there are similar aged laureates in certain parts of the scale, but in other parts of the scale there is more variability in the age. 

#### Age and Year

In [None]:
ggplot(data,aes(Year,Age,size=Age,col=cut(Age,6)))+geom_point(alpha=0.5, na.rm=TRUE)+geom_smooth(method="loess",col="orange",se=FALSE, na.rm=TRUE)+ theme_solarized()+
   theme(axis.text.x=element_text(vjust=0.5),legend.position='none',plot.title = element_text(size=12)) +labs(title="Age and Year at which Nobel Prize was Won")+
  geom_text_repel(aes(label=ifelse(Age<=25 |Age>87.5,paste0(as.character(Full.Name),":",as.character(Age)),"")),size=5, na.rm=TRUE) + scale_x_discrete(breaks = levels(data$Year)[c(T, rep(F, 9))])

The general trend of the age of the laureates with each passing year seems to be upwards. 

The youngest laureate is Malala Yousafzai who won the Nobel Peace Prize in 17 in 2014. The oldest is Leonid Hurwicz (aged 90 at the time of the award) who got the Economics Nobel Prize in 2007.

#### Youngest Winners (age)

In [None]:
data %>% select(Full.Name,Age,Year,Category,Organization.Name,Sex)%>%arrange(Age)%>%head(5)

#### Age and Sex

In [None]:
t1<-data %>% select(Full.Name,Age,Year,Category,Sex)%>%arrange(Age)%>%filter(Sex=="Female") %>%head(5)
t2<-data %>% select(Full.Name,Age,Year,Category,Sex)%>%arrange(Age)%>%filter(Sex=="Male") %>%head(5)
t3<-data %>% select(Full.Name,Age,Year,Category,Sex)%>%arrange(desc(Age))%>%filter(Sex=="Female") %>%head(5)
t4<-data %>% select(Full.Name,Age,Year,Category,Sex)%>%arrange(desc(Age))%>%filter(Sex=="Male") %>%head(5)
table <- tableGrob(t1)

grid.newpage()
h <- grobHeight(table)
w <- grobWidth(table)
title <- textGrob("Youngest Female Laureates", y=unit(0.60,"npc") + 0.5*h, 
                  vjust=0, gp=gpar(fontsize=12))
footnote <- textGrob("Age of youngest: 17 years; Category: Peace", 
                     x=unit(0.35,"npc") - 0.5*w,
                     y=unit(0.41,"npc") - 0.5*h, 
                  vjust=1, hjust=0,gp=gpar( fontface="italic"))
gt <- gTree(children=gList(table, title, footnote))

table <- tableGrob(t2)

grid.newpage()
h <- grobHeight(table)
w <- grobWidth(table)
title <- textGrob("Youngest Male Laureates", y=unit(0.60,"npc") + 0.5*h, 
                  vjust=0, gp=gpar(fontsize=12))
footnote <- textGrob("Age of youngest: 25 years; Category: Physics", 
                     x=unit(0.35,"npc") - 0.5*w,
                     y=unit(0.41,"npc") - 0.5*h, 
                  vjust=1, hjust=0,gp=gpar( fontface="italic"))
gt2 <- gTree(children=gList(table, title, footnote))

table <- tableGrob(t3)

grid.newpage()
h <- grobHeight(table)
w <- grobWidth(table)
title <- textGrob("Oldest Female Laureates", y=unit(0.60,"npc") + 0.5*h, 
                  vjust=0, gp=gpar(fontsize=12))
footnote <- textGrob("Age of oldest: 88 years; Category: Literature", 
                     x=unit(0.35,"npc") - 0.5*w,
                     y=unit(0.41,"npc") - 0.5*h, 
                  vjust=1, hjust=0,gp=gpar( fontface="italic"))
gt3 <- gTree(children=gList(table, title, footnote))

table <- tableGrob(t4)

grid.newpage()
h <- grobHeight(table)
w <- grobWidth(table)
title <- textGrob("Oldest Male Laureates", y=unit(0.60,"npc") + 0.5*h, 
                  vjust=0, gp=gpar(fontsize=12))
footnote <- textGrob("Age of oldest: 90 years; Category: Economics", 
                     x=unit(0.35,"npc") - 0.5*w,
                     y=unit(0.41,"npc") - 0.5*h, 
                  vjust=1, hjust=0,gp=gpar( fontface="italic"))
gt4 <- gTree(children=gList(table, title, footnote))
grid.arrange(top=textGrob("Youngest and Oldest Male and Female Laureates", gp=gpar(fontsize=15,font=8)),
  gt,gt2,gt3,gt4,
  nrow=2)

Some interesting observations:
1. The 5 youngest female laureates have won awards in the Category Peace. For males, all 5 have won it in Physics
2. While the youngest female was aged 17 and the youngest male was aged 25, all other (consecutive) young laureates were aged in the range 31-33
3. For the old laureates, the age range across the genders seems to be 80-90 years

### Prize Share

#### Prize counts and shares

In [None]:
data %>% group_by(Year,Category)%>%summarise(ct=n())%>%ggplot(aes(x=ct))+geom_bar(fill=rainbow(n=1),alpha=0.5)+ geom_text(stat = 'count',aes(label =..count.., vjust = -0.2))+ 
theme_excel()+
xlab("Number of prizes in a Category in a year") + ylab("Count of cases")+
theme(axis.text.x=element_text(vjust=0.5),legend.position='none',plot.title = element_text(size=12)) +labs(title="Laureates per Prize")

The above histogram depicts the number of laureates across the number of prizes awarded in a particular category in a given year. In other words, for a given category in a given year, maximum number of people who have won an award is 3. Such cases are 97.

In [None]:
# prizesharedf<-as.data.frame(data %>% select(Year,Category,Laureate.ID,Prize.Share) %>% 
#   group_by(Year,Category,Prize.Share) %>% 
#   summarise(number=n()) %>% 
#   arrange(-number))


 data %>% ggplot(aes(x=Prize.Share))+geom_bar(fill=rainbow(n=1),alpha=0.5)+ theme_excel()+ 
xlab("Prize Share") + ylab("Count of laureates per Prize Share")+
  geom_text(stat = 'count',aes(label =..count.., vjust = -0.2))+
    theme(axis.text.x=element_text(vjust=0.5),legend.position='none',plot.title = element_text(size=12)) +labs(title="Laureates per Prize Share")

This discrepany is happening because when nobel prizes are split-- they are often awarded in the way: one half is given to an  individual and the other half is jointly given to 2 different individuals Y and Z. This split is depicted as (1/2,1/4,1/4). For instance in 2011 for Physics-- 3 people won the Nobel Prize: Agam G.Reiss (1/4), Brian Schmidt (1/4) and Saul Perlmutter (1/2). This is not 1/3 each but (1/2,1/4,1/4)

#### Prize Share by each Category

In [None]:
#distribution of laureates across prizeshares and categories
table(data$Category,data$Prize.Share)

In [None]:
 data %>% ggplot(aes(x=Prize.Share,group=Category))+facet_wrap(~Category)+geom_bar(fill=rainbow(n=1),alpha=0.5)+ theme_solarized()+ 
xlab("Prize Share") + ylab("Count of laureates per Prize Share")+
  geom_text(stat = 'count',aes(label =..count.., vjust = -0.2))+
    theme(axis.text.x=element_text(vjust=0.5),legend.position='none',plot.title = element_text(size=12)) +labs(title="Laureates per Prize Share")

The maximum number of laureates who have shared a certain prize is 4. Physics, Medicine and Chemistry have the had 4 prize sharers. Literature on the other hand has the most number of single prize recipients.

#### People and Organizations that have received the awards multiple times

In [None]:
data %>% group_by(Full.Name)%>%summarize(Prize_Count=n())%>%filter(Prize_Count>1)%>%arrange(desc(Prize_Count))%>%head(10)

There are 6 different individuals (or organizations) that have won multiple awards across different subjects. Let us see if there are such cases for the same subject. 

In [None]:
data %>% group_by(Full.Name, Category)%>%summarize(Prize_Count=n())%>%filter(Prize_Count>1)%>%arrange(desc(Prize_Count))%>%head(10)

There are 4 such cases as shown above. Let us see if we find anything by Organization Name

In [None]:
data %>% group_by(Organization.Name)%>%summarize(Prize_Count=n())%>%filter(Prize_Count>1 & Organization.Name!="")%>%arrange(desc(Prize_Count))%>%head(10)

University of California has won the maximum number of nobel prizes i.e. 31. The next is Harvard University. Let us look at the category wise split.

In [None]:
data %>% group_by(Organization.Name, Category)%>%summarize(Prize_Count=n())%>%filter(Prize_Count>1 & Organization.Name!="")%>%arrange(desc(Prize_Count))%>%head(10)

University of California is still at the top-- with 13 prizes in Chemistry. The next is University of Chicago with 12 prizes in Economics. Stanford University and Harvard University also feature in this top 10 list with 9 and 8 nobel prizes in Physics repectively.

In [None]:
org_prizes<-data %>% group_by(Organization.Name, Category)%>%summarize(Prize_Count=n()) %>%filter(Prize_Count>1 & Organization.Name!="")
#org_prizes

op_select <- org_prizes[org_prizes$Prize_Count >=10, ]

ggplot(org_prizes, aes(x=Organization.Name, y=Prize_Count)) + 
  geom_point(aes(col=Category, size=Prize_Count)) +   # draw points
  geom_smooth(method="loess", se=F) + 
  ylim(c(0, 20)) + 
  geom_encircle(aes(x=Organization.Name, y=Prize_Count), 
                data=op_select, 
                color="red", 
                size=1, 
                expand=0.06) +   # encircle
  labs(subtitle="Prize Count Vs Organization", 
       y="Prize Count", 
       x="Organization", 
       title="Organizations with multiple Nobel Prizes", 
       caption="Source: Nobel Prize Dataset")+
theme(axis.text.x = element_text(angle=90, hjust=1)) 

Clearly, no organization has won multiple nobel prizes in the category Literature. The 3 top most organizations with multiple awards in any category are:
* University of California-	Chemistry:	13
* University of California-	Physics:	10
* University of Chicago-	Economics:	12

(I don't know why there are 2 circle marking those 3 highest points.. If some one can help me figure that out, it would be great!)


### The first evers... and only evers...

In this section we will look at the first ever awards.. or some interesting facts about the Nobel Prizes.

#### The first ever female Nobel Prize winner

In [None]:
data %>% filter(Sex=="Female")%>%select(Year,Full.Name,Category,Birth.Country,Prize.Share)%>%top_n(1,desc(Year))

It was Marie Curie who won 1/4th of the 1903 Nobel Prize for Physics. Let us now look at per category first female winners.

#### First Ever female winner in every category

In [None]:
data %>% filter(Sex=="Female")%>%group_by(Category)%>%select(Year,Full.Name,Category,Birth.Country,Prize.Share)%>%top_n(1,desc(Year))

Interestingly, no female had won a Nobel Prize in conomics until Elinor Ostrom did so in 2009.

#### First ever male winner in every category

In [None]:
data %>% filter(Sex=="Male")%>%group_by(Category)%>%select(Year,Full.Name,Category,Birth.Country,Prize.Share)%>%top_n(1,desc(Year))

#### Only ever female to win multiple prizes

In [None]:
data %>% filter(Sex=="Female")%>%group_by(Sex,Full.Name)%>%summarize(Prize_Count=n())%>%select(Full.Name,Sex,Prize_Count)%>%filter(Prize_Count>1)

The only female to ever win 2 Nobel Prizes across categories was Marie Curie- for Physics and Chemistry.

#### The (only) males to win multiple prizes

In [None]:
data %>% filter(Sex=="Male")%>%group_by(Sex,Full.Name)%>%summarize(Prize_Count=n())%>%select(Full.Name,Sex,Prize_Count)%>%filter(Prize_Count>1)

3 males have won 2 nobel prizes across categories.

In [None]:
data %>% filter(Sex=="Male")%>%group_by(Sex,Full.Name,Category)%>%summarize(Prize_Count=n())%>%select(Full.Name,Sex,Prize_Count,Category)%>%filter(Prize_Count>1)

There are 2 males- Frederick Sanger and John Bardeen who won the nobel prize twice in the same category-- Chemistry and Physics respectively.

### Motivations
(back to [Index](#Index))

#### Word Clouds and overview by category

In [None]:
data$Motivation<-as.character(data$Motivation)
motivation<-data %>%unnest_tokens(word,Motivation)%>%anti_join(stop_words,by="word")
motivation$lemm_words<- lemmatize_words(motivation$word, dictionary = lexicon::hash_lemmas)
#motive

In [None]:
motivation%>%
  count(lemm_words, sort = TRUE) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(lemm_words = reorder(lemm_words, n)) %>%
  ggplot() +
    geom_col(aes(lemm_words, n), fill = plasma(n=20)) +
    theme(legend.position = "none", 
          plot.title = element_text(hjust = 0.5),
          panel.grid.major = element_blank()) +
    xlab("Words") +
    ylab("Word Count") +
    ggtitle("Most Frequently Used Words by Nobel Laureates") +
    coord_flip()+ theme_solarized()

Discovery and its forms are the most commonly used words by nobel laureates in their papers and motivations. 

In [None]:
#par(mfrow=c(2,3)) # tried putting it in a grid but a lot of words are not fitting so plotting every wordcloud 
#in a single row
nword <- motivation %>%
group_by (Category)%>%
count(lemm_words, sort = TRUE) 

#Medicine
nword_med<-nword %>% filter(Category=="Medicine")
w1<-wordcloud(words = nword_med$lemm_words, freq = nword_med$n, min.freq = 3,
         max.words=200,  
           colors=rainbow(200))

#Physics
nword_phys<-nword %>% filter(Category=="Physics")
w2<-wordcloud(words = nword_phys$lemm_words, freq = nword_phys$n, min.freq = 3,
         max.words=200,  
           colors=rainbow(200))

#Chemistry
nword_chem<-nword %>% filter(Category=="Chemistry")
w3<-wordcloud(words = nword_chem$lemm_words, freq = nword_chem$n, min.freq = 3,
         max.words=200,  
           colors=rainbow(200))

#Economics
nword_eco<-nword %>% filter(Category=="Economics")
w4<-wordcloud(words = nword_eco$lemm_words, freq = nword_eco$n, min.freq = 3,
         max.words=200,  
           colors=rainbow(200))

#Literature
nword_lit<-nword %>% filter(Category=="Literature")
w5<-wordcloud(words = nword_lit$lemm_words, freq = nword_lit$n, min.freq = 3,
         max.words=200, 
           colors=rainbow(200))

#Peace
nword_peace<-nword %>% filter(Category=="Peace")
w6<-wordcloud(words = nword_peace$lemm_words, freq = nword_peace$n, min.freq = 3,
         max.words=200,  
           colors=rainbow(200))

The wordcloud is organized by the frequency of the words. The minimum frequency for a word to occur on the wordcloud is set to 3.

P.Note: Discovery is a prominent word in Chemistry too. It could not be dispalyed because of the size.

#### Vocabulary

In [None]:
diversity_matrix<-qdap::diversity(motivation$lemm_words, motivation$Category)
diversity_matrix

In [None]:
plot(diversity_matrix, high = "red", low = "yellow", values = TRUE)

Interpreting the outputs:
Diversity stats are a measure of language “richness” or rather, how expansive is a speakers vocabulary. The results indicate similar richness of vocabulary. This should not be surprising as they are all nobel prize winning papers.
1. wc: This is the wordcount by each category. Please note that the stop words have already been treated for. These are lemmatized key words. Higher the value, richer is the vocabulary.
2. Shannon index: It is one of the probabilistic indices that quantifies the diversity of the text based on the log of probabilities (or proportions). More different characters (words) there are, and the more equal their proportional abundances in the string of interest, the more difficult it is to correctly predict which word will be the next one in the string. Higher the Shannon Index, higher the diversity.
3. Simpson Index: Simpson's index is a similarity index (the higher the value the lower in diversity). If you want to use it as a diversity index you can subtract it to 1 (i.e. 1-Simpson Index). Moreover, Simpson index is more weighted on dominant words (higher weightage to more frequent words).
4. Berger Parker Index: The Berger–Parker index equals the maximum probability value in the dataset, i.e. the proportional abundance of the most abundant word. This is also a weighted measure by probability.
5. Collision Entropy: Also known as Renyi entropy, this is a generalized case of probabilitic information content. HIgher the collision entropy, higher is the word diversity.
6. Brillouin index: It is similar to Shannon Index but gives the diversity score in the population versus the sample. 

Thus, Physics has the highest word count- thus the most distinct words in the paper are for the category Physics. The most diverse set of words are used by Literature papers according to the highest Shannon Index and the highest collision entropy.

In [None]:
pol <- polarity(motivation$lemm_words, motivation$Category)
plot(pol)

Sentiment or polarity can be studied by categories. Medicine seems to have a slightly negative polarity which might be understandable as these motivations would focus on curing ailments and health issues- these seem to be negative themes. The other categories show positive sentiments and all in a very similar range. 

#### Word Pairs

In [None]:
title_word_pairs <- motivation%>% 
  pairwise_count(lemm_words,Full.Name, sort = TRUE, upper = FALSE)

set.seed(1234)
title_word_pairs %>%
  filter(n >= 8) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "red3") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void()

#### Bigrams

In [None]:
motivation_bigrams <- data %>%
  unnest_tokens(bigram, Motivation, token = "ngrams", n = 2)

bigrams_separated <- motivation_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) 
  

bigram_united <- bigrams_filtered %>%
  filter(word1 != word2) %>%
    unite(bigram, word1, word2, sep = " ")
bigram_counts <- bigram_united %>% 
group_by (Category)%>%
  count(bigram, sort = TRUE)


bigram_counts %>% arrange(desc(n))%>% head(20)%>%ggplot(aes(x=reorder(bigram,n),y=n,group=Category))+
facet_wrap(~Category)+
geom_bar(stat="identity",fill=heat.colors(n=20))+theme_fivethirtyeight()+labs(title="Most Frequently Used bigrams by Nobel Laureates across various categories")+coord_flip()+theme(legend.position = "none", 
          plot.title = element_text(hjust = 0.5),
          panel.grid.major = element_blank()) +
    xlab("") + 
    ylab("Word Count") 

What we see here are the bigrams most used by scientists in each of the category. For instance, in Physics, 'pioneering contributions' and 'elementary particles' are prominent. On the other hand, in Chemistry, 'organic synthesis' is prominent.

In [None]:
# nobel_winner_all_pubs %>% 
#   distinct(category)

# min(nobel_winner_all_pubs$prize_year)