# Course Project
## CSEN 1095 - Data Engineering
### German University in Cairo

Collaborators:
- Nada Hammouda
- Aya Ibrahim
- Habiba ElHussein
- Youssef Tarek - 37-3865

<div style="background-color:rgba(0, 0, 0, 0.6); text-align:center; vertical-align: middle; padding:40px 0;color:rgb(255,255,255);">
    <h1>Visual History of Nobel Prize Winners</h1>
</div>



- Project Website: https://yousseftarekkh.github.io/de-noble-prizes/
- Github: https://github.com/yousseftarekkh/de-noble-prizes/

### Overview & Motivation

This project aims to apply several steps of data refactoring and cleaning in order to visually analyze and introduce potential data fixes to the acquired data set. Such steps can help further identify relations, recognize hidden patterns in different countries, pay attention to trending categories and generally reach conclusions and answers about many proposed questions among the data records which in our case represents **<span style="color:green">Nobel Prize winners</span>**.

Initially, we used the following data set found at https://www.datacamp.com/projects/441. Additional data sets may be used in order to succesfully form a tidy data with less missing values and for the sake of adding more valuable records to the existing data set. All used references will be included in the next section.

### Related Work

We have been influenced by a lot of work after a research we conducted on this particular topic and we found the following sites interesting for a certain cause.
- https://www.datacamp.com/projects/441 - The mentioned project tasks included some interesting questions we needed to find answers to, thus we decided to re-engineer the data to properly formulate answers.
- https://www.kaggle.com/devisangeetha/nobel-prize-winners-story - Inspired on how well and organized that research was and the outcome that this research helped discovering.
- https://www.nobelprize.org/prizes/facts/nobel-prize-facts/ - Website contains a large bulk of Nobel Prize winners facts including both interesting and shocking events in the history.

### Questions

The main questions that motivated that project were:
1. 1
2. 2
3. 3


In [3]:
# modules we'll use
import pandas as pd
import numpy as np

# read in all our data
nps_df = pd.read_csv("data/archive.csv")

nps_df.sample(4)


Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
213,1939,Physics,The Nobel Prize in Physics 1939,"""for the invention and development of the cycl...",1/1,47,Individual,Ernest Orlando Lawrence,1901-08-08,"Canton, SD",United States of America,Male,University of California,"Berkeley, CA",United States of America,1958-08-27,"Palo Alto, CA",United States of America
472,1975,Economics,The Sveriges Riksbank Prize in Economic Scienc...,"""for their contributions to the theory of opti...",1/2,687,Individual,Tjalling C. Koopmans,1910-08-28,'s Graveland,Netherlands,Male,Yale University,"New Haven, CT",United States of America,1985-02-26,"New Haven, CT",United States of America
53,1909,Medicine,The Nobel Prize in Physiology or Medicine 1909,"""for his work on the physiology, pathology and...",1/1,303,Individual,Emil Theodor Kocher,1841-08-25,Berne,Switzerland,Male,Berne University,Berne,Switzerland,1917-07-27,Berne,Switzerland
256,1949,Peace,The Nobel Peace Prize 1949,,1/1,510,Individual,Lord (John) Boyd Orr of Brechin,1880-09-23,Kilmaurs,Scotland,Male,,,,1971-06-25,Edzell,Scotland


In [22]:
missing_values_count = nps_df.isnull().sum()
missing_values_count

Year                      0
Category                  0
Prize                     0
Motivation               88
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date               29
Birth City               28
Birth Country            26
Sex                      26
Organization Name       247
Organization City       253
Organization Country    253
Death Date              352
Death City              370
Death Country           364
dtype: int64

In [23]:
# How many total missing values do we have?
# shape returns the dimentionality of a dataframe (rows and columns), can you guess what product will do?
total_cells_nfl = np.product(nps_df.shape) 
total_missing_nfl = missing_values_count.sum()

# percent of data that is missing
percentage_missign_values_nfl = (total_missing_nfl/total_cells_nfl) * 100
print(percentage_missign_values_nfl)

11.672973282880404


In [24]:
nps_df[nps_df['Sex'].isnull()].head()

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
24,1904,Peace,The Nobel Peace Prize 1904,,1/1,467,Organization,Institut de droit international (Institute of ...,,,,,,,,,,
61,1910,Peace,The Nobel Peace Prize 1910,,1/1,477,Organization,Bureau international permanent de la Paix (Per...,,,,,,,,,,
90,1917,Peace,The Nobel Peace Prize 1917,,1/1,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,
206,1938,Peace,The Nobel Peace Prize 1938,,1/1,503,Organization,Office international Nansen pour les Réfugiés ...,,,,,,,,,,
222,1944,Peace,The Nobel Peace Prize 1944,,1/1,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,


In [25]:
cleaned_nps_df = nps_df.copy()

#Changing the null values of the 'Sex' column to 'Organization'
cleaned_nps_df.loc[cleaned_nps_df['Sex'].isnull() & (cleaned_nps_df['Laureate Type'] == 'Organization'), 'Sex'] = 'Organization'

#Changing the null values of the 'Birth Date' column to 'none' as it doesnot exist
cleaned_nps_df.loc[cleaned_nps_df['Birth Date'].isnull() & (cleaned_nps_df['Laureate Type'] == 'Organization') & 
         (cleaned_nps_df['Sex'] == 'Organization'), 'Birth Date'] = 'none'

#Changing the null values of the 'Birth City' column to 'none' as it doesnot exist
cleaned_nps_df.loc[cleaned_nps_df['Birth City'].isnull() & (cleaned_nps_df['Laureate Type'] == 'Organization') & 
         (cleaned_nps_df['Sex'] == 'Organization'), 'Birth City'] = 'none'


#Changing the null values of the 'Birth Country' column to 'none' as it doesnot exist
cleaned_nps_df.loc[cleaned_nps_df['Birth Country'].isnull() & (cleaned_nps_df['Laureate Type'] == 'Organization') & 
         (cleaned_nps_df['Sex'] == 'Organization'), 'Birth Country'] = 'none'

missing_values_count = cleaned_nps_df.isnull().sum()
missing_values_count


Year                      0
Category                  0
Prize                     0
Motivation               88
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date                3
Birth City                2
Birth Country             0
Sex                       0
Organization Name       247
Organization City       253
Organization Country    253
Death Date              352
Death City              370
Death Country           364
dtype: int64

In [26]:

cleaned_nps_df.loc[cleaned_nps_df['Organization Country'].isnull(), 'Organization City'].unique()

array([nan, 'Tunis'], dtype=object)

In [27]:
cleaned_nps_df.loc[cleaned_nps_df['Organization City'] == 'Tunis', 'Organization Country'] = 'Tunis'

missing_values_count = cleaned_nps_df.isnull().sum()
missing_values_count


Year                      0
Category                  0
Prize                     0
Motivation               88
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date                3
Birth City                2
Birth Country             0
Sex                       0
Organization Name       247
Organization City       253
Organization Country    252
Death Date              352
Death City              370
Death Country           364
dtype: int64

In [28]:
#Imputing the 'Organization Name' from 'Full Name' based on the fact that the name of organization is the 'Full Name' when the laureate type is 'Organization'
cleaned_nps_df.loc[temp['Organization Name'].isnull() & (cleaned_nps_df['Laureate Type'] == 'Organization') &
         (cleaned_nps_df['Sex'] == 'Organization'), 'Organization Name'] = cleaned_nps_df['Full Name']

missing_values_count = cleaned_nps_df.isnull().sum()
missing_values_count

Year                      0
Category                  0
Prize                     0
Motivation               88
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date                3
Birth City                2
Birth Country             0
Sex                       0
Organization Name       247
Organization City       253
Organization Country    252
Death Date              352
Death City              370
Death Country           364
dtype: int64