# Data Cleaning

We were able to locate a dataset available through the Harvard Dataverse Repository, containing data on the characteristics of individual Nobel Prize laureates. We downloaded that initial .csv file, and then pursued the following process to clean up that data and turn it into a dataframe suitable for our research: 

**Step 1:** Imported libraries for our data analysis.

In [1]:
import pandas as pd 
import numpy as np
import seaborn 
from matplotlib import pyplot
from datetime import datetime, date

**Step 2:** Loaded the CSV file, laureate.csv, into a pandas data frame called nobel_data_raw and printed nobel_data_raw to check laureate.csv was loaded correctly into the data frame.

In [2]:
nobel_data_raw = pd.read_csv("laureate.csv")
nobel_data_raw.head()

Unnamed: 0,id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year,category,overallMotivation,share,motivation,name,city,country
0,1,Wilhelm Conrad,Röntgen,1845-03-27,1923-02-10,Prussia (now Germany),DE,Lennep (now Remscheid),Germany,DE,Munich,male,1901.0,physics,,1.0,"""in recognition of the extraordinary services ...",Munich University,Munich,Germany
1,2,Hendrik Antoon,Lorentz,1853-07-18,1928-02-04,the Netherlands,NL,Arnhem,the Netherlands,NL,,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Leiden University,Leiden,the Netherlands
2,3,Pieter,Zeeman,1865-05-25,1943-10-09,the Netherlands,NL,Zonnemaire,the Netherlands,NL,Amsterdam,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Amsterdam University,Amsterdam,the Netherlands
3,4,Antoine Henri,Becquerel,1852-12-15,1908-08-25,France,FR,Paris,France,FR,,male,1903.0,physics,,2.0,"""in recognition of the extraordinary services ...",École Polytechnique,Paris,France
4,5,Pierre,Curie,1859-05-15,1906-04-19,France,FR,Paris,France,FR,Paris,male,1903.0,physics,,4.0,"""in recognition of the extraordinary services ...",École municipale de physique et de chimie indu...,Paris,France


**Step 3:** Printed out the number of rows and columns in nobel_data_raw to make sure that it matches the number of rows and columns of laureate.csv from the original source.

In [3]:
nobel_data_raw.shape

(975, 20)

**Step 4:** Printed out the type of each column so that we know if we need to convert to another type in our upcoming data analysis.

In [4]:
nobel_data_raw.dtypes

id                     int64
firstname             object
surname               object
born                  object
died                  object
bornCountry           object
bornCountryCode       object
bornCity              object
diedCountry           object
diedCountryCode       object
diedCity              object
gender                object
year                 float64
category              object
overallMotivation     object
share                float64
motivation            object
name                  object
city                  object
country               object
dtype: object

**Step 5:** We converted values in 'born' and 'died' columns to datetime. If the values were not formatted to be able to convert to datetime, such as "0000-00-00," we renamed those values as "NaT". We also printed out the head of the data frame to check if the columns were displaying values in datetime and in NaT properly.

In [5]:
nobel_data = nobel_data_raw
nobel_data['born'] = pd.to_datetime(nobel_data_raw['born'], format = '%Y-%m-%d', errors = 'coerce')
nobel_data['died'] = pd.to_datetime(nobel_data_raw['died'], format = '%Y-%m-%d', errors = 'coerce')
nobel_data.head()

Unnamed: 0,id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year,category,overallMotivation,share,motivation,name,city,country
0,1,Wilhelm Conrad,Röntgen,1845-03-27,1923-02-10,Prussia (now Germany),DE,Lennep (now Remscheid),Germany,DE,Munich,male,1901.0,physics,,1.0,"""in recognition of the extraordinary services ...",Munich University,Munich,Germany
1,2,Hendrik Antoon,Lorentz,1853-07-18,1928-02-04,the Netherlands,NL,Arnhem,the Netherlands,NL,,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Leiden University,Leiden,the Netherlands
2,3,Pieter,Zeeman,1865-05-25,1943-10-09,the Netherlands,NL,Zonnemaire,the Netherlands,NL,Amsterdam,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Amsterdam University,Amsterdam,the Netherlands
3,4,Antoine Henri,Becquerel,1852-12-15,1908-08-25,France,FR,Paris,France,FR,,male,1903.0,physics,,2.0,"""in recognition of the extraordinary services ...",École Polytechnique,Paris,France
4,5,Pierre,Curie,1859-05-15,1906-04-19,France,FR,Paris,France,FR,Paris,male,1903.0,physics,,4.0,"""in recognition of the extraordinary services ...",École municipale de physique et de chimie indu...,Paris,France


**Step 6:** We found the age in days of each laureate in the dataframe by using the "year" and year extracted from datetime values in the 'born' column. We made a new column called "age" because it will help us answer a few of our research questions. The values in the 'age' column were rounded down using np.floor to accurately represent each laureate's age at the time of their win. We also used .fillna() to fill missing "age" values with 0 to create models in the future without any errors involving NaN. Lastly, we used .head() to check if our data frame includes the new column.

In [6]:
nobel_data['age'] = nobel_data['year'] - nobel_data['born'].dt.year
nobel_data['age'] = nobel_data['age'].apply(np.floor)
nobel_data['age'] = nobel_data['age'].fillna(0)
nobel_data['age'] = nobel_data['age'].astype(int)
nobel_data.head()

Unnamed: 0,id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,...,gender,year,category,overallMotivation,share,motivation,name,city,country,age
0,1,Wilhelm Conrad,Röntgen,1845-03-27,1923-02-10,Prussia (now Germany),DE,Lennep (now Remscheid),Germany,DE,...,male,1901.0,physics,,1.0,"""in recognition of the extraordinary services ...",Munich University,Munich,Germany,56
1,2,Hendrik Antoon,Lorentz,1853-07-18,1928-02-04,the Netherlands,NL,Arnhem,the Netherlands,NL,...,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Leiden University,Leiden,the Netherlands,49
2,3,Pieter,Zeeman,1865-05-25,1943-10-09,the Netherlands,NL,Zonnemaire,the Netherlands,NL,...,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Amsterdam University,Amsterdam,the Netherlands,37
3,4,Antoine Henri,Becquerel,1852-12-15,1908-08-25,France,FR,Paris,France,FR,...,male,1903.0,physics,,2.0,"""in recognition of the extraordinary services ...",École Polytechnique,Paris,France,51
4,5,Pierre,Curie,1859-05-15,1906-04-19,France,FR,Paris,France,FR,...,male,1903.0,physics,,4.0,"""in recognition of the extraordinary services ...",École municipale de physique et de chimie indu...,Paris,France,44


**Step 7:** Upon this early analysis, we discovered 6 rows in our dataframe that were empty of all data (except for gender: male), and did not correspond to any potential collection gaps that we could ascertain. As such, we concluded that these rows were erroneously included, and dropped them from the dataframe. 

In [7]:
nobel_data = nobel_data.drop([932, 933, 934, 935, 936, 937])

**Step 8:** Created a new dataframe, nobel_data_valid, without any "age" values that are 0 so that we can create models using the "age" column without laureates who are missing their ages. Most of these rows represent organizational winners, rather than individuals, so the age data for these instances would not be relevant to our analysis. 

In [8]:
nobel_data_valid = nobel_data.loc[nobel_data_raw['age']!=0]
nobel_data_valid.head()

Unnamed: 0,id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,...,gender,year,category,overallMotivation,share,motivation,name,city,country,age
0,1,Wilhelm Conrad,Röntgen,1845-03-27,1923-02-10,Prussia (now Germany),DE,Lennep (now Remscheid),Germany,DE,...,male,1901.0,physics,,1.0,"""in recognition of the extraordinary services ...",Munich University,Munich,Germany,56
1,2,Hendrik Antoon,Lorentz,1853-07-18,1928-02-04,the Netherlands,NL,Arnhem,the Netherlands,NL,...,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Leiden University,Leiden,the Netherlands,49
2,3,Pieter,Zeeman,1865-05-25,1943-10-09,the Netherlands,NL,Zonnemaire,the Netherlands,NL,...,male,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Amsterdam University,Amsterdam,the Netherlands,37
3,4,Antoine Henri,Becquerel,1852-12-15,1908-08-25,France,FR,Paris,France,FR,...,male,1903.0,physics,,2.0,"""in recognition of the extraordinary services ...",École Polytechnique,Paris,France,51
4,5,Pierre,Curie,1859-05-15,1906-04-19,France,FR,Paris,France,FR,...,male,1903.0,physics,,4.0,"""in recognition of the extraordinary services ...",École municipale de physique et de chimie indu...,Paris,France,44


**Step 9:** Printed the types of columns again to check to make sure the conversions made above are reflected in our data frame. 

In [17]:
nobel_data.dtypes

id                            int64
firstname                    object
surname                      object
born                 datetime64[ns]
died                 datetime64[ns]
bornCountry                  object
bornCountryCode              object
bornCity                     object
diedCountry                  object
diedCountryCode              object
diedCity                     object
gender                       object
year                        float64
category                     object
overallMotivation            object
share                       float64
motivation                   object
name                         object
city                         object
country                      object
age                           int32
anglosphere                   int64
dtype: object

**Step 10:** Created an additional column, "anglosphere", categorizing each instance by whether they are from one of the core Anglosphere nations (Australia, Canada, New Zealand, the UK, and the US) (1) or not (0).   

In [16]:
nobel_data["anglosphere"] = [1 if (x in ['US', 'GB', 'AU', 'CA', 'NZ']) else 0 for x in nobel_data['bornCountryCode']]
nobel_data.head()

Unnamed: 0,id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,...,year,category,overallMotivation,share,motivation,name,city,country,age,anglosphere
0,1,Wilhelm Conrad,Röntgen,1845-03-27,1923-02-10,Prussia (now Germany),DE,Lennep (now Remscheid),Germany,DE,...,1901.0,physics,,1.0,"""in recognition of the extraordinary services ...",Munich University,Munich,Germany,56,0
1,2,Hendrik Antoon,Lorentz,1853-07-18,1928-02-04,the Netherlands,NL,Arnhem,the Netherlands,NL,...,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Leiden University,Leiden,the Netherlands,49,0
2,3,Pieter,Zeeman,1865-05-25,1943-10-09,the Netherlands,NL,Zonnemaire,the Netherlands,NL,...,1902.0,physics,,2.0,"""in recognition of the extraordinary service t...",Amsterdam University,Amsterdam,the Netherlands,37,0
3,4,Antoine Henri,Becquerel,1852-12-15,1908-08-25,France,FR,Paris,France,FR,...,1903.0,physics,,2.0,"""in recognition of the extraordinary services ...",École Polytechnique,Paris,France,51,0
4,5,Pierre,Curie,1859-05-15,1906-04-19,France,FR,Paris,France,FR,...,1903.0,physics,,4.0,"""in recognition of the extraordinary services ...",École municipale de physique et de chimie indu...,Paris,France,44,0
