# Homework 1: Scrapping the web, Nobel Price Laureates

## Introduction

We are going to scrape the web to extract information about the different Nobel price laureates. This homework is designed to get you familiarized with some of the python data structures.

## Getting the data

We are going to get the data of all the [Nobel price laureates in Physics from Wikipedia](https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics). I wrote a small web parser to parse the table in a pandas dataframe. It is not important that you fully understand how it works but it does not hurt to try! I am using the [`httplib2`](https://github.com/httplib2/httplib2) and [`bs4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) libraries. Be sure to download them:

>In a terminal window type
```
source activate YOUR_ENVIRONMENT
pip install httplib2
pip install bs4
source deactivate
```

In [None]:
# This line to ensure the use of plots within Jupyter
%matplotlib inline
# We import the necessary libraries
import pandas as pd
import numpy as np
from httplib2 import Http
from bs4 import BeautifulSoup, SoupStrainer

class Parser:
    
    def __init__(self, url):  
        http = Http()
        status, response = http.request(url)
        tables = BeautifulSoup(response, "lxml", 
                              parse_only=SoupStrainer("table", {"class":"wikitable sortable"}))
        self.table = tables.contents[1]
    
    def parse_table(self):      
        rows = self.table.find_all("tr")
        header = self.parse_header(rows[0])
        table_array = [self.parse_row(row) for row in rows[1:]]
        table_df = pd.DataFrame(table_array, columns=header).apply(self.clean_table, 1)
        return table_df.replace({"Year":{'':np.nan}})
        
    def parse_row(self, row):     
        columns = row.find_all("td")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != '']
    
    def parse_header(self, row):     
        columns = row.find_all("th")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != ""]
    
    def clean_table(self, row):
        if not row.iloc[0].isdigit() and row.iloc[0] != '':
            return row.shift(1)
        else:
            return row
        
url = "https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics"        
parser = Parser(url)   
nobel_df = parser.parse_table()
nobel_df

## Data cleaning

As you can see the data is a bit messy so we need to clean a bit. We need to:

>- clean the columns names by changing them to: "Year", "Laureate", "Country", "Rationale".
- remove the rows that where the Nobel price was not awarded (the ones with missing values). You can use the [`pd.dropna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) function with the argument `subset`.
- fill the missing values in the year and rational columns. You can use the [`pd.fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) function with the argument `method='ffill'` (you can do that here because the rows are ordered by date).

In [None]:
# TODO: Clean the columns names

# TODO: drop all the rows where the nobel price was not awarded

# TODO: fill the missing values in the year column"

# Is your data clean?
nobel_df

In [None]:
# Lets check that our data set does not contain missing values anymore
nobel_df.isnull().any()

## Some questions about this data

Let's answer few questions about this data (with codes). 

>- How many physicists got a Nobel price? Be careful about possible duplicates. You can look at the [`pd.nunique()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nunique.html) function.
- How many countries are in this data set? Be careful about possible duplicates.

In [None]:
# TODO: How many physicists got a nobel price?
physicist_number =  # YOUR CODE

# TODO: How many countries are in this data set?
country_number =  # YOUR CODE

print(physicist_number)
print(country_number)

Maybe you have noticed that some values for the Column "Country" are represented by 2 countries separated by a return character (i.e. "Austria-Hungary\n Germany"). Let's try to observe the distribution of countries in this data set.

>- Use the [`pandas.Series.str.split`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html) function to split the column "Country" into a column "Country_list" of lists of countries

In [None]:
# TODO: split the column "Country" into a column "Country_list" of lists of countries
nobel_df["Country_list"] =  # YOUR CODE

# We create a pandas series from this new column to ease the analysis on the countries. The sum on list is used to 
# flatten the list of lists into one list of countries.
countries = pd.Series(sum(nobel_df["Country_list"].tolist(), [])).str.strip()
countries

Let's look at the distribution of countries:
>- Print the countries with the number of time it is contained in the `countries` pandas Series. You can use the [`pd.Series.value_counts`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function.
- Plot a barplot ordered by those number. You can use the function [`pd.plot`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html) by changing the argument `kind`.

In [None]:
# TODO: print the countries and the number of times they are contained in the countries pandas Series. It should
# be printed ordered by the number of times they are contained in the countries pandas Series

In [None]:
# TODO: Plot a barplot ordered by those number.

## What type of physics those physicists are practicing?

Let's try to gather some data to understand what type of physics is associated to each of those physicists. Ultimately we want to extract the words that are characteristics of each physicist.

We extract the webpage links to have access to their bibliography.

In [None]:
from httplib2 import Http
from bs4 import BeautifulSoup, SoupStrainer

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
link_df = pd.DataFrame([[x.string, x["href"]] for x in table.contents[1].find_all("a")],
                       columns=["Text", "link"]).drop_duplicates()

link_df

We need now to merge this table to the `nobel_df` table. Use the [`pandas.DataFrame.merge`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) or  [`pandas.concat`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) function to do so

In [None]:
# TODO: merge nobel_df and link_df into nobel_merged_df
nobel_merged_df =  # YOUR CODE
nobel_merged_df

>Did the merging completely work? Are there some missing values? If yes correct it

Now we are going to extract all the words in the Wikipedia page of each of those physicists. The following function `get_text` will extract the text of a Wikipedia page as a long string. 

>Use it to extract every text for each of the physicists into the columns "Bio". Use the function [`apply`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) to vectorize your code.

In [None]:
# Will Extract the text associated to every link
def get_text(link, root_website = "https://en.wikipedia.org"):    
    http = Http()
    status, response = http.request(root_website + link)

    body = BeautifulSoup(response, "lxml", parse_only=SoupStrainer("div", {"id":"mw-content-text"}))
    return BeautifulSoup.get_text(body.contents[1])

nobel_merged_df.set_index("Laureate", inplace=True)

# TODO: extract the text of the wikipedia page associated to each physicist
nobel_merged_df["Bio"] =  # YOUR CODE
nobel_merged_df["Bio"]

We are going to remove all the punctuation along with the number and set all the words to lower case. We import the punctuation package.

In [None]:
from string import punctuation
print(punctuation)

Here an example of how to remove punctuation and the numbers for one bio 

In [None]:
for p in punctuation + "1234567890":
    nobel_merged_df["Bio"][0] = nobel_merged_df["Bio"][0].replace(p,'').lower()  
    
nobel_merged_df["Bio"][0]

>Write a function and then use the pandas `apply` to treat all the bios

In [None]:
# TODO: write a function that remove the punctuation and numbers and set every word to lower case
def clean_string(string):
    # TODO: your code goes here
    pass

# TODO: apply this function to the "Bio" column
nobel_merged_df["Bio"] =  # YOUR CODE  
nobel_merged_df["Bio"]

>Use the `str.split` function again to split each text on any whitespace character (i.e "\s") into the "Bio_split" column

In [None]:
# TODO: split the "Bio" column as a column of lists of the words 
nobel_merged_df["Bio_list"] =  # YOUR CODE
nobel_merged_df["Bio_list"]

As you can see there are a lot of empty elements in each of those lists. We can remove those using the [`filter`](http://book.pythontips.com/en/latest/map_filter.html) function or a comprehension list along with the `apply` function. 

>- Write a function that removes `None` elements from a list
- apply that function to the "Bio_list" columns

In [None]:
# TODO: Write a function that removes `None` elements from a list
def remove(list_to_clean, element_to_remove=[None, ""]):
    # TODO: your code goes here
    pass

# TODO: apply that function to the "Bio_list" columns
nobel_merged_df["Bio_list"] =  # YOUR CODE

We are going to use the [`nltk`](http://www.nltk.org/) library to help us clean this data. Be sure to install the library with `pip` or `conda`.

Use the `nltk.download('stopwords')` function to download the stopwords corpus. A you can see, the stopwords are common english words that do not carry significant information of a specific text.

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
set(stopwords.words('english'))

We want to capture in each bag of words, the words that are characteristic of a specific physicist. There are many words in the english language that are not useful for that. We call those words the stopwords.

> Use your `remove` function to remove those words

In [None]:
from nltk.corpus import stopwords
words_to_remove = set(stopwords.words('english'))

# TODO: remove the stop words
nobel_merged_df["Bio_list"] =  # YOUR CODE
nobel_merged_df["Bio_list"] 

>Write a function that removes the words that have only one character

In [None]:
# TODO: write a function that removes the words that have only one character
def remove_one(list_to_clean):
    pass

# TODO: apply this function to the "bio_list" column
nobel_merged_df["Bio_list"] =  # YOUR CODE
nobel_merged_df["Bio_list"] 

We are going to remove all the words that appear too few times. This is an attempt to filter the words that are not relevant to the particular physics at play

> Write a function that removes all the words under a certain amount of occurance. I would say that it is job to choose the threshold (if any!) on the number of occurance that you feel gives you satisfying results.

In [None]:
# TODO: write a function that remove all the words under a certain amount of occurance
def remove_n_occurrance(list_to_clean, n = 1):
    pass 
 
# TODO: apply this function to the "bio_list" column
nobel_merged_df["Bio_list"] =  # YOUR CODE
nobel_merged_df["Bio_list"] 

Finally, we are just going to keep each word once and remove the duplicate. You can use the function `set` to do so

> Write a function that remove the duplicated words

In [None]:
# TODO: write a function that keeps only each element of a list only once
def remove_duplicates(list_to_clean):
    pass

# TODO: apply this function to the "bio_list" column
nobel_merged_df["Bio_list"] =  # YOUR CODE
nobel_merged_df["Bio_list"] 

We now are going to try to guess from this data, what type of physics those physicists are practicing. [Wikipedia](https://en.wikipedia.org/wiki/Physics) identifies 6 types of physics (to be debated!): [Nuclear physics](https://en.wikipedia.org/wiki/Nuclear_physics), [particle physics](https://en.wikipedia.org/wiki/Particle_physics), [Atomic, molecular, and optical physics](https://en.wikipedia.org/wiki/Atomic,_molecular,_and_optical_physics), [Condensed matter physics](https://en.wikipedia.org/wiki/Condensed_matter_physics), [Astrophysics](https://en.wikipedia.org/wiki/Astrophysics) and [Physical_cosmology](https://en.wikipedia.org/wiki/Physical_cosmology). We are going to get the text data from those pages and look at the set of words that are 2 different pages.

> Use your previously written functions to clean those the content from those Wikipedia pages

In [None]:
physics_df = pd.DataFrame({"Field": ["Nuclear physics",
                                     "Particle physics", 
                                     "Atomic, molecular, and optical physics", 
                                     "Condensed matter physics", 
                                     "Astrophysics",
                                     "Physical_cosmology"],
                           "link": ["/wiki/Nuclear_physics",
                                     "/wiki/Particle_physics", 
                                     "/wiki/Atomic,_molecular,_and_optical_physics", 
                                     "/wiki/Condensed_matter_physics", 
                                     "/wiki/Astrophysics",
                                     "/wiki/Physical_cosmology"]})

physics_df.set_index("Field", inplace=True)

# TODO: gather and clean the data related to those physics fields wikipedia pages
physics_df["Text_data"] =  # YOUR CODE

physics_df["Text_data"]

>For each physicist compute the number of words in his biography that intersect with the physics fields pages. You can use the function [`intersection`](https://docs.python.org/2/library/sets.html) of the `set` type

In [None]:
# Example
list1 = [1, 2, 3, 4]
list2 = [3, 4, 5, 6]
set(list1).intersection(list2)

In [None]:
# TODO: Write a function that count the number of words that intersect between two lists
def intesect_count(list1, list2):
    pass

# TODO: create those columns
nobel_merged_df["Count_intersect_Nuclear"] =  # YOUR CODE
nobel_merged_df["Count_intersect_Particle"] =   # YOUR CODE
nobel_merged_df["Count_intersect_Atomic"] =  # YOUR CODE
nobel_merged_df["Count_intersect_Condensed"] =  # YOUR CODE
nobel_merged_df["Count_intersect_Astrophysics"] =  # YOUR CODE
nobel_merged_df["Count_intersect_Cosmology"] =  # YOUR CODE

nobel_merged_df

>For each physicist compute the total number of words contained in his biography and in each of the physics fields pages. 

In [None]:
# TODO: Write a function that count the total number of unique words contained in two lists
def total_count(list1, list2):
    pass

# TODO: create those columns
nobel_merged_df["Count_total_Nuclear"] =  # YOUR CODE
nobel_merged_df["Count_total_Particle"] =  # YOUR CODE
nobel_merged_df["Count_total_Atomic"] =  # YOUR CODE
nobel_merged_df["Count_total_Condensed"] =  # YOUR CODE
nobel_merged_df["Count_total_Astrophysics"] =  # YOUR CODE
nobel_merged_df["Count_total_Cosmology"] =  # YOUR CODE

nobel_merged_df

We can now try to estimate the probability for words to belong to the wikipedia page of a physicist and to a physics field page using the following approximation:
\begin{equation}
p(\mbox{Same words for physicist P and field F}) \simeq \frac{\mbox{Number of words in P and in F}}{\mbox{Total number of words contained in P and F}}= \frac{P\cap F}{P\cup F}
\end{equation} 
This "probability" is known as the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index).

>For each physicist, compute the Jaccard index for words to be in both the physicist page and each physics field.

In [None]:
# TODO: Compute those columns
nobel_merged_df["Proba_Nuclear"] =  # YOUR CODE
nobel_merged_df["Proba_Particle"] =  # YOUR CODE
nobel_merged_df["Proba_Atomic"] =  # YOUR CODE
nobel_merged_df["Proba_Condensed"] =  # YOUR CODE
nobel_merged_df["Proba_Astrophysics"] =  # YOUR CODE
nobel_merged_df["Proba_Cosmology"] =  # YOUR CODE

nobel_merged_df

>Use the [`pd.idxmax`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.html) to capture what field has the highest probability of intersection.

In [None]:
proba_cols = ["Proba_Nuclear",
              "Proba_Particle",
              "Proba_Atomic",
              "Proba_Condensed",
              "Proba_Astrophysics",
              "Proba_Cosmology"]

# We normalize the probability to 1
nobel_merged_df[proba_cols] = nobel_merged_df[proba_cols].apply(lambda x: x / sum(x), 1)

# TODO: Which field each physicist belongs to? 

>Do you agree with this classification? How could we improve the analysis to get better classification?