# Homework 2: Social Network Data Visualisation

- [Physics as network](#Physics as network)
- [Getting the Data](#Getting the Data)
- [Cleaning the data](#Cleaning the data)
- [Compiling the nodes data](#Compiling the nodes data)
- [Compiling the links data](#Compiling the links data)

In this homework, we are going to reuse the data set of the previous homework to look at it from a Social Network prospective. A social network is a construct where social actors can be represented as nodes linked by edges on a graph. As an example, here is the Facebook social network

<img src="additional/facebook_world_friend_map.png" width=600>

We are using the [d3.js](https://d3js.org/) javascript library, a data visualization library, to represent our networks. You can see [here](https://bl.ocks.org/mbostock/4062045) an example of such a network.


## Physics as a social network  <a class="anchor" id="Physics as network"></a>

A social network can be represented as a graph with a set of nodes and a set of links between those nodes:

<img src="additional/small_undirected_network_labeled.png" width=300>

In the previous homework, we used data sets with physicists and physics domains. If we consider each physicist and each physics domain as a possible node on a network, the problem becomes building edges between those. Here an example of how to represent graphs with python data structures:

In [None]:
small_network = {
    "nodes": [
        {"id": "Albert Einstein"},
        {"id": "Paul Dirac"},
        {"id": "Niels Bohr"}
    ],
    "links": [
        {"source": "Albert Einstein", "target": "Paul Dirac"},
        {"source": "Albert Einstein", "target": "Niels Bohr"},
        {"source": "Paul Dirac", "target": "Niels Bohr"}
    ]
}

## We dump this network into a .json file
import json
with open("./data/small_network.json","w") as f:
    json.dump(small_network, f, indent=4)

In [2]:
import json

json.__version__

'2.0.9'

If you have a mac, the following script is going to open a safari window to visualize this network, otherwise you can just open the `small_network.html` file with Safari or Firefox. For some reason, it does not work with chrome. 

In [12]:
import os
os.system("open -a /Applications/Safari.app ./small_network.html --args --allow-file-access-from-files")
# os.system("open -a /Applications/Google\ Chrome.app ./small_network.html --args --allow-file-access-from-files")

0

Obviously, that is not a very interesting network and we are going to build a more substantial one! D3.js expects the data in a specific format as shown above and we are going to shape our data to follow those requirements. For each node, we need an "id" tag and we can add other attribute if we desire. For the links, we need a "source" and a "target" to connect nodes and we can add other attributes also. 

In the following network, we are going to add the "length" attribute that captures how big is each node and the "value" attribute for each link that captures how "strong" the links are. We also are going to distinguish between 2 types of nodes: the nodes for the physicist and the nodes for the physics domains. Those 2 sets of nodes will be distinguished by the attribute "group". For example:

```
small_network = {
    "nodes": [
        {"id": "Albert Einstein", "group": 1, "length": 100},
        {"id": "Paul Dirac", "group": 1, "length": 200},
        {"id": "Niels Bohr", "group": 1, "length": 300}
    ],
    "links": [
        {"source": "Albert Einstein", "target": "Paul Dirac", "value": 0.5},
        {"source": "Albert Einstein", "target": "Niels Bohr", "value": 0.4},
        {"source": "Paul Dirac", "target": "Niels Bohr", "value": 0.3}
    ]
}
```

## Getting the Data <a class="anchor" id="Getting the Data"></a>

We first going to gather the data needed for this project. We are going to extract the words in each Wikipedia page to understand the relation between each physicist and physics domain. 

In [None]:
## We get the nobel data set
import numpy as np
import pandas as pd
from httplib2 import Http
from bs4 import BeautifulSoup, SoupStrainer

class Parser:
    
    def __init__(self, url):  
        http = Http()
        status, response = http.request(url)
        tables = BeautifulSoup(response, "lxml", 
                              parse_only=SoupStrainer("table", {"class":"wikitable sortable"}))
        self.table = tables.contents[1]
    
    def parse_table(self):      
        rows = self.table.find_all("tr")
        header = self.parse_header(rows[0])
        table_array = [self.parse_row(row) for row in rows[1:]]
        table_df = pd.DataFrame(table_array, columns=header).apply(self.clean_table, 1)
        return table_df.replace({"Year":{'':np.nan}})
        
    def parse_row(self, row):     
        columns = row.find_all("td")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != '']
    
    def parse_header(self, row):     
        columns = row.find_all("th")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != ""]
    
    def clean_table(self, row):
        if not row.iloc[0].isdigit() and row.iloc[0] != '':
            return row.shift(1)
        else:
            return row
        
url = "https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics"        
parser = Parser(url)   
nobel_df = parser.parse_table()
nobel_df.columns = ["Year", "Laureate", "Country", "Rationale"]
nobel_df.dropna(subset=["Country"], inplace=True)
nobel_df.fillna(method="ffill", inplace=True)
nobel_df.drop(["Year", "Country", "Rationale"], 1, inplace=True)

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
link_df = pd.DataFrame([[x.string, x["href"]] for x in table.contents[1].find_all("a")],
                       columns=["Laureate", "link"]).drop_duplicates()

nobel_df = nobel_df.merge(link_df, on="Laureate", how="left")
nobel_df.set_index("Laureate", inplace=True)
nobel_df.drop_duplicates(inplace=True)
nobel_df

We also going to extract the links of each of the physics domains listed in the Research fields table of the [https://en.wikipedia.org/wiki/Physics](https://en.wikipedia.org/wiki/Physics) Wikipedia page.

In [None]:
## We get the physics links
url = "https://en.wikipedia.org/wiki/Physics"

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
physics_df = pd.DataFrame([[x.string.lower(), x["href"].lower()] for x in table.contents[2].find_all("a")],
                       columns=["Physics_domain", "link"]).drop_duplicates()

physics_df = physics_df.groupby("Physics_domain").first()
physics_df


## Cleaning the data <a class="anchor" id="Cleaning the data"></a>

>Use the code from your previous homework to create the different functions to clean the get the word data and clean it.

In [None]:
from string import punctuation

## We get the bios
def get_text(link, root_website = "https://en.wikipedia.org"):    
    http = Http()
    status, response = http.request(root_website + link)
    body = BeautifulSoup(response, "lxml", parse_only=SoupStrainer("div", {"id":"mw-content-text"}))
    return BeautifulSoup.get_text(body.contents[1])

# TODO: copy your clean_string function from the previous homework
def clean_string(string):
    pass

# TODO: copy your remove function from the previous homework
def remove(list_to_clean, element_to_remove=[None, ""]):
    pass

# TODO: copy your remove_one function from the previous homework
def remove_one(list_to_clean):
    pass

We now going to write a function that takes a data frame with a "link" column and return a column of list of words. We basically aggregate all the above function into one to reproduce what was done in the previous homework. Note that here we DO NOT use the function that keep only a unique element of each list nor the one that filter on the number of occurance.

> Write a function that applies all the previous functions to clean a text.

In [None]:
from nltk.corpus import stopwords
words_to_remove = set(stopwords.words('english'))

# TODO: aggregate all the above function into one to return a list of words from each link
def clean_everything(df):
    pass

physics_df["physics_list"] = clean_everything(physics_df)
nobel_df["physics_list"] = clean_everything(nobel_df)
nobel_df

We saw last time that there are many words that are not relevant to physics concepts in those Wikipedia pages. We are going to attempt to filter those with the simple following approach. 

We are going to compile a set of all the unique words in the `nobel_df` lists and a set of all the unique words in the `physics_df` lists. By taking the intersection of those 2 sets, we can subset the words corpus to something more relevant to physics.

>- Compile a set of all unique words in `nobel_df["physics_list"]`. You can use the function [`pd.sum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) to concatenate lists. You can cast the final list to a [`set`](https://docs.python.org/2/library/sets.html)
- Compile a set of all unique words in `physics_df["physics_list"]`.
- Compile the intersection of those 2 sets using the `intersection` function.

In [None]:
# TODO: find all the words in nobel_df["physics_list"]
all_nobel_words =  # YOUR CODE

# TODO: find all the words in physics_df["physics_list"]
all_physics_words =  # YOUR CODE

# TODO: find all the intersection of all_nobel_words and all_physics_words
physics_corpus =  # YOUR CODE

physics_corpus

In [None]:
print(len(physics_corpus), len(all_nobel_words), len(all_physics_words))

>Write a function that keep only specific words from a list

In [None]:
# TODO: write a function that keep only specific words from a list
def keep_only(list_to_clean, corpus=physics_corpus):
    pass
    
nobel_df["physics_list_clean"] = nobel_df["physics_list"].apply(keep_only)
physics_df["physics_list_clean"] = physics_df["physics_list"].apply(keep_only)


## Compiling the nodes data <a class="anchor" id="Compiling the nodes data"></a>

For those 2 dataframes, we are going to create 2 additional columns:
    
>- create columns "length" that counts the number of words in each list. This column will be used to capture the size of the nodes in the networks. Basically we are going to say: the more words in the Wikipedia page, the more significant the physicist or physics domain is.
- create columns "group" with a unique value for each of those dataframes. Set the value to 1 in the `nobel_df` dataframe and 0 for the `physics_df` dataframe. This columns will be used to distinguish the physicists from the physics domains and attribute them different colors in the network visualization.

In [None]:
# TODO: compute the length of each list
nobel_df["length"] =  # YOUR CODE
physics_df["length"] =  # YOUR CODE

# TODO: Set this column to 1
nobel_df["group"] =  # YOUR CODE
# TODO: Set this column to 0
physics_df["group"] =  # YOUR CODE

Let's concatenate the those 2 dataframes into the `nodes_df` dataframe. 

>Use the [`pd.concat`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) function to do so and only keep the "length" and "group" columns. The concatenation needs to be done along the row axis.

In [None]:
# TODO: concatenate those two dataframe into the nodes_df dataframe. 
# keep only the "length" and "group" columns.
nodes_df =  # YOUR CODE

nodes_df.index.name = "id"
nodes_df

From this dataframe, we can easily format the data as a list of dictionaries as the d3.js library expects the data to be. We have the "length" attribute for the size of the node, the "group" attribute to distinguish between physicists and physics domains and each node has a unique "id" tag represented by the names.  

In [None]:
nodes_list = list(nodes_df.reset_index().transpose().to_dict().values())
nodes_list


## Compiling the links data <a class="anchor" id="Compiling the links data"></a>

We have the nodes, we need to find a way to connect them. We are going to compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between each of the wikipedia pages. It is called the cosine similarity because a dot product between 2 vectors $\mathbf{A}$ and $\mathbf{B}$ can be express as:
\begin{equation}
A\cdot B = \Vert A\Vert_2\Vert B\Vert_2\cos\theta
\end{equation}
where $\theta$ is the angle between the 2 vectors. Similarly:
\begin{equation}
\cos\theta = \frac{ A\cdot B}{\Vert A\Vert_2\Vert B\Vert_2}
\end{equation}
$\cos\theta\in[-1,1]$ and specifically $\cos\theta = 1$ if the 2 vectors are in the same direction or $\cos\theta = -1$ if the 2 vectors are in the opposite direction. The matter here becomes to be able to express a Wikipedia page as a vector. I suggest here a simple approach but there are many ways to achieve this. 

We defined earlier the corpus of physics words `physics_corpus`. Each word can be thought as a orthogonal basis defining a vector space where our Wikipedia pages are living in. Each component can be represented by the number of time a specific word appear in a page. As an example, imaging a page $P$ represented by the following list of words:
```
P_list = ["data", "data", "science", "python", "python"]
```
And let's imaging that we have a simple word corpus:
```
corpus = {"engineering", "data", "science", "python"}
```
Then in that basis $P$ could be represented by a vector 
\begin{equation}
\mathbf{P}=\left(\begin{matrix}
  0  \\
  2  \\
  1  \\
  2
 \end{matrix}\right)
\end{equation}

We need to express each Wikipedia page as such vector. 

>Let's start by creating a dataframe with as columns, all the indices of the `nodes_df` dataframe (`nodes_df.index.values`) and as index, the whole `physics_corpus` set.

In [None]:
# TODO: create a data frame with the index of nodes_df as columns and physics_corpus as index
words_vector =  # YOUR CODE
words_vector

To fill this table we are going to use the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function on the list of words contained in `nobel_df["physics_list_clean"]` and `physics_df["physics_list_clean"]`. 

>Write a function that takes a list and return a value_counts. The return value should be a pandas series with the words as index. We are using this function to populate the `words_vector` dataframe. Note that because `words_vector` already has an index, the values get populated at the right place automatically.

In [None]:
#TODO: write a function that take a list and return the a word count
def count_words(list_to_count):
    pass

words_vector.loc[:,nobel_df.index] = nobel_df["physics_list_clean"].apply(count_words).transpose()
words_vector.loc[:,physics_df.index] = physics_df["physics_list_clean"].apply(count_words).transpose()
words_vector

There are many entries in this dataframe that appear as `NaN`. We just need to replace those missing values by 0 since they indicate that no records of the word were found for those pages. 

>Use the function [`pd.fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) to fill with 0.

In [None]:
# TODO: fill the missing values
words_vector = # YOUR CODE
words_vector

>- Write a function that takes 2 vectors (2 pandas series) and return the cosine similarity index. You can use the function [`dot`](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.dot.html), [`pow`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pow.html) and `sum` if you like.
- Use this function to fill the `similarity_df` dataframe
- Bonus points if you can compute this dataframe using matrix algebra ([`dot`](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.dot.html)) without having to iterate through the columns. Hint create 2 dataframe: one that is the dot products of words_vector with itself and one that represent a matrix of norm products. Then divide one matrix by the other element wise. For this case, you would not need to use the `compute_similarity` function.

In [None]:
# TODO: write a function that takes 2 vectors and return the cosine similarity index
def compute_similarity(vect1, vect2):
    return vect1.dot(vect2) / (np.sqrt(vect1.pow(2).sum()) * np.sqrt(vect2.pow(2).sum()))

similarity_df = pd.DataFrame(columns=words_vector.columns, index=words_vector.columns, dtype=float)

# TODO: fill the similarity_df dataframe with the cosine similarity

# TODO: bonus points if you can compute this dataframe using matrix algebra 

similarity_df

We are now going to "melt" `similarity_df` into a long dataframe: 

>- We need to reset the index of `similarity_df` ([`pd.reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html))
- Then we use the [`melt`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) function to melt the dataframe. Use the resetted index as `id_vars`. 

In [None]:
# TODO: reset the index and melt the dataframe
melted_df =  # YOUR CODE

melted_df.columns = ["source", "target", "value"]
melted_df

You can see that at this point we have something close to what is necessary to create the links data. There are 3 things we need do to finalize our dataset. First, it is unnecessary to have "source" equal to the "target" (a node does not need to be linked to itself). Second, we have a duplicated links because in our case a "source" has the same role than a "target" and can be interchanged (our graph is not directed). Third, we need to subset the links set because there are too many for the program to run efficiently.

Let's shuffle the data set rowwise ([`sample`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html)). This help us to not bias our links selection due to a prior alphabetic ordering of the data

In [None]:
melted_df = melted_df.sample(frac=1.).reset_index(drop=True)
melted_df

We then going to find the pairs of ("source", "target") that are equal to pairs ("target", "source"). To do that we are going to merge `melted_df` with itself where ("source", "target") = ("target", "source").

>- merge it with itself with `left_on=["source", "target"]` and `right_on=["target", "source"]`. Pass the dataframe with the index resetted using [`pd.reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html).

In [None]:
# TODO: merge melted_df with itself
merged_df =  # YOUR CODE

merged_df

At this point, we can see that each pair of ("source", "target") has the redondant equivalent ("target", "source"). This also highlight the cases where "source" = "target". To filter the useless rows we can simply pick the ("source", "target") pair or the ("target", "source") pick to remove. Let's choose which pair to remove by capturing the index we want to remove

>- Look at the pair of columns `merged_df[["index_x", "index_y"]]` and simple choose the greater between the two using [`max`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html). By selecting only the unique values ([`unique`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)) of the resulting list of indices we have selected the index to remove and we can drop them using [`drop`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). 

In [None]:
# TODO: find the index to drop
index_to_drop =  # YOUR CODE
# TODO: use the index_to_drop to subset the melted_df dataframe
melted_df_sub =  # YOUR CODE
melted_df_sub

We have filtered quite a bit of rows but it still is too many for the network simulation to run efficiently. For each source, we are going to select the 10 highest values. 

>- Group `melted_df_sub` by "source" using the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method and select the 10 targets that have the highest values using the [`nlargest`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nlargest.html) method.
- The resulting pandas Series has a multiindex with 2 levels. We need to get the level 1 of the multiindex to know which rows to keep in `melted_df_sub`. You can get it using the function [`get_level_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html) on the index. 

In [None]:
# TODO: Group melted_df_sub by "source" using the groupby method and select the 10 
# targets that have the highest values using the nlargest method
largest_df =  # YOUR CODE
# TODO: get the level 1 of the multiindex
index_to_keep =  # YOUR CODE

links_df = melted_df_sub.loc[index_to_keep]
links_df

We need to cast this data frame as a list of dictionaries as we have done for the list of nodes. 

>Use a similar code to than for the nodes to create a list of links: 

In [None]:
# TODO: create the list of links
links_list =  # YOUR CODE
links_list

We now create the final dictionary for the network and save it into a json file

In [None]:
network_dict = {"nodes": nodes_list,
                "links": links_list}

with open("./data/physicists.json","w") as f:
    json.dump(network_dict, f, indent=4)

If you have a mac, the following script is going to open a safari window to visualize this network, otherwise you can just open the index.html file with Safari or Firefox. For some reason it does not work with chrome.

In [None]:
import os
os.system("open -a /Applications/Safari.app ./index.html")

Adjust the parameters to try to find nodes that tend to be grouped together. You can try to recreate the network with with different number of links.