# Homework 2: Social Network Data Visualisation

- [Physics as network](#Physics as network)
- [Getting the Data](#Getting the Data)
- [Cleaning the data](#Cleaning the data)
- [Compiling the nodes data](#Compiling the nodes data)
- [Compiling the links data](#Compiling the links data)

In this homework, we are going to reuse the data set of the previous homework to look at it from a Social Network prospective. A social network is a construct where social actors can be represented as nodes linked by edges on a graph. As an example, here is the Facebook social network

<img src="additional/facebook_world_friend_map.png" width=600>

We are using the [d3.js](https://d3js.org/) javascript library, a data visualization library, to represent our networks. You can see [here](https://bl.ocks.org/mbostock/4062045) an example of such a network.


## Physics as a social network  <a class="anchor" id="Physics as network"></a>

A social network can be represented as a graph with a set of nodes and a set of links between those nodes:

<img src="additional/small_undirected_network_labeled.png" width=300>

In the previous homework, we used data sets with physicists and physics domains. If we consider each physicist and each physics domain as a possible node on a network, the problem becomes building edges between those. Here an example of how to represent graphs with python data structures:

In [13]:
small_network = {
    "nodes": [
        {"id": "Albert Einstein"},
        {"id": "Paul Dirac"},
        {"id": "Niels Bohr"}
    ],
    "links": [
        {"source": "Albert Einstein", "target": "Paul Dirac"},
        {"source": "Albert Einstein", "target": "Niels Bohr"},
        {"source": "Paul Dirac", "target": "Niels Bohr"}
    ]
}

## We dump this network into a .json file
import json
with open("./data/small_network.json","w") as f:
    json.dump(small_network, f, indent=4)

If you have a mac, the following script is going to open a safari window to visualize this network, otherwise you can just open the `small_network.html` file with Safari or Firefox. For some reason, it does not work with chrome. 

In [14]:
import os
os.system("open -a /Applications/Safari.app ./small_network.html")

0

Obviously, that is not a very interesting network and we are going to build a more substantial one! D3.js expects the data in a specific format as shown above and we are going to shape our data to follow those requirements. For each node, we need an "id" tag and we can add other attribute if we desire. For the links, we need a "source" and a "target" to connect nodes and we can add other attributes also. 

In the following network, we are going to add the "length" attribute that captures how big is each node and the "value" attribute for each link that captures how "strong" the links are. We also are going to distinguish between 2 types of nodes: the nodes for the physicist and the nodes for the physics domains. Those 2 sets of nodes will be distinguished by the attribute "group". For example:

```
small_network = {
    "nodes": [
        {"id": "Albert Einstein", "group": 1, "length": 100},
        {"id": "Paul Dirac", "group": 1, "length": 200},
        {"id": "Niels Bohr", "group": 1, "length": 300}
    ],
    "links": [
        {"source": "Albert Einstein", "target": "Paul Dirac", "value": 0.5},
        {"source": "Albert Einstein", "target": "Niels Bohr", "value": 0.4},
        {"source": "Paul Dirac", "target": "Niels Bohr", "value": 0.3}
    ]
}
```

## Getting the Data <a class="anchor" id="Getting the Data"></a>

We first going to gather the data needed for this project. We are going to extract the words in each Wikipedia page to understand the relation between each physicist and physics domain. 

In [15]:
## We get the nobel data set
import numpy as np
import pandas as pd
from httplib2 import Http
from bs4 import BeautifulSoup, SoupStrainer

class Parser:
    
    def __init__(self, url):  
        http = Http()
        status, response = http.request(url)
        tables = BeautifulSoup(response, "lxml", 
                              parse_only=SoupStrainer("table", {"class":"wikitable sortable"}))
        self.table = tables.contents[1]
    
    def parse_table(self):      
        rows = self.table.find_all("tr")
        header = self.parse_header(rows[0])
        table_array = [self.parse_row(row) for row in rows[1:]]
        table_df = pd.DataFrame(table_array, columns=header).apply(self.clean_table, 1)
        return table_df.replace({"Year":{'':np.nan}})
        
    def parse_row(self, row):     
        columns = row.find_all("td")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != '']
    
    def parse_header(self, row):     
        columns = row.find_all("th")
        return [BeautifulSoup.get_text(col).strip() for col in columns if BeautifulSoup.get_text(col) != ""]
    
    def clean_table(self, row):
        if not row.iloc[0].isdigit() and row.iloc[0] != '':
            return row.shift(1)
        else:
            return row
        
url = "https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics"        
parser = Parser(url)   
nobel_df = parser.parse_table()
nobel_df.columns = ["Year", "Laureate", "Country", "Rationale"]
nobel_df.dropna(subset=["Country"], inplace=True)
nobel_df.fillna(method="ffill", inplace=True)
nobel_df.drop(["Year", "Country", "Rationale"], 1, inplace=True)

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
link_df = pd.DataFrame([[x.string, x["href"]] for x in table.contents[1].find_all("a")],
                       columns=["Laureate", "link"]).drop_duplicates()

nobel_df = nobel_df.merge(link_df, on="Laureate", how="left")
nobel_df.set_index("Laureate", inplace=True)
nobel_df.drop_duplicates(inplace=True)
nobel_df

Unnamed: 0_level_0,link
Laureate,Unnamed: 1_level_1
Wilhelm Conrad Röntgen,/wiki/Wilhelm_R%C3%B6ntgen
Hendrik Lorentz,/wiki/Hendrik_Lorentz
Pieter Zeeman,/wiki/Pieter_Zeeman
Antoine Henri Becquerel,/wiki/Henri_Becquerel
Pierre Curie,/wiki/Pierre_Curie
Maria Skłodowska-Curie,/wiki/Maria_Sk%C5%82odowska-Curie
Lord Rayleigh,"/wiki/John_Strutt,_3rd_Baron_Rayleigh"
Philipp Eduard Anton von Lenard,/wiki/Philipp_Lenard
Joseph John Thomson,/wiki/J._J._Thomson
Albert Abraham Michelson,/wiki/Albert_Abraham_Michelson


We also going to extract the links of each of the physics domains listed in the Research fields table of the [https://en.wikipedia.org/wiki/Physics](https://en.wikipedia.org/wiki/Physics) Wikipedia page.

In [16]:
## We get the physics links
url = "https://en.wikipedia.org/wiki/Physics"

http = Http()
status, response = http.request(url)

table = BeautifulSoup(response, "lxml", parse_only=SoupStrainer('table'))
physics_df = pd.DataFrame([[x.string.lower(), x["href"].lower()] for x in table.contents[2].find_all("a")],
                       columns=["Physics_domain", "link"]).drop_duplicates()

physics_df = physics_df.groupby("Physics_domain").first()
physics_df

Unnamed: 0_level_0,link
Physics_domain,Unnamed: 1_level_1
accelerator physics,/wiki/accelerator_physics
acoustics,/wiki/acoustics
agrophysics,/wiki/agrophysics
antimatter,/wiki/antimatter
applied physics,/wiki/applied_physics
astrometry,/wiki/astrometry
astronomy,/wiki/astronomy
astrophysics,/wiki/astrophysics
atom,/wiki/atom
atomic and molecular astrophysics,/wiki/atomic_and_molecular_astrophysics



## Cleaning the data <a class="anchor" id="Cleaning the data"></a>

>Use the code from your previous homework to create the different functions to clean the get the word data and clean it.

In [17]:
from string import punctuation

## We get the bios
def get_text(link, root_website = "https://en.wikipedia.org"):    
    http = Http()
    status, response = http.request(root_website + link)
    body = BeautifulSoup(response, "lxml", parse_only=SoupStrainer("div", {"id":"mw-content-text"}))
    return BeautifulSoup.get_text(body.contents[1])

# TODO: copy your clean_string function from the previous homework
def clean_string(string):
    for p in punctuation + "1234567890":
        string = string.replace(p,'').lower()  
    return string

# TODO: copy your remove function from the previous homework
def remove(list_to_clean, element_to_remove=[None, ""]):
    return [e for e in list_to_clean if e not in element_to_remove]

# TODO: copy your remove_one function from the previous homework
def remove_one(list_to_clean):
    return [e for e in list_to_clean if len(e) > 1]

We now going to write a function that takes a data frame with a "link" column and return a column of list of words. We basically aggregate all the above function into one to reproduce what was done in the previous homework. Note that here we DO NOT use the function that keep only a unique element of each list nor the one that filter on the number of occurance.

> Write a function that applies all the previous functions to clean a text.

In [18]:
from nltk.corpus import stopwords
words_to_remove = set(stopwords.words('english'))

# TODO: aggregate all the above function into one to return a list of words from each link
def clean_everything(df):
    return (df.apply(lambda x: get_text(x["link"]), 1)
            .apply(clean_string)
            .str.split("\s")
            .apply(remove)
            .apply(lambda x: remove(x, words_to_remove))
            .apply(remove_one))


physics_df["physics_list"] = clean_everything(physics_df)
nobel_df["physics_list"] = clean_everything(nobel_df)
nobel_df

Unnamed: 0_level_0,link,physics_list
Laureate,Unnamed: 1_level_1,Unnamed: 2_level_1
Wilhelm Conrad Röntgen,/wiki/Wilhelm_R%C3%B6ntgen,"[wilhelm, röntgen, born, wilhelm, conrad, rönt..."
Hendrik Lorentz,/wiki/Hendrik_Lorentz,"[confused, hendrikus, albertus, lorentz, ludvi..."
Pieter Zeeman,/wiki/Pieter_Zeeman,"[pieter, zeeman, born, may, zonnemaire, nether..."
Antoine Henri Becquerel,/wiki/Henri_Becquerel,"[uses, see, becquerel, disambiguation, antoine..."
Pierre Curie,/wiki/Pierre_Curie,"[pierre, curie, born, may, paris, france, died..."
Maria Skłodowska-Curie,/wiki/Maria_Sk%C5%82odowska-Curie,"[article, polish, physicist, uses, see, marie,..."
Lord Rayleigh,"/wiki/John_Strutt,_3rd_Baron_Rayleigh","[lord, rayleigh, om, prs, born, november, lang..."
Philipp Eduard Anton von Lenard,/wiki/Philipp_Lenard,"[waterfall, effect, redirects, illusory, visua..."
Joseph John Thomson,/wiki/J._J._Thomson,"[article, nobel, laureate, physicist, moral, p..."
Albert Abraham Michelson,/wiki/Albert_Abraham_Michelson,"[confused, athlete, albert, michelsen, albert,..."


We saw last time that there are many words that are not relevant to physics concepts in those Wikipedia pages. We are going to attempt to filter those with the simple following approach. 

We are going to compile a set of all the unique words in the `nobel_df` lists and a set of all the unique words in the `physics_df` lists. By taking the intersection of those 2 sets, we can subset the words corpus to something more relevant to physics.

>- Compile a set of all unique words in `nobel_df["physics_list"]`. You can use the function [`pd.sum`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) to concatenate lists. You can cast the final list to a [`set`](https://docs.python.org/2/library/sets.html)
- Compile a set of all unique words in `physics_df["physics_list"]`.
- Compile the intersection of those 2 sets using the `intersection` function.

In [19]:
# TODO: find all the words in nobel_df["physics_list"]
all_nobel_words = set(nobel_df["physics_list"].sum())

# TODO: find all the words in physics_df["physics_list"]
all_physics_words = set(physics_df["physics_list"].sum())

# TODO: find all the intersection of all_nobel_words and all_physics_words
physics_corpus = all_nobel_words.intersection(all_physics_words)

physics_corpus

{'xps',
 'kivelson',
 'decoupling',
 'joyce',
 'jp',
 'mechanik',
 'paved',
 'consumer',
 'isolation',
 'reach',
 'haidt',
 'hawkes',
 'rubin',
 'proliferation',
 'eternal',
 'interface',
 'gott',
 'lanthanum',
 'unfavorable',
 'chown',
 'notesedit',
 'null',
 'core',
 'reviews',
 'dj',
 'connections',
 'matt',
 'pakistan',
 'scholar',
 'radioactivity',
 'electrodynamics',
 'chao',
 'mbar',
 'radiological',
 'ithaca',
 'rca',
 'documenting',
 'satisfactory',
 'heliumedit',
 'bibcodeapjc',
 'earliest',
 'enter',
 'frobenius',
 'conceived',
 'legal',
 'stem',
 'focuses',
 'joan',
 'spread',
 'randy',
 'duration',
 'charm',
 'manchester',
 'belongs',
 'swann',
 'sure',
 'towness',
 'cbb',
 'taub',
 'neptunium',
 'ferrara',
 'ground',
 'nonuniform',
 'corona',
 'north',
 'since',
 'observational',
 'zheng',
 'slowed',
 'pacific',
 'deliberate',
 'individuals',
 'bohr',
 'id',
 'molybdenum',
 'scarce',
 'iupac',
 'supercomputer',
 'kleppner',
 'signalling',
 'targeted',
 'subject',
 'doipt'

In [20]:
print(len(physics_corpus), len(all_nobel_words), len(all_physics_words))

12619 31053 33483


>Write a function that keep only specific words from a list

In [21]:
# TODO: write a function that keep only specific words from a list
def keep_only(list_to_clean, corpus=physics_corpus):
    return [e for e in list_to_clean if e in corpus]
    
nobel_df["physics_list_clean"] = nobel_df["physics_list"].apply(keep_only)
physics_df["physics_list_clean"] = physics_df["physics_list"].apply(keep_only)


## Compiling the nodes data <a class="anchor" id="Compiling the nodes data"></a>

For those 2 dataframes, we are going to create 2 additional columns:
    
>- create columns "length" that counts the number of words in each list. This column will be used to capture the size of the nodes in the networks. Basically we are going to say: the more words in the Wikipedia page, the more significant the physicist or physics domain is.
- create columns "group" with a unique value for each of those dataframes. Set the value to 1 in the `nobel_df` dataframe and 0 for the `physics_df` dataframe. This columns will be used to distinguish the physicists from the physics domains and attribute them different colors in the network visualization.

In [22]:
# TODO: compute the length of each list
nobel_df["length"] = nobel_df["physics_list_clean"].apply(len)
physics_df["length"] = physics_df["physics_list_clean"].apply(len)

# TODO: Set this column to 1
nobel_df["group"] = 1
# TODO: Set this column to 0
physics_df["group"] = 0

Let's concatenate the those 2 dataframes into the `nodes_df` dataframe. 

>Use the [`pd.concat`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html) function to do so and only keep the "length" and "group" columns. The concatenation needs to be done along the row axis.

In [23]:
# TODO: concatenate those two dataframe into the nodes_df dataframe. 
# keep only the "length" and "group" columns.
nodes_df = pd.concat([nobel_df[["length", "group"]], physics_df[["length", "group"]]])

nodes_df.index.name = "id"
nodes_df

Unnamed: 0_level_0,length,group
id,Unnamed: 1_level_1,Unnamed: 2_level_1
Wilhelm Conrad Röntgen,1245,1
Hendrik Lorentz,2710,1
Pieter Zeeman,984,1
Antoine Henri Becquerel,1156,1
Pierre Curie,1428,1
Maria Skłodowska-Curie,4941,1
Lord Rayleigh,1410,1
Philipp Eduard Anton von Lenard,1242,1
Joseph John Thomson,2889,1
Albert Abraham Michelson,2227,1


From this dataframe, we can easily format the data as a list of dictionaries as the d3.js library expects the data to be. We have the "length" attribute for the size of the node, the "group" attribute to distinguish between physicists and physics domains and each node has a unique "id" tag represented by the names.  

In [24]:
nodes_list = list(nodes_df.reset_index().transpose().to_dict().values())
nodes_list

[{'group': 1, 'id': 'Wilhelm Conrad Röntgen', 'length': 1245},
 {'group': 1, 'id': 'Hendrik Lorentz', 'length': 2710},
 {'group': 1, 'id': 'Pieter Zeeman', 'length': 984},
 {'group': 1, 'id': 'Antoine Henri Becquerel', 'length': 1156},
 {'group': 1, 'id': 'Pierre Curie', 'length': 1428},
 {'group': 1, 'id': 'Maria Skłodowska-Curie', 'length': 4941},
 {'group': 1, 'id': 'Lord Rayleigh', 'length': 1410},
 {'group': 1, 'id': 'Philipp Eduard Anton von Lenard', 'length': 1242},
 {'group': 1, 'id': 'Joseph John Thomson', 'length': 2889},
 {'group': 1, 'id': 'Albert Abraham Michelson', 'length': 2227},
 {'group': 1, 'id': 'Gabriel Lippmann', 'length': 1472},
 {'group': 1, 'id': 'Guglielmo Marconi', 'length': 4077},
 {'group': 1, 'id': 'Karl Ferdinand Braun', 'length': 832},
 {'group': 1, 'id': 'Johannes Diderik van der Waals', 'length': 2187},
 {'group': 1, 'id': 'Wilhelm Wien', 'length': 741},
 {'group': 1, 'id': 'Nils Gustaf Dalén', 'length': 778},
 {'group': 1, 'id': 'Heike Kamerlingh-Onne


## Compiling the links data <a class="anchor" id="Compiling the links data"></a>

We have the nodes, we need to find a way to connect them. We are going to compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between each of the wikipedia pages. It is called the cosine similarity because a dot product between 2 vectors $\mathbf{A}$ and $\mathbf{B}$ can be express as:
\begin{equation}
A\cdot B = \Vert A\Vert_2\Vert B\Vert_2\cos\theta
\end{equation}
where $\theta$ is the angle between the 2 vectors. Similarly:
\begin{equation}
\cos\theta = \frac{ A\cdot B}{\Vert A\Vert_2\Vert B\Vert_2}
\end{equation}
$\cos\theta\in[-1,1]$ and specifically $\cos\theta = 1$ if the 2 vectors are in the same direction or $\cos\theta = -1$ if the 2 vectors are in the opposite direction. The matter here becomes to be able to express a Wikipedia page as a vector. I suggest here a simple approach but there are many ways to achieve this. 

We defined earlier the corpus of physics words `physics_corpus`. Each word can be thought as a orthogonal basis defining a vector space where our Wikipedia pages are living in. Each component can be represented by the number of time a specific word appear in a page. As an example, imaging a page $P$ represented by the following list of words:
```
P_list = ["data", "data", "science", "python", "python"]
```
And let's imaging that we have a simple word corpus:
```
corpus = {"engineering", "data", "science", "python"}
```
Then in that basis $P$ could be represented by a vector 
\begin{equation}
\mathbf{P}=\left(\begin{matrix}
  0  \\
  2  \\
  1  \\
  2
 \end{matrix}\right)
\end{equation}

We need to express each Wikipedia page as such vector. 

>Let's start by creating a dataframe with as columns, all the indices of the `nodes_df` dataframe (`nodes_df.index.values`) and as index, the whole `physics_corpus` set.

In [25]:
# TODO: create a data frame with the index of nodes_df as columns and physics_corpus as index
words_vector = pd.DataFrame(columns=nodes_df.index.values, index=physics_corpus)
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
xps,,,,,,,,,,,...,,,,,,,,,,
kivelson,,,,,,,,,,,...,,,,,,,,,,
decoupling,,,,,,,,,,,...,,,,,,,,,,
joyce,,,,,,,,,,,...,,,,,,,,,,
jp,,,,,,,,,,,...,,,,,,,,,,
mechanik,,,,,,,,,,,...,,,,,,,,,,
paved,,,,,,,,,,,...,,,,,,,,,,
consumer,,,,,,,,,,,...,,,,,,,,,,
isolation,,,,,,,,,,,...,,,,,,,,,,
reach,,,,,,,,,,,...,,,,,,,,,,


To fill this table we are going to use the [`value_counts`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) function on the list of words contained in `nobel_df["physics_list_clean"]` and `physics_df["physics_list_clean"]`. 

>Write a function that takes a list and return a value_counts. The return value should be a pandas series with the words as index. We are using this function to populate the `words_vector` dataframe. Note that because `words_vector` already has an index, the values get populated at the right place automatically.

In [26]:
#TODO: write a function that take a list and return the a word count
def count_words(list_to_count):
    return pd.Series(list_to_count).value_counts()

words_vector.loc[:,nobel_df.index] = nobel_df["physics_list_clean"].apply(count_words).transpose()
words_vector.loc[:,physics_df.index] = physics_df["physics_list_clean"].apply(count_words).transpose()
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
xps,,,,,,,,,,,...,,,,,3,,,,,
kivelson,,,,,,,,,,,...,,,,,,,,,,
decoupling,,,,,,,,,,,...,,,,,,,,,,
joyce,,,,,,,,,,,...,,,,,,,,,,
jp,,,,,,,,,,,...,,,,,,,,,,
mechanik,,,,,,,,,,,...,,,,,,,,,,
paved,,,,,,,,,,,...,,,,1,,,,,,
consumer,,,,,,,,,,,...,,,,,,,,,,
isolation,,,,,,1,,,,,...,,1,,,,,,,,
reach,,,,,,1,,,,,...,,6,1,,,,3,,,


There are many entries in this dataframe that appear as `NaN`. We just need to replace those missing values by 0 since they indicate that no records of the word were found for those pages. 

>Use the function [`pd.fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) to fill with 0.

In [28]:
# TODO: fill the missing values
words_vector = words_vector.fillna(0)
words_vector

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
xps,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
kivelson,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
decoupling,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
joyce,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
jp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mechanik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
paved,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
consumer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
isolation,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
reach,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,6.0,1.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0


In [29]:
words_vector.tail(50)

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
wineland,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
yoichiro,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
li,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
hoax,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
elevated,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
edouard,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
merzbacher,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
phases,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,2.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0
wieman,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
madison,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


>- Write a function that takes 2 vectors (2 pandas series) and return the cosine similarity index. You can use the function [`dot`](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.dot.html), [`pow`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pow.html) and `sum` if you like.
- Use this function to fill the `similarity_df` dataframe
- Bonus points if you can compute this dataframe using matrix algebra ([`dot`](http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.dot.html)) without having to iterate through the columns. Hint create 2 dataframe: one that is the dot products of words_vector with itself and one that represent a matrix of norm products. Then divide one matrix by the other element wise. For this case, you would not need to use the `compute_similarity` function.

In [17]:
# TODO: write a function that takes 2 vectors and return the cosine similarity index
def compute_similarity(vect1, vect2):
    return vect1.dot(vect2) / (np.sqrt(vect1.pow(2).sum()) * np.sqrt(vect2.pow(2).sum()))

similarity_df = pd.DataFrame(columns=words_vector.columns, index=words_vector.columns, dtype=float)

# TODO: fill the similarity_df dataframe with the cosine similarity
# for col1 in similarity_df.columns:
#     for col2 in similarity_df.columns:
#         similarity_df.loc[col1, col2] = compute_similarity(words_vector[col1], words_vector[col2])

# TODO: bonus points if you can compute this dataframe using matrix algebra 
norm = words_vector.pow(2).sum().pow(0.5)
similarity_df = words_vector.T.dot(words_vector) / norm.to_frame().dot(norm.to_frame().T)
similarity_df

Unnamed: 0,Wilhelm Conrad Röntgen,Hendrik Lorentz,Pieter Zeeman,Antoine Henri Becquerel,Pierre Curie,Maria Skłodowska-Curie,Lord Rayleigh,Philipp Eduard Anton von Lenard,Joseph John Thomson,Albert Abraham Michelson,...,superfluid,supernova,superstring theory,supersymmetry,surface physics,theory of everything,universe,vacuum energy,vehicle dynamics,weak
Wilhelm Conrad Röntgen,1.000000,0.185672,0.234898,0.246356,0.244603,0.262972,0.282291,0.332149,0.271266,0.232772,...,0.070318,0.062881,0.068566,0.082538,0.075815,0.096963,0.101659,0.063017,0.055909,0.099553
Hendrik Lorentz,0.185672,1.000000,0.370330,0.218688,0.179595,0.179237,0.204633,0.259935,0.179045,0.277143,...,0.122857,0.074520,0.217858,0.207668,0.073629,0.294773,0.168844,0.152462,0.060412,0.129237
Pieter Zeeman,0.234898,0.370330,1.000000,0.256890,0.209483,0.211097,0.216903,0.284003,0.207938,0.274881,...,0.095226,0.074682,0.071211,0.097682,0.074663,0.119827,0.092988,0.096831,0.045164,0.117186
Antoine Henri Becquerel,0.246356,0.218688,0.256890,1.000000,0.309496,0.278122,0.199439,0.246859,0.210313,0.216431,...,0.072759,0.086875,0.064611,0.089502,0.088283,0.098501,0.109708,0.089068,0.047054,0.100493
Pierre Curie,0.244603,0.179595,0.209483,0.309496,1.000000,0.799107,0.231689,0.214641,0.199662,0.199539,...,0.073302,0.061946,0.058904,0.087582,0.102917,0.097067,0.105868,0.067250,0.040900,0.097288
Maria Skłodowska-Curie,0.262972,0.179237,0.211097,0.278122,0.799107,1.000000,0.253163,0.243989,0.196505,0.233255,...,0.089919,0.068619,0.068480,0.100310,0.104205,0.118499,0.143015,0.064736,0.051924,0.115215
Lord Rayleigh,0.282291,0.204633,0.216903,0.199439,0.231689,0.253163,1.000000,0.232197,0.343531,0.350616,...,0.079355,0.058961,0.139607,0.121848,0.078933,0.174048,0.117191,0.068180,0.067522,0.116294
Philipp Eduard Anton von Lenard,0.332149,0.259935,0.284003,0.246859,0.214641,0.243989,0.232197,1.000000,0.314634,0.265652,...,0.138252,0.075308,0.118009,0.154903,0.102583,0.208230,0.162926,0.139365,0.066741,0.140633
Joseph John Thomson,0.271266,0.179045,0.207938,0.210313,0.199662,0.196505,0.343531,0.314634,1.000000,0.268181,...,0.089429,0.083015,0.108843,0.133921,0.104524,0.158400,0.140876,0.097033,0.076890,0.146132
Albert Abraham Michelson,0.232772,0.277143,0.274881,0.216431,0.199539,0.233255,0.350616,0.265652,0.268181,1.000000,...,0.087832,0.098430,0.077250,0.092427,0.072563,0.123208,0.134413,0.073893,0.064553,0.092076


We are now going to "melt" `similarity_df` into a long dataframe: 

>- We need to reset the index of `similarity_df` ([`pd.reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html))
- Then we use the [`melt`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html) function to melt the dataframe. Use the resetted index as `id_vars`. 

In [29]:
# TODO: reset the index and melt the dataframe
melted_df = similarity_df.reset_index().melt(id_vars=['index'])  

melted_df.columns = ["source", "target", "value"]
melted_df

Unnamed: 0,source,target,value
0,Wilhelm Conrad Röntgen,Wilhelm Conrad Röntgen,1.000000
1,Hendrik Lorentz,Wilhelm Conrad Röntgen,0.185672
2,Pieter Zeeman,Wilhelm Conrad Röntgen,0.234898
3,Antoine Henri Becquerel,Wilhelm Conrad Röntgen,0.246356
4,Pierre Curie,Wilhelm Conrad Röntgen,0.244603
5,Maria Skłodowska-Curie,Wilhelm Conrad Röntgen,0.262972
6,Lord Rayleigh,Wilhelm Conrad Röntgen,0.282291
7,Philipp Eduard Anton von Lenard,Wilhelm Conrad Röntgen,0.332149
8,Joseph John Thomson,Wilhelm Conrad Röntgen,0.271266
9,Albert Abraham Michelson,Wilhelm Conrad Röntgen,0.232772


You can see that at this point we have something close to what is necessary to create the links data. There are 3 things we need do to finalize our dataset. First, it is unnecessary to have "source" equal to the "target" (a node does not need to be linked to itself). Second, we have a duplicated links because in our case a "source" has the same role than a "target" and can be interchanged (our graph is not directed). Third, we need to subset the links set because there are too many for the program to run efficiently.

Let's shuffle the data set rowwise ([`sample`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html)). This help us to not bias our links selection due to a prior alphabetic ordering of the data

In [21]:
melted_df = melted_df.sample(frac=1.).reset_index(drop=True)
melted_df

Unnamed: 0,source,target,value
0,Adam G. Riess,Robert B. Laughlin,0.438583
1,materials physics,Dennis Gabor,0.139132
2,bcs theory,quantum information science,0.084172
3,David J. Wineland,supernova,0.044755
4,Peter Higgs,superstring theory,0.137256
5,Brian P. Schmidt,geophysics,0.140246
6,Shuji Nakamura,Charles Glover Barkla,0.326525
7,nanotechnology,gravitation physics,0.127252
8,spin,Martinus J. G. Veltman,0.089726
9,Murray Gell-Mann,Maria Goeppert-Mayer,0.336182


We then going to find the pairs of ("source", "target") that are equal to pairs ("target", "source"). To do that we are going to merge `melted_df` with itself where ("source", "target") = ("target", "source").

>- merge it with itself with `left_on=["source", "target"]` and `right_on=["target", "source"]`. Pass the dataframe with the index resetted using [`pd.reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html).

In [22]:
# TODO: merge melted_df with itself
merged_df = pd.merge(melted_df.reset_index(),
                     melted_df.reset_index(),
                     left_on=["source", "target"],
                     right_on=["target", "source"])
merged_df

Unnamed: 0,index_x,source_x,target_x,value_x,index_y,source_y,target_y,value_y
0,0,Adam G. Riess,Robert B. Laughlin,0.438583,12689,Robert B. Laughlin,Adam G. Riess,0.438583
1,1,materials physics,Dennis Gabor,0.139132,10421,Dennis Gabor,materials physics,0.139132
2,2,bcs theory,quantum information science,0.084172,9616,quantum information science,bcs theory,0.084172
3,3,David J. Wineland,supernova,0.044755,60411,supernova,David J. Wineland,0.044755
4,4,Peter Higgs,superstring theory,0.137256,89186,superstring theory,Peter Higgs,0.137256
5,5,Brian P. Schmidt,geophysics,0.140246,71785,geophysics,Brian P. Schmidt,0.140246
6,6,Shuji Nakamura,Charles Glover Barkla,0.326525,41451,Charles Glover Barkla,Shuji Nakamura,0.326525
7,7,nanotechnology,gravitation physics,0.127252,27834,gravitation physics,nanotechnology,0.127252
8,8,spin,Martinus J. G. Veltman,0.089726,26729,Martinus J. G. Veltman,spin,0.089726
9,9,Murray Gell-Mann,Maria Goeppert-Mayer,0.336182,9795,Maria Goeppert-Mayer,Murray Gell-Mann,0.336182


At this point, we can see that each pair of ("source", "target") has the redondant equivalent ("target", "source"). This also highlight the cases where "source" = "target". To filter the useless rows we can simply pick the ("source", "target") pair or the ("target", "source") pick to remove. Let's choose which pair to remove by capturing the index we want to remove

>- Look at the pair of columns `merged_df[["index_x", "index_y"]]` and simple choose the greater between the two using [`max`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html). By selecting only the unique values ([`unique`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html)) of the resulting list of indices we have selected the index to remove and we can drop them using [`drop`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html). 

In [23]:
# TODO: find the index to drop
index_to_drop = merged_df[["index_x", "index_y"]].max(1).unique()
# TODO: use the index_to_drop to subset the melted_df dataframe
melted_df_sub = melted_df.drop(index_to_drop)
melted_df_sub

Unnamed: 0,source,target,value
0,Adam G. Riess,Robert B. Laughlin,0.438583
1,materials physics,Dennis Gabor,0.139132
2,bcs theory,quantum information science,0.084172
3,David J. Wineland,supernova,0.044755
4,Peter Higgs,superstring theory,0.137256
5,Brian P. Schmidt,geophysics,0.140246
6,Shuji Nakamura,Charles Glover Barkla,0.326525
7,nanotechnology,gravitation physics,0.127252
8,spin,Martinus J. G. Veltman,0.089726
9,Murray Gell-Mann,Maria Goeppert-Mayer,0.336182


We have filtered quite a bit of rows but it still is too many for the network simulation to run efficiently. For each source, we are going to select the 10 highest values. 

>- Group `melted_df_sub` by "source" using the [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method and select the 10 targets that have the highest values using the [`nlargest`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nlargest.html) method.
- The resulting pandas Series has a multiindex with 2 levels. We need to get the level 1 of the multiindex to know which rows to keep in `melted_df_sub`. You can get it using the function [`get_level_values`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html) on the index. 

In [24]:
# TODO: Group melted_df_sub by "source" using the groupby method and select the 10 
# targets that have the highest values using the nlargest method
largest_df = melted_df_sub.groupby("source")["value"].nlargest(5)
# TODO: get the level 1 of the multiindex
index_to_keep = largest_df.index.get_level_values(1)

links_df = melted_df_sub.loc[index_to_keep]
links_df

Unnamed: 0,source,target,value
41912,Aage Bohr,Leo James Rainwater,0.573368
53698,Aage Bohr,Val Logsdon Fitch,0.466007
43752,Aage Bohr,James Franck,0.456284
33703,Aage Bohr,Jerome I. Friedman,0.433905
29593,Aage Bohr,Manne Siegbahn,0.433859
24807,Abdus Salam,Sheldon Lee Glashow,0.329815
6097,Abdus Salam,Steven Weinberg,0.318946
41827,Abdus Salam,Nicolaas Bloembergen,0.303617
2355,Abdus Salam,Makoto Kobayashi,0.300355
2197,Abdus Salam,Isamu Akasaki,0.300234


In [32]:
melted_df_sub["target"].nunique()

336

We need to cast this data frame as a list of dictionaries as we have done for the list of nodes. 

>Use a similar code to than for the nodes to create a list of links: 

In [176]:
# TODO: create the list of links
links_list = list(links_df.transpose().to_dict().values())
links_list

[{'source': 'Aage Bohr',
  'target': 'James Franck',
  'value': 0.45610418151761045},
 {'source': 'Aage Bohr',
  'target': 'Nicolaas Bloembergen',
  'value': 0.4505068386575566},
 {'source': 'Aage Bohr',
  'target': 'Manne Siegbahn',
  'value': 0.43346733345391864},
 {'source': 'Aage Bohr',
  'target': 'Jerome I. Friedman',
  'value': 0.4329679807019891},
 {'source': 'Aage Bohr',
  'target': 'Kai Manne Börje Siegbahn',
  'value': 0.42721838572556087},
 {'source': 'Abdus Salam',
  'target': 'particle physics',
  'value': 0.3182104275221652},
 {'source': 'Abdus Salam',
  'target': 'Andre Geim',
  'value': 0.31391980329161984},
 {'source': 'Abdus Salam',
  'target': 'Brian David Josephson',
  'value': 0.30472901252211},
 {'source': 'Abdus Salam',
  'target': 'Makoto Kobayashi',
  'value': 0.2942423632265061},
 {'source': 'Abdus Salam',
  'target': 'David J. Thouless',
  'value': 0.2937583888306586},
 {'source': 'Adam G. Riess',
  'target': 'Saul Perlmutter',
  'value': 0.75731772643865},


We now create the final dictionary for the network and save it into a json file

In [177]:
network_dict = {"nodes": nodes_list,
                "links": links_list}

with open("./data/physicists.json","w") as f:
    json.dump(network_dict, f, indent=4)

If you have a mac, the following script is going to open a safari window to visualize this network, otherwise you can just open the index.html file with Safari or Firefox. For some reason it does not work with chrome.

In [27]:
import os
os.system("open -a /Applications/Safari.app ./index.html")

0

Adjust the parameters to try to find nodes that tend to be grouped together. You can try to recreate the network with with different number of links.