Take the accepted talks from ic2s2 and create a co-authorship graph, and a person - concept graph in neo4j. 

The accepted talks are available at https://ic2s2.org/2017/elements/accepted_talks.html

In [1]:
from neo4jrestclient.client import GraphDatabase
from bs4 import BeautifulSoup
from collections import defaultdict

# Let's get the contents of that webpage and stash it locally.

```
$ curl -0 https://ic2s2.org/2017/elements/accepted_talks.html > accepted_talks.html
```

In [102]:
# check we can access the file and process it with Beautiful soup 
filename = "accepted_talks.html"
html = open(filename,"r").read()
soup = BeautifulSoup(html)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Goodness, the HTML is not very well structured, we are going to have to do some inference here around what different types of content are! 

All of the titles are in bold. All of the authors are siblings of content in bold, let's have a think! 

In [189]:
# we want to extract the authors and titles of each talk 
talks = []
section = soup.find('section', {'class': 'section', 'id': 'colgne'})
div = section.find("div")
btags = div.findAll("b")
for btag in btags:
    title = btag.contents
    authors = btag.next_sibling.next_sibling
    author_list = authors.replace(" and ", ",").split(",") 
    talks.append((title, author_list))

In [190]:
# now we have extracted the titles and authors from the page, I noticed that the talks are listed twice on the page, 
# lets get a unique set of talks
unique_talks = []
for talk in talks:
    if talk not in unique_talks:
        unique_talks.append(talk)

In [58]:
# get a dumb set of stopwords
stopwords = open("SmartStoplist.txt").readlines()
stopwords_stripped = list(map(lambda x: x.strip(), stopwords))
strip_characters = [":", "?", "-", "'", "`", ";", ",", ".", "\"" ]

In [113]:
def clean_word(word):
    """
    Ok, we are going to do a bit of claning up here, just a quick job, namely
    normalise words to lowercase 
    get rid of some punctuation 
    if the word is not on our list, strip s from the end to get rid of plural variants (so dumb, but works right now)
    """
    candidate_word = word.lower().strip()
    current_word = candidate_word
    for char in strip_characters:
        current_word = current_word.replace(char, '')
    if current_word not in ["bias", "embeddedne", "ubiquitous", "analysis", "process", "strategie", "properties"]:
        cleaned_word = current_word.rstrip("s") # dumb way of getting rid of plurals, OK for a short analysis 
        return cleaned_word
    else:
        return current_word

In [114]:
def get_keywords(title):
    """
    go through all of the words in a titls, and drop words in our small stop word list
    that way we can try to generate words that might look at bit like keywords for the talks. 
    """
    keywords = []
    for word in title.split():
        cleaned_word = clean_word(word)
        if cleaned_word not in stopwords_stripped:
            keywords.append(cleaned_word)
    return keywords

In [104]:
# now check what that looks like against our titles 
for talk in unique_talks:
    title = talk[0][0]
    keywords = get_keywords(title)
    print(title, keywords) # looks pretty good

School segregation in the digital space ['school', 'segregation', 'digital', 'space']
Bounded confidence in extreme opinion evolution : an online experiment ['bounded', 'confidence', 'extreme', 'opinion', 'evolution', '', 'online', 'experiment']
Do Twitter Data reflect the Link between economic and cultural Modernization? ['twitter', 'data', 'reflect', 'link', 'economic', 'cultural', 'modernization']
Modeling Echo Chambers on Social Media ['modeling', 'echo', 'chamber', 'social', 'media']
New Techniques for Privacy-preserving Record Linkage of Large-scale Social Science Data Sets ['technique', 'privacypreserving', 'record', 'linkage', 'largescale', 'social', 'science', 'data', 'set']
Ideological Creation of Market in Socialist China: Change in Economic Rhetoric in the People’s Daily, 1946-2003 ['ideological', 'creation', 'market', 'socialist', 'china', 'change', 'economic', 'rhetoric', 'people’', 'daily', '19462003']
The Network Dynamics of Social Influence in the Wisdom of Crowds ['ne

In [115]:
# lets get a uniqe set of keywords, that we will use to populate our graph with. 
# if we use the defaultdict, we can find weights for our keywords 
unique_keywords = defaultdict(int)
for talk in unique_talks:
    title = talk[0][0]
    keywords = get_keywords(title)
    for keyword in keywords:
        unique_keywords[keyword] += 1

In [117]:
for key, value in enumerate(unique_keywords):
    if unique_keywords[value] > 2:
        print(key, value, unique_keywords[value])

2 digital 6
8 evolution 5
10 online 12
12 twitter 11
13 data 14
16 economic 4
17 cultural 5
19 modeling 5
22 social 26
23 media 8
26 record 3
28 largescale 3
33 market 5
41 network 25
42 dynamic 9
43 influence 3
44 wisdom 4
45 crowd 5
47 wikipedia 5
53 moral 4
54 foundation 3
55 political 9
58 analysis 9
63 effect 8
76 content 5
83 measuring 3
92 election 4
103 text 3
106 understanding 3
107 collective 3
110 learning 3
130 evidence 3
133 survey 3
145 pattern 4
156 complex 4
167 computational 5
175 global 4
197 urban 3
200 strategie 3
216 gender 4
219 community 3
223 group 3
227 bot 3
235 propaganda 3
276 model 5
292 suicide 3
301 difference 3
318 facebook 4
327 system 4
332 cooperation 3
354 mapping 3
367 prediction 3
374 public 3
385 user 3
411 protest 5
475 perspective 3
551 recommender 3


OK, now we have titles, authors and keywords, time to start thinking about creating a network of these kinds of objects to have a look at. 

In [105]:
# connect to the database
db = GraphDatabase("http://localhost:7474", username="neo4j", password="sage")

In [120]:
# we want three kinds of nodes - people, talks, keywords 
person = db.labels.create("Person")
talk = db.labels.create("Talk")
keyword = db.labels.create("Keyword")

In [121]:
# we want to create a set of node objects for keywords and then add them to neo4j, with weights 
keyword_node_map = {}
# for key, value in enumerate(unique_keywords):
#     kw = value 
#     weight = unique_keywords[kw]
#     print(kw, weight)
#     keyword_node_map[kw] = db.nodes.create(name=kw, weight=weight)
#     node = keyword_node_map[kw]
#     keyword.add(node)

# OK we have the nodes in with weights, I don't want to push them in again, so I'm going to comment out this code 

school 1
segregation 1
digital 6
space 1
bounded 1
confidence 1
extreme 1
opinion 1
evolution 5
 1
online 12
experiment 2
twitter 11
data 14
reflect 1
link 2
economic 4
cultural 5
modernization 1
modeling 5
echo 1
chamber 1
social 26
media 8
technique 1
privacypreserving 1
record 3
linkage 1
largescale 3
science 2
set 1
ideological 1
creation 1
market 5
socialist 1
china 2
change 2
rhetoric 1
people’ 1
daily 1
19462003 1
network 25
dynamic 9
influence 3
wisdom 4
crowd 5
read 1
wikipedia 5
theorybased 1
targeting 1
increase 1
technology 2
adoption 2
moral 4
foundation 3
political 9
discourse 2
comparative 1
analysis 9
speech 2
congre 1
japanese 1
diet 1
effect 8
missing 1
translation 1
analyse 2
european 1
legislative 1
sampling 1
attribute 2
fall 1
empire 1
americanization 1
english 1
sponsored 1
content 5
psychological 2
personality 1
profile 2
extremist 1
care 1
overhead 1
measuring 3
efficiency 2
charity 1
marketplace 2
spread 1
fake 1
voter 1
2016 2
general 2
election 4
statistical

In [130]:
# we want to create a set of node objects for authors and then add them to neo4j, we don't need weights for these people
unique_authors = []
for talk in unique_talks:
    for author in talk[1]:
        if author.strip() not in unique_authors:
            unique_authors.append(author.strip())

In [132]:
# # now generate nodes for our authors 
# author_node_map = {} 
# for author in unique_authors:
#     author_node_map[author] = db.nodes.create(name=author)
#     person.add(author_node_map[author])

# OK got those in 

In [137]:
# now generate a set of nodes for the talks
# talk = db.labels.create("Talk")
# titles = list(map(lambda x: x[0][0], unique_talks))
# titles
# talk_node_map = {}
# for title in titles:
#     talk_node_map[title] = db.nodes.create(name=title)
#     talk.add(talk_node_map[title])

Those are in now

In [2]:
# author_node_map
# keyword_node_map
talk_node_map

NameError: name 'talk_node_map' is not defined

ow we want to create the links between the nodes, there are a number of different links we want to make. 

author - author links 
author - keyword links 
talk - author links

let's start with the talk - author links 

In [145]:
# create links between talks and authors 
# for talk in unique_talks:
#     title = talk[0][0]
#     authors = talk[1]
#     talk_node_id = talk_node_map[title]
#     print("talk:")
#     print(talk_node_id)
#     print("authors:")
#     for author in authors:
#         author_node_id = author_node_map[author.strip()]
#         print(author_node_id)
#         author_node_id.Wrote(talk_node_id)

talk:
<Neo4j Node: http://localhost:7474/db/data/node/991>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/618>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/992>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/619>
<Neo4j Node: http://localhost:7474/db/data/node/620>
<Neo4j Node: http://localhost:7474/db/data/node/621>
<Neo4j Node: http://localhost:7474/db/data/node/622>
<Neo4j Node: http://localhost:7474/db/data/node/623>
<Neo4j Node: http://localhost:7474/db/data/node/624>
<Neo4j Node: http://localhost:7474/db/data/node/625>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/993>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/626>
<Neo4j Node: http://localhost:7474/db/data/node/627>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/994>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/628>
<Neo4j Node: http://localhost:7474/db/data/node/629>
<Neo4j Node: http://localhost:7474/db/data/node/630>
<Neo4j Node: http://localhost:7474/db/d

<Neo4j Node: http://localhost:7474/db/data/node/640>
<Neo4j Node: http://localhost:7474/db/data/node/644>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1025>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/737>
<Neo4j Node: http://localhost:7474/db/data/node/738>
<Neo4j Node: http://localhost:7474/db/data/node/739>
<Neo4j Node: http://localhost:7474/db/data/node/740>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1026>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/741>
<Neo4j Node: http://localhost:7474/db/data/node/742>
<Neo4j Node: http://localhost:7474/db/data/node/743>
<Neo4j Node: http://localhost:7474/db/data/node/744>
<Neo4j Node: http://localhost:7474/db/data/node/745>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1027>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/746>
<Neo4j Node: http://localhost:7474/db/data/node/747>
<Neo4j Node: http://localhost:7474/db/data/node/748>
<Neo4j Node: http://localhost:7474/db/data/node/749

<Neo4j Node: http://localhost:7474/db/data/node/846>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1061>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/847>
<Neo4j Node: http://localhost:7474/db/data/node/848>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1062>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/686>
<Neo4j Node: http://localhost:7474/db/data/node/684>
<Neo4j Node: http://localhost:7474/db/data/node/849>
<Neo4j Node: http://localhost:7474/db/data/node/687>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1063>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/850>
<Neo4j Node: http://localhost:7474/db/data/node/851>
<Neo4j Node: http://localhost:7474/db/data/node/852>
<Neo4j Node: http://localhost:7474/db/data/node/853>
<Neo4j Node: http://localhost:7474/db/data/node/668>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1064>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/854>
<Neo4j Node: http://localhost:7474/

<Neo4j Node: http://localhost:7474/db/data/node/668>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1101>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/940>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1102>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/941>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1103>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/717>
<Neo4j Node: http://localhost:7474/db/data/node/719>
<Neo4j Node: http://localhost:7474/db/data/node/942>
<Neo4j Node: http://localhost:7474/db/data/node/811>
<Neo4j Node: http://localhost:7474/db/data/node/722>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1104>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/943>
<Neo4j Node: http://localhost:7474/db/data/node/944>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1105>
authors:
<Neo4j Node: http://localhost:7474/db/data/node/945>
talk:
<Neo4j Node: http://localhost:7474/db/data/node/1106>
authors:
<Ne

In [203]:
def gen_co_authorships(author_list):
    """
    return a set of tuples of all co-authorships 
    """
    pairs = []
    for author in author_list:
        for target in author_list:
            a = author.strip()
            b = target.strip()
            pair = sorted([a, b])
            if pair not in pairs and a != b:
                pairs.append(pair)
    return pairs 

In [205]:
# create links between authors 
# we want to not create a link if it already exists. It might be interesting to look at relationship weights, but lets not 
# do that now. #

# though we could check for existence of connections in the DB, the sad truth is that I don't know the query API well enough
# yet, and have about 15 minutes to finish this before I go out an pick up some fish, so I have to do this the quick way
author_author_map = [] 
global_connections = []
for talk in unique_talks:
    authors = talk[1]
    if len(authors) > 1:
        print(authors)
        co_authorships = gen_co_authorships(authors)
        for co_authorship in co_authorships:
            if co_authorship not in global_connections:
                global_connections.append(co_authorship)

['Samuel Martin', ' Sylvie Huet', ' Pascal Gend', ' Mohammed Amine Ait Oumajoud', ' Guillaume Deffuant', ' Armelle Nugier', 'Serge Guimond']
['Jochen Hirschle', 'Tuuli-Marja Kleiner']
['Kazutoshi Sasahara', ' Giovanni Luca Ciampaglia', ' Alessandro Flammini', 'Filippo Menczer']
['Rainer Schnell', 'Christian Borgs']
['Shilin Jia', 'Linzhuo Li']
['Joshua Becker', ' Devon Brackbill', 'Damon Centola']
['Florian Lemmerich', ' Philipp Singer', ' Robert West', ' Leila Zia', ' Ellery Wulczyn', ' Markus Strohmaier', 'Jure Leskovec']
['Lori Beaman', ' Ariel Benyishay', ' Jeremy Magruder', 'Ahmed Mushfiq Mobarak']
['Hiroki Takikawa', 'Takuto Sakamoto']
['Laura Hollink', ' Astrid van Aggelen', 'Jacco van Ossenbruggen']
['Claudia Wagner', ' Philipp Singer', ' Fariba Karimi', ' Jürgen Pfeffer', 'Markus Strohmaier']
['Bruno Gonçalves', ' Lucia Loureiro-Porto', ' Jose J. Ramasco', 'David Sanchez']
['Jaimie Park', ' Mahmoudreza Babaei', ' Przemyslaw Grabowicz', ' Krishna Gummadi', 'Sue Moon']
['Meysam 

In [210]:
# we have now generated all global connections betweeb just authors, we can create the links between these authors! 
for connection in global_connections:
    source_node = author_node_map[connection[0]]
    target_node = author_node_map[connection[1]]
    source_node.CoAuthor(target_node)

Interesting Grail Queries against this graph

MATCH (n:Keyword) WHERE n.weight > 4 RETURN n LIMIT 700