# ThingLinker

Python script by Simon Lindgren // [@simonlindgren](http://www.twitter.com/simonlindgren) // [simonlindgren.com](http://simonlindgren.com).

[Actor-Network theory](https://en.wikipedia.org/wiki/Actor%E2%80%93network_theory) (ANT) is about how human and non-human actants are connected in relational systems. It sees entities (humans, texts, machines, activitiees, ideas) as linked to each other in heterogeneous networks. Actors appear in any shape or material. The important thing is not if they have human agency, but whether they have the capacity to cause difference in the course of action of other entitites or not. 

One way of analysing processes like these is to look at mechanisms through which an actor is connected to to other actors, and how those other actors in turn are linked to each other. Such analyses can be starting point of making closer assessments of things such as [obligatory passage points](https://en.wikipedia.org/wiki/Obligatory_passage_point), [interessement](https://en.wikipedia.org/wiki/Interessement), [enrolment, and mobilisation](https://en.wikipedia.org/wiki/Translation_(sociology)). In short, networks are continually made and re-made, by actors who draw links and associations.

In [None]:
# Required Python libraries
import glob, re
import pandas as pd
import spacy
nlp = spacy.load("en") # Set up spaCy with the the default model for English

### Importing text data
The ThingLinker starts from a text dataset that we want to analyse from the perspective of ANT. The data should be in the form of `.txt` files in a `/data` subdirectory to the ThingLinker script. We consider each line in the input as a document, so lines could be units like tweets, blog posts, books, chapters, paragraphs, articles, article sections, etc. – all depending on how we want to calculate the links further on. ThingLinker will analyse how `Things` are `Linked` based on if they co-occur _within_ the documents defined here.

We read the data into a list:

In [None]:
fs = glob.glob("data/*.txt")                # the files in our data directory
dataset = []                                # an empty dataset (Python list)

for f in fs:                                # iterate over the files
    data = open(f, 'r').readlines()         # read each line in the file
    for l in data:                          
        dataset.append(l)                   # add it to the dataset

We then clean the data from things that we do no want. The code below removes urls, any non-alphanumeric characters, double spaces, double line-breaks, and empty lines. Note however, that from the perspective of ANT, things such as urls or emojis can definitely be interesting as actors, so the code below must be customised for the research task at hand.

In [None]:
clean = []

for line in dataset:
    line = re.sub(r'(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$', " ", line)
    line = re.sub('[^0-9a-zA-Z]+', ' ', line)
    line = re.sub('  ', ' ', line)
    line = re.sub(r'(\n\n)','\n', line)
    if not len(line.strip()) == 0 :
        clean.append(line)

dataset = clean

In [None]:
# Inspect the
dataset

### Extracting Things

As our next step, we use the [`spaCy`](https://spacy.io) Python library to extract what language technologists call '[Named Entities](https://en.wikipedia.org/wiki/Named-entity_recognition)'. These entities – 'Things' – are considered here as potential actors, in the sense of ANT. With `spaCy`, we will get the following tags:

- PERSON	People, including fictional.
- NORP	Nationalities or religious or political groups.
- FACILITY	Buildings, airports, highways, bridges, etc.
- ORG	Companies, agencies, institutions, etc.
- GPE	Countries, cities, states.
- LOC	Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT	Objects, vehicles, foods, etc. (Not services.)
- EVENT	Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART	Titles of books, songs, etc.
- LANGUAGE	Any named language.

It also extracts the following values:

- DATE	Absolute or relative dates or periods.
- TIME	Times smaller than a day.
- PERCENT	Percentage, including "%".
- MONEY	Monetary values, including unit.
- QUANTITY	Measurements, as of weight or distance.
- ORDINAL	"first", "second", etc.
- CARDINAL	Numerals that do not fall under another type.

In addition to these, the ThingLinker will extract any number of `MANUAL` entities that are defined by the researcher depending on the issue at hand. Such entities could be things that are for some reason not caught by `spaCy` but still interesting to the analysis, or names of emotions, activites or any other thing. These entities to be extracted are defined in the `my_things` list.

The code below does the following:
* Sets up an empty list called `tagged` and writes a string of column names to it.
* Sets up a variable called `itemnumber` to use as a counter for processed items.
* Iterates through all the items in our `dataset` and does two things:
    - First, it uses the `nlp` model that we imported from `spaCy` to extract Named Entities from the item.
    - Second, it checks if any user-defined things (stored in the `my_things` list) occur in the item.
    - Third, adds any extracted things to `tagged`.

In [None]:
tagged = []
tagged.append("doc;type;thing") #column names

my_things = ['love', 'hate']

itemnumber = 0

for item in dataset:
    itemnumber = itemnumber + 1
    print("Parsing item " + str(itemnumber) + " of " + str(len(dataset)), end='\r')
    parsed = nlp(item.strip())
    ents = list(parsed.ents)
    sents = list(parsed.sents)
    for ent in ents:
        entstring = ""
        entstring += str(itemnumber) + "; " + ent.label_ + "; " + ent.text
        tagged.append(entstring)
    for sent in sents:
        bag = str(sent).split() # sentence to bag-of-words
        for thing in my_things: # look for my things in the bag
            if thing in bag:
                sentstring = ""
                sentstring += str(itemnumber) + "; MANUAL; " + thing
                tagged.append(sentstring)

In [None]:
# Write to a csv file
with open("tagged.csv","w") as out:
    for line in tagged:
        out.write(line)
        out.write('\n')

# Read the csv into a dataframe        
tagged_df = pd.DataFrame.from_csv('tagged.csv', sep = ";", index_col = None)

In the process above, we extracted data using some of the properties that `spaCy` parsed (namely `.ents`, and `.sents`). We can inspect other available properties of the parsed data:

In [None]:
dir(parsed)

### Linking Things

Having extracted the things we want, we now analyse how they are connected witin the analysed documents. To do this, we first filter out the two columns that we need from the tagged dataframe:

In [None]:
df_filtered = tagged_df[['doc','thing']]

Second, we use `pandas.merge()` to get the [Cartesian product](https://www.reddit.com/r/explainlikeimfive/comments/1kznwi/eli5_cartesian_product/) of all the rows which have the same document, i.e. a list of all occurring pairs and in which documents they happen.

In [None]:
# pid = pairs in docs
df_pid = pd.merge(df_filtered, df_filtered, on='doc')

# Also, remove all pairs where thing_x == thing_y (self-loops)
df_pid = df_pid.query("thing_x < thing_y")

Third, let's group the dataframe (`.groupby()`) by the pairs and count the number of distinct documents in which the pair appears.

In [None]:
df_grouped = df_pid.groupby(by=['thing_x', 'thing_y']).agg({'doc': 'nunique'})

Finally, `.reset_index()` to flatten the hierarchical grouping and get weighted pairs.

In [None]:
df_pairs = df_grouped.reset_index()

# Rename the columns into Gephi terminology
df_pairs.rename(columns={'thing_x' : 'Source', 'thing_y' : 'Target', 'doc': 'Weight'}, inplace=True)

df_pairs

### Write files for network analysis

ThingLinker will output two files – a node table and an edge table – to use in further network analysis using – for example – [Gephi](https://gephi.org/users/supported-graph-formats/spreadsheet/).

In [None]:
# NODE TABLE

# Extract data from the tagged_df dataframe
df_nodes = tagged_df[['thing', 'thing', 'type']]

# Rename the columns into Gephi terminology
df_nodes.columns = ['Id', 'Label', 'Type']

# Remove duplicate rows
df_nodes = df_nodes.drop_duplicates()

# Write the nodetable to csv
df_nodes.to_csv('thinglinker_nodes.csv', index=False, header=True, sep=';')

print(len(df_nodes), "nodes")

In [None]:
# EDGE TABLE

df_pairs.to_csv('thinglinker_edges.csv', index=False, header=True, sep=';')

print(len(df_pairs), "edges")