# 2a. Pre-Processing: Organisation Aliases

As mentioned in the previous step, there is a serious issue with organisation names. They are not uniformly applied, such that there are several entries for the same organisation e.g., "ESA", " ESA", "European Space Agency", "European Space Agency (ESA)". Making this data as uniform as possible is important, because this way we'll have reliable data on the involvement of different organisations.

In [2]:
import pickle
import re
import json
from difflib import SequenceMatcher

Unpickling the raw data from step 1 as dict_of_info:

In [None]:
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/1.HTML_Parsing/"
with open(directory+"IAC_raw_data.pickle", "rb") as handle:
  input = pickle.load(handle)

## 1 . Simplified dict
This will only contain as key the paper_id and as value the organisations.

In [None]:
raw1 = {}
for key, value in input.items():
  for id, info in value.items():
    raw1.update({id: info["company"]})

## 2. Separating the organisation names
In some cases, multiple authors from different organiations authored a paper. Their names are one string separated by semicolons. This code will separate them and put them in a list.

In [None]:
raw2 = {}
for key, value in raw1.items():
  raw2.update({key: value.split(";")})

## 3. Removing empty spaces in organisation names
Some organisation names have an empty space in the beginning or end (or both). This codes removes them:

In [None]:
raw3 = {}
for key, value in raw2.items():
  raw3.update({key: []})
  for element in value:
    if len(element) > 2 and element != "N/A":    
      if element[0] is " " and element[-1] is " ":
        raw3[key].append(element[1:-1])
      elif element[0] is " " and element[-1] is not " ":
        raw3[key].append(element[1:])
      elif element[-1] is " " and element[0] is not " ":
        raw3[key].append(element[:-1])
      else:
        raw3[key].append(element)

## 4. Harmonising organisation names
Many organisations have several aliases, partly due to mispellings or including/excluding the abbreviation. I've compiled a dict of organisations that appeared particularly often (synonyms1) along with a list of aliases as the value. Using regex, we'll iterate over the data; if any of the aliases is mentioned as the organisation name, we'll replace it with the keys of synonyms1.

In [None]:
synonyms1 = {"ESA": ["ESA", "European Space Agency", "ESTEC"],
                 "DLR": ["DLR", "German Aerospace Center", "Zentrum fr Luft- und Raumfahrt", "Zentrum fuer Luft- und Raumfahrt"],
                 "CNES": ["CNES", "Centre National d'Etudes Spatiales"],
                 "ASI": ["ASI", "Italian Space Agency", "Agenzia Spaziale Italiana"],
                 "NASA": ["NASA", "National Aeronautics and Space Administration"],
                 "ISRO": ["ISRO", "Indian Space Research Organization"],
                 "KARI": ["KARI", "Korea Aerospace Research Institute"],
                 "EUSPA": ["EUSPA", "European Union Agency for the Space Programme"],
                 "ROSCOSMOS": ["ROSCOSMOS", "Federal Space Agency (ROSCOSMOS)"],
                 "JAXA": ["JAXA", "Japan Aerospace Exploration Agency"],
                 "UKSA": ["UKSA", "UK Space Agency"],
                 "POLSA": ["POLSA", "Polish Space Agency"],
                 "OHB": ["OHB", "OHB System"],
                 "Airbus": ["Airbus"],
                 "Thales": ["Thales Alenia Space", "Thales"],
             "Telespazio": ["Telespazio"]
                 }
synonyms1_list = []
for key, value in synonyms1.items():
  for word in value:
    synonyms1_list.append(word)

raw4 = {}
for key, value in raw3.items():
  raw4.update({key: []})
  for element in value:
    if re.search("|".join(synonyms1_list), element):
      for orga, list_of_aliases in synonyms1.items():
        if re.search("|".join(list_of_aliases), element):
          raw4[key].append(synonyms1[orga][0])
    elif re.search("|".join(synonyms1), element) is None:
      raw4[key].append(element)

## 5. Fixing Spelling Mistakes
Many people misspelled university:

In [None]:
raw5 = {}
regex = "Un[\S]*ty"
regex2 = "Univeristy"
for key, value in raw4.items():
  raw5.update({key: []})
  for element in value:
    if re.search(regex, element):
      new = re.sub(regex, "University", element)
      raw5[key].append(new)
    elif re.search("Uiniversity", element):
      new = re.sub("Uiniversity", "University", element)
      raw5[key].append(new)
    else:
      raw5[key].append(element)

## 6. Removing multiple mentions of the same organisation
When multiple authors from the same organisation wrote an abstract, their organisation name is often repeated. To avoid double-counting later on, I'll remove them:

In [None]:
raw6 = {}
for key, value in raw5.items():
  raw6.update({key: []})
  for element in value:
    if element not in raw6[key]:
        raw6[key].append(element)

## 7. Looking for more Aliases
In step 4 I created a list of organisation-aliases that are well known to me. The problem is very pervasive though, with "TU München" also written as "TU Munich". I use SequenceMatcher to compare the similarity of all organisation names with one another. If there is a high similarity, it is likely that they are spelling-variants of the same organisation.

In [None]:
#creating a list of unique organisation names
to_check = []
for key, value in raw6.items():
  for orga in value:
    if orga not in to_check:
      to_check.append(orga)

In [None]:
#iterating over the list from above to check for similar organisation names
result1 = {}
checked = []
iteration = 1

for orga in to_check:
  if orga not in checked:
    checked.append(orga)
    one = orga
    result1.update({one: [one]})
    for orga in to_check[iteration:]:
      two = orga
      sim = SequenceMatcher(None, one, two).ratio()
      if sim >= 0.9 and sim != 1 and two not in checked:
        result1[one].append(two)
        checked.append(two)
    iteration += 1

In [None]:
#filtering out lists that have at least one alias
result2 = {}
for key, value in result1.items():
  if len(value) > 1:
    result2.update({key: value})

#As a rule of thumb, the alias with the longest name is probably best as it entails the abbreviation
result3 = {}
for key, value in result2.items():
  tmp = sorted(value, key = len)
  result3.update({tmp[-1]: tmp})

This works fairly well, but it's not perfect. E.g., "University Bologna" matched with "University Bonn". In other words, I have to manually double check and delete the mistakes. I did this in another notebook and here I'm loading the results. It's a dictionary like synonyms1.

In [3]:
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/2.Pre-Processing/"
with open(directory+"2ai.synonyms2.pickle", "rb") as handle:
  synonyms2 = pickle.load(handle)

In [None]:
synonyms2_list = []
for key, value in synonyms2.items():
  for word in value:
    synonyms2_list.append(word)

raw7 = {}
for key, value in raw6.items():
  raw7.update({key: []})
  for element in value:
    if re.search("|".join(synonyms2_list), element):
      for orga, list_of_aliases in synonyms2.items():
        if re.search("|".join(list_of_aliases), element):
          raw7[key].append(synonyms2[orga][-1])
    elif re.search("|".join(synonyms2), element) is None:
      raw7[key].append(element)

Removing the \\ in front of ( and ) from the previous step:

In [None]:
raw8 = {}
for key, value in raw7.items():
  raw8.update({key: []})
  for element in value:
    tmp = re.sub(re.escape("\\"), "", element)
    raw8[key].append(tmp)

Removing multiple mentions of the same organisation (for the second time):

In [None]:
raw9 = {}
for key, value in raw8.items():
  raw9.update({key: []})
  for element in value:
    if element not in raw9[key]:
      raw9[key].append(element)

## 8. Conclusion

How many organisations have we been able to consolidate?

In [None]:
unique1 = []
for key, value in raw2.items():
  for element in value:
    if element not in unique1:
      unique1.append(element)

unique2 = []
for key, value in raw9.items():
  for element in value:
    if element not in unique2:
      unique2.append(element)

print(f"Originally: {len(unique1)} | After cleaning: {len(unique2)} | Reduced by: {len(unique1)-len(unique2)}.")

Originally: 4836 | After cleaning: 2947 | Reduced by: 1889.


Not bad, there are probably still many aliases there, but we've significantly increased the quality of the data. In the next step (Pre-Processing 2b), we'll categorise the organisations.

## 9. Exporting

In [None]:
# with open("2a.cleaned_organisation_names.pickle", "wb") as handle:
#   pickle.dump(raw9, handle, protocol=pickle.HIGHEST_PROTOCOL)