# 2c Abstract Cleaning
This last part of pre-processing is focused on cleaning the abstracts. I'll mainly strip any punctuation marks to get the abstracts ready for computing the frequency of words.

In [47]:
import pickle
import re

## 1. Input and Simplified Dictionary

In [48]:
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/1.HTML_Parsing/"
with open(directory+"IAC_raw_data.pickle", "rb") as handle:
  input = pickle.load(handle)

In [49]:
raw1 = {}
for key, value in input.items():
  for id, info in value.items():
    raw1.update({id: info["abstract"]})

## 2. Cleaning

Lowercasing:

In [50]:
raw2 = {}
for key, value in raw1.items():
  raw2.update({key: value.lower()})

Removing new line chars:

In [51]:
raw3 = {}
for key, value in raw2.items():
  raw3.update({key: re.sub("\n", " ", value)})


Removing punctuation:

... I know, but doing it like this gives more control if I want to keep some punct chars

In [52]:
punct = "[.:,;#=&%_<>!@$~/`-]"
raw4 = {}
for key, value in raw3.items():
  raw4.update({key: re.sub(punct, " ", value)})

In [53]:
punct = "[\'\"\*\+\?\^\|]"
raw5 = {}
for key, value in raw4.items():
  raw5.update({key: re.sub(punct, " ", value)})

In [54]:
punct = "[\\\()]" #removes () and \\
raw6 = {}
for key, value in raw5.items():
  raw6.update({key: re.sub(punct, " ", value)})

In [55]:
punct = "[\[\]]" #removes []
raw7 = {}
for key, value in raw6.items():
  raw7.update({key: re.sub(punct, " ", value)})

In [56]:
punct = "[{}]" #removes {}
raw8 = {}
for key, value in raw7.items():
  raw8.update({key: re.sub(punct, " ", value)})

Use this to check if any characters that are not letters or numbers are in the text:

In [57]:
regex = "[^a-zA-Z\d\s]"
findings = []
unique = []
for key, value in raw8.items():
  if len(re.findall(regex, value)) > 0:
    findings.append(re.findall(regex, value))
    for item in re.findall(regex, value):
      if item not in unique:
        unique.append(item)
unique

[]

Adding white spaces at the beginning and end of each abstract:
When searching for keywords, each keyword will have an empty space in front and at the end to allow for searching with several words.

In [58]:
whitespace = " "
raw9 = {}
for key, value in raw8.items():
  raw9.update({key: whitespace+value+whitespace})

Removing any sequence of whitespaces with more than one whitespace, i.e., there should never be two white spaces next to each other:

In [59]:
raw10 = {}
regex = "\s\s+"
for key, value in raw9.items():
  raw10.update({key: re.sub(regex, " ", value)})

Testing search:

In [60]:
word = " space economy "

findings = []
check = {}
nums = 0

for key, value in raw10.items():
  findings.append([key, len(re.findall(word, value)), re.findall(word, value)])
  check.update({key: [value]})
  nums += len(re.findall(word, value))
print(nums, findings)

241 [['46973', 0, []], ['42767', 0, []], ['43919', 0, []], ['44070', 0, []], ['47451', 0, []], ['42151', 0, []], ['43137', 0, []], ['47620', 0, []], ['42912', 0, []], ['48336', 0, []], ['48462', 0, []], ['44673', 0, []], ['47170', 0, []], ['44749', 0, []], ['47705', 0, []], ['42105', 0, []], ['46952', 0, []], ['45892', 0, []], ['45653', 0, []], ['43477', 0, []], ['44021', 0, []], ['47739', 0, []], ['43874', 0, []], ['45428', 0, []], ['46078', 0, []], ['45436', 0, []], ['46266', 0, []], ['44955', 0, []], ['48621', 0, []], ['45018', 0, []], ['43122', 0, []], ['47399', 2, [' space economy ', ' space economy ']], ['46663', 0, []], ['44051', 0, []], ['47544', 0, []], ['47046', 0, []], ['46690', 0, []], ['48553', 0, []], ['43737', 0, []], ['48538', 0, []], ['45677', 0, []], ['43278', 0, []], ['44630', 0, []], ['44793', 0, []], ['42126', 0, []], ['42768', 0, []], ['47241', 0, []], ['43479', 0, []], ['48035', 0, []], ['44704', 0, []], ['45848', 0, []], ['48443', 0, []], ['48222', 0, []], ['443

In [61]:
check["45529"]

[' projects of the new space economy such as spaceshiptwo and new shepard are on their way to shifting the paradigms in space tourism and transportation being privately financed they are also changing the way in which highly complex and formerly government financed systems are now being developed looking to the future we can envision humanity moving to space an opportunity that will be available to many more of us as a result of these new paradigms one of the core issues we encounter when we start the development of complex systems such as the suborbital transportation and space tourism systems is what are the concepts available and how these concepts can be represented using strictly defined ontology and model semantics in this work we present a model based concept framework that aims to address this issue first a concept framework methodology is presented after which we demonstrate its applicability to suborbital human spaceflight missions such as spaceshiptwo and new shepard the ana

#3. Exporting

In [64]:
with open("2c.cleaned_abstracts.pickle", "wb") as f:
  pickle.dump(raw10, f, protocol = pickle.HIGHEST_PROTOCOL)