# Consolidation
This is a crucial part of the project. All the pre-processing, adding features, and topic modelling from steps 3 and 4 are consolidated here. The new data structure, "central_dict", will contain the abstract along with all the necessary information for analysis in step 5.

In [2]:
import os
pwd

NameError: name 'pwd' is not defined

In [3]:
pwd

'C:\\Users\\Admin\\IAC Analysis\\4. Consolidation'

In [1]:
import pickle
import pandas as pd

## 1. Importing

This is the raw inputs from step 1 (HTML_Parsing):

In [2]:
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/1.HTML_Parsing/"
with open(directory+"IAC_raw_data.pickle", "rb") as handle:
  raw_data = pickle.load(handle)

This is the simplified dictionary containing organisation names and types from step 2 (Pre-Processing). It includes as key the paper id and as value the organisation names and their types.

In [3]:
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/2.Pre-Processing/"
with open(directory+"2b.orga_names_and_types.pickle", "rb") as handle:
  orga_names_types = pickle.load(handle)

This is the simplified dictionary containing the cleaned abstracts from step 2 (Pre-Processing). This is used for the lexical search.

In [4]:
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/2.Pre-Processing/"
with open(directory+"2c.cleaned_abstracts.pickle", "rb") as handle:
  abstracts = pickle.load(handle)

This is the simplified dictionary containing topic numbers and names from step 3 (Topic Modeling).

In [5]:
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/3.Topic_Modeling/"
with open(directory+"3a.doc_topic_name_number.pickle", "rb") as handle:
  topic_name_number = pickle.load(handle)

This is a dictionary with top2vec ids as keys and the corresponding paper_id as value:

In [6]:
directory = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/3.Topic_Modeling/"
with open(directory+"3a.top2vec_paper_id_dict.pickle", "rb") as handle:
  top2vec_paper_id_dict = pickle.load(handle)

## 2. Creating the Central Dict:

The central dict will contain all the relevant information. Its keys is the year+doc_id. As values it will contain country, organisation(s), organisation type, cleaned abstract, topic number and name.

### 2a. Adding the organisation types directly into the raw_data dict from step 1 (HTML Parsing):

In [7]:
for key, value in raw_data.items():
  for id, info in value.items():
    raw_data[key][id].update({"organisations": orga_names_types[id]})

### 2b. Creating the Central Dict:

In [8]:
central_dict = {}
for key, value in raw_data.items():
  for id, info in value.items():
    combined_id = info["year"]+"_"+info["paper_id"]
    central_dict.update({combined_id: {}})
    central_dict[combined_id].update({"Paper_id": info["paper_id"],
                                      "Top2Vec_id": "",
                                      "Year": info["year"],
                                      "Country": info["country"],
                                      "Region": "",
                                      "Topic Number": 0,
                                      "Topic Name": "",
                                      "Organisations": info["organisations"],
                                      "Abstract": ""
                                      })

### 2c. Adding the cleaned abstracts:

In [9]:
for id, info in central_dict.items():
  info["Abstract"] = abstracts[info["Paper_id"]]

### 2d. Adding regions (continents)
The data should also be searchable by region. These lists will enable that. I'll add them to the central dict a bit later. Note that the lists only contain countries that come up in the data set at least once; if a new country joins it would need to be added.

In [10]:
Europe = ["Italy", "Germany", "Belgium", "United Kingdom", "The Netherlands", "France", "Austria", "Russian Federation", "Greece", "Portugal", "Malta", "Ukraine", "Spain", "Poland", "Luxembourg", "Switzerland", "Norway",
          "Denmark", "Sweden", "Romania", "Ireland", "Czech Republic", "Hungary", "Finland", "Estonia", "Belarus", "Slovenia", "Slovak Republic", "Lithuania", "Cyprus", "Latvia", "Bulgaria", "Serbia", "Iceland", "Croatia"]

Asia = ["Israel", "Japan", "China", "Iran", "United Arab Emirates", "Taiwan, China", "India", "Hong Kong", "Thailand", "Korea, Republic of", "Bangladesh", "Indonesia", "Pakistan", "Malaysia",
        "Singapore, Republic of", "The Philippines", "Korea, Democratic People's Republic of", "Nepal", "Kuwait", "Turkey", "Taipei", "Kazakhstan", "Cambodia", "Bahrain", "Saudi Arabia", "Qatar", "Jordan",
        "Azerbaijan", "Mongolia", "Sri Lanka"]

Americas = ["United States", "Canada", "Brazil", "Peru", "Mexico", "Chile", "Ecuador", "Paraguay", "Bolivia", "Colombia", "Costa Rica", "Puerto Rico", "Argentina", "Venezuela", "Uruguay", "Panama", "Guatemala",
            "Honduras", "Guyana", "French Guiana", "Netherlands Antilles"]

Oceania = ["Australia", "New Zealand"]

Africa = ["Ghana", "South Africa", "Nigeria", "Ethiopia", "Sudan", "Kenya", "Algeria", "Botswana", "Zimbabwe", "Togo", "Angola", "Egypt", "Cameroon", "Morocco", "Mauritius", "La Reunion"]

#Adding a searchable dict:
countries_dict = {"Europe": Europe, "Asia": Asia, "Americas": Americas, "Oceania": Oceania, "Africa": Africa}

In [None]:
for key, value in central_dict.items():
  for region, countries in countries_dict.items():
    if value["Country"] in countries:
      value["Region"] = region

### 2e. Adding Organisations

In [None]:
orga_type_freq = {}
for key, value in central_dict.items():
  orga_type_freq.update({key: {"University": 0, "Space Agency": 0, "Other Research Institution": 0, "Company": 0, "Other": 0, "Unknown": 0}})
  for orgas_list in value["Organisations"]:
    for type_name, info in orgas_list.items():
      if type_name == "Type":
        orga_type_freq[key][info] += 1


#Adding this to central_dict, such that each orga_type is an individual key next to things like Country and Topic Name:
for key, value in central_dict.items():
    central_dict[key].update(orga_type_freq[key])

##2f. Adding top2vec ids
Top2Vec ids differ from the paper_ids. They are enumerated per year and have in the beginning the year like "2018_1" for the first paper.

In [None]:
for key, value in central_dict.items():
  value["Top2Vec_id"] = top2vec_paper_id_dict[value["Paper_id"]]

### 2g. Adding topics

In [None]:
for key, value in central_dict.items():
  value["Topic Number"] = topic_name_number[value["Top2Vec_id"]]["topic number"]
  value["Topic Name"] = topic_name_number[value["Top2Vec_id"]]["topic name"]

## 3. Exporting

Central dict:

In [None]:
with open("4.central_dict.pickle", "wb") as f:
  pickle.dump(central_dict, f, protocol= pickle.HIGHEST_PROTOCOL)

Countries/regions dict:

In [11]:
with open("4.countries_dict.pickle", "wb") as f:
  pickle.dump(countries_dict, f, protocol = pickle.HIGHEST_PROTOCOL)

# 4. Conclusion
Now, all the relevant information is stored in central_dict. In step 5, I'll analyse the data.