# Exploitation and Analysis
This is the final part of analysing abstracts submitted to the IAC.

## 1. Importing Libraries and Files

In [1]:
import os
import pickle
import pandas as pd
import re
import plotly.express as px

In the previous step, I created the central_dict; it contains all the necessary information. I import it and transform it into a df.

In [30]:
#importing central_dict, where all information is stored

with open(r"C:\Users\Admin\IAC Analysis\4. Consolidation\4.central_dict.pickle", "rb") as f:
    central_dict = pickle.load(f)
    
#transforming central_dict into df
central_df = pd.DataFrame.from_dict(central_dict).transpose()
central_df.sample(3)

Unnamed: 0,Paper_id,Top2Vec_id,Year,Country,Region,Topic Number,Topic Name,Organisations,Abstract,University,Space Agency,Other Research Institution,Company,Other,Unknown
2021_63711,63711,2021_1661,2021,Ukraine,Europe,18,Engines,"[{'Name': 'Yuzhnoye State Design Office', 'Typ...",the rd861k hypergolic high altitude liquid pr...,0,0,0,0,1,0
2019_50816,50816,2019_1705,2019,Belgium,Europe,11,Space Traffic Management (STM),"[{'Name': 'ESA', 'Type': 'Space Agency'}, {'Na...",the need for future space traffic management ...,1,1,0,0,0,0
2019_54576,54576,2019_1898,2019,United States,Americas,58,"Space Generation Advisory Council, Young Profe...","[{'Name': 'Moon Village Association (MVA)', 'T...",the out astronaut project is a science outrea...,1,0,0,0,1,1


#1. Creating some basic dictionaries/lists:

In [31]:
all_years = list(central_df["Year"].unique())

all_countries = list(central_df["Country"].unique())

all_regions = list(central_df["Region"].unique())

all_topic_nums = list(central_df["Topic Number"].unique())

all_topic_names = list(central_df["Topic Name"].unique())

all_organisation_types = ["University", "Space Agency", "Other Research Institution", "Company", "Other", "Unknown"]

In [49]:
with open(r"C:\Users\Admin\IAC Analysis\4. Consolidation\4.countries_dict.pickle", "rb") as f:
  countries_dict = pickle.load(f)

## 2. Countries

Let's take a look at how much different countries contributed to the IAC over time.

First, I define a function that returns a DF, specifying how many papers came from a set of countries:

In [None]:
def query_year_country(Countries = all_countries):
  result = pd.DataFrame()
  for year in all_years:
    add = central_df.loc[((central_df["Year"] == year)) & (central_df["Country"].isin(Countries))]["Country"].value_counts() #creating a series per year
    result[year] = add #and adding the series to the DF object
    result = result.fillna(0).astype(int)

  #Adding Total column
  result["Total"] = result.iloc[:, 0:].sum(axis = 1).astype(int)

  return result
query_year_country(Countries = ["Germany", "South Africa"])

Second, I define a function to create a sun-burst graph:

In [53]:
def graph_compare_region_country(Years_or_Total):
  df = query_year_country()
  regions = []
  for country in df.index:
    for key, value in countries_dict.items():
      if country in value:
        regions.append(key)
  df.insert(0, "Region", regions)
  fig = px.sunburst(df, path = ["Region", df.index], values = Years_or_Total,
                     title = f"Contributions by Region and Country: {Years_or_Total}")
  fig.show()

Here, we can see the total number of contributions between 2018 and 2022 by region and country. Surprisinlgy, Europe has contributed almost half of all papers. This may be due to the fact that in the past two years, the conference has taken place in Europe (2018: Bremen, 2022: Paris).

In [54]:
graph_compare_region_country("Total")

Indeed, if we look at 2019, when the IAC took place in Washington D.C., the Americas are the main engager:

In [57]:
graph_compare_region_country("2019")

Next, let's take a closer look at the development of contributions over time. The space sector has been experiencing a significant change. Known as New Space, the space sector is in the process of moving from a field dominated by a few nations and large defense contractors to a sector comprising many emerging states (e.g., UAE and China). In addition to this, the space sector is commercialising, with many products and services increasingly provided by start-ups instead of large defense contractors (e.g., Lockheed Martin). The demand side too is changing: while governments remain important customers, commercial entities are becoming increasingly important. We'll explore the role of companies further down. For now, lets consider how old and new actors have changed their engagement with the space sector:

In [65]:
def graph_compare_country(Countries = all_countries):
   df = query_year_country(Countries).transpose().drop(["Total"], axis = 0)
   fig = px.line(df, x = df.index, y = df.columns,
                 title = f"Contributions of Countries over Time", 
                 labels = {"value" : "Number of Papers", "index" : "Years", "variable": "Countries"})
   fig.show()

Russia is one of the original space nations. Due to the Russian Invasion of Ukraine, Russia has become much more isolated recently. This can be seen by the decreasing number of submissions in 2022. Interestingly, for all three countries, the IAC in 2019 (Washington D.C.) was the lowest point in the five year period. This must have been the least accessible place for them, possibly due to visa-issues. China, an emerging space nation like India, has significantly decreased its contributions, whereas India is growing its engagement.

In [64]:
graph_compare_country(["China", "India", "Russian Federation"])

## 3. Organisation Types

The following functions create a DF of the organisation types and visualise it, respectively.

In [79]:
def query_year_orgatype():
  result = pd.DataFrame()
  for orga_type in ["University", "Space Agency", "Company"]:
    result[orga_type] = central_df.loc[(central_df[orga_type] > 0)]["Year"].value_counts()
    result = result.sort_index()
  return result

In [80]:
def graph_compare_orgatype_year():
  df = query_year_orgatype()
  
  fig = px.line(df,
                labels = {"value": "Number of Papers", "index": "Years", "variable": "Organisation Type"},
                title = "Number of Papers by Organisation Type"
                )
  fig.show()

We can see the the IAC is a predominantly academic event. The role of universities is increasing significantly. The involvement of companies is strong, but growing very slowly. Engagement from space agencies has stagnated over the five-year period.

In [81]:
graph_compare_orgatype_year()

In [95]:
def query_orgatype_country():
  result = pd.DataFrame()
  for orga_type in ["University", "Space Agency", "Company"]:
    result[orga_type] = central_df.loc[(central_df[orga_type] > 0)]["Country"].value_counts()
  
  result = result.fillna(0).astype("int").head(15)
  return result

In [96]:
def graph_compare_orgatype_country(Orga_Type = "University"):
  df = query_orgatype_country()
  s = df[Orga_Type].sort_values(ascending = False)

  fig = px.bar(s, x = s.index, y = Orga_Type,
               labels = { Orga_Type: "Number of Contributions (Total)","index": "Countries"},
               title = f"Total Contributions by Organisation Type ({Orga_Type})"
               )
  fig.show()

In the graphs below, we can see that the US, Germany, and Italy are always top contributors. China and Russia are mostly contributing through universities and to a much lesser degree through their space agencies or companies. This suggests that the majority of engagement from these countries is academic and less politically or economically motivated.

In [100]:
for orga in ["University", "Space Agency", "Company"]:
    graph_compare_orgatype_country(Orga_Type = orga)

It is possible that one paper was written by several organisations of different types. How much do universities cooperate with companies and are there differences between countries?

In [103]:
def query_orgatype_coop():
  # only coop between uni and company (but not space agency) by country
  add1 = central_df.loc[(central_df["University"] > 0) & (central_df["Company"] > 0) & (central_df["Space Agency"] == 0)]["Country"].value_counts()

  #coop uni-space agency
  add2 = central_df.loc[(central_df["University"] > 0) & (central_df["Space Agency"] > 0) & (central_df["Company"] == 0)]["Country"].value_counts()

  #coop space agency-company
  add3 = central_df.loc[(central_df["Space Agency"] > 0) & (central_df["Company"] > 0) & (central_df["University"] == 0)]["Country"].value_counts()

  #coop space agency-company-uni
  add4 = central_df.loc[(central_df["University"] > 0) & (central_df["Company"] > 0) & (central_df["Space Agency"] > 0)]["Country"].value_counts()

  result = pd.DataFrame()
  result["University-Company"] = add1
  result["University-Space Agency"] = add2
  result["Space Agency-Company"] = add3
  result["Space Agency-Company-University"] = add4
  result = result.fillna(0).astype(int)

  return result.head(15)

In [105]:
def graph_orgatype_coop():
  df = query_orgatype_coop()

  fig = px.bar(df,
               labels = {"value": "Number of Papers", "index": "Country", "variable": "Type of Cooperation:"},
               title = "Cooperation between Organisation Types (Total)"
               )
  fig.show()

The graph below shows cooperation between different organisation types by country. In Germany for instance, the engagement between its space agency (DLR) and companies is stronger than elsewhere.

In [106]:
graph_orgatype_coop()

## 4. Topics

In [114]:
def df_aggregated_topics():
  
  #df
  df = central_df.copy()
  new_df = pd.DataFrame()
  for year in all_years:
    add = df.loc[df["Year"] == year]["Topic Name"].value_counts()
    new_df[year] = add

  #dropping unknown topics
  new_df = new_df.drop(["???"], axis = 0).fillna(0).astype(int).transpose()

  return new_df
  
df = df_aggregated_topics()

The most relevant topics. Notice that there is a mixture between technical ("satellite control") and policy-relevant topics ("Education"). As suggested above, "Commerce and Economy" is one of the most important themes in the space sector, given New Space and the trend towards commercialisation.

In [116]:
df = df_aggregated_topics().transpose()
df["CAGR"] = (((df["2022"]/df["2018"]).pow((1/5)))-1)*100
df.head(7)

Unnamed: 0,2018,2019,2020,2021,2022,CAGR
Satellite control,113,80,114,116,137,3.927008
Education,98,98,97,76,124,4.818785
"Legal, Treaty, Jurisdiction",62,52,67,43,56,-2.015074
Communication,62,48,60,60,73,3.320437
Commerce and Economy,61,52,51,52,70,2.790657
Systems Engineering,59,47,39,33,60,0.336708
Artemis,58,92,45,46,70,3.832667


Looking at the CAGR of the above topics, we notice that they barely grew or even declined in relative terms (the total CAGR is ca. 4%).

Let's take a look at some topics that became increasingly relevant. First, there is an emerging country (Brazil) and questions regarding the development of its space sector. Then, there are two topics regarding important problems space faces: Dispute Resolution and Debris Mitigation.

The advent of cheaper access to space has also raised questions about Space Tourism. Finally, some technical topics have increased significantly, mainly motivated by specific missions that are in the planning/execution phase.

In [117]:
df.sort_values(by = "CAGR", ascending = False).head(7)

Unnamed: 0,2018,2019,2020,2021,2022,CAGR
Brazil,2,10,7,10,19,56.871743
"Disputes, Arbitration, Courts, Resolution",2,9,3,2,13,45.406115
Tourism,11,5,25,30,44,31.950791
"Debris, Remeditation, Mitigation",12,29,18,23,41,27.855826
Lava,6,5,10,15,20,27.225964
"Moons (Jupiter, Saturn)",13,18,17,21,36,22.594738
Space Rider,7,5,8,8,18,20.791088


Taking a look at the absolute numbers above, one notices that even the top-topics only make up a small proportion. In general, the discourse at the IAC is fairly diverse:

In [118]:
def graph_compare_topics_pie(Year):
  #df
  df = df_aggregated_topics().transpose()

  #fig
  fig = px.pie(df, values = df[Year], names = df.index,
               template = "plotly_dark", title = f"Distribution of Topics in {Year}")
  fig.update_traces(textposition='inside', textinfo='percent+label')

  fig.show()

graph_compare_topics_pie("2022")

-----------------------------

Adding the countries:

In [5]:
path = "/content/drive/MyDrive/Colab Notebooks/ESPI_Codes/IAC_Analysis/4.Central_Dict/"
with open(path+"4.countries_dict.pickle", "rb") as f:
  countries_dict = pickle.load(f)

#2. Lexical Query

Explanation regarding the conditions of organisation types: if multiple are choses as True, only papers that have all of the selected conditions as True will be included. E.g., if OnlyUniversity and OnlySpaceAgency are both selected as True, only papers that have at least one university and at least one space agency as organisations will be included.

In [None]:
def lexical_query(Words, 
                  Countries = all_countries, 
                  OnlyUniversity = False, 
                  OnlySpaceAgency = False, 
                  OnlyOtherResearchInstitution = False, 
                  OnlyCompany = False, 
                  OnlyOther = False):
  
  #Making the input lower case and adding a space before and after
  words_input = [" "+word.lower()+" " for word in Words]

  #Creating a df (tmp1) on the basis of central_df and counting the frequency of input words
  tmp1 = central_df.copy()
  for word in words_input:
    tmp1[word] = tmp1["Abstract"].str.count(word)

  #Creating conditions for including organisation types
  if OnlyUniversity == True:
    cond_only_university = tmp1["University"] > 0
  elif OnlyUniversity == False:
    cond_only_university = tmp1["University"] > -1
  
  if OnlySpaceAgency == True:
    cond_only_space_agency = tmp1["Space Agency"] > 0
  elif OnlySpaceAgency == False:
    cond_only_space_agency = tmp1["Space Agency"] > -1

  if OnlyOtherResearchInstitution == True:
    cond_only_other_research_institution = tmp1["Other Research Institution"] > 0
  elif OnlyOtherResearchInstitution == False:
    cond_only_other_research_institution = tmp1["Other Research Institution"] > -1

  if OnlyCompany == True:
    cond_only_company = tmp1["Company"] > 0
  elif OnlyCompany == False:
    cond_only_company = tmp1["Company"] > -1

  if OnlyOther == True:
    cond_only_other = tmp1["Other"] > 0
  elif OnlyOther == False:
    cond_only_other = tmp1["Other"] > -1

  #Creating a dictionary on the basis of tmp1
  tmp2 = {}
  for word in words_input:
    tmp2.update({word: {}})
    for year in all_years:
      tmp2[word].update({year: tmp1.loc[(tmp1["Year"] == year) & 
                                        (tmp1["Country"].isin(Countries)) & 
                                        (cond_only_university) &
                                        (cond_only_space_agency) &
                                        (cond_only_other_research_institution) &
                                        (cond_only_company) &
                                        (cond_only_other)
                                        ][word].sum()})

  result = pd.DataFrame.from_dict(tmp2).transpose()

  #Calculating Compound annual growth rate
  num_years = len(all_years)
  first_year = all_years[0]
  last_year = all_years[-1]

  
  result["CAGR (%)"] = (((result[last_year]/result[first_year]).pow((1/num_years))-1)*100).round()#.astype(int) #left this out bc it leads to a problem when value is 0

  #figure
  df = result.drop(axis = 1, columns = ["CAGR (%)"]).transpose()
  fig = px.line(df, x = df.index, y = words_input)
  
  
  return result

In [None]:
lexical_query(Words = ["internet of things", "iot", "industry 4.0", "edge","connectivity",  "communication", "communications", "telecommunication", "telecommunications"])

Unnamed: 0,2018,2019,2020,2021,2022,CAGR (%)
internet of things,14,11,15,18,20,7.0
iot,15,35,44,66,58,31.0
industry 4.0,4,4,9,2,8,15.0
edge,37,39,91,41,80,17.0
connectivity,37,19,48,48,67,13.0
communication,413,381,468,496,560,6.0
communications,155,219,112,179,231,8.0
telecommunication,55,43,31,32,49,-2.0
telecommunications,27,29,26,20,46,11.0


In [None]:
#@title Lexical Query { run: "auto", vertical-output: true }
#@markdown Enter words here. separate words with semicolon (;) and avoid white spaces:

###Words
Words = "economy;business;new space;trade;supply chain" #@param {type: "string"}
Words = Words+";"
Words = Words.split(";")
#Words = Words.split(",")
#w = list(Words)

###Organisation Types
#@markdown If "Only_University" is selected, only publications that have at least one university as author will be included (options are non-exclusive).
Only_University = False #@param {type: "boolean"}
Only_Company = True #@param {type: "boolean"}
Only_Space_Agency = False #@param {type: "boolean"}

###Countries
#@markdown Selection Regions:
Countries = []

All_Countries = True #@param {type: "boolean"}
if All_Countries is True:
  Countries = all_countries

Europe = False #@param {type:"boolean"}
if Europe is True:
  Countries += countries_dict["Europe"]

Americas = False #@param {type:"boolean"}
if Americas is True:
  Countries += countries_dict["Americas"]

Asia = False #@param {type:"boolean"}
if Asia is True:
  Countries += countries_dict["Asia"]

Africa = False #@param {type:"boolean"}
if Africa is True:
  Countries += countries_dict["Africa"]

Oceania = False #@param {type:"boolean"}
if Oceania is True:
  Countries += countries_dict["Oceania"]


###Function
lexical_query(Words = Words,
              Countries = Countries,
              OnlyUniversity = Only_University,
              OnlySpaceAgency = Only_Space_Agency,
              OnlyCompany = Only_Company)
#@markdown ---
#@markdown Output

Unnamed: 0,2018,2019,2020,2021,2022,CAGR (%)
economy,29,17,17,33,54,13.0
business,67,65,63,65,71,1.0
new space,40,33,26,44,40,0.0
trade,26,36,12,10,32,4.0
supply chain,13,2,8,6,16,4.0
,0,0,0,0,0,


In [None]:
#@title ## Markdown
#@markdown You can also include Markdown in forms.

#@markdown ---
#@markdown ### Enter a file path:
file_path = "" #@param {type:"string"}
#@markdown ---



['[', 'p', 'n', 't', ']']

In [None]:
countries_dict[]

In [None]:
Countries
#all_countries

['Australia',
 'Israel',
 'Italy',
 'Belgium',
 'Germany',
 'Ghana',
 'Japan',
 'United Kingdom',
 'China',
 'The Netherlands',
 'United States',
 'France',
 'Iran',
 'Canada',
 'Austria',
 'United Arab Emirates',
 'South Africa',
 'Russian Federation',
 'Brazil',
 'Greece',
 'Portugal',
 'Taiwan, China',
 'Malta',
 'Ukraine',
 'Spain',
 'Poland',
 'India',
 'Luxembourg',
 'Peru',
 'Switzerland',
 'Norway',
 'Nigeria',
 'Mexico',
 'Hong Kong',
 'Ethiopia',
 'Denmark',
 'Thailand',
 'Sweden',
 'Korea, Republic of',
 'Bangladesh',
 'Indonesia',
 'Romania',
 'New Zealand',
 'Pakistan',
 'Malaysia',
 'Chile',
 'Ireland',
 'Czech Republic',
 'Hungary',
 'Singapore, Republic of',
 'Costa Rica',
 'Ecuador',
 'La Reunion',
 'Paraguay',
 'Finland',
 'The Philippines',
 'Kuwait',
 'Estonia',
 'Netherlands Antilles',
 'Slovenia',
 'Bolivia',
 'Slovak Republic',
 'Colombia',
 'Belarus',
 "Korea, Democratic People's Republic of",
 'Sudan',
 'Nepal',
 'Lithuania',
 'Turkey',
 'Cyprus',
 'Kenya',
 'T

In [None]:
Words = "[pnt,space,space economy]" #@param {type: "string"}
#w = Words.split(",")
w = list(Words)
print(Words)
print(type(w))

[pnt,space,space economy]
<class 'list'>


In [None]:
Words = "pnt;space;space economy" #@param {type: "string"}
Words = Words.split(";")
print(Words)
print(type(Words))

['pnt', 'space', 'space economy']
<class 'list'>


In [None]:
type(w)

list

In [None]:
for word in w:
  print(word)

[
p
n
t
,
s
p
a
c
e
,
s
p
a
c
e
 
e
c
o
n
o
m
y
]


In [None]:
lexical_query(Words = w)

Unnamed: 0,2018,2019,2020,2021,2022,CAGR (%)
pnt,3,0,1,18,35,63.0
space,7010,6462,6919,6312,9724,7.0
space economy,30,39,40,58,74,20.0


In [None]:
s = input("Enter words: ")
s

Enter wordspnt, space


'pnt, space'

In [None]:
x = s.split(",")
x

['pnt', ' space']

In [None]:
lexical_query(Words = x)

Unnamed: 0,2018,2019,2020,2021,2022,CAGR (%)
pnt,3,0,1,18,35,63.0
space,0,0,0,0,0,


In [None]:
l1 = [1,2]
l2 = [3, 4, "s"]
l1+l2

[1, 2, 3, 4, 's']

In [None]:
countries_dict["Africa"]

['Ghana',
 'South Africa',
 'Nigeria',
 'Ethiopia',
 'Sudan',
 'Kenya',
 'Algeria',
 'Botswana',
 'Zimbabwe',
 'Togo',
 'Angola',
 'Egypt',
 'Cameroon',
 'Morocco',
 'Mauritius',
 'La Reunion']

In [None]:
dropdown = '1st option' #@param ["1st option", "2nd option", "3rd option"]

In [None]:
#@title Lexical Query


Query1 = "space", "some" #@param {type: "list"},
Query2 = "entertainment" #@param {type: "string"},
Query3 = "pnt" #@param {type: "string"},

t = list(Query1)
Only_University = True #@param {type: "boolean"}

print(t)
print(type(t))



['space', 'some']
<class 'list'>


#3. Lexical Exploration

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
#stopwords
stopwords = stopwords.words("english")
stopwords.append("space")

In [None]:
#tokenising, removing words that are two letters or less and words that are part of the stopwords list
tokenised = {}
for year in all_years:
  tokenised.update({year: []})
for key, value in central_dict.items():
  for word in word_tokenize(value["Abstract"]):
    if len(word) > 2 and word not in stopwords:
      tokenised[value["Year"]].append(word)

In [None]:
#Creating a nested dict with word as key and a dictionary with years as keys and 0 as value
frequencies_raw = {}
for key, value in tokenised.items():
  for word in value:
    frequencies_raw.update({word: {}})
for key, value in tokenised.items():
  for word in value:
    frequencies_raw[word].update({key: 0})

In [None]:
#Actually counting the frequencies:
for key, value in tokenised.items():
  for word in value:
    frequencies_raw[word][key] += 1

In [None]:
#Removing words with a frequency of less than 
frequencies = {}
for key, value in frequencies_raw.items():
  counter = 0
  for year, freq in value.items():
    counter += freq
  if counter > 25:
    frequencies.update({key: value})

In [None]:
#Creating a DF
all_word_frequencies_df = pd.DataFrame.from_dict(frequencies).transpose().fillna(0).astype(int)

#Calculating CAGR
all_word_frequencies_df["CAGR (%)"] = (((all_word_frequencies_df["2022"]/all_word_frequencies_df["2018"]).pow((1/5)))-1)*100
all_word_frequencies_df = all_word_frequencies_df.sort_values(by = "CAGR (%)", ascending = False)
#all_word_frequencies_df

In [None]:
interesting_words = ["dispute", "disputes", "court", "travelers", "passengers", "tourist", "inequality", "poor", "poverty", "inequality", "ethics", "floods", "pollution", "crisis", "warming",
                     "drought", "emissions", "renewable", "relief", "vleo", "jobs", "harm","congested", "cleaning", "skill", "digitalization", "stm", "sdgs", "threatening", "geopolitical", 
                     "vulnerabilities", "entertainment", "city", "cities", "nuclear", "military" ,"privacy", "attack", "asat", "fishing", "depletion", "insurance", "jamming", "antarctica",
                     "insurance", "resilient", "combat"]
selection_word_frequencies_df = all_word_frequencies_df.loc[interesting_words].sort_values(by = "CAGR (%)", ascending = False)

In [None]:
#Exporting
with pd.ExcelWriter("results_IAC_word_freq_07_11_22.xlsx") as writer:
  selection_word_frequencies_df.to_excel(writer, sheet_name = "Selection") 
  all_word_frequencies_df.to_excel(writer, sheet_name = "All Words")

In [None]:
#total number of words for normalisation
total_num_words = {}
for key, value in tokenised.items():
  total_num_words.update({key: len(value)})
total_num_words

{'2018': 423767,
 '2019': 363143,
 '2020': 371356,
 '2021': 362519,
 '2022': 528566}

#4. Countries

##4.1 This function takes countries as arguments. It returns a df showing the number of papers by country and a total.

In [6]:
def query_year_country(Countries = all_countries):
  result = pd.DataFrame()
  for year in all_years:
    add = central_df.loc[((central_df["Year"] == year)) & (central_df["Country"].isin(Countries))]["Country"].value_counts() #creating a series per year
    result[year] = add #and adding the series to the DF object
    result = result.fillna(0).astype(int)

  #Adding Total column
  result["Total"] = result.iloc[:, 0:].sum(axis = 1).astype(int)

  return result
query_year_country(Countries = ["Germany", "South Africa"])

Unnamed: 0,2018,2019,2020,2021,2022,Total
Germany,439,180,124,140,243,1126
South Africa,11,10,7,4,1,33


##4.2 Comparing Countries

This function returns a line graph to compare the number of contributions per country. It relies on the function above for the dataframe.

In [29]:
def graph_compare_country(Countries = all_countries):
   df = query_year_country(Countries).transpose().drop(["Total"], axis = 0)
   fig = px.line(df, x = df.index, y = df.columns, template = "plotly_dark",
                 title = f"Contributions of Countries over Time", labels = {"value" : "Number of Papers", "index" : "Years", "variable": "Countries"})
   fig.show()
graph_compare_country(countries_dict["Asia"])
graph_compare_country(all_countries)

## 4.3 Sunburst: Contributions by Country and Region

A. This function takes a year or total ("2019" or "Total") as argument to return a sunburst plot to compare different regions in terms of contributions.

In [42]:
def graph_compare_region_country(Years_or_Total):
  df = query_year_country()
  regions = []
  for country in df.index:
    for key, value in countries_dict.items():
      if country in value:
        regions.append(key)
  df.insert(0, "Region", regions)
  fig = px.sunburst(df, path = ["Region", df.index], values = Years_or_Total,
                    template = "plotly_dark", title = f"Contributions by Region and Country: {Years_or_Total}")
  fig.show()

B. The plan was to create subplots with all years + totals. But this required me to change the DF so I will do this another time. Here are the functions for all the years and totals

In [None]:
graph_compare_region_country("Total")
graph_compare_region_country("2018")
graph_compare_region_country("2019")
graph_compare_region_country("2020")
graph_compare_region_country("2021")
graph_compare_region_country("2022")

#5. Topics

This function returns a df to show the number of abstracts per topic per year. This is used in the following functions.

In [None]:
def df_aggregated_topics():
  
  #df
  df = central_df.copy()
  new_df = pd.DataFrame()
  for year in all_years:
    add = df.loc[df["Year"] == year]["Topic Name"].value_counts()
    new_df[year] = add

  #dropping unknown topics
  new_df = new_df.drop(["???"], axis = 0).fillna(0).astype(int).transpose()

  return new_df
  
df = df_aggregated_topics()

In [None]:
df = df_aggregated_topics().transpose()
df["CAGR"] = (((df["2022"]/df["2018"]).pow((1/5)))-1)*100
df
#df.sort_values(by = "CAGR", ascending = False).head(25)

Unnamed: 0,2018,2019,2020,2021,2022,CAGR
Satellite control,113,80,114,116,137,3.927008
Education,98,98,97,76,124,4.818785
Communication,62,48,60,60,73,3.320437
"Legal, Treaty, Jurisdiction",62,52,67,43,56,-2.015074
Commerce and Economy,61,52,51,52,70,2.790657
...,...,...,...,...,...,...
Emirates Mars Mission,5,5,10,12,6,3.713729
Blockchain,4,7,12,7,6,8.447177
"UAE, Mohammed bin Rashid Space Centre",4,10,21,23,8,14.869835
Brazil,2,10,7,10,19,56.871743


In [None]:
all_word_frequencies_df["CAGR (%)"] = (((all_word_frequencies_df["2022"]/all_word_frequencies_df["2018"]).pow((1/5)))-1)*100
all_word_frequencies_df = all_word_frequencies_df.sort_values(by = "CAGR (%)", ascending = False)

This function returns a plot to show the development of all topics over time:

In [None]:
def graph_compare_topics_line():
  
  #df
  new_df = df_aggregated_topics()

  #fig
  fig = px.line(new_df, x = new_df.index, y = new_df.columns,
                template = "plotly_dark", labels = {"value": "Number of Papers (per year)", "index": "Years", "variable" : "Topic"},
                title = "Topic Frequency")
  fig.show()
graph_compare_topics_line()

This function takes year as an argument to return a pie chart with the distribution of all topics in the given year:

In [None]:
def graph_compare_topics_pie(Year):
  #df
  df = df_aggregated_topics().transpose()

  #fig
  fig = px.pie(df, values = df[Year], names = df.index,
               template = "plotly_dark", title = f"Distribution of Topics in {Year}")
  fig.update_traces(textposition='inside', textinfo='percent+label')

  fig.show()

graph_compare_topics_pie("2022")

This query function shows the development of topics over time under the consideration of set conditions (args):

In [None]:
def topic_query(Topics = all_topic_names, 
                Countries = all_countries, 
                OnlyUniversity = False, 
                OnlySpaceAgency = False, 
                OnlyOtherResearchInstitution = False, 
                OnlyCompany = False, 
                OnlyOther = False):
  
  #copying the df from the central_df
  result = central_df.copy()

  #Creating conditions for including organisation types
  if OnlyUniversity == True:
    cond_only_university = result["University"] > 0
  elif OnlyUniversity == False:
    cond_only_university = result["University"] > -1
  
  if OnlySpaceAgency == True:
    cond_only_space_agency = result["Space Agency"] > 0
  elif OnlySpaceAgency == False:
    cond_only_space_agency = result["Space Agency"] > -1

  if OnlyOtherResearchInstitution == True:
    cond_only_other_research_institution = result["Other Research Institution"] > 0
  elif OnlyOtherResearchInstitution == False:
    cond_only_other_research_institution = result["Other Research Institution"] > -1

  if OnlyCompany == True:
    cond_only_company = result["Company"] > 0
  elif OnlyCompany == False:
    cond_only_company = result["Company"] > -1

  if OnlyOther == True:
    cond_only_other = result["Other"] > 0
  elif OnlyOther == False:
    cond_only_other = result["Other"] > -1


  result = result.loc[(result["Topic Name"].isin(Topics)) & 
             (result["Country"].isin(Countries)) & 
             (cond_only_university) &
             (cond_only_space_agency) &
             (cond_only_other_research_institution) &
             (cond_only_company) &
             (cond_only_other)
             ]
  return result

In [None]:
topic_query(Topics = ["Education"], Countries = ["Germany"], OnlyUniversity = True, OnlySpaceAgency = True)

Unnamed: 0,Paper_id,Top2Vec_id,Year,Country,Region,Topic Number,Topic Name,Organisations,Abstract,University,Space Agency,Other Research Institution,Company,Other,Unknown
2020_60041,60041,2020_115,2020,Germany,Europe,1,Education,"[{'Name': 'Technical University Dresden', 'Typ...",the trend towards smaller satellites and mega...,9,1,1,1,0,0
2021_63253,63253,2021_159,2021,Germany,Europe,1,Education,"[{'Name': 'DLR', 'Type': 'Space Agency'}, {'Na...",in 2018 the international institute for astro...,1,1,0,0,0,2


# 6. Organisation Types

##6.1 Contributions by OrgaType over Time

In [None]:
#This function provides the df

def query_year_orgatype():
  result = pd.DataFrame()
  for orga_type in ["University", "Space Agency", "Company"]:
    result[orga_type] = central_df.loc[(central_df[orga_type] > 0)]["Year"].value_counts()
  return result

In [69]:
#This function creates the graph from the data above

def graph_compare_orgatype_year():
  df = query_year_orgatype()
  
  fig = px.line(df, template = "plotly_dark",
                labels = {"value": "Number of Papers", "index": "Years", "variable": "Organisation Type"},
                title = "Number of Papers by Organisation Type"
                )
  fig.show()

##6.2 OrgaTypes by Country

This df shows the number of papers per country and per organisation type (i.e., a paper where at least one organisation of that type was involved in making the paper):

In [71]:
def query_orgatype_country():
  result = pd.DataFrame()
  for orga_type in ["University", "Space Agency", "Company"]:
    result[orga_type] = central_df.loc[(central_df[orga_type] > 0)]["Country"].value_counts()
  
  result = result.fillna(0).astype("int")
  return result

In [78]:
def graph_compare_orgatype_country(Orga_Type = "University"):
  df = query_orgatype_country()
  s = df[Orga_Type].sort_values(ascending = False)

  fig = px.bar(s, x = s.index, y = Orga_Type,
               template = "plotly_dark",
               labels = { Orga_Type: "Number of Contributions (Total)","index": "Country"},
               title = f"Total Contributions by Organisation Type ({Orga_Type})"
               )
  fig.show()

##6.3 Cooperation between 

This shows the different types of cooperation between organisation types:

In [101]:
def query_orgatype_coop():
  # only coop between uni and company (but not space agency) by country
  add1 = central_df.loc[(central_df["University"] > 0) & (central_df["Company"] > 0) & (central_df["Space Agency"] == 0)]["Country"].value_counts()

  #coop uni-space agency
  add2 = central_df.loc[(central_df["University"] > 0) & (central_df["Space Agency"] > 0) & (central_df["Company"] == 0)]["Country"].value_counts()

  #coop space agency-company
  add3 = central_df.loc[(central_df["Space Agency"] > 0) & (central_df["Company"] > 0) & (central_df["University"] == 0)]["Country"].value_counts()

  #coop space agency-company-uni
  add4 = central_df.loc[(central_df["University"] > 0) & (central_df["Company"] > 0) & (central_df["Space Agency"] > 0)]["Country"].value_counts()

  result = pd.DataFrame()
  result["University-Company"] = add1
  result["University-Space Agency"] = add2
  result["Space Agency-Company"] = add3
  result["Space Agency-Company-University"] = add4
  result = result.fillna(0).astype(int)

  return result

In [102]:
def graph_orgatype_coop():
  df = query_orgatype_coop()

  fig = px.bar(df, template = "plotly_dark",
               labels = {"value": "Number of Papers", "index": "Country", "variable": "Type of Cooperation:"},
               title = "Cooperation between Organisation Types (Total)"
               )
  fig.show()

graph_orgatype_coop()