# Transforming raw data to pandas dataframe

_Foreword_

The goal of this notebook is to transform the filtered data into a pandas dataframe, which will be much easier to handle in further computations.

I import the modules and functions I will need later.

In [1]:
import json
from tqdm import tqdm
import pandas as pd
from myfunctions import recreation_abstract,compute_list_score,get_authors,\
    dicowithfilteredref,get_references,get_citing_works
import time
import pickle
import math

I download the filtered data, I scrapped from OpenAlex.

In [2]:
infile_filtered_data = open('data_creation_variables/filtered_data','rb')
filtered_data = pickle.load(infile_filtered_data)
infile_filtered_data.close()

Again, here are the concepts I am interested in, with their OpenAlex ids.

In [3]:
concept_ids = {
    'Authentication protocole': 'C21564112',
    'Biometrics': 'C184297639',
    'Blockchain': 'C2779687700',
    # 'Database Encryption': '',
    'Differential Privacy': 'C23130292',
    'Digital rights management': 'C537843408',
    'Digital signature': 'C118463975',
    'Disk Encryption': 'C9368797',
    'Distributed algorithm': 'C130120984',
    'Electronic voting': 'C2780612046',
    # 'Email encryption': '',
    'Functional encryption': 'C2780746774',
    'Hardware acceleration': 'C13164978',
    'Hardware security module': 'C39217717',
    'Hash function': 'C99138194',
    'Homomorphic encryption': 'C158338273',
    'Identity management': 'C555379026',
    # 'Identity-based encryption': '',
    'Key management': 'C17886624',
    'Link encryption': 'C69254412',
    'Post-quantum cryptography': 'C108277079',
    # 'Private set intersection': '',
    'Public-key cryptography': 'C203062551',
    'Quantum key distribution': 'C95466800',
    'Quantum cryptography': 'C144901912',
    'Random number generation': 'C201866948',
    # 'Searchable symmetric encryption': '',
    'Symmetric-key algorithm': 'C65302260',
    'Threshold cryptosystem': 'C123744220',
    'Trusted Computing': 'C2776831232',
    # 'Trusted execution environment': '',
    'Tunneling protocol': 'C76885553',
    'Zero-knowlegde proof': 'C176329583'}

In [4]:
mylistofconcepts = []
# creating an empty dico that I will use later

for concept, concept_id in concept_ids.items():
    mylistofconcepts.append(concept)

I keep only the referenced works that belongs to my set of papers, which means the papers published between 2002 and 2022 and which are related to encryption technologies. This is what the function below "dicowithfilteredref" does.

In [5]:
helpdico= dicowithfilteredref(filtered_data)

I do now prepare a dictionary that I will turn into a dataframe. To do so, I compute all the lists, I will use later.

In [6]:
fulldata_df = {}

We create a list of abstracts, recreating the abstracts of the list of words we got with their position in the abstract.

In [7]:
abstract_list = list(map(recreation_abstract, filtered_data['abstract_inverted_index']))

We create a list of authors, from the list of information we have got about authors .

In [8]:
authors_list = list(map(get_authors, filtered_data['authorships']))

We create a list of references (with their OpenAlex ids), from the information we've got, taking only the references to the papers we consider.

In [9]:
references_list = list(map(lambda x: get_references(x, helpdico), tqdm(filtered_data['id'])))

100%|█████████████████████████████████████████████████████████████████████| 285716/285716 [00:00<00:00, 2068915.10it/s]


We create a list, containing a list of scores of attribution to all concepts related to each paper.

In [10]:
scores_list = list(map(compute_list_score, filtered_data['concepts']))

I add all the information I am interested in to make a dataframe out of it.

In [11]:
fulldata_df['id']=filtered_data['id']
fulldata_df['title']=filtered_data['title']
fulldata_df['publication_date']=filtered_data['publication_date']
fulldata_df['author']=authors_list
fulldata_df['referenced_works']=references_list
fulldata_df['abstract']=abstract_list
fulldata_df['concepts']=len(filtered_data['id'])*[mylistofconcepts]
fulldata_df['score_concepts']=scores_list
fulldata_df['year']=filtered_data['year']
fulldata_df['month']=filtered_data['month']

In [12]:
df_full = pd.DataFrame(fulldata_df)

In [13]:
print('In my dataset, there are '+str(len(list(set(df_full.id.tolist()))))+' papers.')

In my dataset, there are 285716 papers.


I save this version of df_full, because I will use later in some computation.

In [14]:
df_full_notexploded = df_full
df_full_notexploded.to_pickle('data_creation_variables/df_full_notexploded')

I explode all the columns to obtain a dataframe without lists in its cells and I save this dataframe.

In [15]:
# I do not explode that in order to avoid memory errors
df_full =df_full.explode('author')
df_full =df_full.explode('referenced_works')
df_full =df_full.explode(['concepts', 'score_concepts'])

In [16]:
df_full.to_pickle('data_creation_variables/df_full')