# Creation dataframe of citation count for each paper

_Foreword_

The goal of this notebook is to create a pandas dataframe, containing for each paper, different count of citation (such as citations up to this month/year, citations for the year/month) coming from papers related to encryption technologies and published from 2002 to 2022.

Importing the necessary libraries.

In [1]:
import pandas as pd
import pickle
from myfunctions import creating_dfcit
import time
from tqdm import tqdm
import math

Importing df_full_cleaned, that was cleaned and processed by the file "dataexploration_full_data" and dfcitbasic.

In [2]:
infile_dfcitbasic = open('data_creation_variables/dfcitbasic','rb')
dfcitbasic = pickle.load(infile_dfcitbasic)
infile_dfcitbasic.close()

In [3]:
infile_data_full = open('../exploratory_analysis/data_exploratory_analysis/df_full_cleaned','rb')
df_full = pickle.load(infile_data_full)
infile_data_full.close()

Defining some variables.

In [4]:
myyears = [2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022]
mymonths = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
            'August', 'September', 'October', 'November', 'December']

In [5]:
# I take all the papers from df_full, because the papers in dfcitbasic are not all the papers I have but only the citing papers.
listpapers = list(set(df_full.paper.tolist()))

In [6]:
# I delete df_full, because it is a huge file which takes a lot of space in the memory.
del df_full

Defining some variables for my computations.

In [7]:
numberofpapers=len(listpapers)
fractionnumber=100

In [8]:
list_listpaper=[]

In [9]:
step = int(math.floor(numberofpapers/fractionnumber))

In [10]:
# I divide my papers in subsets of papers.
for i in tqdm(range(fractionnumber)):
    start = i*step
    end = (i+1)*step
    if i == fractionnumber-1:
        mypapers = listpapers[start:]
    else:
        mypapers = listpapers[start:end]
    list_listpaper.append(mypapers)

100%|█████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 33306.63it/s]


In [11]:
# I check that every paper is contained in every subset only once.
for i in range(fractionnumber):
    for j in range(fractionnumber):
        lst1 = list_listpaper[i]
        lst2 = list_listpaper[j]
        res = list(set(lst1) & set(lst2))
        if i !=j:
            if len(res)!=0:
                print(len(res))

In [12]:
listindex= [0,20,40,60,80,100]

In [13]:
list_listpaper_compute=[]

In [14]:
# I divide my list of lists one more time in 5 lists.
for j in range(5):
    list_listpaper_compute.append(list_listpaper[listindex[j]:listindex[j+1]])

In [15]:
# I check that every paper is contained in every subset only once.
for i in range(5):
    for j in range(5):
        list1 = list_listpaper_compute[i]
        list2 = list_listpaper_compute[j]
        for element1 in list1:
            for element2 in list2:
                res = list(set(element1) & set(element2))
                if i !=j:
                    if element1!=element2:
                        if len(res)!=0:
                            print(len(res))

In [16]:
#there are no papers in common in this
items = ['paper', 'year', 'month', 'cituptothistime_year','cituptothistime_month', 'citforthemonth', 'citfortheyear']

In [17]:
# Now for each list I create a dictionary with all information, I turn it into a pandas dataframe and I save it.
for j in range(5):
    finaldicocitation = {'paper': [],
                         'year': [],
                         'month': [],
                         'cituptothistime_year': [],
                         'cituptothistime_month': [],
                         'citforthemonth': [],
                         'citfortheyear': []
                         }
    list_df = list(map(lambda x: creating_dfcit(myyears, mymonths, dfcitbasic,x), tqdm(list_listpaper_compute[j])))
    for element in list_df:
        for item in items:
            mynewlist = element[item].tolist()
            finaldicocitation[item] = finaldicocitation[item] + mynewlist
    newfile = pd.DataFrame(finaldicocitation)
    newfile.to_pickle('data_creation_variables/finaldicocitation_full'+str(j))

100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [41:09<00:00, 123.48s/it]
100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [42:06<00:00, 126.34s/it]
100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [42:22<00:00, 127.14s/it]
100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [42:31<00:00, 127.57s/it]
100%|█████████████████████████████████████████████████████████████████████████████████| 20/20 [40:41<00:00, 122.10s/it]


Last, I put all these dataframes together in one dataframe and I save it.

In [18]:
finaldicocitation = {'paper': [],
                         'year': [],
                         'month': [],
                         'cituptothistime_year': [],
                         'cituptothistime_month': [],
                         'citforthemonth': [],
                         'citfortheyear': []
                         }

In [19]:
for j in range(5):
    with open('data_creation_variables/finaldicocitation_full'+str(j), 'rb') as f:
        dfdicocit = pickle.load(f)
    for item in items:
        mynewlist = dfdicocit[item].tolist()
        finaldicocitation[item] = finaldicocitation[item] + mynewlist

In [20]:
newfile = pd.DataFrame(finaldicocitation)

In [21]:
newfile.to_pickle('data_creation_variables/finaldicocitation')