This notebook cleans the data retrieved from KB long-term-stable data set versions. Duplicates are removed and the data is saved in a new file. The item_ids are saved for the future reference retrieval. In a following notebook, the data is preliminarily analyzed and visualized.

In [18]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
#import re
#import numpy as np
#import matplotlib as mpl

# Fonts for plots
#mpl.rcParams['font.serif'] = 'Times New Roman'
#plt.rcParams['font.family'] = 'serif'

#pd.set_option('display.max_columns', None)

# Path to data retrieval and storage
path = "C:/Users/kleinow/ownCloud/MA_Neuro"

In [19]:
# Load KB data; 12 data frames
# cn1 indicates "computational neuroscience", whereas cn2 indicates "computational neurosciences" (note the -s)

# CompNeuro found in keyword section 
cn1 = pd.read_csv(path + '/cn1.csv')
cn2 = pd.read_csv(path + '/cn2.csv')

#CompNeuro found in title section
cn1_t = pd.read_csv(path + '/cn1_titles.csv')
cn2_t = pd.read_csv(path + '/cn2_titles.csv')

#CompNeuro found in abstract section
cn1_abs = pd.read_csv(path + '/cn1_abs.csv')
cn2_abs = pd.read_csv(path + '/cn2_abs.csv')

# Corresponding abstract retrieval for each of the data frames above
cn1_abstracts = pd.read_csv(path + '/cn1_abstracts.csv')
cn2_abstracts = pd.read_csv(path + '/cn2_abstracts.csv')
cn1_t_abstracts = pd.read_csv(path + '/cn1_titles_abstracts.csv')
cn2_t_abstracts = pd.read_csv(path + '/cn2_titles_abstracts.csv')
cn1_abs_abstracts = pd.read_csv(path + '/cn1_abs_abstracts.csv')
cn2_abs_abstracts = pd.read_csv(path + '/cn2_abs_abstracts.csv')



In [20]:
# Merge abstracts to data frames
# Attention: Contrary to the KB overview, the abstract column is called "abstract" here, not "abstract_text"
cn1 = cn1.merge(cn1_abstracts[['item_id', 'abstract']], on='item_id', how='left')
cn2 = cn2.merge(cn2_abstracts[['item_id', 'abstract']], on='item_id', how='left')

cn1_t = cn1_t.merge(cn1_t_abstracts[['item_id', 'abstract']], on='item_id', how='left')
cn2_t = cn2_t.merge(cn2_t_abstracts[['item_id', 'abstract']], on='item_id', how='left')

cn1_abs = cn1_abs.merge(cn1_abs_abstracts[['item_id', 'abstract']], on='item_id', how='left')
cn2_abs = cn2_abs.merge(cn2_abs_abstracts[['item_id', 'abstract']], on='item_id', how='left')

In [21]:
# Append the cn2 data frames to the cn1 data frames and remove occuring duplicates
cn = cn1.append(cn2).drop_duplicates().reset_index(drop=True) # method chaining
cn_t = cn1_t.append(cn2_t).drop_duplicates().reset_index(drop=True)
cn_abs = cn1_abs.append(cn2_abs).drop_duplicates().reset_index(drop=True)

# Comparison of the length of the data frames
lengths = {
    "DataFrame": ["cn1", "cn2", "cn_combined", "cn1_t", "cn2_t", "cn_t_combined", "cn1_abs", "cn2_abs", "cn_abs_combined"],
    "Length": [
        len(cn1), len(cn2), len(cn),
        len(cn1_t), len(cn2_t), len(cn_t),
        len(cn1_abs), len(cn2_abs), len(cn_abs)
    ]
}

length_df = pd.DataFrame(lengths)
length_df


  cn = cn1.append(cn2).drop_duplicates().reset_index(drop=True) # method chaining
  cn_t = cn1_t.append(cn2_t).drop_duplicates().reset_index(drop=True)
  cn_abs = cn1_abs.append(cn2_abs).drop_duplicates().reset_index(drop=True)


Unnamed: 0,DataFrame,Length
0,cn1,705
1,cn2,11
2,cn_combined,716
3,cn1_t,72
4,cn2_t,0
5,cn_t_combined,72
6,cn1_abs,903
7,cn2_abs,22
8,cn_abs_combined,924


Note: There were entirely no finds for "computational neurosciences" in the title section of the KB search results.

In [22]:
# Merge the results for keyword section, title, and abstract
cn_full = cn.append([cn_t, cn_abs]).drop_duplicates().reset_index(drop=True)
cn_full.shape # (1587, 51)

  cn_full = cn.append([cn_t, cn_abs]).drop_duplicates().reset_index(drop=True)


(1587, 51)

By merging the results for the keyword section, title and abstract, we get a data frame with 1587 entries. During the merging process, 125 publications have been removed from the otherwise combined 1712 publications.

In [23]:
# Add a column with the title in lower case
cn_full['title_lower'] = cn_full['item_title'].str.lower()
cn_full.title_lower.nunique() # only 1552 unique titles

duplicated_mask = cn_full['title_lower'].duplicated(keep=False) # keep = False marks all duplicates as True, not just the subsequent ones after the first occurrence

non_unique = cn_full[duplicated_mask]
non_unique # shows all the duplicated titles

Unnamed: 0,item_id,fk_repository_history,pubyear,pubmonth,wos_pubdate_online,item_title,scopus_item_title_non_eng,first_author,doi,pmid,source_title,scopus_source_id,book_series_title,scopus_issue_title,pages,first_page,last_page,article_number,volume,issue,wos_special_issue,source_type,item_type,prepublication_item,languages,publisher_hash,wos_orga1_count,country_count,author_count,ref_count,source_ref_count,wos_aff_complete,german,vendor_pagecount,pagecount,wos_ci,keyword,class_name,cit_3_years,cit_5_years,cit_all_years,fncr_3_years,fncr_5_years,fncr_all_years,hc_3_years,hc_5_years,hc_all_years,oa_status,oa_url,scopus_oa_licence,abstract,title_lower
32,WOS:000389557000015,52378983,2016,10.0,,Correlation between videogame mechanics and ex...,,"Mondejar, Tania",10.1016/j.jbi.2016.08.006,27507089.0,JOURNAL OF BIOMEDICAL INFORMATICS,,,,131-140,131,140,,63,,,Journal,{Article},False,{eng},91FB1203535B6B2DFCC9A73FF21E16ED,2,1,5,60,34,True,False,10,,{SCI},"{""serious games"",""pervasive health"",""health ga...","{""Computer Science, Interdisciplinary Applicat...",15,24,35,2.740283,1.939175,1.706871,"{""(\""Computer Science, Interdisciplinary Appli...","{""(\""Computer Science, Interdisciplinary Appli...","{""(\""Computer Science, Interdisciplinary Appli...",{bronze},,,"{""This paper addresses a different point of vi...",correlation between videogame mechanics and ex...
177,WOS:000184147100024,33473482,2003,,,Noise in a randomly and sparsely connected exc...,,"Vibert, JF",10.1117/12.488736,,"FLUCTUATIONS AND NOISE IN BIOLOGICAL, BIOPHYSI...",,Proceedings of SPIE,,210-223,210,223,,5110,,,Book in series,"{""Proceedings Paper""}",False,{eng},BCB3C7C10049BF423156171372C87C7F,1,1,2,46,28,False,False,14,,{ISTP},"{""stochastic model"",noise,""computational neuro...","{Biology,Neurosciences,Physiology}",0,0,0,0.000000,0.000000,0.000000,"{""(Neurosciences,0.0)"",""(Physiology,0.0)"",""(Bi...","{""(Neurosciences,0.0)"",""(Physiology,0.0)"",""(Bi...","{""(Neurosciences,0.0)"",""(Physiology,0.0)"",""(Bi...",,,,"{""The mechanisms involved in respiratory rhyth...",noise in a randomly and sparsely connected exc...
215,WOS:000376684300020,7930336,2015,,,Can Videogames Improve Executive Functioning? ...,,"Mondejar, Tania",10.1007/978-3-319-26508-7_20,,"AMBIENT INTELLIGENCE FOR HEALTH, AMIHEALTH 2015",,Lecture Notes in Computer Science,,201-212,201,212,,9456,,,Book in series,"{""Proceedings Paper""}",False,{eng},A810C1CFE74304008ED15F4DE50D638B,2,1,7,20,9,True,False,12,,{ISTP},"{""serious games"",videogames,""health games"",""ex...","{""Computer Science, Artificial Intelligence"",""...",0,0,0,0.000000,0.000000,0.000000,"{""(\""Computer Science, Artificial Intelligence...","{""(\""Computer Science, Artificial Intelligence...","{""(\""Computer Science, Artificial Intelligence...",,,,"{""Nowadays, we are living a different use and ...",can videogames improve executive functioning? ...
265,WOS:000264956700003,20687095,2009,3.0,,Cortical basis of communication: Local computa...,,"Alexandre, Frederic",10.1016/j.neunet.2009.01.006,19217253.0,NEURAL NETWORKS,,,,126-133,126,133,,22,2,SI,Journal,"{Article,""Proceedings Paper""}",False,{eng},36695547BCB15CDE880C3A93E6613FDF,1,1,1,36,23,False,False,8,,"{SCI,ISTP}","{""computational neurosciences"",communication,""...","{""Computer Science, Artificial Intelligence"",N...",2,2,2,0.357224,0.163626,0.050769,"{""(\""Computer Science, Artificial Intelligence...","{""(\""Computer Science, Artificial Intelligence...","{""(\""Computer Science, Artificial Intelligence...",{green_submitted},,,"{""Human communication emerges from cortical pr...",cortical basis of communication: local computa...
277,WOS:000339937700001,73952925,2014,6.0,,Toward a new cognitive neuroscience: modeling ...,,"Gramann, Klaus",10.3389/fnhum.2014.00444,24994978.0,FRONTIERS IN HUMAN NEUROSCIENCE,,,,,,,ARTN 444,8,,,Journal,"{""Editorial Material""}",False,{eng},AE82F3AAACEA2E17B49C961053D8021A,4,3,5,22,22,True,True,3,,{SCI},"{""mobile brain/body imaging"",eeg,fnirs,""brain ...","{Neurosciences,Psychology}",12,21,48,5.735813,5.445466,6.641124,"{""(Psychology,1.0)"",""(Neurosciences,1.0)""}","{""(Psychology,1.0)"",""(Neurosciences,1.0)""}","{""(Psychology,1.0)"",""(Neurosciences,1.0)""}","{gold,green_published}",,,,toward a new cognitive neuroscience: modeling ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1582,WOS:000166212100001,32739148,2000,10.0,,Reorganization of the human CNS - Neurophysiol...,,"Schalow, G",,11252267.0,GENERAL PHYSIOLOGY AND BIOPHYSICS,,,,11-+,11.0,+,,19.0,,,Journal,{Review},False,{eng},EC33B56496AD65B211024449B13F430D,2,2,2,199,75,False,False,225,,"{SCI,SSCI}","{""human neurophysiology"",""integrative cns func...","{""Biochemistry & Molecular Biology"",Biophysics...",0,2,7,0.000000,0.055891,0.048093,"{""(Physiology,0.0)"",""(Biophysics,0.0)"",""(Bioch...","{""(Physiology,0.0)"",""(Biophysics,0.0)"",""(Bioch...","{""(Physiology,0.0)"",""(Biophysics,0.0)"",""(Bioch...",,,,"{""The key strategies on which the discovery of...",reorganization of the human cns - neurophysiol...
1583,WOS:000894965000001,70541596,2022,11.0,,Virtual Intelligence: A Systematic Review of t...,,"Zavala Hernandez, Jesus Gerardo",10.3390/brainsci12111552,36421877.0,BRAIN SCIENCES,,,,,,,ARTN 1552,12.0,11,,Journal,{Review},False,{eng},25F9B168B49CB5A71E37CDBD049CA31D,1,1,2,61,50,True,False,16,,{SCI},"{""computational architectures"",""brain function...",{Neurosciences},0,0,0,0.000000,0.000000,0.000000,"{""(Neurosciences,0.0)""}","{""(Neurosciences,0.0)""}","{""(Neurosciences,0.0)""}","{gold,green_published}",,,"{""The functioning of the brain has been a comp...",virtual intelligence: a systematic review of t...
1584,WOS:000321898300005,6794670,2013,8.0,,"Action, Outcome, and Value: A Dual-System Fram...",,"Cushman, Fiery",10.1177/1088868313495594,23861355.0,PERSONALITY AND SOCIAL PSYCHOLOGY REVIEW,,,,273-292,273.0,292,,17.0,3,,Journal,{Review},False,{eng},B8ADD20B8EB01FC20E5B741C8EA09C6E,1,1,1,157,128,True,False,20,,{SSCI},"{""dual-system theory"",morality,emotion,reasoni...","{""Psychology, Social""}",26,75,224,3.365482,3.958603,3.676859,"{""(\""Psychology, Social\"",1.0)""}","{""(\""Psychology, Social\"",1.0)""}","{""(\""Psychology, Social\"",1.0)""}",,,,"{""Dual-system approaches to psychology explain...","action, outcome, and value: a dual-system fram..."
1585,WOS:000392212200010,1259555,2010,,,Grid-wide neuroimaging data federation in the ...,,"Michel, Franck",10.3233/978-1-60750-583-9-112,20543431.0,HEALTHGRID APPLICATIONS AND CORE TECHNOLOGIES,,Studies in Health Technology and Informatics,,112-123,112.0,123,,159.0,,,Book in series,"{""Proceedings Paper""}",False,{eng},80877105732DB11D666064BE0F5904B0,9,1,17,18,12,True,False,12,,{ISTP},"{""distributed data management"",""relational dat...","{""Health Care Sciences & Services"",""Medical In...",1,1,6,1.716707,0.737614,1.644438,"{""(Medical Informatics,0.02910798)"",""(Health C...","{""(Medical Informatics,0.0)"",""(Health Care Sci...","{""(Medical Informatics,0.0)"",""(Health Care Sci...",,,,"{""Grid technologies are appealing to deal with...",grid-wide neuroimaging data federation in the ...


In [24]:
dup_counts = cn_full.groupby('title_lower').size().reset_index(name='count')

# Filter rows where count > 1 to get duplicates
dup_overview = dup_counts[dup_counts['count'] > 1].sort_values(by='count', ascending=False)

dup_overview

Unnamed: 0,title_lower,count
151,a subsequent closed-form description of propag...,2
1210,reorganization of the human cns - neurophysiol...,2
953,modelling honeybee visual guidance in a 3-d en...,2
976,multi-site voxel-based morphometry - not quite...,2
1037,neuronal circuit-based computer modeling as a ...,2
1042,neuronal reorganization through oscillator for...,2
1049,neuroplasticity and the brain connectome: what...,2
1057,noise in a randomly and sparsely connected exc...,2
1219,resolving the biophysics of axon transmembrane...,2
815,integrating fmri and single-cell data of visua...,2


Check for generic titles that could in fact be different publications instead of only duplicates

In [25]:
#cn_full[cn_full['title_lower'] == 'computational psychiatry'] # different articles!!
#cn_full[cn_full['title_lower'] == 'computational physics of the mind'] # same article
#cn_full[cn_full['title_lower'] == 'dynamical complexity in cognitive neural networks'] # same article
#cn_full[cn_full['title_lower'] == 'introduction to machine learning for brain imaging'] # same article


Removal of all duplicate titles except the ones with the title "computational psychiatry"

In [26]:
# Find rows with 'title_lower' as "computational psychiatry" and remove duplicates while always keeping the first occurrence
comppsy = cn_full[cn_full['title_lower'] == 'computational psychiatry']
filtered_cn_full = cn_full[(cn_full['title_lower'] != 'computational psychiatry')].drop_duplicates(subset='title_lower', keep='first') # Exclude rows with 'title_lower' as "computational psychiatry"!

# "Remerge" the filtered df and the "computational psychiatry" rows
cn_full = pd.concat([filtered_cn_full, comppsy]).sort_index()

cn_full.shape # (1553, 52)

(1553, 52)

In [27]:
# Save clean data to csv
cn_full.to_csv('cn_items_clean.csv', index=False)

list_item_ids = cn_full['item_id'].tolist()
with open('list_item_ids.txt', 'w') as file: # "Context manager" saves item ids to txt file
    for item in list_item_ids:
        file.write(str(item) + '\n')

In [28]:
# Convert list to comma-separated string with quoted values for the KB sql query of references
wos_ids = ', '.join(f"'{id_}'" for id_ in list_item_ids)
#print(wos_ids)


After checking for duplicates, we can assume to have a clean dataset with 1553 articles. We can now start with the preliminary analysis.