# Grouping and Combining Web of Science and Scopus Files

This notebook focuses on grouping and combining records from Web of Science and Scopus files. The goal is to create a single CSV file that consolidates all records from multiple sources, which will be used for modeling in the next steps. 

In [1]:
# Imports
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import os

# Var list global
key_words = []

### Create DataFrame Pandas with articles WOS records

In [2]:
# Directory containing the .xls files
directory = 'data/articles/scopus'

# List to store the DataFrames
dataframes = []

# Iterate over all files in the specified folder
for filename in os.listdir(directory):
    if filename.endswith('.csv'):
        file_path = os.path.join(directory, filename)
        # Read the .xls file and add it to the DataFrame list
        print(filename.split('.')[0])

ai
artificial intelligence
automation
autonomous
big data
chatbot
computer vision
data mining
data science
fuzzy logic
genetic algorithm
Internet of Things
IoT
k-means
machine learning
natural language processing
neural network
prediction
recommendation
robot
smart technologies
soft computing
text mining


In [5]:
# Directory containing the .xls files
directory = 'data/articles/wos'

# List to store the DataFrames
dataframes = []

# Iterate over all files in the specified folder
for filename in os.listdir(directory):
    if filename.endswith('.xls'):
        file_path = os.path.join(directory, filename)
        # Read the .xls file and add it to the DataFrame list
        df = pd.read_excel(file_path)
        key_words.append(filename.split('.')[0])
        df['Source'] = directory.split('/')[-1]+'-'+filename.split('.')[0]
        dataframes.append(df)

# Concatenate all DataFrames into a single DataFrame
combined_df = pd.concat(dataframes, ignore_index=True)

# If duplicates based on the "Article Title" field, to remove!
df_wos = combined_df.drop_duplicates(subset=['Article Title'])
df_wos.reset_index(drop=True, inplace=True)

df_wos = df_wos[['Article Title','Authors','Publication Year','DOI','Abstract','Author Keywords','Publisher','Source']]
df_wos.columns = ['Title','Authors','Year','DOI','Abstract','Keywords','Publisher','Source']
df_wos.head()

Unnamed: 0,Title,Authors,Year,DOI,Abstract,Keywords,Publisher,Source
0,AI voice bots: a services marketing research a...,"Klaus, P; Zaichkowsky, J",2020,10.1108/JSM-01-2019-0043,Purpose This paper aims to document how AI has...,Big data; Customer service; Robotics; Service ...,EMERALD GROUP PUBLISHING LTD,wos-ai
1,Artificial intelligence (AI) competencies for ...,"Mikalef, P; Islam, N; Parida, V; Singh, H; Alt...",2023,10.1016/j.jbusres.2023.113998,The deployment of Artificial Intelligence (AI)...,Artificial intelligence; B2B marketing; AI com...,ELSEVIER SCIENCE INC,wos-ai
2,Collaboration with machines in B2B marketing: ...,"Gaczek, P; Leszczynski, G; Mouakher, A",2023,10.1016/j.indmarman.2023.09.007,This paper links negative emotions to AI and e...,Decision-making; Human-AI partnership; Custome...,ELSEVIER SCIENCE INC,wos-ai
3,Machine learning and AI in marketing - Connect...,"Ma, LY; Sun, BH",2020,10.1016/j.ijresmar.2020.04.005,Artificial intelligence (AI) agents driven by ...,Artificial intelligence (AI); Machine learning...,ELSEVIER,wos-ai
4,A strategic framework for artificial intellige...,"Huang, MH; Rust, RT",2021,10.1007/s11747-020-00749-9,The authors develop a three-stage framework fo...,Artificial intelligence; Machine learning; Mec...,SPRINGER,wos-ai


In [6]:
print(f'Number of articles in Web of Science Data Base: {len(df_wos)}')

Number of articles in Web of Science Data Base: 873


### Create DataFrame Pandas with Scopus records

In [9]:
# Directory containing the .xls files
directory = 'data/articles/scopus'

# List to store the DataFrames
dataframes = []

# Iterate over all files in the specified folder
for filename in os.listdir(directory):
    if filename.endswith('.csv'):
        file_path = os.path.join(directory, filename)
        # Read the .xls file and add it to the DataFrame list
        df = pd.read_csv(file_path)    
        df['Source'] = directory.split('/')[-1]+'-'+filename.split('.')[0]    
        dataframes.append(df)

# Concatenate all DataFrames into a single DataFrame
combined_df = pd.concat(dataframes, ignore_index=True)

# If duplicates based on the "Article Title" field, to remove!
df_scopus = combined_df.drop_duplicates(subset=['Title'])
df_scopus.reset_index(drop=True, inplace=True)

df_scopus = df_scopus[['Title','Authors','Year','DOI','Abstract','Author Keywords','Publisher','Source']]
df_scopus.columns = ['Title','Authors','Year','DOI','Abstract','Keywords','Publisher','Source']
df_scopus.head()

Unnamed: 0,Title,Authors,Year,DOI,Abstract,Keywords,Publisher,Source
0,Out of the fog: fog computing-enabled AI to su...,Hornik J.; Ofir C.; Rachamim M.,2024,10.1007/s11301-024-00441-0,Marketing and consumer research use a variety ...,Artificial intelligence (AI); Digital marketin...,Springer Nature,scopus-ai
1,Artificial Intelligence and Machine Learning: ...,Volkmar G.; Fischer P.M.; Reinecke S.,2022,10.1016/j.jbusres.2022.04.007,Companies neither fully exploit the potential ...,Artificial Intelligence; Decision- Making; Del...,Elsevier Inc.,scopus-ai
2,When AI meets store layout design: a review,Nguyen K.; Le M.; Martin B.; Cil I.; Fookes C.,2022,10.1007/s10462-022-10142-3,An efficient store layout presents merchandise...,Business intelligence; CCTV visual intelligenc...,Springer Nature,scopus-ai
3,Deploying artificial intelligence in services ...,Hermann E.; Williams G.Y.; Puntoni S.,2023,10.1007/s11747-023-00986-8,Despite offering substantial opportunities to ...,Artificial intelligence; Ethics; Justice; Serv...,Springer,scopus-ai
4,Studying the Relationship between Artificial I...,Sabharwal D.; Sood R.S.; Verma M.,2022,10.31620/JCCC.12.22/10,Introduction – Current study examines the rela...,Artificial intelligence; Communication technol...,Amity University,scopus-ai


In [10]:
print(f'Number of articles in Scopus Data Base: {len(df_scopus)}')

Number of articles in Scopus Data Base: 1572


### Group WOS and Scorpus

In [11]:
df_all = pd.concat([df_scopus,df_wos])
df_all.head()

Unnamed: 0,Title,Authors,Year,DOI,Abstract,Keywords,Publisher,Source
0,Out of the fog: fog computing-enabled AI to su...,Hornik J.; Ofir C.; Rachamim M.,2024,10.1007/s11301-024-00441-0,Marketing and consumer research use a variety ...,Artificial intelligence (AI); Digital marketin...,Springer Nature,scopus-ai
1,Artificial Intelligence and Machine Learning: ...,Volkmar G.; Fischer P.M.; Reinecke S.,2022,10.1016/j.jbusres.2022.04.007,Companies neither fully exploit the potential ...,Artificial Intelligence; Decision- Making; Del...,Elsevier Inc.,scopus-ai
2,When AI meets store layout design: a review,Nguyen K.; Le M.; Martin B.; Cil I.; Fookes C.,2022,10.1007/s10462-022-10142-3,An efficient store layout presents merchandise...,Business intelligence; CCTV visual intelligenc...,Springer Nature,scopus-ai
3,Deploying artificial intelligence in services ...,Hermann E.; Williams G.Y.; Puntoni S.,2023,10.1007/s11747-023-00986-8,Despite offering substantial opportunities to ...,Artificial intelligence; Ethics; Justice; Serv...,Springer,scopus-ai
4,Studying the Relationship between Artificial I...,Sabharwal D.; Sood R.S.; Verma M.,2022,10.31620/JCCC.12.22/10,Introduction – Current study examines the rela...,Artificial intelligence; Communication technol...,Amity University,scopus-ai


In [12]:
print(f'Number of articles in Scopus and Web of Science Data Base: {len(df_all)}')

Number of articles in Scopus and Web of Science Data Base: 2445


In [13]:
# Remove duplicate
df_all = df_all.drop_duplicates(subset=['Title'])

# Remove empty
df_all = df_all[~df_all['Abstract'].isna()]

df_all.reset_index(inplace=True, drop=True)

print(f'Number of articles less duplicates and empty abstract: {len(df_all)}')

Number of articles less duplicates and empty abstract: 2174


### Save all data in .csv

In [15]:
df_all.to_csv('data/articles/all_articles1.csv', index=False)