# Reference List Analysis

This notebook analyzes all the articles references from our CSV files located in the `data/references` folder. We load the files, combine them into a single DataFrame, and produce summary statistics such as:

- Total number of articles
- Distribution by publication year
- List of unique journals
- Additional descriptive statistics

This analysis will help us quickly get an overview of the articles we plan to download.

In [21]:
import os
import glob
import pandas as pd

# Adjust path to be relative to the repository root
# Going up one directory from notebooks/ to reach the project root
references_dir = os.path.join('..', 'data', 'references')

# Print absolute path to help debug
print(f"Looking for files in: {os.path.abspath(references_dir)}")

# Check if directory exists
if not os.path.exists(references_dir):
    print(f"Directory does not exist: {references_dir}")
    # Create if needed
    # os.makedirs(references_dir)

Looking for files in: /workspaces/tsi-sota-ai/data/references


In [22]:
import os
import pandas as pd

# Use absolute path or proper relative path
# Option 1: Absolute path
references_dir = '/workspaces/tsi-sota-ai/data/references'

# Option 2: Relative path (going up one directory from notebooks)
# references_dir = os.path.join('..', 'data', 'references')

print(f"Looking for files in: {references_dir}")

# Dictionary with exact filenames and their corresponding DataFrame names
file_mapping = {
    '1.2.2.1 LR - The Specialist Shortage and its Impact.csv': '1_specialists_df',
    '1.2.2.2 LR - AI Applications in SCM Decision Support.csv': '2_aiscm_df',
    '1.2.2.3 LR - Human-AI Collaboration in SCM.csv': '3_humanai_df',
    '1.2.2.4 LR - Challenges and Limitations of LLMs in SCM.csv': '4_challenges_df',
    '1.2.2.5 LR - Decision-Making Processes.csv': '5_decision_df',
    '1.2.2.6 LR - Agents.csv': '6_agents_df'
}

# Initialize dictionary to store DataFrames
dataframes = {}

# Read each CSV file
for filename, df_name in file_mapping.items():
    file_path = os.path.join(references_dir, filename)
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        dataframes[df_name] = df
        print(f"Loaded {filename} into {df_name} with shape {df.shape}")
    else:
        print(f"File not found: {file_path}")

# Assign DataFrames to individual variables
locals().update(dataframes)

# Print basic info about each DataFrame
for name, df in dataframes.items():
    print(f"\n{name} info:")
    print(df.info())

Looking for files in: /workspaces/tsi-sota-ai/data/references
Loaded 1.2.2.1 LR - The Specialist Shortage and its Impact.csv into 1_specialists_df with shape (94, 12)
Loaded 1.2.2.2 LR - AI Applications in SCM Decision Support.csv into 2_aiscm_df with shape (69, 12)
Loaded 1.2.2.3 LR - Human-AI Collaboration in SCM.csv into 3_humanai_df with shape (97, 12)
Loaded 1.2.2.4 LR - Challenges and Limitations of LLMs in SCM.csv into 4_challenges_df with shape (54, 12)
Loaded 1.2.2.5 LR - Decision-Making Processes.csv into 5_decision_df with shape (110, 12)
Loaded 1.2.2.6 LR - Agents.csv into 6_agents_df with shape (169, 12)

1_specialists_df info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94 entries, 0 to 93
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           94 non-null     object 
 1   title          94 non-null     object 
 2   doi            94 non-null     object 
 3   authors        94 non-n

## Combine DataFrames

Combine all individual DataFrames into one for analysis.

In [23]:
# Combine all DataFrames into one
references_df = pd.concat(dataframes.values(), ignore_index=True)
print(f"Combined references data shape: {references_df.shape}")

Combined references data shape: (593, 12)


## Data Overview

Below are the first few rows of the combined references DataFrame to inspect its structure.

In [24]:
references_df.head()

Unnamed: 0,date,title,doi,authors,journal,short_journal,volume,year,publisher,issue,page,abstract
0,2023-09-13,Transformative Procurement Trends: Integrating...,10.3390/logistics7030063,"[{'author_name': 'Areej Althabatah', 'author_s...",Logistics,Logistics,7.0,2023,MDPI AG,3.0,63,Background: the advent of Industry 4.0 (I4.0) ...
1,2021-10-07,Exploring Progress with Supply Chain Risk Mana...,10.3390/logistics5040070,"[{'author_name': 'Remko van Hoek', 'author_slu...",Logistics,Logistics,5.0,2021,MDPI AG,4.0,70,Background: In response to calls for actionabl...
2,2023-12-01,Exploring Applications and Practical Examples ...,10.3390/logistics7040091,"[{'author_name': 'João Reis', 'author_slug': '...",Logistics,Logistics,7.0,2023,MDPI AG,4.0,91,Background: Material Requirements Planning (MR...
3,2021-09-27,Sustainable Innovations in the Food Industry t...,10.3390/logistics5040066,"[{'author_name': 'Saurabh Sharma', 'author_slu...",Logistics,Logistics,5.0,2021,MDPI AG,4.0,66,The agri-food sector is an endless source of e...
4,2021-04-01,Artificial Intelligence (AI): Multidisciplinar...,10.1016/j.ijinfomgt.2019.08.002,"[{'author_name': 'Yogesh K. Dwivedi', 'author_...",International Journal of Information Management,International Journal of Information Management,57.0,2021,Elsevier BV,,101994,"As far back as the industrial revolution, sign..."


## Summary Statistics

In [25]:
# Total number of articles
total_articles = references_df.shape[0]
print(f"Total number of articles: {total_articles}")

# Distribution by publication year (assuming 'year' column exists)
if 'year' in references_df.columns:
    year_distribution = references_df['year'].value_counts().sort_index()
    print("\nPublication Year Distribution:")
    print(year_distribution)
else:
    print("The column 'year' is not found in the data.")

# List of unique journals (assuming 'journal' column exists)
if 'journal' in references_df.columns:
    unique_journals = references_df['journal'].unique()
    print(f"\nUnique journals ({len(unique_journals)}):")
    print(unique_journals)
else:
    print("The column 'journal' is not found in the data.")

# Publisher distribution if 'publisher' column exists
if 'publisher' in references_df.columns:
    publisher_distribution = references_df['publisher'].value_counts()
    print("\nPublisher Distribution (top 10):")
    print(publisher_distribution.head(10))
else:
    print("The column 'publisher' is not found in the data.")

Total number of articles: 593

Publication Year Distribution:
year
2008      1
2010      3
2011      2
2012      1
2013      4
2014      5
2015      2
2016      3
2017      6
2018     24
2019     22
2020     58
2021    113
2022    121
2023    182
2024     46
Name: count, dtype: int64

Unique journals (23):
['Logistics' 'International Journal of Information Management'
 'Transport and Telecommunication Journal'
 'International Journal of Information Systems and Project Management'
 'Applied System Innovation' 'Sustainable Operations and Computers'
 'Smart Cities' 'Management Science' 'Big Data and Cognitive Computing'
 'Iet Collaborative Intelligent Manufacturing'
 'Frontiers in Artificial Intelligence' 'Science'
 'Frontiers in Robotics and Ai' 'Journal of Big Data'
 'Machine Learning and Knowledge Extraction'
 'Journal of Artificial Intelligence Research'
 'Nature Machine Intelligence'
 'Transportation Research Interdisciplinary Perspectives'
 'Transactions of the Association for Compu

## DOI Analysis and Missing Data

In [26]:
# Check for missing DOIs
missing_dois = references_df['doi'].isna().sum()
print(f"Number of entries with missing DOIs: {missing_dois}")

# Check DOI patterns
if not missing_dois == len(references_df):
    print("\nSample of DOI patterns:")
    print(references_df['doi'].value_counts().head())

Number of entries with missing DOIs: 0

Sample of DOI patterns:
doi
10.3390/logistics7030063      15
10.3390/logistics6030048      11
10.1186/s40537-020-00329-2     9
10.3390/logistics7010001       7
10.3389/frai.2023.1264372      7
Name: count, dtype: int64


## Abstract Analysis

Analyzing abstracts can help us understand the content distribution and identify potential data quality issues.

In [27]:
if 'abstract' in references_df.columns:
    # Calculate abstract lengths
    references_df['abstract_length'] = references_df['abstract'].apply(lambda x: len(str(x)) if pd.notnull(x) else 0)
    
    # Basic statistics
    print("Abstract Statistics:")
    print(f"Mean length: {references_df['abstract_length'].mean():.2f} characters")
    print(f"Median length: {references_df['abstract_length'].median():.0f} characters")
    print(f"Shortest abstract: {references_df['abstract_length'].min()} characters")
    print(f"Longest abstract: {references_df['abstract_length'].max()} characters")
    
    # Check for missing abstracts
    missing_abstracts = references_df['abstract'].isna().sum()
    print(f"\nNumber of entries with missing abstracts: {missing_abstracts}")
else:
    print("The column 'abstract' is not found in the data.")

Abstract Statistics:
Mean length: 1437.88 characters
Median length: 1396 characters
Shortest abstract: 0 characters
Longest abstract: 3071 characters

Number of entries with missing abstracts: 6


## Save Processed Data

Save the processed DataFrame for future use.

In [None]:
# Create proper paths relative to project root
# Option 1: Using absolute path
data_dir = '/workspaces/tsi-sota-ai/data'

# Option 2: Using relative path
# data_dir = os.path.join('..', 'data')  # Go up one level from notebooks/

# Ensure directory exists
os.makedirs(data_dir, exist_ok=True)
print(f"Using data directory: {data_dir}")

# Save to JSON using the correct path
output_json = os.path.join(data_dir, 'references_analysis.json')
references_df.to_json(output_json, orient='records', indent=2)
print(f"Saved processed data to: {output_json}")

# Save basic statistics to a separate file
stats_dict = {
    'total_articles': total_articles,
    'unique_journals': len(references_df['journal'].unique()) if 'journal' in references_df.columns else 0,
    'year_range': f"{references_df['year'].min()}-{references_df['year'].max()}" if 'year' in references_df.columns else 'N/A',
    'missing_dois': missing_dois if 'doi' in references_df.columns else 'N/A',
    'missing_abstracts': missing_abstracts if 'abstract' in references_df.columns else 'N/A'
}

stats_json = os.path.join(data_dir, 'references_stats.json')
with open(stats_json, 'w') as f:
    json.dump(stats_dict, f, indent=2)
print(f"Saved statistics to: {stats_json}")

OSError: Cannot save file into a non-existent directory: 'data'