Identify papers that are not assigned to any domain, and remove papers from final dataset that are not relevant to any domain.

1. Create file with papers that aren't in any domain.

In [1]:
import pandas as pd

# Step 1: Load the cleaned merged dataset
cleaned_merged_dataset_path = "../output/cleaned_merged_dataset.csv"
cleaned_merged_df = pd.read_csv(cleaned_merged_dataset_path)

# Step 2: Identify papers not included (NaN OR 0) in any domain columns
# Fill NaN with 0, then check if everything == 0 across each row from column 10 onward.
not_included_mask = (cleaned_merged_df.iloc[:, 9:].fillna(0) == 0).all(axis=1)
not_included_df = cleaned_merged_df[not_included_mask].copy()

# Step 3: Save the dataset with papers that are not included in any of the reviews
not_included_output_path = "../output/not_included_papers.csv"
not_included_df.to_csv(not_included_output_path, index=False)

print(f"Saved {len(not_included_df)} 'not included' papers to {not_included_output_path}")


Saved 186 'not included' papers to ../output/not_included_papers.csv


These papers have been reviewed in ASReview. The resulting dataset is "relevant_unassigned_papers.csv"

2. Create a dataset that contains only assigned documents

In [2]:
# Identify papers included in at least one domain

# Step 1: load the cleaned merged dataset
cleaned_merged_dataset_path = "../output/cleaned_merged_dataset.csv"
cleaned_merged_df = pd.read_csv(cleaned_merged_dataset_path)

# Step 2: Identify papers included in at least one domain
included_mask = (cleaned_merged_df.iloc[:, 9:].fillna(0) != 0).any(axis=1)
included_df = cleaned_merged_df[included_mask].copy()

# Step 3: save the dataset with papers that are included in at least one of the reviews
included_output_path = "../output/included_papers.csv"
included_df.to_csv(included_output_path, index=False)

print(f"Saved {len(included_df)} 'included' papers to {included_output_path}") 

Saved 865 'included' papers to ../output/included_papers.csv


3. Merge relevant_unassigned_papers and included_papers

In [3]:
# Merge ../data/relevant_unassigned_papers.csv and output/included_papers.csv

# Step 1: Load the relevant unassigned papers
relevant_unassigned_papers_path = "../data/relevant_unassigned_papers.csv"
relevant_unassigned_df = pd.read_csv(relevant_unassigned_papers_path)

# Step 2: Merge the relevant unassigned papers with the included papers
merged_df = pd.concat([included_df, relevant_unassigned_df], ignore_index=True)

# Step 3: Check for duplicates using the 'title' column (not case sensitive)
duplicates_mask = merged_df.duplicated(subset="title", keep=False)
duplicates_df = merged_df[duplicates_mask].copy()
# Print titles that are duplicated
print("Duplicated titles:")
print(duplicates_df["title"].value_counts())

# Step 4: Save the merged final dataset
merged_output_path = "../output/final_dataset.csv"
merged_df.to_csv(merged_output_path, index=False)


Duplicated titles:
Series([], Name: count, dtype: int64)


Check for missing DOIs

In [6]:
# Check final_dataset.csv for missing values in doi column
final_dataset_path = "../output/final_dataset.csv"
final_df = pd.read_csv(final_dataset_path)

# Check for missing values in the 'doi' column
missing_doi_mask = final_df["doi"].isnull()
missing_doi_df = final_df[missing_doi_mask].copy()
print(f"Missing DOIs: {len(missing_doi_df)}")

# Save the dataset with missing DOIs
missing_doi_output_path = "../output/missing_doi.csv"
missing_doi_df.to_csv(missing_doi_output_path, index=False)



Missing DOIs: 41
