# <b>(Phase 2) Screening: Screening the results of the ERIC API results using Python for a Systematic Review of P-20 Tracking</b>
---
This Jupyter Notebook imports the articles from the ERIC API Search (Searches 1-4) and conducts the first round of screening. It uses a by-hand journal screening file that is availble in the github project.

<ul>
    <li> <b>Language Screening (Non-English)</b></li>
    <li> <b>Geography Code Screening (Non-US contexts)</b></li> 
    <li> <b>Publication Type</b></li>
    <li> <b>Journal Screen (By Hand)</b></li>   
    <li> <b>Less Relevant Subjects Screen</b></li>
    <li> <b>Publication Type Screen (#2)</b></li>
</ul>


<b>Project:</b> Systematic Literature Review of Tracking in P-20 Education<br>

---
### Import the CSV from the ERIC API Search

In [14]:
import pandas as pd
from IPython.display import display

# Load data from specified file path
file_path = r"C:\Users\sbaser\OneDrive - University of Georgia\Shared Folders\P-20 Parallels and Perils\Data Collection\Phase 2 - Literature Review Search\ERIC API Resources\ERIC Searches\Phase 2 Searches\Tracking_API_Output_Phase_2b_Search 1.csv"
master_records = pd.read_csv(file_path)

# Display total records loaded
print("Initial Data Load")
print(f"Total records loaded: {master_records.shape[0]}")

# Set display options for scrollable view if the DataFrame is large
pd.options.display.max_rows = 10  # Adjust as needed for number of rows to display at once
pd.options.display.max_columns = 10  # Adjust as needed for number of columns to display at once

# Display the DataFrame
display(master_records)

Initial Data Load
Total records loaded: 10272


Unnamed: 0,id,source_search,title,author,publicationdateyear,...,institution,iesfunded,ieswwcreviewed,ieslinkwwcreviewguide,isbn
0,EJ061632,Search 1,"Use of a ""Balance-Sheet"" Procedure to Improve ...","Mann, Leon",1972,...,,,,,
1,EJ203027,Search 1,A Procedure to Estimate the Probability of Err...,"Ackerson, Gary E., And Others",1978,...,,,,,
2,EJ201539,Search 1,Entwicklung eines Einstufungstests fuer Deutsc...,"Kummer, Manfred, And Others",1978,...,,,,,
3,EJ196127,Search 1,The Principal and Special Education Placement.,"Yoshida, Roland K., And Others",1978,...,,,,,
4,EJ199000,Search 1,Curriculum Tracking and Educational Stratifica...,"Alexander, Karl L., And Others",1978,...,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
10267,EJ1431054,Search 2,Announcing New NHSDA Programming Connecting Se...,"Susan McGreevy-Nichols, Marissa Finkelstein",2024,...,,,,,
10268,EJ1426660,Search 3,Student Perceptions of Glover/Curwen Hand Sign...,Whitney Mayo,2024,...,,,,,
10269,EJ1418576,Search 3,Development of Comprehension Monitoring Skill ...,"Kunyu Xu, Yu-Min Ku, Chenlu Ma, Chien-Hui Lin,...",2024,...,,,,,
10270,EJ1407862,Search 2,Is &quot;Option B&quot; a Viable Plan B? Schoo...,"Amy E. Stich, George Spencer, Brionna Johnson,...",2024,...,,,,,


---
## <ins>Screening the Results for Relevant P-20 Tracking Articles</ins>

### Screening 1: Screen based on language (Non-English) & geography code (Non-US contexts)

This section of the Jupyter notebook performs the initial screening of articles based on two criteria: language and geography code. The purpose of this screening is to filter out articles that are not relevant to the scope of the study.

<b>a. Language Screening (Non-English):</b>
<ul>
    <li>Articles are first screened based on their language metadata. Only articles that are in English or have missing language information are retained. Non-English articles are identified and removed during this step.</li>
    <li>The count of articles for each language is displayed both before and after the screening to ensure transparency and accuracy in the filtering process.</li>
</ul>

<b>b. Geography Code Screening (Non-US contexts):</b>
<ul>
    <li>Following the language screening, articles are further filtered based on their geography code. The geography code screening is aimed at identifying and removing articles that pertain to non-US contexts, as determined through a preliminary analysis in Excel.</li>
    <li>The list of non-US geographical identifiers, along with the associated article counts, is provided to offer insight into the scope of this screening. The total number of articles before and after this screening is also displayed.</li>
</ul>
<p>This dual screening process is designed to refine the dataset by focusing on articles that are both in English and relevant to the US context.</p>

In [7]:
######################################################
# Screening 1a: Language (Non-English)
######################################################

# Replace NaN with a placeholder ("Missing") for easier filtering to ensure that NaNs are handled correctly.
master_records['language'] = master_records['language'].fillna('Missing')

# List of languages to remove (in this case, any language that is not 'English' or 'Missing')
languages_to_remove = master_records['language'].unique()
languages_to_remove = [lang for lang in languages_to_remove if lang not in ['Missing', 'English']]

# Print the languages that will be removed along with the count of associated articles
print("Languages to be removed that are not English or Missing:")
for language in languages_to_remove:
    count = master_records[master_records['language'] == language].shape[0]
    print(f"- {language}: {count:,}")

# Total articles before removal
total_before = master_records.shape[0]
print(f"\nTotal articles (Before Screen 1a): {total_before:,}")

# Count the records that are in the list of languages to be removed
num_records_to_remove = master_records['language'].isin(languages_to_remove).sum()
print(f"Total articles that are in the list of languages to be removed: {num_records_to_remove:,}")

# Remove the records that are not in English or Missing
filtered_records = master_records[(master_records['language'] == 'Missing') | (master_records['language'] == 'English')]

# Total articles after removal
total_after = filtered_records.shape[0]
print(f"Total articles (After Screen 1a): {total_after:,}")

# Apply the filtering and update the master_records DataFrame
master_records = filtered_records

Languages to be removed that are not English or Missing:
- German: 1
- French: 1

Total articles (Before Screen 1a): 10,272
Total articles that are in the list of languages to be removed: 2
Total articles (After Screen 1a): 10,270


In [8]:
######################################################
# Screening 1b: Geography code (Non-US contexts)
######################################################

# List of geographical identifiers to remove
identifiers_to_remove = [
    'Australia', 'Canada', 'Canada (Ottawa)', 'Canada (Winnipeg)', 'Germany',
    'Guam', 'Guatemala', 'Guyana', 'Iran', 'Israel', 'Italy', 'Japan',
    'South Africa', 'South Africa (Cape Town)', 'Spain (Barcelona)', 'Turkey',
    'United Kingdom', 'United Kingdom (Aberdeen)', 'United Kingdom (Birmingham)',
    'United Kingdom (England)', 'United Kingdom (Great Britain)', 'United Kingdom (Reading)',
    'USSR'
]

# Function to count and remove records based on exact matching
def remove_geo_records(df, identifiers):
    # Print the identifiers that will be removed along with the count of associated articles
    print("Identifiers to be removed with associated article counts:")
    for identifier in identifiers:
        count = df[df['identifiersgeo'] == identifier].shape[0]
        print(f"- {identifier}: {count:,} articles")
        
    # Total articles before removal
    total_before = df.shape[0]
    print(f"\nTotal articles (Before Screen 1b): {total_before:,}")

    # Count the records that are in the list of identifiers (to be removed)
    num_records_to_remove = df['identifiersgeo'].isin(identifiers).sum()
    print(f"Total articles that are in the list of identifiers to be removed: {num_records_to_remove:,}")

    # Remove the records
    filtered_df = df[~df['identifiersgeo'].isin(identifiers)]

    # Total articles after removal
    total_after = filtered_df.shape[0]
    print(f"Total articles (After Screen 1b): {total_after:,}")

    return filtered_df

# Apply the removal function to master_records
master_records = remove_geo_records(master_records, identifiers_to_remove)

Identifiers to be removed with associated article counts:
- Australia: 7 articles
- Canada: 6 articles
- Canada (Ottawa): 1 articles
- Canada (Winnipeg): 1 articles
- Germany: 1 articles
- Guam: 1 articles
- Guatemala: 1 articles
- Guyana: 1 articles
- Iran: 1 articles
- Israel: 2 articles
- Italy: 1 articles
- Japan: 1 articles
- South Africa: 3 articles
- South Africa (Cape Town): 1 articles
- Spain (Barcelona): 1 articles
- Turkey: 2 articles
- United Kingdom: 4 articles
- United Kingdom (Aberdeen): 1 articles
- United Kingdom (Birmingham): 1 articles
- United Kingdom (England): 5 articles
- United Kingdom (Great Britain): 3 articles
- United Kingdom (Reading): 1 articles
- USSR: 1 articles

Total articles (Before Screen 1b): 10,270
Total articles that are in the list of identifiers to be removed: 47
Total articles (After Screen 1b): 10,223


### Screening 2: Screen based on Article Type

This section of the Jupyter notebook performs a screening of articles based on their publication type. 

To narrow our search to empirical articles, we tried to determine how ERIC distinguishes between types of publications (e.g., opinion, research, evaluation). The response to our query indicated that ERIC does not have a metadata field that identifies the research design or type of study but does sort by their own distinct categories that fall under “Reports.” Reports are sorted by “General,” Descriptive,” “Evaluative,” “Research,” and “Research-practitioner Partnerships.” ERIC experts were unable to guarantee that there would be no empirical research in categories outside of “research,” but stated that it is a “reasonable assumption that Reports-Research contains the most appropriate content.” Thus, we excluded articles that fell outside of “Reports-Research.”

After analyzing the results and communicating with staff at ERIC, we have chosen to limit our dataset to articles that meet the following criteria for publicationtype:
<ul>
    <li>Articles that are classified solely as "Journal Articles" (n=30).</li>
    <li>Articles that are classified as both "Journal Articles" and "Reports - Research."</li>
</ul>

<b>Article Type: Screening Process</b>
<ul>
    <li>Articles are first screened based on their publication type metadata. Only articles that either are solely categorized as "Journal Articles" or that include both "Journal Articles" and "Reports - Research" are retained. Any article that combines "Journal Articles" with other types, but does not include "Reports - Research," is filtered out during this step.</li>
    <li>The count of articles for each publication type is displayed before the screening begins, with those meeting the criteria highlighted with a special marker.</li>
    <li>After the screening, the total number of articles that have been filtered out and the total number of articles remaining are displayed.</li>
</ul>

In [9]:
######################################################
# Screening 2: Publication Type (Must be "Journal Articles" alone or "Journal Articles" with "Reports - Research")
######################################################

# Define the publication types required
required_types = ["Journal Articles", "Reports - Research"]

# Function to filter based on publication types
def filter_publication_type(df, required_types):
    # Print the publication types and associated article counts
    print("Publication types and associated article counts:")
    
    # Split the 'publicationtype' field on commas, then explode the list into separate rows
    # This allows counting each publication type individually, even when multiple types are listed in a single row
    publication_counts = df['publicationtype'].str.split(', ').explode().value_counts()

    # Iterate over each unique publication type and its associated count
    for p_type, count in publication_counts.items():
        # Check if the publication type is one of the required types to keep
        marker = "*" if p_type in required_types else ""
        
        # Print the publication type, the count of articles, and add a "*" marker if it is a required type
        print(f"{marker}{p_type}: {count:,} articles")

    # Total articles before removal
    total_before = df.shape[0]
    print(f"\nTotal articles (Before Screening 2): {total_before:,}")

    # Filter the records:
    # - Keep records where "Journal Articles" is the only type
    # - OR keep records where "Journal Articles" AND "Reports - Research" are present
    filtered_df = df[df['publicationtype'].apply(lambda x: x == "Journal Articles" or 
                                                 ("Journal Articles" in x and "Reports - Research" in x))]

    # Total articles filtered out
    total_filtered_out = total_before - filtered_df.shape[0]
    print(f"Total articles filtered out: {total_filtered_out:,}")

    # Total articles after removal
    total_after = filtered_df.shape[0]
    print(f"Total articles (After Screening 2): {total_after:,}")

    return filtered_df

# Apply the filter function to master_records
master_records = filter_publication_type(master_records, required_types)

Publication types and associated article counts:
*Journal Articles: 10,223 articles
*Reports - Research: 6,113 articles
Reports - Descriptive: 2,026 articles
Reports - Evaluative: 1,335 articles
Opinion Papers: 533 articles
Information Analyses: 418 articles
Tests/Questionnaires: 330 articles
Guides - Non-Classroom: 103 articles
Guides - Classroom - Teacher: 62 articles
Speeches/Meeting Papers: 31 articles
Reports - General: 29 articles
Legal/Legislative/Regulatory Materials: 22 articles
Numerical/Quantitative Data: 17 articles
Historical Materials: 17 articles
Reference Materials - Bibliographies: 7 articles
ERIC Publications: 5 articles
Collected Works - Serials: 4 articles
Collected Works - General: 3 articles
Guides - General: 2 articles
Book/Product Reviews: 2 articles
Reference Materials - General: 1 articles
Collected Works - Serial: 1 articles
Collected Works - Proceedings: 1 articles

Total articles (Before Screening 2): 10,223
Total articles filtered out: 4,080
Total articles

### Screening 3: Journal Screen

This section of the Jupyter notebook performs a screening of articles based on the journals in which they were published. Phase II included a journal screening process of 1,508 distinct journals. This screening process included reviewing each journal to determine whether the journal conducted peer review or was publishing empirical articles through an analysis of ERIC's descriptors and internet searches. If the journal did not conduct peer review or was primarily for commentary or opinion pieces, the journal was excluded. If the journal was in a language other than English, the journal was excluded. We also checked whether the articles included in that journal were based in the US, empirical, or relevant  using a pivot table of the output. If none of the articles met the criteria, the journal was also excluded.

<b>Journal Source: Screening Process</b>
<ul>
    <li>Articles are screened based on the journal in which they were published. The names of the journals in the dataset are matched with the screening list created during the journal review process.</li>
    <li>To ensure accurate matching, all journal names in both the dataset and the screening list are converted to lowercase. This accounts for any differences in letter casing that could prevent a match.</li>
    <li>The screening process checks how many journal names in the dataset match those in the screening list and how many do not. A report is generated to show the number of matches and non-matches.</li>
    <li>Non-matching sources are identified and reviewed. A list of these sources, along with the number of associated articles, is provided for further analysis.</li>
    <li>After the screening is complete, the total number of articles that have been filtered out and the total number remaining are displayed. A summary is provided, showing the breakdown of the exclusion types.</li>
</ul>

This screening process ensures that only articles from relevant journals are retained in the dataset. Journals that do not meet the inclusion criteria or that contain irrelevant articles are systematically excluded.


In [10]:
######################################################
# Screening 3: Journal Screen
######################################################

# URL of the Journal Screen Excel file from your GitHub repository
file_url = 'https://raw.githubusercontent.com/s-baser/P20Tracking_Review/main/Journal_Screen.xlsx'

# Load the entire sheet from the Journal Screen Excel file
df_full_sheet = pd.read_excel(file_url, sheet_name='source_screen')

# Slice the DataFrame to extract the table (ERIC_screen) from A1 to B1509, ensuring proper headers
df_screening = df_full_sheet.iloc[1:1509, 0:2]  # Rows 1 to 1508 (1509 rows), Columns A (0) and B (1)

# Rename the columns to reflect the proper header names
df_screening.columns = ['source', 'source_screen']

# Convert 'source' columns in both DataFrames to lowercase to ensure case-insensitive matching
df_screening['source'] = df_screening['source'].str.lower()
master_records['source'] = master_records['source'].str.lower()

# Total articles before removal in Screening 3
total_before_screening_3 = master_records.shape[0]

# Merge the screening data into the main data (master_records) on the 'source' column
master_records = pd.merge(master_records, df_screening, how='left', on='source')

# Check for matches and non-matches
matches = master_records['source_screen'].notna().sum()
non_matches = master_records['source_screen'].isna().sum()

# Print merge summary
print("\n\033[1mMerge Summary\033[0m")
print("Merge successful of source (journal) from master and screen data frames")
print(f"  Total matches in merge: {matches:,}")
print(f"  Total non-matches in merge: {non_matches:,}")

# Filter out the articles that need to be excluded based on the 'source_screen' column
master_records_filtered = master_records[~master_records['source_screen'].str.startswith('Exclude', na=False)]

# Total articles filtered out in Screening 3
total_filtered_out_screening_3 = total_before_screening_3 - master_records_filtered.shape[0]

# Total articles after removal in Screening 3
total_after_screening_3 = master_records_filtered.shape[0]

# Summary of Screening 3
print("\n\033[1mSummary of Screening 3\033[0m")
print(f"Total: {df_screening['source'].nunique():,} journals | {total_before_screening_3:,} articles")
print(f"  Included Total: {df_screening[df_screening['source_screen'].str.startswith('Include', na=False)]['source'].nunique()} journals | {total_after_screening_3:,} articles")
print(f"  Excluded Total: {df_screening[df_screening['source_screen'].str.startswith('Exclude', na=False)]['source'].nunique()} journals | {total_filtered_out_screening_3:,} articles")

# Breakdown of Exclude Types
exclude_types_updated = df_screening[df_screening['source_screen'].str.startswith('Exclude')]['source_screen'].value_counts()
for exclude_type, count in exclude_types_updated.items():
    count_articles = master_records[master_records['source_screen'] == exclude_type].shape[0]
    print(f"   - {exclude_type}: {count} journals | {count_articles:,} articles")

# Summary of totals
print(f"\nTotal articles (Before Screening 3): {total_before_screening_3:,}")
print(f"Total articles filtered out (Screening 3): {total_filtered_out_screening_3:,}")
print(f"Total articles (After Screening 3): {total_after_screening_3:,}")


[1mMerge Summary[0m
Merge successful of source (journal) from master and screen data frames
  Total matches in merge: 6,143
  Total non-matches in merge: 0

[1mSummary of Screening 3[0m
Total: 1,507 journals | 6,143 articles
  Included Total: 629 journals | 4,306 articles
  Excluded Total: 878 journals | 1,837 articles
   - Exclude(Article Not Relevant): 576 journals | 1,294 articles
   - Exclude: 287 journals | 519 articles
   - Exclude(Non-US Journal Context): 9 journals | 24 articles
   - Exclude(Reference): 5 journals | 0 articles
   - Exclude(Literature Review): 1 journals | 0 articles

Total articles (Before Screening 3): 6,143
Total articles filtered out (Screening 3): 1,837
Total articles (After Screening 3): 4,306


In [11]:
# Count the number of articles for each unique value in the 'source_search' column
articles_per_search = master_records['source_search'].value_counts()

# Print the number of articles for each search in 'source_search'
sorted_searches = articles_per_search.loc[['Search 1', 'Search 2', 'Search 3', 'Search 4']]
print("\nNumber of Articles in Each Search (source_search):")
for search, count in sorted_searches.items():
    print(f"{search}: {count:,} articles")


Number of Articles in Each Search (source_search):
Search 1: 1,724 articles
Search 2: 2,280 articles
Search 3: 2,134 articles
Search 4: 5 articles


### Screening 4: Less Relevant Subjects Screen

This section of the Jupyter notebook performs a screening for subjects not relevant to tracking (e.g., eye tracking). The screening identified obvious examples of unrelated subjects and used pivot tables for quick validation. The majority of removed articles originated from search 2.

<b>Subject Source: Screening Process</b>
<ul>
    <li> Used Excel pivot tables to analyze unique subject counts for each article and identify subjects for removal.
    <li> Removed 470 articles through the screening process.</li>
</ul>

This screening process ensures that only articles with relevant subjects are retained in the dataset. Subjects that do not meet the inclusion criteria or are irrelevant are systematically excluded.

In [12]:
######################################################
# Screening 4: Subject Filtering
######################################################

# Define the subjects from both lists that are considered less relevant, now alphabetized.
less_relevant_subjects = sorted([
    'anxiety', 'athletes', 'audiovisual communications', 'client characteristics (human services)',
    'cognitive ability', 'cognitive processes', 'dentistry', 'eye movements', 'faculty development',
    'family life', 'food service', 'foods instruction', 'handwriting', 'muscular strength',
    'patients', 'perinatal influences', 'premature infants', 'psychiatric services', 
    'psychiatry', 'school space', 'smoking', 'video technology', 'workstations'
])

# Split the 'subject' column into individual subjects, explode them into separate rows, and strip/clean the data.
exploded_subjects = master_records_filtered['subject'].str.split(',').explode().str.strip().str.lower()

# Create a boolean mask for subjects to be removed
mask = exploded_subjects.isin(less_relevant_subjects)

# Filter out rows where the subject is in the less relevant subjects, but we need to use the original index
filtered_master_records = master_records_filtered[~master_records_filtered.index.isin(exploded_subjects[mask].index)]

# Create a dictionary to store counts of each removed subject
removed_subjects_count = {subject: exploded_subjects[exploded_subjects == subject].count() for subject in less_relevant_subjects}

# Count how many articles were removed for each subject and the total removed
total_removed_articles = sum(removed_subjects_count.values())

# Print the counts for each less relevant subject
print("\nNumber of records removed for each less relevant subject:")
for subject, count in removed_subjects_count.items():
    print(f"{subject}: {count}")

# Summary of Screening 4
print("\n\033[1mSummary of Screening 4\033[0m")
print(f"Total articles before Screening 4: {len(master_records_filtered):,}")
print(f"Total articles removed (Screening 4): {total_removed_articles:,}")
print(f"Total articles after Screening 4: {len(filtered_master_records):,}")

# Count the number of articles for each unique value in the 'source_search' column
articles_per_search = master_records['source_search'].value_counts()

# Print the number of articles for each search in 'source_search'
sorted_searches = articles_per_search.loc[['Search 1', 'Search 2', 'Search 3', 'Search 4']]
print("\n\033[1mNumber of Articles in Each Search (source_search):\033[0m")
for search, count in sorted_searches.items():
    print(f"{search}: {count:,} articles")


Number of records removed for each less relevant subject:
anxiety: 42
athletes: 5
audiovisual communications: 1
client characteristics (human services): 1
cognitive ability: 42
cognitive processes: 89
dentistry: 2
eye movements: 121
faculty development: 90
family life: 3
food service: 1
foods instruction: 1
handwriting: 3
muscular strength: 1
patients: 5
perinatal influences: 1
premature infants: 1
psychiatric services: 1
psychiatry: 7
school space: 1
smoking: 5
video technology: 44
workstations: 1

[1mSummary of Screening 4[0m
Total articles before Screening 4: 4,306
Total articles removed (Screening 4): 468
Total articles after Screening 4: 3,909

[1mNumber of Articles in Each Search (source_search):[0m
Search 1: 1,724 articles
Search 2: 2,280 articles
Search 3: 2,134 articles
Search 4: 5 articles


### Screening 5: Publication Type Screen (#2)

This section of the Jupyter notebook refines the dataset by performing an additional screening based on publication type, following a prior round of subject-based analysis.

To ensure the dataset remains focused on relevant, empirical research, we further screen out articles categorized under non-research publication types. Specifically, we remove articles labeled with types such as "Opinion Papers" or "Guides" that may still be tagged alongside valid types like "Journal Articles" or "Reports - Research."

<b>Publication Type: Screening Process</b>
<ul> 
    <li>Articles classified under non-research types such as "Opinion Papers," "Guides - Non-Classroom," "Speeches/Meeting Papers," and similar categories are removed, even if they also include "Journal Articles" or "Reports - Research."</li> 
    <li>A count of articles for each publication type is displayed before screening begins, with non-research types highlighted for removal.</li> 
    <li>After the screening, 26 articles removed.</li> 
</ul>

This additional filtering ensures that only articles with relevant research types remain in the dataset.

In [13]:
######################################################
# Screening 5: Remove Specific Publication Types
######################################################

# Define the publication types to be removed
remove_types = [
    "Opinion Papers", "Guides - Non-Classroom", "Guides - Classroom - Teacher", 
    "Speeches/Meeting Papers", "Legal/Legislative/Regulatory Materials", 
    "Reference Materials - Bibliographies", "Guides - General"
]

# Function to filter out unwanted publication types
def filter_out_unwanted_types(df, remove_types):
    # Print the publication types and associated article counts
    print("Publication types and associated article counts before screening:")

    # Split the 'publicationtype' field on commas, then explode the list into separate rows
    # This allows counting each publication type individually, even when multiple types are listed in a single row
    publication_counts = df['publicationtype'].str.split(', ').explode().value_counts()

    # Iterate over each unique publication type and its associated count
    for p_type, count in publication_counts.items():
        marker = "*" if p_type in remove_types else ""
        print(f"{marker}{p_type}: {count:,} articles")

    
    # Total articles before removal
    total_before = df.shape[0]
    print("\n\033[1mSummary of Screening 5\033[0m")
    print(f"Total articles (Before Screening 5): {total_before:,}")

    # Filter the records:
    # - Remove any record where at least one of the remove_types is present in the 'publicationtype' field
    filtered_df = df[~df['publicationtype'].apply(lambda x: any(r_type in x for r_type in remove_types))]

    # Total articles filtered out
    total_filtered_out = total_before - filtered_df.shape[0]
    print(f"Total articles filtered out: {total_filtered_out:,}")

    # Total articles after removal
    total_after = filtered_df.shape[0]
    print(f"Total articles (After Screening 5): {total_after:,}")

    return filtered_df

# Apply the Screening 5 filter function to the results from Screening 4
final_filtered_records = filter_out_unwanted_types(filtered_master_records, remove_types)


# Count the number of articles for each unique value in the 'source_search' column
articles_per_search = final_filtered_records['source_search'].value_counts()

# Print the number of articles for each search in 'source_search'
sorted_searches = articles_per_search.loc[['Search 1', 'Search 2', 'Search 3', 'Search 4']]
print("\n\033[1mNumber of Articles in Each Search (source_search):\033[0m")
for search, count in sorted_searches.items():
    print(f"{search}: {count:,} articles")


Publication types and associated article counts before screening:
Journal Articles: 3,909 articles
Reports - Research: 3,886 articles
Tests/Questionnaires: 150 articles
Information Analyses: 75 articles
*Opinion Papers: 11 articles
*Speeches/Meeting Papers: 10 articles
Numerical/Quantitative Data: 8 articles
*Guides - Non-Classroom: 3 articles
Reports - Descriptive: 3 articles
Reports - Evaluative: 3 articles
*Guides - Classroom - Teacher: 2 articles
Reports - General: 1 articles
Historical Materials: 1 articles

[1mSummary of Screening 5[0m
Total articles (Before Screening 5): 3,909
Total articles filtered out: 26
Total articles (After Screening 5): 3,883

[1mNumber of Articles in Each Search (source_search):[0m
Search 1: 1,370 articles
Search 2: 1,472 articles
Search 3: 1,037 articles
Search 4: 4 articles


------
## <ins>Saving the output</ins>

### Save Searches to CSV
This section of the notebook details the process of exporting our fetched ERIC database records from our four searches into a single CSV.

In [26]:
# Define the path and filename with the current date and time
base_path = "C:\\Users\\sbaser\\OneDrive - University of Georgia\\Shared Folders\\P-20 Parallels and Perils\\Data Collection\\Phase 2 - Literature Review Search\\ERIC API Resources\\ERIC Searches\\"
current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"Tracking_API_Output_Search2d_{current_time}.csv"
full_path = os.path.join(base_path, filename)

# Apply the Screening 5 filter function to the results from Screening 4
final_filtered_records = filter_out_unwanted_types(filtered_master_records, remove_types)

# Export the final_filtered_records DataFrame to a CSV file
final_filtered_records.to_csv(full_path, index=False)

# Print a message confirming the export
print(f"Data exported successfully to {full_path}")

Publication types and associated article counts before screening:
Journal Articles: 3,927 articles
Reports - Research: 3,904 articles
Tests/Questionnaires: 151 articles
Information Analyses: 75 articles
*Opinion Papers: 11 articles
*Speeches/Meeting Papers: 10 articles
Numerical/Quantitative Data: 8 articles
*Guides - Non-Classroom: 3 articles
Reports - Descriptive: 3 articles
Reports - Evaluative: 3 articles
*Guides - Classroom - Teacher: 2 articles
Reports - General: 1 articles
Historical Materials: 1 articles

[1mSummary of Screening 5[0m
Total articles (Before Screening 5): 3,927
Total articles filtered out: 26
Total articles (After Screening 5): 3,901
Data exported successfully to C:\Users\sbaser\OneDrive - University of Georgia\Shared Folders\P-20 Parallels and Perils\Data Collection\Phase 2 - Literature Review Search\ERIC API Resources\ERIC Searches\Tracking_API_Output_Search2d_20240906_083927.csv
