<a href="https://colab.research.google.com/github/selgebali/Colabs/blob/main/publisherIDs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Publisher Metadata Analysis Using DataCite API

## Overview
This project is a Python-based tool designed to interact with the DataCite API to analyze metadata related to publishers in DOI records. Specifically, the script counts occurrences of various publisher attributes such as the name, scheme URI, publisher identifier, and identifier scheme. It then visualizes the counts in a bar graph using `Plotly`, providing insights into the completeness of publisher metadata across different DOI entries.

The project serves as an efficient way to explore metadata completeness, particularly focusing on key attributes required for identifying publishers in DOI records.

## Features
- **Data Extraction from DataCite API**: The script queries the DataCite API to collect information about publisher attributes from DOI metadata.
- **Pagination Support**: Uses the cursor pagination method to iterate through multiple pages of the API response, ensuring that all available data is collected.
- **Attribute Analysis**: Counts the number of occurrences for each specific publisher attribute (name, scheme URI, publisher identifier, identifier scheme).
- **Visualization with Plotly**: The final count of each attribute is visualized in a bar chart, making it easy to understand the completeness and distribution of publisher metadata.

## Prerequisites
To run this project, you will need the following:
- Python 3.x
- `requests` (for making API requests to the DataCite endpoint)
- `pandas` (for handling tabular data)
- `plotly` (for visualizing data)

You can install all the dependencies with the following command:

```sh
pip install requests pandas plotly
```

## Usage
### 1. Initialize API Variables
The script starts by defining the initial URL to make requests to the DataCite API, specifying a page size of 1000 to optimize data collection:

```python
url = "https://api.test.datacite.org/dois?publisher=true&page[size]=1000"  # Starting URL
```
This URL can be modified to change the page size or filter the data more precisely.

### 2. Data Extraction
The script iteratively sends requests to the API to retrieve DOI metadata. It counts the number of entries where specific publisher fields are present:
- **Publisher Name**: Checks if the publisher name is present.
- **Scheme URI**: Checks if a scheme URI exists for the publisher.
- **Publisher Identifier**: Verifies if a publisher identifier is available.
- **Publisher Identifier Scheme**: Checks if the publisher identifier scheme is specified.

The loop continues until there are no more pages available, ensuring that all records are analyzed.

### 3. Print and Analyze Counts
Once all pages are fetched, the script prints the counts for each of the publisher fields:

```python
print("Publisher Name Count:", publisher_name_count)
print("Scheme URI Count:", scheme_uri_count)
print("Publisher Identifier Count:", publisher_identifier_count)
print("Publisher Identifier Scheme Count:", publisher_identifier_scheme_count)
print("All Fields Present Count:", all_fields_present_count)
```
These counts represent the presence of the specified attributes across all DOI records retrieved.

### 4. Visualize Results
The script creates a bar graph using `Plotly` to visualize the counts for each publisher attribute:
- **Custom Color Palette**: A custom color palette is used for better visual distinction between bars.
- **Font Customization**: Arial font is used consistently across labels and titles.
- **Enhanced Axes and Layout**: The x and y axes are customized for a clean look, and values are displayed directly on the bars.

The final visualization presents a clear view of metadata completeness for publisher information.

### Running the Script
You can run the script by saving it to a file and executing it as follows:

```sh
python script_name.py
```
This will generate an interactive bar graph in your browser showing the count of publisher attributes.

## Customization
- **API Endpoint**: The starting URL can be modified to query other aspects of the DataCite API or to adjust the page size.
- **Plot Customization**: The visualization can be customized further by modifying the color palette, font settings, and graph layout.
- **Attribute Selection**: You can adjust the script to analyze different attributes from the API response.

### Example Modifications
- To change the color scheme, modify the `custom_colors` variable:

```python
custom_colors = ['#FF5733', '#33FF57', '#3357FF', '#FF33A1', '#A133FF']
```
- To visualize additional fields, you can expand the `fields` and `counts` lists accordingly, based on new metadata attributes you wish to analyze.

## Output
- **Attribute Counts**: The console will display the total counts for each publisher attribute.
- **Interactive Bar Graph**: The bar graph generated by `Plotly` provides an easy-to-understand visualization of how well the metadata is populated across different DOI records.

## Debugging and Error Handling
- The script contains basic error handling for API requests. If an error occurs, the script will raise an exception, which can be used to debug connectivity or endpoint issues.
- **Pagination Handling**: If there are no more pages, the loop stops to avoid unnecessary requests.

## Further Development and Contributions
- **Additional Metadata Analysis**: Extend the script to include analysis of other fields like contributors, funding information, etc.
- **Reporting**: Export the results to a CSV or generate a PDF report summarizing the findings.
- **Visualization Enhancements**: Add more types of visualizations (e.g., pie charts) for better analysis of categorical data.




In [None]:
import requests
import pandas as pd
import plotly.express as px

# Step 1: Initialize the variables
url = "https://api.test.datacite.org/dois?publisher=true&page[size]=1000"  # Starting URL
publisher_name_count = 0
scheme_uri_count = 0
publisher_identifier_count = 0
publisher_identifier_scheme_count = 0
all_fields_present_count = 0

# Step 2: Loop through the API pages using the cursor method
while url:
    response = requests.get(url)
    data = response.json()

    # Iterate through the 'data' to count entries with values in the specified fields
    for item in data['data']:
        publisher = item.get('attributes', {}).get('publisher', None)

        # Check if 'publisher' is not None before checking its attributes
        if publisher:
            name_present = 'name' in publisher and publisher['name']
            scheme_uri_present = 'schemeUri' in publisher and publisher['schemeUri']
            publisher_identifier_present = 'publisherIdentifier' in publisher and publisher['publisherIdentifier']
            publisher_identifier_scheme_present = 'publisherIdentifierScheme' in publisher and publisher['publisherIdentifierScheme']

            # Count each individual field
            if name_present:
                publisher_name_count += 1
            if scheme_uri_present:
                scheme_uri_count += 1
            if publisher_identifier_present:
                publisher_identifier_count += 1
            if publisher_identifier_scheme_present:
                publisher_identifier_scheme_count += 1

            # Count when all fields are present
            if name_present and scheme_uri_present and publisher_identifier_present and publisher_identifier_scheme_present:
                all_fields_present_count += 1

    # Step 3: Move to the next page if available
    url = data['links'].get('next')

# Step 4: Print the counts
print("Publisher Name Count:", publisher_name_count)
print("Scheme URI Count:", scheme_uri_count)
print("Publisher Identifier Count:", publisher_identifier_count)
print("Publisher Identifier Scheme Count:", publisher_identifier_scheme_count)
print("All Fields Present Count:", all_fields_present_count)



Publisher Name Count: 9997
Scheme URI Count: 614
Publisher Identifier Count: 614
Publisher Identifier Scheme Count: 614
All Fields Present Count: 614


In [None]:
# Step 5: Prepare data for the bar graph
fields = ['Publisher Name', 'Scheme URI', 'Publisher Identifier', 'Publisher Identifier Scheme', 'All Fields Present']
counts = [publisher_name_count, scheme_uri_count, publisher_identifier_count, publisher_identifier_scheme_count, all_fields_present_count]

# Step 6: Create the bar graph with Plotly, using the custom color palette and Arial font
custom_colors = ['#243B54', '#00B1E2', '#5B88B9', '#46BCAB', '#90D7CD', '#BC2B66']

df = pd.DataFrame({
    'Field': fields,
    'Count': counts
})

fig = px.bar(df, x='Field', y='Count', title='Count of Publisher Fields in DOI Metadata',
             labels={'Field':'Publisher Attribute', 'Count':'Count'},
             color_discrete_sequence=custom_colors)

# Step 7: Update font, layout, background, and customizations for x and y axes
fig.update_layout(
    font=dict(family="Arial", size=18),
    title_font=dict(family="Arial", size=18),
    xaxis_title_font=dict(family="Arial", size=18),
    yaxis_title_font=dict(family="Arial", size=18),
    width=2000,
    height=900,
    plot_bgcolor='white',  # Set background color to white
    paper_bgcolor='white'  # Set the surrounding paper background to white
)

# Add text (counts) to each bar
fig.update_traces(texttemplate='%{y}', textposition='outside', textfont=dict(
    family="Arial",
    size=20,
    color="#243B54",
    weight="bold"
))

# Step 8: Add toggles for x and y axis customizations
fig.update_xaxes(
    showgrid=False,  # Hide gridlines
    showline=True,   # Show axis line
    linewidth=2,     # Line width
    linecolor='black',  # Axis line color
    ticks='outside',  # Show ticks outside the plot
    tickfont=dict(family='Arial', size=18, color='black')  # Customize tick labels
)

fig.update_yaxes(
    showgrid=False,   # Show gridlines
    gridwidth=1,     # Gridline width
    gridcolor='lightgrey',  # Gridline color
    showline=True,   # Show axis line
    linewidth=2,     # Line width
    linecolor='black',  # Axis line color
    ticks='outside',  # Show ticks outside the plot
    tickfont=dict(family='Arial', size=18, color='black')  # Customize tick labels
)

# Step 9: Show the bar graph
fig.show()