# Wellcome Trust: 360Giving Bulk Data

This represents the crawler for Wellcome Trust's 360giving bulk data. 360giving has a [standard tabular format](https://standard.threesixtygiving.org/en/latest/#) for publishing UK philanthropies data. It is published as csv, xls, xlsx, depending on the funder.

### Notes:
- Wellcome Trust also has a [grant explorer](https://wellcome.ac.uk/grant-funding) which is a web interface to their grants data. This crawler is for the bulk data only.
- 360giving maintains a registry, however, [its Wellcome Trust entry](https://grantnav.threesixtygiving.org/org/360G-wellcome#) is out-of-date relative to the [Wellcome Trust's own publication](https://wellcome.org/grant-funding/funded-people-and-projects) in the format. At the time of writing

| Site | Date Range | Data | 
| --- | --- | --- |
| [360Giving](https://grantnav.threesixtygiving.org/org/360G-wellcome#) | 2005-10-01 -- 2022-05-04 | [xlsx](https://cms.wellcome.org/sites/default/files/2022-05/Wellcome-grants-awarded-1-October-2005-to-04-05-2022.xlsx) |
| [Wellcome Funded Projects](https://wellcome.org/grant-funding/funded-people-and-projects) | 2005-10-01 -- 2023-09-30 | [xlsx](https://wellcome.org/sites/default/files/2023-11/Wellcome-grants-awarded-1-October-2005-to-30-September-2023.xlsx) |

Because of the way this is structured, the latest data is not always available from the 360giving registry. This crawler will use the Wellcome Trust's own publication as the source of truth, but eventual revisions should check both for the most recent data.

In [1]:
from oic_scrape.items import AwardItem, AwardParticipant
import pandas as pd
from datetime import datetime, timedelta
from attrs import asdict
from currency_converter import ECB_URL, CurrencyConverter, RateNotFoundError

## Parameters and Configuration

In [2]:
THREESIXTY_G_DATA_URL = "https://wellcome.org/sites/default/files/2023-11/Wellcome-grants-awarded-1-October-2005-to-30-September-2023.xlsx"
FUNDER_ORG_NAME = "The Wellcome Trust"
FUNDER_ROR_ID = "https://ror.org/029chgv08"
OUTPUT_LOCATION = "data/wellcome-trust--giving360_grants.jsonl"
OUTPUT_FORMAT = "jsonl"

In [3]:
def validate_output_format(format):
    """
    Validates the output file format.

    Args:
        format (str): The output format to be validated.

    Returns:
        bool: True if the format is valid (json, jsonl, or jsonlines), False otherwise.
    """
    if (
        format.lower() == "json"
        or format.lower() == "jsonl"
        or format.lower() == "jsonlines"
    ):
        return True
    else:
        return False


if validate_output_format(OUTPUT_FORMAT):
    if OUTPUT_FORMAT.lower() == "jsonl" or OUTPUT_FORMAT.lower() == "jsonlines":
        output_format_lines = True
    else:
        output_format_lines = False
else:
    raise ValueError("Output format should be either 'json' or 'jsonl'/'jsonlines'.")

In [4]:
c = CurrencyConverter(ECB_URL)
df = pd.read_excel(
    THREESIXTY_G_DATA_URL, sheet_name="General Report", header=0, engine="calamine"
)
_crawled_at = datetime.utcnow()

In [5]:
from oic_scrape.spiders.sloan_org import FUNDER_ORG_NAME, FUNDER_ORG_ROR_ID


awards = []
for ix, row in df.iterrows():
    source = "wellcome.org_360giving-export"
    grant_id = f"360g::{row['Identifier']}"
    funder_org_name = row["Funding Org:Name"]
    recipient_org_name = row["Recipient Org:Name"]
    funder_org_ror_id = (
        FUNDER_ORG_ROR_ID if funder_org_name == FUNDER_ORG_NAME else None
    )
    recipient_org_location = row["Recipient Org:Country"]
    pi_name = str(row["Lead Applicant"])

    named_participants = []
    if row["Lead Applicant"]:
        named_participants.append(
            AwardParticipant(
                full_name=row["Lead Applicant"],
                last_name=row["Applicant Surname"],
                grant_role="Lead Applicant",
                is_pi=True,
            )
        )
    if row["Other Applicant(s)"]:
        for name in str(row["Other Applicant(s)"]).split(","):
            named_participants.append(
                AwardParticipant(
                    full_name=name,
                    grant_role="Other Applicant",
                    is_pi=False,
                )
            )

    if row["Sponsor(s)"]:
        for name in str(row["Sponsor(s)"]).split(","):
            named_participants.append(
                AwardParticipant(
                    full_name=name,
                    grant_role="Sponsor",
                    is_pi=False,
                )
            )

    grant_year = pd.to_datetime(row["Award Date"], errors="coerce").year
    grant_start_date = pd.to_datetime(
        row["Planned Dates:Start Date"], errors="coerce"
    ).date()
    grant_end_date = pd.to_datetime(
        row["Planned Dates:End Date"], errors="coerce"
    ).date()
    grant_duration = f"{(grant_end_date - grant_start_date).days} days"

    if row['Amount Awarded'] and float(row['Amount Awarded']) > 0:
        award_amount = float(row['Amount Awarded'])
        award_currency = row['Currency']
        try:
            award_amount_usd = c.convert(award_amount, 'GBP', 'USD', date=grant_start_date)
            comment = f"`award_amount_usd` converted from GBP to USD using ECB exchange rate on {grant_start_date}."
        except RateNotFoundError:
            for i in range(1, 6):
                try:
                    new_date = datetime(2023, 10, 1) - timedelta(days=i)
                    award_amount_usd = c.convert(award_amount, 'GBP', 'USD', date=new_date)
                    comment = f"`award_amount_usd` converted from GBP to USD using ECB exchange rate on {new_date}, rather than {grant_start_date}."
                    break
                except RateNotFoundError:
                    comment = f'Could not find exchange rate for GBP to USD in the 5 days before {grant_start_date}.'
                    award_amount_usd = None
                    pass
    else:
        award_amount = None
        award_currency = None
        award_amount_usd = None
        comment = None
 
    grant_title = row["Title"]
    grant_description = row["Description"]
    program_of_funder = row["Grant Programme:Title"]
    raw_source_data = row.to_json()

    award = AwardItem(
        _crawled_at=_crawled_at,
        source=source,
        grant_id=grant_id,
        funder_org_name=funder_org_name,
        funder_org_ror_id=FUNDER_ORG_ROR_ID,
        recipient_org_name=recipient_org_name,
        recipient_org_location=recipient_org_location,
        pi_name=pi_name,
        named_participants=named_participants,
        grant_year=grant_year,
        grant_start_date=grant_start_date,
        grant_end_date=grant_end_date,
        grant_duration=grant_duration,
        award_amount=award_amount if award_amount else None,
        award_currency=award_currency if award_currency else None,
        award_amount_usd=award_amount_usd if award_amount_usd else None,
        comments=comment if comment else None,
        grant_title=grant_title,
        grant_description=str(grant_description),
        program_of_funder=program_of_funder,
        raw_source_data=raw_source_data,
        _award_schema_version="0.1.0",
    )
    awards.append(asdict(award))

In [6]:
grants_df = pd.DataFrame(awards)
grants_df.to_json(OUTPUT_LOCATION, orient="records", lines=output_format_lines)