The Yahoo Knowledge Graph team at Verizon Media is responsible for providing critical COVID-19 data that feeds into Yahoo properties like Yahoo News, Yahoo Finance, and Yahoo Weather. The COVID-19 datasets include country, state, and county level information updated on a rolling basis, with updates occurring approximately hourly.
The COVID-19 datasets are constructed entirely from primary (government and public agency) sources with a clear attribution of the primary sources used for each geographical region. While other aggregations of COVID-19 data are already available, we believe ours to be the only open source COVID-19 dataset that is constructed entirely from primary sources with clear attribution back to those sources. Our hope is that additional transparency will enable more accurate analysis, aiding researchers who seek to understand and prevent further spread of the disease.
Released together with the COVID-19 dataset are two other open source projects:
- An API powered by Elide, which provides JSON-API and GraphQL interfaces to the COVID-19 public data
- A dashboard to explore the data
The data is logically organized by region and time. Time is further organized into a snapshot of the latest updates received for all regions and the updates reported by regions for a given date. As the COVID-19 pandemic develops and local governments and agencies improve their ability to collect and present their data to the public, the schema will evolve. Please check back as sources frequently evolve.
We welcome data feeds or links to web pages that you would like us to crawl, extract, and merge into the overall stats. Feel free to submit an issue.
Provides general information about the regions covered in the dataset, such as geographical location and links to other public data sources.
Field | Type | Description |
---|---|---|
id | xsd:string | a unique identifier for the region |
type | list of xsd:string | a list of type classifications for the region. for example: Country, StateAdminArea, CountyAdminArea, etc... |
woeId | xsd:string | WhereOnEarth unique identifier for the region |
wikiId | xsd:string | the main Wikipedia page name of the country, can be used as a unique key |
countryCode | xsd:string | 2 letter country abbreviation code (ISO 3116) |
stateCode | xsd:string | 2 letter state abbreviation code (FIPS 5-2) |
countyCode | xsd:string | US county code (FIPS 6-4) |
label | xsd:string | the English name of the region |
latitude | xsd:float | latitude in decimal number format |
longitude | xsd:float | longitude in decimal number format |
population | xsd:integer | the population residing in the region |
parentId | list of xsd:string | a list of parent geopolitical regions for the region, this represents only direct parents as they exist in the dataset and not the full possible hierarchy |
Provides detailed case counts of COVID-19 in each region on [DATE]
in local time for that region. Each entry (row) in the daily file represents a single region.
Please be aware that different sources release data at different and often unpredictable frequencies. The by-region-[DATE]
numbers will be updated as sources release data for the given date for their region. In some cases, data for a given region is not released until many days after that calendar date has elapsed everywhere in the world. As a result, the same by-region-[DATE]
file may show different aggregate statistics for the same date depending on when the by-region-[DATE]
is accessed. Generally speaking, by-region-[DATE]
data more than one week old is stable.
Field | Type | Description |
---|---|---|
regionId | xsd:string | see id above |
label | xsd:string | see above |
totalConfirmed | xsd:integer | the total amount of confirmed cases of COVID-19 in the region until the given date (aggregate) |
totalDeaths | xsd:integer | the total amount of fatalities from COVID-19 in the region |
totalRecoveredCases | xsd:integer | the total amount of people recovered from COVID-19 in the region (aggregate) |
totalTestedCases | xsd:integer | the total amount of people tested for COVID-19 in the region (aggregate) |
numPositiveTests | xsd:integer | the daily count of people tested positive for COVID-19 |
numDeaths | xsd:integer | the daily count of fatalities as a result of COVID-19 |
numRecoveredCases | xsd:integer | the daily count of people recovered from COVID-19 |
diffNumPositiveTests | xsd:integer | the difference in number of positive cases found between 2 consecutive days |
diffNumDeaths | xsd:integer | the difference in number of deaths between 2 consecutive days |
avgWeeklyConfirmedCases | xsd:float | 7-day moving average of daily new confirmed cases |
avgWeeklyDeaths | xsd:float | 7-day moving average of daily new deaths |
referenceDate | xsd:date | the date associated with the COVID-19 data according to the local timezone of the region |
lastUpdatedDate | xsd:datetime | last update time of the entry |
dataSource | xsd:anyURI | the source attribution for the COVID-19 data in the current entry |
Provides the latest figures for each region.
The schema for the latest file is similar to the by-region-[DATE]
above.
There are 2 main differences:
- All daily diff, moving average and daily numbers are removed - daily numbers in latest file can be misleading as they are dependant on the time of day at which the data was collected
- referenceDate - In the daily files, referenceDate always matches the filename, and represents the date in local time for the relevant data reported by the source for that region when that source was last consulted. In the latest file, referenceDate will differ across regions, representing the latest date on which the source for a given region was consulted.
Note that because different regions report at different and often unpredictable frequencies, the latest figures for one region may be many days older than the latest figures for another region. For this reason, stable by-region-[DATE]
numbers are required for an accurate comparison of growth rates in different regions. Generally speaking, by-region-[DATE]
data more than one week old is stable.
Field | Type | Description |
---|---|---|
regionId | xsd:string | see id above |
label | xsd:string | see above |
totalConfirmed | xsd:integer | the total amount of confirmed cases of COVID-19 in the region until the given date (aggregate) |
totalDeaths | xsd:integer | the total amount of fatalities from COVID-19 in the region |
totalRecoveredCases | xsd:integer | the total amount of people recovered from COVID-19 in the region (aggregate) |
totalTestedCases | xsd:integer | the total amount of people tested for COVID-19 in the region (aggregate) |
referenceDate | xsd:date | the date associated with the COVID-19 data according to the local timezone of the region |
lastUpdatedDate | xsd:datetime | last update time of the entry |
dataSource | xsd:anyURI | the source attribution for the COVID-19 data in the current entry |
Please contact yk-covid-19-os@verizonmedia.com with any questions.
Thank you to everyone who contributed to this project!
The Yahoo Knowledge Graph COVID-19 Dataset is made available under a Creative Commons CC-BY-NC 4.0 license. No express permission from Verizon Media is required for noncommercial uses. Only compliance with the CC-BY-NC 4.0 license is required for noncommercial uses including attribution.
Verizon Media may consider granting royalty-free commercial licenses upon request. If you are interested in making commercial use of the Yahoo COVID-19 Dataset, please submit a request.