# Compare and Analyze Data on COVID-19 Infections Provided by the [Robert Koch Institute (RKI)](https://www.rki.de/EN/Home/homepage_node.html)

The [Robert Koch Institute (RKI)](https://www.rki.de/EN/Home/homepage_node.html) is the federal government agency responsible for disease control and prevention in Germany. Is publishes data about COVID-19 for all of Germany and uses various channels for that (see [this](https://www.rki.de/EN/Content/infections/epidemiology/outbreaks/COVID-19/COVID19.html) page for an overview).

This notebook has been created for analyzing and comparing data from two different sources that are updated by the RKI daily, but have a different level of detail. The main objective is to understand how the very fine grained numbers provided via github can be aggregated such that they match what is shown in the 
[RKI's COVID-19 dashboard](https://corona.rki.de/).

## Preliminaries

In [1]:
import json

import datetime as dt
import numpy as np
import pandas as pd

from urllib.request import urlopen

In [2]:
# define constants that are used for formatting output 
 
INTEGER_FORMAT_STR = "{:,.0f}"
FOUR_DIGIT_YEAR_FORMAT_STR = "{:.0f}"
LOW_PRECISION_FLOAT_FORMAT_STR = "{:,.1f}"
MEDIUM_PRECISION_FLOAT_FORMAT_STR = "{:,.3f}"
HIGH_PRECISION_FLOAT_FORMAT_STR = "{:,.6f}" 
PERCENT_FORMAT_STR = "{:.2%}"
PER_THOUSAND_FORMATTER = lambda p: f"{p*1000:.2f}\u2030"
DATE_FORMATTER = lambda d: d.strftime("%m-%d")
DUMMY_FORMAT_STR = "{:}"

FORMAT_MAPPER = {"area": INTEGER_FORMAT_STR,
                 "category": DUMMY_FORMAT_STR,
                 "cases": INTEGER_FORMAT_STR,
                 "cases last 7 days": INTEGER_FORMAT_STR,
                 "cases per 100k": LOW_PRECISION_FLOAT_FORMAT_STR,
                 "cases per pop.": PERCENT_FORMAT_STR,
                 "cases last 7 days per 100k": LOW_PRECISION_FLOAT_FORMAT_STR,
                 "deaths": INTEGER_FORMAT_STR,
                 "deaths last 7 days": INTEGER_FORMAT_STR,
                 "deaths per 100k": HIGH_PRECISION_FLOAT_FORMAT_STR,
                 "deaths per pop.": PER_THOUSAND_FORMATTER,
                 "deaths last 7 days per 100k": MEDIUM_PRECISION_FLOAT_FORMAT_STR,
                 "death rate": PERCENT_FORMAT_STR,
                 "death rate last 7 days": PERCENT_FORMAT_STR,
                 "recovered": INTEGER_FORMAT_STR,
                 "recovered last 7 days": INTEGER_FORMAT_STR,
                 "recovered per 100k": LOW_PRECISION_FLOAT_FORMAT_STR,
                 "recovered per pop.": PERCENT_FORMAT_STR,
                 "recovered last 7 days per 100k": LOW_PRECISION_FLOAT_FORMAT_STR,
                 "district name": DUMMY_FORMAT_STR ,
                 "population": INTEGER_FORMAT_STR,
                 "state ID": DUMMY_FORMAT_STR,
                 "state name": DUMMY_FORMAT_STR ,
                 "state pop.": INTEGER_FORMAT_STR,
                 "update time": DATE_FORMATTER,
}

## Load and Analyze Data From the [NPGEO Corona Hub 2020](https://npgeo-corona-npgeo-de.hub.arcgis.com/)

For all German districts, up-to-date COVID-19 data is available via [this](https://opendata.arcgis.com/datasets/917fc37a709542548cc3be077a786c17_0) page. This data appears to be the basis for the [COVID-19 dashboard](https://corona.rki.de/).

In [3]:
# URL for loading data as comma-seperated-data-file

RKI_ARCGIS_URL = "https://opendata.arcgis.com/datasets/917fc37a709542548cc3be077a786c17_0.csv"

In [4]:
# dictionary for renaming and harmonizing column names 

RKI_ARCGIS_COLUMN_NAME_MAPPER = {"cases7_per_100k": "cases last 7 days per 100k",
                                 "cases7_lk" : "cases last 7 days",
                                 "county": "district name",
                                 "cases_per_100k": "cases per 100k",
                                 "cases_per_population": "cases per pop.",
                                 "cases": "cases",
                                 "deaths": "deaths",
                                 "death_rate": "death rate",
                                 "death7_lk": "deaths last 7 days",
                                 "EWZ": "population",
                                 "BL_ID": "state ID",
                                 "EWZ_BL": "state pop.",
                                 "last_update": "update time",
                                 "RS": "district ID",
                                 "BEZ": "category",
                                 "GEN": "municipality name",
                                 "BL": "state name",
                                 "KFL": "area",
}

RKI_ARCGIS_VALUE_CONVERTERS = {"last_update": lambda s: dt.datetime.strptime(s, "%d.%m.%Y, %H:%M Uhr")}

### Read Data

In [5]:
# load main data, but restrict the created dataframe to the most relevant columns 

RKI_ARCGIS_COVID_BY_DISTRICT = \
    pd.read_csv(RKI_ARCGIS_URL, usecols=list(RKI_ARCGIS_COLUMN_NAME_MAPPER.keys()),
                converters=RKI_ARCGIS_VALUE_CONVERTERS)\
        .rename(columns=RKI_ARCGIS_COLUMN_NAME_MAPPER)\
        .sort_values(by="district ID")\
        .set_index("district ID")

# add column for the number of deaths within the last seven days adjusted to a population size of 100.000 people

RKI_ARCGIS_COVID_BY_DISTRICT["deaths last 7 days per 100k"] = \
    10**5 * RKI_ARCGIS_COVID_BY_DISTRICT["deaths last 7 days"] \
    / RKI_ARCGIS_COVID_BY_DISTRICT["population"]

In [6]:
# print date of most recent entry in the loaded data

print(RKI_ARCGIS_COVID_BY_DISTRICT["update time"].max().strftime("last update is from %Y-%m-%d"))

last update is from 2021-09-18


## Load and Analyze RKI's Data on COVID-19 From [GitHub](https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland)

Repository ["SARS-CoV-2 Infektionen in Deutschland"](https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland) (SARS-CoV-2 Infections in Germany) contains
up-to-date numbers of COVID-19 cases in Germany. The data appears to be what is reported by the districts to the RKI as it lists new cases based on the reporting date, beginning of the disease (reference date), age group, sex and district.

In [Readme.md](https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland/blob/master/Readme.md), an explanation for the data is provided (in German). 

In [7]:
# define constants for the URLs that give access to data

RKI_GITHUB_REPOSITORY_URL = "https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland"
RKI_GITHUB_RAW_DATA_BASE_URL = RKI_GITHUB_REPOSITORY_URL + "/raw/master"
RKI_GITHUB_ZENODO_REL_URL = "/.zenodo.json"
RKI_GITHUB_COVID_INFECTIONS_REL_URL = "/Aktuell_Deutschland_SarsCov2_Infektionen.csv"

In [8]:
# dictionary for renaming and harmonizing column names 

RKI_GITHUB_COLUMN_NAME_MAPPER = {
    "IdLandkreis": "district ID", 
    "Gemeindename": "municipality name", 
    "Flaeche": "area", 
    "EW_insgesamt": "population",
    "EW_maennlich": "population male", 
    "EW_weiblich" : "population female",
    "Altersgruppe": "age group", 
    "Geschlecht": "sex", 
    "Meldedatum": "reporting date", 
    "Refdatum": "reference date",
    "IstErkrankungsbeginn": "is start of desease", 
    "NeuerFall": "is new case", 
    "NeuerTodesfall": "is new death", 
    "NeuGenesen": "is new recovered",
    "AnzahlFall": "cases", 
    "AnzahlTodesfall": "deaths", 
    "AnzahlGenesen": "recovered",
}

unknown_converter = lambda s : "unknown" if s == "unbekannt" else s

RKI_GITHUB_VALUE_CONVERTERS = {
    "Altersgruppe": unknown_converter,
    "Geschlecht": lambda s: "F" if s == "W" else unknown_converter(s), 
}

# dictionary for setting the date type for some of the columns

RKI_GITHUB_COLUMN_TYPES_MAPPER = {
    "district ID": "int16",
    "reporting date": "datetime64", 
    "reference date": "datetime64",
    "is start of desease": "int32", 
    "is new case": "int8", 
    "is new death": "int8", 
    "is new recovered": "int8",
}

### Read Metadata

The RKI publishes metadata based on [zenodo's](https://about.zenodo.org/) JSON format. Here, it is used to detect the publication date of the data. Typically, this is shortly after 3 AM of the current day (local time in Germany). 

In [9]:
RKI_GITHUB_METADATA = json.loads(urlopen(RKI_GITHUB_RAW_DATA_BASE_URL + RKI_GITHUB_ZENODO_REL_URL).read())
RKI_GITHUB_PUBLICATION_DATE = pd.to_datetime(RKI_GITHUB_METADATA["publication_date"])
print(f"publication date is {RKI_GITHUB_PUBLICATION_DATE:%Y-%m-%d %H:%M:%S%z}")

publication date is 2021-09-18 03:40:19+0200


### Read Data

Load data describing COVID-19 infections and deaths etc.

In [10]:
RKI_GITHUB_COVID_INFECTIONS = pd.read_csv(RKI_GITHUB_RAW_DATA_BASE_URL + RKI_GITHUB_COVID_INFECTIONS_REL_URL, 
                                    converters=RKI_GITHUB_VALUE_CONVERTERS)\
                            .rename(columns=RKI_GITHUB_COLUMN_NAME_MAPPER)\
                            .astype(RKI_GITHUB_COLUMN_TYPES_MAPPER)

### Trying to Understand [Readme.md](https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland/blob/master/Readme.md)

According to the text in file [Readme.md](https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland/blob/master/Readme.md), are the values for the number of 
infected (column `cases` - `AnzahlFall` in the original data), deceased (column `deaths` - column `AnzahlTodesfall` before renaming) and recovered 
(column `recovered` - `AnzahlGenesen` before renaming) people, **[natural numbers](https://en.wikipedia.org/wiki/Natural_number)** (i.e. elements of {1,2,3 ...}). 

However, negative values seem to occur in the data. 

In [11]:
# show that there are rows, for which the value in one of the columns "cases", "deaths" or "recovered" is negative

((RKI_GITHUB_COVID_INFECTIONS["cases"] < 0) |  \
    (RKI_GITHUB_COVID_INFECTIONS["deaths"] < 0) |  \
        (RKI_GITHUB_COVID_INFECTIONS["recovered"] < 0))\
            .value_counts()[True] > 0

True

Hence, despite what's stated in `Readme.md`, values of the respective columns are **integers** (i.e. elements of {... -3, -2, -1, 0,1,2,3 ...}) and **not** natural numbers. 

If one wants to determine to total number of COVID-19 cases reported in Germany, `Readme.md` seems to imply that only such rows should be considered for which the value 
column `is new case` (`NeuerFall` in the original data) is not 0, because 0 indicates that the respective case was already reported previously. 

Additionally, one would think that rows for which the values in column `is new case` is -1 should be subtracted, because this value indicates that cases have been falsely reported in the past.

However, it appears that 
$$

 n = -1  \Leftrightarrow c < 0
$$ 
if $n$ stands for the value in column `is new case` and $c$ for the value in column `cases` in a row. This means that values for such rows should **not be subtracted but added** (since thy are negative).

In [12]:
# verify that in all rows of dataframe RKI_GITHUB_COVID_INFECTIONS, the value in column "is new case" == -1 iff the value in column "cases" is < 0

equiv = lambda a,b: ((~a) | b) & ((~b) | a) 
equiv(RKI_GITHUB_COVID_INFECTIONS["is new case"] == -1, RKI_GITHUB_COVID_INFECTIONS["cases"] < 0).all()

True

Hence, it should be correct to simply add all values of column `cases`, where `is new case` is not 0 to get the total number of cases in Germany ...

In [13]:
# print the sum for column "cases", filtered by "is new case" != 0 (i.e. is new case) 

print(f'{RKI_GITHUB_COVID_INFECTIONS.loc[RKI_GITHUB_COVID_INFECTIONS["is new case"] != 0, "cases"].sum():,d}')

8,901


... but this number is **way too small**. 

The number that is shown on the dashboard can be retrieved from the data by adding _all positive values_ in column `cases`. 

In [14]:
print("total cases for Germany:")
# print the sum for all positive values in column "cases" 
print(f' GitHub: {RKI_GITHUB_COVID_INFECTIONS.loc[RKI_GITHUB_COVID_INFECTIONS["cases"] > 0, "cases"].sum():,d}')
# sum of all cases in the data from ARCGIS
print(f' ARCGIS: {RKI_ARCGIS_COVID_BY_DISTRICT["cases"].sum():,d}')

total cases for Germany:
 GitHub: 4,134,779
 ARCGIS: 4,134,779


### Compute Totals

Based on te interpretation of the data described above, the sum of columns `cases`, `deaths` and `recovered` is calculated for each district.

For the sake of convenience, a new dataframe is defined, in which negative values for the number of COVID-19 cases, deaths and recovered patients are set to zero. This is used later for the aggregation of data.

In [15]:
# define dataframe that without the negative values in RKI_GITHUB_COVID_INFECTIONS

RKI_GITHUB_COVID_INFECTIONS_WITHOUT_NEGATIVES = \
    pd.concat([RKI_GITHUB_COVID_INFECTIONS[["district ID", "age group", "sex", "reporting date", "reference date"]], 
              RKI_GITHUB_COVID_INFECTIONS[["cases", "deaths", "recovered"]].apply(lambda a: np.maximum(a,0))], 
              axis="columns")

In [16]:
#  sum up "cases", "deaths", "recovered" for each district
RKI_GITHUB_COVID_BY_DISTRICT_TOTALS = \
    RKI_GITHUB_COVID_INFECTIONS_WITHOUT_NEGATIVES[["district ID", "cases", "deaths", "recovered"]].groupby(by="district ID").sum()

# copy population size for each district from RKI_ARCGIS_COVID_BY_DISTRICT
RKI_GITHUB_COVID_BY_DISTRICT_TOTALS["population"] = RKI_ARCGIS_COVID_BY_DISTRICT["population"]

### Compute Numbers per 100K People 

The sums in columns `cases`, `deaths` and `recovered` are normalized by the population size of the district. 
For the district's population size, the data from the [NPGEO Corona Hub 2020](https://npgeo-corona-npgeo-de.hub.arcgis.com/) is used.

In [17]:
# define a new dataframe by dividing the totals by the population size of each district and multiplying that with 100,000

RKI_GITHUB_COVID_BY_DISTRICT_PER_100K = \
    pd.DataFrame(data=10**5 * RKI_GITHUB_COVID_BY_DISTRICT_TOTALS[["cases", "deaths", "recovered"]].values \
                 / np.array(3 * [RKI_GITHUB_COVID_BY_DISTRICT_TOTALS["population"].values]).T,
                 index=RKI_GITHUB_COVID_BY_DISTRICT_TOTALS.index,
                 columns=["cases per 100k", "deaths per 100k", "recovered per 100k"])

### Compute Totals for the Last Seven Days

The values in columns `cases`, `deaths` and `recovered` of the last seven days are summed up.

In [18]:
# define the date of the earlist data that should be considered
number_of_days = 7 
cut_off_date = np.datetime64(RKI_GITHUB_PUBLICATION_DATE.date() - dt.timedelta(days=number_of_days))

# compute sums for data that has been reported on or after the cut_off_date
RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS = \
        RKI_GITHUB_COVID_INFECTIONS_WITHOUT_NEGATIVES.loc[RKI_GITHUB_COVID_INFECTIONS_WITHOUT_NEGATIVES["reporting date"] >= cut_off_date]\
                .groupby(by="district ID").sum()\
                .rename(columns={"cases": "cases last 7 days", "deaths": "deaths last 7 days", "recovered": "recovered last 7 days"})

# ensure that there is data for each district by filling missing data with zeros
# define a dataframe containing the most recent data of each district
df = RKI_GITHUB_COVID_INFECTIONS_WITHOUT_NEGATIVES[["district ID", "reporting date"]].sort_values(by=["district ID", "reporting date"])\
        .groupby(["district ID"]).last()

missing_rows = df.loc[df["reporting date"] < cut_off_date].index.shape[0]
if  missing_rows > 0:
    # there are districts that have not reported data within the last 7 days
    # create dataframe containing the zeroes filling the missing data 
    ddf = pd.DataFrame(data=np.zeros((missing_rows, RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS.shape[1]), dtype=np.int64), 
                       index=df.loc[df["reporting date"] < cut_off_date].index,
                       columns=RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS.columns)
    # append zeros
    RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS = RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS.append(ddf)

### Compute Numbers for the Last Seven Days per 100K People 

Normalize the sums for the last seven days by the districts' population size 

In [19]:
# define a new dataframe computing values by dividing the totals for the last seven days by the population size of each district and multiplying it with 100,000
 
RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_PER_100K = \
    pd.DataFrame(data=10**5 * RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS.values \
                 / np.array(3 * [RKI_GITHUB_COVID_BY_DISTRICT_TOTALS.loc[RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS.index, "population"].values]).T,
                 index=RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS.index,
                 columns=["cases last 7 days per 100k", "deaths last 7 days per 100k", "recovered last 7 days per 100k"])

### Create a DataFrame Containing all Values Derived From the COVID-19 Data from GitHub

In [20]:
# combine all of the dataframes with data that has been derived from RKI_GITHUB_COVID_INFECTIONS into one dataframe, 
# only the district names are taken from RKI_ARCGIS_COVID_BY_DISTRICT

RKI_GITHUB_COVID_BY_DISTRICT = \
    pd.concat([RKI_ARCGIS_COVID_BY_DISTRICT["district name"],
               RKI_GITHUB_COVID_BY_DISTRICT_TOTALS, 
               RKI_GITHUB_COVID_BY_DISTRICT_PER_100K, 
               RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_TOTALS,
               RKI_GITHUB_COVID_BY_DISTRICT_LAST_7_DAYS_PER_100K], axis="columns")

## Verify that Derived Data From the [COVID-19 Data from GitHub](https://github.com/robert-koch-institut/SARS-CoV-2_Infektionen_in_Deutschland) and the Data From [NPGEO Corona Hub 2020](https://npgeo-corona-npgeo-de.hub.arcgis.com/) are (More or Less) Identical

In [21]:
EPSILON = 10**-11 # threshold for treating float values as zero
common_columns = list(set(RKI_ARCGIS_COVID_BY_DISTRICT.columns) & set(RKI_GITHUB_COVID_BY_DISTRICT.columns)) 
common_numerical_columns = [c for c in common_columns if RKI_GITHUB_COVID_BY_DISTRICT[c].dtypes != object]

# return True, if all the differences between the absolute values in the common numerical columns is smaller than the threshold

np.max(np.max(np.abs(RKI_ARCGIS_COVID_BY_DISTRICT[common_numerical_columns] - RKI_GITHUB_COVID_BY_DISTRICT[common_numerical_columns]))) < EPSILON

True

## Show Some Data

Based on the result of the verification above, it seems that the interpretation of the data that is provided via GitHub is correct ... or at least identical with what is shown
in the COVID-19 Dashboard. Hence, the following list the most affected districts in Germany can be assumed to be correct.

In [22]:
n = 30
RKI_GITHUB_COVID_BY_DISTRICT.sort_values(by="cases last 7 days per 100k", ascending=False).head(n).style.hide_index().format(FORMAT_MAPPER)

district name,cases,deaths,recovered,population,cases per 100k,deaths per 100k,recovered per 100k,cases last 7 days,deaths last 7 days,recovered last 7 days,cases last 7 days per 100k,deaths last 7 days per 100k,recovered last 7 days per 100k
LK Traunstein,12237,218,11391,177485,6894.7,122.827281,6418.0,393,0,4,221.4,0.0,2.3
LK Berchtesgadener Land,6967,102,6525,106327,6552.4,95.930479,6136.7,206,0,3,193.7,0.0,2.8
SK Bremerhaven,5549,108,4713,113557,4886.5,95.106422,4150.3,218,1,0,192.0,0.881,0.0
SK Neustadt a.d.Weinstraße,2103,39,1784,53306,3945.1,73.162496,3346.7,99,0,1,185.7,0.0,1.9
SK Rosenheim,4588,72,4222,63591,7214.9,113.223569,6639.3,114,1,0,179.3,1.573,0.0
SK Offenbach,10641,185,9888,130892,8129.6,141.337897,7554.3,232,0,0,177.2,0.0,0.0
LK Rosenheim,15742,468,14265,261721,6014.8,178.816373,5450.5,451,0,3,172.3,0.0,1.1
LK Lippe,19794,407,18597,346970,5704.8,117.301208,5359.8,579,0,44,166.9,0.0,12.7
SK Wuppertal,24383,495,22675,355004,6868.4,139.435049,6387.3,578,0,19,162.8,0.0,5.4
LK Ahrweiler,5508,55,4662,130479,4221.4,42.152377,3573.0,211,0,0,161.7,0.0,0.0


Equally, the following totals for all of Germany appear correct

In [23]:
columns = ["population", "cases", "cases last 7 days", "deaths", "deaths last 7 days", "recovered", "recovered last 7 days"]
RKI_GITHUB_COVID_BY_DISTRICT[columns].sum().to_frame().T.style.hide_index().format(FORMAT_MAPPER)

population,cases,cases last 7 days,deaths,deaths last 7 days,recovered,recovered last 7 days
83148406,4134779,59846,92920,43,3882704,979


Likewise, the sums for the last 7 days per 100,000 people can be computed

In [24]:
columns = ["cases last 7 days", "deaths last 7 days", "recovered last 7 days"]
(10**5 * RKI_GITHUB_COVID_BY_DISTRICT[columns].sum() / RKI_GITHUB_COVID_BY_DISTRICT["population"].sum())\
    .to_frame().T.rename(columns={c:c+" per 100k" for c in columns}).style.hide_index().format(FORMAT_MAPPER)

cases last 7 days per 100k,deaths last 7 days per 100k,recovered last 7 days per 100k
72.0,0.052,1.2


However, the following numbers, which should reflect the increase of cases, deaths and recovered people during the last day, differ from what the RKI reports. 
This is a mystery ...

In [25]:
columns = ["district ID", "reporting date"]
RKI_GITHUB_COVID_INFECTIONS_WITHOUT_NEGATIVES.sort_values(by=columns).groupby(by=columns).sum()\
    .groupby(by="district ID").last().sum()\
        .to_frame().T.style.hide_index().format(FORMAT_MAPPER)

cases,deaths,recovered
6619,2,27


### Compute Totals by Age Group and Sex

In order to close this notebook with something that is not mysterious, the total number of cases, deaths and recovered people is computed by age group and sex. 
This is again in sync with the data of the COVID-19 dashboard.

In [26]:
RKI_GITHUB_COVID_BY_AGE_GROUP_AND_SEX_TOTALS = \
    RKI_GITHUB_COVID_INFECTIONS_WITHOUT_NEGATIVES[["age group", "sex", "cases", "deaths", "recovered"]].groupby(by=["age group", "sex"]).sum()
RKI_GITHUB_COVID_BY_AGE_GROUP_AND_SEX_TOTALS.style.format(FORMAT_MAPPER)

Unnamed: 0_level_0,Unnamed: 1_level_0,cases,deaths,recovered
age group,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A00-A04,F,54164,9,50781
A00-A04,M,58420,3,54715
A00-A04,unknown,1422,0,1193
A05-A14,F,164478,4,149229
A05-A14,M,180035,3,163051
A05-A14,unknown,3915,0,3042
A15-A34,F,619415,74,592206
A15-A34,M,633929,129,605344
A15-A34,unknown,7721,0,7017
A35-A59,F,790023,1239,764462
