# Exploration of the datasets

This notebook is dedicated to the exploration and understanding of the three datasets related to breast cancer.  
The objective here is to analyze their structure, quality, and characteristics in order to design an appropriate cleaning and preparation strategy later.

In [2]:
# Import libraries
import pandas as pd

## Loading datasets into dataframe

In [3]:
df_screen = pd.read_csv('../data/breast_cancer_screening.csv')
df_death = pd.read_csv('../data/death_due_to_cancer.csv')
df_exam = pd.read_csv('../data/breast_exam_income.csv')

Three CSV files were downloaded from the **data.europa.eu**:

1. `breast_cancer_screening.csv`: percentage of women participating in breast cancer screening programs (2000–2021).

https://data.europa.eu/data/datasets/75kk9hje0s7cm2idhpvvww?locale=en
<br/><br/>

2. `death_due_to_cancer.csv`: death rates from various cancers by sex and country.

https://data.europa.eu/data/datasets/is1cbzt2xixmwv630aoqpw?locale=en
<br/><br/>

3. `breast_exam_income.csv`: percentage of women who reported having a breast examination (X-ray) by income level and age.

https://data.europa.eu/data/datasets/otvi02wdhgtfmmgvvkvgxa?locale=en
<br/><br/>

These datasets together allow us to analyze how screening participation, mortality, and socioeconomic factors interact in the context of breast cancer awareness.

## 1. Breast cancer screening dataset

This dataset contains information about the **percentage of women screened for breast cancer** (mammography) in each European country between **2000 and 2021**.

We begin by inspecting its structure and identifying potential quality issues such as missing values, duplicates, or redundant columns.

In [94]:
# Display first few rows
df_screen.head()

Unnamed: 0,DATAFLOW,LAST UPDATE,freq,unit,source,icd10,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG,CONF_STATUS
0,ESTAT:HLTH_PS_SCRE(1.0),12/07/22 11:00:00,A,PC,PRG,C50,BE,2001,50.0,,
1,ESTAT:HLTH_PS_SCRE(1.0),12/07/22 11:00:00,A,PC,PRG,C50,BE,2002,54.0,,
2,ESTAT:HLTH_PS_SCRE(1.0),12/07/22 11:00:00,A,PC,PRG,C50,BE,2003,53.7,,
3,ESTAT:HLTH_PS_SCRE(1.0),12/07/22 11:00:00,A,PC,PRG,C50,BE,2004,55.9,,
4,ESTAT:HLTH_PS_SCRE(1.0),12/07/22 11:00:00,A,PC,PRG,C50,BE,2005,56.6,,


- `DATAFLOW`: Dataset identifier
- `LAST UPDATE`: Date and time when the dataset was last updated (12/07/22)
- `freq`: Data collection frequency - here `A` stands for annual
- `unit`: Measurement unit - `PC` means percentage of the target population
- `source`: Indicates the type of screening program. Example: `PRG` refers to organized public screening programs
- `icd10`: ICD-10 code for the disease. `C50` corresponds to breast cancer (malignant neoplasm of the breast)
- `geo`: Country code (e.g., `BE` = Belgium, `FR` = France, `DE` = Germany)
- `TIME_PERIOD`: Year of data collection (from 2000 to 2021)
- `OBS_VALUE`: Observed value - the percentage of women screened for breast cancer in that country and year
- `OBS_FLAG`: Observation flag indicating data status (e.g., provisional, estimated, or missing)
- `CONF_STATUS`: Confidence status of the observation

In [95]:
# Shape of the dataset
print(df_screen.shape)

(963, 11)


The dataset contains 963 rows and 11 columns, which indicates a moderate dataset size, manageable for exploratory analysis.

In [96]:
# Information
print(df_screen.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 963 entries, 0 to 962
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATAFLOW     963 non-null    object 
 1   LAST UPDATE  963 non-null    object 
 2   freq         963 non-null    object 
 3   unit         963 non-null    object 
 4   source       963 non-null    object 
 5   icd10        963 non-null    object 
 6   geo          963 non-null    object 
 7   TIME_PERIOD  963 non-null    int64  
 8   OBS_VALUE    963 non-null    float64
 9   OBS_FLAG     314 non-null    object 
 10  CONF_STATUS  0 non-null      float64
dtypes: float64(2), int64(1), object(8)
memory usage: 82.9+ KB
None


From the `info()` output, we can see:
- `OBS_VALUE` (the main variable of interest) has no missing values.  
- `CONF_STATUS` is entirely missing and should be dropped.  
- `OBS_FLAG` has more than 600 missing values and limited analytical use.  
- Other columns like `DATAFLOW`, `LAST UPDATE`, `freq`, and `unit` have constant values, meaning they do not provide useful variation.

These findings suggest a relatively clean dataset, with a few redundant columns that can be safely removed.

In [97]:
# Overview of missing values (%)
missing = df_screen.isna().mean().sort_values(ascending=False) * 100
print(missing)

CONF_STATUS    100.000000
OBS_FLAG        67.393562
DATAFLOW         0.000000
LAST UPDATE      0.000000
freq             0.000000
unit             0.000000
source           0.000000
icd10            0.000000
geo              0.000000
TIME_PERIOD      0.000000
OBS_VALUE        0.000000
dtype: float64


The percentage of missing values shows that:
- `CONF_STATUS`: 100% missing  
- `OBS_FLAG`: 67% missing  
- All other columns are complete.

Since these two columns do not add analytical value, the best approach is to drop them from the dataset.

In [98]:
# Count duplicates
print(df_screen.duplicated().sum())

0


There are no duplicate rows in the dataset, confirming that each record is unique by country and year.

In [99]:
# Summary of numeric columns
print(df_screen.describe())

       TIME_PERIOD   OBS_VALUE  CONF_STATUS
count   963.000000  963.000000          0.0
mean   2012.401869   57.852347          NaN
std       5.376421   20.479383          NaN
min    2000.000000    0.040000          NaN
25%    2008.000000   46.025000          NaN
50%    2013.000000   61.800000          NaN
75%    2017.000000   74.200000          NaN
max    2021.000000   95.200000          NaN


The descriptive statistics show that:
- Screening rates (`OBS_VALUE`) range from 0.04% to 95.2%, which is realistic for participation rates across different countries.  
- The time period spans from 2000 to 2021, covering over two decades of public health data.  
- The mean screening rate is around 57.8%, with a relatively high standard deviation (20.5), indicating significant differences between countries and years.

In [100]:
# Unique values in categorical fields
for col in df_screen.select_dtypes('object'):
    print(f"{col}: {df_screen[col].nunique()} unique values")

DATAFLOW: 1 unique values
LAST UPDATE: 1 unique values
freq: 1 unique values
unit: 1 unique values
source: 2 unique values
icd10: 2 unique values
geo: 36 unique values
OBS_FLAG: 6 unique values


### Preliminary Cleaning Decisions

Based on the analysis above, the following actions will be applied:
1. Drop irrelevant columns: `DATAFLOW`, `LAST UPDATE`, `freq`, `CONF_STATUS`, and `OBS_FLAG`.  
2. Filter rows where `icd10 == 'C50'` (breast cancer only).  
3. Keep only rows where `source == 'PRG'` (organized screening programs).  
4. Rename columns for clarity:  
   - `geo` --> `country`  
   - `TIME_PERIOD` --> `year`  
   - `OBS_VALUE` --> `screening_rate`

These transformations will produce a clean dataset ready for analysis and integration with the mortality and income datasets.

## 2. Death rate by cancer dataset

This dataset provides information on death rates due to cancer.  
It includes multiple cancer types, but we are specifically interested in breast cancer deaths (ICD-10 code C50) among women.

In [101]:
# Display first few rows
df_death.head()

Unnamed: 0,DATAFLOW,LAST UPDATE,freq,unit,age,icd10,sex,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG,CONF_STATUS
0,ESTAT:TPS00116(1.0),21/03/25 11:00:00,A,RT,TOTAL,C,F,AT,2011,205.15,,
1,ESTAT:TPS00116(1.0),21/03/25 11:00:00,A,RT,TOTAL,C,F,AT,2012,206.3,,
2,ESTAT:TPS00116(1.0),21/03/25 11:00:00,A,RT,TOTAL,C,F,AT,2013,198.24,,
3,ESTAT:TPS00116(1.0),21/03/25 11:00:00,A,RT,TOTAL,C,F,AT,2014,202.41,,
4,ESTAT:TPS00116(1.0),21/03/25 11:00:00,A,RT,TOTAL,C,F,AT,2015,194.97,,


- `DATAFLOW`: Dataset identifier  
- `LAST UPDATE`: Date and time when the dataset was last updated (`21/03/25 11:00:00`)  
- `freq`: Data frequency - `A` stands for annual reporting
- `unit`: Unit of measurement - `RT` indicates rate per 100 000 inhabitants, standardized by age
- `age`: Age category of the population (e.g., `TOTAL` means all age groups combined)
- `icd10`: ICD-10 classification code for the type of cancer
- `sex`: Gender of the population
  - `F` = Female, `M` = Male, `T` = Total (both sexes combined)
- `geo`: Country code
- `TIME_PERIOD`: Year of data collection
- `OBS_VALUE`: Observed death rate for that cancer type, country, and year (per 100 000 inhabitants)
- `OBS_FLAG`: Observation flag - marks data status (e.g., provisional, estimated)
- `CONF_STATUS`: Confidence status indicator

In [102]:
# Shape of the dataset
print(df_death.shape)

(1248, 12)


The dataset contains 1248 rows and 12 columns. This size is manageable for exploration and visualization.

In [103]:
# Information
print(df_death.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1248 entries, 0 to 1247
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATAFLOW     1248 non-null   object 
 1   LAST UPDATE  1248 non-null   object 
 2   freq         1248 non-null   object 
 3   unit         1248 non-null   object 
 4   age          1248 non-null   object 
 5   icd10        1248 non-null   object 
 6   sex          1248 non-null   object 
 7   geo          1248 non-null   object 
 8   TIME_PERIOD  1248 non-null   int64  
 9   OBS_VALUE    1248 non-null   float64
 10  OBS_FLAG     54 non-null     object 
 11  CONF_STATUS  0 non-null      float64
dtypes: float64(2), int64(1), object(9)
memory usage: 117.1+ KB
None


From the `info()` output, we can observe the following:

- All key columns such as `icd10`, `sex`, `geo`, `TIME_PERIOD`, and `OBS_VALUE` are complete - there are no missing values in these fields.  
- The column `CONF_STATUS` is completely empty (0 non-null values) and can be dropped.  
- The column `OBS_FLAG` contains only 54 non-null values (around 4%), meaning it is mostly missing and not useful for analysis.  
- The columns `DATAFLOW`, `LAST UPDATE`, `freq`, and `unit` are fully filled but constant - they do not provide analytical information.  
- The dataset uses appropriate data types:  
  - `TIME_PERIOD` is stored as integer (year).  
  - `OBS_VALUE` is float, representing the mortality rate.  
  - Other fields are stored as object (string), which is fine for categorical variables.

Overall, the dataset is clean and well-structured, with only a few redundant or low-value columns that will be removed during the cleaning phase.


In [104]:
# Overview of missing values (%)
missing = df_death.isna().mean().sort_values(ascending=False) * 100
print(missing)

CONF_STATUS    100.000000
OBS_FLAG        95.673077
DATAFLOW         0.000000
LAST UPDATE      0.000000
freq             0.000000
unit             0.000000
age              0.000000
icd10            0.000000
sex              0.000000
geo              0.000000
TIME_PERIOD      0.000000
OBS_VALUE        0.000000
dtype: float64


The percentage of missing values shows that:
- `CONF_STATUS`: 100% missing  
- `OBS_FLAG`: 96% missing  
- All other columns are complete.

Since these two columns do not add analytical value, the best approach is to drop them from the dataset.

In [105]:
# Count duplicates
print(df_death.duplicated().sum())

0


There are no duplicate rows in the dataset, confirming that each record is unique by country and year.

In [106]:
# Summary of numeric columns
print(df_death.describe())

       TIME_PERIOD    OBS_VALUE  CONF_STATUS
count  1248.000000  1248.000000          0.0
mean   2016.461538   265.340537          NaN
std       3.445783    76.266042          NaN
min    2011.000000   102.710000          NaN
25%    2013.000000   209.465000          NaN
50%    2016.000000   249.455000          NaN
75%    2019.000000   308.577500          NaN
max    2022.000000   504.430000          NaN


The descriptive statistics show that:

- The dataset covers the period from 2011 to 2022, which provides more than a decade of mortality data
- The mean mortality rate (`OBS_VALUE`) across all cancer types, sexes, and countries is approximately 265 deaths per 100,000 inhabitants
- The minimum value (~ 103) and maximum value (~ 504) indicate large variations between countries and cancer types, which is expected
- The standard deviation (~ 76) confirms this variability across European countries
- The `CONF_STATUS` column contains no values and will be removed during cleaning

Overall, the data distribution appears realistic for standardized mortality rates.
The next step is to identify which columns contain missing or redundant data and to prepare filtering for female breast cancer (icd10 = 'C50', sex = 'F').


In [107]:
# Unique values in categorical fields
for col in df_death.select_dtypes('object'):
    print(f"{col}: {df_death[col].nunique()} unique values")

DATAFLOW: 1 unique values
LAST UPDATE: 1 unique values
freq: 1 unique values
unit: 1 unique values
age: 1 unique values
icd10: 1 unique values
sex: 3 unique values
geo: 35 unique values
OBS_FLAG: 1 unique values


The categorical column analysis reveals that:

- `DATAFLOW`, `LAST UPDATE`, `freq`, `unit`, and `age` each have only one unique value, meaning they are constant across the dataset and can be safely removed.
- `icd10` also contains only one unique code, which in this dataset corresponds to “C” (all cancers combined) rather than a specific type.
- `sex` has three unique values - `F` (female), `M` (male), and `T` (total).
  For our study, we will focus only on female mortality (F).
- `geo` includes 35 country codes, representing the European countries in the dataset.
- `OBS_FLAG` has one unique value and provides no additional information.

These findings confirm that several metadata columns are redundant, and only a subset of variables (`geo`, `TIME_PERIOD`, `sex`, and `OBS_VALUE`) will be needed for analysis.

### Preliminary Cleaning Decisions

Based on the analysis above, the following actions will be applied:

1. **Drop irrelevant or redundant columns:**  
   `DATAFLOW`, `LAST UPDATE`, `freq`, `CONF_STATUS`, and `OBS_FLAG` - they contain constant values or are mostly empty.  

2. **Filter records:**  
   - Keep only rows where `sex == 'F'` (female).  
   - Keep only rows where `icd10 == 'C50'` (breast cancer).  

3. **Rename columns for clarity:**  
   - `geo` --> `country`  
   - `TIME_PERIOD` --> `year`  
   - `OBS_VALUE` --> `mortality_rate`  

4. **Keep only relevant columns:**  
   `country`, `year`, `unit`, `age`, `sex`, `icd10`, `mortality_rate`.

5. **Convert data types:**  
   - `year` --> integer  
   - `mortality_rate` --> float  

These transformations will produce a clean dataset containing female breast cancer mortality rates across European countries from 2011 to 2022, ready for comparison with screening participation data.

## 3. Self-reported breast examination dataset

This dataset contains information on the percentage of women who reported having a breast examination (by X-ray), categorized by age group and income quintile.
It allows us to study socioeconomic disparities in breast cancer screening participation across European countries.

In [108]:
# Display first few rows
df_exam.head()

Unnamed: 0,DATAFLOW,LAST UPDATE,freq,duration,age,quant_inc,unit,geo,TIME_PERIOD,OBS_VALUE,OBS_FLAG,CONF_STATUS
0,ESTAT:HLTH_EHIS_PA7I(1.0),24/02/22 23:00:00,A,NEV,TOTAL,QU1,PC,AT,2019,30.2,,
1,ESTAT:HLTH_EHIS_PA7I(1.0),24/02/22 23:00:00,A,NEV,TOTAL,QU1,PC,BE,2019,35.2,,
2,ESTAT:HLTH_EHIS_PA7I(1.0),24/02/22 23:00:00,A,NEV,TOTAL,QU1,PC,BG,2019,58.2,,
3,ESTAT:HLTH_EHIS_PA7I(1.0),24/02/22 23:00:00,A,NEV,TOTAL,QU1,PC,CY,2019,55.2,,
4,ESTAT:HLTH_EHIS_PA7I(1.0),24/02/22 23:00:00,A,NEV,TOTAL,QU1,PC,CZ,2019,34.6,,


- `DATAFLOW`: Dataset identifier 
- `LAST UPDATE`: Date and time when the dataset was last updated in the Eurostat database (`24/02/22 23:00:00`).
- `freq`: Data frequency - `A` indicates annual data collection
- `duration`: Indicates the recency or frequency of the examination
  - Example: `NEV` = Never had a breast X-ray
- `age`: Age group of respondents (e.g., `40-49`, `50-69`, or `TOTAL` for all ages)
- `quant_inc`: Income quintile of respondents:
  - `QU1` = lowest income,
  - `QU5` = highest income
- `unit`: Measurement unit - `PC` stands for **percentage** of respondents in that category
- `geo`: Country code
- `TIME_PERIOD`: Year of data collection (here: `2019`)
- `OBS_VALUE`: Observed percentage of women who reported having a breast X-ray in that year, age group, and income level
- `OBS_FLAG`: Observation flag
- `CONF_STATUS`: Confidence status indicator

In [109]:
# Shape of the dataset
print(df_exam.shape)

(17472, 12)


The dataset contains 17 472 rows and 12 columns, representing a wide range of country, age group, and income quintile combinations.
This larger size reflects the detailed breakdown by age**, income, and duration of last examination, making it valuable for studying socioeconomic and demographic disparities in breast examination participation.

In [110]:
# Information
print(df_exam.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17472 entries, 0 to 17471
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   DATAFLOW     17472 non-null  object 
 1   LAST UPDATE  17472 non-null  object 
 2   freq         17472 non-null  object 
 3   duration     17472 non-null  object 
 4   age          17472 non-null  object 
 5   quant_inc    17472 non-null  object 
 6   unit         17472 non-null  object 
 7   geo          17472 non-null  object 
 8   TIME_PERIOD  17472 non-null  int64  
 9   OBS_VALUE    17217 non-null  float64
 10  OBS_FLAG     1030 non-null   object 
 11  CONF_STATUS  0 non-null      float64
dtypes: float64(2), int64(1), object(9)
memory usage: 1.6+ MB
None


From the `info()` output, we can observe the following:

- The dataset contains 17 472 entries and 12 columns, which confirms that it covers many combinations of country, age group, income quintile, and duration
- Most columns are complete (`DATAFLOW`, `LAST UPDATE`, `freq`, `duration`, `age`, `quant_inc`, `unit`, and `geo` all have 17 472 non-null values)
- The column `OBS_VALUE` has 17 217 non-null values, which is acceptable for large survey datasets
- `OBS_FLAG` contains 1 030 non-null values (~ 6 %), meaning it is mostly missing and provides little analytical value
- `CONF_STATUS` is entirely missing (0 non-null values) and will be dropped
- The data types are appropriate:
  - `TIME_PERIOD` is stored as integer (year)
  - `OBS_VALUE` is float (percentage)
  - The remaining columns are categorical strings (`object` type)

Overall, the dataset is complete and well-structured, with minimal missing data in key analytical columns
The only columns to be removed later are administrative or empty (`CONF_STATUS`, `OBS_FLAG`, `DATAFLOW`, `LAST UPDATE`, and `freq`).


In [111]:
# Overview of missing values (%)
missing = df_exam.isna().mean().sort_values(ascending=False) * 100
print(missing)

CONF_STATUS    100.000000
OBS_FLAG        94.104853
OBS_VALUE        1.459478
DATAFLOW         0.000000
LAST UPDATE      0.000000
freq             0.000000
duration         0.000000
age              0.000000
quant_inc        0.000000
unit             0.000000
geo              0.000000
TIME_PERIOD      0.000000
dtype: float64


The percentage of missing values in each column shows that:

- `CONF_STATUS`: 100% missing - this column is entirely empty and will be dropped
- `OBS_FLAG`: 94% missing - contains very few valid entries and provides little analytical value
- `OBS_VALUE`: 1.46% missing - only a small fraction of records lack a reported percentage, which can be safely ignored or filled if needed.  
- All other columns (`DATAFLOW`, `LAST UPDATE`, `freq`, `duration`, `age`, `quant_inc`, `unit`, `geo`, `TIME_PERIOD`) are complete with no missing values

These results confirm that the dataset is overall complete and reliable.
Only `CONF_STATUS` and `OBS_FLAG` should be removed during cleaning, as they contain mostly empty or redundant administrative information.


In [112]:
# Count duplicates
print(df_exam.duplicated().sum())

0


There are no duplicate rows in the dataset, confirming that each record is unique.

In [113]:
# Summary of numeric columns
print(df_exam.describe())

       TIME_PERIOD     OBS_VALUE  CONF_STATUS
count      17472.0  17217.000000          0.0
mean        2019.0     25.547726          NaN
std            0.0     25.067589          NaN
min         2019.0      0.000000          NaN
25%         2019.0      6.600000          NaN
50%         2019.0     15.800000          NaN
75%         2019.0     37.900000          NaN
max         2019.0    100.000000          NaN


The descriptive statistics show that:

- All data were collected in 2019, meaning this dataset provides a cross-sectional snapshot (not a time series)
- The reported breast examination rates (`OBS_VALUE`) range from 0% to 100%, which is expected for a percentage-based indicator
- The mean examination rate is approximately 25.5 %, suggesting that on average, about one in four women reported having had a breast X-ray across all countries, ages, and income levels
- The standard deviation (~ 25.1) indicates large variation between groups, some populations have very low participation rates, while others approach full coverage
- The column `CONF_STATUS` contains no values and will be dropped during cleaning

Overall, the values are realistic and consistent, showing strong variability likely linked to socioeconomic and age-related differences in screening behavior.


In [114]:
# Unique values in categorical fields
for col in df_exam.select_dtypes('object'):
    print(f"{col}: {df_exam[col].nunique()} unique values")

DATAFLOW: 1 unique values
LAST UPDATE: 1 unique values
freq: 1 unique values
duration: 7 unique values
age: 13 unique values
quant_inc: 6 unique values
unit: 1 unique values
geo: 32 unique values
OBS_FLAG: 1 unique values


The categorical column analysis reveals that:

- `DATAFLOW`, `LAST UPDATE`, `freq`, and `unit` each have only one unique value, meaning they are constant across the dataset and can be safely removed
- `duration` has 7 unique values, representing different time intervals since the last breast X-ray (e.g., *never*, *within 1 year*, *within 2 years*, etc.)
- `age` contains 13 unique categories, corresponding to different age groups of respondents (e.g., `30–39`, `40–49`, `50–59`, etc.)
- `quant_inc` has 6 unique values, covering the five income quintiles (`QU1` to `QU5`) plus possibly one total or undefined category
- `geo` includes 32 country codes, representing the European countries covered in the dataset
- `OBS_FLAG` has only one unique value, making it redundant and suitable for removal

These results confirm that the dataset provides rich variation across age, income, and duration, while several administrative fields can be safely dropped during the cleaning phase.

### Preliminary Cleaning Decisions

Based on the analysis above, the following actions will be applied:
1. **Drop irrelevant or redundant columns:**  
   `DATAFLOW`, `LAST UPDATE`, `freq`, `CONF_STATUS`, and `OBS_FLAG`: these contain constant or mostly missing administrative information.  

2. **Drop missing values in `OBS_VALUE`:**  
   Only about 1.5% of entries are missing, so removing them is the simplest and cleanest approach.  

3. **Rename columns for clarity:**  
   - `geo` --> `country`  
   - `TIME_PERIOD` --> `year`  
   - `age` --> `age_group`  
   - `quant_inc` --> `income_quintile`  
   - `OBS_VALUE` --> `exam_rate`

4. **Keep only relevant analytical columns:**  
   `country`, `year`, `duration`, `age_group`, `income_quintile`, `unit`, and `exam_rate`.

5. **Convert data types:**  
   - `year` --> integer  
   - `exam_rate` --> float  

6. **Sort and reindex:**  
   Sort by `country` and `income_quintile` for easier cross-country and income-based comparison.

After these steps, the dataset will be clean, consistent, and ready for analysis of how income and age influence breast examination rates across Europe.

## Conclusion

After exploring the three datasets, we can conclude the following:

- All datasets are well-structured and consistent, with clear variables and minimal missing data.  
- Several **administrative columns** (such as `DATAFLOW`, `LAST UPDATE`, `freq`, `CONF_STATUS`, and `OBS_FLAG`) were identified as **irrelevant or redundant** and will be removed during the cleaning phase.  
- The **screening dataset** (`breast_cancer_screening.csv`) contains reliable yearly data on mammography participation from **2000 to 2021**.  
- The **mortality dataset** (`death_due_to_cancer.csv`) provides age-standardized **death rates per 100 000 women** and will be filtered for **female breast cancer (ICD-10: C50)**.  
- The **self-reported examination dataset** (`breast_exam_income.csv`) gives a detailed snapshot of **2019** participation rates across **age groups** and **income quintiles**, revealing potential socioeconomic disparities.

Overall, the three datasets are complementary and ready for the **data cleaning and preparation phase**, which will be implemented in `prep.py`.  
This will ensure uniform column naming, consistent data types, and filtered subsets focused on **female breast cancer screening and mortality** across Europe.