The cancer hospital records (RHC) database is formed from data generated by several hospital institutions located in the State of São Paulo (Brazil) being coordinated by the Fundação Oncocentro de São Paulo (FOSP). The dataset is well documented, but this is only available in Portuguese.
The RHC:
- started its activities on 01/01/2000;
- has quarterly update;
- had their records anonymized;
- is formed by analytical cases (those who arrived at the institution without treatment, diagnosed or not);
RHC objectives:
- know and improve the care provided to cancer patients during all stages of treatment;
- expand access to the state base and offer the possibility of performing specific tabulations and analyzes.
Pediatric tumors differ in location and morphology to tumors in adults and can be analyzed according to the International Classification of Cancer in Childhood (CICI), which defines the largest diagnostic groups.
The above information was adapted from the FOSP.
Understand the data and make it more machine learnable.
This joint project was started at Information Technology Center Renato Archer.
Team members:
Notebooks are located in the /notebooks
directory:
1_Glossary.ipynb
: short description in English of the dataset variables2_EDA.ipynb
: presents an exploratory data analysis (EDA) of the dataset
The /src
directory contains some python code:
format_conversion.py
: helps to convert the dataset from .dbf to .csvutility.py
: contains some functions that help to explore the dataset
- Create a virtual environment of your choice
- Clone this github repository
- Install project depencies:
pip install -r requeriments.txt
- Inside the repository, create a new directory called
data
- Download the dataset from the FOSP website and store it in the
/data
- Convert the dataset from .dbf (original format) to .csv format
- The python code and dataset are located in the subfolders /usr and /data, respectively.
- To convert the file just run the following command on a terminal paying attention to the respective file paths:
python format_conversion.py --input "dbf file name" --output "csv file name"