In [2]:
import os
import pandas as pd

# Identificación y evaluación de  datasets públicos de cáncer de pulmón (kaggle.com)

## 01.- Dataset Lung Cancer Structured Clinical Dataset

A Comprehensive Lung Cancer Structured Clinical Dataset:

Lung cancer is a type of cancer that begins in the lungs, often associated with smoking but also linked to factors like air pollution, genetic predisposition, and exposure to toxins such as asbestos and radon. It is one of the leading causes of cancer-related deaths globally, as it is often diagnosed in advanced stages.

There are two main types: Non-Small Cell Lung Cancer (NSCLC), the most common form, and Small Cell Lung Cancer (SCLC), which is more aggressive but less common.

Symptoms include persistent cough, shortness of breath, chest pain, and unexplained weight loss.

Early detection through screenings such as low-dose CT scans significantly improves outcomes.
Treatment options include surgery, chemotherapy, radiation, immunotherapy, and targeted therapies, tailored to the cancer's type and stage.

Preventive measures like avoiding smoking and reducing exposure to environmental risk factors are key to reducing lung cancer incidence.

Dataset Link:
https://www.kaggle.com/arifcuet14

https://www.kaggle.com/datasets/arifcuet14/lung-cancer-structured-clinical-dataset


In [3]:
root = 'C:/Users/scarv/Downloads/08_semestre_ingenieria_en_informatica_2025/01.-Capstone_709V/CAPSTONE_709v/CAPSTONE_709V/pulmonpredic_ml/pulmonpredic_crispdm/data/raw/lung-cancer-structured-clinical-dataset/lung_cancer_data.csv'
data = os.path.abspath(root)

data = pd.read_csv(data)
data.head()

Unnamed: 0,Patient_ID,Age,Gender,Smoking_History,Years_Smoked,Pack_Years,Family_History_Cancer,Occupation,Exposure_to_Toxins,Residential_Area,...,Previous_Cancer_Diagnosis,Tumor_Size_cm,Metastasis_Status,Stage_of_Cancer,Treatment_Type,Survival_Years,Follow_Up_Visits,Medication_Response,Symptom_Progression,Year_of_Diagnosis
0,1,69,Male,Never,30,3,False,Farmer,False,Urban,...,True,11.02,True,III,Surgery,12,24,Good,Stable,2007
1,2,32,Female,Former,6,61,False,Office Worker,False,Urban,...,False,14.29,True,II,Chemotherapy,6,12,Poor,Stable,2009
2,3,89,Male,Never,2,9,True,Office Worker,True,Rural,...,False,9.47,False,III,Chemotherapy,6,15,Good,Worsening,2015
3,4,78,Female,Never,11,69,False,Factory Worker,True,Urban,...,False,2.22,False,IV,Chemotherapy,13,25,Moderate,Improving,2012
4,5,38,Male,Former,11,57,False,Farmer,False,Rural,...,False,8.26,False,III,Palliative,3,4,Good,Stable,2014


## 2.-Dataset Lung Cancer Dataset
### Data about lung cancer focused on individuals diagnosed with cancer.
### About Dataset

This dataset contains data about lung cancer Mortality and is a comprehensive collection of patient
information, specifically focused on individuals diagnosed with cancer.
Description of columns:

* id: A unique identifier for each patient in the dataset.
* age: The age of the patient at the time of diagnosis.
* gender: The gender of the patient (e.g., male, female).
* country: The country or region where the patient resides.
* diagnosis_date: The date on which the patient was diagnosed with lung cancer.
* cancer_stage: The stage of lung cancer at the time of diagnosis (e.g., Stage I, Stage II,
    Stage III, Stage IV).
* family_history: Indicates whether there is a family history of cancer (e.g., yes, no).
* smoking_status: The smoking status of the patient (e.g., current smoker, former smoker,
    never smoked, passive smoker).
* bmi: The Body Mass Index of the patient at the time of diagnosis.
* cholesterol_level: The cholesterol level of the patient (value).
* hypertension: Indicates whether the patient has hypertension (high blood pressure) (e.g.,
  yes, no).
* asthma: Indicates whether the patient has asthma (e.g., yes, no).
* cirrhosis: Indicates whether the patient has cirrhosis of the liver (e.g., yes, no).
* other_cancer: Indicates whether the patient has had any other type of cancer in addition to
  the primary diagnosis (e.g., yes, no).
* treatment_type: The type of treatment the patient received (e.g., surgery, chemotherapy,
  radiation, combined).
* end_treatment_date: The date on which the patient completed their cancer treatment or died.
* survived: Indicates whether the patient survived (e.g., yes, no).

## About this file

This dataset was assembled for educational purposes and may include synthetic or simulated data to reflect patterns seen in lung cancer diagnosis and treatment.
It was inspired by various healthcare-related learning resources and is not intended to represent real patient data.
This dataset is suitable for machine learning modeling, analysis, and educational projects — not for clinical or diagnostic use.
For real-world datasets, users are encouraged to explore verified sources like:

* SEER (https://seer.cancer.gov/)
* TCGA (https://portal.gdc.cancer.gov/)
* MIMIC-IV (https://physionet.org/)
* File Information

Daset Link:
https://www.kaggle.com/khwaishsaxena

https://www.kaggle.com/datasets/khwaishsaxena/lung-cancer-dataset



A single CSV file containing 890000 rows and 17 columns of patient data.

In [4]:
root = 'C:/Users/scarv/Downloads/08_semestre_ingenieria_en_informatica_2025/01.-Capstone_709V/CAPSTONE_709v/CAPSTONE_709V/pulmonpredic_ml/pulmonpredic_crispdm/data/raw/lung_cancer_dataset/lung_cancer.csv'
data = os.path.abspath(root)

data = pd.read_csv(data)
data.head()

Unnamed: 0,id,age,gender,country,diagnosis_date,cancer_stage,family_history,smoking_status,bmi,cholesterol_level,hypertension,asthma,cirrhosis,other_cancer,treatment_type,end_treatment_date,survived
0,1,64.0,Male,Sweden,2016-04-05,Stage I,Yes,Passive Smoker,29.4,199,0,0,1,0,Chemotherapy,2017-09-10,0
1,2,50.0,Female,Netherlands,2023-04-20,Stage III,Yes,Passive Smoker,41.2,280,1,1,0,0,Surgery,2024-06-17,1
2,3,65.0,Female,Hungary,2023-04-05,Stage III,Yes,Former Smoker,44.0,268,1,1,0,0,Combined,2024-04-09,0
3,4,51.0,Female,Belgium,2016-02-05,Stage I,No,Passive Smoker,43.0,241,1,1,0,0,Chemotherapy,2017-04-23,0
4,5,37.0,Male,Luxembourg,2023-11-29,Stage I,No,Passive Smoker,19.7,178,0,0,0,0,Combined,2025-01-08,0


## 3.-Lung Cancer Prediction

### Air Pollution, Alcohol, Smoking & Risk of Lung Cancer
### About Dataset

Lung Cancer Prediction
Air Pollution, Alcohol, Smoking & Risk of Lung Cancer
About this dataset

This dataset contains information on patients with lung cancer, including their age, gender, air pollution exposure, alcohol use, dust allergy, occupational hazards,
genetic risk, chronic lung disease, balanced diet, obesity, smoking, passive smoker, chest pain, coughing of blood, fatigue, weight loss ,shortness of breath ,wheezing ,
swallowing difficulty ,clubbing of finger nails and snoring

### How to use the dataset

Lung cancer is the leading cause of cancer death worldwide, accounting for 1.59 million deaths in 2018. The majority of lung cancer cases are attributed to smoking,
but exposure to air pollution is also a risk factor. A new study has found that air pollution may be linked to an increased risk of lung cancer, even in nonsmokers.

The study, which was published in the journal Nature Medicine, looked at data from over 462,000 people in China who were followed for an average of six years.
The participants were divided into two groups: those who lived in areas with high levels of air pollution and those who lived in areas with low levels of air pollution.

The researchers found that the people in the high-pollution group were more likely to develop lung cancer than those in the low-pollution group. They also found that the
risk was higher in nonsmokers than smokers, and that the risk increased with age.

While this study does not prove that air pollution causes lung cancer, it does suggest that there may be a link between the two. More research is needed to confirm these
findings and to determine what effect different types and levels of air pollution may have on lung cancer risk

### Research Ideas

predicting the likelihood of a patient developing lung cancer
identifying risk factors for lung cancer
determining the most effective treatment for a patient with lung cancer

### About this file

This dataset contains information on patients with lung cancer, including their age, gender, air pollution exposure, alcohol use, dust allergy, occupational hazards, genetic risk, chronic lung disease, balanced diet, obesity, smoking status, passive smoker status, chest pain, coughing of blood, fatigue levels , weight loss , shortness of breath , wheezing , swallowing difficulty , clubbing of finger nails , frequent colds , dry coughs , and snoring. By analyzing this data we can gain insight into what causes lung cancer and how best to treat it

*   Age: The age of the patient. (Numeric)
*   Gender: The gender of the patient. (Categorical)
*   Air Pollution: The level of air pollution exposure of the patient. (Categorical)
*   Alcohol use: The level of alcohol use of the patient. (Categorical)
*   Dust Allergy: The level of dust allergy of the patient. (Categorical)
*   OccuPational Hazards: The level of occupational hazards of the patient. (Categorical)
*   Genetic Risk: The level of genetic risk of the patient. (Categorical)
*   chronic Lung Disease: The level of chronic lung disease of the patient. (Categorical)
*   Balanced Diet: The level of balanced diet of the patient. (Categorical)
*   Obesity: The level of obesity of the patient. (Categorical)
*   Smoking: The level of smoking of the patient. (Categorical)
*   Passive Smoker: The level of passive smoker of the patient. (Categorical)
*   Chest Pain: The level of chest pain of the patient. (Categorical)
*   Coughing of Blood: The level of coughing of blood of the patient. (Categorical)
*   Fatigue: The level of fatigue of the patient. (Categorical)
*   Weight Loss: The level of weight loss of the patient. (Categorical)
*   Shortness of Breath: The level of shortness of breath of the patient. (Categorical)
*   Wheezing: The level of wheezing of the patient. (Categorical)
*   Swallowing Difficulty: The level of swallowing difficulty of the patient. (Categorical)
*   Clubbing of Finger Nails: The level of clubbing of finger nails of the patient. (Categorical)


### Acknowledgements

License

See the dataset description for more information.

Dataset Link: https://www.kaggle.com/datasets/thedevastator/cancer-patients-and-air-pollution-a-new-link

In [5]:
root = 'C:/Users/scarv/Downloads/08_semestre_ingenieria_en_informatica_2025/01.-Capstone_709V/CAPSTONE_709v/CAPSTONE_709V/pulmonpredic_ml/pulmonpredic_crispdm/data/raw/lung_cancer_prediction/cancer_patient_data_sets.csv'
data = os.path.abspath(root)

data = pd.read_csv(data)
data.head()

Unnamed: 0,index,Patient Id,Age,Gender,Air Pollution,Alcohol use,Dust Allergy,OccuPational Hazards,Genetic Risk,chronic Lung Disease,...,Fatigue,Weight Loss,Shortness of Breath,Wheezing,Swallowing Difficulty,Clubbing of Finger Nails,Frequent Cold,Dry Cough,Snoring,Level
0,0,P1,33,1,2,4,5,4,3,2,...,3,4,2,2,3,1,2,3,4,Low
1,1,P10,17,1,3,1,5,3,4,2,...,1,3,7,8,6,2,1,7,2,Medium
2,2,P100,35,1,4,5,6,5,5,4,...,8,7,9,2,1,4,6,7,2,High
3,3,P1000,37,1,7,7,7,7,6,7,...,4,2,3,1,4,5,6,7,5,High
4,4,P101,46,1,6,8,7,7,7,6,...,3,2,4,1,4,2,4,2,3,High
