# Link to Data 

https://www.kaggle.com/datasets/tarekmasryo/cancer-risk-factors-dataset/data

# ðŸ§¬ Cancer Risk Factors Dataset â€” Summary

## General Information
- **Total Observations (Rows):** 2,000  
- **Total Features (Columns):** 20  
- **File Size:** ~312 KB  
- **Primary Goal:** Predict a personâ€™s **cancer risk level** based on lifestyle, environmental, and genetic factors.  
- **Target Variable:** `Risk_Level` (categorical: *Low*, *Medium*, *High*)  
- **Alternative Target (optional):** `Cancer_Type` â€” type of cancer diagnosed (for secondary analysis).

---

## Data Types Overview
| Data Type | Count | Example Columns |
|------------|--------|----------------|
| **Numeric (float64)** | 2 | `Overall_Risk_Score`, `BMI` |
| **Integer (int64)** | 16 | `Age`, `Smoking`, `Obesity`, `Family_History`, etc. |
| **Categorical (object)** | 2 | `Cancer_Type`, `Risk_Level` |

---

## Feature Descriptions
| Column | Type | Description |
|--------|------|-------------|
| `Cancer_Type` | object | Cancer type (diagnosed cases) â€” may be used as a secondary target. |
| `Age` | int | Age of the individual (years). |
| `Gender` | int | Encoded gender (e.g., 0 = Female, 1 = Male). |
| `Smoking` | int | Indicates if the person smokes (0/1). |
| `Alcohol_Use` | int | Indicates alcohol consumption (0/1). |
| `Obesity` | int | Obesity indicator (0/1). |
| `Family_History` | int | Family history of cancer (0/1). |
| `Diet_Red_Meat` | int | Frequency of red meat consumption. |
| `Diet_Salted_Processed` | int | Intake of salted or processed foods. |
| `Fruit_Veg_Intake` | int | Fruit and vegetable consumption frequency. |
| `Physical_Activity` | int | Physical activity frequency. |
| `Air_Pollution` | int | Exposure to polluted environments. |
| `Occupational_Hazards` | int | Exposure to hazardous work conditions. |
| `BRCA_Mutation` | int | BRCA genetic mutation (0/1). |
| `H_Pylori_Infection` | int | Infection status of *H. pylori* (0/1). |
| `Calcium_Intake` | int | Daily calcium intake. |
| `Overall_Risk_Score` | float | Computed combined risk score. |
| `BMI` | float | Body Mass Index (kg/mÂ²). |
| `Physical_Activity_Level` | int | Encoded physical activity level. |
| `Risk_Level` | object | **Target variable** â€” overall cancer risk (*Low*, *Medium*, *High*). |

---

## Predictive Objective
Given an individualâ€™s **risk factors**, the goal is to train a machine learning model that predicts:
> **`Risk_Level` â†’ how likely a person is to develop cancer (Low / Medium / High).**

This makes the dataset suitable for:
- **Preventive analysis:** Identify high-risk individuals before diagnosis.  
- **Classification modeling:** Using algorithms like Logistic Regression, Random Forest, or XGBoost.  
- **Feature importance studies:** Understand which factors contribute most to cancer risk.

---

## Notes
- All 2,000 rows have **no missing values**.  
- Most features are **binary or integer-encoded categorical variables**.  
- Continuous features (`BMI`, `Overall_Risk_Score`) should be normalized for modeling.  
- `Cancer_Type` can be explored as a secondary task once risk levels are predicted.

---

### ðŸ§­ Summary
> This dataset is designed for **predicting overall cancer risk levels** from a personâ€™s **health, lifestyle, and genetic factors**, helping to flag **high-risk individuals** for early screening and prevention.


In [3]:
import pandas as pd

f = pd.read_csv("cancer-risk-factors.csv")

f.head()

Unnamed: 0,Patient_ID,Cancer_Type,Age,Gender,Smoking,Alcohol_Use,Obesity,Family_History,Diet_Red_Meat,Diet_Salted_Processed,...,Physical_Activity,Air_Pollution,Occupational_Hazards,BRCA_Mutation,H_Pylori_Infection,Calcium_Intake,Overall_Risk_Score,BMI,Physical_Activity_Level,Risk_Level
0,LU0000,Breast,68,0,7,2,8,0,5,3,...,4,6,3,1,0,0,0.398696,28.0,5,Medium
1,LU0001,Prostate,74,1,8,9,8,0,0,3,...,1,3,3,0,0,5,0.424299,25.4,9,Medium
2,LU0002,Skin,55,1,7,10,7,0,3,3,...,1,8,10,0,0,6,0.605082,28.6,2,Medium
3,LU0003,Colon,61,0,6,2,2,0,6,2,...,6,4,8,0,0,8,0.318449,32.1,7,Low
4,LU0004,Lung,67,1,10,7,4,0,6,3,...,9,10,9,0,0,5,0.524358,25.1,2,Medium


# clean up the data 

In [4]:
df = f.drop(columns=["Patient_ID"])