### I320 Project 1
### Name: Sophie Gill

## About the Dataset

**Dataset:** Wisconsin Diagnostic Breast Cancer (WDBC)  
**Source:** UCI Machine Learning Repository / Kaggle  
**File:** `data.csv`  
**Target Variable:** `diagnosis` (M = Malignant, B = Benign)

In [3]:
# Load the dataset
import pandas as pd
df = pd.read_csv('data.csv')

# Verify dataset is loaded
df.head(3)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,


## Part 1: 10-Point Data Inspection

In [4]:
# Step 1: Shape
df.shape

(569, 33)

Findings: The dataset contains 569 rows and 33 columns, with the rows signifying individual cases of a single Fine Needle Aspirate (FNA) biopsy sample from a breast mass, and the columns describing the different features measured.

In [5]:
# Step 2: Column Names
df.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

Findings: The column names include id, diagnosis, and 10 base features (radius, texture, etc.), each with three suffixes: _mean, _se, and _worst. There is also an 'Unnamed: 32' column. The features follow a structured naming convention of [Characteristic] [Calculation Type]. Additionally, concavity, concave points, and fractal_dimension may need additional clinical research to understand how they are calculated.

In [6]:
# Step 3: Data Types
df.dtypes

id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst     

Findings: All columns are numeric (int64 or float64) except for diagnosis, which is categorical (object).
There are no data types that appear incorrect; however, 'Unnamed: 32' is technically a float but contains no data as seen in the following step.

In [7]:
# Step 4: First Look
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


Findings: The actual values are standardized numerical measurements (mostly between 0 and 1 for SE/shape and much higher for area/perimeter). The diagnosis values are either M (Malignant) or B (Benign). The only unusual or unexpected values come from the 'Unnamed: 32' column as mentioned before, which is immediately visible as being full of NaN values. 

In [8]:
# Step 5: Last Look
df.tail()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
564,926424,M,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,...,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,
565,926682,M,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,...,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,...,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,
567,927241,M,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,...,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,
568,92751,B,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,...,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,


Findings: There is a clean ending to the data; the indices end at 568 (for 569 total rows). The structure and formatting of the last rows are consistent and match the first rows completely.

In [9]:
# Step 6: Memory Usage
df.memory_usage()

Index                       132
id                         4552
diagnosis                  4552
radius_mean                4552
texture_mean               4552
perimeter_mean             4552
area_mean                  4552
smoothness_mean            4552
compactness_mean           4552
concavity_mean             4552
concave points_mean        4552
symmetry_mean              4552
fractal_dimension_mean     4552
radius_se                  4552
texture_se                 4552
perimeter_se               4552
area_se                    4552
smoothness_se              4552
compactness_se             4552
concavity_se               4552
concave points_se          4552
symmetry_se                4552
fractal_dimension_se       4552
radius_worst               4552
texture_worst              4552
perimeter_worst            4552
area_worst                 4552
smoothness_worst           4552
compactness_worst          4552
concavity_worst            4552
concave points_worst       4552
symmetry

Findings: The dataset uses approximately 146.9 KB of memory (total sum of all columns). By data science standards, this would be considered a small dataset.

In [10]:
# Step 7: Missing Values
df.isnull().sum()

id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed:

Findings: Only 'Unnamed: 32' has null/missing values, with 100% of the column missing. However, this does not indicate a suspicious pattern as it was likely created by a trailing comma error in the CSV file.

In [11]:
# Step 8: Duplicates
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
564    False
565    False
566    False
567    False
568    False
Length: 569, dtype: bool

Findings: There are no duplicate rows, and all 569 patient IDs are unique. 

In [12]:
# Step 9: Basic Statistics
df.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,


Findings: The radius_mean ranges from 6.981 to 28.11. The area_mean range is 143.5 to 2501.0, and the concavity_mean range is 0.0 to 0.4268. There aren't any values that appear impossible; however, a concavity of 0.0 is unusual but possible for perfectly smooth cell boundaries.

In [13]:
# Step 10: Unique Counts
df.nunique()

id                         569
diagnosis                    2
radius_mean                456
texture_mean               479
perimeter_mean             522
area_mean                  539
smoothness_mean            474
compactness_mean           537
concavity_mean             537
concave points_mean        542
symmetry_mean              432
fractal_dimension_mean     499
radius_se                  540
texture_se                 519
perimeter_se               533
area_se                    528
smoothness_se              547
compactness_se             541
concavity_se               533
concave points_se          507
symmetry_se                498
fractal_dimension_se       545
radius_worst               457
texture_worst              511
perimeter_worst            514
area_worst                 544
smoothness_worst           411
compactness_worst          529
concavity_worst            539
concave points_worst       492
symmetry_worst             500
fractal_dimension_worst    535
Unnamed:

Findings: Diagnosis, being categorical, has few unique values (2). All the other measurement columns have many unique values (most with 400+). Additionally, the number of unique IDs does match the number of rows (569).

## Part 2: Data Dictionary

Complete the following data dictionary for the **key columns**. For each column, you must:
1. **Research** the clinical meaning
2. **Identify** the feature type (Continuous, Discrete, Categorical-Nominal, Categorical-Ordinal, Binary, Identifier)
3. **Document** the valid values/range you observe
4. **Note** any issues or questions

| Column | Description | Feature Type | Valid Values/Range | Notes/Issues |
|--------|-------------|--------------|-------------------|--------------|
| `id` |Unique patient identification number |Identifier |8670 - 911320502 |Unique for every row |
| `diagnosis` |Tumor classification |Binary |{M, B} |Target variable |
| `radius_mean` |Avg distance from nucleus center to points on perimeter |Continuous |6.981 - 28.11 |Larger = more likely malignant |
| `texture_mean` |Standard deviation of gray-scale values |Continuous |9.71 - 39.28 |Larger = more irregular surface |
| `perimeter_mean` |Boundary length of the nucleus |Continuous |43.79 - 188.5 |Correlated with radius |
| `area_mean` |Area of the nucleus |Continuous |143.5 - 2501.0 |Malignant cells typically larger |
| `smoothness_mean` |Local variation in radius lengths |Continuous |0.05263 - 0.1634 |Lower values = smoother borders, higher values = irregular borders |
| `compactness_mean` |(perimeter² / area) - 1.0 |Continuous |0.01938 - 0.3454 |0 = perfectly circular, higher = more irregular |
| `concavity_mean` |Severity of concave portions of the contour |Continuous |0.0 - 0.4268 |Higher = more indentations |
| `concave points_mean` |Number of concave portions of the contour |Continuous |0.0 - 0.2012 |Malignant tumors have more concave points |
| `symmetry_mean` |Symmetry of the nucleus |Continuous |0.106 - 0.304 |Lower = more symmetric/likely benign |
| `fractal_dimension_mean` |Avg boundary complexity |Continuous |0.04996 - 0.09744 |Higher = more irregular border |

### Clinical Research Questions

**1. What is computer-aided diagnosis (CAD) in radiology? How does it assist physicians in detecting breast cancer?**
CAD is a technology that uses pattern recognition and other sophisticated algorithms to analyze medical images as a form of assistance or second opinion for radiologists. This is helpful as it reduces human error, picking out features that may have otherwise been overlooked. The software scans the images for specific indicators of cancer and makes any areas of concern known to the physician. 

**2. What is the relationship between tumor size and prognosis in breast cancer? Why is early detection so important?**
In breast cancer, tumor size is one of the most crucial factors for determining a patient's prognosis. There is a direct correlation between the size of the tumor and the patient's long-term survival rate - typically, the smaller the tumor, the more favorable the prognosis. Thus, identifying cancer before it becomes large enough to cause symptoms is ideal for improving survival. When breast cancer is detected early (still in the "localized" stage and confined to the breast), the 5-year relative survival rate is 99%. This drops sharply to around 30% if the cancer is diagnosed at an advanced, metastatic stage.

**3. Explain what "compactness" measures mathematically (perimeter² / area - 1.0). Why might cancer cells have different compactness than normal cells?** Compactness is a shape-based metric that is used to describe the irregularity of a cell nucleus based on perimeter and area as shown in the formula perimeter² / area - 1.0. As perimeter increases, area stays relatively the same, causing the compactness value to rise. Physicians look for this irregular shape, as cancer cells have higher activity inside the nucleus that can cause the nuclear envelope to bulge or become distorted.

**4. What is the difference between "mean," "standard error," and "worst" measurements in this dataset? Why might the "worst" values be particularly important for diagnosis?** In the context of this dataset, 'mean' represents the average value of a specific feature across all cell nuclei captured in the image, providing a baseline for the sample. 'Standard error' measures the variability or spread of the feature among the nuclei, with a high SE indicating the cell nuclei are inconsistent in size or shape. 'Worst' shows the mean of the three largest values recorded for a feature in the sample. This doesn't always constitute something 'bad', but shows the extreme end of the measurements. These 'worst' values are often the most important for diagnosis, as a tumor might have a small cluster of aggressive, irregular cells (indicating malignant growth) that are able to be identified when analyzing extreme values. 

## Part 3: Data Validation

### 3.1 Diagnosis Distribution Validation
Write code to check:
- How many patients have malignant (M) tumors?
- How many patients have benign (B) tumors?
- What is the percentage of each?

In [37]:
# Get raw counts for M and B
diagnosis_counts = df['diagnosis'].value_counts()

# Get percentages
diagnosis_percentages = df['diagnosis'].value_counts(normalize=True) * 100

# Display results
print("Diagnosis Counts:")
print(diagnosis_counts)
print(" ")
print("Diagnosis Percentages:")
print(diagnosis_percentages)

Diagnosis Counts:
diagnosis
B    357
M    212
Name: count, dtype: int64
 
Diagnosis Percentages:
diagnosis
B    62.741652
M    37.258348
Name: proportion, dtype: float64


Findings: The dataset is imbalanced as there is a clear majority of benign cases. This distribution is okay because in real-world data, this imbalance is typically more prominent, with about 80% of breast biopsies turning out to be benign (American Cancer Society). Additionally, having more representation for malignant cases is beneficial for training a machine learning model.

### 3.2 Empty Column Validation 

Write code to examine all columns for any that might be completely empty or contain only null values.

In [38]:
# Check for null values in all columns
null_counts = df.isnull().sum()

print("Columns with missing values:")
print(null_counts)

Columns with missing values:
id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_di

Findings: There are no empty columns in the dataset other than one named 'Unnamed: 32', which contains only null values. As it contains no data, this column should be dropped from the dataframe before continuing with any machine learning. Empty columns may exist in a dataset due to formatting issues, such as trailing commas in the CSV that are interpreted as an additional empty column. 

In [39]:
# Drop the empty column
df = df.drop(columns=['Unnamed: 32'])

### 3.3 Feature Range Validation
Write code to check if the "worst" measurements are always greater than or equal to the "mean" measurements for the same characteristic.

In [14]:
# Check if 'worst' always >= to 'mean' for radius
radius_check = (df['radius_worst'] >= df['radius_mean']).all()

# Check for all relevant feature pairs
features = ['radius', 'texture', 'perimeter', 'area', 'smoothness', 
            'compactness', 'concavity', 'concave points', 'symmetry', 'fractal_dimension']

all_checks = {f: (df[f'{f}_worst'] >= df[f'{f}_mean']).all() for f in features}

print(f"Is radius_worst always >= radius_mean? {radius_check}")
print(" ")
print("Check for all feature pairs:")
for feature, result in all_checks.items():
    print(f"{feature:20}: {result}")

Is radius_worst always >= radius_mean? True
 
Check for all feature pairs:
radius              : True
texture             : True
perimeter           : True
area                : True
smoothness          : True
compactness         : True
concavity           : True
concave points      : True
symmetry            : True
fractal_dimension   : True


Findings: 'Radius_worst' is always greater than or equal to 'radius_mean'. As seen in the secondary part of the output, this logic also applies to all 10 characteristics in the dataset. The "worst" measurement consistently represents the extreme upper end of the distribution for each sample, meaning it will always be greater than or equal to the overall mean. If a 'mean' value were ever higher than a 'worst' value, it would indicate a data integrity error (this could be caused by things like calculation errors or data entry errors). 

## Part 4: Create Clinical Area Groups
Create a new column called `area_category` that categorizes tumors into clinically-meaningful groups based on `area_mean`.

### Quartile-Based Categories
Use these categories based on the quartile distribution of the data:

| Area Category | Area Range | Clinical Rationale |
|---------------|------------|-------------------|
| Q1 - Smallest | < 25th percentile | Smallest tumors, likely early stage |
| Q2 - Below Average | 25th to 50th percentile | Small tumors |
| Q3 - Above Average | 50th to 75th percentile | Moderate size tumors |
| Q4 - Largest | > 75th percentile | Largest tumors, may indicate advanced disease |

**Hint:** First calculate the quartile values using `df['area_mean'].quantile([0.25, 0.50, 0.75])`


In [41]:
# Calculate the quartile values
quartiles = df['area_mean'].quantile([0.25, 0.50, 0.75])
print("Quartiles:")
print(quartiles)

# Create the area_category column using qcut, define 4 bins
df['area_category'] = pd.qcut(df['area_mean'], q=4, labels=['Q1 - Smallest', 'Q2 - Below Average', 'Q3 - Above Average', 'Q4 - Largest'])

# Show distribution of area categories
print("\nCounts per Area Category:")
print(df['area_category'].value_counts().sort_index())

Quartiles:
0.25    420.3
0.50    551.1
0.75    782.7
Name: area_mean, dtype: float64

Counts per Area Category:
area_category
Q1 - Smallest         144
Q2 - Below Average    141
Q3 - Above Average    142
Q4 - Largest          142
Name: count, dtype: int64


In [42]:
# Calculate percentage of malignant diagnoses in each area category
malignancy_rates = df.groupby('area_category', observed=True)['diagnosis'].apply(lambda x: (x == 'M').mean() * 100)

print("Malignancy Rate by Area Category:")
print(malignancy_rates)

Malignancy Rate by Area Category:
area_category
Q1 - Smallest          2.083333
Q2 - Below Average     9.929078
Q3 - Above Average    41.549296
Q4 - Largest          95.774648
Name: diagnosis, dtype: float64


### Analysis Questions:

**1. What are the actual quartile boundaries (area values) you calculated?**
The quartile boundaries for area_mean are: 25th Percentile (Q1) - 420.3, 50th Percentile (Median) - 551.1, and 75th Percentile (Q3) - 782.7. 


**2. How many tumors are in each area category? (Should be approximately equal)**
The distribution is nearly equal across the four categories, as expected with quartile splitting. Q1 - Smallest: 144, Q2 - Below Average: 141, Q3 - Above Average: 142, and Q4 - Largest: 142.


**3. What is the malignancy rate (percentage) for each area category?**
As seen in the output, the malignancy rates for these specific bins are approximately:
Q1 = 2%
Q2 = 10%
Q3 = 42%
Q4 = 96%.

**4. How does malignancy rate change as you move from Q1 to Q4? What does this suggest about the relationship between tumor area and diagnosis?**
There is a dramatic, non-linear increase in the malignancy rate as the tumor area increases. While a tumor in the smallest quartile has a very low probability of being malignant, a tumor in the largest quartile is almost certainly malignant. This suggests a strong positive correlation between tumor size and malignancy, confirming that area_mean is a significant clinical predictor for breast cancer diagnosis.

## Part 5: Research Questions

### 5.1 Write Three Answerable Questions 
Write three questions that THIS dataset can answer. Remember: the data can show relationships and patterns, but cannot prove causation.

**1. A question about texture and diagnosis:**
Is there a statistically significant difference in the average texture_mean between malignant and benign tumors, and does high texture variability correlate with a higher malignancy rate?

**2. A question about the standard error (_se) features:**
How does the area_se (standard error of the area) compare between the four area_category quartiles, and do malignant tumors consistently show higher levels of measurement inconsistency (SE) than benign ones?

**3. A question about symmetry combined with another feature:**
To what extent does the combination of low symmetry_mean and high compactness_mean increase the likelihood of a malignant diagnosis compared to just looking at symmetry alone?

### 5.2 Identify One Question the Data CANNOT Answer

Write one question about patient demographics or survival that this dataset cannot answer, and explain why.

**Question:** What is the average age of patients diagnosed with malignant tumors, and how does their survival rate compare to those with benign tumors?

**Why the data cannot answer this:**
This dataset is focused strictly on image-based measurements of cell nuclei (fine needle aspirate features). It lacks any demographic data (there is no information regarding patient age, ethnicity, or medical history) or longitudinal/outcome data. The dataset only provides a "snapshot" diagnosis of Malignant/Benign at the time of the biopsy, it does not track patient outcomes, treatments, or survival length after the diagnosis was made.

### 5.3 Grouping Analysis

Answer this question using a groupby analysis: 

**"What is the average compactness_mean for each diagnosis category (M vs B)?"**

In [43]:
# Calculate the average compactness_mean for Malignant (M) and Benign (B)
compactness_analysis = df.groupby('diagnosis')['compactness_mean'].mean()

print("Average Compactness Mean by Diagnosis:")
print(compactness_analysis)

Average Compactness Mean by Diagnosis:
diagnosis
B    0.080085
M    0.145188
Name: compactness_mean, dtype: float64


**Interpretation: How does compactness differ between malignant and benign tumors? What might explain this difference based on your understanding of what compactness measures?** The output shows that malignant tumors have a significantly higher average compactness (approximately 0.145) compared to benign tumors (approximately 0.080). In this dataset, the average malignant tumor is nearly twice as "compact" as a benign one. 
Based on the mathematical definition of compactness, a higher value indicates that the perimeter is much longer than it would be for a smooth, simple shape (like a circle) of the same area. Biologically, benign cells generally maintain regular, smooth oval shapes. In contrast, malignant cells grow rapidly, causing  the nuclear boundary to become jagged or "bumpy." This increased surface irregularity spikes the perimeter measurement, leading to a higher compactness score.

## Part 6: Target Variable Analysis (Bonus)

The `diagnosis` column is our **target variable** - what we're trying to predict. Analyze its relationship with key features.


In [44]:
# Show the distribution of diagnosis
diagnosis_dist = df['diagnosis'].value_counts()
print("Diagnosis Distribution:")
print(diagnosis_dist)

# Calculate summary statistics for 3 key features grouped by diagnosis
key_features = ['radius_mean', 'concavity_mean', 'perimeter_worst']
summary_stats = df.groupby('diagnosis')[key_features].agg(['mean', 'std', 'max'])

print("\nSummary Statistics for Key Features:")
print(summary_stats)

Diagnosis Distribution:
diagnosis
B    357
M    212
Name: count, dtype: int64

Summary Statistics for Key Features:
          radius_mean                  concavity_mean                    \
                 mean       std    max           mean       std     max   
diagnosis                                                                 
B           12.146524  1.780512  17.85       0.046058  0.043442  0.4108   
M           17.462830  3.203971  28.11       0.160775  0.075019  0.4268   

          perimeter_worst                    
                     mean        std    max  
diagnosis                                    
B               87.005938  13.527091  127.1  
M              141.370330  29.457055  251.2  


### Bonus Questions

**1. What percentage of patients in this dataset have malignant tumors?** Based on the output in Part 3.1 and Part 6, there are 212 malignant cases out of 569 total patients - approximately 37.3%.


**2. Which feature shows the largest difference between malignant and benign tumors?** Of the three features analyzed in Part 6, perimeter_worst shows the largest absolute difference. The mean for benign tumors is about 87.0, while the mean for malignant tumors is about 141.4. This represents a large increase of over 60%, making it a very strong separator between the two classes.

**3. Why does class imbalance matter for machine learning classification? (You may need to research this)**
Class imbalance (63% Benign vs. 37% Malignant here) matters because machine learning models are designed to maximize overall accuracy. If a dataset is heavily imbalanced, a model might simply learn to predict the majority class (Benign) every time to achieve high accuracy, while failing to actually identify the "minority" cases (Malignant). In a medical context, this is dangerous because a false negative (telling a sick patient they are healthy) is much worse than a false positive.

**4. If you were building a diagnostic model, which 3 features would you prioritize based on your analysis? Justify your choices.** One feature is area_mean (or area_worst) because the Part 4 analysis shows a near-perfect correlation with malignancy in the highest quartile (96% malignancy rate). Next, I would choose perimeter_worst because the "worst" measurements capture the most aggressive outliers in a tumor, providing the clearest numerical gap between B and M groups. Lastly, I would choose concavity_mean because the biological irregularity of cancer cells is captured here - the Part 6 stats show malignant tumors have over 3 times the average concavity of benign ones.
