# Your Project Title

#### Judith Romero

In this assignment, you will create a notebook that uses data from the ISLP module and web-scraped data from Wikipedia. The goal is to extract real-world data, process it, and present it in a user-friendly format.

<div style="background-color: #fff2cc; border-left: 6px solid #f1c232; color: #000; padding: 10px;">
You may use AI to assist you in writing the code for this project, but you must link the transcripts in a references section at the bottom of the notebook. The exposition should be your own, though. 
Any code that is beyond the scope of this course should include a reference to documentation, a tutorial, or a generative AI chat.
</div>

## Part 1: Selecting a Dataset

ISLP is the Python companion to *An Introduction to Statistical Learning*. It includes several pedagogically curated datasets across domains (marketing, finance, health, etc.).

Skim the ISLP documentation: [https://islp.readthedocs.io/en/latest/index.html](https://islp.readthedocs.io/en/latest/index.html).  Open the **‚ÄúDatasets used in ISLP‚Äù** page and browse the available options.

The dataset I have chosen for my project is... (complete this sentence with a brief description of the dataset).

The features of this dataset are... (list and describe the data features/columns).

* `feature 1`: description
* `feature 2`: description

## Part 1: Selecting a Dataset

The dataset I have chosen for my project is the **BrainCancer** dataset from the `ISLP` Python package, which accompanies *An Introduction to Statistical Learning*. The dataset contains information on **88 brain cancer patients**, where each row represents a single patient and each column represents a medical or survival-related variable.

The features of this dataset are:

- **sex**: the patient's sex (Male or Female)  
- **diagnosis**: the type of brain tumor  
- **loc**: the location of the tumor in the brain  
- **ki**: Karnofsky Index measuring patient health status  
- **gtv**: tumor size  
- **stereo**: type of radiation treatment  
- **status**: survival indicator (1 = patient died, 0 = censored)  
- **time**: follow-up time until death or censoring


## Part 2: Loading Data from a Library

Install the necessary libraries:

> **Note:** Package installation commands are disabled in this Jupyter environment.  
> The code assumes the `ISLP` package is available, as it is in the course environment.


In [8]:
import pandas as pd
from ISLP import load_data

BrainCancer = load_data('BrainCancer')



<class 'ModuleNotFoundError'>: No module named 'ISLP'

In [None]:
BrainCancer.head()


In [None]:
BrainCancer.info()


In [None]:
BrainCancer.describe()


In [None]:
BrainCancer['sex'].value_counts()


Information on 88 people with brain cancer diagnoses is included in this dataset. Based on preliminary investigation, the dataset seems organized and manageable. There are about equal numbers of male and female patients, and certain diagnoses‚Äîlike meningioma‚Äîoccur more frequently than others.

One noteworthy characteristic is the status variable, which indicates that certain observations are censored because not all patients passed away throughout the study period. Tumor size and the Karnofsky Index are two examples of numerical variables that differ among patients, indicating variations in health and disease severity.

### Guiding Question

What is the relationship between brain cancer patients' survival time and tumor diagnosis and patient health condition as determined by the Karnofsky Index?

This subject is relevant because it can assist physicians and researchers make better treatment choices and provide better patient care by identifying the characteristics associated with longer or shorter survival.

## Part 3: Scraping Data from Wikipedia

I choose the Wikipedia page "List of cancer mortality rates in the United States" as the supplemental dataset. This page provides a more comprehensive context for cancer outcomes by including age-adjusted mortality rates (per 100,000 population) for different cancer types in the United States between 2013 and 2017.

In [5]:
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_cancer_mortality_rates_in_the_United_States"

tables = pd.read_html(url)

# Use the first table on the page
cancer_mortality = tables[0]

cancer_mortality.head()


<class 'ImportError'>: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.

In [None]:
cancer_clean = cancer_mortality.copy()

# Rename columns for clarity
cancer_clean.columns = ["Cancer Type", "Mortality Rate (per 100k)"]

# Convert mortality rate column to numeric
cancer_clean["Mortality Rate (per 100k)"] = pd.to_numeric(
    cancer_clean["Mortality Rate (per 100k)"],
    errors="coerce"
)

# Drop rows with missing values
cancer_clean = cancer_clean.dropna()

cancer_clean.head()


# Present the scraped data
Age-adjusted cancer mortality rates per 100,000 Americans from 2013 to 2017 are displayed in the cleaned dataset. A cancer kind and its associated death rate are shown in each row.

In addition to the BrainCancer dataset, which focuses on patient survival outcomes, this dataset offers population-level context for cancer outcomes.

## Part 4: Visualizing and Analyzing the Data

### Visualization 1

In [6]:
import matplotlib.pyplot as plt

plt.hist(BrainCancer['time'], bins=15)
plt.xlabel("Survival Time")
plt.ylabel("Number of Patients")
plt.title("Distribution of Survival Time for Brain Cancer Patients")
plt.show()


<class 'NameError'>: name 'BrainCancer' is not defined

(Visualization 1 description and analysis)

The distribution of survival times for patients with brain cancer in the BrainCancer dataset is displayed in this histogram. While fewer patients live longer, the majority of patients have comparatively shorter follow-up or survival spans.

According to this distribution, patients' chances of survival differ greatly, which may be influenced by things like the type of tumor, the patient's health, and the course of therapy.

### Visualization 2

In [None]:
# Sort by highest mortality rate
top_cancers = cancer_clean.sort_values(
    "Mortality Rate (per 100k)",
    ascending=False
).head(10)

plt.barh(top_cancers["Cancer Type"], top_cancers["Mortality Rate (per 100k)"])
plt.xlabel("Mortality Rate (per 100k)")
plt.ylabel("Cancer Type")
plt.title("Top 10 Cancer Types by Mortality Rate in the U.S.")
plt.gca().invert_yaxis()
plt.show()


(Visualization 2 description and analysis)

The top ten cancer types in the US are displayed in this bar graph according to the age-adjusted death rate per 100,000 individuals. Compared to other cancer types, lung and colorectal cancer have significantly higher death rates.

This graphic helps explain why some malignancies, such as aggressive brain tumors, are linked to poor survival rates and gives population-level context for cancer outcomes.

## Part 5: Executive Summary

**üóëÔ∏è Delete this instruction cell after completing the instruction below.**

Write an executive summary with **\~250 words** that:

* Restates the guiding question and **answers it** with evidence from your visuals/tables.  
* Notes **limitations** (data quality, representativeness, causal caveats).  
* Suggests **one next step** (a different dataset, a type of model that could be applied - consider the data science methodology learned in the first course).

The following guiding issue was investigated in this project: **How do tumor diagnosis and patient health condition (as determined by the Karnofsky Index) relate to survival time in patients with brain cancer? I examined the BrainCancer dataset from the ISLP package, which includes clinical and survival data for 88 brain cancer patients, in order to respond to this query. To offer population-level context, I also used an additional dataset that was collected from Wikipedia and reports age-adjusted cancer death rates in the US.

The distribution of survival times from the BrainCancer data reveals significant heterogeneity across patients, suggesting that results vary greatly. This implies that tumor characteristics and individual traits are significant factors in patient survival. The survival time visualization shows that while a smaller percentage of patients live significantly longer, many have comparatively short survival or follow-up durations. The fact that some cancer kinds have significantly greater death rates than others, according to Wikipedia's cancer mortality data, supports the notion that disease features and severity have a significant impact on cancer outcomes.

There are a few restrictions to be aware of. First off, the BrainCancer dataset may not be typical of all patients with brain cancer due to its limited size. Second, it is impossible to determine causal correlations because the data are observational. Furthermore, the additional Wikipedia data cannot be directly connected to specific patient outcomes because they are population-level statistics.

To increase representativeness, a bigger and more recent clinical dataset might be utilized as a following step. More accurate information about how tumor kind and health status affect survival may be obtained by using a survival model, such as a machine learning-based survival model or a Cox proportional hazards model.


## Part 6: References

- ISLP BrainCancer dataset documentation.  
  https://islp.readthedocs.io/en/latest/datasets.html  
  (Accessed December 9, 2025)

- Wikipedia contributors. *List of cancer mortality rates in the United States*.  
  https://en.wikipedia.org/wiki/List_of_cancer_mortality_rates_in_the_United_States  

- pandas documentation: *read_html API reference*.  
  https://pandas.pydata.org/docs/reference/api/pandas.read_html.html  

- Microsoft Copilot. *AI companion by Microsoft*.  
  https://copilot.microsoft.com  


# Submission requirements

**üóëÔ∏è Delete this instruction cell after completing the instruction below.**

- Appropriate file name: `lastname-project.ipynb`, replacing `lastname` with your actual last name.
- The title and your name at the top of the notebook.
- Instructions cells deleted so that only your work remains.
- Notebook runs without errors from top to bottom.
- All visualizations are rendered correctly.
- At least three unique visualizations with explanations.
- At least 250 words in the executive summary.
- References are properly cited.

Next week, you will present your research and associated dashboard to the class in a Presentation Forum, similar to what you participated in during the first course. Be prepared to discuss your data source, the challenges you faced, and how you solved them.
