# COGS 188 - Final Project

# Predicting Prostate Cancer Outcomes Using Medical and Lifestyle Data
##### [Link to Github Repository](https://github.com/danielrmerskine/COGS188_Final_Project)

## Group members

- Daniel Erskine
- Alec Slim
- Jeff Ung
- Rohun Kulshrestha

# Abstract 
Prostate cancer is one of the most common cancers among men and a leading cause of cancer-related deaths making early detection critical for improving patient outcomes. Therefore, our project aims to develop a supervised machine learning model that predicts the likelihood of early-stage prostate cancer using non-invasive medical and lifestyle data. Using a dataset of 27,945 patients with 30 relevant features, we preprocess the data by encoding categorical variables and normalizing numerical values to ensure consistency. We then train a neural network with two hidden layers, employing ReLU activations and a Sigmoid output function to predict a risk score between 0 and 1. To assess model performance, we evaluate accuracy, precision, recall, and F1-score, ensuring a balance between sensitivity and specificity. Our goal is to create a robust, data-driven diagnostic tool that could assist healthcare professionals in improving prostate cancer detection which could lead to better informed treatment strategies.

# Background

Prostate cancer is the second most common type of cancer to develop and the second leading cause of cancer death in men in the United States. Some of the methods of diagnosis for prostate cancer like a digital rectal exam or prostate biopsy are quite invasive and can be subject to human error. Neural networks give us the opportunity to use a more data-driven approach by analyzing medical and family history. Here are some examples of past work that researchers have done to use AI-powered tools to enhance prostate cancer treatment, which greatly helps the ability of healthcare professionals to detect the cancer early and decide possible treatment strategies.

The first resource<a name=”note1”></a>[<sup>[1]</sup>](#note1) I found related to this project had a very method of predicting whether a patient’s prostate cancer was benign or malignant. Roffman and his fellow researchers used a multiparameterized neural network that uses patient health information to predict the risk of prostate cancer. The team concluded that their neural network demonstrated high specificity and low sensitivity so it could potentially be used as a non-invasive method for cancer risk assessment. We hope to create more relevant neural network compared to Roffman and his team due to our dataset being much more up to date (Roffman’s dataset contained data from 1997 to 2015 and ours is from 2019 to 2024) and our dataset has a much higher rate of malignant tumors (Roffman’s dataset contained about 1.6% malignant tumors and ours has 30% malignant tumors). We hope to expand on Roffman and his team’s research.

The next resource<a name=”note2”></a>[<sup>[2]</sup>](#note2) I found related to this project is similar to the previous project except Esteban and his team use far more modeling techniques (classification tree, random first, neural networks, and more) compared to Roffman. Their dataset is based on 4799 patients in Catalonia, Spain, which is less than one twentieth of the size of Rothman’s dataset: this is slightly concerning. Esteban and his team used a 80-20% training and validation split. The modeling technique with the best performance was XGBoost. The most influential parameters on the result were digital rectal examination and family history, which makes sense logically.

The third resource we explored <a name=”note3”></a>[<sup>[3]</sup>](#note3) was a study where Talaat and her team used a convolutional neural network (ResNet50) with a large dataset of annotated medical images for early detection of prostate cancer. Talaat and her team’s model achieved an accuracy rate of 95.24%. The study also discussed the ethical implications of balancing overdiagnosis and early detection and how it is still debated. 

The fourth resource<a name=”note4”></a>[<sup>[4]</sup>](#note4) we dove into reviewed the current landscape of using AI-powered diagnostic tools to help give clinicians valuable insights from medical data that could be used to improve patient outcomes. Although the authors of this research did not create any machine learning models like the previously mentioned studies, they gave us more knowledge as to what we can include in the ethics and privacy sections of our project. Agrawal concluded that although there is a bright future for these AI-powered tools, there are still regulatory hurdles and ethical considerations that we must consider as a society.

The fifth resource<a name=”note5”></a>[<sup>[5]</sup>](#note5) is another review that did not include the creation of a machine learning model but instead gave us some insights as to the current state of using machine learning models in prostate cancer diagnosis. Olabanjo and his team found that the United States has the most research on prostate cancer diagnosis with machine learning, that magnetic resonance images are the most often used dataset when using datasets of images, and that the most common method of diagnosing prostate cancer is transfer learning. Four of the six researchers on Olabanjo’s team were from Nigeria and discussed how there is a higher prevalence and mortality rate of prostate cancer in developing countries.

The final resource<a name=”note6”></a>[<sup>[6]</sup>](#note6) is yet another review performed by medical doctors and does not include any computer scientist creating machine learning models. Riaz and his team explored the role of AI in many different stages of prostate cancer medical treatment: prostate cancer drug discovery, clinical trials, and clinical practice guidelines. The authors also discussed how human-AI collaboration will become more and more symbiotic in cancer care and will be used to augment and enhance human decision making with accurate and real-time data. Human oversight and domain expertise was discussed to be hugely important in the discussion of using AI implementation for prostate cancer care. This review will further help us write the ethics and privacy section of our project due to the standards and ethical frameworks explored by Raiz and his team.

Overall, we explored three studies that used AI implementations to detect prostate cancer in patients around the world and three reviews that will help our team consider the ethical concerns of allowing machine learning models to make decisions affecting the human body. Daniel Erskine’s (me) father is a radiologist and we asked him to explain some of the intricacies of these medical procedures so we gained more knowledge on the topic.

# Problem Statement

Our goal is to develop a supervised machine learning model for early prostate cancer detection that uses non-invasive patient data (i.e medical history, lifestyle factors, and routine clinical test results). By integrating these diverse data sources with advanced machine learning techniques, we aim to improve diagnostic accuracy, reduce reliance on invasive procedures, and ultimately enhance patient outcomes. 

Our model will be a binary classifier that will output a risk score between 0 and 1, representing the probability of early stage prostate cancer. This makes the problem quantifiable, since the risk score is derived from a defined set of numerical inputs and logical decision boundaries. The input data comprises standardized, objective variables that are routinely collected in clinical settings, ensuring that the problem is both measurable and replicable. Our model's performance will be evaluated using metrics such as accuracy, sensitivity, precision, and recall.




# Data

Dataset: https://www.kaggle.com/datasets/ankushpanday1/prostate-cancer-prediction-dataset

This dataset has 27,945 observations and 30 variables. The variables consist of the following: Patient_ID, Age, Family_History, Race_African_Ancestry, PSA_Level, DRE_Result, Biopsy_Result, Difficulty_Urinating, Weak_Urine_Flow, Blood_in_Urine, Pelvic_Pain, Back_Pain, Erectile_Dysfunction, Cancer_Stage, Treatment_Recommended, Survival_5_Years, Exercise_Regularly, Healthy_Diet, BMI, Smoking_History, Alcohol_Consumption, Hypertension, Diabetes, Cholesterol_Level, Screening_Age, Follow_Up_Required, Prostate_Volume, Genetic_Risk_Factors,Previous_Cancer_History, Early_Detection. 

Each observation consists of patient information such as health metrics(BMI, PSA level, Age, ect.) as well as patient predispositions such as drinking and smoking history. Some of the critical variables are PSA level, Previous cancer, and prostate volume. The dataset does not need any cleaning or transformations at the moment because there are no missing values or incomplete observations but this may change depending on specific circumstances when designing the neural network.


# Proposed Solution

To address our problem, we propose a supervised learning approach using a neural network to detect early stage prostate cancer from non-invasive patient data. The process begins with preprocessing our data (non-predictive features are removed, categorical variables are one-hot encodoed, and numerical features are standardized). This ensures that all input data are formatted consistently for the training phase. 

The neural network itself is designed with two hidden layers using ReLU activations and concludes with a single output neuron paired with a Sigmoid activation function. producing a risk score between 0 and 1. The model will be trained using binary cross-entropy loss and optimized via the Adam optimizer, with a learning rate schduler to dynamically adjust training based on validation performance. 

To validate our model, the data is split into training, validation, and test sets, and performance is evaluated using metrics like accuracy, precision, recall and F1 score. To ensure our solution is reproducible, we will provide clear and detailed documentation that includes steps like training, plotting, and evalutions. We will also provide a `requirements.txt` file with all library dependencies and set random seeds to ensure consistent results. 

<br>

# Evaluation Metrics

Accuracy is the proportion of correctly predicted instances (both positive and negative) out of the total number of instances.
<br>
<br>

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$   


<br>
<br>
Precision is the ratio of true positives to the sum of true and false positives, indicating the correctness of positve predictions.
<br>
<br>

$$\text{Precision} = \frac{TP}{TP + FP}$$

<br>
<br>
Recall(Sensitivity) is the ratio of true positives to the sum of true positives and false negatives, reflecting the model's ability to capture all positive instances.
<br>
<br>

$$\text{Recall} = \frac{TP}{TP + FN}$$

<br>
<br>
F1-Score is the harmonic mean of precision and recall, providing a balance between the two. 
<br>
<br>

$$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

<br>
<br>

Where:
- **TP** = True Positives
- **TN** = True Negatives
- **FP** = False Positives
- **FN** = False Negatives

# Data Analysis

As previously stated, our dataset has a total of 27,945 observations and 30 features. We seperated each of the features into groups based on their characteristics to get a better understanding of the dataset.

### Feature Groups

| Group Label | Identifier | Demographic Information | Medical History | Lifestyle | Diagnosis | Symptoms |
|---|---|---|---|---|---|---|
| 1  | Patient_ID | Age | Genetic_Risk_Factors | Alcohol_Consumption | PSA_Level | Difficulty_Urinating |
| 2 |  | Race_African_Ancestry | Hypertension | Exercise_Regularly | DRE_Result | Weak_Urine_Flow |
| 3 |  | Family_History | Diabetes | Smoking_History | Prostate_Volume | Blood_in_Urine |
| 4 |  |  | Cholesterol_Level | Healthy_Diet | Early_Detection | Pelvic_Pain |
| 5 |  |  | Previous_Cancer_History |  | Cancer_Stage | Back_Pain |
| 6 |  |  | BMI |  | Biopsy_Result | Erectile_Dysfunction |
| 7 |  |  | Screening_Age |  | Treatment_Recommended |  |
| 8 |  |  | Follow_Up_Required |  | Survival_5_Years |  |

### Feature Explanation

| Feature | Explanation | 
|---|---|
| Patient_ID | Unique idenfier for each patient in our dataset (Integer) |
| Age | Current age fo the patient at time of data collection (Integer) |
| Race_African_Ancestry | Whether a patient is of African Ancestry, who are at higher risk for prostate cancer (Yes/No) |
| Family_History | Whether a patient has a family history of prostate cancer (Yes/No) |
| Genetic_Risk_Factors | Whether a patient has genetic markers associated with prostate cancer (Yes/No) |
| Hypertension | Whether a patient has high blood pressure (Yes/No) |
| Diabeters | Whether a patient has diabties (Yes/No) |
| Cholesterol_Level | Patient's cholesterol level ("Normal" or "High") |
| Previous_Cancer_History | Whether a patient has had cancer before (Yes/No) |
| BMI | A patient's body mass index which is a measure of body fat (Number) |
| Screening_Age | Age which the the patient first was screened for prostate cancer (Integer) |
| Follow_Up_Required | Weather a follow up medical visit was recommended after diagnosis (Yes/No) |
| Alcohol_Consumption | Level of a patient's alcohol consumption ("Low" or "Moderate" or "High") |
| Exercise_Regularly | Whether a patient exercise frequently (Yes/No) |
| Smoking_History | Whether a patient has a history of smoking (Yes/No) |
| Healthy_Diet | Whether a patient eats healthy food regularly (Yes/No) |
| PSA_Level | Blood test result that gives a patient's Prostate-Specific Antigen Level, which is a protein produced by the prostate gland (Number) |
| DRE_Result | Result of a Digital Rectal Exam ("Normal" or "Abnormal") |
| Prostate_Volume | Size of the prostate gland measured in cubic centimeters (Number) |
| Early_Detection | Whether prostate cancer was detected early (Yes/No) |
| Cancer_Stage | Stage of a patient's prostate cancer ("Localize" or "Advanced" or "Metastatic") |
| Biopsy_Result | Result of a prostate biopsy ("Benign" or "Malignant") |
| Treatment_Recommended | Type of treatment the medical professional recommended ("Active Surveillance" or "Hormone Therapy" or "Immunotherapy" or "Radiation" or "Surgery") |
| Survival_5_Years | Whether a patient survived five years after their diagnosis (Yes/No) |
| Difficulty_Urinating | Whether a patient experience trouble urinating (Yes/No) |
| Weak_Urine_Flow | Whether the flow of a patient's urine is weak when they are urinating (Yes/No) |
| Blood_in_Urine | Whether there is blood in a patient's urine (Yes/No) |
| Pelvic_Pain | Whether the patient has pain in their pelvis (Yes/No) |
| Back_Pain | Whether the patient has pain in their back (Yes/No) |
| Erectile_Dysfunction | Whether the patient has erectile dysfunction (Yes/No) |

## Feature Correlation with Biopsy Result

### Loading and Cleaning Data

In [None]:
import pandas as pd

file = "prostate_cancer_prediction.csv"
df = pd.read_csv(file)

drop = ["Patient_ID", "Screening_Age"]
cleanDf = df.drop(columns=drop, errors="ignore")

encodedDf = cleanDf.copy()

This code reads out dataset (`prostate_cancer_predictions.csv`) into a dataframe. We decided to remove `Patient_ID` due to the fact that it is completely arbitrary and does not have any relation to the other data. We decided to remove `Screening_Age` due to the fact that their is already an "Age" feature that serves the same purpose. These features are removed from our original dataframe to create `cleanDf` and then `cleanDf` is copied to create a new dataframe that will be used for encoding 

In [None]:
binaryCols = [
    "Family_History", "Race_African_Ancestry", "Difficulty_Urinating", "Weak_Urine_Flow", 
    "Blood_in_Urine", "Pelvic_Pain", "Exercise_Regularly", "Healthy_Diet", "Smoking_History", 
    "Hypertension", "Diabetes", "Genetic_Risk_Factors", "Previous_Cancer_History", "Early_Detection", 
    "Survival_5_Years", "Back_Pain", "Erectile_Dysfunction", "Follow_Up_Required"
]

for col in binaryCols:
    if col in encodedDf.columns:
        encodedDf[col] = encodedDf[col].map({"Yes": 1, "No": 0})

We then determined which features were binary in the fact that their answers were `Yes` and `No`. We were then able to use this information to encode all `Yes` answers as `1` and all `No` responses as `0`.

In [None]:
encodedDf["Biopsy_Result"] = encodedDf["Biopsy_Result"].map({"Malignant": 1, "Benign": 0})

encodedDf["DRE_Result"] = encodedDf["DRE_Result"].map({"Normal": 0, "Abnormal": 1})

encodedDf["Alcohol_Consumption"] = encodedDf["Alcohol_Consumption"].map({"Low": 1, "Moderate": 2, "High": 3})

encodedDf["Cholesterol_Level"] = encodedDf["Cholesterol_Level"].map({"Normal": 0, "High": 1})

encodedDf["Cancer_Stage"] = encodedDf["Cancer_Stage"].map({"Localized": 1, "Advanced": 2, "Metastatic": 3})

encodedDf["Treatment_Recommended"] = encodedDf["Treatment_Recommended"].map({"Active Surveillance": 1, "Hormone Therapy": 2, "Immunotherapy": 3, "Radiation": 4, "Surgery": 5})

We then had to determine a system of encoding the features in our dataset that did not have binary "Yes" and "No" responses. The features `Biopsy_Result`, `DRE_Result`, and `Cholesterol_Level` all had binary responses in a different form so it was easy to determine what to put as `1` or `0`. `Alcohol_Consumption` was slightly more difficutl due to the fact that their are three ordinal values but the values are still ordinal so we were about to determine what should be `1`, `2`, and `3`. Encoding `Cancer_Stage` required a little more research as to which each of the three stages of cancer meant. `Localized` cancer is still only located inside the prostate. `Advanced` cancer has spread to tissue nearby the prostate. `Metastatic` cancer has spread to other parts of the body through the bloodstream or lumphatic system. We determined to give `Localized` a value of `1`, `Advanced` a value of `2`, and `Metastatic` a value of `3` due to their increasing levels of severity. We did the same for `Treatment_Recommended`. If `Active Surveillance` was the recommended treatment, the cancer was considered to be a low risk while if `Surgery` was the determined treatment, the cancer was considered to be a high risk. `Hormone Therapy`, `Immunotherapy`, and `Radiation` all fell between those two values in terms of severity so we gave them corresponding values between `1` and `5`.

In [None]:
biopsyCorrelations = encodedDf.corr(numeric_only=True)["Biopsy_Result"].sort_values(ascending=False)

print("Ranked Feature Correlations with Biopsy Result:")
print(biopsyCorrelations)

The code above calculates the Pearson correlation coefficient between the values given for `Biopsy_Result` and each other feature in the dataset. It then prints the results of the correlation and ranks them based on highest to lowest correlation. In a pearson correlation, values range from -1 (which means a strong negative correlation) to 1 (strong positive correlation). Our values range from `-0.015013` and `0.009137`. 

The feature with the strongest correlation is `Early_Detection` with a value of `-0.015013`, meaning that if there is no early detection of the prostate cancer, the likelihood of the biopsy result to be malignant increases more than any other feature. The feature with the next highest influence on the biopsy result being malignant is if a patient has weak urine flow, with a coefficient of `0.009137`. The features with the next two strongest correlations are `Family_History` and `Healthy_Diet` with cofficients of `-0.008705` and `-0.007066` respectively. This means that if a patient has a healthy diet or does not have a family history of prostate cancer, they are less likely to have a malignant biopsy result.

The features with the weakest correlation is `Survival_5_Years`, meaning that if a patient has survived five years after their diagnosis, it has little affect on if their biopsy is malignant or benign. The Pearson correlation coefficient of `Survival_5_Years` is `0.000453`. The features with the next weakest correlations are `Follow_Up_Required` and `Age`, with correlation cofficient of `0.000590` and `0.000912` respectively. 

Based on the correlation coefficients we have calculated, we would say that there is a fairly weak correlation between any of our features and the results of biopsy result. Our strongest correlation has an absolute value of `0.015013`, which is very small and would be considered a weak correlation. We now know that none of the features in our dataset exhibit a strong linear relationship with `Biopsy_Result`.

# Results

After preprocessing all of the data we trained our multi-layer perceptron model on 100 epochs with early stopping to stop the model once the loss hits a plateau. We recorded the accuracy, precision, recall, and F1 score as metrics to evaluate performance which are described below.


### Hyperparameters

To start off our hyperparameter tuning, we first set our learning rate to 0.001 as a starting point. We chose this as it allowed our model's weights to be updated and converge steadily during training. Later on, we observe the model's performance and put a learning rate scheduler to reduce the learning rate by a factor of 0.1 when the validation loss plateaued, ensuring finer adjustments during the later training stages. We chose the Adam optimizer for its adaptive learning rate properties and robust performance. It will help to balance rapid convergence with stability. 


### Optimization
To really refine our neural network and improve its performance on prostate cancer detection, we implemented some optimization strategies. 

A learning rate scheduler (`ReduceLROnPlateau`) was used to adjust the learning rate based on validation loss. When the validation loss plateaued for 5 epochs, the scheduler recuded the learning rate by a factor of 0.1. This allowed the optimizer to take smaller, more precise steps as the model converged, leading to finer tuning of the weights in later epochs. 

Another strategy used was early stopping with a patience of 10 epochs. So, if no improvement in validation loss was observed for 10 epochs in a row, training was stopped and the best performing model is saved. This helps ensures that the model did not continue training past the point of optimal generalization, perventing overfitting. 

To help monitor and visualize our results, both training and validation losses were recorded at each epoch then plotted to visualize the model's convergance behavior over time. Since both curves were plotted alongside each other, it offered valuable insight into how different learning rates influenced training dynamics. 



### Metrics

As from the `Evaluation Metrics` section, those metrics provides insights into the overall predictive performance. In addition, we investigated the effect of different learning rates on these outcomes. 

So, looking at the starting rate of `0.001`,

![accuracy graph](4a6f4f2b-95a9-4dda-86f8-b2eeee526c52.png)

We can observe that our model acheived a test accuracy of __61%__ with a validation accuracy of also 61% meaning the model is predicting accurately to how it should. The graph above shows the training and validation loss. The Precision of the model turned out to be __31.33%__ meaning that out of all the malignant cases that were detected 31.33% of them were correct while the rest were a false positive. The recall score turned out to be __0.2256__ meaning the model is only detecting 22.56% of actual malignant cases which is not very optimal. Finally the F1-score came out to __0.2623__ which is expected given the previous metrics. 


Moving on to a starting learning rate of `0.00001`,

<img src="lr_0.00001.png">

From this, we can observe that our model achieved a test accuracy of __55%__ with a precision of __0.3002__, recall score of __0.3762__ and a F1 score of __0.3339__.

Although having a lower learning rate will slow down the training process, we think that it's worth it to get a more higher recall score. Since our problem is to determine whether patients have or do not have prostate cancer, missing a malignant case is far more consequential than incorrectly flagging a benign case. Therefore, a model with higher recall (even though a lower accuracy), we think, is better in this context.

# Discussion

### Interpreting the result

We believe that the most important takeaway from the results of our project is that our model shows strong learning during training but does not generalize new data well. This is shown by the fact that the training loss decreases with more iterations. This means that the model is finding patterns in the training set. This is not true about the validation loss: the validation loss slightly increases with more iterations. This means the model may be overfitting. Overall, our model performs well on familiar data but its predictive accuracy with new data is not as good as we hoped.

The first secondary point we wanted to consider is that the increasing gap between the training and validation loss may suggest that the model is memorizing the training dataset rather than finding a pattern that it can use in the validation set. This typically means that our model may be overfitting. To conclude, the fact that training loss keeps decreasing and validation loss stays generally the same means that our model could benefit from various regularization techniques.

Another observation is that feature selection or preprocessing the data further may play a crucial role in improving performance. The model may be missing some key features that could help it generalize new data further. We really don't know what this feature would be because we thought we did a fairly good job processing our data. Perhaps a dataset with more relevant data would help as well.

Lastly, one of the more important takeaways we have from our results is that tuning the learning rate has a significant impact on recall, which is important for prostate cancer detection. When we used a learning rate of `0.001`, our model achieved a test accuracy of `61%`, but the recall was only `22.56%`, meaning that it failed to correctly identify most malignant cases. When we changed the learning rate to `0.00001`, our recall was increased to `37.62%`. This shows us that when we used a more conservative learning rate, the model's ability to detect cancer improves.


### Limitations

One of the main limitations of our model is that is may suffer from overfitting, as discussed above. We believe this due to the widening gap between training and validation loss as more iterations happen. The training loss continue to decrease with more iterations but the validation loss tends to stay the same or even slightly increase. This could be potentially solved by increasing the amount of training data or by adding more regularization techniques.

Another limitation could be the potential of having an imbalanced dataset. We discussed this in the Ethics & Privacy section of our proposal, but the only patient demographic information we have are whether they were of African ancestory or not. We do not know whether a patient was White, Hispanic, Asian, or Native American. There is potential that our dataset underrepresents one of these groups and we do not know. It is most important to include the African Ancestory of any group due to the fact that they have the highest rates of prostate cancer and are historically underrepresented in medical data.

The last limitation we wanted to discuss was the fact that all the members of our group had very limit domian knowledge on the topic of prostate cancer before starting this project. We had to learn what the various procedures and features that would be included in a dataset about patients with prostate cancer. Due to the resources available to us, we had to do most of this learning online. While we believe that we did a thorough job researching each topic, we do not have the medical qualifications and education of an actual medical professional. We thought it was important for us to acknowledge this and mention it here.


### Future work
Looking at the limitations and/or the toughest parts of the problem and/or the situations where the algorithm(s) did the worst... is there something you'd like to try to make these better.

The most obvious area for improvement in our model would be reducing overfitting, as discussed extensively above. We believe our model is overfitting because the training loss continues to decrease as the validation loss stays relatively the same or slightly increases. The means that the model is struggling the generalize new data. Again, we could address this problem by adding some regularization techniques such as early stopping or dropout.

Another place where we could do further analysis to potentially get better results would be by using different feature analysis techniques. We used a Pearson correlation analysis and found that all the features have a fairly weak correlation with `Biopsy_Result`. Perhaps if we used another technique such as prinicipal component analysis (PCA) we could refine the dataset further to get better results.

A final task that could improve our results would be to wait for more data to come out on prostate cancer. The dataset we used is said to be updated annually on Kaggle so if we came back and did this project again in 5 years I imagine that we would get different results. We could try to find another open source dataset on prostate cancer online but we truly believe that this Kaggle dataset is probably the best, free dataset we could find on the topic. We just need to give it more time to gather more data.

### Ethics & Privacy

As stated by the prompt for this section, almost every ML project has ethical implications. A ML project in the healthcare industry is certain to have ethical implications 

The primary ethical issue that arises with our project about the dataset itself and the concept of patient data confidentiality. Patient data confidentially means that a patient's health information cannot be used or shared without your written consent, unless certain laws allow it. Although our dataset does not elaborate on the origin of their data besides stating that it is webscraper from “12 Health data Websites,” I believe that our data is ethically sourced due to the fact that these health data websites would certainly be shut down and sued if found exposing patient’s healthcare data without consent. The Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR) both help protect our healthcare data from being stolen.

Another ethical consideration is potential bias in the dataset. If the dataset that we use to train our neural network is disproportionately represented by people from specific demographic groups, the neural network may produce results that misdiagnose groups of people that are not represented correctly in the dataset. We must ensure that the dataset we use is diverse so that our model is fair and accurate. Although we do not have an ethnicity parameter for our dataset, we do have an “Race_African_Ancestry” parameter which 20% of our dataset has as true. This leads me to believe that we will not have an issue of underrepresentation due to 13.7% of the population that our dataset was drawn from (the United States) being of African ancestry. Olabanjo and his team discussed the fact that developing countries have a higher prevalence and mortality rate of prostate cancer, even if the work done in this project is based on a dataset from the United States population, hopefully the results can be used to reduce the amount of prostate cancer around the world.

Over-reliance on AI models in medical decision making is another risk we must consider. Neural networks, such as the one we propose to make, should be used to assist in diagnosis but not be used as a replacement. Physicians will need to remain present and confirm the results given by a machine learning model. AI models may not have all the contextual information needed to give a correct diagnosis. As a society, we need to use the outputs of these large language models as recommendations that can be used to assist medical professions in their work, but not be used to replace them completely.

The last ethical consideration we have is the idea of informed consent. From student athletes  getting a routine check up to elderly men getting a digital rectal exam for prostate cancer, patients should be aware of how and when their healthcare data is being used. Patients should be given the option to keep their data private. This consideration is more important for the creator of the dataset we are using, but we believed that we should explore it nonetheless.

If there are any future ethical problems caused by neural networks in healthcare, as a society, we need to be ready to recognize when this happens and how to deal with it. We imagine that in the future, healthcare companies will employ huge numbers of data scientists for this exact purpose. 

### Conclusion

In conclusion, our model demonstrates strong potential, achieving solid training performance and making meaningful strides in understanding prostate cancer diagnosis through machine learning. Although challenges remain in generalizing to new data, the model’s ability to learn from the training set provides a solid foundation for further improvement. The results emphasize the need for refinement to make the model clinically applicable. By addressing overfitting with regularization techniques and exploring better data preprocessing strategies, we believe the model can become more robust and better at handling unseen data. Looking ahead, future work could involve leveraging more diverse datasets to ensure better representation and generalizability, as well as further hyperparameter tuning and exploring advanced techniques such as ensemble methods. With these enhancements, we are optimistic that our model can contribute significantly to advancing AI-assisted diagnostics in prostate cancer and beyond.

# Footnotes

<a name="note1"></a>1.[^](#note1): Roffman, D. A., Hart, G. R., Leapman, M. S., Yu, J. B., Guo, F. L., Ali, I., & Deng, J. (2018, December). Development and validation of a multiparameterized artificial neural network for prostate cancer risk prediction and stratification. *JCO clinical cancer informatics*. https://pmc.ncbi.nlm.nih.gov/articles/PMC6873987/<br>
<a name="note2"></a>2.[^](#note2): Esteban, L. M., Borque-Fernando, Á., Escorihuela, M. E., Esteban-Escaño, J., Abascal, J. M., Servian, P., & Morote, J. (2025, February 4). Integrating radiological and clinical data for clinically significant prostate cancer detection with Machine Learning Techniques. *Nature News*. https://www.nature.com/articles/s41598-025-88297-6<br>
<a name="note3"></a>3.[^](#note3): Talaat, F. M., El-Sappagh, S., Alnowaiser, K., & Hassan, E. (2024, January 24). Improved prostate cancer diagnosis using a modified ResNet50-based Deep Learning Architecture - BMC Medical Informatics and Decision making. *BioMed Central*. https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-024-02419-0<br>
<a name="note4"></a>4.[^](#note4): Agrawal, S., & Vagha, S. (2024, August 5). A comprehensive review of Artificial Intelligence in prostate cancer care: State-of-the-art diagnostic tools and future outlook. *Cureus*. https://pmc.ncbi.nlm.nih.gov/articles/PMC11374581/#sec3<br>
<a name="note5"></a>5.[^](#note5): Olabanjo, O., Wusu, A., Asokere, M., Afisi, O., Okugbesan, B., Olabanjo, O., Folorunso, O., & Mazzara, M. (2023, September 19). Application of machine learning and deep learning models in prostate cancer diagnosis using Medical Images: A systematic review. *MDPI*. https://www.mdpi.com/2813-2203/2/3/39<br>
<a name="note6"></a>6.[^](#note6): Riaz, I., Harmon, S., & Chen, Z. (2024, June 27). Applications of Artificial Intelligence in Prostate Cancer Care: A Path to Enhanced Efficiency and Outcomes. *ASCO Publications*. https://ascopubs.org/doi/10.1200/EDBK_438516<br>
