# Final Tutorial: Cervical Cancer Risk Factors

## Analysis by Sarah Chiariello

### Identifying the Dataset

Cervical Cancer Risk Classification:
https://www.kaggle.com/loveall/cervical-cancer-risk-classification


### Project Description & Plan

Forty years ago, cervical cancer was the leading causes of cancer death for women in the United States. As one of the most preventable types of cancer, this statistic left much to be desired from the prior healthcare infrastructure. Due to early detection of abnormal cells within the cervix after more regularly scheduled Pap test screenings, The United States was able to control the mortality rates as a result of cervical cancer. Sadly, this cannot be said for developing healthcare systems around the world. Many women worldwide do not have the access to healthcare that would enable an early detection of cervical cancer by  screening for human papilloma virus (HPV) cell types in their youth, which can increase their risk of cervical cancer in the future. Women with absnormal cells at a young age who do not get regular examinations are at a higher risk of localized cancer, which can lead to invasive cancer by the age of 50. While cervical cancer rates have declined in the US, death rates for African American women are twice as high as Caucasian women, while rates of invasive cervical cancer in Hispanic women are more than twice those of Caucasian women. These racial disparities may be results of less advanced healthcare systems worldwide, socioeconomic patternns, and low screening rates due to high poverty levels. These may also be the result of a lack of access to transportation, health insurance, or language translators.

Cervical cancer involves multiple risk factors and must be diagnosed by a healthcare professional after a biopsy examination. An increases sexual activity can introduce the risk of contracting HPV, the main risk factor for cervical cancer. HPV is a sexually transmitted infection which results in abnormal cell growth within the cervix, regularly screened in a Pap test. Family history of cervical cancer can also be a risk factor, along with the use of oral hormonal contraception pills. Other risk factors include, but are not limited to, smoking, past STDs, number of children, and history of diagnosed cervical cancer.

The purpose of this project is to examine data regarding cervical cancer risk factors and, hopefully, make meaningful insights about causation and correlations between certain risk factors, socioeconomic factors, healthcare structures, and current racial disparities. Note: this may be a lofty goal.


### Related Questions & Hypotheses

With this project, I hope to analyze cervical cancer risk factors.
    1. Are there any factors that have a heavier weight than the others?
    2. What do the number of sexual partners or the age of first sexual intercourse mean as risk factors and how does their weight affect the possibility of getting cancer due to the spread of HPV?
    3. Hormonal contraceptives are believed to increase risk for cervical cancer if used for over a certain number of years. Is this the case? How many years is the threshold for which the risk increases?
    4. A further goal of mine would be to find another dataset about race/socioeconomic class/access to a clinic/etc. to further assess how those factors may weigh in on the likelihood of a woman contracting HPV or cervical cancer.
    5. Does history of having an STD mediate or moderate the likelihood of cocntracting abnormal cells in the cervix? Is it more to do with behavioral changes, or pathology? How can we tell?

### Data

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\schia\schiariello.github.io\project_data\cervical_raw.csv")
df


Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN (Cervical Intraepithelial Neoplasia),Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4,15,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
1,15,1,14,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
2,34,1,?,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
3,52,5,16,4,1,37,37,1,3,0,...,?,?,1,0,1,0,0,0,0,0
4,46,3,21,4,0,0,0,1,15,0,...,?,?,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
853,34,3,18,0,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
854,32,2,19,1,0,0,0,1,8,0,...,?,?,0,0,0,0,0,0,0,0
855,25,2,17,0,0,0,0,1,0.08,0,...,?,?,0,0,0,0,0,0,1,0
856,33,2,24,2,0,0,0,1,0.08,0,...,?,?,0,0,0,0,0,0,0,0


In [6]:
# Create values of NaN where there are currently ?
display(df["STDs: Time since last diagnosis"].replace("?", np.nan))
df["STDs: Time since last diagnosis"].replace("?", np.nan,inplace=True)

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
      ... 
853    NaN
854    NaN
855    NaN
856    NaN
857    NaN
Name: STDs: Time since last diagnosis, Length: 858, dtype: object

In [8]:
df.loc[15:26]

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN (Cervical Intraepithelial Neoplasia),Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
15,40,3,18,2,0,0,0,1,15.0,0,...,?,,0,0,0,0,0,0,0,0
16,41,4,21,3,0,0,0,1,0.25,0,...,?,,0,0,0,0,0,0,0,0
17,43,3,15,8,0,0,0,1,3.0,0,...,?,,0,0,0,0,0,0,0,0
18,42,2,20,?,0,0,0,1,7.0,1,...,?,,0,0,0,0,0,0,0,0
19,40,2,27,?,0,0,0,0,0.0,1,...,?,,0,0,0,0,0,0,0,0
20,43,2,18,4,0,0,0,1,15.0,0,...,?,,0,0,0,0,0,0,0,0
21,41,3,17,4,0,0,0,1,10.0,0,...,21,21.0,0,0,0,0,0,0,0,0
22,40,1,18,1,0,0,0,1,0.25,0,...,2,2.0,0,0,0,0,0,1,1,1
23,40,1,20,2,0,0,0,1,15.0,0,...,?,,1,0,1,0,1,1,0,1
24,40,3,15,3,0,0,0,1,3.0,0,...,?,,0,0,0,0,0,0,0,0
