https://schiariello.github.io/

# Final Tutorial: Cervical Cancer Risk Factors

## Analysis by Sarah Chiariello

### Identifying the Dataset

Cervical Cancer Risk Classification:
https://www.kaggle.com/loveall/cervical-cancer-risk-classification


### Project Description & Plan

Forty years ago, cervical cancer was the leading causes of cancer death for women in the United States. As one of the most preventable types of cancer, this statistic left much to be desired from the prior healthcare infrastructure. Due to early detection of abnormal cells within the cervix after more regularly scheduled Pap test screenings, The United States was able to control the mortality rates as a result of cervical cancer. Sadly, this cannot be said for developing healthcare systems around the world. Many women worldwide do not have the access to healthcare that would enable an early detection of cervical cancer by  screening for human papilloma virus (HPV) cell types in their youth, which can increase their risk of cervical cancer in the future. Women with absnormal cells at a young age who do not get regular examinations are at a higher risk of localized cancer, which can lead to invasive cancer by the age of 50. While cervical cancer rates have declined in the US, death rates for African American women are twice as high as Caucasian women, while rates of invasive cervical cancer in Hispanic women are more than twice those of Caucasian women. These racial disparities may be results of less advanced healthcare systems worldwide, socioeconomic patternns, and low screening rates due to high poverty levels. These may also be the result of a lack of access to transportation, health insurance, or language translators.

Cervical cancer involves multiple risk factors and must be diagnosed by a healthcare professional after a biopsy examination. An increases sexual activity can introduce the risk of contracting HPV, the main risk factor for cervical cancer. HPV is a sexually transmitted infection which results in abnormal cell growth within the cervix, regularly screened in a Pap test. Family history of cervical cancer can also be a risk factor, along with the use of oral hormonal contraception pills. Other risk factors include, but are not limited to, smoking, past STDs, number of children, and history of diagnosed cervical cancer.

The purpose of this project is to examine data regarding cervical cancer risk factors and, hopefully, make meaningful insights about causation and correlations between certain risk factors, socioeconomic factors, healthcare structures, and current racial disparities. Note: this may be a lofty goal.


### Related Questions & Hypotheses

With this project, I hope to analyze cervical cancer risk factors.
- Are there any factors that have a heavier weight than the others?
- What do the number of sexual partners or the age of first sexual intercourse mean as risk factors and how does their weight affect the possibility of getting cancer due to the spread of HPV?
- Hormonal contraceptives are believed to increase risk for cervical cancer if used for over a certain number of years. Is this - A further goal of mine would be to find another dataset about race/socioeconomic class/access to a clinic/etc. to further assess how those factors may weigh in on the likelihood of a woman contracting HPV or cervical cancer.
- Does history of having an STD mediate or moderate the likelihood of cocntracting abnormal cells in the cervix? Is it more to do with behavioral changes, or pathology? How can we tell?
- I hypothesize that I will be able to find trends in the following:
        - Higher number of sexual partners, increased risk of cervical cancer
        - Those who take hormonal contraceptives for longer, increased risk of cervical cancer
        - Those who have contracted STDs, specifically HPV, increased risk of cervical cancer
- I hope to create behavioral insights regarding sexual behavior as it relates to cervical cancer
- I hope to find another dataset to relate risk of cervical cancer to what was discussed under "Project Description" like access to a clinic, accessibility of screening tests, regularity of visits to a clinic in underdeveloped neighborhoods, transportation time to the nearest clinic as it relates to risk of cervical cancer, neighborhood poverty levels, ability to make a follow up appointment, etc.

### Data

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"..\schiariello.github.io\project_data\cervical_raw.csv")
df


Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN (Cervical Intraepithelial Neoplasia),Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4,15,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
1,15,1,14,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
2,34,1,?,1,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
3,52,5,16,4,1,37,37,1,3,0,...,?,?,1,0,1,0,0,0,0,0
4,46,3,21,4,0,0,0,1,15,0,...,?,?,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
853,34,3,18,0,0,0,0,0,0,0,...,?,?,0,0,0,0,0,0,0,0
854,32,2,19,1,0,0,0,1,8,0,...,?,?,0,0,0,0,0,0,0,0
855,25,2,17,0,0,0,0,1,0.08,0,...,?,?,0,0,0,0,0,0,1,0
856,33,2,24,2,0,0,0,1,0.08,0,...,?,?,0,0,0,0,0,0,0,0


In [8]:
# Create values of NaN where there are currently ?, in accordance with best practices for tidy data.

display(df.replace("?", np.nan))
df.replace("?", np.nan,inplace=True)

Unnamed: 0,Age,Number of sexual partners,First sexual intercourse,Num of pregnancies,Smokes,Smokes (years),Smokes (packs/year),Hormonal Contraceptives,Hormonal Contraceptives (years),IUD,...,STDs: Time since first diagnosis,STDs: Time since last diagnosis,Dx:Cancer,Dx:CIN (Cervical Intraepithelial Neoplasia),Dx:HPV,Dx,Hinselmann,Schiller,Citology,Biopsy
0,18,4,15,1,0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
1,15,1,14,1,0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
2,34,1,,1,0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
3,52,5,16,4,1,37,37,1,3,0,...,,,1,0,1,0,0,0,0,0
4,46,3,21,4,0,0,0,1,15,0,...,,,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
853,34,3,18,0,0,0,0,0,0,0,...,,,0,0,0,0,0,0,0,0
854,32,2,19,1,0,0,0,1,8,0,...,,,0,0,0,0,0,0,0,0
855,25,2,17,0,0,0,0,1,0.08,0,...,,,0,0,0,0,0,0,1,0
856,33,2,24,2,0,0,0,1,0.08,0,...,,,0,0,0,0,0,0,0,0


In [10]:
# List the variable types for each column of the data, to ensure they are as we intend to use them later.

# float64, int64 = quantitative
# object = categorical

df.dtypes

Age                                             int64
Number of sexual partners                      object
First sexual intercourse                       object
Num of pregnancies                             object
Smokes                                         object
Smokes (years)                                 object
Smokes (packs/year)                            object
Hormonal Contraceptives                        object
Hormonal Contraceptives (years)                object
IUD                                            object
IUD (years)                                    object
STDs                                           object
STDs (number)                                  object
STDs:condylomatosis                            object
STDs:cervical condylomatosis                   object
STDs:vaginal condylomatosis                    object
STDs:vulvo-perineal condylomatosis             object
STDs:syphilis                                  object
STDs:pelvic inflammatory dis

By running the .types() command, we can see how pandas is categorizing each variable. There are specific ways in which we want to categorize our own data: either quantitative (like "age", properly categorized by pandas) or categorical (like "smokes", improperly characterized by pandas).

Note: For the "STDs:  ..." columns, we will have to determine whether to view the data contained as "quantitative: the number of times a participant has gotten the particular STD in question" or "categorical: whether or not the participant has contracted the STD in this column." I will determine this on a later date, so I am leaving these alone for now, currently categorized categorically.

Depending on whether we want to examine STDs in generalas they relate to cervical cancer vs. the risk of each particular STD with regard to later cervical cancer, we can combine those columns later on to create a more tidy data set.

In [27]:
# This block will convert from categorical to quantitative and vice-versa as deemed necessary by me.

df['Number of sexual partners'] = pd.to_numeric(df['Number of sexual partners'], errors="coerce")
df['First sexual intercourse'] = pd.to_numeric(df['First sexual intercourse'], errors="coerce")
df['Num of pregnancies'] = pd.to_numeric(df['Num of pregnancies'], errors="coerce")
df['Smokes (years)'] = pd.to_numeric(df['Smokes (years)'], errors="coerce")
df['Smokes (packs/year)'] = pd.to_numeric(df['Smokes (packs/year)'], errors="coerce")
df['Hormonal Contraceptives (years)'] = pd.to_numeric(df['Hormonal Contraceptives (years)'], errors="coerce")
df['IUD (years)'] = pd.to_numeric(df['IUD (years)'], errors="coerce")
df['STDs: Time since first diagnosis'] = pd.to_numeric(df['STDs: Time since first diagnosis'], errors="coerce")
df['STDs: Time since last diagnosis'] = pd.to_numeric(df['STDs: Time since last diagnosis'], errors="coerce")

df['Dx:Cancer'] = df['Dx:Cancer'].astype(str)
df['Dx:CIN (Cervical Intraepithelial Neoplasia)'] = df['Dx:CIN (Cervical Intraepithelial Neoplasia)'].astype(str)
df['Dx:HPV'] = df['Dx:HPV'].astype(str)
df['Dx'] = df['Dx'].astype(str)
df['Hinselmann'] = df['Hinselmann'].astype(str)
df['Schiller'] = df['Schiller'].astype(str)
df['Citology'] = df['Citology'].astype(str)
df['Biopsy'] = df['Biopsy'].astype(str)

In [28]:
# Check to ensure the characterizations of the variable names have been changed

df.dtypes

Age                                              int64
Number of sexual partners                      float64
First sexual intercourse                       float64
Num of pregnancies                             float64
Smokes                                          object
Smokes (years)                                 float64
Smokes (packs/year)                            float64
Hormonal Contraceptives                         object
Hormonal Contraceptives (years)                float64
IUD                                             object
IUD (years)                                    float64
STDs                                            object
STDs (number)                                   object
STDs:condylomatosis                             object
STDs:cervical condylomatosis                    object
STDs:vaginal condylomatosis                     object
STDs:vulvo-perineal condylomatosis              object
STDs:syphilis                                   object
STDs:pelvi

### Next Steps & Justifications

Future plans for this project, and the dataset include creating a tidier version of the data, with shorter, more succinct variable names. I held off on this for the first Milestone since I have not yet determined which risk factors to delve into. For example, creating a tidier dataset by merging all of the STD columns into more of a categorical "Have you gotten an STD" may hinder insights, which I do not want to do just yet. Currently, I recognize the need to create a dataset that is long rather than fat, but I have not determned how best to melt the data down without losing information that I may want to access later. I kept the lengthy column names for the sake of an audience with less knowledge of cervical cancer and medical jargon. In order to efficiently call and work with the variables, I recognize that I will have to update the columns later on.

In the future, I will also hone my questions while searching for a race/SES/access to healthcare dataset to compare and add to these as risk factors.