*This notebook has been built from this one:* https://www.kaggle.com/randyrose2017/for-beginners-using-keras-to-build-models

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from subprocess import check_output
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

print(check_output(["ls", "kag_risk_factors_cervical_cancer.csv"]).decode("utf8"))
# make sure you have the csv file in the same dir as the one of this notebook

# 1. Data observation

Here **observation** means to see the type and structure of the data or whether there are missing values. Let's import the data.

In [None]:
df_full = pd.read_csv('kag_risk_factors_cervical_cancer.csv')
df_full

In [None]:
df_full.info()

It seems there are some missing values named as '?', and made the whole column become an object. 
To do further computation, we have to replace '?' with NaN and turn the object type to numeric type.

In [None]:
df_fullna = df_full.replace('?', np.nan)

In [None]:
df_fullna.isnull().sum() #check NaN counts in different columns

# 2. Data-preprocessing

In [None]:
df = df_fullna  #making temporary save
df = df.convert_objects(convert_numeric=True) #turn data into numeric type for computation
df.info() # Now it's all numeric type, and we are ready for computation and fill NaN.

Now It's time to fill all the NaN values. <br/>
For continuous variable, it is recommended to fill with the median value.
For categorical variable, we fill with 1.

In [None]:
# for continuous variable
df['Number of sexual partners'] = df['Number of sexual partners'].fillna(df['Number of sexual partners'].median())
df['First sexual intercourse'] = df['First sexual intercourse'].fillna(df['First sexual intercourse'].median())
df['Num of pregnancies'] = df['Num of pregnancies'].fillna(df['Num of pregnancies'].median())
df['Smokes'] = df['Smokes'].fillna(1)
df['Smokes (years)'] = df['Smokes (years)'].fillna(df['Smokes (years)'].median())
df['Smokes (packs/year)'] = df['Smokes (packs/year)'].fillna(df['Smokes (packs/year)'].median())
df['Hormonal Contraceptives'] = df['Hormonal Contraceptives'].fillna(1)
df['Hormonal Contraceptives (years)'] = df['Hormonal Contraceptives (years)'].fillna(df['Hormonal Contraceptives (years)'].median())
df['IUD'] = df['IUD'].fillna(0) # Under suggestion
df['IUD (years)'] = df['IUD (years)'].fillna(0) #Under suggestion
df['STDs'] = df['STDs'].fillna(1)
df['STDs (number)'] = df['STDs (number)'].fillna(df['STDs (number)'].median())
df['STDs:condylomatosis'] = df['STDs:condylomatosis'].fillna(df['STDs:condylomatosis'].median())
df['STDs:cervical condylomatosis'] = df['STDs:cervical condylomatosis'].fillna(df['STDs:cervical condylomatosis'].median())
df['STDs:vaginal condylomatosis'] = df['STDs:vaginal condylomatosis'].fillna(df['STDs:vaginal condylomatosis'].median())
df['STDs:vulvo-perineal condylomatosis'] = df['STDs:vulvo-perineal condylomatosis'].fillna(df['STDs:vulvo-perineal condylomatosis'].median())
df['STDs:syphilis'] = df['STDs:syphilis'].fillna(df['STDs:syphilis'].median())
df['STDs:pelvic inflammatory disease'] = df['STDs:pelvic inflammatory disease'].fillna(df['STDs:pelvic inflammatory disease'].median())
df['STDs:genital herpes'] = df['STDs:genital herpes'].fillna(df['STDs:genital herpes'].median())
df['STDs:molluscum contagiosum'] = df['STDs:molluscum contagiosum'].fillna(df['STDs:molluscum contagiosum'].median())
df['STDs:AIDS'] = df['STDs:AIDS'].fillna(df['STDs:AIDS'].median())
df['STDs:HIV'] = df['STDs:HIV'].fillna(df['STDs:HIV'].median())
df['STDs:Hepatitis B'] = df['STDs:Hepatitis B'].fillna(df['STDs:Hepatitis B'].median())
df['STDs:HPV'] = df['STDs:HPV'].fillna(df['STDs:HPV'].median())
df['STDs: Time since first diagnosis'] = df['STDs: Time since first diagnosis'].fillna(df['STDs: Time since first diagnosis'].median())
df['STDs: Time since last diagnosis'] = df['STDs: Time since last diagnosis'].fillna(df['STDs: Time since last diagnosis'].median())

In [None]:
# for categorical variable
df = pd.get_dummies(data=df, columns=['Smokes','Hormonal Contraceptives','IUD','STDs',
                                      'Dx:Cancer','Dx:CIN','Dx:HPV','Dx','Hinselmann','Citology','Schiller'])

In [None]:
df.isnull().sum() #No null left~

In [None]:
df 

Now, we have full data 'df' for computation.<br/>
We are ready for spliting data into train/test set, defining features and labels, and normalization.

In [None]:
df_data = df #making temporary save

## Data set description

In [None]:
df.describe()

In [None]:
fig, (ax1,ax2,ax3,ax4,ax5,ax6,ax7,ax8) = plt.subplots(8,1,figsize=(20,40))
sns.countplot(x='Age', data=df, ax=ax1)
sns.countplot(x='Number of sexual partners', data=df, ax=ax2)
sns.countplot(x='Num of pregnancies', data=df, ax=ax3)
sns.countplot(x='Smokes (years)', data=df, ax=ax4)
sns.countplot(x='Hormonal Contraceptives (years)', data=df, ax=ax5)
sns.countplot(x='IUD (years)', data=df, ax=ax6)
sns.countplot(x='STDs (number)', data=df, ax=ax7)
sns.countplot(x='Biopsy', data=df, ax=ax8)

### Inspecting the priors
Now inspect the targets which is here the result of the biopsy. Compute below the priors of sick vs healthy patients.

In [None]:
targets = np.array(df['Biopsy'])

### START Your code here




### END Your code here

print('Prior for healthy = ', p_healthy)
print('Prior for sick    = ', p_sick)

**What is your observation about this dataset (in terms of priors)?**  COMPLETE

Let's assume know 3 teams A, B and C fight on Kaggle on this task and report their results on an independent test set (with n_test = 300).
 - Team A: 91.3% overall accuracy
 - Team B: 93.5% overall accuracy
 - Team C: 96.5% overall accuracy

**Which system would you buy?** COMPLETE

**What would you need to take a better decision?**  COMPLETE