\newpage

1. A classification problem based on the dataset by @misc_chronic_kidney_disease_336 is to classify an individual into has or doesn't have chronic kidney disease based on the following predictor variables: age, blood pressure (mm/Hg), specific gravity, albumin, sugar, red blood cells, pus cell, pus cell clumps, bactera, blood glucose random (mgs/dl), blood urea (mgs/dl), serum creatinine (mgs/dl), sodium (mEq/L), potassium (mEq/L), hemoglobin (gms), packed cell volume, white blood cell count (cells/cmm), red blood cell count (millions/cmm), hypertension, diabetes mellitus, coronary artery disease, appetite, pedal edema, and anemia. 

2. We import the data before and transform as necessary. 

In [9]:
import pandas as pd
import numpy as np
from ucimlrepo import fetch_ucirepo 
import certifi

chronic_kidney_disease = fetch_ucirepo(id=336) 
  
# data (as pandas dataframes) 
X = chronic_kidney_disease.data.features 
y = chronic_kidney_disease.data.targets

data = pd.concat([X, y], axis = 1)


To check if any variable transformations are necessary, we check that the data types of the variables in the dataframe match the description in the data dictionary. 

In [10]:
print(data.dtypes)

age      float64
bp       float64
sg       float64
al       float64
su       float64
rbc       object
pc        object
pcc       object
ba        object
bgr      float64
bu       float64
sc       float64
sod      float64
pot      float64
hemo     float64
pcv      float64
wbcc     float64
rbcc     float64
htn       object
dm        object
cad       object
appet     object
pe        object
ane       object
class     object
dtype: object


We can see that age, blood pressure (bp), blood glucose random (bgr), blood urea (bu), serum creatinine (sc), sodium (sod), potassium (pot), hemoglobin (hemo), packed cell volume (pcv), white blood cell count (wbcc), and red blood cell count (rbcc) are all numerical (float) values as they should be according to the data dictionary. However, specific gravity (sg), albumin (al), and sugar (su) should be nominal but are float values. Red blood cells (rbc), pus cell (pc), pus cell clumps (pcc), bacteria (ba), hypertension (htn), diabetes mellitus (dm), coronary artery disease (cad), appette (appet), pedal edema (pe), anemia (ane), and class should all be nominal, instead of object as they appear now. Now we correct the data type errors by converting the noted variables to categorical/nominal.

In [11]:
columns_to_convert = ['sg','al','su','rbc','pc','pcc','ba','htn','dm','cad','appet','pe','ane','class']
for col in columns_to_convert:
    data[col] = pd.Categorical(data[col])
print(data.dtypes)

age       float64
bp        float64
sg       category
al       category
su       category
rbc      category
pc       category
pcc      category
ba       category
bgr       float64
bu        float64
sc        float64
sod       float64
pot       float64
hemo      float64
pcv       float64
wbcc      float64
rbcc      float64
htn      category
dm       category
cad      category
appet    category
pe       category
ane      category
class    category
dtype: object


In order to make the variables match the abbreviations found in the data dictionary, we rename wbcc and rbcc to wc and rc, respectively. 

In [13]:
data = data.rename(columns={'wbcc':'wc'})
data = data.rename(columns={'rbcc':'rc'})
print(data.columns)

Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class'],
      dtype='object')


Finally, we scale the numerical (float) predictor variables by standardization.

In [15]:
from sklearn.preprocessing import StandardScaler
num_col = data.select_dtypes(include='float64').columns
scale = StandardScaler()
data[num_col] = scale.fit_transform(data[num_col])

3. We explore the data and give a detailed description of the dataset.

In [8]:
print(data.iloc[0:2,0:12])
print(data.iloc[0:2,12:26])

    age    bp    sg   al   su  rbc      pc         pcc          ba    bgr  \
0  48.0  80.0  1.02  1.0  0.0  NaN  normal  notpresent  notpresent  121.0   
1   7.0  50.0  1.02  4.0  0.0  NaN  normal  notpresent  notpresent    NaN   

     bu   sc  
0  36.0  1.2  
1  18.0  0.8  
   sod  pot  hemo   pcv    wbcc  rbcc  htn   dm cad appet  pe ane class
0  NaN  NaN  15.4  44.0  7800.0   5.2  yes  yes  no  good  no  no   ckd
1  NaN  NaN  11.3  38.0  6000.0   NaN   no   no  no  good  no  no   ckd


\newpage

## Bibliography