# Part 1 
## 10 Point Inspection

In [6]:
import pandas as pd
df = pd.read_csv("kidney_disease.csv")
df.head()

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,...,38,6000,,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,...,35,7300,4.6,no,no,no,good,no,no,ckd


### 1. Shape

In [7]:
print(df.shape)

(400, 26)


400 rows, 26 columns
Each row represents a different patient's data

### 2. Columns

In [5]:
print(df.columns)

Index(['id', 'age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr',
       'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'classification'],
      dtype='object')


Almost all column abbreviations require further investigation, except id, age, bp, hemo, and classification.

sg - Specific gravity (Weight of urine in comparison to distilled water)
al - Albumin (0-5 scale)
su - Sugar (0-5 scale)
rbc - Red blood cells in urine (normal/abnormal)
pc - Pus cells in urine (normal/abnormal)
pcc - Pus cell clumps (present/notpresent)
ba - Bacteria (present/notpresent)
bgr - Blood glucose random
bu - Blood urea
sc - Serum creatinine
sod - Sodium
pot - Potassium
pcv - Packed cell volume
wc - White blood cells
rc - Red blood cells
htn - Hypertension (yes/no)
dm - Diabetes mellitus (yes/no)
cad - Coronary artery disease (yes/no)
appet - Appetite (good/poor)
pe - Pedal edema (Foot swelling) (yes/no)
ane - Anemia (yes/no)

### 3. Data Types


In [8]:
print(df.dtypes)

id                  int64
age               float64
bp                float64
sg                float64
al                float64
su                float64
rbc                object
pc                 object
pcc                object
ba                 object
bgr               float64
bu                float64
sc                float64
sod               float64
pot               float64
hemo              float64
pcv                object
wc                 object
rc                 object
htn                object
dm                 object
cad                object
appet              object
pe                 object
ane                object
classification     object
dtype: object


id, age, bp, sg, al, su, bgr, bu, sc, sod, pot, and hemo are numeric columns.

rbc, pc, pcc, ba, pcv, wc, rc, htn, dm, cad, appet, pe, ane, and claddification are categorical columns and they are all objects.

Packed cell volume (PCV), white cell count (WC), and red blood cells (RC) are numeric columns but stored as objects.


### 4. First Look

In [9]:
print(df.head())

   id   age    bp     sg   al   su     rbc        pc         pcc          ba  \
0   0  48.0  80.0  1.020  1.0  0.0     NaN    normal  notpresent  notpresent   
1   1   7.0  50.0  1.020  4.0  0.0     NaN    normal  notpresent  notpresent   
2   2  62.0  80.0  1.010  2.0  3.0  normal    normal  notpresent  notpresent   
3   3  48.0  70.0  1.005  4.0  0.0  normal  abnormal     present  notpresent   
4   4  51.0  80.0  1.010  2.0  0.0  normal    normal  notpresent  notpresent   

   ...  pcv    wc   rc  htn   dm  cad appet   pe  ane classification  
0  ...   44  7800  5.2  yes  yes   no  good   no   no            ckd  
1  ...   38  6000  NaN   no   no   no  good   no   no            ckd  
2  ...   31  7500  NaN   no  yes   no  poor   no  yes            ckd  
3  ...   32  6700  3.9  yes   no   no  poor  yes  yes            ckd  
4  ...   35  7300  4.6   no   no   no  good   no   no            ckd  

[5 rows x 26 columns]


Some values are numeric, while some are categorical. The data is overall simple to understand after knowing what the abbreviations stand for.

I don't see any unexpected values; they all seem to make sense, granted that the patients have ckd.

It looks like NaN stands for not a number, meaning that NaN is used as a placeholder in this data. This is missing data.

### 5. Last Look


In [11]:
print(df.tail())

      id   age    bp     sg   al   su     rbc      pc         pcc          ba  \
395  395  55.0  80.0  1.020  0.0  0.0  normal  normal  notpresent  notpresent   
396  396  42.0  70.0  1.025  0.0  0.0  normal  normal  notpresent  notpresent   
397  397  12.0  80.0  1.020  0.0  0.0  normal  normal  notpresent  notpresent   
398  398  17.0  60.0  1.025  0.0  0.0  normal  normal  notpresent  notpresent   
399  399  58.0  80.0  1.025  0.0  0.0  normal  normal  notpresent  notpresent   

     ...  pcv    wc   rc  htn  dm  cad appet  pe ane classification  
395  ...   47  6700  4.9   no  no   no  good  no  no         notckd  
396  ...   54  7800  6.2   no  no   no  good  no  no         notckd  
397  ...   49  6600  5.4   no  no   no  good  no  no         notckd  
398  ...   51  7200  5.9   no  no   no  good  no  no         notckd  
399  ...   53  6800  6.1   no  no   no  good  no  no         notckd  

[5 rows x 26 columns]


The data ends cleanly and consistently with the first rows. The only thing I notice is that the classification values at the end are all patients without CKD, whereas at the start of the data, all patients have CKD. This simply has to do with the fact that the patients were organized by whether or not they have CKD, and CKD patients are at the top of the data set.

### 6. Memory


In [13]:
print(df.memory_usage())
print(df.memory_usage(deep=True).sum()/1e6)

Index              132
id                3200
age               3200
bp                3200
sg                3200
al                3200
su                3200
rbc               3200
pc                3200
pcc               3200
ba                3200
bgr               3200
bu                3200
sc                3200
sod               3200
pot               3200
hemo              3200
pcv               3200
wc                3200
rc                3200
htn               3200
dm                3200
cad               3200
appet             3200
pe                3200
ane               3200
classification    3200
dtype: int64
0.325639


The dataset uses 325.639 KB of memory. This is a small dataset by data science standards

### 7. Missing


In [14]:
print(df.isnull().sum())

id                  0
age                 9
bp                 12
sg                 47
al                 46
su                 49
rbc               152
pc                 65
pcc                 4
ba                  4
bgr                44
bu                 19
sc                 17
sod                87
pot                88
hemo               52
pcv                70
wc                105
rc                130
htn                 2
dm                  2
cad                 2
appet               1
pe                  1
ane                 1
classification      0
dtype: int64


In [44]:
#8. Duplicates
print(df.duplicated().sum())

0


In [45]:
#9. Statistics
print(df.describe())

               id         age          bp          sg          al          su  \
count  400.000000  391.000000  388.000000  353.000000  354.000000  351.000000   
mean   199.500000   51.483376   76.469072    1.017408    1.016949    0.450142   
std    115.614301   17.169714   13.683637    0.005717    1.352679    1.099191   
min      0.000000    2.000000   50.000000    1.005000    0.000000    0.000000   
25%     99.750000   42.000000   70.000000    1.010000    0.000000    0.000000   
50%    199.500000   55.000000   80.000000    1.020000    0.000000    0.000000   
75%    299.250000   64.500000   80.000000    1.020000    2.000000    0.000000   
max    399.000000   90.000000  180.000000    1.025000    5.000000    5.000000   

              bgr          bu          sc         sod         pot        hemo  
count  356.000000  381.000000  383.000000  313.000000  312.000000  348.000000  
mean   148.036517   57.425722    3.072454  137.528754    4.627244   12.526437  
std     79.281714   50.503006 

In [46]:
#10. Unique
print(df.nunique())

id                400
age                76
bp                 10
sg                  5
al                  6
su                  6
rbc                 2
pc                  2
pcc                 2
ba                  2
bgr               146
bu                118
sc                 84
sod                34
pot                40
hemo              115
pcv                44
wc                 92
rc                 49
htn                 2
dm                  5
cad                 3
appet               2
pe                  2
ane                 2
classification      3
dtype: int64
