# Data manipulation with Pandas

## Introduction

### Goal:
This work contains basic steps of data manipulation with Pandas  with I began with. This work is purely for display purposes, but can be useful notes.

### Documentation:
Stroke Prediction Dataset will be used for the analysis. The dataset is public and available in Kaggle at <a href= "https://www.kaggle.com/fedesoriano/stroke-prediction-dataset" target="_blank">this link</a>. 

### Data Dictionary:
- id: unique identifier
- gender: 'Male', 'Female' or 'Other'
- age: age of the patient
- hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
- heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease 
- ever_married: 'No' or 'Yes'
- work_type: 'Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'
- Residence_type: 'Rural' or 'Urban'
- avg_glucose_level: average glucose level in blood
- bmi: body mass index
- smoking_status: 'formerly smoked', 'never smoked', 'smokes', 'Unknown'
- stroke: 1 if the patient had a stroke or 0 if not

## Different ways of data manipulation

In [129]:
# importing libraries
import pandas as pd

print('Load Libraries- Done')
print ('-'*127)  

Load Libraries- Done
-------------------------------------------------------------------------------------------------------------------------------


In [130]:
# importing dataset
df = pd.read_csv('healthcare-data.csv')

In [131]:
# visual preview of the dataset - first three columns
df.head(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1


In [132]:
# visual preview of the dataset - last three columns
df.tail(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
5109,44679,Female,44.0,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


In [133]:
# checking dataset dimension
print(f'There are {df.shape[0]} rows/observations and {df.shape[1]} columns/variables in the dataset')
print ('-'*125) 

There are 5110 rows/observations and 12 columns/variables in the dataset
-----------------------------------------------------------------------------------------------------------------------------


In [134]:
# accessing to the headers of the dataset
df.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [135]:
# accessing to the size of the dataset: dimension, headers, missing data, % of missing data and unique values
print ('Rows/Observations     : ' , df.shape[0])  
print ('Columns/Variables  : ' , df.shape[1]) 
print ('-'*127,'\n','Variables : \n\n', df.columns.tolist()) 
print ('-'*127,'\nMissing values :\n\n', df.isnull().sum().sort_values(ascending=False))
print ('-'*127,'\nPercent of missing :\n\n', round(df.isna().sum() / df.isna().count() * 100, 2)) 
print ('-'*127,'\nUnique values :  \n\n', df.nunique())  
print ('-'*127)  

Rows/Observations     :  5110
Columns/Variables  :  12
------------------------------------------------------------------------------------------------------------------------------- 
 Variables : 

 ['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke']
------------------------------------------------------------------------------------------------------------------------------- 
Missing values :

 bmi                  201
stroke                 0
smoking_status         0
avg_glucose_level      0
Residence_type         0
work_type              0
ever_married           0
heart_disease          0
hypertension           0
age                    0
gender                 0
id                     0
dtype: int64
------------------------------------------------------------------------------------------------------------------------------- 
Percent of missing :

 id                   0.00
g

In [136]:
# checking variable types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [137]:
# accessing to categorical variables of the dataset
cat_col = df.select_dtypes('object').columns
cat_col

Index(['gender', 'ever_married', 'work_type', 'Residence_type',
       'smoking_status'],
      dtype='object')

In [138]:
# printing the count number of each unique value of the variables
for column in cat_col:
    print ('-'*127)
    print(df[column].value_counts())
    print ('-'*127)  

-------------------------------------------------------------------------------------------------------------------------------
Female    2994
Male      2115
Other        1
Name: gender, dtype: int64
-------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------
Yes    3353
No     1757
Name: ever_married, dtype: int64
-------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------
Private          2925
Self-employed     819
children          687
Govt_job          657
Never_worked       22
Name: work_type, dtype: int64
-------------------------------------------------------------------------------------------

In [139]:
# accessing to numeric variables of the dataset
num_col = df.select_dtypes('number').columns
num_col

Index(['id', 'age', 'hypertension', 'heart_disease', 'avg_glucose_level',
       'bmi', 'stroke'],
      dtype='object')

In [140]:
# accessing to a specific column of the dataset, for example, 'ever_married'
df.ever_married

0       Yes
1       Yes
2       Yes
3       Yes
4       Yes
       ... 
5105    Yes
5106    Yes
5107    Yes
5108    Yes
5109    Yes
Name: ever_married, Length: 5110, dtype: object

In [141]:
# accessing to a specific column of the dataset, for example, firs five rows of 'ever_married' column
df.ever_married[0:5]

0    Yes
1    Yes
2    Yes
3    Yes
4    Yes
Name: ever_married, dtype: object

In [142]:
# accessing to several columns of the dataset
df[['age', 'bmi']]

Unnamed: 0,age,bmi
0,67.0,36.6
1,61.0,
2,80.0,32.5
3,49.0,34.4
4,79.0,24.0
...,...,...
5105,80.0,
5106,81.0,40.0
5107,35.0,30.6
5108,51.0,25.6


In [143]:
# accessing to a specific row thru integer location function, for example, index #1
df.iloc[1]

id                           51676
gender                      Female
age                             61
hypertension                     0
heart_disease                    0
ever_married                   Yes
work_type            Self-employed
Residence_type               Rural
avg_glucose_level           202.21
bmi                            NaN
smoking_status        never smoked
stroke                           1
Name: 1, dtype: object

In [144]:
# accessing to multiple specific rows thru integer location function
df.iloc[1011:1014]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
1011,8521,Male,71.0,0,0,Yes,Private,Rural,227.91,31.6,formerly smoked,0
1012,72779,Female,14.0,0,0,No,children,Urban,131.77,31.0,Unknown,0
1013,45824,Female,77.0,1,0,Yes,Self-employed,Urban,102.01,29.5,Unknown,0


In [145]:
# accessing to a specific item of the dataset, for example, to the age of observation/row #1013
df.iloc[1013, 2]

77.0

In [146]:
# iterating each row of the dataset
for index, row in df.iterrows():
    print(index, row)

0 id                              9046
gender                          Male
age                               67
hypertension                       0
heart_disease                      1
ever_married                     Yes
work_type                    Private
Residence_type                 Urban
avg_glucose_level             228.69
bmi                             36.6
smoking_status       formerly smoked
stroke                             1
Name: 0, dtype: object
1 id                           51676
gender                      Female
age                             61
hypertension                     0
heart_disease                    0
ever_married                   Yes
work_type            Self-employed
Residence_type               Rural
avg_glucose_level           202.21
bmi                            NaN
smoking_status        never smoked
stroke                           1
Name: 1, dtype: object
2 id                          31112
gender                       Male
age             

Name: 173, dtype: object
174 id                             40899
gender                        Female
age                               78
hypertension                       0
heart_disease                      0
ever_married                     Yes
work_type              Self-employed
Residence_type                 Rural
avg_glucose_level              60.67
bmi                              NaN
smoking_status       formerly smoked
stroke                             1
Name: 174, dtype: object
175 id                           14431
gender                        Male
age                             72
hypertension                     1
heart_disease                    0
ever_married                   Yes
work_type            Self-employed
Residence_type               Rural
avg_glucose_level           185.49
bmi                           37.1
smoking_status        never smoked
stroke                           1
Name: 175, dtype: object
176 id                          62466
gender         

Name: 299, dtype: object
300 id                           65199
gender                      Female
age                             53
hypertension                     0
heart_disease                    0
ever_married                   Yes
work_type            Self-employed
Residence_type               Urban
avg_glucose_level            81.51
bmi                           28.5
smoking_status             Unknown
stroke                           0
Name: 300, dtype: object
301 id                             43454
gender                        Female
age                               78
hypertension                       0
heart_disease                      0
ever_married                      No
work_type              Self-employed
Residence_type                 Urban
avg_glucose_level             137.74
bmi                             34.9
smoking_status       formerly smoked
stroke                             0
Name: 301, dtype: object
302 id                           7282
gender         

419 id                            129
gender                     Female
age                            24
hypertension                    0
heart_disease                   0
ever_married                   No
work_type                 Private
Residence_type              Urban
avg_glucose_level           97.55
bmi                          26.2
smoking_status       never smoked
stroke                          0
Name: 419, dtype: object
420 id                      20351
gender                   Male
age                        75
hypertension                0
heart_disease               0
ever_married              Yes
work_type            Govt_job
Residence_type          Urban
avg_glucose_level       94.29
bmi                      35.2
smoking_status        Unknown
stroke                      0
Name: 420, dtype: object
421 id                        530
gender                 Female
age                        12
hypertension                0
heart_disease               0
ever_married        

557 id                     62936
gender                  Male
age                       46
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Urban
avg_glucose_level     103.62
bmi                     40.9
smoking_status       Unknown
stroke                     0
Name: 557, dtype: object
558 id                      29010
gender                   Male
age                         5
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level      100.52
bmi                      17.2
smoking_status        Unknown
stroke                      0
Name: 558, dtype: object
559 id                          36561
gender                     Female
age                            39
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                G

Name: 701, dtype: object
702 id                          41673
gender                     Female
age                            45
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level           80.93
bmi                          23.1
smoking_status       never smoked
stroke                          0
Name: 702, dtype: object
703 id                     27796
gender                Female
age                       66
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Urban
avg_glucose_level     102.07
bmi                     16.7
smoking_status        smokes
stroke                     0
Name: 703, dtype: object
704 id                     18390
gender                Female
age                       19
hypertension               0
heart_disease              0
ever_married

Name: 829, dtype: object
830 id                      65218
gender                   Male
age                         2
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level       109.1
bmi                        20
smoking_status        Unknown
stroke                      0
Name: 830, dtype: object
831 id                          30102
gender                       Male
age                            52
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level           68.35
bmi                          34.1
smoking_status       never smoked
stroke                          0
Name: 831, dtype: object
832 id                     49521
gender                Female
age                       33
hypertension               0
heart_disease              0


Name: 940, dtype: object
941 id                          58999
gender                       Male
age                            60
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                Govt_job
Residence_type              Urban
avg_glucose_level          100.54
bmi                          30.1
smoking_status       never smoked
stroke                          0
Name: 941, dtype: object
942 id                           28261
gender                        Male
age                             79
hypertension                     0
heart_disease                    1
ever_married                   Yes
work_type            Self-employed
Residence_type               Urban
avg_glucose_level           106.68
bmi                           30.8
smoking_status        never smoked
stroke                           0
Name: 942, dtype: object
943 id                          35222
gender                     Female
age              

Name: 1108, dtype: object
1109 id                          32240
gender                     Female
age                            27
hypertension                    0
heart_disease                   0
ever_married                   No
work_type                 Private
Residence_type              Urban
avg_glucose_level           93.55
bmi                          41.6
smoking_status       never smoked
stroke                          0
Name: 1109, dtype: object
1110 id                             28127
gender                        Female
age                               44
hypertension                       0
heart_disease                      0
ever_married                     Yes
work_type                    Private
Residence_type                 Rural
avg_glucose_level               90.4
bmi                             33.1
smoking_status       formerly smoked
stroke                             0
Name: 1110, dtype: object
1111 id                          20347
gender               

Name: 1246, dtype: object
1247 id                     28414
gender                  Male
age                       50
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Urban
avg_glucose_level     103.48
bmi                     29.1
smoking_status        smokes
stroke                     0
Name: 1247, dtype: object
1248 id                     25767
gender                Female
age                       30
hypertension               0
heart_disease              0
ever_married              No
work_type            Private
Residence_type         Urban
avg_glucose_level      96.42
bmi                     22.6
smoking_status       Unknown
stroke                     0
Name: 1248, dtype: object
1249 id                     71319
gender                  Male
age                       15
hypertension               0
heart_disease              0
ever_married              No
work_type            Private
Residenc

Name: 1383, dtype: object
1384 id                               239
gender                          Male
age                               59
hypertension                       1
heart_disease                      1
ever_married                     Yes
work_type                    Private
Residence_type                 Rural
avg_glucose_level             246.53
bmi                             27.2
smoking_status       formerly smoked
stroke                             0
Name: 1384, dtype: object
1385 id                              3184
gender                        Female
age                               45
hypertension                       0
heart_disease                      0
ever_married                     Yes
work_type                    Private
Residence_type                 Urban
avg_glucose_level              89.05
bmi                             27.8
smoking_status       formerly smoked
stroke                             0
Name: 1385, dtype: object
1386 id                 

Name: 1497, dtype: object
1498 id                      4559
gender                  Male
age                       38
hypertension               0
heart_disease              0
ever_married              No
work_type            Private
Residence_type         Rural
avg_glucose_level      86.86
bmi                     36.5
smoking_status       Unknown
stroke                     0
Name: 1498, dtype: object
1499 id                      45357
gender                 Female
age                      1.24
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level      113.96
bmi                      21.5
smoking_status        Unknown
stroke                      0
Name: 1499, dtype: object
1500 id                     45257
gender                Female
age                       38
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Priv

Name: 1676, dtype: object
1677 id                             48722
gender                        Female
age                               54
hypertension                       0
heart_disease                      0
ever_married                     Yes
work_type                    Private
Residence_type                 Urban
avg_glucose_level              75.09
bmi                             38.9
smoking_status       formerly smoked
stroke                             0
Name: 1677, dtype: object
1678 id                           14481
gender                      Female
age                             79
hypertension                     0
heart_disease                    0
ever_married                   Yes
work_type            Self-employed
Residence_type               Urban
avg_glucose_level            80.57
bmi                           23.8
smoking_status        never smoked
stroke                           0
Name: 1678, dtype: object
1679 id                             67963
gender

1845 id                     49485
gender                Female
age                       26
hypertension               0
heart_disease              0
ever_married              No
work_type            Private
Residence_type         Rural
avg_glucose_level      136.1
bmi                     26.4
smoking_status       Unknown
stroke                     0
Name: 1845, dtype: object
1846 id                      61641
gender                   Male
age                        14
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level      149.42
bmi                      20.6
smoking_status        Unknown
stroke                      0
Name: 1846, dtype: object
1847 id                           12600
gender                      Female
age                             42
hypertension                     0
heart_disease                    0
ever_married                   Yes
work_type      

Name: 1980, dtype: object
1981 id                      8968
gender                Female
age                       42
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Urban
avg_glucose_level     208.06
bmi                      NaN
smoking_status        smokes
stroke                     0
Name: 1981, dtype: object
1982 id                          20310
gender                       Male
age                            25
hypertension                    0
heart_disease                   0
ever_married                   No
work_type                Govt_job
Residence_type              Urban
avg_glucose_level            75.5
bmi                          24.6
smoking_status       never smoked
stroke                          0
Name: 1982, dtype: object
1983 id                           11450
gender                      Female
age                             41
hypertension                     0
heart_disea

Name: 2135, dtype: object
2136 id                     59745
gender                Female
age                       27
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Urban
avg_glucose_level      76.74
bmi                     53.9
smoking_status       Unknown
stroke                     0
Name: 2136, dtype: object
2137 id                     24721
gender                  Male
age                       24
hypertension               0
heart_disease              0
ever_married              No
work_type            Private
Residence_type         Urban
avg_glucose_level      72.29
bmi                     22.2
smoking_status       Unknown
stroke                     0
Name: 2137, dtype: object
2138 id                              9644
gender                          Male
age                               72
hypertension                       0
heart_disease                      0
ever_married              

Name: 2261, dtype: object
2262 id                          61743
gender                       Male
age                            28
hypertension                    0
heart_disease                   0
ever_married                   No
work_type                Govt_job
Residence_type              Urban
avg_glucose_level          118.66
bmi                          32.3
smoking_status       never smoked
stroke                          0
Name: 2262, dtype: object
2263 id                      3879
gender                Female
age                       20
hypertension               0
heart_disease              0
ever_married              No
work_type            Private
Residence_type         Urban
avg_glucose_level      89.03
bmi                      NaN
smoking_status        smokes
stroke                     0
Name: 2263, dtype: object
2264 id                             58086
gender                          Male
age                               67
hypertension                       0
hea

Name: 2405, dtype: object
2406 id                     40977
gender                  Male
age                       51
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Rural
avg_glucose_level      122.5
bmi                     20.6
smoking_status       Unknown
stroke                     0
Name: 2406, dtype: object
2407 id                          39129
gender                       Male
age                            53
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                Govt_job
Residence_type              Rural
avg_glucose_level              86
bmi                          24.1
smoking_status       never smoked
stroke                          0
Name: 2407, dtype: object
2408 id                          40837
gender                       Male
age                            52
hypertension                    0
heart_disease  

Name: 2527, dtype: object
2528 id                             91
gender                     Female
age                            42
hypertension                    0
heart_disease                   0
ever_married                   No
work_type                 Private
Residence_type              Urban
avg_glucose_level           98.53
bmi                          18.5
smoking_status       never smoked
stroke                          0
Name: 2528, dtype: object
2529 id                     22056
gender                Female
age                       71
hypertension               1
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Urban
avg_glucose_level     105.55
bmi                      NaN
smoking_status        smokes
stroke                     0
Name: 2529, dtype: object
2530 id                      45469
gender                   Male
age                        16
hypertension                0
heart_disease               0
e

Name: 2662, dtype: object
2663 id                       3701
gender                 Female
age                         2
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Urban
avg_glucose_level       84.12
bmi                      15.3
smoking_status        Unknown
stroke                      0
Name: 2663, dtype: object
2664 id                           61339
gender                        Male
age                             47
hypertension                     0
heart_disease                    0
ever_married                   Yes
work_type            Self-employed
Residence_type               Urban
avg_glucose_level            95.04
bmi                           28.7
smoking_status        never smoked
stroke                           0
Name: 2664, dtype: object
2665 id                      24965
gender                 Female
age                        25
hypertension                0
heart_d

Name: 2776, dtype: object
2777 id                             15533
gender                          Male
age                               46
hypertension                       0
heart_disease                      0
ever_married                      No
work_type                    Private
Residence_type                 Urban
avg_glucose_level             107.59
bmi                             26.2
smoking_status       formerly smoked
stroke                             0
Name: 2777, dtype: object
2778 id                          50903
gender                     Female
age                            29
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Urban
avg_glucose_level          116.98
bmi                          23.4
smoking_status       never smoked
stroke                          0
Name: 2778, dtype: object
2779 id                      35276
gender                 Fe

Name: 2915, dtype: object
2916 id                          60145
gender                     Female
age                            38
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Urban
avg_glucose_level           77.35
bmi                          27.7
smoking_status       never smoked
stroke                          0
Name: 2916, dtype: object
2917 id                          11702
gender                     Female
age                            18
hypertension                    0
heart_disease                   0
ever_married                   No
work_type            Never_worked
Residence_type              Urban
avg_glucose_level           82.36
bmi                          22.7
smoking_status            Unknown
stroke                          0
Name: 2917, dtype: object
2918 id                           50508
gender                      Female
age                  

Name: 3051, dtype: object
3052 id                          26103
gender                       Male
age                            36
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level          106.85
bmi                          40.1
smoking_status       never smoked
stroke                          0
Name: 3052, dtype: object
3053 id                          10436
gender                     Female
age                            29
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level          102.07
bmi                          31.8
smoking_status       never smoked
stroke                          0
Name: 3053, dtype: object
3054 id                      16550
gender                 Female
age                        69
h

Name: 3191, dtype: object
3192 id                          47917
gender                     Female
age                            82
hypertension                    1
heart_disease                   0
ever_married                   No
work_type                 Private
Residence_type              Rural
avg_glucose_level           61.47
bmi                          22.9
smoking_status       never smoked
stroke                          0
Name: 3192, dtype: object
3193 id                             30303
gender                          Male
age                               33
hypertension                       0
heart_disease                      0
ever_married                      No
work_type                    Private
Residence_type                 Rural
avg_glucose_level               88.5
bmi                             32.6
smoking_status       formerly smoked
stroke                             0
Name: 3193, dtype: object
3194 id                     63864
gender                  Ma

Name: 3312, dtype: object
3313 id                      53843
gender                 Female
age                      1.48
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level       55.59
bmi                      17.9
smoking_status        Unknown
stroke                      0
Name: 3313, dtype: object
3314 id                           5236
gender                     Female
age                            49
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level           73.48
bmi                            33
smoking_status       never smoked
stroke                          0
Name: 3314, dtype: object
3315 id                      32110
gender                 Female
age                         2
hypertension                0
heart_disease      

Name: 3438, dtype: object
3439 id                      21661
gender                 Female
age                        68
hypertension                0
heart_disease               0
ever_married              Yes
work_type            Govt_job
Residence_type          Urban
avg_glucose_level      228.05
bmi                      51.9
smoking_status        Unknown
stroke                      0
Name: 3439, dtype: object
3440 id                      18837
gender                   Male
age                      0.56
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Urban
avg_glucose_level       98.23
bmi                      14.1
smoking_status        Unknown
stroke                      0
Name: 3440, dtype: object
3441 id                           57777
gender                      Female
age                             59
hypertension                     0
heart_disease                    0
ever_married

3587 id                             14000
gender                        Female
age                               72
hypertension                       1
heart_disease                      1
ever_married                     Yes
work_type                    Private
Residence_type                 Urban
avg_glucose_level             198.32
bmi                             31.3
smoking_status       formerly smoked
stroke                             0
Name: 3587, dtype: object
3588 id                          23047
gender                       Male
age                            43
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Urban
avg_glucose_level          100.16
bmi                          59.7
smoking_status       never smoked
stroke                          0
Name: 3588, dtype: object
3589 id                           6827
gender                       Male
age          

Name: 3715, dtype: object
3716 id                     44831
gender                Female
age                       69
hypertension               0
heart_disease              0
ever_married              No
work_type            Private
Residence_type         Urban
avg_glucose_level      59.31
bmi                     31.4
smoking_status        smokes
stroke                     0
Name: 3716, dtype: object
3717 id                             68420
gender                        Female
age                               13
hypertension                       0
heart_disease                      0
ever_married                      No
work_type                   children
Residence_type                 Urban
avg_glucose_level              63.22
bmi                             18.5
smoking_status       formerly smoked
stroke                             0
Name: 3717, dtype: object
3718 id                          39632
gender                     Female
age                            53
hypertension 

Name: 3846, dtype: object
3847 id                          66083
gender                       Male
age                            62
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level          145.46
bmi                          40.1
smoking_status       never smoked
stroke                          0
Name: 3847, dtype: object
3848 id                          21238
gender                     Female
age                            43
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Urban
avg_glucose_level           74.86
bmi                          26.9
smoking_status       never smoked
stroke                          0
Name: 3848, dtype: object
3849 id                      70992
gender                 Female
age                         8
h

Name: 4029, dtype: object
4030 id                      12990
gender                   Male
age                         9
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level       84.17
bmi                      17.4
smoking_status        Unknown
stroke                      0
Name: 4030, dtype: object
4031 id                          14414
gender                     Female
age                            34
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level           85.79
bmi                            32
smoking_status       never smoked
stroke                          0
Name: 4031, dtype: object
4032 id                     46343
gender                Female
age                       79
hypertension               0
heart_disease          

Name: 4155, dtype: object
4156 id                          37728
gender                     Female
age                            26
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Urban
avg_glucose_level           68.99
bmi                          22.2
smoking_status       never smoked
stroke                          0
Name: 4156, dtype: object
4157 id                      47410
gender                 Female
age                        14
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level      111.76
bmi                      24.8
smoking_status        Unknown
stroke                      0
Name: 4157, dtype: object
4158 id                          56450
gender                       Male
age                            25
hypertension                    0
hea

Name: 4287, dtype: object
4288 id                      47886
gender                 Female
age                        43
hypertension                1
heart_disease               0
ever_married              Yes
work_type            Govt_job
Residence_type          Rural
avg_glucose_level       56.94
bmi                      45.3
smoking_status        Unknown
stroke                      0
Name: 4288, dtype: object
4289 id                             21407
gender                          Male
age                               39
hypertension                       0
heart_disease                      0
ever_married                     Yes
work_type                    Private
Residence_type                 Rural
avg_glucose_level             117.03
bmi                             40.3
smoking_status       formerly smoked
stroke                             0
Name: 4289, dtype: object
4290 id                          34026
gender                     Female
age                            60
h

Name: 4390, dtype: object
4391 id                     63312
gender                  Male
age                       16
hypertension               0
heart_disease              0
ever_married              No
work_type            Private
Residence_type         Urban
avg_glucose_level      80.55
bmi                     23.5
smoking_status        smokes
stroke                     0
Name: 4391, dtype: object
4392 id                      55681
gender                 Female
age                         7
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level       63.98
bmi                        23
smoking_status        Unknown
stroke                      0
Name: 4392, dtype: object
4393 id                     63804
gender                Female
age                       27
hypertension               0
heart_disease              0
ever_married              No
work_type            Priv

Name: 4498, dtype: object
4499 id                          56547
gender                       Male
age                            54
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level           57.56
bmi                          27.5
smoking_status       never smoked
stroke                          0
Name: 4499, dtype: object
4500 id                             13598
gender                          Male
age                               60
hypertension                       0
heart_disease                      0
ever_married                     Yes
work_type              Self-employed
Residence_type                 Urban
avg_glucose_level             227.23
bmi                               40
smoking_status       formerly smoked
stroke                             0
Name: 4500, dtype: object
4501 id                      24246
gender                   

Name: 4656, dtype: object
4657 id                      21852
gender                   Male
age                         2
hypertension                0
heart_disease               0
ever_married               No
work_type            children
Residence_type          Rural
avg_glucose_level       96.47
bmi                      19.5
smoking_status        Unknown
stroke                      0
Name: 4657, dtype: object
4658 id                             24711
gender                        Female
age                               55
hypertension                       0
heart_disease                      0
ever_married                     Yes
work_type                   Govt_job
Residence_type                 Urban
avg_glucose_level              99.44
bmi                               25
smoking_status       formerly smoked
stroke                             0
Name: 4658, dtype: object
4659 id                     21967
gender                Female
age                       20
hypertension    

Name: 4776, dtype: object
4777 id                          10119
gender                       Male
age                            79
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Rural
avg_glucose_level           69.34
bmi                            29
smoking_status       never smoked
stroke                          0
Name: 4777, dtype: object
4778 id                           48127
gender                        Male
age                             53
hypertension                     0
heart_disease                    0
ever_married                   Yes
work_type            Self-employed
Residence_type               Urban
avg_glucose_level           109.09
bmi                           26.3
smoking_status              smokes
stroke                           0
Name: 4778, dtype: object
4779 id                           65892
gender                      Female
age      

Name: 4904, dtype: object
4905 id                     49925
gender                Female
age                       60
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Rural
avg_glucose_level      84.54
bmi                     23.4
smoking_status        smokes
stroke                     0
Name: 4905, dtype: object
4906 id                          72696
gender                     Female
age                            53
hypertension                    0
heart_disease                   0
ever_married                  Yes
work_type                 Private
Residence_type              Urban
avg_glucose_level           70.51
bmi                          54.1
smoking_status       never smoked
stroke                          0
Name: 4906, dtype: object
4907 id                             39708
gender                          Male
age                               55
hypertension                       0
hea

5047 id                             25102
gender                        Female
age                               51
hypertension                       0
heart_disease                      0
ever_married                     Yes
work_type                   Govt_job
Residence_type                 Urban
avg_glucose_level              95.16
bmi                             42.7
smoking_status       formerly smoked
stroke                             0
Name: 5047, dtype: object
5048 id                     28788
gender                  Male
age                       40
hypertension               0
heart_disease              0
ever_married             Yes
work_type            Private
Residence_type         Urban
avg_glucose_level     191.15
bmi                      NaN
smoking_status        smokes
stroke                     0
Name: 5048, dtype: object
5049 id                          29028
gender                     Female
age                            41
hypertension                    0
heart

In [147]:
# iterating a specific column of the dataset, for example, 'age' column
for index, row in df.iterrows():
    print(index, row['age'])

0 67.0
1 61.0
2 80.0
3 49.0
4 79.0
5 81.0
6 74.0
7 69.0
8 59.0
9 78.0
10 81.0
11 61.0
12 54.0
13 78.0
14 79.0
15 50.0
16 64.0
17 75.0
18 60.0
19 57.0
20 71.0
21 52.0
22 79.0
23 82.0
24 71.0
25 80.0
26 65.0
27 58.0
28 69.0
29 59.0
30 57.0
31 42.0
32 82.0
33 80.0
34 48.0
35 82.0
36 74.0
37 72.0
38 58.0
39 49.0
40 78.0
41 54.0
42 82.0
43 63.0
44 60.0
45 76.0
46 75.0
47 58.0
48 81.0
49 39.0
50 76.0
51 78.0
52 79.0
53 77.0
54 63.0
55 63.0
56 82.0
57 78.0
58 73.0
59 54.0
60 56.0
61 80.0
62 67.0
63 45.0
64 75.0
65 78.0
66 70.0
67 76.0
68 59.0
69 80.0
70 76.0
71 67.0
72 66.0
73 63.0
74 52.0
75 80.0
76 80.0
77 79.0
78 51.0
79 43.0
80 59.0
81 66.0
82 79.0
83 68.0
84 58.0
85 54.0
86 61.0
87 70.0
88 47.0
89 74.0
90 79.0
91 81.0
92 57.0
93 80.0
94 45.0
95 78.0
96 70.0
97 58.0
98 57.0
99 69.0
100 64.0
101 77.0
102 74.0
103 81.0
104 57.0
105 58.0
106 50.0
107 54.0
108 79.0
109 53.0
110 79.0
111 80.0
112 76.0
113 45.0
114 68.0
115 71.0
116 61.0
117 74.0
118 38.0
119 77.0
120 58.0
121 53.0
122 80.0
123

1003 54.0
1004 39.0
1005 26.0
1006 6.0
1007 41.0
1008 42.0
1009 9.0
1010 55.0
1011 71.0
1012 14.0
1013 77.0
1014 50.0
1015 49.0
1016 51.0
1017 79.0
1018 63.0
1019 66.0
1020 20.0
1021 37.0
1022 22.0
1023 60.0
1024 39.0
1025 53.0
1026 55.0
1027 63.0
1028 57.0
1029 82.0
1030 56.0
1031 41.0
1032 8.0
1033 34.0
1034 75.0
1035 57.0
1036 72.0
1037 21.0
1038 51.0
1039 15.0
1040 24.0
1041 30.0
1042 82.0
1043 62.0
1044 79.0
1045 19.0
1046 45.0
1047 5.0
1048 57.0
1049 31.0
1050 61.0
1051 27.0
1052 61.0
1053 53.0
1054 76.0
1055 57.0
1056 9.0
1057 34.0
1058 51.0
1059 61.0
1060 19.0
1061 50.0
1062 80.0
1063 13.0
1064 55.0
1065 67.0
1066 30.0
1067 67.0
1068 82.0
1069 5.0
1070 81.0
1071 48.0
1072 66.0
1073 38.0
1074 8.0
1075 47.0
1076 27.0
1077 53.0
1078 27.0
1079 36.0
1080 50.0
1081 32.0
1082 58.0
1083 73.0
1084 62.0
1085 50.0
1086 51.0
1087 19.0
1088 30.0
1089 45.0
1090 30.0
1091 28.0
1092 70.0
1093 0.32
1094 23.0
1095 18.0
1096 41.0
1097 52.0
1098 77.0
1099 34.0
1100 67.0
1101 1.64
1102 23.0
1103 59

1952 52.0
1953 72.0
1954 52.0
1955 61.0
1956 15.0
1957 1.56
1958 6.0
1959 3.0
1960 18.0
1961 53.0
1962 58.0
1963 31.0
1964 29.0
1965 5.0
1966 40.0
1967 75.0
1968 52.0
1969 39.0
1970 40.0
1971 78.0
1972 39.0
1973 17.0
1974 45.0
1975 0.56
1976 13.0
1977 26.0
1978 42.0
1979 44.0
1980 3.0
1981 42.0
1982 25.0
1983 41.0
1984 51.0
1985 20.0
1986 25.0
1987 18.0
1988 37.0
1989 51.0
1990 2.0
1991 38.0
1992 64.0
1993 60.0
1994 22.0
1995 71.0
1996 32.0
1997 32.0
1998 63.0
1999 0.24
2000 54.0
2001 25.0
2002 80.0
2003 31.0
2004 53.0
2005 35.0
2006 31.0
2007 60.0
2008 0.56
2009 21.0
2010 78.0
2011 59.0
2012 0.64
2013 10.0
2014 60.0
2015 11.0
2016 48.0
2017 50.0
2018 69.0
2019 20.0
2020 22.0
2021 55.0
2022 57.0
2023 29.0
2024 32.0
2025 54.0
2026 37.0
2027 58.0
2028 41.0
2029 72.0
2030 0.48
2031 32.0
2032 54.0
2033 79.0
2034 56.0
2035 45.0
2036 6.0
2037 45.0
2038 60.0
2039 65.0
2040 57.0
2041 58.0
2042 8.0
2043 18.0
2044 49.0
2045 2.0
2046 52.0
2047 63.0
2048 57.0
2049 50.0
2050 12.0
2051 35.0
2052 35.

2836 32.0
2837 27.0
2838 44.0
2839 20.0
2840 52.0
2841 57.0
2842 29.0
2843 16.0
2844 35.0
2845 5.0
2846 63.0
2847 59.0
2848 63.0
2849 52.0
2850 50.0
2851 43.0
2852 27.0
2853 30.0
2854 8.0
2855 75.0
2856 14.0
2857 23.0
2858 6.0
2859 37.0
2860 38.0
2861 3.0
2862 26.0
2863 58.0
2864 57.0
2865 58.0
2866 76.0
2867 68.0
2868 79.0
2869 34.0
2870 75.0
2871 11.0
2872 71.0
2873 40.0
2874 24.0
2875 0.64
2876 82.0
2877 32.0
2878 81.0
2879 33.0
2880 79.0
2881 62.0
2882 39.0
2883 60.0
2884 48.0
2885 24.0
2886 70.0
2887 17.0
2888 56.0
2889 3.0
2890 65.0
2891 72.0
2892 10.0
2893 29.0
2894 44.0
2895 46.0
2896 33.0
2897 63.0
2898 0.24
2899 55.0
2900 56.0
2901 50.0
2902 78.0
2903 63.0
2904 31.0
2905 65.0
2906 51.0
2907 60.0
2908 69.0
2909 23.0
2910 46.0
2911 16.0
2912 26.0
2913 44.0
2914 56.0
2915 23.0
2916 38.0
2917 18.0
2918 63.0
2919 23.0
2920 32.0
2921 8.0
2922 77.0
2923 41.0
2924 34.0
2925 25.0
2926 35.0
2927 15.0
2928 1.64
2929 4.0
2930 33.0
2931 28.0
2932 37.0
2933 50.0
2934 76.0
2935 72.0
2936 16

3727 54.0
3728 14.0
3729 45.0
3730 51.0
3731 8.0
3732 52.0
3733 39.0
3734 13.0
3735 69.0
3736 71.0
3737 73.0
3738 54.0
3739 10.0
3740 26.0
3741 41.0
3742 71.0
3743 46.0
3744 15.0
3745 29.0
3746 8.0
3747 21.0
3748 56.0
3749 14.0
3750 78.0
3751 36.0
3752 57.0
3753 79.0
3754 26.0
3755 22.0
3756 72.0
3757 54.0
3758 8.0
3759 62.0
3760 28.0
3761 50.0
3762 7.0
3763 33.0
3764 55.0
3765 25.0
3766 25.0
3767 37.0
3768 58.0
3769 45.0
3770 60.0
3771 66.0
3772 80.0
3773 38.0
3774 11.0
3775 63.0
3776 19.0
3777 17.0
3778 19.0
3779 40.0
3780 49.0
3781 69.0
3782 46.0
3783 78.0
3784 63.0
3785 3.0
3786 1.8
3787 18.0
3788 46.0
3789 8.0
3790 53.0
3791 38.0
3792 74.0
3793 24.0
3794 78.0
3795 60.0
3796 12.0
3797 32.0
3798 5.0
3799 40.0
3800 19.0
3801 28.0
3802 61.0
3803 44.0
3804 50.0
3805 50.0
3806 18.0
3807 1.64
3808 37.0
3809 5.0
3810 39.0
3811 65.0
3812 26.0
3813 42.0
3814 34.0
3815 45.0
3816 43.0
3817 40.0
3818 35.0
3819 2.0
3820 61.0
3821 64.0
3822 32.0
3823 23.0
3824 51.0
3825 52.0
3826 75.0
3827 40.0


4582 76.0
4583 46.0
4584 23.0
4585 9.0
4586 53.0
4587 4.0
4588 62.0
4589 37.0
4590 82.0
4591 33.0
4592 3.0
4593 14.0
4594 16.0
4595 40.0
4596 18.0
4597 29.0
4598 56.0
4599 33.0
4600 2.0
4601 36.0
4602 30.0
4603 31.0
4604 16.0
4605 58.0
4606 19.0
4607 47.0
4608 59.0
4609 40.0
4610 26.0
4611 17.0
4612 30.0
4613 19.0
4614 78.0
4615 55.0
4616 59.0
4617 57.0
4618 33.0
4619 35.0
4620 32.0
4621 55.0
4622 28.0
4623 25.0
4624 45.0
4625 34.0
4626 33.0
4627 65.0
4628 62.0
4629 36.0
4630 31.0
4631 54.0
4632 53.0
4633 44.0
4634 77.0
4635 67.0
4636 48.0
4637 42.0
4638 72.0
4639 49.0
4640 1.32
4641 45.0
4642 63.0
4643 33.0
4644 32.0
4645 0.48
4646 63.0
4647 70.0
4648 57.0
4649 8.0
4650 54.0
4651 37.0
4652 59.0
4653 78.0
4654 59.0
4655 10.0
4656 21.0
4657 2.0
4658 55.0
4659 20.0
4660 38.0
4661 33.0
4662 14.0
4663 32.0
4664 32.0
4665 68.0
4666 70.0
4667 24.0
4668 44.0
4669 39.0
4670 81.0
4671 19.0
4672 69.0
4673 42.0
4674 8.0
4675 8.0
4676 28.0
4677 66.0
4678 66.0
4679 47.0
4680 78.0
4681 65.0
4682 78.

In [148]:
# filtering the dataset by specific value, for example, by Male in 'gender' column
df.loc[df['gender'] == 'Male']

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
13,8213,Male,78.0,0,1,Yes,Private,Urban,219.84,,Unknown,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5097,64520,Male,68.0,0,0,Yes,Self-employed,Urban,91.68,40.8,Unknown,0
5098,579,Male,9.0,0,0,No,children,Urban,71.88,17.5,Unknown,0
5099,7293,Male,40.0,0,0,Yes,Private,Rural,83.94,,smokes,0
5100,68398,Male,82.0,1,0,Yes,Self-employed,Rural,71.97,28.3,never smoked,0


In [149]:
# basic statistics of the dataset
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [150]:
# full statistics of the dataset
df.describe(include='all')

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
count,5110.0,5110,5110.0,5110.0,5110.0,5110,5110,5110,5110.0,4909.0,5110,5110.0
unique,,3,,,,2,5,2,,,4,
top,,Female,,,,Yes,Private,Urban,,,never smoked,
freq,,2994,,,,3353,2925,2596,,,1892,
mean,36517.829354,,43.226614,0.097456,0.054012,,,,106.147677,28.893237,,0.048728
std,21161.721625,,22.612647,0.296607,0.226063,,,,45.28356,7.854067,,0.21532
min,67.0,,0.08,0.0,0.0,,,,55.12,10.3,,0.0
25%,17741.25,,25.0,0.0,0.0,,,,77.245,23.5,,0.0
50%,36932.0,,45.0,0.0,0.0,,,,91.885,28.1,,0.0
75%,54682.0,,61.0,0.0,0.0,,,,114.09,33.1,,0.0


In [151]:
# full statistics of the dataset in the different format
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,5110,,,,36517.8,21161.7,67.0,17741.2,36932.0,54682.0,72940.0
gender,5110,3.0,Female,2994.0,,,,,,,
age,5110,,,,43.2266,22.6126,0.08,25.0,45.0,61.0,82.0
hypertension,5110,,,,0.097456,0.296607,0.0,0.0,0.0,0.0,1.0
heart_disease,5110,,,,0.0540117,0.226063,0.0,0.0,0.0,0.0,1.0
ever_married,5110,2.0,Yes,3353.0,,,,,,,
work_type,5110,5.0,Private,2925.0,,,,,,,
Residence_type,5110,2.0,Urban,2596.0,,,,,,,
avg_glucose_level,5110,,,,106.148,45.2836,55.12,77.245,91.885,114.09,271.74
bmi,4909,,,,28.8932,7.85407,10.3,23.5,28.1,33.1,97.6


In [152]:
# sorting the dataset in alphabetical order by column ‘Residence_type’
df.sort_values('Residence_type')

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
2554,72369,Female,14.0,0,0,No,children,Rural,65.41,19.5,Unknown,0
2094,6199,Female,52.0,0,0,Yes,Govt_job,Rural,107.27,30.1,Unknown,0
2095,4635,Female,68.0,0,0,Yes,Private,Rural,97.96,31.3,never smoked,0
2097,24342,Female,23.0,0,0,No,Private,Rural,112.30,26.6,Unknown,0
4060,68330,Female,69.0,0,0,Yes,Self-employed,Rural,110.96,25.9,never smoked,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2297,58359,Female,71.0,1,0,Yes,Private,Urban,129.97,44.2,smokes,0
2298,59347,Male,62.0,0,0,Yes,Private,Urban,124.26,33.4,never smoked,0
2299,12849,Female,28.0,0,0,Yes,Private,Urban,87.92,32.5,Unknown,0
2304,58253,Male,5.0,0,0,No,children,Urban,71.92,18.2,Unknown,0


In [153]:
# sorting the dataset in other way around
df.sort_values('Residence_type', ascending=False)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
2887,62923,Female,17.0,0,0,No,Private,Urban,87.39,24.6,Unknown,0
2862,52790,Female,26.0,0,0,No,Govt_job,Urban,123.81,39.0,never smoked,0
2865,51963,Male,58.0,0,0,Yes,Private,Urban,69.24,27.6,never smoked,0
2866,13375,Male,76.0,0,0,Yes,Private,Urban,192.39,31.0,never smoked,0
...,...,...,...,...,...,...,...,...,...,...,...,...
3013,52968,Female,45.0,0,0,Yes,Self-employed,Rural,149.15,33.5,Unknown,0
1082,59894,Female,58.0,0,0,Yes,Govt_job,Rural,109.56,23.1,never smoked,0
3015,51421,Female,54.0,0,0,Yes,Private,Rural,65.38,25.9,Unknown,0
3017,33009,Male,76.0,0,0,Yes,Self-employed,Rural,221.80,44.7,formerly smoked,0


In [154]:
# sorting the dataset by several criterias, for example, 'ever_married', 'work_type'
df.sort_values(['ever_married', 'work_type'])

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
19,25226,Male,57.0,0,1,No,Govt_job,Urban,217.08,,Unknown,1
34,14248,Male,48.0,0,0,No,Govt_job,Urban,84.20,29.7,never smoked,1
114,71639,Female,68.0,0,0,No,Govt_job,Urban,82.10,27.1,Unknown,1
115,53401,Male,71.0,1,1,No,Govt_job,Rural,216.94,30.9,never smoked,1
168,44993,Female,79.0,1,0,No,Govt_job,Urban,98.02,22.3,formerly smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5088,22190,Female,64.0,1,0,Yes,Self-employed,Urban,76.89,30.2,Unknown,0
5097,64520,Male,68.0,0,0,Yes,Self-employed,Urban,91.68,40.8,Unknown,0
5100,68398,Male,82.0,1,0,Yes,Self-employed,Rural,71.97,28.3,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0


In [155]:
# sorting the dataset by ascending one column, and descending another one
df.sort_values(['ever_married', 'work_type'], ascending = [1, 0])

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
162,69768,Female,1.32,0,0,No,children,Urban,70.37,,Unknown,1
245,49669,Female,14.00,0,0,No,children,Rural,57.93,30.9,Unknown,1
249,30669,Male,3.00,0,0,No,children,Rural,95.12,18.0,Unknown,0
282,33759,Female,3.00,0,0,No,children,Urban,73.74,16.0,Unknown,0
290,55680,Male,13.00,0,0,No,children,Urban,114.84,18.3,Unknown,0
...,...,...,...,...,...,...,...,...,...,...,...,...
5081,37680,Male,55.00,0,0,Yes,Govt_job,Rural,108.35,40.8,formerly smoked,0
5092,56799,Male,76.00,0,0,Yes,Govt_job,Urban,82.35,38.9,never smoked,0
5093,32235,Female,45.00,1,0,Yes,Govt_job,Rural,95.02,,smokes,0
5096,41512,Male,57.00,0,0,Yes,Govt_job,Rural,76.62,28.2,never smoked,0


In [156]:
# adding a new column to the dataset:
df['New'] = df['avg_glucose_level'] + df['bmi']
df.head(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,New
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,265.29
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1,
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,138.42


In [157]:
# removing a column from the dataset
df = df.drop(columns=['New'])
df.head(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1


In [158]:
# another way of adding new column to the dataset
df['New'] = df.iloc[:, 8:10].sum(axis=1)
df.head(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,New
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1,265.29
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1,202.21
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1,138.42


In [159]:
# reorder columns of the dataset
cols = list(df.columns)
df = df[cols[0:10] + [cols[-1]] + cols[10:12]]
print(cols)

['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke', 'New']


In [160]:
# viewing the dataset after reordering it
df.head(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,New,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1


In [161]:
# adding empty columns to the dataset
df['Comments'] = df.apply(lambda _: '', axis=1)
df['Notes'] = df.apply(lambda _: '', axis=1)
df.head(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,New,smoking_status,stroke,Comments,Notes
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,


In [162]:
# adding a new column to the dataset with some value, for example, value '1'
df['Count'] = 1
df.head(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,New,smoking_status,stroke,Comments,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,,1


In [163]:
# renaming column name from 'New' to 'Total', 'Comments' to 'Comment'
df.rename(columns={'New': 'Total', 'Comments': 'Comment'}, inplace=True)
df.head(3)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,,1


In [164]:
# filtering the dataset by two conditions, for example, 'age' > 50 & 'gender' is female
df.loc[(df['age'] > 50) & (df['gender'] == 'Female')]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,198.12,never smoked,1,,,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,117.19,never smoked,1,,,1
8,27419,Female,59.0,0,0,Yes,Private,Rural,76.15,,76.15,Unknown,1,,,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,82.77,Unknown,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5087,26214,Female,63.0,0,0,Yes,Self-employed,Rural,75.93,34.7,110.63,formerly smoked,0,,,1
5088,22190,Female,64.0,1,0,Yes,Self-employed,Urban,76.89,30.2,107.09,Unknown,0,,,1
5102,45010,Female,57.0,0,0,Yes,Private,Rural,77.93,21.7,99.63,never smoked,0,,,1
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,83.75,never smoked,0,,,1


In [165]:
# filtering the dataset by one of two conditions, for example, either 'age' > 50 or 'gender' is female
df.loc[(df['age'] > 50) | (df['gender'] == 'Female')]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,205.63,smokes,1,,,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,198.12,never smoked,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,83.75,never smoked,0,,,1
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,165.20,never smoked,0,,,1
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,113.59,never smoked,0,,,1
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,191.89,formerly smoked,0,,,1


In [166]:
# filtering the dataset by three conditions
df.loc[(df['age'] > 50) & (df['gender'] == 'Female') & (df['Residence_type'] == 'Urban')]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,117.19,never smoked,1,,,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,82.77,Unknown,1,,,1
12,12175,Female,54.0,0,0,Yes,Private,Urban,104.51,27.3,131.81,smokes,1,,,1
14,5317,Female,79.0,0,1,Yes,Private,Urban,214.09,28.2,242.29,never smoked,1,,,1
18,27458,Female,60.0,0,0,No,Private,Urban,89.22,37.8,127.02,never smoked,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5085,53525,Female,72.0,0,0,Yes,Private,Urban,83.89,33.1,116.99,formerly smoked,0,,,1
5086,65411,Female,51.0,0,0,Yes,Private,Urban,152.56,21.8,174.36,Unknown,0,,,1
5088,22190,Female,64.0,1,0,Yes,Self-employed,Urban,76.89,30.2,107.09,Unknown,0,,,1
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,83.75,never smoked,0,,,1


In [167]:
# filtering the dataset by specific frame, for example, 'Self' in 'work_type' column
df.loc[df['work_type'].str.contains('Self')]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,198.12,never smoked,1,,,1
15,58202,Female,50.0,1,0,Yes,Self-employed,Rural,167.41,30.9,198.31,never smoked,1,,,1
21,13861,Female,52.0,1,0,Yes,Self-employed,Urban,233.29,48.9,282.19,never smoked,1,,,1
22,68794,Female,79.0,0,0,Yes,Self-employed,Urban,228.70,26.6,255.30,never smoked,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5088,22190,Female,64.0,1,0,Yes,Self-employed,Urban,76.89,30.2,107.09,Unknown,0,,,1
5097,64520,Male,68.0,0,0,Yes,Self-employed,Urban,91.68,40.8,132.48,Unknown,0,,,1
5100,68398,Male,82.0,1,0,Yes,Self-employed,Rural,71.97,28.3,100.27,never smoked,0,,,1
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,165.20,never smoked,0,,,1


In [168]:
# filtering the dataset with exception of some value, for example, 'Self' in 'work_type' column
df.loc[~df['work_type'].str.contains('Self')]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,205.63,smokes,1,,,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,215.21,formerly smoked,1,,,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,97.49,never smoked,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5103,22127,Female,18.0,0,0,No,Private,Urban,82.85,46.9,129.75,Unknown,0,,,1
5104,14180,Female,13.0,0,0,No,children,Rural,103.08,18.6,121.68,Unknown,0,,,1
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,83.75,never smoked,0,,,1
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,191.89,formerly smoked,0,,,1


In [169]:
# filtering the dataset thru regular expressions
import re

df.loc[df['Residence_type'].str.contains('Urban|Rural', regex = True)]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,205.63,smokes,1,,,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,198.12,never smoked,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,83.75,never smoked,0,,,1
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,165.20,never smoked,0,,,1
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,113.59,never smoked,0,,,1
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,191.89,formerly smoked,0,,,1


In [170]:
# since regular expressions are upper/lower letter capitalization sensitive, adding flag for the code to ignore the upper/lower letter
df.loc[df['Residence_type'].str.contains('Urban|Rural', flags=re.I, regex = True)]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,205.63,smokes,1,,,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,198.12,never smoked,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,83.75,never smoked,0,,,1
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,165.20,never smoked,0,,,1
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,113.59,never smoked,0,,,1
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,191.89,formerly smoked,0,,,1


In [171]:
# filtering specific values starting with particular letters, for example, 'ur'
df.loc[df['Residence_type'].str.contains('^ur[a-z]*', flags=re.I, regex = True)]

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,205.63,smokes,1,,,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,215.21,formerly smoked,1,,,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,117.19,never smoked,1,,,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,82.77,Unknown,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5101,36901,Female,45.0,0,0,Yes,Private,Urban,97.95,24.5,122.45,Unknown,0,,,1
5103,22127,Female,18.0,0,0,No,Private,Urban,82.85,46.9,129.75,Unknown,0,,,1
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,83.75,never smoked,0,,,1
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,165.20,never smoked,0,,,1


In [172]:
# replacing one value by another in the dataset, for example, 'Govt_job' by 'Government_job'
df.loc[df['work_type'] == 'Govt_job', 'work_type'] = 'Government_job'
df.loc[df['work_type'] == 'Government_job']

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
11,12095,Female,61.0,0,1,Yes,Government_job,Rural,120.46,36.8,157.26,smokes,1,,,1
19,25226,Male,57.0,0,1,No,Government_job,Urban,217.08,,217.08,Unknown,1,,,1
20,70630,Female,71.0,0,0,Yes,Government_job,Rural,193.94,22.4,216.34,smokes,1,,,1
34,14248,Male,48.0,0,0,No,Government_job,Urban,84.20,29.7,113.90,never smoked,1,,,1
44,7937,Male,60.0,1,0,Yes,Government_job,Urban,213.03,20.2,233.23,smokes,1,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5090,4211,Male,26.0,0,0,No,Government_job,Rural,100.85,21.0,121.85,smokes,0,,,1
5092,56799,Male,76.0,0,0,Yes,Government_job,Urban,82.35,38.9,121.25,never smoked,0,,,1
5093,32235,Female,45.0,1,0,Yes,Government_job,Rural,95.02,,95.02,smokes,0,,,1
5096,41512,Male,57.0,0,0,Yes,Government_job,Rural,76.62,28.2,104.82,never smoked,0,,,1


In [173]:
# replacing values of several columns as per condition
df.loc[df['bmi'] < 28, ['Comment', 'Notes']] = 'Normal'
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,205.63,smokes,1,,,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,198.12,never smoked,1,Normal,Normal,1


In [174]:
# replacing values of several columns by different values as per conditions 
df.loc[df['bmi'] < 28, ['Comment', 'Notes']] = ['Normal', 'Check']
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,265.29,formerly smoked,1,,,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,202.21,never smoked,1,,,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,138.42,never smoked,1,,,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,205.63,smokes,1,,,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,198.12,never smoked,1,Normal,Check,1


In [175]:
# groupby by mean of 'work_type'
df.groupby(['work_type']).mean()

Unnamed: 0_level_0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,Total,stroke,Count
work_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Government_job,36516.936073,50.879756,0.111111,0.054795,107.779772,30.522063,137.047504,0.050228,1.0
Never_worked,38274.409091,16.181818,0.0,0.0,96.042727,25.545455,121.588182,0.0,1.0
Private,36951.227009,45.503932,0.096068,0.054017,106.796844,30.304625,135.920366,0.05094,1.0
Self-employed,35551.288156,60.201465,0.175824,0.098901,112.645446,30.211871,141.234212,0.079365,1.0
children,35769.432314,6.841339,0.0,0.001456,94.400277,20.038003,113.971601,0.002911,1.0


In [176]:
# groupby by mean & sorting for 'work_type'
df.groupby(['work_type']).mean().sort_values('age')

Unnamed: 0_level_0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,Total,stroke,Count
work_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
children,35769.432314,6.841339,0.0,0.001456,94.400277,20.038003,113.971601,0.002911,1.0
Never_worked,38274.409091,16.181818,0.0,0.0,96.042727,25.545455,121.588182,0.0,1.0
Private,36951.227009,45.503932,0.096068,0.054017,106.796844,30.304625,135.920366,0.05094,1.0
Government_job,36516.936073,50.879756,0.111111,0.054795,107.779772,30.522063,137.047504,0.050228,1.0
Self-employed,35551.288156,60.201465,0.175824,0.098901,112.645446,30.211871,141.234212,0.079365,1.0


In [177]:
# groupby by mean & sorting in descending order
df.groupby(['work_type']).mean().sort_values('age', ascending=False)

Unnamed: 0_level_0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,Total,stroke,Count
work_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Self-employed,35551.288156,60.201465,0.175824,0.098901,112.645446,30.211871,141.234212,0.079365,1.0
Government_job,36516.936073,50.879756,0.111111,0.054795,107.779772,30.522063,137.047504,0.050228,1.0
Private,36951.227009,45.503932,0.096068,0.054017,106.796844,30.304625,135.920366,0.05094,1.0
Never_worked,38274.409091,16.181818,0.0,0.0,96.042727,25.545455,121.588182,0.0,1.0
children,35769.432314,6.841339,0.0,0.001456,94.400277,20.038003,113.971601,0.002911,1.0


In [178]:
# summarization for 'work_type'
df.groupby(['work_type']).sum()

Unnamed: 0_level_0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,Total,stroke,Count
work_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Government_job,23991627,33428.0,73,36,70811.31,19228.9,90040.21,33,657
Never_worked,842037,356.0,0,0,2112.94,562.0,2674.94,0,22
Private,108082339,133099.0,281,158,312380.77,85186.3,397567.07,149,2925
Self-employed,29116505,49305.0,144,81,92256.62,23414.2,115670.82,65,819
children,24573600,4700.0,0,1,64852.99,13445.5,78298.49,2,687


In [179]:
# count of values for 'work_type'
df.groupby(['work_type']).count()

Unnamed: 0_level_0,id,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,Total,smoking_status,stroke,Comment,Notes,Count
work_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Government_job,657,657,657,657,657,657,657,657,630,657,657,657,657,657,657
Never_worked,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22
Private,2925,2925,2925,2925,2925,2925,2925,2925,2811,2925,2925,2925,2925,2925,2925
Self-employed,819,819,819,819,819,819,819,819,775,819,819,819,819,819,819
children,687,687,687,687,687,687,687,687,671,687,687,687,687,687,687


In [180]:
# estimate count number by 'Count' column for several features
df.groupby(['gender', 'work_type']).count()['Count']

gender  work_type     
Female  Government_job     399
        Never_worked        11
        Private           1754
        Self-employed      504
        children           326
Male    Government_job     258
        Never_worked        11
        Private           1170
        Self-employed      315
        children           361
Other   Private              1
Name: Count, dtype: int64

In [181]:
# useful ‘chunksize’ function for massive datasets
for df in pd.read_csv('healthcare-data.csv', chunksize = 50):
    print('Chunk')
    print(df)

Chunk
       id  gender  age  hypertension  heart_disease ever_married  \
0    9046    Male   67             0              1          Yes   
1   51676  Female   61             0              0          Yes   
2   31112    Male   80             0              1          Yes   
3   60182  Female   49             0              0          Yes   
4    1665  Female   79             1              0          Yes   
5   56669    Male   81             0              0          Yes   
6   53882    Male   74             1              1          Yes   
7   10434  Female   69             0              0           No   
8   27419  Female   59             0              0          Yes   
9   60491  Female   78             0              0          Yes   
10  12109  Female   81             1              0          Yes   
11  12095  Female   61             0              1          Yes   
12  12175  Female   54             0              0          Yes   
13   8213    Male   78             0      

Chunk
        id  gender  age  hypertension  heart_disease ever_married  \
450   8233    Male   72             0              1          Yes   
451  46436    Male   13             0              0           No   
452  23221    Male   29             0              0           No   
453  31830    Male   59             0              0          Yes   
454  15296  Female   42             0              0          Yes   
455   7351    Male   13             0              0           No   
456  66196    Male   75             0              1           No   
457  17718  Female   33             1              0          Yes   
458  31164  Female   43             0              0          Yes   
459  48272  Female   11             0              0           No   
460   2893  Female    7             0              0           No   
461  34376  Female   16             0              0           No   
462  18498  Female   44             0              0           No   
463  56735  Female   78     

Chunk
        id  gender  age  hypertension  heart_disease ever_married  \
800  51125  Female   66             0              0          Yes   
801  29077  Female   77             0              0          Yes   
802   4970    Male   79             0              0          Yes   
803  58291  Female   52             0              0          Yes   
804  18616  Female   41             0              0          Yes   
805     99  Female   31             0              0           No   
806  55529    Male   39             0              0          Yes   
807  12204  Female   51             0              0           No   
808  21397  Female   40             0              0          Yes   
809  64633  Female   48             0              0          Yes   
810  23016    Male   55             0              0          Yes   
811  18412    Male   41             0              0          Yes   
812  67412  Female   39             0              0          Yes   
813  37545    Male   41     

1249       0  
Chunk
         id  gender  age  hypertension  heart_disease ever_married  \
1250  70031  Female   71             1              0          Yes   
1251  23604    Male    4             0              0           No   
1252  46576    Male    2             0              0           No   
1253  31293    Male   11             0              0           No   
1254  70610  Female   45             0              0          Yes   
1255   6044    Male   22             0              0           No   
1256  62284    Male   63             0              0          Yes   
1257   5821  Female   50             0              0          Yes   
1258  22295  Female   25             0              0           No   
1259  27583    Male   49             0              0          Yes   
1260   9696    Male   39             0              0          Yes   
1261   1164  Female   43             0              0           No   
1262  48781    Male   67             0              0          Yes   

Chunk
         id  gender  age  hypertension  heart_disease ever_married  \
1700   5686    Male   35             0              0          Yes   
1701   4789    Male    8             0              0           No   
1702    897    Male    3             0              0           No   
1703  69553  Female   29             0              0          Yes   
1704  58438    Male   36             0              0           No   
1705  29104  Female   19             0              0           No   
1706  26862  Female   41             0              0          Yes   
1707  38036  Female   23             0              0           No   
1708  36666    Male   14             0              0           No   
1709  16316    Male   35             0              0          Yes   
1710  61365    Male   45             0              0          Yes   
1711  12512  Female   52             1              0          Yes   
1712  31835    Male   19             0              0           No   
1713   4099  F

Chunk
         id  gender  age  hypertension  heart_disease ever_married  \
2200  17295  Female   31             0              0          Yes   
2201  55466  Female   69             0              1          Yes   
2202  65419    Male   73             0              1          Yes   
2203  34448  Female   56             0              0          Yes   
2204  14406  Female   80             0              1          Yes   
2205    924  Female   60             0              0          Yes   
2206  71339  Female   40             0              0          Yes   
2207  31443  Female   30             0              0          Yes   
2208  49672  Female   66             0              0          Yes   
2209    394    Male   78             1              0          Yes   
2210  63362  Female   37             0              0          Yes   
2211  59928  Female   41             0              0          Yes   
2212  62289  Female   34             0              0          Yes   
2213  59464  F

Chunk
         id  gender    age  hypertension  heart_disease ever_married  \
2700  54756  Female  59.00             0              0          Yes   
2701  19590    Male  48.00             0              0          Yes   
2702  23332  Female  42.00             0              0          Yes   
2703  16971  Female  26.00             0              0           No   
2704  11727    Male  39.00             0              0          Yes   
2705  60255  Female  34.00             0              0           No   
2706  38796  Female  54.00             0              0          Yes   
2707  46498  Female  57.00             0              0          Yes   
2708  41042  Female   1.56             0              0           No   
2709  35069  Female  50.00             1              1           No   
2710  61103  Female  64.00             1              0          Yes   
2711  25095    Male  44.00             0              0          Yes   
2712  55607    Male  38.00             0              0   

Chunk
         id  gender    age  hypertension  heart_disease ever_married  \
3250  36958  Female  32.00             0              0          Yes   
3251  14877    Male   0.56             0              0           No   
3252  65988  Female  26.00             0              0           No   
3253  50001  Female  34.00             0              0          Yes   
3254  27034  Female  65.00             0              0          Yes   
3255   8950  Female  15.00             0              0           No   
3256  31850  Female  17.00             0              0           No   
3257  14288  Female  71.00             0              0          Yes   
3258   3180  Female  42.00             0              0          Yes   
3259  13899    Male  30.00             0              0          Yes   
3260  23730  Female  75.00             0              0          Yes   
3261   6011    Male   9.00             0              0           No   
3262  14376    Male  47.00             0              0   

Chunk
         id  gender    age  hypertension  heart_disease ever_married  \
3700  56075  Female  58.00             0              0          Yes   
3701  46130  Female  57.00             0              0          Yes   
3702   7730    Male  31.00             0              0           No   
3703  12380    Male  43.00             0              0          Yes   
3704  15324  Female  40.00             0              0           No   
3705  11658    Male   1.08             0              0           No   
3706  22778    Male  34.00             0              0          Yes   
3707   4128  Female  55.00             0              0          Yes   
3708  36825  Female  39.00             0              0          Yes   
3709   1454  Female  42.00             0              0           No   
3710  12674    Male  44.00             0              0          Yes   
3711  55375    Male  69.00             1              0          Yes   
3712   3726    Male  16.00             0              0   

4149       0  
Chunk
         id  gender  age  hypertension  heart_disease ever_married  \
4150  47456    Male   30             0              0          Yes   
4151  56139    Male    8             0              0           No   
4152  12857    Male   55             0              0          Yes   
4153  40980    Male   79             1              0          Yes   
4154  47668  Female   49             0              0          Yes   
4155  72792  Female   53             1              0          Yes   
4156  37728  Female   26             0              0          Yes   
4157  47410  Female   14             0              0           No   
4158  56450    Male   25             0              0           No   
4159   9189  Female   20             0              0           No   
4160  71966  Female   18             0              0           No   
4161  59272    Male   38             0              0          Yes   
4162  45563  Female   72             0              1          Yes   

4549       0  
Chunk
         id  gender   age  hypertension  heart_disease ever_married  \
4550  65351    Male  11.0             0              0           No   
4551  61830    Male  51.0             0              0          Yes   
4552  71777    Male  74.0             1              1          Yes   
4553  69059  Female  42.0             0              0          Yes   
4554  11908  Female  69.0             0              0          Yes   
4555  24955  Female  22.0             0              0           No   
4556  61477  Female  25.0             0              0           No   
4557    724    Male  17.0             0              0           No   
4558  22614    Male  64.0             0              0           No   
4559  61997  Female  50.0             0              0          Yes   
4560   6605    Male  52.0             1              0          Yes   
4561  46987  Female  65.0             0              1          Yes   
4562  70428  Female  37.0             0              0  

4949       0  
Chunk
         id  gender  age  hypertension  heart_disease ever_married  \
4950  66650  Female   17             0              0           No   
4951  59945  Female   23             0              0           No   
4952  16245    Male   51             1              0          Yes   
4953  68094  Female   46             0              0          Yes   
4954  64661  Female   81             0              0           No   
4955  61376    Male   38             0              0          Yes   
4956  47236  Female   50             0              0          Yes   
4957    875  Female   34             0              0           No   
4958  63986    Male   60             0              0          Yes   
4959  55410  Female   50             0              0          Yes   
4960  63287  Female   49             0              0          Yes   
4961   3720  Female    2             0              0           No   
4962  20274    Male   47             0              0          Yes   

In [182]:
# percentiles
print(df.age.describe(percentiles = [0.25,0.50,0.75,0.85,0.90,1]))

count    10.000000
mean     50.600000
std      24.967534
min      13.000000
25%      37.250000
50%      48.000000
75%      74.250000
85%      80.650000
90%      81.100000
100%     82.000000
max      82.000000
Name: age, dtype: float64


In [183]:
# groupby function
temp = (df.groupby(['smoking_status'])['bmi'].agg({'median','mean','max'})).sort_values('max', ascending=True)
temp

Unnamed: 0_level_0,max,median,mean
smoking_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
formerly smoked,25.6,25.6,25.6
never smoked,40.0,29.45,30.15
Unknown,46.9,25.35,29.05


In [184]:
# accessing to the first column of the dataset
temp1 = temp.iloc[0:2,:0]
temp1

formerly smoked
never smoked


In [185]:
# checking type of the data
temp1.dtypes

Series([], dtype: object)

In [186]:
# transferring an object data to list
temp2 = temp1.reset_index().values.tolist()
temp2

[['formerly smoked'], ['never smoked']]

In [187]:
# renaming headers of the columns to lower letters
df.columns= df.columns.str.lower()
df.head(2)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
5100,68398,Male,82,1,0,Yes,Self-employed,Rural,71.97,28.3,never smoked,0
5101,36901,Female,45,0,0,Yes,Private,Urban,97.95,24.5,Unknown,0
