# **Data Collection Notebook**

## Objectives

* Fetch dataset from Kaggle.
* Preliminary data exploration
* Save dataset in the appropriate format.

## Inputs

* Synthetic hospital records by MIT.

## Outputs

* Generate Dataset: outputs/datasets/collection/patientReadmission.csv



---

# Set working directory

In [1]:
import os
os.chdir(os.path.dirname(os.getcwd()))
current_dir = os.getcwd()
current_dir

'/workspaces/patient-readmission-predictor'

# Install necessary packages

All necessary packages for fetching data have already been install using the requirements.txt file

In [2]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Fetch data from Kaggle

Configure access token

In [4]:

import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Data downloaded for this url : www.kaggle.com 

Define folders


In [5]:
KaggleDatasetPath ="prasad22234/hospital-data-for-patient-readmission-prediction"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading hospital-data-for-patient-readmission-prediction.zip to inputs/datasets/raw
100%|████████████████████████████████████████| 537k/537k [00:00<00:00, 1.47MB/s]
100%|████████████████████████████████████████| 537k/537k [00:00<00:00, 1.47MB/s]


Unzip the downloaded file, 

In [6]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/hospital-data-for-patient-readmission-prediction.zip
  inflating: inputs/datasets/raw/Healthcare Data Analysis for readmission.csv  


---

# Load and Inspect patient admission data

In [3]:
import pandas as pd

df = pd.read_csv("inputs/datasets/raw/Healthcare Data Analysis for readmission.csv")
df.tail(10)

Unnamed: 0,hospital_name,Admission_date,hospital_id,hospital_beds_available,occupied_beds,hospital_ward,patient_id,patient_gender,patient_age,patient_race,...,doctor_id,doctor_name,doctor_specialty,patient_assigned_doctor,patient_checkin_date,patient_checkout_date,patient_disease,patient_length_of_stay,discharge_status,readmission
9990,The Johns Hopkins Hospital,02/01/2024,9232,380,180,ICU,8808,Female,22,White,...,9233,Tammy Nunez,Neurology,True,07/07/2024,18-07-2024,Influenza (Flu),5,Transferred,0
9991,The Johns Hopkins Hospital,05/02/2023,3417,130,150,Maternity,5997,Female,16,Hispanic,...,1906,Julie Weiss,Oncology,True,10/07/2024,07/07/2024,Irritable Bowel Syndrome (IBS),7,Deceased,1
9992,The Johns Hopkins Hospital,05/01/2022,4453,360,310,Pediatrics,3816,Male,14,Asian,...,2577,David Little,Oncology,True,30-06-2024,02/07/2024,Stroke,13,Transferred,1
9993,The Johns Hopkins Hospital,15-02-2022,2544,130,70,Maternity,5109,Male,63,Black,...,6270,Donna Freeman,Hepatology or Gastroenterology,False,01/07/2024,03/07/2024,Migraine,20,Deceased,1
9994,The Johns Hopkins Hospital,21-08-2021,5155,410,290,ICU,9207,Male,84,Hispanic,...,7795,Edward Cook,Pulmonology,True,13-07-2024,28-06-2024,Cancer,14,Deceased,1
9995,The Johns Hopkins Hospital,25-11-2023,5218,300,390,Maternity,1869,Male,77,Hispanic,...,6974,Jonathan Munoz,Gastroenterology,True,21-07-2024,24-06-2024,Epilepsy,20,Transferred,0
9996,The Johns Hopkins Hospital,28-03-2023,2265,100,380,ICU,2847,Male,75,Black,...,2135,Brandon Torres,Neurology,True,22-07-2024,24-06-2024,Influenza (Flu),29,Transferred,1
9997,The Johns Hopkins Hospital,13-02-2021,7011,330,230,ER,8324,Male,57,Other,...,1827,Richard Turner,Hepatology or Gastroenterology,False,01/07/2024,15-07-2024,Cancer,11,Recovered,0
9998,The Johns Hopkins Hospital,04/01/2022,2471,230,400,Maternity,6911,Female,63,Other,...,3097,Jonathan Thompson,Cardiology or Internal Medicine,False,06/07/2024,06/07/2024,COVID-19,24,Recovered,0
9999,The Johns Hopkins Hospital,31-05-2023,4143,100,100,ER,3081,Female,3,White,...,3577,Louis Luna MD,Rheumatology,False,21-07-2024,17-07-2024,Celiac Disease,23,Deceased,0


DataFrame Summary

Check columns and nan counts

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   hospital_name            10000 non-null  object
 1   Admission_date           10000 non-null  object
 2   hospital_id              10000 non-null  int64 
 3   hospital_beds_available  10000 non-null  int64 
 4   occupied_beds            10000 non-null  int64 
 5   hospital_ward            10000 non-null  object
 6   patient_id               10000 non-null  int64 
 7   patient_gender           10000 non-null  object
 8   patient_age              10000 non-null  int64 
 9   patient_race             10000 non-null  object
 10  patient_sat_score        10000 non-null  int64 
 11  patient_first_initial    10000 non-null  object
 12  patient_last_name        10000 non-null  object
 13  patient_waittime         10000 non-null  int64 
 14  department_referral      10000 non-null

Check Target variable unique value

In [5]:
df['readmission'].unique()

array([1, 0])

### Remove sensitive personal information

In [9]:
sensitive_cols = [
    'patient_first_initial',
    'patient_last_name',
    'doctor_name',
]

df = df.drop(columns=sensitive_cols)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   hospital_name            10000 non-null  object
 1   Admission_date           10000 non-null  object
 2   hospital_id              10000 non-null  int64 
 3   hospital_beds_available  10000 non-null  int64 
 4   occupied_beds            10000 non-null  int64 
 5   hospital_ward            10000 non-null  object
 6   patient_id               10000 non-null  int64 
 7   patient_gender           10000 non-null  object
 8   patient_age              10000 non-null  int64 
 9   patient_race             10000 non-null  object
 10  patient_sat_score        10000 non-null  int64 
 11  patient_waittime         10000 non-null  int64 
 12  department_referral      10000 non-null  object
 13  time_slot                10000 non-null  object
 14  doctor_id                10000 non-null

# Save file 

create destination folder: outputs/datasets/collection

In [10]:
import os

output_path = r"outputs/datasets/collection"
try:
  os.makedirs(name=output_path)
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/patientReadmission.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


---

# Conclusions and Next Steps

* raw data have been downloaded
* Preliminary checks and variable data type changes
* Output file saved to the 'collections' subfolder in 'outputs/dataset/'
* Next notebook will tackle data exploration