<a href="https://colab.research.google.com/github/taiwotman/TaiwotmanGoogleColab/blob/main/COVID_HOSPITAL_TREATMENT_Predicting_Patient's_Length_of_Stay(LOS)_using_Kaggle_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**[COVID HOSPITAL TREATMENT](https://www.kaggle.com/arashnic/covid19-hospital-treatment)**

Dataset available on Kaggle for download: [data](https://www.kaggle.com/arashnic/covid19-hospital-treatment/download)

---




**Mount content from Google Drive**


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Install Java 8**

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

**Install pyspark libraries**

In [3]:
!pip install -q findspark
!pip install pyspark




**Set JAVA_HOME and SPARK_HOME**

In [15]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/usr/local/lib/python3.7/dist-packages/pyspark"


**Ensure spark is set up and running.**


In [16]:
import findspark 
findspark.find()

'/usr/local/lib/python3.7/dist-packages/pyspark'

In [18]:

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local')\
.appName("Predicting LOS for High Risk Patient")\
.getOrCreate()

In [19]:
spark

**Read file from the mounted Drive in spark dataframe**

In [20]:
filepath = "/content/drive/MyDrive/Colab Notebooks/data/host_train.csv" #Change this to your data filepath

df  = spark.read.option("header", "true").csv(filepath)


**Data Preparation**


In [21]:
df.printSchema()

root
 |-- case_id: string (nullable = true)
 |-- Hospital: string (nullable = true)
 |-- Hospital_type: string (nullable = true)
 |-- Hospital_city: string (nullable = true)
 |-- Hospital_region: string (nullable = true)
 |-- Available_Extra_Rooms_in_Hospital: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Ward_Type: string (nullable = true)
 |-- Ward_Facility: string (nullable = true)
 |-- Bed_Grade: string (nullable = true)
 |-- patientid: string (nullable = true)
 |-- City_Code_Patient: string (nullable = true)
 |-- Type of Admission: string (nullable = true)
 |-- Illness_Severity: string (nullable = true)
 |-- Patient_Visitors: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Admission_Deposit: string (nullable = true)
 |-- Stay_Days: string (nullable = true)



In [22]:
df.count()

318438

In [32]:
df.show(30)

+-------+--------+-------------+-------------+---------------+---------------------------------+------------+---------+-------------+---------+---------+-----------------+-----------------+----------------+----------------+-----+-----------------+---------+
|case_id|Hospital|Hospital_type|Hospital_city|Hospital_region|Available_Extra_Rooms_in_Hospital|  Department|Ward_Type|Ward_Facility|Bed_Grade|patientid|City_Code_Patient|Type of Admission|Illness_Severity|Patient_Visitors|  Age|Admission_Deposit|Stay_Days|
+-------+--------+-------------+-------------+---------------+---------------------------------+------------+---------+-------------+---------+---------+-----------------+-----------------+----------------+----------------+-----+-----------------+---------+
|      1|       8|            2|            3|              2|                                3|radiotherapy|        R|            F|      2.0|    31397|              7.0|        Emergency|         Extreme|               2|51-

**Observations on Dataframe Schema**


\begin{array}{ccc}
Column\:Name&Critical\:Factor&Data\:Type&Transformation\:required&Transformation\\
case\_id & No &String &No&N/A \\ 
Hospital & Yes &String&Yes&String\:to\:integer\\
Hospital\_type & Yes &String&Yes& String\:to\:integer\\
Hospital\_city & Yes &String&Yes&String\:to\:integer\\
Hospital\_region & Yes &String&Yes&String\:to\:integer\\
Available\_extra\_rooms\_in\_hospital & Yes &String&Yes&String\:to\:integer\\
Department & Yes &String&Yes&String\:to\:index\\
Ward\_type & Yes &String&Yes&String\:to\:index\\
Ward\_facility & Yes &String&Yes&String\:to\:index\\
Bed\_grade & Yes &String&Yes&String\:to\:integer\\
Patientid & No &String&Yes&String\:to\:integer\\
City\_Code\_Patient & Yes &String&Yes&String\:to\:integer\\
Type\:of\:Admission & Yes &String&Yes&String\:to\:index\\
Illness\_Severity & Yes &String&Yes&String\:to\:index\\
Patient\_Visitors & Yes &String&Yes&String\:to\:index\\
Age & Yes &String&Yes&String\:to\:index\\
Admission\_Deposit& Yes &String&Yes&String\:to\:integer\\
Stay\_Days & Target\:variable\:or\:Label &String&Yes&String\:to\:index\\
\end{array}




**First Level Transformation**

In [31]:
## Rename column "Type of Admission" and "patientid"
df2 = df.withColumnRenamed("Type of Admission", "Type_of_Admission")\
      .withColumnRenamed("patientid", "Patient_id")

## Convert all columns to lower case for uniformity
df3 = df2.toDF(*[c.lower() for c in df2.columns])
df3.printSchema()



root
 |-- case_id: string (nullable = true)
 |-- hospital: string (nullable = true)
 |-- hospital_type: string (nullable = true)
 |-- hospital_city: string (nullable = true)
 |-- hospital_region: string (nullable = true)
 |-- available_extra_rooms_in_hospital: string (nullable = true)
 |-- department: string (nullable = true)
 |-- ward_type: string (nullable = true)
 |-- ward_facility: string (nullable = true)
 |-- bed_grade: string (nullable = true)
 |-- patient_id: string (nullable = true)
 |-- city_code_patient: string (nullable = true)
 |-- type_of_admission: string (nullable = true)
 |-- illness_severity: string (nullable = true)
 |-- patient_visitors: string (nullable = true)
 |-- age: string (nullable = true)
 |-- admission_deposit: string (nullable = true)
 |-- stay_days: string (nullable = true)



In [None]:
string_to_integer_list = 
df3 = df.withColumn("id", F.col("id").astype(IntegerType()))