# `Clinics Dataset`

## Import
Import **numpy** and **pandas**.

[**`pandas`**](https://pandas.pydata.org/pandas-docs/stable/index.html) is a software library for Python which provides data structures and data analysis tools.

In [1]:
import numpy as np
import pandas as pd

## Reading the Clinics and Appointments Dataset

In [2]:
clinics_df = pd.read_csv("../datasets_backup/clinics.csv", encoding='ISO-8859-1')

In [3]:
appointments_df = pd.read_csv("../datasets_backup/appointments.csv")

## Clinics Dataset Information

In [4]:
clinics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53962 entries, 0 to 53961
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   clinicid      53962 non-null  object
 1   hospitalname  17538 non-null  object
 2   IsHospital    53962 non-null  bool  
 3   City          53962 non-null  object
 4   Province      53962 non-null  object
 5   RegionName    53962 non-null  object
dtypes: bool(1), object(5)
memory usage: 2.1+ MB


## Appointments Dataset Information

In [5]:
appointments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9752932 entries, 0 to 9752931
Data columns (total 11 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   pxid        object
 1   clinicid    object
 2   doctorid    object
 3   apptid      object
 4   status      object
 5   TimeQueued  object
 6   QueueDate   object
 7   StartTime   object
 8   EndTime     object
 9   type        object
 10  Virtual     object
dtypes: object(11)
memory usage: 818.5+ MB


# Creating a Copy of the Datasets

In [6]:
clinics_copy_df = clinics_df.copy()

In [7]:
clinics_copy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53962 entries, 0 to 53961
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   clinicid      53962 non-null  object
 1   hospitalname  17538 non-null  object
 2   IsHospital    53962 non-null  bool  
 3   City          53962 non-null  object
 4   Province      53962 non-null  object
 5   RegionName    53962 non-null  object
dtypes: bool(1), object(5)
memory usage: 2.1+ MB


In [8]:
appointments_copy_df = appointments_df.copy()

In [9]:
appointments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9752932 entries, 0 to 9752931
Data columns (total 11 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   pxid        object
 1   clinicid    object
 2   doctorid    object
 3   apptid      object
 4   status      object
 5   TimeQueued  object
 6   QueueDate   object
 7   StartTime   object
 8   EndTime     object
 9   type        object
 10  Virtual     object
dtypes: object(11)
memory usage: 818.5+ MB


## Merging Clinics and Appointments Dataset

In [10]:
# remove after
clinics_copy_df = clinics_copy_df.copy()

In [11]:
clinics_copy_df = clinics_copy_df.merge(appointments_copy_df, on='clinicid')

## Check Results of Merge

In [12]:
clinics_copy_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9751401 entries, 0 to 9751400
Data columns (total 16 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   clinicid      object
 1   hospitalname  object
 2   IsHospital    bool  
 3   City          object
 4   Province      object
 5   RegionName    object
 6   pxid          object
 7   doctorid      object
 8   apptid        object
 9   status        object
 10  TimeQueued    object
 11  QueueDate     object
 12  StartTime     object
 13  EndTime       object
 14  type          object
 15  Virtual       object
dtypes: bool(1), object(15)
memory usage: 1.2+ GB


## Drop Extra Columns Generated

In [13]:
clinics_copy_df = clinics_copy_df.drop(columns=['pxid', 'doctorid', 'apptid', 'status', 'TimeQueued', 'QueueDate', 'StartTime', 'EndTime', 'type', 'Virtual'])

In [14]:
clinics_copy_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9751401 entries, 0 to 9751400
Data columns (total 6 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   clinicid      object
 1   hospitalname  object
 2   IsHospital    bool  
 3   City          object
 4   Province      object
 5   RegionName    object
dtypes: bool(1), object(5)
memory usage: 455.7+ MB


## Checking if Clinic ID column has duplicates

In [15]:
clinicIdHasDuplicates = clinics_copy_df['clinicid'].duplicated().any()

print('Does clinicid column has duplicates: ' + str(clinicIdHasDuplicates))

Does clinicid column has duplicates: True


## Dropping Duplicates in Clinic ID Column

In [16]:
clinics_copy_df = clinics_copy_df.drop_duplicates(subset=['clinicid'], keep='first')

## Check Results of Dropping Duplicates

In [17]:
print(clinics_copy_df[clinics_copy_df['clinicid'].duplicated()])

Empty DataFrame
Columns: [clinicid, hospitalname, IsHospital, City, Province, RegionName]
Index: []


In [18]:
clinics_copy_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25232 entries, 0 to 9751400
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   clinicid      25232 non-null  object
 1   hospitalname  9651 non-null   object
 2   IsHospital    25232 non-null  bool  
 3   City          25232 non-null  object
 4   Province      25232 non-null  object
 5   RegionName    25232 non-null  object
dtypes: bool(1), object(5)
memory usage: 1.2+ MB


## Checking for Null Values in Dataset Columns

In [19]:
clinics_copy_df.isnull().any()

clinicid        False
hospitalname     True
IsHospital      False
City            False
Province        False
RegionName      False
dtype: bool

## Checking for Unique Values in City, Province, and RegionName Columns

In [20]:
print(clinics_copy_df['City'].unique())

['Taguig' 'Manila' 'Makati' 'San Juan' 'Quezon City' 'Santa Rosa City'
 'Batangas City' 'Iloilo City' 'Las Piñas' 'Bilar' 'Malabon' 'Caloocan'
 'Muntinlupa' 'Calape' 'Cebu City' 'Pasay' 'Cabanatuan City'
 'Antipolo City' 'Taytay' 'Pasig' 'Cainta' 'Mandaluyong' 'San Leonardo'
 'Imus City' 'Bacoor City' 'Dasmariñas City' 'Silang'
 'San Jose del Monte City' 'Marikina' 'San Mateo' 'Davao City'
 'Kidapawan City' 'Tagbilaran City' 'Mandaue City' 'Balamban'
 'Bacolod City' 'Cagayan de Oro' 'Guagua' 'Lubao' 'San Fernando City'
 'Rizal' 'Asuncion' 'Tagum City' 'Cardona' 'Morong' 'Tanay'
 'Meycauayan City' 'Santo Domingo' 'Trece Martires City' 'Angono'
 'Mabalacat City' 'Angeles City' 'Zamboanga City' 'Ozamiz City' 'Bocaue'
 'Pulilan' 'Baliuag' 'Lucena City' 'Valenzuela' 'Koronadal City'
 'Dumaguete City' 'Sibulan' 'Navotas' 'Butuan City' 'Malolos City'
 'Santiago City' 'San Pedro City' 'Cabuyao City' 'Kalilangan'
 'Baguio City' 'Urdaneta City' 'Lapu-Lapu City' 'Alicia' 'Dagupan City'
 'Legazpi 

In [21]:
print(clinics_copy_df['Province'].unique())

['Manila' 'Laguna' 'Batangas' 'Iloilo' 'Bohol' 'Cebu' 'Nueva Ecija'
 'Rizal' 'Cavite' 'Bulacan' 'Davao del Sur' 'Cotabato' 'Negros Occidental'
 'Misamis Oriental' 'Pampanga' 'Davao del Norte' 'Zamboanga del Sur'
 'Misamis Occidental' 'Quezon' 'South Cotabato' 'Negros Oriental'
 'Agusan del Norte' 'Isabela' 'Bukidnon' 'Benguet' 'Pangasinan' 'La Union'
 'Albay' 'Lanao del Norte' 'Oriental Mindoro' 'Occidental Mindoro'
 'Marinduque' 'Camarines Sur' 'Zambales' 'Angeles' 'Samar'
 'Surigao del Norte' 'Mountain Province' 'Davao Occidental' 'Ilocos Norte'
 'Aklan' 'Aurora' 'Cagayan' 'Kalinga' 'Abra' 'Davao Oriental'
 'Zamboanga Sibugay' 'Nueva Vizcaya' 'Ilocos Sur' 'Palawan' 'Bataan'
 'Leyte' 'Capiz' 'Southern Leyte' 'Tarlac' 'Sorsogon' 'Sultan Kudarat'
 'Quirino' 'Agusan del Sur' 'Camarines Norte' 'Maguindanao'
 'Dinagat Islands' 'Zamboanga del Norte' 'Masbate' 'Catanduanes' 'Biliran'
 'Antique' 'Lanao del Sur' 'Batanes' 'Compostela Valley' 'Surigao del Sur'
 'Basilan' 'Taguig' 'Eastern Samar

In [22]:
print(clinics_copy_df['RegionName'].unique())

['National Capital Region (NCR)' 'CALABARZON (IV-A)'
 'Western Visayas (VI)' 'Central Visayas (VII)' 'Central Luzon (III)'
 'Davao Region (XI)' 'SOCCSKSARGEN (Cotabato Region) (XII)'
 'Northern Mindanao (X)' 'Zamboanga Peninsula (IX)' 'Caraga (XIII)'
 'Cagayan Valley (II)' 'Cordillera Administrative Region (CAR)'
 'Ilocos Region (I)' 'Bicol Region (V)' 'MIMAROPA (IV-B)'
 'Eastern Visayas (VIII)'
 'Bangsamoro Autonomous Region in Muslim Mindanao (BARMM)']


## Check if There are Null Values in `hospitalname` That are Set to True in `IsHospital`

In [23]:
invalid_rows = clinics_copy_df[(clinics_copy_df['hospitalname'].isnull()) & (clinics_copy_df['IsHospital'] == True)]

invalid_rows

Unnamed: 0,clinicid,hospitalname,IsHospital,City,Province,RegionName


## Updating the Original DataFrame and CSV with the Changes Made in the Copy

In [24]:
clinics_df = clinics_copy_df.copy()

In [25]:
clinics_df.to_csv('../cleaned_datasets/clinics_cleaned.csv', index=False) 