# `Doctors Dataset`

## Import
Import **numpy** and **pandas**.

[**`pandas`**](https://pandas.pydata.org/pandas-docs/stable/index.html) is a software library for Python which provides data structures and data analysis tools.

In [1]:
import numpy as np
import pandas as pd

## Reading the Doctors Dataset

In [2]:
doctors_df = pd.read_csv("doctors.csv", encoding='ISO-8859-1')

## Doctors Dataset Information

In [3]:
doctors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60024 entries, 0 to 60023
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   doctorid       60024 non-null  object 
 1   mainspecialty  27175 non-null  object 
 2   age            20028 non-null  float64
dtypes: float64(1), object(2)
memory usage: 1.4+ MB


## Creating a Copy of the Dataset

In [4]:
doctors_copy_df = doctors_df.copy()

In [5]:
doctors_copy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60024 entries, 0 to 60023
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   doctorid       60024 non-null  object 
 1   mainspecialty  27175 non-null  object 
 2   age            20028 non-null  float64
dtypes: float64(1), object(2)
memory usage: 1.4+ MB


## Checking if Doctor ID column has duplicates

In [6]:
doctorIdHasDuplicates = doctors_copy_df['doctorid'].duplicated().any()

print('Does doctorid column has duplicates: ' + str(doctorIdHasDuplicates))

Does doctorid column has duplicates: False


### `mainspecialty` column

In [7]:
doctors_copy_df['mainspecialty'].nunique()

3974

In [8]:
print(doctors_copy_df['mainspecialty'].unique())

['General Medicine' 'Family Medicine' 'Vascular Medicine' ...
 'Bone alignment therapy' 'Healot Biomekaniks' 'Allergist']


In [9]:
# Count occurrences of each value in 'mainspecialty'
specialty_counts = doctors_copy_df['mainspecialty'].value_counts()

# Filter out values that occur only once (i.e., don't have duplicates)
unique_specialties = specialty_counts[specialty_counts == 1].index.tolist()

print("Values in mainspecialty column without duplicates:", unique_specialties)


Values in mainspecialty column without duplicates: ['ORTHODONTIST/Veneers', 'General Medicine/Occupational Health Physcian', 'Adult Psychiatry and Behavioral Medicine', 'Faye Garciano', 'Child psychiatrist ', 'INTERNAL MEDICINE-ADULT DISEASE SPECIALIST', 'OB-Gyn Ultrasound Subspecialist', 'General Physician, Medical Acupuncturist ', 'Family Medicine, Ophthalmology', 'IM,Surgery,Pedia', 'Psychotherapy, Psychosocial Counseling, Mental Health Consultation, Psychological Assessment', 'Orthopedics/Orthopedic Surgery', 'Need Consultant Of Doctor Before Vaccine', 'Infectiius disease', 'OTOLARYNGOLOGY HEAND NECK SURGERY', 'General Dentistry, Orthodontics, Oral Surgery, ', 'banking', 'Anesthesiology, Pain Medicine', 'Pediatric Hematology and Oncology ', 'Neuropsychological assessment and Neurodevelopmental evaluation, Applied Behavioral Analysis Intervention and Behavioural Support Programs', 'Diagnostic Imaging', 'Otolaryngology, Head and Neck Specialty (ENT-HNS)', 'Therapy', 'Adult and Pediat

## Replacing NaN/Non-Integer Values into Sentinel Values in Age Column

In [10]:
doctors_copy_df.isnull().any()

doctorid         False
mainspecialty     True
age               True
dtype: bool

## Check for NaN Values in Age Column

In [11]:
print(doctors_copy_df.loc[doctors_copy_df['age'].isnull() == True])

                               doctorid         mainspecialty  age
24     ED3D2C21991E3BEF5E069713AF9FA6CA                   NaN  NaN
28     EC8956637A99787BD197EACD77ACCE5E                   NaN  NaN
31     65B9EEA6E1CC6BB9F0CD2A47751A186F                   NaN  NaN
33     A97DA629B098B75C294DFFDC3E463904                   NaN  NaN
38     7F6FFAA6BB0B408017B62254211691B5                   NaN  NaN
...                                 ...                   ...  ...
60016  3DC09677E0FDB539A31D497C4FB25F20  general practitioner  NaN
60017  39D96AC1450B2D517807DC8A94B26C17         Ophthalmology  NaN
60020  4473D870B5E31FAA40D2C45E1FF6DC27                   NaN  NaN
60021  A4F554EB2C0934E7FDE2511E8C1573BA                   NaN  NaN
60022  E540A361D93D37A33BB2F55D43DA79D9  General Practitioner  NaN

[39996 rows x 3 columns]


## Convert NaN Values into Sentinel Values

In [12]:
doctors_copy_df.loc[doctors_copy_df['age'].isnull(), 'age'] = str(9999)

## Check Results of Conversion

In [13]:
doctors_copy_df.isnull().any()

doctorid         False
mainspecialty     True
age              False
dtype: bool

In [14]:
doctors_copy_df.loc[doctors_copy_df['age'] == str(9999)]

Unnamed: 0,doctorid,mainspecialty,age
24,ED3D2C21991E3BEF5E069713AF9FA6CA,,9999
28,EC8956637A99787BD197EACD77ACCE5E,,9999
31,65B9EEA6E1CC6BB9F0CD2A47751A186F,,9999
33,A97DA629B098B75C294DFFDC3E463904,,9999
38,7F6FFAA6BB0B408017B62254211691B5,,9999
...,...,...,...
60016,3DC09677E0FDB539A31D497C4FB25F20,general practitioner,9999
60017,39D96AC1450B2D517807DC8A94B26C17,Ophthalmology,9999
60020,4473D870B5E31FAA40D2C45E1FF6DC27,,9999
60021,A4F554EB2C0934E7FDE2511E8C1573BA,,9999


## Checking for Negative/Non-Integer Values in Age Column

In [15]:
# Check for non-integer values in the 'age' column
doctors_non_integer_rows = doctors_copy_df[~doctors_copy_df['age'].astype(str).str.isdigit()]

print("Rows with non-integer values in the 'age' column:")
print(doctors_non_integer_rows)

Rows with non-integer values in the 'age' column:
                               doctorid         mainspecialty   age
0      AD61AB143223EFBC24C7D2583BE69251      General Medicine  41.0
1      D09BF41544A3365A46C9077EBB5E35C3       Family Medicine  43.0
2      FBD7939D674997CDB4692D34DE8633C4     Vascular Medicine  26.0
3      28DD2C7955CE926456240B2FF0100BDE     Otolaryngologists  34.0
4      35F4A8D465E6E1EDC05F3D8AB658C551     General Dentistry  50.0
...                                 ...                   ...   ...
60012  D1EEFDEF38B0DC9967B2482ED5676157    General Pediatrics  37.0
60014  FD1D25ED94DA6BA29824C667A7093312  General Practitioner  26.0
60018  178B515E6825D004EBAF232DD6977CCD      General Medicine  29.0
60019  CD532DBEF6547A66D2138FAB49AA3B94  General Practitioner  33.0
60023  23BA85862DD19C3550E7C0F0AF84C7ED     Internal Medicine  38.0

[20028 rows x 3 columns]


## Converting Data Type of Age Column into Integer

In [16]:
doctors_copy_df['age'] = doctors_copy_df['age'].astype(int)

## Updating the Original DataFrame and CSV with the Changes Made in the Copy

In [17]:
doctors_df.update(doctors_copy_df)

In [18]:
doctors_df.to_csv('doctors.csv', index=False) 

# `Patients Dataset`

## Reading the Patients Dataset

In [19]:
patients_df = pd.read_csv("px.csv", dtype={"pxid": "string", "age": "string"})

In [20]:
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6507813 entries, 0 to 6507812
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxid    string
 1   age     string
 2   gender  object
dtypes: object(1), string(2)
memory usage: 149.0+ MB


## Creating a Copy of the Dataset

In [21]:
patients_copy_df = patients_df.copy()

In [22]:
patients_copy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6507813 entries, 0 to 6507812
Data columns (total 3 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   pxid    string
 1   age     string
 2   gender  object
dtypes: object(1), string(2)
memory usage: 149.0+ MB


## Checking for Null Values in the Columns

In [23]:
patients_copy_df.isnull().any()

pxid      False
age       False
gender    False
dtype: bool

## Adding Sentinel Values to Null Age Columns

In [24]:
print(patients_copy_df.loc[patients_copy_df['age'].isnull() == True])

Empty DataFrame
Columns: [pxid, age, gender]
Index: []


In [25]:
patients_copy_df.loc[patients_copy_df['age'].isnull(), 'age'] = str(9999)

In [26]:
# Check appointments with a type of inpatient
patients_copy_df.isnull().any()

pxid      False
age       False
gender    False
dtype: bool

## Check for Negative Values in Age Column

In [27]:
# Check for non-integer values in the 'age' column
non_integer_rows = patients_copy_df[~patients_copy_df['age'].astype(str).str.isdigit()]

print("Rows with non-integer values in the 'age' column:")
print(non_integer_rows)

Rows with non-integer values in the 'age' column:
Empty DataFrame
Columns: [pxid, age, gender]
Index: []


## Convert Negative Values to Sentinel Values

Since str.isdigit() considers negative integers as non-integers, we have to replace them with a sentinel value.

In [28]:
patients_copy_df.loc[~patients_copy_df['age'].astype(str).str.isdigit()] = str(9999)

## Check if There are Still Negative Values

In [29]:
print(patients_copy_df[~patients_copy_df['age'].astype(str).str.isdigit()])


Empty DataFrame
Columns: [pxid, age, gender]
Index: []


## Converting the Age Column Back into Int

In [30]:
patients_copy_df['age'] = patients_copy_df['age'].astype(int)

## Updating the Original DataFrame and CSV with the Changes Made in the Copy

In [31]:
patients_df.update(patients_copy_df)

In [32]:
patients_df.to_csv('px.csv', index=False) 