**About Dataset**
This Dataset contains information on medical appointments in Brazil. It includes:

* Patient Demographics (age, gender, etc.).
* Appointment Details (date, neighborhood).
* Medical Conditions (e.g., diabetes, alcoholism).
* Whether the patient showed up for their appointment or not.




In [9]:
import kagglehub
import pandas as pd

In [10]:
path = kagglehub.dataset_download("joniarroba/noshowappointments")

print(" Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/joniarroba/noshowappointments?dataset_version_number=5...


100%|██████████| 2.40M/2.40M [00:00<00:00, 75.1MB/s]

Extracting files...
 Path to dataset files: /root/.cache/kagglehub/datasets/joniarroba/noshowappointments/versions/5





In [3]:
# Load the dataset
df = pd.read_csv("KaggleV2-May-2016.csv")
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [11]:
# Check for Missing values
df.isnull().sum()

Unnamed: 0,0
patientid,0
appointmentid,0
gender,0
scheduledday,0
appointmentday,0
age,0
neighbourhood,0
scholarship,0
hipertension,0
diabetes,0


No missing values in the dataset.

In [17]:
# Remove duplicates
df.duplicated().sum()

np.int64(0)

In [18]:
df = df.drop_duplicates()

The dataset typically has no duplicates.

In [19]:
# Clean column names
df.columns = df.columns.str.lower().str.replace("-", "_").str.replace(" ", "_")
df.columns

Index(['patientid', 'appointmentid', 'gender', 'scheduledday',
       'appointmentday', 'age', 'neighbourhood', 'scholarship', 'hipertension',
       'diabetes', 'alcoholism', 'handcap', 'sms_received', 'no_show'],
      dtype='object')

Making them lowercase, and replace spaces or hyphens with underscores.

In [20]:
# Standardize Text Fields
df ["no_show"] = df["no_show"].map({"No": 0, "Yes": 1})
df["gender"] = df["gender"].str.upper()

In [22]:
# Convert Date columns to Datetime
df["scheduledday"] = pd.to_datetime(df["scheduledday"])
df["appointmentday"] = pd.to_datetime(df["appointmentday"])

In [23]:
# Extract useful features like weekday
df["appointment_weekday"] = df["appointmentday"].dt.day_name()

In [24]:
# Check and Fix Data Types
df.dtypes

Unnamed: 0,0
patientid,float64
appointmentid,int64
gender,object
scheduledday,"datetime64[ns, UTC]"
appointmentday,"datetime64[ns, UTC]"
age,int64
neighbourhood,object
scholarship,int64
hipertension,int64
diabetes,int64


In [25]:
# Handle outliers/Inconsistent data
# check for invalid ages
df[df["age"] < 0]

Unnamed: 0,patientid,appointmentid,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show,appointment_weekday
99832,465943200000000.0,5775010,F,2016-06-06 08:58:13+00:00,2016-06-06 00:00:00+00:00,-1,ROMÃO,0,0,0,0,0,0,0,Monday


In [26]:
df = df[df["age"] >= 0]

In [27]:
# check, max age
df["age"].describe()

Unnamed: 0,age
count,110526.0
mean,37.089219
std,23.110026
min,0.0
25%,18.0
50%,37.0
75%,55.0
max,115.0
