# Project: Investigate No-show medical appointments dataset.

## Table of Contents
<ul/>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
 ### Today i will investigate a dataset of more than 100k medical appointments in Brazil. The data is concerned about the missed appointments.

  #### This dataset contains the following characteristics..

-PatientID : Patient identification number.
    
-AppointmentID : Appointment identification number.
    
-Gender : Patiant gender male or female.
    
-ScheduledDay : The date on which the appointment was scheduled.
    
-AppointmentDay : The date of the appointment.
    
-Neighbourhood : Location of the hospital.
    
-Scholarship : Whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
    
-Hipertension : If the patient is hypertensive.
    
-Diabetes : If the patient is diabetic.
    
-Alcoholism : The patient is an alcoholic.
    
-Handcap : Indicates if the patient is handicaped.
    
-SMS_received : Shows if message is sent to the patient.
    .
-No-show : It says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.
#### Questions:
###### 1-Do waiting days impact show up?
###### 2-Does gender impact show up?
###### 3-Are chronic diseases like hypertension and diabetes affecting patient's show up?
###### 4-Are the patients who received a SMS more committed to show up?

<a id='wrangling'></a>
## Data Wrangling


### Data Loading & Inspection

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import datetime as dt

In [None]:
dat = pd.read_csv('../input/no-show-appointments-dataset-investigation/noshowappointments-kagglev2-may-2016.csv')
dat.head(3)

In [None]:
dat.shape

##### This dataset contains 110,527 rows & 14 columns

In [None]:
dat.isna().sum()

In [None]:
dat.duplicated().sum()

##### There are no duplicates or missing values.

In [None]:
dat.info()

##### The set columns contain typos & spelling errors. I should rename it.

##### Some columns dtypes should be modified to be more readable.

In [None]:
dat.Gender.unique()

##### For gender type (F) stands for Female & (M) for male.

In [None]:
(dat.Age.unique(),dat.Age.min(),dat.Age.max())

##### There is a wrong age value(-1) should be droped.

In [None]:
(dat.Scholarship.unique(),
 dat.Hipertension.unique(),
 dat.Diabetes.unique(),
 dat.Alcoholism.unique(),
 dat.SMS_received.unique())

##### All this columns have 2 values {0 for no & 1 for yes}.

In [None]:
dat.Handcap.unique()

##### This column has 5 unique values {0 for no disabilities - 1,2,3 & 4 refer to number of disabilities}


### Data Cleaning Process

In [None]:
dat.rename(columns= {'PatientId':'Patient_id',
                         'AppointmentID':'Appointment_id',
                         'ScheduledDay':'Scheduled_day',
                         'AppointmentDay':'Appointment_day',
                         'Hipertension':'Hypertension',
                         'Handcap':'Handicap',
                         'No-show':'No_show'}, inplace=True)
dat.columns

##### Columns renamed..

In [None]:
dat['Patient_id'] = dat['Patient_id'].astype('int64')

In [None]:
dat.Scheduled_day = dat.Scheduled_day.apply(np.datetime64).dt.date

In [None]:
dat.Appointment_day = dat.Appointment_day.apply(np.datetime64).dt.date

In [None]:
dat.dtypes

##### Changing types Done..

In [None]:
dat = dat[(dat.Age != -1)]

In [None]:
(dat.Age.unique(),dat.Age.min(),dat.Age.max())

##### Outliers removed..

In [None]:
dat.No_show.unique()

##### It says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.
##### I'll change it to 0/1 type {0 for show - 1 for no-show} . It would be more usable..

In [None]:
dat['No_show'].replace("Yes" , 1 , inplace = True)
dat['No_show'].replace("No" , 0 , inplace = True)

##### No_show column format changed to 0/1..

In [None]:
dat.head(10)

### That's Better...

<a id='eda'></a>
## Exploratory Data Analysis


### Question 1 (Do waiting days impact show up?)

###### - I will create a new columns for the waiting days between schedule & appointment..

In [None]:
dat['Wait_days'] = (dat.Appointment_day - dat.Scheduled_day)
dat['Wait_days'] = dat['Wait_days'].astype(str)
dat['Wait_days'] = dat['Wait_days'].apply(lambda x: x.split("d")[0]).astype(int)
dat.tail(5)

In [None]:
dat.Wait_days.min()

###### There are erroneous negative values should be excluded!

In [None]:
dat = dat[dat['Wait_days'] >=0]

In [None]:
dat.Wait_days.unique()

###### Done..

In [None]:
wait_counts = dat.groupby('Wait_days').sum()['No_show']
total_wcounts = dat.groupby('Wait_days').count()['No_show']
wait_show_prob = pd.DataFrame(wait_counts / total_wcounts);
wait_show_prob.head(10)

In [None]:
sns.set_theme(style="darkgrid")
plt.figure(figsize=(10, 5), dpi=80)
sns.lineplot(data=wait_show_prob, x="Wait_days", y="No_show",linewidth=2)
plt.xlabel('Waiting days')
plt.ylabel('Absence proportion')
plt.title('Waiting days vs Absence proportion',fontsize=15);

###### 1-From the plot i can observe that at the first 60 waiting days the absence proportion is positively related to the waiting period.
###### 2-The same day appointments have remarkably low no-show ratio by only 4.6% absence.
###### 3-For waiting period more than 60 days the no-show ratio is highly variable & unpredictable.


### Question 2 (Does gender impact show up?)

In [None]:
g_counts = dat.groupby('Gender').sum()['No_show']
total_gcounts = dat.groupby('Gender').count()['No_show']
gen_prop = g_counts / total_gcounts
gen_prop

In [None]:
locations = [1, 2]
heights = gen_prop
labels = ['Female', 'Male']
plt.figure(figsize=(3,5))
plt.bar(locations, heights, tick_label=labels, color=['pink', 'blue'], edgecolor='white')
plt.xlabel('Gender')
plt.ylabel('Absence proportion')
plt.title('Gender vs Absence proportion',fontsize=12);

###### Now i can see that there is no noticeable difference between males & females.
 The no-show ratio is 20.3% for females & 19.9% for males.

### Question 3 (Are chronic diseases like hypertension and diabetes affecting patient's show up?)

In [None]:
dat.Diabetes.hist(figsize=(5,3));

In [None]:
dat.Diabetes.sum()/dat.Diabetes.count()*100

In [None]:
dat.Hypertension.hist(figsize=(5,3));

In [None]:
dat.Hypertension.sum()/dat.Hypertension.count()*100

###### At the dataset appointments, Around 7.2% & 19.7% are diabetic & hypertensive patients respectively.

In [None]:
d_counts = dat.groupby('Diabetes').sum()['No_show']
total_dcounts = dat.groupby('Diabetes').count()['No_show']
dia_prop = d_counts / total_dcounts
dia_prop

In [None]:
h_counts = dat.groupby('Hypertension').sum()['No_show']
total_hcounts = dat.groupby('Hypertension').count()['No_show']
ten_prop = h_counts / total_hcounts
ten_prop

In [None]:
locations = [1, 2]
heights = dia_prop
labels = ['Non-Diabetic', 'Diabetic']
plt.figure(figsize=(3,5))
plt.bar(locations, heights, tick_label=labels, color=['blue', 'red'], edgecolor='black')
plt.xlabel('Diabetes Status')
plt.ylabel('Absence proportion')
plt.title('Diabetes vs Absence proportion',fontsize=12);

In [None]:
locations = [1, 2]
heights = ten_prop
labels = ['Non-Hypertensive', 'Hypertensive']
plt.figure(figsize=(3,5))
plt.bar(locations, heights, tick_label=labels, color=['green', 'red'], edgecolor='black')
plt.xlabel('Hypertension status')
plt.ylabel('Absence proportion')
plt.title('Hypertension vs Absence proportion',fontsize=12);

###### As we can see..
######   The no-show ratio is 18% for diabetic & 20.4% for non-diabetic.
######   The no-show ratio is 17.3% for hypertensive & 20.9% for non-hypertensive.
##### So generally the patients with chronic diseases are more likely to show up..

### Question 4 (Are the patients who received a SMS more committed to show up?)

In [None]:
sms_receive = dat.SMS_received.sum()
sms_not_receive = dat.SMS_received.count() - sms_receive
labels = ['Received', 'Not Received']
sizes = [sms_receive, sms_not_receive]
explode = (0, 0.2)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
plt.title('SMS receivers ratio')
ax1.axis('equal')
plt.show()

###### Only 32.1% of patients have received a SMS.

In [None]:
sms_counts = dat.groupby('SMS_received').sum()['No_show']
total_scounts = dat.groupby('SMS_received').count()['No_show']
sms_prop = sms_counts / total_scounts
sms_prop

In [None]:
locations = [1, 2]
heights = sms_prop
labels = ['Not Received', 'Received']
plt.figure(figsize=(3,5))
plt.bar(locations, heights, tick_label=labels, color=['cyan', 'blue'], edgecolor='black')
plt.xlabel('SMS status')
plt.ylabel('Absence proportion')
plt.title('SMS receive vs Absence proportion',fontsize=12);

###### Surprisingly, The SMS receivers have lower commitment to show up.

<a id='conclusions'></a>
## Conclusions
#### In this project, we analyzed the no show database of patients.
#### I analyzed only some variables of the dataset.
###### 1-Generally, Patient is more likely to show up if the time between the patient scheduling day and his appointment is less.
###### 2-Same day appointment is lowest no-show ratio 4.6%
###### 3-Gender of a patient does not have effect on whether the patient shows up or no.
###### 4-hypertensive or diabetic patient is more likely to show up.
###### 5-The patients who received a SMS have lower commitment to show up. So save the money..


#### Limitations
###### -The waiting days between scheduling & appointment are up to 6 months! which is upnormal condition for an ill person to see a doctor and cause unexpected results.
###### -Age below 14 & above 75 can't get to appointment alone without a family member. So the age of that member is the one to be added to age column.
###### -The time of the sent SMS is not mentioned. Is it at the scheduling day (which is useless)? Or at the day of appointment(which is more useful as a reminder)?
###### -The type of handicapping conditions is not mentioned just the no. of conditions. Some conditions make it more difficult to go to appointment alone like blindiness & physical disability.