# Project: No-show appointments

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Data Set**: This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.

* ‘ScheduledDay’ tells us on what day the patient set up their appointment.
* ‘Neighborhood’ indicates the location of the hospital.
* ‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.
* ‘No-show’ says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.

> Importing of the necessary Library files

In [57]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

<a id='wrangling'></a>
## Data Wrangling


### Importing the Data
> Extracting the data stored as rows and columns into a DataFrame

In [58]:
appt_df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')

### Descriptive Summary

>  Generating descriptive statistics that summarize the different aspects of a dataset's distribution.

In [59]:
appt_df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


> Viewing the first few(5) rows of the data frame

In [60]:
appt_df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


### Data Cleaning
> Includes removing any incorrect data harmful for analysis

In [61]:
appt_df.rename(columns = {'Hipertension': 'Hypertension',
                'Handcap': 'Handicap','No-show':'No_show'}, inplace = True)
print(appt_df.columns)

Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hypertension',
       'Diabetes', 'Alcoholism', 'Handicap', 'SMS_received', 'No_show'],
      dtype='object')


> Checking for error giving values and NaN in the data

In [62]:
print('Age:-> ',sorted(appt_df.Age.unique()))
print('Gender:->',appt_df.Gender.unique())
print('Diabetes:-> ',appt_df.Diabetes.unique())
print('Alcoholism:-> ',appt_df.Alcoholism.unique())
print('Hypertension:-> ',appt_df.Hypertension.unique())#Hypertension
print('Handicap:-> ',appt_df.Handicap.unique())#Handicap
print('Scholarship:-> ',appt_df.Scholarship.unique())
print('SMS Received:-> ',appt_df.SMS_received.unique())
print('No Show:-> ',appt_df.No_show.unique())

Age:->  [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 115]
Gender:-> ['F' 'M']
Diabetes:->  [0 1]
Alcoholism:->  [0 1]
Hypertension:->  [1 0]
Handicap:->  [0 1 2 3 4]
Scholarship:->  [0 1]
SMS Received:->  [0 1]
No Show:->  ['No' 'Yes']


> Converting the (string) Dates columns into DateTime type columns

In [63]:
appt_df.AppointmentDay = appt_df.AppointmentDay.apply(np.datetime64)
appt_df.ScheduledDay   = appt_df.ScheduledDay.apply(np.datetime64)

> Adding the day of Week in the Data Frame

In [64]:
appt_df['WeekDay'] = pd.to_datetime(appt_df['AppointmentDay']).apply(lambda x: x.isoweekday())

> Removing Age outliers from the data frame to limit the data in a certain age range

In [65]:
appt_df = appt_df[(appt_df.Age >= 0) & (appt_df.Age <= 95)]
appt_df.shape

(110480, 15)

## Map No_Show column

> Normalize No_show column to 0's and 1's :
>> 0 = Showed up to appointment
>>> 1 = did not show up to appointment (miis it)

In [66]:
appt_df['No_show'].replace({'No':0,'Yes':1},inplace=True)

## Calculating the Patients Awaiting Time

> AwaitingTime is calculated as the rounded number of days from Scheduling to the appointment date

In [67]:
appt_df['AwaitingTime'] = appt_df["AppointmentDay"].sub(appt_df["ScheduledDay"], axis=0)
appt_df["AwaitingTime"] = (appt_df["AwaitingTime"] / np.timedelta64(1, 'D')).abs()# abs for negative values

> Creating new column named as "DayOfWeek" which contains the day at which the appointment was booked

In [None]:
appt_df['DayOfWeek'] = pd.to_datetime(appt_df['AppointmentDay']).apply(lambda x: x.isoweekday())

> The number of appointments been missed by patients.

In [None]:
appt_df['Num_App_Missed'] = appt_df.groupby('PatientId')['No_show'].apply(lambda x: x.cumsum())

> Checking the data types for all the columns in the data frame

In [None]:
appt_df.info()

<a id='eda'></a>
# Exploratory Data Analysis

## How each feature on it's own reflect on the stats of people who show up?

> Checking the percentage of people who show up at their appointed dates as per schedule.

In [None]:
no_show = appt_df["No_show"].value_counts()
print(no_show)
percent_no_show = no_show[1]/ no_show.sum() * 100
print("Percentage of people who miss their scheduled appointments:",percent_no_show )

> Calculating the mean of every possible aspect of the people's data

In [None]:
columns_of_df = ['Gender','Hypertension','Alcoholism','Diabetes']
for r in columns_of_df :
    print(appt_df.groupby(r)['No_show'].mean())

> Calculating the people who received SMS and if they showed up or not.

In [None]:
appt_df.groupby('SMS_received')['No_show'].mean()

### Conclusion:
> On recieving of the SMS's also, there is not at all a significant change in the people showing up at their appointed schedule. 

### What are the impotant factors which affects a patient showing up for their scheduled appointment?

In [None]:
def prob_show(dataset, group_by):    
    appt_df = pd.crosstab(index = dataset[group_by], columns = dataset['No_show']).reset_index()
    # calculating probability of showing up '0' means show up, and '1' means DID NOT show up for the appointment
    appt_df['probShowUp'] = appt_df[0] / (appt_df[1] + appt_df[0])
    return appt_df[[group_by, 'probShowUp']] 


> Predicting that a person will show up at an appointment depending on Age and Num_App_Missed.

In [None]:
sns.lmplot(data = prob_show(appt_df, 'Age'), x = 'Age', y = 'probShowUp', fit_reg = True)
plt.xlim(0, 100)
plt.title('Probability of showing up depending on Age')
plt.show()

> No_show rate of medical appointments is highly dependent on the age of the patient,ages from 14 to 24 years have higher cancellation rate. The no-show rate then decreases after around 80 year old patients.

#### Number of Appointments Missed by Patient

> Probability of showing up with respect decrease to the Number of previous missed appiontments

In [None]:
sns.lmplot(data = prob_show(appt_df, 'Num_App_Missed'), x = 'Num_App_Missed', 
           y = 'probShowUp', fit_reg = True)
plt.title('Probability of showing up with respect to Number of missed appiontments')
plt.ylim(0, 1)
plt.show()

> Number of missed appiontments and Age are good predictors of Showing up of patients.

#### Probability of showing up on number of handcaps a person presents'

In [None]:
sns.barplot(data = prob_categorical(df,['Handicap']),
            x = 'Condition', y = 'Probability', hue = 'Level', palette = 'Set2')
sns.plt.title('Probability of showing up on number of handcaps a person presents')
sns.plt.ylabel('Probability')
sns.plt.show()

> Handcap is the total amount of handcaps a person presents,So showing up decrease on increase number of handcaps a person presents especially after 2 handcaps.

#### Probability of showing up based on Day of the week

In [None]:
sns.barplot(data = prob_categorical(df,['DayOfWeek']),
            x = 'Condition', y = 'Probability', hue = 'Level', palette = 'Set2')
sns.plt.title('Probability of showing up based on Day of the week')
sns.plt.ylabel('Probability')
sns.plt.show()

After analyzed the probability of showing up with respect to Day of the week,probability decrease on weekends. 

<a id='conclusions'></a>
## Conclusions

* Certain age groups appear to be more likely to miss appointments.
* SMS reminder didn't increase show ups.
* Number of missed appiontments and AwaitingTime and Age are good predictors of ٍShowing up.
* Patients with scholarships (low income) appeared to have a higher percentage of not attending appointments.