# Analyzing Doctor Appointment No-Shows

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

We want to analyze a data set to see if we can discover patterns or indicators about whether or not someone is more likely to miss a doctor appointment.

Similar to how airline companies overbook their flights - assuming that some will not show-up - we may want to consider "overbooking" our doctor appointments assuming that some will not show up for their appointment. However, we do not want to be arbitrary in our scheduling.

We can leverage data from over 100,000 records of doctor appointments to uncover insights that can help us determine and predict whether or not they will show up for their appointment, based on a number of factors (health conditions, welfare status, location of doctor's office, and other relevant data points).

This project investigates such data and seeks to explore and reveal helpful insights that can help us make better, more efficient, and more profitable decisions.

In [128]:
import pandas as pd
from datetime import datetime

<a id='wrangling'></a>
## Data Wrangling

Import the data intro a pandas dataframe and make some general observations.

### General Properties

In [129]:
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')

### Data Cleaning

There are no null values, but there are two changes to make:<br>
1. Convert the No-Show appointment column to a more readable, intuitive name and value. I will change it from the negative to the positive by renaming the column to "ApptResult" with each cooresponding value as "Present" or "Missed."<br>
2. Change the Sholarship column to make the data make more sense<br>
3. To utilize the dates provided, I will parse out the month and the day for both the scheduled date and the appointment date.<br>
4. Remove same-day / walk-in appointments.<br>

Upon making these changes, I will export to a new, cleaned CSV file for future use.

#### Clean the No-show column

I want to first change the column name from No-show to ApptResult. This column name will be more intuitive to interpret.

In [132]:
df.rename(columns = {'No-show' : 'ApptResult'} , inplace=True)

Once we rename the column from No-show (which is in the negative), we need to rename the ApptResult column values from a "Yes" or "No" value to "Missed" or "Present", respectively.

In [133]:
df['ApptResult'].replace("No", "Present", inplace=True)
df['ApptResult'].replace("Yes", "Missed", inplace=True)

#### Change Scholarship column to say WelfareScholarship

In [148]:
df.rename(columns = {'Scholarship' : 'WelfareScholarship'} , inplace=True)

#### Clean the date columns and add a new column that calculates the difference

Convert the AppointmentDate and ScheduledDate to datetime format

In [135]:
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'], format="%Y-%m-%d").dt.date
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'], format="%Y-%m-%d").dt.date

Add a column to the dataframe "DateDifference" to the dataframe

In [136]:
df['DateDifference'] = df['AppointmentDay'] - df['ScheduledDay']

Retain rows where DateDifference is greater than 0. We can assume that these are walk-in appointments, and appointments scheduled after the appointment date must be a data-entry error.

In [137]:
df = df[df['DateDifference'] > '0 days']

#### Export the cleaned data to a new CSV

In [149]:
df.to_csv('cleaned-noshowappointments.csv', index=False)

In [150]:
df = pd.read_csv('cleaned-noshowappointments.csv')

<a id='eda'></a>
## Exploratory Data Analysis

In [157]:
df = pd.read_csv('cleaned-noshowappointments.csv')

### Does the location of the facility affect the likelihood of a missed appointment?

### Does a person's welfare status affect the likelihood of missing appointment?

### Does a person's health conditions affect the likelihood of mission an appointment?

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!