## Project Blueprint

Table of Contents
1) Introduction
2) Questions
3) Data Wrangling
Load packages & gathering data
Assessing data
Cleaning and trimming data
4) Exploratory Data Analysis
Research Question 1
Research Question 2
Research Question 3
Research Question 4
Research Question 5
Research Question 6
Research Question 7
5) Conclusions
6) References


Table of Contents
1. Introduction
1.1. First impressions
1.2. Questions
2. Data Wrangling
2.1. General Properties
2.2. Data Cleaning
3. Exploratory Data Analysis
3.1. Age
3.2. Waiting days
3.2.1. Analysing the decrease after a month
3.3. Received sms
3.4. Appointment week day
3.5. Gender
3.6. Neighbourhood
3.7. Patient Id
3.8. Answering questions
4. Conclusion

> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Before submitting your project, it will be a good idea to go back through your report and remove these sections to make the presentation of your work as tidy as possible. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: Exploratory Analysis of Medical Appointment No-shows Dataset
<br><br>
## Table of Contents<br>
<ul>
<li><a href="#intro">Introduction</a></li>
    <li style="margin-left:3%"> Initial observations</li>
    <li style="margin-left:3%"> Questions</li><br>
<li><a href="#wrangling">Data Wrangling</a></li><br>
<li><a href="#eda">Exploratory Data Analysis</a></li><br>
<li><a href="#conclusions">Conclusions</a></li><br>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
<br><br>
> The focus of this analysis would be to explore the question of whether or not patients show up for their appointment.



#### Import libraries and load dataset

In [2]:
# Import necessary libraries and packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
# load dataset to a pandas dataframe
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')

In [4]:
# view top of dataset
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [5]:

df.sample(5)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
87417,7644285000000.0,5759516,M,2016-06-01T13:08:28Z,2016-06-01T00:00:00Z,6,GRANDE VITÓRIA,0,0,0,0,0,0,No
57300,1228913000000.0,5661143,M,2016-05-05T07:14:40Z,2016-05-05T00:00:00Z,57,MARIA ORTIZ,0,0,0,0,0,0,No
108712,9326398000000.0,5775246,M,2016-06-06T09:12:12Z,2016-06-06T00:00:00Z,28,JABOUR,0,0,0,0,0,0,No
12948,71185350000000.0,5656030,F,2016-05-04T07:32:02Z,2016-05-04T00:00:00Z,37,CONSOLAÇÃO,0,0,0,0,0,0,No
1162,3821870000000.0,5638919,M,2016-04-29T08:35:06Z,2016-04-29T00:00:00Z,1,INHANGUETÁ,0,0,0,0,0,0,No


In [6]:
# view number of rows and columns
print(f'Dataset has {df.shape[0]} rows and {df.shape[1]} columns')

Dataset has 110527 rows and 14 columns


In [7]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


### Intial observations
> * The dataset contains over 110k observations and 14 features.
> * The __"No-show"__ feature appears to be the only dependent variable, mostly likely having 'Yes' and 'No' entries (where 'Yes' implies the patient did not show up for the appointment while 'No' indicates that they did show up.
> * We observe the minimum age is a negative value of -1, which does not make sense. This would require further looking into.
> * The **PatientID** and **Appointment** do not appear to be helpful to this analysis. It isn't obvious whether the dataset accounts for multiple appointment bookings for the same patient.
> * 

#### Key Note:
* Worthy of note is the fact that the analysis in this prroject is strictly descriptive and hence, possesses no predictive power.
* In this analysis, I will focus on the **no_show, age, hypertension, sms_received, gender, scheduled_day,** and **appointment_day variables.**

### Questions <br>

From the dataset documentation and preliminary observations above, here are some questions I hope to explore in this project.

> * My primary question would be to ascertain whether the features I have chosen to explore directly influence a patient's likelihood to miss or show up for their appointment, and which features influence no-show the most. 
> * Which gender is more likely to show up or not for an appointment, and how much of this is due to that gender's prevalence in the dataset? Using proportions?
> * What is the relationship between longer waiting times (time between the scheduling date and the actual appointment) and the likelihood of missing an appointment?
> * Do older or younger folks tend to miss their appointments?
> * Do the patients who received an SMS tend to show up for their appointment(s)?
> * What is the relationship between hypertension and the tendency to not show for an appointment? 

<a id='wrangling'></a>
## Data Wrangling  <br>

In this section the data will be analysed and cleaned, dealing with missing values or weird values.

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.


### General Properties/Assessment
<br>
Here, I want to get a high-level overview of the dataset characteristics/properties, and also try to identify any faults or inconsistencies in the data quality and overall structure. I will be looking out for data types, missing values, duplicates, errant/outlier values, etc.


> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Replace this with more specific notes!)

In [1]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


### References <br>
* https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed
* njkl
* jjkk