# Project: Investigate a Dataset - Medical Appointment No Shows

## Table of Contents
<ul>
<li><a href="#intro">1. Introduction</a></li>
<li><a href="#wrangling">2. Data Wrangling</a></li>
<li><a href="#eda">3. Exploratory Data Analysis</a></li>
<li><a href="#conclusions">4. Conclusions</a></li>
<li><a href="#references">5. References</a></li>
</ul>

<a id='intro'></a>
## 1. Introduction

This analysis is part of the Udacity Data Analysis Nanodegree Program and aims to explore the dataset of medical appointments no show in Brazil. This analysis is divided in four main parts:
1. *Introduction*, where the initial info is provided and the problem is set.
2. *Data Wrangling*, where the data is cleaned and prepared for analysis.
3. *Exploratory Data Analysis*, where key patterns are to be found.
4. *Conclusions*, in which the insights are described.


### Dataset Description 
This dataset contains over 100K medical appointments realized in Brazil. The information provided concerns The Brazil Public health system, known as SUS (*Sistema Único de Saúde*, Unified Health System), one of the largest health system in the world and entirely free of any cost. Being a system used by over 220 million Brazillians.

The dataset have 14 columns as so:
1. **PatientID**, Identification of a patient.
2. **AppointmentID**, Identification of each appointment.
3. **Gender**, "M" for Male and "F" for Female.
4. **ScheduledDay**, the day the patient set up the appointment.
5. **AppointmentDay**.
6. **Age**, age in years.
7. **Neighbourhood**, the location of the appointment.
8. **Scholarship**, whether the patient is enrolled in *Bolsa Família*.
9. **Hipertension**, True or False.
10. **Diabetes**, True or False.
11. **Alcoholism**, True or False.
12. **Handcap**, True or False.
13. **SMS_received**, True or False.
14. **No-show**, Categorical type, if the appointment was a no-show or not.


### Question(s) for Analysis


1. What is the mean time between Sscheduled day and the appointment day? 
    - There is a direct relation between this time difference and the no-shows?
2. SMS is a key factor to reduce no-shows?
3. About location, which neighborhood have the highest no show rate?

<a id='wrangling'></a>
## Data Wrangling

In this section of the report, the data will be loaded in, checked for cleanliness, cleanned as necessary.

In [1]:
# import necessary packages for analysis and data visualization
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [2]:
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [4]:
df.duplicated().value_counts()

False    110527
dtype: int64

As it is seen in the previous outputs, there is no missing data and no duplicated data.

In [5]:
# Change names to lower case
for name in df.columns:
    df.rename(columns={name:name.lower()},
              inplace=True)

# Rename incorrect column names
df.rename(columns={'hipertension': 'hypertension',
                   'handcap': 'handicap',
                   'no-show': 'noshow'},
                   inplace=True);

### Categorical Data

#### PatientId
The column is loaded as *float64*, but it represents the unique identification of the patient. So there is no reason to apply any numerical operations.

In [6]:
# Change the dtype of PatientId to string
# The PatientId is changed to int first, just to remove the decimals and have a cleaner look
df['patientid'] = df['patientid'].astype('int64').astype(str)

In [7]:
# Checking for number of uniques patients Id.
df.patientid.nunique()

62299

The lower number of unique patients indicates that one or more patients made more than one appointment. 

#### Appointment ID
As so as Patient ID, this data should be a string value.

In [8]:
# Change the dtype of AppointmentID
df['appointmentid'] = df['appointmentid'].astype(str)

In [9]:
df['appointmentid'].nunique()

110527

This column have all unique values and cotains the information generated by a system do unique identify appointments. For the analysis, this columns will be used as index for more clarity and meaning to the index.

In [10]:
df.set_index('appointmentid', drop=True, inplace=True)

#### Gender and No-Show
Both will be checked for any inconsistenses and have the datatype change to categorical for data efficiency

In [11]:
#Checking for the different values
df.gender.unique()

array(['F', 'M'], dtype=object)

In [12]:
# Changing the dtype to category
df['gender'] = df['gender'].astype('category')

In [13]:
df.noshow.unique()

array(['No', 'Yes'], dtype=object)

In [14]:
# Changing the dtype to category
df['noshow'] = df['noshow'].astype('category')

#### Neighbourhood

In [15]:
df.neighbourhood.unique()

array(['JARDIM DA PENHA', 'MATA DA PRAIA', 'PONTAL DE CAMBURI',
       'REPÚBLICA', 'GOIABEIRAS', 'ANDORINHAS', 'CONQUISTA',
       'NOVA PALESTINA', 'DA PENHA', 'TABUAZEIRO', 'BENTO FERREIRA',
       'SÃO PEDRO', 'SANTA MARTHA', 'SÃO CRISTÓVÃO', 'MARUÍPE',
       'GRANDE VITÓRIA', 'SÃO BENEDITO', 'ILHA DAS CAIEIRAS',
       'SANTO ANDRÉ', 'SOLON BORGES', 'BONFIM', 'JARDIM CAMBURI',
       'MARIA ORTIZ', 'JABOUR', 'ANTÔNIO HONÓRIO', 'RESISTÊNCIA',
       'ILHA DE SANTA MARIA', 'JUCUTUQUARA', 'MONTE BELO',
       'MÁRIO CYPRESTE', 'SANTO ANTÔNIO', 'BELA VISTA', 'PRAIA DO SUÁ',
       'SANTA HELENA', 'ITARARÉ', 'INHANGUETÁ', 'UNIVERSITÁRIO',
       'SÃO JOSÉ', 'REDENÇÃO', 'SANTA CLARA', 'CENTRO', 'PARQUE MOSCOSO',
       'DO MOSCOSO', 'SANTOS DUMONT', 'CARATOÍRA', 'ARIOVALDO FAVALESSA',
       'ILHA DO FRADE', 'GURIGICA', 'JOANA D´ARC', 'CONSOLAÇÃO',
       'PRAIA DO CANTO', 'BOA VISTA', 'MORADA DE CAMBURI', 'SANTA LUÍZA',
       'SANTA LÚCIA', 'BARRO VERMELHO', 'ESTRELINHA', 'FORTE SÃO 

In [16]:
df.neighbourhood.nunique()

81

The Neighbourhood columns has 81 unique values and thereis no need to change anything.

### Datetime data
Both columns AppointmentDay and ScheduleDay should be in datetime format.

In [17]:
#Change both columns to datetime and normalizing the columns to kee just the "day" information
df['appointmentday']= pd.to_datetime(df['appointmentday']).dt.normalize()
df['scheduledday'] = pd.to_datetime(df['scheduledday']).dt.normalize()

# Create a month column for easy data manipulation
df['appointmentday_month'] = df['appointmentday'].dt.month_name()
df['scheduledday_month'] = df['scheduledday'].dt.month_name()

It can be presumed that all the ScheduledDays are made before the Appointment Day.

To check this, a new column will becreated and filter by this information.

In [18]:
# Create a new column 'BetweenDays'
df['betweendays'] = df['appointmentday'] - df['scheduledday']

# Create a colunm with the value in float for easy data manipulation
df['betweendays_float'] = df['betweendays'].dt.total_seconds() / (24 * 60 * 60)

# Check for posible inconsistenses
df['betweendays'].describe()

count                        110527
mean     10 days 04:24:31.828602965
std      15 days 06:07:11.673762786
min               -6 days +00:00:00
25%                 0 days 00:00:00
50%                 4 days 00:00:00
75%                15 days 00:00:00
max               179 days 00:00:00
Name: betweendays, dtype: object

In [19]:
# Checking how many data have inconistent data.
df[df['betweendays'].dt.days < 0].shape

(5, 17)

Since there is only 5 rows with invalid data, the data will be droped from the dataframe

In [20]:
df.drop(
    df[df['betweendays'].dt.days < 0].index,
    inplace=True
)

Lets transform the float value in the *betweendays_float* in a categorical type. This will help in further analysis by grouping this data.

In [21]:
def categorical_betweendays(row):
    if row['betweendays_float'] == 0:
        return '0 Days'
    elif row['betweendays_float'] <=3:
        return '1-3 Days'
    elif row['betweendays_float'] <=7:
        return '4-7 Days'
    elif row['betweendays_float'] <=14:
        return '8-14 Days'
    elif row['betweendays_float'] <=30:
        return '15-30 Days'
    else:
        return '30+ Days'

In [22]:
df['betweendays_cate'] = df.apply(categorical_betweendays, axis=1)

### Numeric data
For the numeric data, it is expected that the rows 'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'Handicap' and 'SMSReceived' only have ints from zero to one.

In [23]:
df[['scholarship', 'hypertension', 'diabetes', 'alcoholism', 'handicap', 'sms_received']].describe()

Unnamed: 0,scholarship,hypertension,diabetes,alcoholism,handicap,sms_received
count,110522.0,110522.0,110522.0,110522.0,110522.0,110522.0
mean,0.09827,0.197255,0.071868,0.030401,0.022231,0.32104
std,0.297681,0.397928,0.25827,0.171689,0.161493,0.466878
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,4.0,1.0


As it is possible to see, all rows have only ones and zeros, except from Handicap, where the max value is four.

As it is stated in [this discussion on Kaggle](https://www.kaggle.com/datasets/joniarroba/noshowappointments/discussion/29699), that means the number of desabilites a person has. For exemple, a person who is blind and has a difficult to walk, the total value should be 2.

With this, we will check for the number of values and leave it the same way.

In [24]:
print(
    f"""
Number of Appoints where Handicap > 1: {df[df.handicap > 1].shape[0]}
Percentual: {df[df.handicap > 1].shape[0] / df.shape[0] * 100:.3f} % 
    """
)


Number of Appoints where Handicap > 1: 199
Percentual: 0.180 % 
    


There is aproximally only 0.2% of the data with with more than one desabilit.

#### Age column
For the Age column, we need to checkif there is any inconsistent value.

In [25]:
df.age.describe()

count    110522.000000
mean         37.089041
std          23.110064
min          -1.000000
25%          18.000000
50%          37.000000
75%          55.000000
max         115.000000
Name: age, dtype: float64

We see outiers as there is negative data and values over 100.

As for the negative data, it has no meaning, so it will be droped. 

As for the values over 100, Althoud it is hard to believe, we will check the number of cases and see how it impact the data.

In [26]:
# check for values lower 0 years
df[df.age < 0]

Unnamed: 0_level_0,patientid,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hypertension,diabetes,alcoholism,handicap,sms_received,noshow,appointmentday_month,scheduledday_month,betweendays,betweendays_float,betweendays_cate
appointmentid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
5775010,465943158731293,F,2016-06-06 00:00:00+00:00,2016-06-06 00:00:00+00:00,-1,ROMÃO,0,0,0,0,0,0,No,June,June,0 days,0.0,0 Days


In [27]:
# Drop that column
df.drop(
    df[df.age < 0].index,
    inplace= True
)

In [28]:
df[df.age > 100]

Unnamed: 0_level_0,patientid,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hypertension,diabetes,alcoholism,handicap,sms_received,noshow,appointmentday_month,scheduledday_month,betweendays,betweendays_float,betweendays_cate
appointmentid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
5651757,976294799775439,F,2016-05-03 00:00:00+00:00,2016-05-03 00:00:00+00:00,102,CONQUISTA,0,0,0,0,0,0,No,May,May,0 days,0.0,0 Days
5700278,31963211613981,F,2016-05-16 00:00:00+00:00,2016-05-19 00:00:00+00:00,115,ANDORINHAS,0,0,0,0,1,0,Yes,May,May,3 days,3.0,4-7 Days
5700279,31963211613981,F,2016-05-16 00:00:00+00:00,2016-05-19 00:00:00+00:00,115,ANDORINHAS,0,0,0,0,1,0,Yes,May,May,3 days,3.0,4-7 Days
5562812,31963211613981,F,2016-04-08 00:00:00+00:00,2016-05-16 00:00:00+00:00,115,ANDORINHAS,0,0,0,0,1,0,Yes,May,April,38 days,38.0,30+ Days
5744037,31963211613981,F,2016-05-30 00:00:00+00:00,2016-05-30 00:00:00+00:00,115,ANDORINHAS,0,0,0,0,1,0,No,May,May,0 days,0.0,0 Days
5751563,234283596548,F,2016-05-31 00:00:00+00:00,2016-06-02 00:00:00+00:00,102,MARIA ORTIZ,0,0,0,0,0,0,No,June,May,2 days,2.0,1-3 Days
5717451,748234579244724,F,2016-05-19 00:00:00+00:00,2016-06-03 00:00:00+00:00,115,SÃO JOSÉ,0,1,0,0,0,1,No,June,May,15 days,15.0,15-30 Days


Fo the data with age over 100, it is possible to see that there are few cases, only 7.

And by looking in the PatientId, 4 of those appointments were made for the same patient in the spam of 3 months.

<a id='eda'></a>
## Exploratory Data Analysis

As a step to prepare for the further analysis, functions will be defined to help explore more of the columns

In [45]:
def get_statistcs(df, col_name, height=400, width=400):
    """
    Loads resume statics for certain colunm in a pandas dataframe. Also prints a basic violin plot.
        
    Args:
        df_col: Pandas DataFrame. 
        col_name: str, column's name. Ntype numeric only.
        height: int, graph's height. Default = 400
        width: int, graph's width. Default = 400
    Return:
        Statistics, violin plot
    """
    print(f"""
    Mean: {df[col_name].mean():.3f} 
    Std deviation: {np.std(df[col_name]):.3f} 
    Max value: {df[col_name].max()}
    Min value: {df[col_name].min()}
    Median: {df[col_name].median()}

    Number of unique values: {df[col_name].nunique()}
""")
    
    fig = px.violin(df, y=col_name, box=True, height=height, width=width, points='outliers')
    fig.show()


### 1. What is the mean time between Scheduled day and the appointment day? 

In [46]:
get_statistcs(df, 'betweendays_float')


    Mean: 10.184 
    Std deviation: 15.255 
    Max value: 179.00000000000003
    Min value: 0.0
    Median: 4.0

    Number of unique values: 129



In [30]:
# Find the betweendays mean
print('---')
print(f'Mean time between schedule and appointment:\n{df.betweendays.mean()}')
print('---')

# Plot a graph of this mean over for each month
df_fig = pd.DataFrame(df.groupby('appointmentday_month').betweendays_float.mean()).reset_index()

fig = px.bar(
    df_fig,
    x = 'appointmentday_month',
    y = 'betweendays_float',
    height= 400,
    width= 500,
    title='Mean days between Schedule and Appointment',
    text_auto='.2f'
)
fig.show()

---
Mean time between schedule and appointment:
10 days 04:25:27.412889858
---


As it is possible to see by the graph, the mean days between schedule and appointment days is about **10 days**. The standard deviation is about 15 days, meaning that the data is well dispersed. That also suggests that the data does not follow a standard distribuition.

There is no significant variation over time.

#### 1.1. There is a direct relation between this time difference and the no-shows?

In [31]:
# Group data by noshow
df_fig = df.groupby('noshow')['betweendays_float'].mean().reset_index()

fig = px.bar(
    df_fig,
    x = 'noshow',
    y = 'betweendays_float',
    height= 400,
    width= 500,
    title='Mean days between by noshow',
    text_auto='.2f'
)
fig.show()

As the graph above suggests, there is a relation between the noshow and the days between scheduled and appoinment.

In [55]:
# Group data by noshow
df_fig = df.groupby(['betweendays_cate','noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index='betweendays_cate',
    columns='noshow',
    values='patientid'
).reset_index()

df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100
df_fig.sort_values(by='noshow_rate', inplace=True)

fig = px.bar(
    df_fig,
    x = 'betweendays_cate',
    y = 'noshow_rate',
    # color = 'noshow',
    # barmode='group',
    height= 400,
    width= 500,
    title='Noshow rate over between days',
    text_auto='.2f'
)
fig.show()

From the graph, it can be shown that the **noshow rate increases the longer the time between scheduling and the scheduled day**. The lowest value is reached when the consultation occurs on the same day, where the noshow rate is less than 5%. in this case, the noshow may be associated with emergencies.

### 2. Is SMS a key factor to reduce no-shows?

In [33]:
get_statistcs(df, 'sms_received')


    Mean: 0.321 
    Max value: 1
    Min value: 0
    Median: 0.0
    
    Number of unique values: 2



The graph above does not present a meaningful visualization as SMS data can be interpreted as categorical. However, the relevant value is the average, which shows us the frequency of receiving SMS, being around 30%.

Therefore, it is interesting to analyze whether there is any relationship with location, as SMS may not be sent to some location.

In [63]:
# todo fig 

In [34]:
df_fig = df.groupby(['sms_received', 'noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index='sms_received',
    columns='noshow',
    values='patientid'
).reset_index()

df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100

fig = px.bar(
    df_fig,
    x = 'sms_received',
    y = 'noshow_rate',
    height= 400,
    width= 500,
    title='Noshow rate if sms_received',
    text_auto='.1f'
)
fig.show()

Weird result
Try to plot a distribution graph between this value and the average days between to see more info

In [35]:
df_fig = df.groupby(['sms_received','betweendays_cate', 'noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index=['sms_received','betweendays_cate'],
    columns='noshow',
    values='patientid'
).reset_index()

df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100

fig = go.Figure(
    data=[
        go.Bar(x=df_fig.betweendays_cate, y=df_fig.noshow_rate, name="SMS_received"),
        go.Bar(x=df.betweendays_cate, y=df_fig.noshow_rate, name="SMS_not_received")
    ]
)
fig.show()

In [36]:
df_fig

noshow,sms_received,betweendays_cate,No,Yes,noshow_rate
0,0,0 Days,36770,1792,4.647062
1,0,1-3 Days,9223,2715,22.742503
2,0,15-30 Days,4325,2526,36.87053
3,0,30+ Days,2479,1489,37.525202
4,0,4-7 Days,6386,2313,26.589263
5,0,8-14 Days,3326,1695,33.758215
6,1,0 Days,0,0,
7,1,1-3 Days,0,0,
8,1,15-30 Days,7385,3135,29.80038
9,1,30+ Days,4474,1936,30.202808


In [37]:
df_fig = df.groupby(['sms_received','betweendays_cate', 'noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index=['sms_received','betweendays_cate'],
    columns='noshow',
    values='patientid'
).reset_index()

df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100

fig = px.bar(
    df_fig,
    x = 'sms_received',
    y = 'noshow_rate',
    height= 400,
    width= 500,
    title='Noshow rate if sms_received',
    text_auto='.1f'
)
fig.show()

### 3. About location, with neighborhood have the highest no show rate?

In [56]:
get_statistcs(df_fig, 'noshow_rate')


    Mean: 24.738 
    Std deviation: 9.751 
    Max value: 33.002505299672386
    Min value: 4.647061874384108
    Median: 27.723197102068045

    Number of unique values: 6



In [59]:
# Group data by noshow
df_fig = df.groupby(['neighbourhood', 'noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index='neighbourhood',
    columns='noshow',
    values='patientid'
).reset_index()

# Filter for low volume of data
df_fig = df_fig.where(df_fig['Yes'] + df_fig['No'] > 10)

# Create noshow_rate column
df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100

# Sort values and get the top 10
df_fig1 = df_fig.sort_values(by='noshow_rate', ascending=False).head(10)

fig = px.bar(
    df_fig1,
    x = 'neighbourhood',
    y = 'noshow_rate',
    height= 400,
    width= 500,
    title='Top 10 neighbourhood by noshow rate',
    text_auto='.1f'
)
fig.show()

In [60]:
df_fig2 = df_fig.sort_values(by='noshow_rate', ascending=False).tail(10)

fig = px.bar(
    df_fig2,
    x = 'neighbourhood',
    y = 'noshow_rate',
    height= 400,
    width= 500,
    title='Bottom 10 neighbourhood by noshow rate',
    text_auto='.1f'
)
fig.show()

In [62]:
get_statistcs(df_fig, 'noshow_rate')


    Mean: 19.887 
    Std deviation: 3.080 
    Max value: 28.91849529780564
    Min value: 8.571428571428571
    Median: 19.758848697005057

    Number of unique values: 77



When analyzing the noshow rate by neighbourhood, it can be seen that some regions have a higher noshow rate than others, however, when analyzing the average and distribution data with statistics and the violin graph, it can be seen that there is less dispersion, being closer to a normal distribution.

---

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed in relation to the question(s) provided at the beginning of the analysis. Summarize the results accurately, and point out where additional research can be done or where additional information could be useful.

> **Tip**: Make sure that you are clear with regards to the limitations of your exploration. You should have at least 1 limitation explained clearly. 

> **Tip**: If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should see output that starts with `NbConvertApp] Converting notebook`, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

<a id='references'></a>
## References
- [Applying heatmaps for categorical data analysis](https://www.kaggle.com/code/tsilveira/applying-heatmaps-for-categorical-data-analysis/notebook)
- [Predict Show/NoShow - EDA+Visualization+Model](https://www.kaggle.com/code/samratp/predict-show-noshow-eda-visualization-model)

https://en.wikipedia.org/wiki/Sistema_%C3%9Anico_de_Sa%C3%BAde

- [Project data by Kaggle](https://www.kaggle.com/datasets/joniarroba/noshowappointments)

In [40]:
# Running this cell will execute a bash command to convert this notebook to an .html file
#!python -m nbconvert --to html Investigate_a_Dataset.ipynb