# Project: Investigate a Dataset (No-show appointments)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
>In this analysis we will answer a very important question
(What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?) 
This analysis will go through several stages : Data wrangling then Data cleaning both of this will help us to go thrught step 3 
which is find relations between our data (Exploratory Data Analysis) and finally We will obtain conclusions that will help us improve our performance. 

>Be focus !

<a id='wrangling'></a>
## Data Wrangling
> In this step we will import libraries wich will help us in our analysis and explore our data to know what operation we need to do to make our data clean and tidy .

In [None]:
#importing libraries.
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#uploading dataset using pandas.
df = pd.read_csv("../input/no-show-appointment/noshowappointments-kagglev2-may-2016.csv")

> After uploading data we need to assess it.

In [None]:
#showing first 5 rows in data 
df.head()

In [None]:
#showing important information about data 
df.info()

In [None]:
#showing some stattistics about data 
df.describe()

In [None]:
#showing sum of doplicated rows 
df.duplicated().sum()

>Some queries to make the data more clear.

In [None]:
#showing data-types of some columns
type(df.Gender[0])
type(df.ScheduledDay[0])
type(df.AppointmentDay[0])
type(df.Neighbourhood[0])
type(df['No-show'][0])

In [None]:
#showing first element in some columns 
df.ScheduledDay[0]
df.AppointmentDay[0]
df.PatientId[0]

In [None]:
#showing if there are ages = 0 ?
df.query("Age == '0'")

# Decisions after assessing 
- removing [ PatientId , AppointmentID , ScheduledDay , AppointmentDay ] because I saw These columns won't affect our anlysis
- rename "No-show" column to "No_show" according to the rules for naming variables
- rename "Hipertension" column to "Hypertension" because this is a typo
- replacing ages zero values to mean of age because there are no ages equal to zero
- making negative ages postive because there are no ages negative 

### Data Cleaning 
> After assessing our data i find some problems which it need to be cleaned 
In this step we will clean our data to be ready for exploratory.


In [None]:
#make copy from data to make real dataset save.
df_clean = df.copy()

> It's time to implement decisions ! 

In [None]:
#delete columns using pandas.
df_clean.drop(['PatientId','AppointmentID','ScheduledDay','AppointmentDay'] , axis=1 , inplace = True)

In [None]:
#rename column to be easy to understand.
df_clean.rename(columns={'No-show': 'No_show'}, inplace=True)

In [None]:
#rename column to be easy to understand.
df_clean.rename(columns={'Hipertension': 'Hypertension'}, inplace=True)

In [None]:
#replacing ages with value 0 with the mean of ages 
df_clean['Age'].replace(0, df_clean['Age'].mean() , inplace = True)

In [None]:
#making negative ages postive
df_clean['Age'] = df_clean['Age'].abs()

## Testing 

In [None]:
#showing important information about data 
df_clean.info()

In [None]:
#showing some stattistics about data 
df_clean.describe()

<a id='eda'></a>
## Exploratory Data Analysis

### Research Question .
- (What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment ?)

>At first We need to know some information about data and attendance rate 

In [None]:
#showing some histograms which explain some information about data 
df_clean.hist(figsize=(12,12));

In [None]:
#showing ratio of attendence 
df_clean.No_show.value_counts().plot.bar(color=['green','red']);
#naming the tile of plot 
plt.title("ratio of attendence")
#naming xlabel 
plt.xlabel("showing up")
#naming ylabel 
plt.ylabel("count")

- We have seen that the attendance rate is higher than the absence rate, which is almost equal to 4 times
- Most of patients didn't suffer from  Alcoholism and Handcap
- about 10% from patient enrolled in Brasilian welfareprogram
- about 19% from patient suffered from hypertension   
- Numbers of who recieved sms is half of who didn't 


>Secondly , we want to know the relationships between all the data and showing up.

In [None]:
# making function find the relation between any columns and No_show column to help us find raltionships 
def relation(x):
    (df_clean.groupby([x,'No_show']).size()
                                        .unstack(x)
                                        .apply(lambda x : x/x.sum())
                                        .plot.bar(title = "discovering reasons of show up and no show" , rot=0 , width = .9 , color=['blue','orange'] , ylabel="count of patients"));

In [None]:
relation('Gender')

In [None]:
relation("Scholarship")

In [None]:
relation('Diabetes')       

In [None]:
relation("Hypertension")

In [None]:
relation("Alcoholism")

> There is no clear relationship between  (Gender,Scholarship,Diabetes,Alcoholism,Handcap) and showing up.

In [None]:
relation("SMS_received")

> There is a strange relationship between receiving messages and showing up As the percentage of recipients of messages and no attendance Greater than the percentage of non-recipients of messages and attendees

In [None]:
#size of plot 
plt.figure(figsize=[20,20])
#Determine the desired part
e = df_clean.groupby(['Neighbourhood','No_show']).size().unstack()
e.Yes.plot(kind='bar' , alpha=.5 , color = 'red' , label= 'no show')
e.No.plot(kind='bar' , alpha=.5 , color = 'green' , label= 'show')
plt.legend()
plt.title("The relation between neighbourhood and showing up")
plt.xlabel("Neighbourhood")
plt.ylabel("patients")

>there is a clear relationship between Neighbourhood and showing up we can see clearly

<a id='conclusions'></a>
## Conclusions



- there is s strong relation between Neighbourhood and showing up 
- there is some thing strange about relation between sms and attendence 
- No abvisuos relation between showing up and Hypertension, Diabetes, Alcoholism and  Handcap   

### Limitations
> couldn't detect correlation between showing/no_showing and Hypertension, Diabetes, Alcoholism and  Handcap   