# **EXPLORATORY DATA ANALYSIS ON GHANA'S HEALTH INFRASTRUCTURE**

## **Description**

This dataset contains information about the total number of health facilites in Ghana by regions and districts and also it provides information about the type of facilities available (source:Ghana Open Data Initiative, https://data.gov.gh/dataset/health-facilities) published by the Health Sector and released as at **2016-02-05**

## **Objective**

The general objective of this exploratory data analysis is to understand the health infratructure of Ghana.

Specific Objectives

1. To examine the types of facility tiers in the nation.


2. To analyse the distribution of these health facility tiers accross the nation per regions.


3. To examine the type of facilities widespread accross nation.      eg (Hospitals,clinics etc)


4. To analyse and determine whether most health facilites are state-owned or private.

In [None]:
import os
os.getcwd()

In [None]:
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")

#Importing libraries for data analysis and cleaning
import numpy as np
import pandas as pd

#importing visualisation libraries for data visualisation
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
init_notebook_mode(connected=True)

#load datasets
tiers = pd.read_csv('../input/health-facilities-gh/health-facility-tiers.csv')
facilities = pd.read_csv('../input/health-facilities-gh/health-facilities-gh.csv')

## **PART 1 : DATA PREPROCESSING**

#### **Dataset 1**

Here, i will check for any null values or duplicates to shape the data into a meaningful one for exploration.

In [None]:
#checking dataset 1
tiers.head()

There are no null values in the tiers dataset. A total of 1475 rows for the 3 columns (region,facility and tier)

In [None]:
#checking the general summary of the dataset
tiers.info()

In [None]:
#checking for duplicated data
tiers.duplicated().sum()

In [None]:
#examining the 20 rows that are duplicated
tiers.loc[tiers.duplicated(keep=False),:]

In [None]:
#removing duplicated data
tiers = tiers.drop_duplicates()

#confirming 
tiers.duplicated().sum()

#### **Dataset 2**

In [None]:
#checking dataset 2
facilities.head()

In [None]:
#general info of dataset 2
facilities.info()

Seems there are some null(missing values). lets explore this further

In [None]:
facilities.isnull().sum()

In [None]:
#Investigating missing rows in the town column
facilities[facilities['Town'].isnull()]

Since i wont be working with specific townships, i would leave the data as it is to avoid losing information when we drop the rows with these missing towns.

This is applicable on the longitude and latitude columns which both have 24 missing values. 

In [None]:
#checking for duplicated data
facilities.duplicated().sum()

In [None]:
#examing those 30 duplicated rows
facilities.loc[facilities.duplicated(),:]

In [None]:
#removing duplicated rows
facilities = facilities.drop_duplicates()

In [None]:
#Investigating categorical data. This is to identify any duplicates resulting from many possible factors
no_regions = facilities['Region'].unique()

for x in no_regions:
    print(x)

No duplication or errors from the category 'Region' above. Lets procced to category 'Type' 

In [None]:
#Identifying the available facility types accross the nation and cross-checking for errors
no_types = facilities['Type'].unique()

for x in no_types:
    print(x)

It can be identified that Clinic is has been created twice. One with a capital 'C' and the other in a lower-case 'c' . 
This must be addressed since it will create duplication.

In [None]:
#investigating category clinic under 'Type'
facilities[facilities['Type'] == 'clinic']

In [None]:
#fixing error. Adding 'clinc' to 'Clinc'.
facilities['Type'].loc[[2010,2056]] = 'Clinic'

From the data above, 'CPHS' is a misspelled version of the actual '**CHPS**' and this must be addressed

In [None]:
#Investigating the misspelled 'CPHS'
facilities[facilities['Type'] == 'CPHS']

In [None]:
#correcting that error
facilities['Type'].loc[646] = 'CHPS'

In [None]:
facilities[facilities['Type'] == 'DHD']
facilities['Type'].loc[1250] = 'Municipal Health Directorate'

In [None]:
#investigating all misclassifed CHPS types and ownerships
pd.set_option('display.max_rows', None)
facilities[facilities['FacilityName'].str.contains('CHPS')]

In [None]:
#duplicated data. One is classified as Clinic and the other CHPS with the same Lat and Long.
facilities.loc[[2183,2187]]

The facility is Taifa CHPS, this means its under category 'CHPS'. Lets address this in the code below

In [None]:
#dropping the category type 'clinic'
facilities = facilities.drop([2183])

#misclassified 'CHPS' under type
facilities.loc[[953,1008,1163,1166,2516,3103,3265]]

In [None]:
#correcting the wrongly classified 'Clinic' to the right category 'CHPS'
facilities['Type'].loc[[953,1008,1163,1166,2516,3103,3265]] = 'CHPS'

It can also be identified that the Municipal Health Directorate has been created twice.
Lets investigate and address this.

In [None]:
#duplicated. The spelling of it was spaced out
facilities[facilities['Type'] == 'Municipal  Health Directorate']

In [None]:
#compiling it into one
facilities['Type'].loc[1134] = 'Municipal Health Directorate'

The category 'Centre' belongs to category 'Health Centres'. Lets address this.

In [None]:
#Investigating category 'Centre'
facilities[facilities['Type'] == 'Centre']

In [None]:
#reassigning to its correct category 'Health Centre'
facilities['Type'].loc[[99,667]] = 'Health Centre'

In [None]:
#Identifying the types of ownerships for these facilities accross the nation
no_ownship = facilities['Ownership'].unique()

for x in no_ownship:
    print(x)

#### The categories above have a few errors that must be addressed to make it meaningful.

No 1: The category 'Islamic' and 'Muslim' are the same and it must be addressed.

No 2 : 'Clinic' under this dataset is a type of facility is not an ownership

No 3: 'Maternity Home' under this dataset is a type of facility not an ownership

No 4 : 'Private' and 'Government' have been created twice. Both have same issue of the first letter capitalisation. 'G'/'g' and 'P'/'p'.

No 5: 'NGO' and 'Mission' are theoretically the same

**No 1 : The category 'Islamic' and 'Muslim'**

In [None]:
#Investigating category 'Muslim'
facilities[facilities['Ownership'] == 'Muslim']

#Adding it to category 'Islamic'
facilities['Ownership'].loc[930] = 'Islamic'

**No 2 : The category of 'Clinic'**

In [None]:
#Investigating 'Clinic'
facilities[facilities['Ownership'] == 'Clinic']

#Since the clinic is a rural clinic ('Adadiem Rural Clinic'), it is reasonable to assign the ownership as 'Government'

#reassigning to 'Government'
facilities['Ownership'].loc[971] = 'Government'

**No : 3 The category of 'Maternity Home'**

In [None]:
#Investigating 'Maternity Home'
facilities[facilities['Ownership'] == 'Maternity Home']

There isnt much information on correcting this error('maternity home'). I will reassign it to 'Government' since is the most common

In [None]:
#checking the most owned facilities
facilities['Ownership'].value_counts().head()

In [None]:
#reassigning to 'Government'
facilities['Ownership'].loc[[969,970]] = 'Government'

**No 4 : The category of government and private**

In [None]:
#investigating government
facilities[facilities['Ownership'] == 'government']

In [None]:
#fixing the error
facilities['Ownership'].loc[[2127, 3209, 3226, 3228, 3229, 3230]] = 'Government'

In [None]:
#investigating private
facilities[facilities['Ownership'] == 'private']

In [None]:
#fixing the error
facilities['Ownership'].loc[[1413,1608]] = 'Private'

**No 5: The category of Missions and NGO**

In [None]:
#Investigating 'missions'
facilities[facilities['Ownership'] == 'Mission']

#reassigning to NGO
facilities['Ownership'].loc[3398] = 'NGO'

From the data above, it is clearly identified that 'CHPS' are government owned. There are 4 misclassified ones that must be corrected.

In [None]:
#wrongly classified Ownerships
facilities.loc[[704,2463,3265,3312]]

In [None]:
#correcting these to their right category 'Government'
facilities['Ownership'].loc[[704,2463,3265,3312]] = 'Government'

## **PART 2 : DATA ANALYSIS**###

### **OBJECTIVE 1**

*Examining the types of facility tiers in the nation*.

There are two types of facility tiers in Ghana. **Tier 2 and Tier 3** as identified from the code below.
From the analysis, there are *264* health facilities under Tier 2 which covers **18%** of the total health facilities accross the nation of Ghana whiles Tier 3, which is the largest, covering **82%** with a total number of *1,191* accross the country.

In [None]:
print('Two types of health facility tiers. Tier 2 and Tier 3:',tiers['Tier'].unique())
print('\n')
print('Tier 3 coverage percentage in Ghana:',round(100* len(tiers[tiers['Tier'] == 3])/len(tiers['Tier'])),'%')
print('With a total number of',tiers[tiers['Tier'] == 3]['Tier'].count())

print('\n')
print('Tier 2 coverage percentage in Ghana:',round(100* len(tiers[tiers['Tier'] == 2])/len(tiers['Tier'])),'%')
print('With a total number of',tiers[tiers['Tier'] == 2]['Tier'].count())

print('\n')
ylabel='Count'
xlabel='Types of Health Facility Tiers'
ax = tiers['Tier'].value_counts().plot(kind='bar',figsize=(12,5),title='Number of health facilty Tiers in Ghana',color='red');
ax.autoscale(axis='x',tight=True)
ax.set(xlabel=xlabel, ylabel=ylabel);

## **OBJECTIVE 2**

*Analysing the distribution of these health facility tiers accross the nation per regions*.


The barplot shows that Tier 3 is the most common accross the nation.
Tier 3 happens to be the most common facility type in the urban cities, that is 'Greater Accra' and the 'Ashanti region'.
Greater Accra has total of 448 tier 3 facilties and 79 tier 2 facilities and the Ashanti region has 223 Tier 3 facilites and 86 tier 2 facilites.

From the data visualised below, the northern parts of Ghana which includes the Northern, UpperWest and UpperEast regions have a low count of health facilites in general (both tier 2 and 3) as compared to the urban regions with a total count,*12* tier 2 facilities and *106* tier 3 facilities in all for the three regions.

In [None]:
tiers_per_region = tiers.groupby(['Region','Tier']).count()
tiers_per_region

In [None]:
plt.figure(figsize=(15,5));
plt.title('Count of Health facility Tiers per Region');
sns.countplot(data=tiers,x = 'Region',hue='Tier');

In [None]:
max_long = facilities['Longitude'].max()
min_long = facilities['Longitude'].min()
max_lat = facilities['Latitude'].max()
min_lat = facilities['Latitude'].min()

In [None]:
facilities['FacilityName'] = facilities['FacilityName'].str.lower()
tiers['Facility'] = tiers['Facility'].str.lower()
merged = pd.merge(facilities, tiers, left_on=['FacilityName'], right_on=['Facility'])

In [None]:
data = []
for index, tier in enumerate(merged['Tier'].unique()):
    facils = merged[merged['Tier'] == tier]
    data.append(
        go.Scattergeo(
        lon = facils['Longitude'],
        lat = facils['Latitude'],
        text = facils['FacilityName'],
        mode = 'markers',
        marker_color = index,
        name = "Tier " + str(tier)
        )
    )

layout = dict(
        title = 'Health facilities in Ghana based on Tier',
        geo = dict(
        scope = 'africa',
        landcolor = "rgb(212, 212, 212)",
        subunitcolor = "rgb(255, 255, 255)",
        lonaxis = dict(
            showgrid = True,
            gridwidth = 0.5,
            range= [ min_long - 5, max_long + 5 ],
            dtick = 5
        ),
        lataxis = dict (
            showgrid = True,
            gridwidth = 0.5,
            range= [ min_lat - 1, max_lat + 1 ],
            dtick = 5
        )
    )
)
fig = dict(data = data, layout = layout)
go.Figure(fig)

## **OBJECTIVE 3**

*Examining the five most common type of Health facilities in the Ghana and their spread per each Region*

From the analysis, it is clear that the most common health facilities(5) in Ghana are Clinics, Health Centres, CHPS,Maternity Homes and Hospitals.

The analysis shows that *Greater Accra* has the highest count of available **Clinics** with a total of **281** and the *Ashanti region* comming up at second place with a count of **268** Clinics.

The second most common health facility among the regions is the **Health Centre** which is greatly dominated by the *Volta region* of a total count of **201** and the Ashanti having a count of **132**.

**CHPS** is greatly dominated in the *Western Region* of Ghana (**126**) and a count of **101** for the Central Region.

**Maternity Homes** are mostly populated in the Ashanti region with a count of **112**.

**Hospitals** are less found in the *Upper west* and *East* of Ghana with *Ashanti Region* and *Greater Accra* having the majority count of hospitals at **105** and **101** respectively.




In [None]:
#Examining the overall count of the Health Facilities in Ghana
ylabel='Count'
xlabel='Types of Health Facilities'
ax1 = facilities['Type'].value_counts().plot(kind='bar',figsize=(11,5),title='The most common health facilities in Ghana');
ax1.autoscale(axis='x',tight=True)
ax1.set(xlabel=xlabel, ylabel=ylabel);

In [None]:
#Investigating the 5 most common health facilities and their total counts
facilities['Type'].value_counts().head()

Lets analyse the distribution of the five most common health facilities in the regions of Ghana

In [None]:
df2 = facilities[facilities['Type'].str.contains('Clinic')]
df2 =df2['Region'].value_counts()

df3 = facilities[facilities['Type'].str.contains('Health Centre')]
df3=df3['Region'].value_counts()

df4 = facilities[facilities['Type'].str.contains('CHPS')]
df4=df4['Region'].value_counts()

df5 = facilities[facilities['Type'].str.contains('Maternity Home')]
df5 = df5['Region'].value_counts()

df6 = facilities[facilities['Type'].str.contains('Hospital')]
df6 = df6['Region'].value_counts()

per_reg = pd.concat([df2, df3,df4,df5,df6], axis=1).reset_index()
per_reg.columns = ['Region','Clinic','Health Centre','CHPS','Maternity Home','Hospital']
per_reg = per_reg.set_index('Region')
per_reg

In [None]:
#Analysing the highest count per each health facility
per_reg.describe().loc['max']

**A visual representation of the five most common health facilities in Ghana per region**

In [None]:
per_reg.iplot(kind='bar',barmode='stack',title='Distribution of the five most common health facilities per regional area',xTitle='Regions',yTitle='Count')

In [None]:
data = []
for index, region in enumerate(facilities['Region'].unique()):
    selected_facilities = facilities[facilities['Region'] == region]
    data.append(
        go.Scattergeo(
        lon = selected_facilities['Longitude'],
        lat = selected_facilities['Latitude'],
        text = selected_facilities['FacilityName'],
        mode = 'markers',
        marker_color = index,
        name = region
        )
    )

layout = dict(
        title = 'Health facilities in Ghana based on Region',
        geo = dict(
        scope = 'africa',
        landcolor = "rgb(212, 212, 212)",
        subunitcolor = "rgb(255, 255, 255)",
        lonaxis = dict(
            showgrid = True,
            gridwidth = 0.5,
            range= [ min_long - 5, max_long + 5 ],
            dtick = 5
        ),
        lataxis = dict (
            showgrid = True,
            gridwidth = 0.5,
            range= [ min_lat - 1, max_lat + 1 ],
            dtick = 5
        )
    )
)
fig = dict(data = data, layout = layout)
go.Figure(fig)




## OBJECTIVE 4

*To analyse and determine whether most health facilites are state-owned or private.*

From the analysis below, the pie chart shows that the *Government* owns **59.1%** of the health facilites whiles *Private ownerships* cover **31.4%**. This indicates that most health facilities in Ghana are **State-owned (Government)** having a total of **2202** Health facilities.

In [None]:
#Structuring into a dataframe
grp_ownships = pd.DataFrame(facilities['Ownership'].value_counts())
grp_ownships['Percentage Ownerships'] = round(100 * (grp_ownships['Ownership']/grp_ownships['Ownership'].sum()),1)
grp_ownships = pd.DataFrame(grp_ownships).reset_index()
grp_ownships.columns = ['Type','Ownership','Percentage Ownerships']

#Pie chart 
fig = px.pie(grp_ownships, values='Ownership', names='Type',
             title='Ownership Percentages', labels=dict(grp_ownships['Ownership']))
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

In [None]:
#Complete list of ownerships
grp_ownships

## **CONCLUSIONS**

The analysis conducted above gives a clear understanding of Ghana's Health Infrastruture.

1. There are more Tier 3 health facilities in the country as compared to Tier 2 with Tier 3 covering 82% and Tier 2 covering 18%.
2. Tier 3 is more dominated in the capital of Ghana (Greater Accra).
3. There are 5 most common health facilities in Ghana and Clinics are the most popular. There are 1151 total counts of them
4. Finally, most of the health facilities in Ghana are state-owned(Government).