Original notebook: https://colab.research.google.com/drive/1FOczd5Gq9NSvEcosZcpTCWltEgN9q6X1?usp=sharing#scrollTo=70aae7d57446dca3

## Data Preparation

### Import Libraries

In [None]:
import pandas as pd
import plotly.express as px
from geopy.geocoders import Nominatim
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np

# ignore pandas warning
import warnings

warnings.filterwarnings("ignore")

### Load Data

First we get the data in regards to health facilities in Kenya.    
We will remane the district to sub-county in the dataset since districts were renamed to subcounties.

In [None]:
health_facilities_df = pd.read_excel("data/health-facilities-data-kenya.xlsx")
# rename district to sub county
health_facilities_df = health_facilities_df.rename(columns={"District": "Sub County"})
health_facilities_df.head(5)

Our  main county of focus will be Nairobi so we're going to focus on data in Nairobi.    

In [None]:
health_facilities_df_nairobi = health_facilities_df[
    health_facilities_df["County"] == "Nairobi"
]

health_facilities_df_nairobi.head(5)

We then get data on population density by sub-county. This will be used to determine whether facilities are accessible.    
We rename the `National/County` field to `Sub County` since that's the data we'll be targeting.

In [None]:
population_density_by_subcounty_df = pd.read_csv(
    "data/kenya-population-and-area-population-density_by_subcounty.csv"
)
population_density_by_subcounty_df = population_density_by_subcounty_df.rename(
    columns={"National/ County": "Sub County"}
)
population_density_by_subcounty_df.tail(10)

### Data Cleaning

At a quick glance, the names in the `Sub County` section of the `population_density_by_subcounty_df` dataframe have some inconsistencies.    
We'll remove the trailing spaces and the random `*` and `.` at the end of some sub-county names.    
Since we'll only be dealing with Nairobi county as of now only 'Kibra' sub-county are affected by this, this should be fine.

In [None]:
population_density_by_subcounty_df["Sub County"] = population_density_by_subcounty_df["Sub County"].str.replace("*", "").str.replace(".", "")
population_density_by_subcounty_df.tail(10)

The data in `health_facilities_df_nairobi` is pretty clean as is.

### Data Preprocessing

First let's determine how many facilities are in each sub-county in Nairobi.

In [None]:
nairobi_facilities_count = (
    health_facilities_df_nairobi.groupby("Sub County")
    .size()
    .reset_index(name="Number of Facilities")
    .sort_values(by="Number of Facilities", ascending=False)
)
nairobi_facilities_count

Since all the Embakasi and Dagoretti sub-counties are counted as one sub-county in the population density dataframe, we'll merge them together.    
The results were edited using Excel.

In [None]:
nairobi_facilities_count = pd.read_csv("data/nairobi-health-facilities-locations.csv")
nairobi_facilities_count

#### Geocode facilities

A good way to check the distribution of facilities is to plot them on a map. This will provide near immediate visual insights into the distribution of facilities.

In [None]:
def get_coordinates(row):
    """This code takes a pandas row and returns a tuple of latitude and longitude.

    Args:
        row (pandas.core.series.Series): pandas row

    Returns:
        tuple: latitude and longitude
    """
    geolocator = Nominatim(user_agent="kenya_healthcare")

    address = f"{row['Sub County']}, Nairobi, Kenya"
    location = geolocator.geocode(address)

    try:
        if location:
            return location.latitude, location.longitude
        else:
            return None, None
    except:
        return None, None

In [None]:
nairobi_facilities_count[["Latitude", "Longitude"]] = nairobi_facilities_count.apply(
    get_coordinates, axis=1, result_type="expand"
)
nairobi_facilities_count = nairobi_facilities_count.sort_values(
    by="Number of Facilities", ascending=False
)
nairobi_facilities_count

## Exploratory Data Analysis

### Descriptive Statistics

Here we analyze basic statistics of the health facilities in Nairobi.

In [None]:
print(f"Data shape: {health_facilities_df_nairobi.shape}")
print("-----------------------------")
print(health_facilities_df_nairobi.info())

We have a total of 942 facilities in Nairobi dataset, the facility code and facility name rows match meaning they are all unique and no null values.

### Data Visualizations

Here we visualize various data in the dataset.

In [None]:
sns.set_style("whitegrid")

Let's look at the number of facilities in each sub-county in Nairobi.

In [None]:
sns.barplot(
    data=nairobi_facilities_count, x="Number of Facilities", y="Sub County", orient="y"
).set_title("Number of Health Facilities in Nairobi");

Now let's view the data on a map.

In [None]:
fig = px.scatter_mapbox(
    data_frame=nairobi_facilities_count,
    lat="Latitude",
    lon="Longitude",
    size="Number of Facilities",
    zoom=10,
    mapbox_style="open-street-map",
    hover_name="Sub County",
)
fig.show()

Embakasi sub-county has the highest number of facilities followed by Dagoretti and least is Westalnds.    
The high number of facilities in said areas is because Dagoretti is separated into North and South and Embakasi is separated into North, South, East and West.

Let's have a look at the facility types in Nairobi.

In [None]:
facility_types = (
    health_facilities_df_nairobi["Type"].value_counts().sort_values(ascending=False)
)

In [None]:
ax = sns.barplot(
    data=pd.DataFrame(facility_types), x=facility_types, y=facility_types.index, orient="y"
).set_title("Types of Health Facilities in Nairobi");
plt.xlabel("Facility Type Count");
plt.ylabel("");

Medical facilities are the most common type of health facility in Nairobi followed by dispensaries.

Let us now look at job titles of those incharge within the health facilities in Nairobi.

In [None]:
job_title_incharge = health_facilities_df_nairobi["Job Title of in Charge"].value_counts()
job_title_incharge

A pie chart would be a good way to summarize this data.

In [None]:
plt.figure(figsize=(6, 6))
plt.pie(job_title_incharge, labels=job_title_incharge.index, autopct="%1.1f%%");
plt.title("Job Title of Those Incharge of Health Facilities in Nairobi", fontweight="bold");

As we can see most of the persons incharge of health facilities in Nairobi have the title `Nursing Officer in Charge`.

What about ownership? Who owns most of the health facilities in Nairobi?

In [None]:
facilities_ownership = (
   health_facilities_df_nairobi["Owner"].value_counts().sort_values(ascending=False)
)
facilities_ownership_df = pd.DataFrame(facilities_ownership)
facilities_ownership_df = facilities_ownership_df.reset_index()
facilities_ownership_df

In [None]:
ax = sns.barplot(
    data=facilities_ownership_df,
    x= "count",
    y="Owner",
)
ax.set_title("Health Facility Ownership in Nairobi");
plt.xlabel("Facility Ownership Count");
plt.ylabel("");

From the above we can see that most health facilities in Nairobi are privately owned with the government only starting to come in on ownership at around 5th place with the `Local Authority` label.

## Feature Engineering

Here we determine the ratio of various aspects of the health facilities in Nairobi that affect accessibility.

We can start by combining the data in `health_facilities_df_nairobi` and `population_density_by_subcounty_df`.    
This will help us determine the number of people each sub-county in Nairobi is expected to serve.

In [None]:
nairobi_facilities_count.sort_values(by="Sub County", ascending=True, inplace=True)
nairobi_facilities_count

In [None]:
# Nairobi population data per sub-county
nairobi_population_df = population_density_by_subcounty_df[385:]
nairobi_population_df["Sub County"] = (
    nairobi_population_df["Sub County"].str.replace(".", "").str.replace(" ", "").str.replace("'","")
)
nairobi_population_df.sort_values(by="Sub County", ascending=True, inplace=True)
nairobi_population_df

In [None]:
nairobi_facility_population_df = pd.merge(nairobi_facilities_count,
                                          nairobi_population_df,
                                          on='Sub County',
                                          how="left")
nairobi_facility_population_df = nairobi_facility_population_df.dropna()
nairobi_facility_population_df["Population"] = nairobi_facility_population_df["Population"].astype(int)
nairobi_facility_population_df

## Hypothesis Testing

Our hypothesis when we started on the analysis was: **Regions with higher density have a greater number of healthcare facilities compared to regions with lower population density.**

Rationale: **Denser populations might drive a higher demand for healthcare services, leading to more health facilities.**

 According to a report by the Kenyan government, https://www.countdown2030.org/wp-content/uploads/2023/02/Infrastructure-Policy.pdf:    
 *The average national health
facility density is 2.2 per 10,000 population which is slightly
above the target of 2 per 10,000 population. (2018, KHFA).
However, it is noted that there are geographical disparities with
33 (70%) counties having health facility densities of 2 per 10,000
population and above apart from Nandi, Kwale, Uasin Gishu,
Nairobi, Busia, Bomet, Trans Nzoia, Kakamega, Narok, Vihiga,
Wajir, Kisii, Bungoma and Mandera Counties with facility density
of below 2 per 10,000 population*.  

By knowing this we can try determining whether most facilities in Nairobi meet the minimum health facility density of 2 per 10,000 population.

In [None]:
nairobi_facility_population_df["Facilities per 10,000 People"] = (nairobi_facility_population_df["Number of Facilities"] / nairobi_facility_population_df["Population"]) * 10000
nairobi_facility_population_df

From the above data we can determine that most of the counties in Nairobi meet the threshhold health facility density of 2 per 10,000 people in the population.

Next we determine the correlation between population density and the number of health facilities in a sub-county.

In [None]:
correlation_matrix = nairobi_facility_population_df.corr(numeric_only=True)
correlation = correlation_matrix.loc['Population', 'Number of Facilities']
print(f'The Correlation of population and number of health facilities in Nairobi sub-counties is: {correlation}')

We can see a correlation coefficient of 0.747 between number of facilities and population which indicates a strong positive correlation between population and health facilities.
This means that as the population increases, the number of health facilities also tends to increase.

To better visualize this information. Here is a linear regression representation:

In [None]:
population = nairobi_facility_population_df["Population"]
facilities = nairobi_facility_population_df["Number of Facilities"]

# Reshape data for scikit-learn
population = [[x] for x in population]  # Convert to 2D array

# Create and fit the linear regression model
model = LinearRegression()
model.fit(population, facilities)

# Generate predictions for the entire population range
population_range = [[x] for x in range(min(population)[0], max(population)[0] + 1, 1000)]  # Adjusted step size
predicted_facilities = model.predict(population_range)

# Plot the scatter plot and regression line
plt.scatter(population, facilities, color='blue', label='Actual Data')
plt.plot(population_range, predicted_facilities, color='red', linewidth=2, label='Regression Line')

plt.xlabel('Population')
plt.ylabel('Number of Facilities')
plt.title('Regression of Number of Facilities on Population')
plt.legend()
plt.grid(True)
plt.show()

With this we have established that:


1.   As the population of a sub-county increases in Nairobi, the number of health facilities increases.
2.   Most sub-counties in Nairobi meet the minimun health facility standards suggested by the kenyan government in 2020.



## Limitations
* Data for some sub-counties was unavailable due to issues with how fields were named.
* Due to the little amount of time we had to prepare, we could not go more indepth.

## End Credits

This is a notebook by: Kennedy, Paul, Pauline and Antony 👌