# 1.0 Overview
Urbanization is when people move from rural places to cities. This is seen as a rise in the proportion of people who live in cities and towns.
Urbanization, however, affects various aspects of human society. These include socio-economic factors such as GDP, population growth rates, healthcare access, employment opportunities and development. In Kenya, urbanization has influenced the socio-economic factors aforementioned.

## 1.1 Research Questions
* What is the effect of urbanization on GDP Kenya?
* What is the trend of GDP with the increasing urbanization in Kenya?
* What does urbanization mean for development in Kenya?
* How has urbanization influenced healthcare access in Kenya?
* How has urbanization influenced unemployment in Kenya?

## 1.2 Objectives
* Determine the effect that urbanization has had on the GDP of Kenya.
* Find the trend of GDP with the increase in urbanization in Kenya.
* Determine what urbanization has done to development in Kenya.
* Determine the influence that urbanization has had on healthcare access in Kenya.
* Determine the influence of urbanization on unemployment in Kenya.

# 2.0 Data Understanding
The study will use open-source data in order to investigate the influence urbanization has had on the Kenyan society. The data to be used include data on gdp, healthcare, unemployment, urban and rural population.

In [1]:
# load dependencies
import pandas as pd
import numpy as np
import ee
import geemap
import folium
import geopandas as gpd
import geopy as gpy
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
ke_urban_target_path = 'src/data/ke_urbanareas/ke_urbanareas.shp'
ke_major_towns_path = 'src/data/ke_major-towns/ke_major-towns.shp'
base_dir = '/home/teofilo_acholla_ligawa_gafna/Documents/python_practice/collaborations/dte_datathon/'
ke_urban_file_path = base_dir+ke_urban_target_path

# read the urban and major towns files
ke_urban_shapefile = gpd.read_file(ke_urban_file_path)

ke_urban_shapefile

Unnamed: 0,URID,URNAME,geometry
0,UR,"Urban and associated areas, rural settlements","POLYGON ((36.71925 -1.31900, 36.71900 -1.31917..."
1,UR,"Urban and associated areas, rural settlements","POLYGON ((36.79016 -1.32067, 36.79011 -1.32072..."
2,UR,"Urban and associated areas, rural settlements","POLYGON ((36.67917 -1.33357, 36.67925 -1.33371..."
3,UR,"Urban and associated areas, rural settlements","POLYGON ((36.90098 -1.34641, 36.90098 -1.34760..."
4,UR,"Urban and associated areas, rural settlements","POLYGON ((36.70578 -1.36347, 36.70481 -1.36330..."
...,...,...,...
292,UR,"Urban and associated areas, rural settlements","POLYGON ((36.74226 -1.26365, 36.74298 -1.26377..."
293,UR,"Urban and associated areas, rural settlements","POLYGON ((36.87193 -1.27477, 36.87241 -1.27460..."
294,UR,"Urban and associated areas, rural settlements","POLYGON ((36.68344 -1.28486, 36.68357 -1.28469..."
295,UR,"Urban and associated areas, rural settlements","POLYGON ((36.91741 -1.28943, 36.91751 -1.28892..."


In [3]:
# plot urban areas in Kenya

# Create a Folium map centered on Kenya
m = folium.Map(location=[-1.2921, 36.8219], zoom_start=6)

# Add the GeoDataFrame as a GeoJSON layer
folium.GeoJson(ke_urban_shapefile).add_to(m)

# show map
m

The region of interest is Kenya and the map is centered around it. The blue marks indicate the urban areas of the country. From the map we realize that the urban areas in Kenya are not well spread out, they ar sparse towards the eastern, southern and northern part of the country. The urban areas in the country are mostly centrally located and a good number is seen towards the western part of the country.

In [4]:
# save the map
m.save(base_dir+'src/data/ke_urbanareas/kenyan_urbanareas_folium_map.html')

## 2.1 Preliminary Data Inspection
> In this phase of the study, we shall inspect the data about the data.

In [6]:
# load data
df1 = pd.read_csv(base_dir+'src/data/agriculture-electricity-health.csv')
df2 = pd.read_csv(base_dir+'src/data/gdp-population-rural.csv')

# preview
df1.head()

Unnamed: 0,time,agricultural_land_%_of_land_area,access to_electricity_%_of_population,access_to_electricity_rural_%_of_rural_population,access_to_electricity_urban_%_of_urban_population,access_to_clean_fuels_and_technologies_for_cooking_urban_%_of_urban_population,access_to_clean_fuels_and_technologies_for_cooking_rural_%_of_rural_population,access_to_clean_fuels_and_technologies_for_cooking_%_of_population,adolescents_out_of_school_%_of_lower_secondary_school_age,adults_ages_15-49_newly_infected_with_hiv,agriculture_forestry_and_fishing_value_added_annual_%_growth,"hospital_beds_per_1,000_people",literacy_rate_adult_total_%_of_people_ages_15_and_above,"physicians_per_1,000_people"
0,1960,..,..,..,..,..,..,..,..,..,..,1.25048005580902,..,0.092
1,1961,44.2773307094915,..,..,..,..,..,..,..,..,..,..,..,..
2,1962,44.2861158941561,..,..,..,..,..,..,..,..,..,..,..,..
3,1963,44.2949010788207,..,..,..,..,..,..,..,..,..,..,..,..
4,1964,44.3036862634853,..,..,..,..,..,..,..,..,..,..,..,..


In [7]:
# preview df2
df2.head()

Unnamed: 0,time,forest_area_%_of_land_area,urban_population_%_of_total_population,urban_population_growth_annual_%,urban_population,urban_land_area_sq_km,rural_population,rural_population_%_of_total_population,rural_population_growth_annual_%,population_density_people_per_sq_km_of_land_area,population_growth_annual_%,unemployment_total_%_of_total_labor_force,gdp_growth_annual_%,gdp_current_US$,current_health_expenditure_%_of_gdp,domestic_general_government_health_expenditure_%_of_current_health_expenditure,domestic_general_government_health_expenditure_%_of_gdp,proportion_of_population_pushed_below_the_$2.15_$_2017_PPP_poverty_line_by_out_of_pocket_health_care_expenditure_%
0,1960,..,7.362,..,570661,..,7180774,92.638,..,..,..,..,..,791265500.0,..,..,..,..
1,1961,..,7.565,6.46796851804438,608791,..,7438679,92.435,3.52861026239367,14.1397020065362,3.74797688027386,..,-7.77463490371655,792959500.0,..,..,..,..
2,1962,..,7.774,6.57819145208457,650185,..,7713393,92.226,3.62648892733818,14.6951154373265,3.85285696473711,..,9.45735874072126,868111400.0,..,..,..,..
3,1963,..,8.038,7.25096775695646,699081,..,7998119,91.962,3.62482204432135,15.2813016129599,3.91148088133147,..,8.77834021621184,926589300.0,..,..,..,..
4,1964,..,8.318,7.37167699069845,752562,..,8294825,91.682,3.6425437339829,15.8965931053871,3.94748519609761,..,4.96446728844091,998759300.0,..,..,..,..


In [8]:
# metadata of df1
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 14 columns):
 #   Column                                                                          Non-Null Count  Dtype 
---  ------                                                                          --------------  ----- 
 0   time                                                                            63 non-null     int64 
 1   agricultural_land_%_of_land_area                                                63 non-null     object
 2   access to_electricity_%_of_population                                           63 non-null     object
 3   access_to_electricity_rural_%_of_rural_population                               63 non-null     object
 4   access_to_electricity_urban_%_of_urban_population                               63 non-null     object
 5   access_to_clean_fuels_and_technologies_for_cooking_urban_%_of_urban_population  63 non-null     object
 6   access_to_clean_fuels_and_te

In [10]:
# metadata of df2
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 18 columns):
 #   Column                                                                                                              Non-Null Count  Dtype  
---  ------                                                                                                              --------------  -----  
 0   time                                                                                                                63 non-null     int64  
 1   forest_area_%_of_land_area                                                                                          63 non-null     object 
 2   urban_population_%_of_total_population                                                                              63 non-null     float64
 3   urban_population_growth_annual_%                                                                                    63 non-null     object 
 4   urban_population      

In [13]:
# summary statistics of df1
df1.describe()

Unnamed: 0,time
count,63.0
mean,1991.0
std,18.330303
min,1960.0
25%,1975.5
50%,1991.0
75%,2006.5
max,2022.0


In [14]:
# summary statistics of df2
df2.describe()

Unnamed: 0,time,urban_population_%_of_total_population,urban_population,rural_population,rural_population_%_of_total_population,gdp_current_US$
count,63.0,63.0,63.0,63.0,63.0,63.0
mean,1991.0,17.60746,5470449.0,20896010.0,82.39254,23184470000.0
std,18.330303,6.077777,4375918.0,9781360.0,6.077777,31291720000.0
min,1960.0,7.362,570661.0,7180774.0,70.998,791265500.0
25%,1975.5,13.2085,1833618.0,12043470.0,77.7675,3366944000.0
50%,1991.0,17.043,4076385.0,19841850.0,82.957,8209121000.0
75%,2006.5,22.2325,8334010.0,29147010.0,86.7915,28891850000.0
max,2022.0,29.002,15669050.0,38358440.0,92.638,113420000000.0


In [22]:
# check duplicates
print(df1.duplicated().sum())
print(df2.duplicated().sum())


# missing values
print(f"df1 :\n {df1.isna().sum()}")
print()
print(f"df2 :\n {df2.isna().sum()}")

0
0
df1 :
 time                                                                              0
agricultural_land_%_of_land_area                                                  0
access to_electricity_%_of_population                                             0
access_to_electricity_rural_%_of_rural_population                                 0
access_to_electricity_urban_%_of_urban_population                                 0
access_to_clean_fuels_and_technologies_for_cooking_urban_%_of_urban_population    0
access_to_clean_fuels_and_technologies_for_cooking_rural_%_of_rural_population    0
access_to_clean_fuels_and_technologies_for_cooking_%_of_population                0
adolescents_out_of_school_%_of_lower_secondary_school_age                         0
adults_ages_15-49_newly_infected_with_hiv                                         0
agriculture_forestry_and_fishing_value_added_annual_%_growth                      0
hospital_beds_per_1,000_people                                   

The data type of the values in the columns are not correct. The time variable should have be of date time category. The other variables should be of type float. This will be dealt with in the data preparation.
There are missing values but they are not registered as missing values. This will be looked into and converted into missing values. The criteria will be attempting to convert the values into a float and if it does not get converted into a float then we register the value as missing.

## Data Preparation
> In this phase of the study, we shall prepare the data for analysis.

### Validity

In checking for the validity, we check if the data is in the correct format.