<a href="https://colab.research.google.com/github/stogaja/Tanzanian-Water-Project/blob/main/TANZANIA_WATER_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. **Defining the Question** 

Tanzania is the largest country in East Africa, with a population of 52 million people. But of those 52 million people, 23 million have no choice but to drink dirty water from unsafe sources. 44 million do not have access to adequate sanitation and 4000 children die from preventable diseases due to unsafe water. Safe water is scarce, and often women and children have to spend two to seven hours collecting clean water (WaterAid, 2016). This is quite the predicament. Water is a basic need and right for all human beings. The Tanzanian Ministry of water agrees and together with Taarifa, they aim to improve sanitation conditions in their country.
Water is fundamental to life and the environment; it plays a central role in both, economic and social development activities. Water touches all the spheres of human life including domestic, livestock, fisheries, wildlife, industry and energy, recreation, and other social—economic activities. It plays a pivotal role in poverty alleviation through the enhancement of food security, domestic hygiene, and the environment. The availability of safe and clean water raises the standard of living while its inadequacy of it poses serious health risks and leads to a decline in the living standards and life expectancy. Major fresh water sources in Tanzania include lakes, rivers, streams, dams, and groundwater. However, these are not well distributed all over the country. Some areas lack both surface and groundwater sources. Increasing population growth and urbanization pose serious pressure on the quantity and quality of available water. The sustainability of the present and future human life and environment depends mainly on proper water resources management. 


### a) Specifying the Question

Water supply to different parts of Tanzania is mainly done through pipes dug underground, while this is an initiative to curb the water problem, over 24 million people are still impacted by the crisis, that’s almost half of the population. This has resulted in poor sanitation, lack of safe drinking water as well as overcrowding at water sources, the adverse effects include disease outbreaks and generally very slow economic growth. The project aims to solve these problems by predicting which pipes are operating well, which ones need repairs and which ones are not working at all, as optimally functioning pipes will mean smooth delivery of water to where its needed.

### b) Defining the Metric for Success

The project will be considered a success when we can classify pumps into 3 categories namely:

* functional : the waterpoint is operational and there are no repairs needed

* functional needs repair : the waterpoint is operational, but needs repairs

* non functional : the waterpoint is not operational

### c) Understanding the context

### d) Recording the Experimental Design

## e) Data Relevance

The data has been proven to be valid and was provided by the Tanzania Water Ministry

# **2. Importing Libraries.**

In [None]:
# Importing the necessary libraries
#
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import seaborn as sns
import matplotlib.pyplot as plt

#  **3. Reading the Data**

In [None]:
#Loading the csv file
df=pd.read_csv("https://drivendata-prod.s3.amazonaws.com/data/7/public/4910797b-ee55-40a7-8668-10efd5c1b960.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20220627%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220627T100628Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=17f0185a23d5355302ebac77192ec6d2769226c8b35fc762a85ed4d324e13ae0")

# Exploring the data

In [None]:
#checking for shape 
# size of the dataset
print("The dataset consist of",df.shape[0], "rows and", df.shape[1], "columns")

In [None]:
#a preview of the data 
df.head()

In [None]:
#checking for colum names
df.columns

* amount_tsh : Total static head (amount water available to waterpoint)

* date_recorded : The date the row was entered

* funder : Who funded the well

* gps_height : Altitude of the well

* installer : Organization that installed the well
* longitude : GPS coordinate

* latitude : GPS coordinate

* wpt_name : Name of the waterpoint if there is one

* num_private :Private use or not

* basin : Geographic water basin

* subvillage : Geographic location

* region : Geographic location

* region_code : Geographic location (coded)

* district_code : Geographic location (coded)

* lga : Geographic location

* ward : Geographic location

* population : Population around the well

* public_meeting : True/False

* recorded_by : Group entering this row of data

* scheme_management : Who operates the waterpoint

* scheme_name : Who operates the waterpoint

* permit : If the waterpoint is permitted

* construction_year : Year the waterpoint was constructed

* extraction_type : The kind of extraction the waterpoint uses

* extraction_type_group : The kind of extraction the waterpoint uses

* extraction_type_class : The kind of extraction the waterpoint uses

* management : How the waterpoint is managed

* management_group : How the waterpoint is managed

* payment : What the water costs

* payment_type : What the water costs

* water_quality : The quality of the water

* quality_group : The quality of the water

* quantity : The quantity of water
quantity_group : The quantity of water

* source : The source of the water

* source_type : The source of the water

* source_class : The source of the water

* waterpoint_type : The kind of waterpoint

* waterpoint_type_group : The kind of waterpoint

In [None]:
#cheking for data types if each columns 
df.dtypes

In [None]:
#The cunstruction year should be a datetime data type
df['construction_year']=df['construction_year'].astype('datetime64[ns]')
df.dtypes

#  **4. Data Preperation**

# Data Cleaning.

### a)Validity

In [None]:
# Preview sample of 100 records to see whether all records are appropiately ordered
df.sample(10)

### b) Accuracy

### c) Uniformity

In [None]:
#checking if columns are properly named 
df.columns

Columns have uniform naming.

### d) Completeness

In [None]:
# here we check for missing values 
# Dealing with missing values 
# Checking the mumber of missing values by column and sorting for the smallest

Total = df.isnull().sum().sort_values(ascending=False)

# Calculating percentages
percent_1 = df.isnull().sum()/df.isnull().count()*100

# rounding off to one decimal point
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)

# creating a dataframe to show the values
missing_data = pd.concat([Total, percent_2], axis=1, keys=['Total', '%'])
missing_data

In [None]:
# let's encode our dataframe columns with object datatype

from sklearn.preprocessing import OneHotEncoder

# let's store all categorical columns in a variable
cat_cols = df.select_dtypes(include=['object']).columns.to_list()

# instantiate the one hot encoder
one_hot_encoder = OneHotEncoder(sparse=False, drop = "first")

# apply the one hot encoder logic 
encoder_vars_array = one_hot_encoder.fit_transform(df[cat_cols])

# create object for the feature names using the categorical variables
encoder_feature_names = one_hot_encoder.get_feature_names(cat_cols)

# create a dataframe to hold the one hot encoded variables
encoder_vars_df = pd.DataFrame(encoder_vars_array, columns = encoder_feature_names)

# concatenate the new dataframe back to the original input variables dataframe
df1 = pd.concat([df.reset_index(drop=True), encoder_vars_df.reset_index(drop=True)], axis = 1)

# drop the original input 2 and input 3 as it is not needed anymore
df1.drop(cat_cols, axis = 1, inplace = True)
df1.dtypes()


In [None]:
# let's fill the missing values with the modes since it is categorical variables
df1 = df
df1['scheme_name']=df1['scheme_name'].fillna(df1['scheme_name'].mode())
df1['scheme_management']=df1['scheme_management'].fillna(df1['scheme_management'].mode())
df1['installer']=df1['installer'].fillna(df1['installer'].mode())
df1['funder']=df1['funder'].fillna(df1['funder'].mode())
df1['public_meeting']=df1['public_meeting'].fillna(df1['public_meeting'].mode())
df1['permit']=df1['permit'].fillna(df1['permit'].mode())
df1['subvillage']=df1['subvillage'].fillna(df1['subvillage'].mode())

# let's check for missing values again
print(df.isnull().sum())


In [None]:
#we impute the missing values with the string "N/A"
df.scheme_name= df.scheme_name.fillna('N/A')
df.scheme_management = df.scheme_management.fillna('N/A')
df.installer = df.installer.fillna('N/A')
df.funder = df.funder.fillna('N/A')
df.public_meeting = df.public_meeting.fillna('N/A')
df.permit = df.permit.fillna('N/A')
df.subvillage = df.subvillage.fillna('N/A')

In [None]:
# Checking for missing values
print(df.isnull().sum())

### e) Consistency

In [None]:
# Check for duplicates
df.duplicated().sum()

No duplicate rows were found in our data set

**Exploratory Data Analysis**

a)Univariate analysis.

In [None]:
df.describe()

In [None]:
# selecting object datatypes columns
categorical = ['basin', 'region', 
         'public_meeting', 'recorded_by',
       'scheme_management', 'permit',
       'extraction_type_group', 'extraction_type_class',
       'management', 'management_group',  'payment_type',
        'quality_group', 'quantity_group',
       'source', 'source_type', 'source_class', 
       'waterpoint_type_group']
categorical

# lets make a for loop to make countplots for our categorical variables.
for col in categorical:
  ax=sns.countplot(y=col,data=df)
  plt.title(f"countplot of {col}")
  plt.show()


In [None]:
numerical=['amount_tsh','gps_height','population']
numerical