<a href="https://colab.research.google.com/github/stogaja/Tanzanian-Water-Project/blob/main/TANZANIA_WATER_PROJECT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. **Defining the Question** 

Tanzania is the largest country in East Africa, with a population of 52 million people. But of those 52 million people, 23 million have no choice but to drink dirty water from unsafe sources. 44 million do not have access to adequate sanitation and 4000 children die from preventable diseases due to unsafe water. Safe water is scarce, and often women and children have to spend two to seven hours collecting clean water (WaterAid, 2016). This is quite the predicament. Water is a basic need and right for all human beings. The Tanzanian Ministry of water agrees and together with Taarifa, they aim to improve sanitation conditions in their country.
Water is fundamental to life and the environment; it plays a central role in both, economic and social development activities. Water touches all the spheres of human life including domestic, livestock, fisheries, wildlife, industry and energy, recreation, and other social—economic activities. It plays a pivotal role in poverty alleviation through the enhancement of food security, domestic hygiene, and the environment. The availability of safe and clean water raises the standard of living while its inadequacy of it poses serious health risks and leads to a decline in the living standards and life expectancy. Major fresh water sources in Tanzania include lakes, rivers, streams, dams, and groundwater. However, these are not well distributed all over the country. Some areas lack both surface and groundwater sources. Increasing population growth and urbanization pose serious pressure on the quantity and quality of available water. The sustainability of the present and future human life and environment depends mainly on proper water resources management. 


### a) Specifying the Question

Water supply to different parts of Tanzania is mainly done through pipes dug underground, while this is an initiative to curb the water problem, over 24 million people are still impacted by the crisis, that’s almost half of the population. This has resulted in poor sanitation, lack of safe drinking water as well as overcrowding at water sources, the adverse effects include disease outbreaks and generally very slow economic growth. The project aims to solve these problems by predicting which pipes are operating well, which ones need repairs and which ones are not working at all, as optimally functioning pipes will mean smooth delivery of water to where its needed.

### b) Defining the Metric for Success

### c) Understanding the context

### d) Recording the Experimental Design

## e) Data Relevance

# **2. Importing Libraries.**

In [None]:
# Importing the necessary libraries
#
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

#  **3. Reading the Data**

In [None]:
#Loading the csv file
df=pd.read_csv("https://drivendata-prod.s3.amazonaws.com/data/7/public/4910797b-ee55-40a7-8668-10efd5c1b960.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARVBOBDCYQTZTLQOS%2F20220623%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220623T211344Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=a162a39b1967610ed2b01516c4cd0fe80a414a8e6cdca24befeb2515c350ccf9")

# Exploring the data

In [None]:
#checking for shape 
# size of the dataset
print("The dataset consist of",df.shape[0], "rows and", df.shape[1], "columns")

In [None]:
#a preview of the data 
df.head()

In [None]:
#checking for colum names
df.columns

In [None]:
#cheking for data types if each columns 
df.dtypes

In [None]:
#The cunstruction year should be a datetime data type
df['construction_year']=df['construction_year'].astype('datetime64[ns]')
df.dtypes

#  **4. Data Preperation**

# Data Cleaning.

### a)Validity

In [None]:
# Preview sample of 100 records to see whether all records are appropiately ordered
df.sample(10)

### b) Accuracy

### c) Uniformity

### d) Completeness

In [None]:
#here wecheck for missing values 
# Dealing with missing values 
# Checking the mumber of missing values by column and sorting for the smallest

Total = df.isnull().sum().sort_values(ascending=False)

# Calculating percentages
percent_1 = df.isnull().sum()/df.isnull().count()*100

# rounding off to one decimal point
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)

# creating a dataframe to show the values
missing_data = pd.concat([Total, percent_2], axis=1, keys=['Total', '%'])
missing_data

In [None]:
#we impute the missing values with the string "N/A"
df.scheme_name= df.scheme_name.fillna('N/A')
df.scheme_management = df.scheme_management.fillna('N/A')
df.installer = df.installer.fillna('N/A')
df.funder = df.funder.fillna('N/A')
df.public_meeting = df.public_meeting.fillna('N/A')
df.permit = df.permit.fillna('N/A')
df.subvillage = df.subvillage.fillna('N/A')

In [None]:
# Checking for missing values
print(df.isnull().sum())

### d) Consistency

In [None]:
# Check for duplicates
df.duplicated().sum()

No duplicate rows we found in our data set