# Tanzanian faulty pumps prediction

## Problem statement
In Tanzania, access to clean and potable water is essential for the health and well-being of its citizens. However, many water pumps 
across the country are faulty, leading to water shortages and posing significant health risks to communities. To address this issue 
and promote access to clean water, we aim to develop a predictive model that can identify faulty water pumps based on various 
features such as pump age, location, type, and condition. By accurately predicting which water pumps are faulty, authorities and
organizations can prioritize maintenance and repair efforts, ensuring that clean and safe water is readily available to all 
Tanzanians.
 Stakeholder:
The Ministry of Water in Tanzania is a key stakeholder in addressing the issue of faulty water pumps and promoting access to clean
 and potable water across the country. As the government body responsible for water resource management and infrastructure
 development, the Ministry plays a crucial role in ensuring that water supply systems are well-maintained and functional.
 By leveraging predictive modeling to identify faulty water pumps, the Ministry can efficiently allocate resources for maintenance
 and repair activities, thereby improving the reliability and accessibility of clean water for Tanzanian communit.se

### Objectives
1. To predict the functionality of water pumps: Develop a predictive model to classify water pumps into functional, non-functional, and functional needs repair categories based on various features such as amount_tsh, gps_height, waterpoint_type, and others.
2. To identify factors influencing water pump functionality: Conduct exploratory data analysis to identify the key factors (e.g., funder, installer, water quality) that influence the functionality of water pumps and their maintenance needs.
3. To optimize water pump maintenance strategies: Use historical data on water pump failures and repairs to optimize maintenance schedules and resource allocation, ensuring timely repairs and minimizing downtime of water pumps.
4. To assess the geographical distribution of water pump functionality: Analyze the geographical distribution of functional and non-functional water pumps to identify regions with high repair needs and prioritize interventions for improved access to clean water.
5. To evaluate the impact of funding sources on water pump functionality: Investigate the relationship between funding sources and water pump functionality to assess the effectiveness of different funding mechanisms in ensuring sustainable access to clean water.ter.s.




## Data understanding

In [1]:
# importing relevant modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import random
from scipy import stats
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
import warnings

In [9]:
# displaying first few rows of the labels set
df1 = pd.read_csv('training_set_labels.csv')
df1.head()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [3]:
# displaying first few rows of the training set
df2 = pd.read_csv('training_set_values.csv')
df2.head(10)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
5,9944,20.0,2011-03-13,Mkinga Distric Coun,0,DWE,39.172796,-4.765587,Tajiri,0,...,per bucket,salty,salty,enough,enough,other,other,unknown,communal standpipe multiple,communal standpipe
6,19816,0.0,2012-10-01,Dwsp,0,DWSP,33.36241,-3.766365,Kwa Ngomho,0,...,never pay,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
7,54551,0.0,2012-10-09,Rwssp,0,DWE,32.620617,-4.226198,Tushirikiane,0,...,unknown,milky,milky,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump
8,53934,0.0,2012-11-03,Wateraid,0,Water Aid,32.7111,-5.146712,Kwa Ramadhan Musa,0,...,never pay,salty,salty,seasonal,seasonal,machine dbh,borehole,groundwater,hand pump,hand pump
9,46144,0.0,2011-08-03,Isingiro Ho,0,Artisan,30.626991,-1.257051,Kwapeto,0,...,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump


In [4]:
# displaying first few rows of df3
df3 = pd.read_csv('test_set_values.csv')
df3

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,...,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,...,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,...,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14845,39307,0.0,2011-02-24,Danida,34,Da,38.852669,-6.582841,Kwambwezi,0,...,never pay,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
14846,18990,1000.0,2011-03-21,Hiap,0,HIAP,37.451633,-5.350428,Bonde La Mkondoa,0,...,annually,salty,salty,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump
14847,28749,0.0,2013-03-04,,1476,,34.739804,-4.585587,Bwawani,0,...,never pay,soft,good,insufficient,insufficient,dam,dam,surface,communal standpipe,communal standpipe
14848,33492,0.0,2013-02-18,Germany,998,DWE,35.432732,-10.584159,Kwa John,0,...,never pay,soft,good,insufficient,insufficient,river,river/lake,surface,communal standpipe,communal standpipe


In [7]:
#merging df2 and df3
merged_df = pd.concat([df2, df3], ignore_index=True)
merged_df

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
74245,39307,0.0,2011-02-24,Danida,34,Da,38.852669,-6.582841,Kwambwezi,0,...,never pay,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
74246,18990,1000.0,2011-03-21,Hiap,0,HIAP,37.451633,-5.350428,Bonde La Mkondoa,0,...,annually,salty,salty,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump
74247,28749,0.0,2013-03-04,,1476,,34.739804,-4.585587,Bwawani,0,...,never pay,soft,good,insufficient,insufficient,dam,dam,surface,communal standpipe,communal standpipe
74248,33492,0.0,2013-02-18,Germany,998,DWE,35.432732,-10.584159,Kwa John,0,...,never pay,soft,good,insufficient,insufficient,river,river/lake,surface,communal standpipe,communal standpipe


In [10]:
final_merged_df = pd.merge(df1, merged_df, on='id')
final_merged_df

Unnamed: 0,id,status_group,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,functional,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,functional,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,functional,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,non functional,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,functional,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59395,60739,functional,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,...,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
59396,27263,functional,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,...,annually,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
59397,37057,functional,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,...,monthly,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
59398,31282,functional,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,...,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump


In [None]:
#Understanding the general information of the data
merged_df.info()

In [None]:
# checking the shape of the data
merged_df.shape

In [None]:
merged_df.duplicated().sum()

### Defining Variables

**Independent Variables (Predictors):**
- Amount_tsh
- Gps_height
- Waterpoint_type
- Funder
- Installer
- Water_quality
- Payment_type
- Region
- Latitude
- Longitude

**Dependent Variable (Target):**nt.
 the waterpoint.
 the waterpoint.f the waterpoint.f the waterpoint.on_code
- district_code
- population
- construction_year


Categorical Variables:
- **status_group:** The status of the waterpoint (e.g., functional, non-functional, functional needs repair).
- **funder:** The organization or individual who funded the waterpoint.
- **region:** The geographic region where the waterpoint is located.
- **extraction_type:** The mechanism used to extract water from the waterpoint.
- **payment:** The type of payment required to access the waterpoint.
- **water_quality:** The quality of the water provided by the waterpoint.
- **source:** The source of the water (e.g., river, well, spring).

Numeric Variables:
- **gps_height:** The altitude of the waterpoint.
- **construction_year:** The year when the waterpoint was constructed.
- **longitude:** The longitude coordinate of the waterpoint.
- **latitude:** The latitude coordinate of the waterpoint.


In [None]:
merged_df.isnull().sum()

## Data preparation and cleaning

### Creating a new dataframe containing only the needed variables

In [None]:
# List of columns to include in the new DataFrame
selected_columns = ['status_group', 'funder', 'gps_height', 'region', 'extraction_type', 'payment', 'water_quality', 'source',
                    'construction_year', 'longitude', 'latitude']

# Create a new DataFrame with only the selected columns
new_df = merged_df.filter(selected_columns)

new_df

In [None]:
new_df.shape

The new_df has 59400 rows and 14 columns

In [None]:
new_df.info()

In [None]:
#Understanding the descriptive statistics of the data
new_df.describe()

### Checking for missing values

In [None]:
#Checking for null values in the training set
new_df.isnull().sum()

### Dealing with missing values

In [None]:
unique_counts = new_df['funder'].isna().value_counts()
unique_counts

In [None]:
missing_funders = new_df[new_df['funder'].isna()]
missing_funders

In [None]:
#rechecking for null values 
new_df['funder'].fillna('Unknown', inplace=True)
new_df.isnull().sum()


Missing values are assigned the placeholder 'Unknown' signifying that the information for these entries is unavailable. However, the rows cannot be dropped as they may have other important information in the other columns.

## Exploratory data analysis

### Checking for outliers
Outliers will be addressed systematically, one category at a time, to ensure comprehensive analysis.

#### Status_group

In [None]:
#checking unique categories in status_group 
unique_values = new_df['status_group'].unique()
unique_values

In [None]:
#merging 'functional need repair' into 'functional'
new_df['status_group'] = new_df['status_group'].replace('functional needs repair', 'functional')

# Check the unique values again
print(new_df['status_group'].value_counts())

In [None]:
# checking for outliers in status_group
plt.figure(figsize=(4,4))

sns.countplot(x="status_group", data=new_df)
plt.title("Distribution of status_group")
plt.xlabel("status_group")
plt.ylabel("Count")
plt.show()

There are no outliers in the status group

#### Funder

In [None]:
unique_values = new_df['funder'].unique()
value_counts = new_df['funder'].value_counts()
value_counts

In [None]:
#checking for outliers in funder using a count plot
plt.figure(figsize=(10, 4))
sns.countplot(x='funder', data=new_df)
plt.xticks(rotation=90, fontsize=8)  # Rotate the x-axis labels by 90 degrees and adjust font size
plt.tight_layout()  # Adjust layout to prevent overlapping labels
plt.show()


In [None]:
# Set the threshold for defining outliers
upper_threshold = 50

# Getting the counts of each funder
funder_counts = new_df['funder'].value_counts()

# Identifying the outliers (funders with counts below 50)
outliers = funder_counts[funder_counts < upper_threshold].index

# Create a new column to categorize funders as eligible or outliers
new_df['funder_category'] = np.where(new_df['funder'].isin(outliers), 'Outlier', 'Eligible')

# Set the color palette
sns.set_palette("Greens_d")

# Plot the count plot for funder category
plt.figure(figsize=(8, 4))
sns.countplot(y='funder_category', data=new_df, dodge=False)

# Display the plot
plt.show()





##### Eligible Funders vs. Outliers

The count plot above illustrates the distribution of funders categorized as "Eligible" and "Outlier" based on the specified thresholds. Here's a summary of the findings:

- **Eligible Funders:** These are funders with a count falling within the specified thresholds (between 10 and 600).
- **Outliers:** These are funders with a count below 10 or above 600.

As observed in the plot, the number of outliers is significantly higher than the count of eligible funders. However, it's important to note that we cannot disregard the outliers as they may contain valuable insights or represent specific cases of interest.



#### GPS height

In [None]:
#checking for outliers for gps_height
plt.figure(figsize=(10,2))

sns.boxplot(x = 'gps_height', data = new_df)

# Display the plot
plt.show()

In [None]:
# Find the mode of the 'gps_height' column
mode_value = new_df['gps_height'].mode()

# Display the mode
print("Mode of 'gps_height' column:", mode_value)


The mode of the 'gps_height' column is 0, indicating that this value is the most common within the dataset. As box plots rely on quartiles to determine their position, the prevalence of 0 strongly influences the box plot's positioning.

With the mode close to 0, it's likely that the median (second quartile) aligns closely with this value, resulting in a box plot skewed towards lower values. Consequently, the majority of the data tends to concentrate towards the lower end of the scale.

The presence of a whisker starting below 0 at -90 may suggest data recorded at elevations below a predefined reference datum. In this context, these points below 0 are not considered outliers.

On the other hand, the longer upper whisker compared to the lower one suggests greater dispersion or variability in the upper range of the data (maximum). This could hint at the presence of outliers or extreme values towards higher elevations.

However, it's important to note that we are not removing these outliers. They might represent genuine data points and carry valuable information. Blindly removing them could lead to the loss of valuable insights and potentially bias the analysis or conclusions drawn from the data.

#### Region

In [None]:
# Get the order of regions based on their counts
region_order = new_df['region'].value_counts().index

# Plot the count plot with specified order
plt.figure(figsize=(12, 6))
sns.countplot(x='region', data=new_df, order=region_order, palette='Greens_r')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.xlabel('Region')  # Add x-axis label
plt.ylabel('Count')  # Add y-axis label
plt.title('Distribution of Data by Region')  # Add plot title
plt.tight_layout()  # Adjusting layout to prevent clipping of labels
plt.show()


#### Extraction_type 

In [None]:
new_df['extraction_type'].unique()

In [None]:
#checking for outliers in Extraction_type
sns.set_palette("Greens_r")
extraction_order = new_df['extraction_type'].value_counts().index
# Plot the count plot for Extraction_type
plt.figure(figsize=(12, 4))
sns.countplot(x='extraction_type', data=new_df, order=extraction_order, palette='Greens_r')
plt.xticks(rotation=45)  # Rotate the x-axis labels by 45 degrees
plt.title('Count of Waterpoints by Extraction Type')
plt.xlabel('Extraction Type')
plt.ylabel('Count')
# Display the plot
plt.show()

In [None]:
# Get value counts of 'Extraction_type' and sort by counts in descending order
extraction_type_counts = new_df['extraction_type'].value_counts().sort_values(ascending=False)

# Display unique values in 'Extraction_type' with counts
print(extraction_type_counts)


In the 'Extraction_type' column, the majority of water pumps fall into the following categories:

Gravity: 26,780 pumps
Nira/Tanira: 8,154 pumps
Other: 6,430 pumps
Submersible: 4,764 pumps
Swn 80: 3,670 pumps
Mono: 2,865 pumps
India Mark II: 2,400 pumps
Afridev: 1,770 pumps
KSB: 1
,415 pumps
However, there are some categories with notably fewer pumps, such as 'Other - Rope Pump', 'Other - Swn 81', 'Windmill', 'India Mark III', 'CEMO', 'Other - Play Pump', 'Walimi', 'Climax', and 'Other - Mkulima/Shinyanga'. These could be outliers regardingin terms of pump d. This couldstributie indicating less common or specialized therefore we cannot simply remove them as they may hold significance in the dataset.pump types.

#### Construction year

In [None]:
#checking for outliers for construction_year
plt.figure(figsize=(8, 2))

sns.boxplot(x = 'construction_year', data = new_df)

# Display the plot
plt.show()

The box plot shows unlikely years included in the dataset. It is impossible to have year 0 as pumps were not even invented then.

In [None]:
#Displaying unique years and their value counts
new_df['construction_year'].value_counts()

There is a category miscategorized as year '0' with a value count of 20709.

The code below generates a scatter plot visualizing the geographical locations of water pumps in Tanzania, with each point representing a water pump. The x-axis represents the longitude coordinates, and the y-axis represents the latitude coordinates. The color of each point is determined by the construction year of the water pump, with a colormap ('Greens') used to provide different hues of green corresponding to different years. The size of each point is fixed ('s=100') for better visibility, and transparency ('alpha=0.4') is applied to avoid overlapping points. Finally, a colorbar is added to the plot to provide a visual reference for the construction years.


In [None]:
plt.figure(figsize=(8,6))
# Filter the DataFrame using .loc and multiple conditions
filtered_df = new_df.loc[(new_df['longitude'] > 0) & (new_df['latitude'] < 0) & (new_df['construction_year'] > 0)]
plt.scatter(x=filtered_df['longitude'], 
            y=filtered_df['latitude'],
            alpha=0.4,
            s=100,  
            c=filtered_df["construction_year"], 
            cmap='Greens')
plt.title("Construction Years & Locations of Waterpumps in Tanzania", 
          fontsize=12, fontweight='bold')
plt.colorbar(label='Construction Year')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()


From the scatter plot above, it is clear that most pumps were installed between 2000 and 2010. Therefore, below year '0's values will be distributed evenly between the range 2000 - 2010.

In [None]:
# Replace year 0 with later years (i.e., 2000 - 2010)
new_df['construction_year'] = new_df['construction_year'].apply(lambda x: np.random.randint(2000, 2011) if x == 0 else x)

In [None]:
#rechecking for outliers for construction_year
plt.figure(figsize=(8, 2))

sns.boxplot(x = 'construction_year', data = new_df)

# Display the plot
plt.show()

Outliers in the construction years falling between 1960 and 1978 represent years that may hold significant importance in the dataset. While they deviate from the majority of the construction years, they could signify historical data or specific events related to water pump installation during that period. Therefore, it's crucial to retain these outliers in the dataset for a comprehensive analysis and understanding of the trends and patterns over time.


#### Payment

In [None]:
new_df['payment'].value_counts()

In [None]:
# Define the order of source categories
sorted_payments = new_df['payment'].value_counts().index

#checking for outliers in payment
sns.set_palette("Greens_r")

# Plot the count plot for payment
plt.figure(figsize=(8, 4))
sns.countplot(x='payment', data=new_df, order=sorted_payments, palette='Greens_r')
plt.xticks(rotation=45)  # Rotate the x-axis labels by 45 degrees
plt.title('Count of Waterpoints by payment')
plt.xlabel('payment')
plt.ylabel('Count')
# Display the plot
plt.show()

The distribution of payment types, as observed in the count plot above, reveals an interesting trend. The "never pay" category dominates the dataset, indicating that a significant portion of water points in the dataset do not require any payment. This could be due to various reasons, such as government subsidies or community initiatives aimed at providing free access to water.

In contrast, the paid categories exhibit a more even distribution, with multiple categories having similar counts. This distribution suggests that while there are options for paid water access, they are not as prevalent as the "never pay" category. This observation might be attributed to the socioeconomic factors prevalent in the area. Residents who cannot afford paid water services may opt for the free "never pay" option, resulting in its higher prevalence in the dataset.

Therefore, the presence of multiple paid categories with similar counts does not necessarily indicate outliers. Instead, it reflects the diverse payment options available and the socioeconomic dynamics influencing water access in the region.

#### Water quality

In [None]:
new_df['water_quality'].value_counts()

In [None]:
# Define the order of water_quality categories
sorted_water_quality = new_df['water_quality'].value_counts().index

# Set the color palette to shades of green
palette = sns.color_palette("Greens_r", len(sorted_water_quality))

# Plot the count plot for water_quality
plt.figure(figsize=(8, 4))
sns.countplot(x='water_quality', data=new_df, order=sorted_water_quality, palette=palette)
plt.xticks(rotation=45)  # Rotate the x-axis labels by 45 degrees
plt.title('Count of Waterpoints by water_quality')
plt.xlabel('water_quality')
plt.ylabel('Count')
# Display the plot
plt.show()


The count plot above indicates the most prevalent category is "soft". This indicates that most water sources provide satisfactory water quality. It could also mean most people prefer soft water leading to its prevalence. 
Next,," we find the "salty" catego,y, whit, exhibits a considerably lower counin comparison toto "soft." This suggests that while some water sources may have elevated salinity levels, they arlessas commothanas those providing "soft" water.Then there is the "milky" and "coloured" categories, which may raise concerns regarding water quality. These categories, while not as frequent as "soft" or "salty," suggest the presence of impurities or contaminants that could affect the desirability of the water.

Another notable category is "salty abandoned," which indicates water sources that have been abandoned likely due to high salinity levels. This category, although less common, highlights instances where water quality issues have led to the abandonment of  waterpoints.

Lastly, we have "fluoride" and "fluoride abandoned" categories, which indicate the presence of fluoride in the water. While fluoride is beneficial in controlled amounts for dental health, excessive levels can be harmful. The presence of "fluoride abandoned" suggests instances where water sources have been abandoned due to excessive fluoride

Generally the plot revealsis reveals a diverse landscape of water quality categories, with "soft" being the predominant category. While certain categories may raise concerns, such as "salty abandoned" or "fluoride," they do not appear to be outliers but rather indicative of the range of water quality issues present ac ross waterthents in our dataset.

#### Source

In [None]:
new_df['source'].value_counts()

In [None]:
# Define the order of source categories
sorted_source = new_df['source'].value_counts().index

# Set the color palette to shades of green
palette = sns.color_palette("Greens_r", len(sorted_water_quality))

# Plot the count plot for water_quality
plt.figure(figsize=(8, 4))
sns.countplot(x='source', data=new_df, order=sorted_source, palette=palette)
plt.xticks(rotation=45)  # Rotate the x-axis labels by 45 degrees
plt.title('Count of Waterpoints by source')
plt.xlabel('source')
plt.ylabel('Count')
# Display the plot
plt.show()

"Spring" and "shallow well" emerge as the most prevalent sources, followed closely by "machine dbh" and "river." These categories exhibit relatively high counts, indicating their widespread usage as water sources.

Next in line is "rainwater harvesting," although its count is notably lower compared to the preceding categories. "Hand dtw," "lake," and "dam" follow, each with decreasing counts.

Finally, we have the categories of "unknown" and "other," which appear to represent sources with less distinct categorization or sources not captured by the specified categories.

Overall, while there is variation in the counts across different water source categories, there are no outliers that significantly deviate from the expected distribution. Instead, the distribution reflects the diverse range of water sources utilized across waterpoints in our dataset.

#### Longitude

In [None]:
#checking for outliers for longitude
plt.figure(figsize=(8, 2))

sns.boxplot(x = 'longitude', data = new_df)

# Display the plot
plt.show()

The plot shows presence of outliers.

In [None]:
new_df['longitude'].value_counts()

There is a huge count miscategorized as longitude '0'. Below that is dealt with by being redistributed to a range with more frequent occurence.

In [None]:
# plotting a scatter plot to show the majority of longitude points
plt.figure(figsize=(8,6))
# Filter the DataFrame using .loc and multiple conditions
filtered_df = new_df.loc[(new_df['longitude'] > 0) & (new_df['latitude'] < 0)]
plt.scatter(x=filtered_df['longitude'], 
            y=filtered_df['latitude'],
            alpha=0.4,
            s=100,  
            c=filtered_df["longitude"], 
            cmap='Greens')
plt.title("Locations of Waterpumps in Tanzania", 
          fontsize=12, fontweight='bold')
plt.colorbar(label='Longitude')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()



It's evident that there are more water pumps located at longitudes greater than 34 degrees than those located at longitudes less than 34 degrees. This suggests a higher concentration of water points towards the eastern side of the region under consideration. Longitude, representing the east-west position on the Earth's surface, indicates that the area to the east of 34 degrees longitude may have higher population densities or other factors contributing to the need for more water access points compared to the western region.


In [None]:
# Replace longitude 0 with longitudes between 32 and 42 as they are more prevalent
new_df['longitude'] = new_df['longitude'].apply(lambda x: np.random.randint(32, 42) if x == 0 else x)

In [None]:
# confirming redistribution of the '0' category
new_df['longitude'].value_counts()

In [None]:
#rechecking for outliers for longitude
plt.figure(figsize=(8, 2))

sns.boxplot(x = 'longitude', data = new_df)

# Display the plot
plt.show()

#### Latitude

In [None]:
#checking for outliers for latitude
plt.figure(figsize=(8, 2))

sns.boxplot(x = 'latitude', data = new_df)

# Display the plot
plt.show()

There does not seem to have outliers in latitude but further analysis is still necessary.

In [None]:
new_df['latitude'].value_counts()

There is a latitude that seems to be miscategorized (-2.000000e-08    1812). This will be dealt with below. 

In [None]:
# Plotting a scatter plot to show the majority of latitude points
plt.figure(figsize=(8,6))
# Filter the DataFrame using .loc and multiple conditions
filtered_df = new_df.loc[(new_df['longitude'] > 0) & (new_df['latitude'] < 0)]
plt.scatter(x=filtered_df['longitude'], 
            y=filtered_df['latitude'],
            alpha=0.4,
            s=100,  
            c=filtered_df["latitude"], 
            cmap='Greens')
plt.title("Locations of Waterpumps in Tanzania", 
          fontsize=12, fontweight='bold')
plt.colorbar(label='Latitude')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()


A scatter plot of pumps in relevance to latitude indicates more pumps between latitudes -1 and -8. The misplaced class will be evenly distributed in this range.

In [None]:
# Replace latitude -2.000000e-08 with latitudes between -1 and -8 as they are more prevalent
new_df['latitude'] = new_df['latitude'].apply(lambda x: np.random.randint(-8, -1) if x == -2.000000e-08 else x)


In [None]:
#confirming the redistribution of misplaced category
new_df['latitude'].value_counts()

### Distribution of variables before log transformation

#### Categorical variables

In [None]:
# Distribution before onehot encoding
palette = sns.color_palette("Greens_r")

# Selecting categorical variables
categorical_features = new_df.select_dtypes(include=['object'])

# Plot count plots for each categorical variable with dark green color palette
for feature in categorical_features.columns:
    plt.figure(figsize=(4, 4))  # Set the figure size
    sns.countplot(x=feature, data=new_df, palette=palette)
    plt.title(f'Count Plot of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
    plt.show()  # Display the plot


#### Numerical variables

In [None]:
# Print unique values in the 'gps_height' column
print(new_df['gps_height'].unique())


In [None]:
# Get unique values in the 'gps_height' column
unique_gps_heights = new_df['gps_height'].unique()

# Sort unique values in descending order
sorted_unique_gps_heights = sorted(unique_gps_heights, reverse=True)

# Print each unique value from highest to lowest
for value in sorted_unique_gps_heights:
    print(value)


In [None]:
# Distribution before transformation
numerical_features = new_df.select_dtypes(include=['int64', 'float64'])

# Create a grid of subplots
fig, axes = plt.subplots(nrows=len(numerical_features.columns) // 3 + 1, ncols=3, figsize=(15, 5))

# Plot the distribution of numerical features
for i, feature in enumerate(numerical_features.columns):
    sns.histplot(new_df[feature].dropna(), kde=False, ax=axes[i // 3, i % 3])
    axes[i // 3, i % 3].set_title(f"Distribution of {feature}")
    axes[i // 3, i % 3].set_xlabel(feature)
    axes[i // 3, i % 3].set_ylabel("Frequency")

# Remove empty subplots
if len(numerical_features.columns) % 3 != 0:
    for j in range(len(numerical_features.columns) % 3, 3):
        fig.delaxes(axes[len(numerical_features.columns) // 3, j])
        
# Adjust layout
plt.tight_layout()
plt.show()


### Distribution of variables after log transformation

In [None]:
# Select numerical columns
numerical_columns = new_df.select_dtypes(include=['int64', 'float64']).columns

# Log transform numerical variables, handling zero and negative values
for col in numerical_columns:
    # Handling zero values
    if (new_df[col] == 0).any():
        new_df[col] = new_df[col] + 1  # Add 1 to handle zeros
    
    # Handling negative values
    if (new_df[col] < 0).any():
        min_value = new_df[col].min()
        new_df[col] = new_df[col] - min_value + 1  # Shift all values to be positive
    
    # Apply log transformation
    new_df[col + '_log'] = np.log1p(new_df[col])

# Display the DataFrame after log transformation
new_df.head()


In [None]:
# Plot the distribution of numerical features after log transformation
plt.figure(figsize=(12, 8))

# Loop through each numerical feature
for col in numerical_columns:
    # Plot the distribution after log transformation
    sns.histplot(new_df[col + '_log'], kde=True, label=col, alpha=0.6)

plt.title('Distribution of Numerical Features after Log Transformation')
plt.xlabel('Value (Log Transformed)')
plt.ylabel('Frequency')
plt.legend()
plt.show()


In [None]:
# Select categorical columns
categorical_columns = new_df.select_dtypes(include=['object']).columns

# Perform one-hot encoding
one_hot_encoded_df = pd.get_dummies(new_df, columns=categorical_columns)

# Display the one-hot encoded DataFrame
one_hot_encoded_df.head()


In [None]:
# A correlation heat map between variables
numerical_features = new_df.select_dtypes(include=['int64', 'float64'])
target_variable = new_df['status_group']  # status_group being the target variable

# Computing correlation matrix
correlation_matrix = numerical_features.corr()

# Ploting heatmap of correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='Greens', fmt=".2f", annot_kws={"size": 10})
plt.title('Correlation Heatmap')
plt.show()