## MODELLING TANZANIA WATER POINTS

### Data pre-processing

I will drop some columns that have no meaningfull information to the model. I will also convert our target variable from strings of (functional, non functional, functional needs repair) to 0,1 and 2 values in order to make our model. I will import my cleaned data from my EDA notebook for the purpose of modelling. This will help in reducing run time for my modelling purpose. 

Importing my EDA notebook

In [2]:
from ipynb.fs.full.EDA import *

In [4]:
df.head()

Unnamed: 0,gps_height,longitude,latitude,basin,region,district_code,lga,ward,population,public_meeting,permit,extraction_type_group,management,payment,water_quality,quantity,source,waterpoint_type,status_group,construction_year_bins
0,1390,34.938093,-9.856322,Lake Nyasa,Iringa,5,Ludewa,Mundindi,109,True,False,gravity,vwc,pay annually,soft,enough,spring,communal standpipe,functional,"(1990, 2000]"
1,1399,34.698766,-2.147466,Lake Victoria,Mara,2,Serengeti,Natta,280,True,True,gravity,wug,never pay,soft,insufficient,rainwater harvesting,communal standpipe,functional,"(2000, 2010]"
2,686,37.460664,-3.821329,Pangani,Manyara,4,Simanjiro,Ngorika,250,True,True,gravity,vwc,pay per bucket,soft,enough,dam,communal standpipe multiple,functional,"(2000, 2010]"
3,263,38.486161,-11.155298,Ruvuma / Southern Coast,Mtwara,63,Nanyumbu,Nanyumbu,58,True,True,submersible,vwc,never pay,soft,dry,machine dbh,communal standpipe multiple,non functional,"(1980, 1990]"
4,0,31.130847,-1.825359,Lake Victoria,Kagera,1,Karagwe,Nyakasimbi,281,True,True,gravity,other,never pay,soft,seasonal,rainwater harvesting,communal standpipe,functional,"(1990, 2000]"


I will create a copy of the cleaned data df from the EDA notebook so as to maintain the cleaned data as it is.

In [5]:
#create a copy of the clean df
clean_df = df.copy()

In [7]:
clean_df.head()

Unnamed: 0,gps_height,longitude,latitude,basin,region,district_code,lga,ward,population,public_meeting,permit,extraction_type_group,management,payment,water_quality,quantity,source,waterpoint_type,status_group,construction_year_bins
0,1390,34.938093,-9.856322,Lake Nyasa,Iringa,5,Ludewa,Mundindi,109,True,False,gravity,vwc,pay annually,soft,enough,spring,communal standpipe,functional,"(1990, 2000]"
1,1399,34.698766,-2.147466,Lake Victoria,Mara,2,Serengeti,Natta,280,True,True,gravity,wug,never pay,soft,insufficient,rainwater harvesting,communal standpipe,functional,"(2000, 2010]"
2,686,37.460664,-3.821329,Pangani,Manyara,4,Simanjiro,Ngorika,250,True,True,gravity,vwc,pay per bucket,soft,enough,dam,communal standpipe multiple,functional,"(2000, 2010]"
3,263,38.486161,-11.155298,Ruvuma / Southern Coast,Mtwara,63,Nanyumbu,Nanyumbu,58,True,True,submersible,vwc,never pay,soft,dry,machine dbh,communal standpipe multiple,non functional,"(1980, 1990]"
4,0,31.130847,-1.825359,Lake Victoria,Kagera,1,Karagwe,Nyakasimbi,281,True,True,gravity,other,never pay,soft,seasonal,rainwater harvesting,communal standpipe,functional,"(1990, 2000]"


#### Converting Target variable to numerical variables of 0,1,2

We need to convert our target variable from strings of (functional, non functional, functional needs repair) to 0,1 and 2 values in order to make our model

In [8]:
target_variable = {'functional':0, 
                   'non functional': 2, 
                   'functional needs repair': 1} 
clean_df['status_group'] = clean_df['status_group'].replace(target_variable)

In [9]:
clean_df['status_group'].value_counts()

0    32259
2    22824
1     4317
Name: status_group, dtype: int64

- 0 = functional water points ,

- 2 = non-functinal water point

- 1 = functional but needs repair water points,



I will drop Iga and ward columns since we have region column with similar information

In [10]:
#drop Iga and ward columns since we have region column with similar information
clean_df.drop(columns=['lga','ward'],inplace=True ) 

#### converting boolean type column of true or false values to 0 and 1

In [13]:
#convert True/False in permit column to 0-1
clean_df['permit'] =clean_df['permit'].astype(bool).astype(int) 

In [14]:
#convert True/False in public meeting column to 0-1
clean_df['public_meeting'] = clean_df['public_meeting'].astype(bool).astype(int)

#### Dividing all the features in categorical and numerical features and give them variables

I will divide my features into two categories i.e `numerical_variables` to be scaled using scaler and `categorical_variables` to be  encoded using encoder

In [16]:
# numerical features placed in a variable 
numerical_variables = ['gps_height','longitude','latitude','district_code','population','public_meeting','permit'] 

In [17]:
# categorical features placed in a variable  
categorical_variables = ['basin','region','extraction_type_group','management','payment','water_quality','quantity',
               'source','waterpoint_type','construction_year_bins']

### Converting Target Ternary classes to Target Binary classes

I will convert my target variable from having three classes to having two classes. I will combine functional water points with water points that are functional needs repair as 1. I will make non functional water points to be 0

In [18]:
# convert ternary classes to binary class
status_group_dict = {0:1, 1: 1, 2 : 0}
clean_df['status_group'] = clean_df['status_group'].replace(status_group_dict )

In [19]:
#print the status group value count
clean_df['status_group'].value_counts()

1    36576
0    22824
Name: status_group, dtype: int64

Now we have two categories to be predicted(binary)
- 1 = functional water points ,

- 0 = non-functinal water point




### MODEL 1: Logistic Regression