### Index

1. [Importing the libraries](#1.-Importing-the-libraries)
1. [Reading the dataset](#2.-Reading-the-dataset)
1. [Data Cleaning](#3.-Data-Cleaning)

## 1. Importing the libraries

In [1]:
from glob import glob

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt

## 2. Reading the dataset

In [2]:
files = glob('data/*.csv')

files

['data/test.csv',
 'data/SubmissionFormat.csv',
 'data/training.csv',
 'data/labels.csv']

In [3]:
# Reading the training dataset
training_data = pd.read_csv('data/training.csv')

# Reading the labels dataset
training_labels = pd.read_csv('data/labels.csv')

In [4]:
training_data.shape

(59400, 40)

In [5]:
training_labels.shape

(59400, 2)

In [6]:
# Merging the training data and labels
training_data = training_data.merge(training_labels, on='id')

training_data.shape

(59400, 41)

## 3. Data Cleaning

In [7]:
# Looking the dataset
training_data.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [8]:
for col in training_data.columns:
    print(col, training_data[col].nunique())

id 59400
amount_tsh 98
date_recorded 356
funder 1897
gps_height 2428
installer 2145
longitude 57516
latitude 57517
wpt_name 37400
num_private 65
basin 9
subvillage 19287
region 21
region_code 27
district_code 20
lga 125
ward 2092
population 1049
public_meeting 2
recorded_by 1
scheme_management 12
scheme_name 2696
permit 2
construction_year 55
extraction_type 18
extraction_type_group 13
extraction_type_class 7
management 12
management_group 5
payment 7
payment_type 7
water_quality 8
quality_group 6
quantity 5
quantity_group 5
source 10
source_type 7
source_class 3
waterpoint_type 7
waterpoint_type_group 6
status_group 3


If we observe the above values we can find that id column has unique values equal to the number of rows in the data.

### Checking for the columns that are having same data

In [9]:
training_data['extraction_type'].value_counts()

gravity                      26780
nira/tanira                   8154
other                         6430
submersible                   4764
swn 80                        3670
mono                          2865
india mark ii                 2400
afridev                       1770
ksb                           1415
other - rope pump              451
other - swn 81                 229
windmill                       117
india mark iii                  98
cemo                            90
other - play pump               85
walimi                          48
climax                          32
other - mkulima/shinyanga        2
Name: extraction_type, dtype: int64

In [10]:
training_data['extraction_type_group'].value_counts()

gravity            26780
nira/tanira         8154
other               6430
submersible         6179
swn 80              3670
mono                2865
india mark ii       2400
afridev             1770
rope pump            451
other handpump       364
other motorpump      122
wind-powered         117
india mark iii        98
Name: extraction_type_group, dtype: int64

In [11]:
training_data['extraction_type_class'].value_counts()

gravity         26780
handpump        16456
other            6430
submersible      6179
motorpump        2987
rope pump         451
wind-powered      117
Name: extraction_type_class, dtype: int64