## Inspecting the Training Data
In this section we inspect the training data and group variables for further analysis.

### Importing the Train Test Split
Below we import the train test split that we created in the [ImportData](ImportData.ipynb) notebook. The `train_test_split` object is a dictionary containing dataframes as values.

In [1]:
import pickle
with open('../data/train_test_split.pkl', mode='rb') as f:
    train_teast_split = pickle.load(f)

### Extracting Training Data

In [2]:
X_train = train_teast_split['X_train']
y_train = train_teast_split['y_train']

### Inspecting the Shape of the Training Data
We observe that our training data contains 53,460 observations of thirty-nine features and one target.

In [3]:
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)

X_train shape:  (53460, 39)
y_train shape:  (53460,)


### Inspecting Data Types
Below we see that out of thirty-nine features nine are numerical and 30 are categorical.

In [4]:
X_train.dtypes.value_counts()

object     30
int64       6
float64     3
dtype: int64

The list below shows the data type for each feature.

In [5]:
X_train.dtypes.sort_index()

amount_tsh               float64
basin                     object
construction_year          int64
date_recorded             object
district_code              int64
extraction_type           object
extraction_type_class     object
extraction_type_group     object
funder                    object
gps_height                 int64
installer                 object
latitude                 float64
lga                       object
longitude                float64
management                object
management_group          object
num_private                int64
payment                   object
payment_type              object
permit                    object
population                 int64
public_meeting            object
quality_group             object
quantity                  object
quantity_group            object
recorded_by               object
region                    object
region_code                int64
scheme_management         object
scheme_name               object
source    

Note that the `date_recorded` variable contains string encoded dates. Otherwise, the assigned data types seem reasonable.

In [6]:
X_train[['date_recorded']].head()

Unnamed: 0_level_0,date_recorded
id,Unnamed: 1_level_1
31080,2012-10-10
17282,2013-02-16
72545,2011-03-20
44490,2012-10-12
67816,2013-02-04


### Checking for Missing Values
Below we check for features with missing values. There are seven features with missing values and thirty-two without.

In [7]:
X_train.isnull().any().value_counts()

False    32
True      7
dtype: int64

The list below shows which features have missing values.

In [8]:
X_train.isnull().any().sort_index()

amount_tsh               False
basin                    False
construction_year        False
date_recorded            False
district_code            False
extraction_type          False
extraction_type_class    False
extraction_type_group    False
funder                    True
gps_height               False
installer                 True
latitude                 False
lga                      False
longitude                False
management               False
management_group         False
num_private              False
payment                  False
payment_type             False
permit                    True
population               False
public_meeting            True
quality_group            False
quantity                 False
quantity_group           False
recorded_by              False
region                   False
region_code              False
scheme_management         True
scheme_name               True
source                   False
source_class             False
source_t

### Classification of Features
Below we group features by the type of data that they contain. We referred to the competition [feature descriptions](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/) when grouping variables.

In [9]:
feature_classification = {
    'Geospatial': [
        'longitude',
        'latitude',
        'gps_height'
    ],
    'Regional': [
        'region',
        'region_code',
        'lga',
        'district_code',
        'ward',
        'subvillage'
    ],
    'Water': [
        'basin',
        'water_quality',
        'quality_group',
        'quantity',
        'quantity_group',
        'source',
        'source_class',
        'source_type'
    ],
    'WaterpointNumerical': [
        'amount_tsh',
        'population'
    ],
    'WaterpointCategorical': [
        'wpt_name',
        'extraction_type',
        'extraction_type_class',
        'extraction_type_group',
        'waterpoint_type',
        'waterpoint_type_group'
    ],
    'Management': [
        'management',
        'management_group',
        'payment',
        'payment_type',
        'permit', 
        'scheme_management', 
        'scheme_name'
    ],
    'Installation': [
        'construction_year',
        'installer',
        'funder'
    ],
    'Data Collection': [
        'date_recorded',
        'recorded_by'
    ],
    'Unknown': [
        'num_private',
        'public_meeting'
    ]
}

In [10]:
feature_count = 0
cols = list(X_train.columns)
for key in feature_classification.keys():
    print('- ', key)
    for feature in feature_classification[key]:
        print('\t- ', feature)
        cols.remove(feature)
        feature_count += 1
print('\nTotal Feature Count: ', feature_count)
print('Unclassified Features: ', cols)

-  Geospatial
	-  longitude
	-  latitude
	-  gps_height
-  Regional
	-  region
	-  region_code
	-  lga
	-  district_code
	-  ward
	-  subvillage
-  Water
	-  basin
	-  water_quality
	-  quality_group
	-  quantity
	-  quantity_group
	-  source
	-  source_class
	-  source_type
-  WaterpointNumerical
	-  amount_tsh
	-  population
-  WaterpointCategorical
	-  wpt_name
	-  extraction_type
	-  extraction_type_class
	-  extraction_type_group
	-  waterpoint_type
	-  waterpoint_type_group
-  Management
	-  management
	-  management_group
	-  payment
	-  payment_type
	-  permit
	-  scheme_management
	-  scheme_name
-  Installation
	-  construction_year
	-  installer
	-  funder
-  Data Collection
	-  date_recorded
	-  recorded_by
-  Unknown
	-  num_private
	-  public_meeting

Total Feature Count:  39
Unclassified Features:  []


### Future Work 
Write a class to wrap the pickled test train split.