## Inspecting the Training Data
In this section we inspect the training data and group variables for further analysis.

### Importing the Train Test Split
Below we import the train test split that we created in the [ImportData](ImportData.ipynb) notebook. The `train_test_split` object is a dictionary containing dataframes as values.

In [None]:
import pickle
with open('../data/train_test_split.pkl', mode='rb') as f:
    train_teast_split = pickle.load(f)

### Extracting Training Data

In [None]:
X_train = train_teast_split['X_train']
y_train = train_teast_split['y_train']

### Inspecting the Shape of the Training Data
We observe that our training data contains 53,460 observations of thirty-nine features and one target.

In [None]:
print('X_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)

### Inspecting Data Types
Below we see that out of thirty-nine features nine are numerical and 30 are categorical.

In [None]:
X_train.dtypes.value_counts()

The list below shows the data type for each feature.

In [None]:
X_train.dtypes.sort_index()

Note that the `date_recorded` variable contains string encoded dates. Otherwise, the assigned data types seem reasonable.

In [None]:
X_train[['date_recorded']].head()

### Checking for Missing Values
Below we check for features with missing values. There are seven features with missing values and thirty-two without.

In [None]:
X_train.isnull().any().value_counts()

The list below shows which features have missing values.

In [None]:
X_train.isnull().any().sort_index()

### Classification of Features
Below we group features by the type of data that they contain. We referred to the competition [feature descriptions](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/) when grouping variables.

In [None]:
feature_classification = {
    'Geospatial': [
        'longitude',
        'latitude',
        'gps_height'
    ],
    'Regional': [
        'region',
        'region_code',
        'lga',
        'district_code',
        'ward',
        'subvillage'
    ],
    'Water': [
        'basin',
        'water_quality',
        'quality_group',
        'quantity',
        'quantity_group',
        'source',
        'source_class',
        'source_type'
    ],
    'WaterpointNumerical': [
        'amount_tsh',
        'population'
    ],
    'WaterpointCategorical': [
        'wpt_name',
        'extraction_type',
        'extraction_type_class',
        'extraction_type_group',
        'waterpoint_type',
        'waterpoint_type_group'
    ],
    'Management': [
        'management',
        'management_group',
        'payment',
        'payment_type',
        'permit', 
        'scheme_management', 
        'scheme_name'
    ],
    'Installation': [
        'construction_year',
        'installer',
        'funder'
    ],
    'Data Collection': [
        'date_recorded',
        'recorded_by'
    ],
    'Unknown': [
        'num_private',
        'public_meeting'
    ]
}

In [None]:
feature_count = 0
cols = list(X_train.columns)
for key in feature_classification.keys():
    print('- ', key)
    for feature in feature_classification[key]:
        print('\t- ', feature)
        cols.remove(feature)
        feature_count += 1
print('\nTotal Feature Count: ', feature_count)
print('Unclassified Features: ', cols)