## Inspecting the Regional Feature Group
Tanzania is organized into 31 [administrative regions](https://en.wikipedia.org/wiki/Regions_of_Tanzania). Each of these regions are subdivided into [districts](https://en.wikipedia.org/wiki/Districts_of_Tanzania). Each district contains wards which, in turn, contain sub-villages. The variables below describe the location of a waterpoint in terms of this naming scheme. The regions and districts are coded, however I was not able to reconcile the codes with any publicly available encoding scheme.
* `region` (`region_code`)  - Names (codes) for top-level administrative regions.
* `lga` (`district_code`) - Names (codes) for districts, which divide regions.
* `ward` - Names for wards, which divide districts.
* `subvillage` - Names for sub-villages, presumably these subdivide wards. 

In [None]:
from data_utilities import DataVisualization
viz = DataVisualization()
X_train = viz.X_train

In this group, only the variable `subvillage` has missing values. 

In [None]:
regional = ['region', 'lga', 'ward', 'subvillage']
print(X_train[regional].isnull().any())

In [None]:
for col in regional:
    print(f'- {col} has {len(X_train[col].unique())} unique values.')

In [None]:
lst = sorted(list(X_train['region'].unique()))
print(f'Our data contains information about waterpoints in the following {len(lst)} regions:\n')
print(lst)

The function below takes a

In [None]:
viz.barplot_waterpoints(['region']);

Below I have counted all of the combinations of values of `region` and `region_code`. It looks like some regions have multiple codes, but the region codes do not provide substantially more information than the region names.

In [None]:
lst = sorted(list(X_train['lga'].unique()))
print(f'Our data contains information about waterpoints in the following {len(lst)} districts:\n')
print(lst)

### Conclusions
Both `region` and `lga` seem to be very clean features. I am inclined to drop `region_code` and `district_code` since they do not seem to clearly relate to regions and districts, respectively. Both, `ward` and `subvillage` have a huge number of classes, which will be computationally expensive to include in models. I will fill missing values in the `subvillage` feature with the string 'missing'.

### Future Work
It might be possible to impute missing values in the `subvillage` feature by using geospatial data and K-nearest neighbors.