### Inspecting the Geospatial Feature Group

In [None]:
import matplotlib.pyplot as plt
from data_utilities import DataWrapper
data = DataWrapper()
X_train = data.X_train

In [None]:
def geospatial_scatter(df):
    fig = plt.figure(figsize=(10,10))
    plt.scatter(x='longitude', y='latitude', c='gps_height', data=df)
    cbar = plt.colorbar()
    cbar.set_label('GPS Height (meters)', rotation=270);
    plt.title('Spatial Coordinates of Waterpoints');
    plt.xlabel('Longitude');
    plt.ylabel('Latitude');
    return fig

In [None]:
fig = geospatial_scatter(X_train)
fig.savefig('../images/geo-spatial_coordinates.png', bbox_inches='tight')

Notice that there are waterpoints with latitude and longitude equal to zero. These are almost surely waterpoints with missing values that have been encoded using dummy values. Based on the geography of Tanzania we would expect our data to fall within the following intervals.
* Longitude: [29.6, 40.4]
* Latitude: [-11.7, -0.8]
* Altitude: [0, 5895]
Let's take a look at the values that fall outside of the allowed intervals.

In [None]:
geospatialErrors = X_train.query(
    'longitude < 29.6 or longitude >40.4 or latitude < -11.7 or latitude > -0.8 or gps_height < 0 or gps_height > 5895'
)
print(f'There are {len(geospatialErrors)} rows, roughly {round(len(geospatialErrors)/len(X_train)*100)}% of our data, with geospatial coordinates that fall outside our bounds.')
print(geospatialErrors[['longitude', 'latitude', 'gps_height']].value_counts())

geospatial_scatter(geospatialErrors);

It looks like all of the values that are not zero encoded missing values are coastal waterpoints with a negative `gps_heaight`. This indicates that `gps_height` may refer to the altitude of the bottom of a well in some cases. Therfore, our initial bounds need to be relaxed to allow for waterpoints with heights below sea level.

In [None]:
geospatialErrors = X_train.query(
    'longitude < 29.6 or longitude >40.4 or latitude < -11.7 or latitude > -0.8 or gps_height > 5895'
)
print(f'There are {len(geospatialErrors)} rows, roughly {round(len(geospatialErrors)/len(X_train)*100)}% of our data, with geospatial coordinates that fall outside our bounds.')
print(geospatialErrors[['longitude', 'latitude', 'gps_height']].value_counts())

fig = plt.figure(figsize=(10,10))
plt.scatter(x='longitude', y='latitude', c='gps_height', data=geospatialErrors)
cbar = plt.colorbar()
cbar.set_label('GPS Height (meters)', rotation=270);
plt.title('Spatial Coordinates of Waterpoints');
plt.xlabel('Longitude');
plt.ylabel('Latitude');
#fig.savefig('../images/geo-spatial_coordinates.png', bbox_inches='tight')

#### Conclusion
About 3% of our training data has zero encoded missing values in the Geospatial feature group. While these values are not ideal, they should not have a substantial negative effect on the model. I will leave them in place. Below we revise our bounds and note that `gps_height` may not refer to the surface altitude of a waterpoint, but rather incorporate the depth of a well in some cases.
* Longitude: [29.6, 40.4]
* Latitude: [-11.7, -0.8]
* GPS Height: less than 5895.

#### Future Work
Imputing missing values based on regional data would be a good way to improve handling of the zero encoded missing values. Thanks to Kristen for this recommendation.