# Overview

Geocoding is the process of taking a text-based description of a location, and returning geographic coordinates, frequently latitude/longitude pairs. Reverse geocoding is the process of converting a location as described by geographic coordinates (latitude, longitude) to a human-readable address or place. 

How can this be useful? Well, we are given latitude-longitude coordinates, so we can perform reverse geocoding. As a result, we can generate lots of new data that could potentially be used as model features or inputs to post processing.

Is this Competition Viable? To put it shortly, yes. As stated by Rule 7c, external data is allowed as long as it is publicly available. Since this data is generated using an open source library, it is allowed.

I hope you find this useful and if you do, leave an upvote.

In [None]:
import pandas as pd
import numpy as np
import json
import os
import glob

import matplotlib.pyplot as plt
from matplotlib_venn import venn2, venn2_circles
import seaborn as sns
from tqdm.notebook import tqdm
import pathlib
import plotly
import plotly.express as px
from pathlib import Path
import pyproj
from pyproj import Proj, transform
from geopy.geocoders import Nominatim
import plotly.express as px
from scipy import stats

!pip install reverse_geocoder
import reverse_geocoder as rg

In [None]:
INPUT = '../input/google-smartphone-decimeter-challenge'
base_train = pd.read_csv(INPUT + '/' + 'baseline_locations_train.csv')
base_test = pd.read_csv(INPUT + '/' + 'baseline_locations_test.csv')
sample_sub = pd.read_csv(INPUT + '/' + 'sample_submission.csv')
base_test.head(1)

In [None]:
# ground_truth
p = pathlib.Path(INPUT)
gt_files = list(p.glob('train/*/*/ground_truth.csv'))
print('ground_truth.csv count : ', len(gt_files))

gts = []
for gt_file in tqdm(gt_files):
    gts.append(pd.read_csv(gt_file))
ground_truth = pd.concat(gts)

#display(ground_truth.head())

# Reverse Goecoder Setup

In [None]:
base_test["geom"] = base_test["latDeg"].map(str) + "," + base_test["lngDeg"].map(str)

In [None]:
def geocoder(data):
    locator = Nominatim(user_agent="myGeocoder")
    coordinates = data['geom']
    location = locator.reverse(coordinates)
    return location.raw

# EDA with Reverse Geocoder

### Example 1

In [None]:
# I am using the first point to demonstrate
ex_point = base_test.iloc[0]
ex_point

In [None]:
ex_output = geocoder(ex_point)
ex_output

### Example 2

In [None]:
# I am using a random point to demonstrate
ex_point = base_test.iloc[420]
ex_point

In [None]:
ex_output = geocoder(ex_point)
ex_output

notice it can be broken down more in the address section

In [None]:
ex_output["address"]

### Mini Dataset For A Single Path

Since every call of my geocoder function takes quite a while to run, the example below uses only every 5 data points in a single path.

Here are the features I will be taking a closer look at: place id (numerical id of the place), lat or adjusted lat (a lat value essentially generated through snap to structure), lon or adjusted lon (a lon value essentially generated through snap to structure), road (the name of the road the point falls on), man_made (name of a man made structure the point falls on)

In [None]:
ex_base = base_test[base_test.phone == '2021-04-02-US-SJC-1_Pixel4']
ex_base = ex_base[::5]
ex_base.reset_index(drop=True, inplace=True)
ex_base.tail(2)

In [None]:
ex_data = pd.DataFrame(columns=["place_id", "adj_lat", "adj_lon", "road", "man_made"])

for i in tqdm(range(len(ex_base))): 
    data = geocoder(ex_base.iloc[i])
    try:
        man_made = data["address"]["man_made"]
    except:
        man_made = 'nan'
    ex_data = ex_data.append({'place_id': data["place_id"],
                              'adj_lat': data["lat"],
                              'adj_lon': data["lon"],
                              'road': data["address"]["road"],
                              'man_made': man_made}, 
                              ignore_index=True)

In [None]:
ex_data.apply(pd.to_numeric, errors='ignore')
ex_data = ex_data.join(ex_base)
ex_data = ex_data.apply(pd.to_numeric, errors='ignore')
ex_data.head(3)

### EDA of Mini Dataset

In [None]:
print('place id')
print(ex_data.place_id.value_counts(), '\n')

print('road:')
print(ex_data.road.value_counts(), '\n')

print('man_made:')
print(ex_data.man_made.value_counts(), '\n')

print('adj_lat:')
print('mean', np.mean(ex_data.adj_lat))
print('median', np.median(ex_data.adj_lat))
print('mode', stats.mode(ex_data.adj_lat)[0][0], '\n')

print('adj_lon:')
print('mean', np.mean(ex_data.adj_lon))
print('median', np.median(ex_data.adj_lon))
print('mode', stats.mode(ex_data.adj_lon)[0][0])

In [None]:
place_ids = list()
place_id_values = list()
other_count = 0
for place_id in ex_data.place_id.value_counts().index:
    if(ex_data.place_id.value_counts()[place_id] > 6):
        place_ids.append(place_id)
        place_id_values.append(ex_data.place_id.value_counts()[place_id])
    else:
        other_count+=ex_data.place_id.value_counts()[place_id]
place_ids.append("other")
place_id_values.append(other_count)

plt.pie(place_id_values)
plt.legend(place_ids, bbox_to_anchor=(1.2,0.5), loc="center right", fontsize=10, 
           bbox_transform=plt.gcf().transFigure)
plt.title("Place ID")
plt.plot()

In [None]:
plt.pie(ex_data.road.value_counts())
plt.legend(ex_data.road.value_counts().index, bbox_to_anchor=(1.2,0.5), loc="center right", fontsize=10, 
           bbox_transform=plt.gcf().transFigure)
plt.title("Road")
plt.plot()

In [None]:
plt.pie(ex_data.man_made.value_counts())
plt.legend(ex_data.man_made.value_counts().index, bbox_to_anchor=(1.4,0.5), loc="center right", fontsize=10, 
           bbox_transform=plt.gcf().transFigure)
plt.title("Man Made")
plt.plot()

Here is what the path looks like on a map.

In [None]:
fig = px.scatter_mapbox(ex_data,

                    # Here, plotly gets, (x,y) coordinates
                    lat="latDeg",
                    lon="lngDeg",
                    text='phoneName',

                    #Here, plotly detects color of series
                    color="collectionName",
                    labels="collectionName",

                    zoom=12,
                    center={"lat":np.mean(ex_data.latDeg), "lon":np.mean(ex_data.lngDeg)},
                    height=600,
                    width=800)
fig.update_layout(mapbox_style='stamen-terrain')
fig.update_layout(margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig.update_layout(title_text="Original")
fig.show()

Here is a comparison of the lat/lon path in the test file vs the adjusted results. Unfortunately the adjusted results are subpar 

In [None]:
plt.subplots(nrows=1, ncols=2, figsize=(11, 5))
plt.subplot(1,2,1)
plt.title('original')
plt.xlabel('lon')
plt.ylabel('lat')
plt.plot(ex_data.latDeg, ex_data.lngDeg, color='blue')
plt.subplot(1,2,2)
plt.title('rev geocoded')
plt.xlabel('lon')
plt.ylabel('lat')
plt.plot(ex_data.adj_lat, ex_data.adj_lon, color='red')
plt.show()

# Conclusion

Reverse Geocoders can definitely be useful to generate more data that can be used as model features or inputs to post processing.

Some good features to take a look at are place id, road, man made, bounding box, postcode, commercial, and more.

The snap to structure with reverse geocoders is not as good as the original; however, we can use other external data tools such as openstreetmap.org