# Author: Tiffany Seeley

Github repo: https://github.com/tiffsea

Demo video: https://www.youtube.com/watch?v=gJMHbW3MK2w

Project: React-Maps-UK-Housing-Prices

## Installation:

Use Anaconda Prompt command prompt/terminal from your machine to ensure you can import geocoders successfully

In [1]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


import pandas for calculations and nominatim for geopy API

In [2]:
#prereq libraries
import pandas as pd
from geopy.geocoders import Nominatim

## First, format your data file

you should prepare a csv file that has your address data on it, then name it `data.csv`. 
- If your address data file has a different file name, then you must change the name in the first line of code below.

In [3]:
#read-in csv file and create column headers according to UK housing site
df = pd.read_csv("data.csv", #file name
                 sep=',') #seperator

In [4]:
# print the data
df.head()

Unnamed: 0,COMPANY,STREET_ADDRESS,CITY,STATE,DISTRICT,COUNTY,COUNTRY,POSTCODE
0,University of Sussex,,Brighton,,,,UK,


let's concatenate a few column names to create a long string of any given address in our `data.csv` file. This column will be used a query term for the geoPy API.

In [5]:
#create variable column inside file called `myAddress and assign it to seleted address columns
df['query'] = df['COMPANY'] + " " + df['CITY'] + " " +  df['COUNTRY']

#print new column with index -first 5 rows only
df.iloc[0:5, 8:9]

Unnamed: 0,query
0,University of Sussex Brighton UK


next we will remove duplicate addresses, keeping the
first instance of the query address
- __TIP__: sort the data for most current address if applicable.

In [6]:
#remove duplicate addresses (new concat column) but keep first instance
df.drop_duplicates(subset ='query', keep ='first', inplace = True)

#print some useful info: row length and shape size
print("data row x columns is {}\ndata row count is {}".format(df.shape,len(df.index)))

#print first rows as sample
df.head()

data row x columns is (1, 9)
data row count is 1


Unnamed: 0,COMPANY,STREET_ADDRESS,CITY,STATE,DISTRICT,COUNTY,COUNTRY,POSTCODE,query
0,University of Sussex,,Brighton,,,,UK,,University of Sussex Brighton UK


### *OPTIONAL*:
Drop columns you won't use. This can be particularly helpful when your data file is enormous, otherwise, feel free to skip the next step

In [7]:
#drop columns we won't use
df = df.drop(columns=['DISTRICT','COUNTY'])
                 
#print row length and shape size
print("data row x columns is {}\ndata row count is {}".format(df.shape,len(df.index)))

#print first rows as sample
df.head()

data row x columns is (1, 7)
data row count is 1


Unnamed: 0,COMPANY,STREET_ADDRESS,CITY,STATE,COUNTRY,POSTCODE,query
0,University of Sussex,,Brighton,,UK,,University of Sussex Brighton UK


Geopy returns us latitude, longitude coordinates and specific address. Let's create 3 columns in our df.

In [8]:
#create 2 new columns to store lat/long - initalise to null
df['location_lat'] = ""
df['location_long'] = ""
df['location_address'] = ""

#print first rows to sample
df.head()

Unnamed: 0,COMPANY,STREET_ADDRESS,CITY,STATE,COUNTRY,POSTCODE,query,location_lat,location_long,location_address
0,University of Sussex,,Brighton,,UK,,University of Sussex Brighton UK,,,


## Second, use Geopy to fetch geocode data

In [9]:
'''
**Get Lat/Long Data with GeoPy**
---------------------

the code below calls a geopy API using a concatenated column of address values. We use this column as a query key 
to pull back cooresponding lat/long coordinates.
'''

geolocator = Nominatim(user_agent="myApp")

for i in df.index:
    try:
        #tries fetch address from geopy
        location = geolocator.geocode(df['query'][i])
        
        #append lat/long to column using dataframe location
        df.loc[i,'location_lat'] = location.latitude
        df.loc[i,'location_long'] = location.longitude
        df.loc[i,'location_address'] = location.address
    except:
        #catches exception for the case where no value is returned
        #appends null value to column
        df.loc[i,'location_lat'] = ""
        df.loc[i,'location_long'] = ""
        df.loc[i,'location_address'] = ""

#print first rows as sample
df.head()

Unnamed: 0,COMPANY,STREET_ADDRESS,CITY,STATE,COUNTRY,POSTCODE,query,location_lat,location_long,location_address
0,University of Sussex,,Brighton,,UK,,University of Sussex Brighton UK,50.868,-0.0877856,"University of Sussex, Southern Ring Road, Brig..."


## You're done! Now save your results to a new data file.

In [10]:
#write the contents thus far to new csv file
df.to_csv('geopy_data.csv')

Check your local directory to see if the new file was saved properly. 

#### Resources:
Check out the [GeoPy documentation](https://geopy.readthedocs.io/en/stable/#data) to discover additional data points (e.g., apparently you can fetch `altitude` as well), that can be queried using this API. 

TIP: apparently geopy can also be used for [**Calculating Distance**](https://geopy.readthedocs.io/en/stable/#module-geopy.distance).
- Geopy can calculate geodesic distance between two points using the geodesic distance or the great-circle distance, with a default of the geodesic distance available as the function geopy.distance.distance.