Please have a look into the code and follow the below texts for understanding.

# Process for Location Analytics include:

    1. Address Standardization
    
    2. Address Normalization
    
    3. Address Validation
    
    4. Address verification
    
    5. Geocoding
    
    6. Feeding into Machine Learning Model with Active Learning (with New addresses also being added continuously into the databases and old addressess being verified and validated conitnuously)
    
    7. Reiterating the whole process 

# Address Standardization

The relevant details (i.e. street number, apartment number, street name, city, state, and postal code) are in the correct formats and they pin to a specific and unique Location (i.e. Latititude and Longitude)

## Some of the reasons for Address discrepancies can be:

1. Incomplete information - Missing street names, block numbers, or zip codes

2. Invalid information - Fake addresses and zip codes

3. Incorrect information - Typos, misspellings, formatting of abbreviations (i.e. CAL instead of CA for California)

4. Inaccurate information - Wrong addresses or house numbers

Also due to population growth, urbanization, and new construction many new addressess are always getting add up. Example in the United States, the Postal Service (USPS) adds 4,221 addresses to its delivery network every day.

### Data Preparation Steps:

    1. normalization
    2. stemming
    3. lemmatization
    4. segmentation (tokenization)
    5. text rebuild

## Normalization

Normalization consists on transforming the text to a canonical form to make them easy to be compared. 

Steps for Normalization:

    1. Standardize encoding
    2. Remove punctuation
    3. Transform to lowercase
    4. Remove stopwords and punctuation
    5. Separate prefixes and suffixes that doesn’t contain information

## Stemming

Stemming is the process of reducing words in different forms (conjugated verbs, plural) to a radical form. This step is not useful for addresses because most of the addresses are not in different forms. Proper names, for example, are very common in addresses and don’t benefit a lot from stemming.

## Lemmatization

Lemmatization is the process of grouping together the flexionated forms of the words so they can be analysed together.

## Segmentation

Segmentation is the task of breaking up the text into tokens, so each token can be analysed separately. 
For our case the address field can be broke down into preffixes, location, complements and suffixes.

This is helpful because now we can match each part of the address with an existing canonical form without a lot of noise. Each of the fields can be further processed to extract more information, like the pincode number.

## Parsing

The next step is to parse the address. Parsing consists in break up the address string into fields that compose the address, breaking up of address into specific fields. To parse we have to assume a structure for the address.

## Rebuilding the Text
This task consists in rebuild the normalized text to a final form. It will be done after the match phase.

## Identification and Match
After cleaning up and normalizing the text we need to check if the value of the address exists in our in-house database. 

## Two approaches:

### 1. Match with existing database
Name Entity Recognition on address
Match with  database
If we have a database with the data considered correct, our job is to match the target addresses with the ones on this in the database. This is a match problem. We can attack this problem following these steps:

### 2. Split address by field (prefix, location, suffixes)
retrieve match candidates (search engine)
Match address with candidates by similarity
For this approach we’re going to work directly on the text patterns, without any kind of machine learning. The canonical database is usually provided by the Post Office.

The match between two addresses is a way to check if two addresses are the same. For example, let’s say that we have in our canonical database the entry

    Pincode     Location	    City    State
    400093	    Mahakali Caves	Mumbai	MH
    NaN	        Mahakali Road	Mumbai	MH

How to figure out which one is the best match? 
We could try to do an exact match: only the location strings that are exactly same are the same address. 
But this would miss lots of entries that could have typing errors but are otherwise valid addresses.




## Match
One approach is to retrieve candidates from the in-house database that are similar to the address we want to normalize. Search engines do that using different strategies. We’re not going to detail this process, so let’s just say that our search engine returned candidates to be compared.

For each of these candidates we do a comparison with our target address using some metric of similarity. There are several of such metrics:

    1. Jaro distance
    2. Jaro-Winkler distance
    3. Cosine distance

I've chosen Jaro-Winkler distance. We compare the target address with each of the candidates and rank by the similarity between them.

## Search engines

Search engines usually already make the string similarity comparison to retrive the candidates, so it could, in principle, already compute the similarity score withou the need to program it by ourselves. But sometimes the search engine similarity algorithm cannot be tuned to the type of text, like addresses. We also have more information than only the Location string, like the postal code and suffixes. This could help in the decision process.

## NER
Instead of using regular expressions to break up the address text into components we could create a Named Entity Recognizer and let it separate the address by fields.

tag canonical database with relevant tags
train CRF with tagged database
classify each address
match classified entity with canonical base

## Decision process
After the text normalization and match we hopefully have a list of candidates with a similarity score between the target and a canonical address. How we decide if the address is indeed the correct address? We can set a score threshold, for example, based on our experience, and test the error rate. We also can create a classification model and train manually with some entries.

## Address Normalization

Please have a look into the Address Normalization Folder for the code.

# Geocoding using Geopy

we can also building something similar to this in-house

In [3]:
pip install geopy

Collecting geopy
  Downloading geopy-2.3.0-py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting geographiclib<3,>=1.52
  Downloading geographiclib-2.0-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: geographiclib, geopy
Successfully installed geographiclib-2.0 geopy-2.3.0
Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.0/250.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting et-xmlfile
  Using cached et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
messy_address = pd.read_excel("messy address.xlsx")

In [2]:
messy_address.head()

Unnamed: 0,Raw Address,Issue
0,"800 S Figueroa St, Los Angeles, CA",missing zipcode
1,"515 W 7th St 2nd floor, Los Angeles, CA 90014",missing state
2,"2701 S Vermont Ave, Los Angoles, CA 90007",misspelled city
3,"7511 Raymonds Ave, Los Angeles, CA 90044",misspelled street name
4,"7900 S Westen Ave, Los Angeles, 90047","misspelled street name, missing state"


In [16]:
# OpenStreetMap Free API 
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Verified email address from OpenStreetMap")
# Google Maps Paid API
# from geopy.geocoders import GoogleV3
# geolocator = GoogleV3(api_key='Your Google Maps API Key')

In [9]:
def extract_clean_address(address):
    try:
        location = geolocator.geocode(address)
        return location.address
    except:
        return ''
messy_address['clean address'] = messy_address.apply(lambda x: extract_clean_address(x['Raw Address']) , axis =1  )

In [None]:
def extract_lat_long(address):
    try:
        location = geolocator.geocode(address)
        return [location.latitude, location.longitude]
    except:
        return ''
messy_address['lat_long'] = messy_address.apply(lambda x: extract_lat_long(x['Raw Address']) , axis =1)
messy_address['latitude'] = messy_address.apply(lambda x: x['lat_long'][0] if x['lat_long'] != '' else '', axis =1)
messy_address['longitude'] = messy_address.apply(lambda x: x['lat_long'][1] if x['lat_long'] != '' else '', axis =1)
messy_address.drop(columns = ['lat_long'], inplace = True)