## Preflight checks

1) Install the following:
- plotly-express
- googlemaps
- tqdm

2) Set up an account on Google Places API https://developers.google.com/maps/documentation/places/web-service/overview?hl=PL


3) wget https://raw.githubusercontent.com/MarkHershey/CompleteTrumpTweetsArchive/master/data/realDonaldTrump_in_office.csv

__or__ 

wget https://raw.githubusercontent.com/MarkHershey/CompleteTrumpTweetsArchive/master/data/realDonaldTrump_in_office.csv

In [25]:
# System tools
import os
import sys
sys.path.append(os.path.join(".."))

# Google API helper functions
import googlemaps
from utils.google_utils import get_placeid, process_id

# Data analysis
import pandas as pd
from collections import Counter
from tqdm import tqdm

# Plotly for visualisation; bit of an issue with Jupyter
import plotly.express as px
import plotly.io as pio
pio.renderers.default='notebook'

# NLP
import spacy
nlp = spacy.load("en_core_web_sm")

## Doing NER with ```spaCy```

Here, we're only focusing on ```GPE``` in the data, but feel free to explore other kinds of entity!

__Read in the data__

In [26]:
input_file = os.path.join("..", "data", "realDonaldTrump_in_office.csv")

Remember that the data format is a little screwy. If you're using your own data, all of this boilerplate might not be necessary.

In [27]:
data = pd.read_csv(input_file, 
                   sep=',\s+', 
                   delimiter=',', 
                   encoding="utf-8", 
                   skipinitialspace=True)

__Extract GPEs with ```spaCy```__

In [28]:
doc = nlp("My name is Ross")
# For every entity in the doc object
for token in doc.ents:
    # print the token and the NER label (NB: .label_ not .label)
    print(token.text, token.label_)

Ross PERSON


In [29]:
# create empty list for our entities
gpe_ents = []

The next cell extracts only the mentions of entities tagged with ```GPE``` by ```spaCy``` in the tweets.

On my computer (MacBook Pro 2017, 16GB RAM, 2.8 GHz Intel Core i7) it takes just under 2 minutes.

In [30]:
# Batch tweets in 200, run through NLP pipe. Try different sizes!
# You might also want to use the 'disable' flag in nlp.pipe

for posts in tqdm(nlp.pipe(data["Tweet Text"], batch_size=200)):
    for entity in posts.ents:
        if entity.label_ == "GPE":
            gpe_ents.append(entity.text)

23075it [01:42, 225.14it/s]


__Count and group__

Use a Python ```Counter()``` object to count how often each entity occurs.

In [31]:
counted_gpes = Counter(gpe_ents)

In [32]:
# Create dataframe from dict
locations = pd.DataFrame.from_dict(counted_gpes, orient='index')

In [33]:
# Reset the index
locations.reset_index(level=0, inplace=True)
locations.columns = ["location", "count"]

__Explore data__

In [34]:
# Show a random sample of 10 entities
locations.sample(10)

Unnamed: 0,location,count
1020,@NRA,1
559,NY,3
20,Nov.14,1
717,@Hyundai,2
744,Lexington,3
314,Singapore,16
134,Stock M,1
267,@DHSgov,1
956,Brooklyn,3
89,Connecticut,12


In [35]:
# Show 10 most frequent
locations.sort_values("count", ascending=False).head(10)

Unnamed: 0,location,count
3,America,695
6,U.S.,670
50,China,661
18,Russia,439
21,the United States,396
26,Florida,248
40,USA,237
7,Mexico,230
78,Pennsylvania,208
11,Iran,206


## Geocoding

You might not be interested in exploring entities tagged as ```GPE```. In which case, you obviously don't need to worry about geocoding!

If you are intersted in this kind of spatial question, though, you'll need to make sure that you have an API key set up for Google Places.

When you have that, save the key to a file called ```api-key.txt``` - it will look like a string of random numbers and letters.

API keys are specific to you and only you, so don't share it!

In [36]:
with open("../data/api-key.txt", "r") as f:
    google_key = f.read()

These following cells show how to interface with Google Places via the API. 

We use two helper functions from ```utils/google_utils.py``` imported - ```get_placeid()``` and ```process_id()```

In [37]:
# Set up googlemaps clientsa
gc_rate  = 100 # Geocoding queries per second
pl_rate  = 100 # Places queries per second

gc_client = googlemaps.Client(key=google_key, queries_per_second=gc_rate) # For Geocoding API
pl_client = googlemaps.Client(key=google_key, queries_per_second=pl_rate) # For Places API

This will be time consuming, circa 15 minutes or so.

__REMEMBER THE GOOGLE PLACES BUSINESS MODEL $$__

Free options also exist, such as MapQuest https://developer.mapquest.com/

If you use MapQuest, you'll need to use a different Python library https://geocoder.readthedocs.io/providers/MapQuest.html

In [38]:
# Perform geocoding
google_geocode_output = {}
for loc in tqdm(locations):
    placeid = get_placeid(loc, pl_client)
    if placeid:
        google_geocode_output[loc] = process_id(placeid, gc_client)

  0%|          | 2/1190 [00:00<01:28, 13.47it/s]


__Explore results__

In [39]:
# Examine output
google_geodata = pd.DataFrame.from_dict(google_geocode_output, orient='index')

In [40]:
# Save for future use

#google_geodata.to_csv("google_geodata.csv", sep="\t")
#google_geodata = pd.read_csv("google_geodata.csv", sep="\t")

In [41]:
# Merge dataframes
merged_data = pd.merge(locations, google_geodata, left_on="location", right_on="Unnamed: 0")

KeyError: 'Unnamed: 0'

In [None]:
# Group data by location for plotting
subset = pd.DataFrame(merged_data.groupby(["formatted_address", "lat", "lon", "location_type"]).sum()).reset_index()

In [None]:
subset.sort_values("count", ascending=False).head(20)

__Visualise__

In [None]:
# Show locations on a map
fig = px.scatter_geo(google_map_data, lat='lat', lon='lon',
                     hover_name="formatted_address", size="count",
                     color='location_type',
                     projection="natural earth")

In [None]:
fig.show()