# 3. Location Extraction and Spacy Word Vectorization

Using the SpaCy module, we used Named Entity Recognition (NER) Functionality to pull important features form the Tweet text. We first had to format the Tweets to be friendly to SpaCy's algorithm. Then, using the information extracted, we compared the tweets and the extracted locations to a dataset of interstate exits and cross streets to extract the GPS coordinates from known entities. This information was added to the Tweet dataset to help build queries for mapping purposes.

- [**Import Libraries**](#Import-Libraries)
- [**SpaCy Preprocessing**](#SpaCy-Preprocessing)
  - [Spacy Processing Function](#Run-SpaCy-Location-Extraction)
- [**GPS Coordinate Extraction**](#GPS-Coordinate-Extraction-using-Interstate-Exits)
- [**SpaCy Visualization**](#SpaCy-Visualization)

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import spacy
import string
import datetime

from spacy import displacy

In [2]:
# Read in csv with Tweets
twitter_closures = pd.read_csv("../data/Cleaned_Tweets/cleaned_historic_official_07312019.csv")
rt_closures = pd.read_csv("../data/Cleaned_Tweets/cleaned_RT_official_08022019.csv")

exits = pd.read_csv("../data/interstate_exits.csv")

twitter_closures = twitter_closures[['date', 'text', 'type', 'username', 'tweet', 'state', 'road_closure']]
rt_closures = rt_closures[['date', 'text', 'type', 'username', 'tweet', 'state', 'road_closure']]

# Print DF shape
print(twitter_closures.shape)
print(rt_closures.shape)

# Show head of real time tweets
rt_closures.head()

(24054, 7)
(200, 7)


Unnamed: 0,date,text,type,username,tweet,state,road_closure
0,2019-08-02 13:12:24,updated crash in duval on sr-202 butler blv...,official,fl511_northeast,Updated Crash in Duval on SR-202 Butler Blv...,Florida,1
1,2019-08-02 13:06:27,good news project wrapped early #anjtraf...,official,ActionTraffic,Good News Project wrapped early #ANJTraf...,Florida,0
2,2019-08-02 13:04:02,roosevelt boulevard and san juan avenue railro...,official,JSOPIO,Roosevelt Boulevard and San Juan Avenue railro...,Florida,0
3,2019-08-02 12:46:32,year to date in 2019 \n101 traffic fatalities ...,official,JSOPIO,Year to Date in 2019 \n101 Traffic Fatalities ...,Florida,0
4,2019-08-02 11:48:59,jacksonville there will be lane closure on n...,official,fl511_northeast,Jacksonville There will be lane closure on N...,Florida,1


In [3]:
exits['crossSt'] = exits['crossSt'].fillna('None')

## SpaCy Preprocessing

SpaCy is able to recognize locations more accurately if certain words like "at" or "in" are before the entity. We modified the strings to allow SpaCy to be more effective and gather more location entities.

We specifically call spacy to highlight Countries, cities, and States (GPE), non-tagged locations, mountain ranges, bodies of water (LOC), and buildings, airports, highways, and bridges (FAC)

In [4]:
# Create new columns to transfer modified tweet text. Five versions of tweets will be created.
twitter_closures['modified_text'] = ''
twitter_closures['location'] = ''
rt_closures['modified_text'] = ''
rt_closures['location'] = ''

# Show modified DF
twitter_closures.head(2)

Unnamed: 0,date,text,type,username,tweet,state,road_closure,modified_text,location
0,2016-10-11 16:39:51+00:00,the pioh for the sr 138 i-20 is going on now u...,official,GDOTATL,The PIOH for the SR 138 I-20 is going on now u...,Georgia,0,,
1,2016-10-10 19:10:23+00:00,we appreciate all the hard work our crews have...,official,GDOTATL,We appreciate all the hard work our crews have...,Georgia,0,,


In [5]:
format_dict = {"hwy": "highway ",
            "blvd": "boulevard",
            " st": "street",
           "CR ": "County Road ",
           "SR ": "State Road",
           "I-": "Interstate ",
           "EB ": "Eastbound ",
           "WB ": "Westbound ",
           "SB ": "Southbound",
           "NB ": "Northbound",
           " on ": " at ",
           " E ": " East ",
           " W ": " West ",
           " S ": " South",
           " N ": " North",
           "mi ": "mile ",
           "between ": "at ",
           "Between ": "at ",
           " In ": " in",
           " in ": " at "}

In [6]:
def spacy_cleaner(df, col, word_dict):
    modified_text = "At " + df[col].replace(word_dict, regex=True)
    modified_text = modified_text.str.title()
    return modified_text

In [7]:
# run the text cleaning function and test results
twitter_closures['modified_text'] = spacy_cleaner(twitter_closures, 'tweet', format_dict)
rt_closures['modified_text'] = spacy_cleaner(rt_closures, 'tweet', format_dict)

In [8]:
twitter_closures['username'].unique()

array(['GDOTATL', 'SCDOTMidlands', 'SCDOTPeeDee', 'SCDOTLowCountry',
       'SCDOTPiedmont', '511statewideva', 'fl511_panhandl', '511Georgia',
       'fl511_state', 'fl511_northeast', 'fl511_southeast',
       'fl511_southwest', 'fl511_tampabay', 'fl511_central',
       '511centralva', '511hamptonroads', '511northernva',
       'NCDOT_Westmtn', 'NCDOT_Triangle', 'NCDOT_Piedmont',
       'NCDOT_Charlotte', 'NCDOT_Asheville', 'NCDOT_Scoast',
       'NCDOT_Ncoast'], dtype=object)

In [9]:
# convert date column to datetime
twitter_closures['date'] = pd.to_datetime(twitter_closures['date'])
rt_closures['date'] = pd.to_datetime(rt_closures['date'])

In [10]:
# for ease of use of the historic tweets, only take tweets that happened from
# October 6, 2016 to October 9, 2016
# the day Hurricane Matthew hit Jacksonville
twitter_closures = twitter_closures[(twitter_closures['date'] > '2016-10-6') & (twitter_closures['date'] < '2016-10-9')]

In [11]:
# only use tweets that contain road closures are from 'fl511 northeast'
loc_df = twitter_closures[(twitter_closures['road_closure'] == 1) & (twitter_closures['username'] == 'fl511_northeast')]
loc_df.shape

(285, 9)

In [12]:
# only take tweets that contain road closures from the real time set
rt_loc_df = rt_closures[(rt_closures['road_closure'] == 1)]
rt_loc_df.shape

(129, 9)

## Run SpaCy Location Extraction 

**WARNING** SpaCy is computationally expensive, extracting these locations will take time.

In [13]:
def get_loc(df, text_column, location_column):
    
    # Use Spacy to extract location names from `text` column
    for i in range(len(df)):
        
        #instantiate spacy model
        nlp = spacy.load("en_core_web_sm")
        
        # create documewnt from modified text column
        doc = nlp(df[text_column].iloc[i])
        
        locations = set()
        
        # loop through every entity in the doc
        for ent in doc.ents:
            
            # find entities labelled as places
            if (ent.label_=='GPE') or (ent.label_=='FAC') or (ent.label_ == 'LOC'):
                
                # put locations in a set
                locations.add(ent.text)
                df[location_column].iloc[i] = locations
                
    return df[location_column]

In [14]:
loc = get_loc(loc_df, 'modified_text', 'location')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [15]:
rt_loc = get_loc(rt_loc_df, 'modified_text', 'location')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [16]:
loc_df['location'] = loc
print(loc_df.shape)
loc_df.head()

(285, 9)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,date,text,type,username,tweet,state,road_closure,modified_text,location
6649,2016-10-08 23:33:38+00:00,new disabled vehicle in duval on sr-202 but...,official,fl511_northeast,NEW Disabled vehicle in Duval on SR-202 But...,Florida,1,At New Disabled Vehicle At Duval At Sr-202 ...,"{Kernan Blvd Right Shoulder Blocked, Blvd East}"
6651,2016-10-08 23:18:05+00:00,new disabled vehicle in duval on i-295 w nort...,official,fl511_northeast,NEW Disabled vehicle in Duval on I-295 W nort...,Florida,1,At New Disabled Vehicle At Duval At Interstat...,"{San Jose, Interstate 295 West North}"
6657,2016-10-08 21:58:06+00:00,new unconfirmed disabled vehicle in duval on ...,official,fl511_northeast,NEW Unconfirmed disabled vehicle in Duval on ...,Florida,1,At New Unconfirmed Disabled Vehicle At Duval ...,{Interstate 10 East Ramp To Interstate 95}
6660,2016-10-08 21:33:05+00:00,update disabled vehicle in duval on i-295 w n...,official,fl511_northeast,UPDATE Disabled vehicle in Duval on I-295 W n...,Florida,1,At Update Disabled Vehicle At Duval At Inters...,{Interstate 295 West North}
6662,2016-10-08 21:28:40+00:00,new disabled vehicle in duval on i-295 w nort...,official,fl511_northeast,NEW Disabled vehicle in Duval on I-295 W nort...,Florida,1,At New Disabled Vehicle At Duval At Interstat...,{Interstate 295 West North}


In [17]:
rt_loc_df['location'] = rt_loc
print(rt_loc_df.shape)
rt_loc_df.head()

(129, 9)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,date,text,type,username,tweet,state,road_closure,modified_text,location
0,2019-08-02 13:12:24,updated crash in duval on sr-202 butler blv...,official,fl511_northeast,Updated Crash in Duval on SR-202 Butler Blv...,Florida,1,At Updated Crash At Duval At Sr-202 Butler ...,"{Interstate, Butler Blvd}"
4,2019-08-02 11:48:59,jacksonville there will be lane closure on n...,official,fl511_northeast,Jacksonville There will be lane closure on N...,Florida,1,At Jacksonville There Will Be Lane Closure A...,"{New Kings Rd Northbound From Dunn Ave, Woodle..."
7,2019-08-02 11:12:34,new disabled vehicle in duval on i-295 e sout...,official,fl511_northeast,New Disabled vehicle in Duval on I-295 E sout...,Florida,1,At New Disabled Vehicle At Duval At Interstat...,"{Interstate 295 East South, Dames Point}"
9,2019-08-02 10:45:18,new planned construction in clay on sr-21 b...,official,fl511_northeast,New Planned construction in Clay on SR-21 B...,Florida,1,At New Planned Construction At Clay At Sr-21 ...,
10,2019-08-02 10:34:38,635a- big 3 travel times no problems on th...,official,ActionTraffic,635a- BIG 3 travel times no problems on th...,Florida,1,At 635A- Big 3 Travel Times No Problems At...,


In [18]:
loc_df.to_csv("../data/Loc_Extracted/tweet_locations_sample_08022019.csv", index = False)
rt_loc_df.to_csv("../data/Loc_Extracted/rt_locations_sample_08022019.csv", index = False)

## GPS Coordinate Extraction using Interstate Exits

Using our collected dataset of geolocated GPS coordinates, we then attempt to match strings referencing interstate names and exits to the coordinates in our dataset. The following two functions search by Interstate and Exit, then attempt to append the GPS coordinates to the Tweet.

In [19]:
# function to extract interstate, exit number, and direction
def exit_extractor (df, col, i_df):
    
    # instantiate lists for exit data
    exits = []
    interstates = []
    direction = []
    cross_st = []
    
    # loop through text column
    for item in df[col]:
        
        # look for "interstate" in text
        if 'Interstate' in item:
            
            # use regex to extract interstate and number from text
            i_string = re.search(r'Interstate (\S+)', item)
            interstates.append(i_string.group(0))
            
            # use regex to extract direction following "interstate"
            d_string = re.search("(i-\d*|Interstate \d*) (South|North|East|West)*", item)
            d_string = d_string.group(0)
            d_string =  re.search("South|North|East|West", d_string)
            
            # try to extract the direction from the regex object
            # append null if an error is thrown
            try:
                d_string = d_string.group(0)
                
            except AttributeError: 
                d_string = np.nan
                
            # append direction to list    
            direction.append(d_string)
                             
            # find "exit" in text
            if 'Exit' in item:
                
                # use regex to extract interstate and number from text
                e_string = re.search(r'Exit (\S+)', item)
                exits.append(e_string.group(0))
            
            # add "none" when no exit is found   
            else:
                exits.append("None")
                
        # add "none" to exits and interstates if no interstate is found
        else:
            interstates.append("None")
            exits.append("None")
            direction.append("None")
            
    # create a new dataframe from the interstate and exit lists
    new_df = pd.DataFrame(data = interstates, columns = ['interstate'])
    new_df['exits'] = exits
    new_df['direction'] = direction
    
    # return new dataframe
    return new_df

In [20]:
# function to extract longitude and latitude, if available
def loc_extractor(new_df, i_df):
    
    lat = []
    long = []
    cross = []
    
    # loop through the new dataframe
    for index, row in new_df.iterrows():
        
        # find rows that have both an interstate and exit extracted
        if (row['interstate'] != "None") and (row['exits'] != "None") and row['direction'] != "None":
            
            # attempt to add lat and long based on exit and interstate strings
            try:    
                mask = (i_df['interstate'].str.contains(row['interstate'])) & (i_df['exits'].str.contains(row['exits']))
                
                # add lat and long to list
                lat.append(i_df[mask].iloc[0]['lat'])
                long.append(i_df[mask].iloc[0]['long'])
            
            # if an error occurs, append null to lat and long
            # print index where error occured
            except:
                print(f"No exit found at {index}")

                lat.append(np.nan)
                long.append(np.nan)
        # if no exit is found, add null values to lat and long
        else:
            lat.append(np.nan)
            long.append(np.nan)
            
    # add lat and long to new dataframe
    new_df['lat'] = lat
    new_df['long'] = long
    
    return new_df

In [21]:
loc_df = pd.read_csv("../data/Loc_Extracted/tweet_locations_sample_08012019.csv")
rt_loc_df = pd.read_csv("../data/Loc_Extracted/rt_locations_sample_08022019.csv")

In [22]:
# run coordinate extraction on historic dataset
e_df = loc_extractor(exit_extractor(loc_df, 'modified_text', exits), exits)
final_df = pd.concat([loc_df, e_df], axis = 1)

No exit found at 157
No exit found at 160
No exit found at 167
No exit found at 220
No exit found at 230
No exit found at 253
No exit found at 265
No exit found at 267


In [23]:
# run coorindate extraction on real time dataset
rt_e_df = loc_extractor(exit_extractor(rt_loc_df, 'modified_text', exits), exits)
final_rt_df = pd.concat([rt_loc_df, rt_e_df], axis = 1)

No exit found at 39
No exit found at 40


In [24]:
final_rt_df.dropna().shape

(35, 14)

In [25]:
final_df.dropna().shape

(97, 14)

In [26]:
final_df.to_csv("../data/Loc_Extracted/tweet_locations_extract_08012019.csv", index = False)
final_rt_df.to_csv("../data/Loc_Extracted/rt_locations_extract_08022019.csv", index = False)

## SpaCy Visualization

In [27]:
nlp = spacy.load("en_core_web_sm")
text = final_rt_df.iloc[50]['modified_text']
doc = nlp(text)
displacy.render(doc, style="ent", jupyter = True)

In [28]:
text = final_rt_df.iloc[72]['modified_text']
doc = nlp(text)
displacy.render(doc, style="ent", jupyter = True)