# Lab 02: Georefereing Location-based Social Media  
In this tutorial, we will learn:
- How to extract information (e.g., tweets) from location-based social media (e.g., Twitter)
- How to identify locational (e.g., place name) information from text-based data (e.g., tweets, newspapers)
- How to refer the identified location to metric-based coordinates on the surface of the earcth 

Several libraries/packages are needed in this tutorial. Use `pip` or `conda` to install them:
- [tweepy](https://pypi.org/project/tweepy/): this is library to access the Tweeter API
- [spaCy](https://spacy.io/usage): this is the libeary to do natural lanuguage processing 
- [spacy-dbpedia-spotlight](https://pypi.org/project/spacy-dbpedia-spotlight/): a small library that annotate recognized entities from spaCy to DBpedia enities. 
- ... 

## Part 1: Extracting (geo)text from Twitter
This part explains how to extract text-based unstructured information from the social media Twitter via its API. Similar pipline can be used to extract information from other types of social media/Web services (e.g., Foursquare, Yelp, Flickr, etc.).  

Twitter is a useful data source for studying the social impacts of events and activities. In this part, we are going to learn how to collect Twitter data using its API. Specifically, we are going to focus on geotagged Twitter data.

First, Twitter requires the users of Twitter API to be authenticated by the system. One simple approach to obtain such authentication is by registering a Twitter account. This is the approach we are going to take in this tutorial. 

Go to the website of Twitter: https://twitter.com/ , and click “Sign up” at the upper right corner. You can skip this step if you already have a Twitter account.

After you have registered/logged in to your Twitter account, we are going to obtain the keys that are necessary to use Twitter API. Go to https://apps.twitter.com/ , and sign in using the Twitter account you have just created
(sometimes the browser will automatically sign you in).

After you have signed in, click the button “Create New App”. Then fill in the necessary information to create an APP. Note that you might need to record your phone number in your Twitter account in order to do so. If you don't like it, feel free to remove your phone number from your account after you have done your project. 

Then you will be directed to a page (see example below) asking you for a name of your App. Give it a name that you want. 

![Get Keys from Twitter Developer](lab2-fig1.png)

Click `Get keys`. It will then generated API Key, API Key Secret, and Bearer Token (see below for an example). Make sure you copy and paste them into a safe place (e.g., a text editor). We need these authentications later. 

![Authentication Example](lab2-fig2.png)

Next, we also need to obtain the Access Token and its key. To do so, go to the `Projects & Apps`--> Select your App. Then click `Keys and tokens`, and then click `Generate` on the right of `Access Token ane Secret` (see below). Again, make sure you record them in a safe place. We need them later. Note that if for some reasons, you lose your tokens and secrets, this page is where you regenerate them. 

![Access Token Example](lab2-fig3.png)

Once you have your Twitter app set-up, you are ready to access tweets in Python. Begin by importing the necessary Python libraries.

In [1]:
import os
import tweepy as tw
import pandas as pd

To access the Twitter API, you will need four things from the your Twitter App page. These keys are located in your Twitter app settings in the Keys and Access Tokens tab.
- api key
- api key seceret
- access token 
- access token secret 

Below I put in my authentications. You should use yours! But remember to not share these with anyone else because these values are specific to your app.

In [5]:
api_key= '5TX6isrDz92kOC1s7qsTFWq5F'
api_key_secret= 'DL4Gw2WLNo2bK538lL5GeNYCtiwlsuYUHOlW8NCSQszK3ac101'
access_token= '1582847729486684180-VH5N9AEb2zyyFyLOj5BuD8I9ca0ils'
access_token_secret = 'e0cvkxJz9AWvu9dq0Fb48r61vkmIaA1JqLLyhEhms5FGt'

With these authentications, we can next build an API variable in Python that build the connection between this Jupyter program and Twitter:

In [6]:
auth = tw.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

For example, now we can send tweets using your API access. Note that your tweet needs to be 280 characters or less:

In [12]:
# Post a tweet from Python
api.update_status("Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience")
# Your tweet has been posted!

Status(_api=<tweepy.api.API object at 0x7fe4200e7700>, _json={'created_at': 'Wed Oct 19 22:55:47 +0000 2022', 'id': 1582868138827296768, 'id_str': '1582868138827296768', 'text': "Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience", 'truncated': False, 'entities': {'hashtags': [{'text': 'DataScience', 'indices': [92, 104]}], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="https://ruizhugeographer.com/" rel="nofollow">GEOGM0068-Zhu</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1582847729486684180, 'id_str': '1582847729486684180', 'name': 'Richard Chu', 'screen_name': 'GEOGM0068', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 0, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Wed Oct 19 21:35:

If you go to your Twitter account and check Profile, you will see the tweet being posted! Congrats for your first post via Python!  

Note that if you see errors like "453 - You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve". It means you need to elevate your access. What you need to do is (1). Go to Products --> Twitter API v2; (2). click the tab "Elevated" (or "Academic Research" if you need it for your dissertation later); (3). Click `Apply`, then file the form (you can choose No for many of the questions). See screenshot below for (1) and (2):

![Elevate your access](lab2-fig4.png)

Next, let's retrieve (search) some tweets that are about `#energycrisis` that are posted since 2022-10-01 in English. There are going to be many posts returned. To make it easy to illustrate and to save some request (note you have a limited number of requests via this API), we only request 5 from the list. 

In [61]:
search_words = "#energycrisis"
tweets = tw.Cursor(api.search_tweets,
              q=search_words,
              lang="en").items(5)
tweets

<tweepy.cursor.ItemIterator at 0x7fe4101d0f10>

Here, you see a  an object that you can iterate (i.e. `ItemIterator`) or loop over to access the data collected. Each item in the iterator has various attributes that you can access to get information about each tweet including:

- the text of the tweet
- who sent the tweet
- the date the tweet was sent

and more. The code below loops through the object and save the time of the tweet, the user who posted the tweet, the text of the tweet, as well ast the user location to a pandas `DataFrame`:

In [62]:
import pandas as pd

# create dataframe
columns = ['Time', 'User', 'Tweet', 'Location']

data = []
for tweet in tweets:
    data.append([tweet.created_at, tweet.user.screen_name, tweet.text, tweet.user.location])

df = pd.DataFrame(data, columns=columns)
df

Unnamed: 0,Time,User,Tweet,Location
0,2022-10-19 23:58:43+00:00,wildbluethistle,Lonely inside https://t.co/XFcwiIFhXa #goodmus...,
1,2022-10-19 23:58:12+00:00,DrowerR,RT @DaveDavos2: @Bowenchris That will NEVER de...,"Melbourne, Victoria"
2,2022-10-19 23:57:48+00:00,grahamtfn,RT @lucycowan83: We're encouraging voluntary o...,"Glasgow, Scotland"
3,2022-10-19 23:57:16+00:00,SocEntEdinburgh,RT @socialprintandc: For further details on ho...,Edinburgh
4,2022-10-19 23:53:03+00:00,lindakillian,RT @LevittFlisser: High electric bill poised t...,


We can further save the dataframe to a local csv file (structured data):

In [63]:
df.to_csv('tweets_example.csv')

Note that there is another way of writing the query to Twitter API, which might be more intuitive to some users. For example, you can replace `tweets = tw.Cursor(api.search_tweets,q=search_words,lang="en").items(5)` to something like:

In [71]:
tweets2 = api.search_tweets(q=search_words,lang="en", count="5")
data2 = []
for tweet2 in tweets2:
    data2.append([tweet2.created_at, tweet2.user.screen_name, tweet2.text, tweet2.user.location])
df2 = pd.DataFrame(data2, columns=columns)
df2

Unnamed: 0,Time,User,Tweet,Location
0,2022-10-20 00:00:59+00:00,WhosFibbing,RT @DaveDavos2: @Bowenchris That will NEVER de...,Everywhere
1,2022-10-19 23:59:12+00:00,SocEntEdinburgh,RT @socialprintandc: For further details on ho...,Edinburgh
2,2022-10-19 23:58:43+00:00,wildbluethistle,Lonely inside https://t.co/XFcwiIFhXa #goodmus...,
3,2022-10-19 23:58:12+00:00,DrowerR,RT @DaveDavos2: @Bowenchris That will NEVER de...,"Melbourne, Victoria"
4,2022-10-19 23:57:48+00:00,grahamtfn,RT @lucycowan83: We're encouraging voluntary o...,"Glasgow, Scotland"


To learn more about the key function `search_tweets()`, check its webpage [here](https://docs.tweepy.org/en/stable/api.html#tweepy.API.search_tweets). Please try yourself to set up some other parameters to see what you can get. 

## Part 2: Basic Natural Language Processing and Geoparsing
To extract places (or other categories) from text-based (unstructured) data, we need to do some basic Natural Language Processing (NLP), such as tokenization and Part-of-Speech analysis. All these operations can be done through the library `spaCy`. 

Ideally, you can use the tweets you got from Part 1 to do the experiment. But since sometimes the tweets you get might be very heterogenous and noisy, here we use a clean example (you can also get it from some long news online) to show how to use `spaCy` in order to make sure all the knowledge points are covered in one example. 

First make sure you have intsalled and imported `spaCy`:

In [82]:
import spacy

`spaCy` comes with pretrained NLP models that can perform most common NLP tasks, such as tokenization, parts of speech (POS) tagging, named entity recognition (NER), transforming to word vectors etc.

If you are dealing with a particular language, you can load the spacy model specific to the language using spacy.load() function. For example, we want to load the English version:

In [83]:
# Load small english model: https://spacy.io/models
nlp=spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x7fe3f00d2fd0>

This returns a Language object that comes ready with multiple built-in capabilities. 

Now let's say you have your text data in a string. What can be done to understand the structure of the text?

First, call the loaded nlp object on the text. It should return a processed Doc object.

In [84]:
# Parse text through the `nlp` model
my_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost"""

my_doc = nlp(my_text)
type(my_doc)

spacy.tokens.doc.Doc

Hmmm, it is a  Doc object. But wait, what exactly is a Doc object?

It is a sequence of tokens that contains not just the original text but all the results produced by the spaCy model after processing the text. Useful information such as the lemma of the text, whether it is a stop word or not, named entities, the word vector of the text and so on are pre-computed and readily stored in the Doc object.

So first, what is a token? 

As you have learnt from the lecture. Tokens are individual text entities that make up the text. Typically a token can be the words, punctuation, spaces, etc. Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words (and punctuation optionally). And paragraphs into sentences, depending on the context.

Each token in `spacy` has different attributes that tell us a great deal of information.

Let’s see the token texts on `my_doc`. The string which the token represents can be accessed through the `token.text` attribute.

In [85]:
# Printing the tokens of a doc
for token in my_doc:
  print(token.text)

The
economic
situation
of
the
country
is
on
edge
,
as
the
stock


market
crashed
causing
loss
of
millions
.
Citizens
who
had
their
main
investment


in
the
share
-
market
are
facing
a
great
loss
.
Many
companies
might
lay
off


thousands
of
people
to
reduce
labor
cost


The above tokens contain punctuation and common words like “a”, ” the”, “was”, etc. These do not add any value to the meaning of your text. They are called stop words. We can clean it up.

The type of tokens will allow us to clean those noisy tokens such as stop word, punctuation, and space. First, we show whether a token is stop/punctuation or not, and then we use this information to remove them. 

In [86]:
# Printing tokens and boolean values stored in different attributes
for token in my_doc:
  print(token.text,'--',token.is_stop,'---',token.is_punct)

The -- True --- False
economic -- False --- False
situation -- False --- False
of -- True --- False
the -- True --- False
country -- False --- False
is -- True --- False
on -- True --- False
edge -- False --- False
, -- False --- True
as -- True --- False
the -- True --- False
stock -- False --- False

 -- False --- False
market -- False --- False
crashed -- False --- False
causing -- False --- False
loss -- False --- False
of -- True --- False
millions -- False --- False
. -- False --- True
Citizens -- False --- False
who -- True --- False
had -- True --- False
their -- True --- False
main -- False --- False
investment -- False --- False

 -- False --- False
in -- True --- False
the -- True --- False
share -- False --- False
- -- False --- True
market -- False --- False
are -- True --- False
facing -- False --- False
a -- True --- False
great -- False --- False
loss -- False --- False
. -- False --- True
Many -- True --- False
companies -- False --- False
might -- True --- False
lay -

In [88]:
# Removing StopWords and punctuations
my_doc_cleaned = [token for token in my_doc if not token.is_stop and not token.is_punct and not token.is_space]

for token in my_doc_cleaned:
  print(token.text)

economic
situation
country
edge
stock
market
crashed
causing
loss
millions
Citizens
main
investment
share
market
facing
great
loss
companies
lay
thousands
people
reduce
labor
cost


To get the POS tagging of your text, you use code like:

In [89]:
for token in my_doc_cleaned:
  print(token.text,'---- ',token.pos_)

economic ----  ADJ
situation ----  NOUN
country ----  NOUN
edge ----  NOUN
stock ----  NOUN
market ----  NOUN
crashed ----  VERB
causing ----  VERB
loss ----  NOUN
millions ----  NOUN
Citizens ----  NOUN
main ----  ADJ
investment ----  NOUN
share ----  NOUN
market ----  NOUN
facing ----  VERB
great ----  ADJ
loss ----  NOUN
companies ----  NOUN
lay ----  VERB
thousands ----  NOUN
people ----  NOUN
reduce ----  VERB
labor ----  NOUN
cost ----  NOUN


You will see each word (tokenization) now is associated with a POS tag, whether it is a Noun, a Adj, a Verb, or so on ... POS often can help us disambiguate the meaning of words (or places in GIR). 

Btw, if you don't know what "ADJ" means, you can use code like:

In [90]:
spacy.explain('ADJ')

'adjective'

You can also use `spaCy` to do some Named Entity Recognition (including place name identification or geoparsing). For you instance: 

In [91]:
text='Tony Stark owns the company StarkEnterprises . Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'
doc=nlp(text)

for entity in doc.ents:
    print(entity.text,'--- ',entity.label_)

Tony Stark ---  PERSON
StarkEnterprises ---  ORG
Emily Clark ---  PERSON
Microsoft ---  ORG
Manchester ---  GPE
Bible ---  WORK_OF_ART
French ---  NORP


What is "GPE"? 

In [92]:
spacy.explain('GPE')

'Countries, cities, states'

spaCy also provides special visualization for NER through displacy. Using displacy.render() function, you can set the style=ent to visualize.

In [93]:
# Using displacy for visualizing NER
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)

So far, you have learnt the basics of retrieving information from social media like Twitter, as well as basic NLP operations and named entity recognition (geoparsing is part of it). I suggest you to play with what you have learnt so far by using new data to experiment these functions, changing the parameters of function, combining these skills with what you have learn in Tutorial 1 (e.g., geopandas), etc.

Next week, we will try more cool libraries and examples related to geoparsing and geocoding. 

TBC next week ... 