# Annotating tweets for location extraction and geocoding

This notebook is intended for annotating tweets for computing F-score statistics

## Dataset structure

The tweets dataset are not filtered for a particular topic but cover a range of topics all with different legths. The tweets are all in english language

**For F-score test:**  
- Determine if a tweet contains a location or not.
- If unsure, label the tweet in the class you are more certain of. 

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import json
import os


## Loading Data

Insert your annotator id in the annotator_name variable.

In [3]:
import pandas as pd
import numpy as np
df = pd.read_csv ('df_location_entities1.csv', nrows=300)
df.to_json ('ss_california_tweets.json')

In [96]:
df.replace(np.nan, '', regex=True, inplace=True)

In [97]:
dataset_file_names = ('ss_california_tweets.json','ss_california_tweets.json')

# Remember to replace annotator_name with own names
annotator_name = 'Rufai_Vitoria_FS'

for fn in dataset_file_names:
    print(fn)
    df = pd.DataFrame(json.load(open(fn)))
    display(df.head()[['text', 'clean_text']])

ss_california_tweets.json


Unnamed: 0,text,clean_text
0,"I'm at My Home Gym in Pacifica, CA https://t....","Im at My Home Gym in Pacifica, CA"
1,_styledbym.e killed it with this #shadowroot #...,styledbym.e killed it with this shadowroot col...
2,Primigi Classic loafers for your boy or girl. ...,Primigi Classic loafers for your boy or girl. ...
3,Warriors single game tickets go on sale at 10...,Warriors single game tickets go on sale at 10...
4,I'm at Hardly Strictly Bluegrass in San Franc...,Im at Hardly Strictly Bluegrass in San Franci...


ss_california_tweets.json


Unnamed: 0,text,clean_text
0,"I'm at My Home Gym in Pacifica, CA https://t....","Im at My Home Gym in Pacifica, CA"
1,_styledbym.e killed it with this #shadowroot #...,styledbym.e killed it with this shadowroot col...
2,Primigi Classic loafers for your boy or girl. ...,Primigi Classic loafers for your boy or girl. ...
3,Warriors single game tickets go on sale at 10...,Warriors single game tickets go on sale at 10...
4,I'm at Hardly Strictly Bluegrass in San Franc...,Im at Hardly Strictly Bluegrass in San Franci...


In [98]:
df

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC
0,0,"I'm at My Home Gym in Pacifica, CA https://t....",Pacifica; CA,-122.500464,37.593650,"Im at My Home Gym in Pacifica, CA","[Pacifica, CA]",[],[],[]
1,1,_styledbym.e killed it with this #shadowroot #...,,-121.989751,38.355840,styledbym.e killed it with this shadowroot col...,[],[],[],[]
2,2,Primigi Classic loafers for your boy or girl. ...,"East Oakdale, CA",-120.829960,37.774330,Primigi Classic loafers for your boy or girl. ...,[],[],[Primigi],[]
3,3,Warriors single game tickets go on sale at 10...,San Jose; CA,-121.891766,37.332484,Warriors single game tickets go on sale at 10...,[],[],[],[]
4,4,I'm at Hardly Strictly Bluegrass in San Franc...,San Francisco; CA,-122.489542,37.771727,Im at Hardly Strictly Bluegrass in San Franci...,"[San Francisco, CA]",[],[],[]
...,...,...,...,...,...,...,...,...,...,...
295,308,Next level. Wonder what's on that boat. @ Ocea...,"San Francisco, CA",-122.511657,37.774898,Next level. Wonder whats on that boat. at Oce...,[],[Ocean Beach Parking Lot],[],[]
296,309,Bright yellow with textured flower pattern #19...,,-122.266190,37.806500,Bright yellow with textured flower pattern 196...,[],[],[],[]
297,310,“@HoodieAllen: The difference a year makes #tr...,7d62cffe6f98f349,-121.882252,37.328329,at HoodieAllen: The difference a year makes tr...,[],[],[],[]
298,311,Drinking a Pumking by @stbcbeer at @beerrevol...,Oakland; CA,-122.276000,37.797100,Drinking a Pumking by at stbcbeer at at beerr...,[],[],[],[]


## Annotation

### Helper function

This function loads the data (using partially annotated .json files if available) and saves it after every annotation.

This means that annotation can simply be picked up again whenever desired. Intermediate and final results are saved with the original filename with `_annotated`appended.

Only the specified labels (`0,1` by default) are accepted as input, `p` prints a progress bar and any other keys show a help text.

In [99]:

def annotate_tweet_df(fn, possible_labels=('0', '1')):
    def process_input(user_input):
        if user_input in possible_labels:
            return user_input
        elif user_input.startswith('p'):
            progressbar(compute_annotation_progress(), max_num=len(df))
            vc = df[label_column_name].value_counts()
            print('labels\t',  ', '.join([str(k)+': ' + str(v) for k,v in zip(vc.keys(), vc.values)]))
        elif user_input.startswith('q'):
            raise
        else:
            print(help_text)

        return process_input(input('\t'))

    def compute_annotation_progress():
        if label_column_name not in df.keys():
            return 0
        return len(df) - df[label_column_name].isna().sum()

    def progressbar(it, max_num, size=60):
        finished = int(round((it / max_num * size))) if it > 0 else 0
        rest = size - finished
        print('[' + finished * '|' + rest * '.' + ']\t', it, '/', max_num)

    help_text = '\n'.join(['Possible Commands', str(possible_labels) + '\tpossible labels',
                           'h\tshow this help', 'p\tshow progress', 'q\tquit', ''])

    label_column_name = 'label_' + annotator_name
    annotated_df_fn = fn.split('.json')[0] + '_annotated' + annotator_name + '.json'

    if os.path.exists(annotated_df_fn) and os.path.isfile(annotated_df_fn):
        print(annotated_df_fn, 'already exists, continuing previous annotation process')
        df: pd.DataFrame = pd.DataFrame(json.load(open(annotated_df_fn)))
    else:
        df: pd.DataFrame =  pd.DataFrame(json.load(open(fn)))

    nb_annotated_tweets = compute_annotation_progress()
    if label_column_name in df.keys():
        print('Labels from', annotator_name, 'already in data!')
        if compute_annotation_progress() < len(df):
            print('Continuing annotation,', nb_annotated_tweets, 'of', len(df), 'already annotated')
        else:
            return
    else:
        df[label_column_name] = np.nan

    print(help_text)
    print('Starting annotation for', len(df) - nb_annotated_tweets, 'tweets:')
    for index, row in df.iterrows():
        if not pd.isna(row[label_column_name]):
            continue
        print(row.text)
        label = process_input(input('\t'))
        if label is not None:
            df.loc[index, label_column_name] = label
        df.to_json(annotated_df_fn)

    print('Finished!\nSaved results as', annotated_df_fn, '\n')
    
    

In [100]:
df

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC
0,0,"I'm at My Home Gym in Pacifica, CA https://t....",Pacifica; CA,-122.500464,37.593650,"Im at My Home Gym in Pacifica, CA","[Pacifica, CA]",[],[],[]
1,1,_styledbym.e killed it with this #shadowroot #...,,-121.989751,38.355840,styledbym.e killed it with this shadowroot col...,[],[],[],[]
2,2,Primigi Classic loafers for your boy or girl. ...,"East Oakdale, CA",-120.829960,37.774330,Primigi Classic loafers for your boy or girl. ...,[],[],[Primigi],[]
3,3,Warriors single game tickets go on sale at 10...,San Jose; CA,-121.891766,37.332484,Warriors single game tickets go on sale at 10...,[],[],[],[]
4,4,I'm at Hardly Strictly Bluegrass in San Franc...,San Francisco; CA,-122.489542,37.771727,Im at Hardly Strictly Bluegrass in San Franci...,"[San Francisco, CA]",[],[],[]
...,...,...,...,...,...,...,...,...,...,...
295,308,Next level. Wonder what's on that boat. @ Ocea...,"San Francisco, CA",-122.511657,37.774898,Next level. Wonder whats on that boat. at Oce...,[],[Ocean Beach Parking Lot],[],[]
296,309,Bright yellow with textured flower pattern #19...,,-122.266190,37.806500,Bright yellow with textured flower pattern 196...,[],[],[],[]
297,310,“@HoodieAllen: The difference a year makes #tr...,7d62cffe6f98f349,-121.882252,37.328329,at HoodieAllen: The difference a year makes tr...,[],[],[],[]
298,311,Drinking a Pumking by @stbcbeer at @beerrevol...,Oakland; CA,-122.276000,37.797100,Drinking a Pumking by at stbcbeer at at beerr...,[],[],[],[]


## Annotation Task

Please refere to the annotation guide file for annotation examples. In case something is not clear feel free to ask.  

IMPORTANT:   
- For **F-Score**, we are only interested in the presence or absence of a location within the tweet. The context in which the location is mentioned is not important

The label is either `1` if the tweet has a location or `0` otherwise.  

For more information see the annotation guide.

#### To start annotating run the cell below.  
Press q to pause the annotation (the red error is intended bahviour).  
Press p to show your progress.  
Press h to see all possible functions.

In [101]:
list(map(annotate_tweet_df, dataset_file_names))

ss_california_tweets_annotatedRufai_Vitoria_FS.json already exists, continuing previous annotation process
Labels from Rufai_Vitoria_FS already in data!
ss_california_tweets_annotatedRufai_Vitoria_FS.json already exists, continuing previous annotation process
Labels from Rufai_Vitoria_FS already in data!


[None, None]

In [102]:
df

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC
0,0,"I'm at My Home Gym in Pacifica, CA https://t....",Pacifica; CA,-122.500464,37.593650,"Im at My Home Gym in Pacifica, CA","[Pacifica, CA]",[],[],[]
1,1,_styledbym.e killed it with this #shadowroot #...,,-121.989751,38.355840,styledbym.e killed it with this shadowroot col...,[],[],[],[]
2,2,Primigi Classic loafers for your boy or girl. ...,"East Oakdale, CA",-120.829960,37.774330,Primigi Classic loafers for your boy or girl. ...,[],[],[Primigi],[]
3,3,Warriors single game tickets go on sale at 10...,San Jose; CA,-121.891766,37.332484,Warriors single game tickets go on sale at 10...,[],[],[],[]
4,4,I'm at Hardly Strictly Bluegrass in San Franc...,San Francisco; CA,-122.489542,37.771727,Im at Hardly Strictly Bluegrass in San Franci...,"[San Francisco, CA]",[],[],[]
...,...,...,...,...,...,...,...,...,...,...
295,308,Next level. Wonder what's on that boat. @ Ocea...,"San Francisco, CA",-122.511657,37.774898,Next level. Wonder whats on that boat. at Oce...,[],[Ocean Beach Parking Lot],[],[]
296,309,Bright yellow with textured flower pattern #19...,,-122.266190,37.806500,Bright yellow with textured flower pattern 196...,[],[],[],[]
297,310,“@HoodieAllen: The difference a year makes #tr...,7d62cffe6f98f349,-121.882252,37.328329,at HoodieAllen: The difference a year makes tr...,[],[],[],[]
298,311,Drinking a Pumking by @stbcbeer at @beerrevol...,Oakland; CA,-122.276000,37.797100,Drinking a Pumking by at stbcbeer at at beerr...,[],[],[],[]


In [117]:
#df2 = pd.read_json('pd.read_json('ss_california_tweetsRufai_Vitoria_FS.json')')
#df2 = pd.read_json('ss_california_tweetsRufai_Vitoria_FS.json', lines=True, orient='split')

In [103]:
with open('ss_california_tweets_annotatedRufai_Vitoria_FS.json', 'r') as datafile:
    data = json.load(datafile)
    df2 = pd.DataFrame(data)

In [104]:
df2

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC,label_Rufai_Vitoria_FS
0,0,"I'm at My Home Gym in Pacifica, CA https://t....",Pacifica; CA,-122.500464,37.593650,"Im at My Home Gym in Pacifica, CA","[Pacifica, CA]",[],[],[],1
1,1,_styledbym.e killed it with this #shadowroot #...,,-121.989751,38.355840,styledbym.e killed it with this shadowroot col...,[],[],[],[],0
2,2,Primigi Classic loafers for your boy or girl. ...,"East Oakdale, CA",-120.829960,37.774330,Primigi Classic loafers for your boy or girl. ...,[],[],[Primigi],[],0
3,3,Warriors single game tickets go on sale at 10...,San Jose; CA,-121.891766,37.332484,Warriors single game tickets go on sale at 10...,[],[],[],[],0
4,4,I'm at Hardly Strictly Bluegrass in San Franc...,San Francisco; CA,-122.489542,37.771727,Im at Hardly Strictly Bluegrass in San Franci...,"[San Francisco, CA]",[],[],[],1
...,...,...,...,...,...,...,...,...,...,...,...
295,308,Next level. Wonder what's on that boat. @ Ocea...,"San Francisco, CA",-122.511657,37.774898,Next level. Wonder whats on that boat. at Oce...,[],[Ocean Beach Parking Lot],[],[],1
296,309,Bright yellow with textured flower pattern #19...,,-122.266190,37.806500,Bright yellow with textured flower pattern 196...,[],[],[],[],0
297,310,“@HoodieAllen: The difference a year makes #tr...,7d62cffe6f98f349,-121.882252,37.328329,at HoodieAllen: The difference a year makes tr...,[],[],[],[],0
298,311,Drinking a Pumking by @stbcbeer at @beerrevol...,Oakland; CA,-122.276000,37.797100,Drinking a Pumking by at stbcbeer at at beerr...,[],[],[],[],1


In [105]:
len(df2)

300

In [106]:
def parse_values(x):
        if len(x) > 2:
            return 1
        else:
            return 0

In [107]:
df2['GPE'] = df2['GPE'].apply(parse_values)
df2['FAC'] = df2['FAC'].apply(parse_values)
df2['ORG'] = df2['ORG'].apply(parse_values)
df2['LOC'] = df2['LOC'].apply(parse_values)

In [108]:
df2

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC,label_Rufai_Vitoria_FS
0,0,"I'm at My Home Gym in Pacifica, CA https://t....",Pacifica; CA,-122.500464,37.593650,"Im at My Home Gym in Pacifica, CA",1,0,0,0,1
1,1,_styledbym.e killed it with this #shadowroot #...,,-121.989751,38.355840,styledbym.e killed it with this shadowroot col...,0,0,0,0,0
2,2,Primigi Classic loafers for your boy or girl. ...,"East Oakdale, CA",-120.829960,37.774330,Primigi Classic loafers for your boy or girl. ...,0,0,1,0,0
3,3,Warriors single game tickets go on sale at 10...,San Jose; CA,-121.891766,37.332484,Warriors single game tickets go on sale at 10...,0,0,0,0,0
4,4,I'm at Hardly Strictly Bluegrass in San Franc...,San Francisco; CA,-122.489542,37.771727,Im at Hardly Strictly Bluegrass in San Franci...,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
295,308,Next level. Wonder what's on that boat. @ Ocea...,"San Francisco, CA",-122.511657,37.774898,Next level. Wonder whats on that boat. at Oce...,0,1,0,0,1
296,309,Bright yellow with textured flower pattern #19...,,-122.266190,37.806500,Bright yellow with textured flower pattern 196...,0,0,0,0,0
297,310,“@HoodieAllen: The difference a year makes #tr...,7d62cffe6f98f349,-121.882252,37.328329,at HoodieAllen: The difference a year makes tr...,0,0,0,0,0
298,311,Drinking a Pumking by @stbcbeer at @beerrevol...,Oakland; CA,-122.276000,37.797100,Drinking a Pumking by at stbcbeer at at beerr...,0,0,0,0,1


In [163]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import  accuracy_score, average_precision_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import f1_score

x = df2[['GPE','FAC','ORG','LOC']]
Y = df2['label_Rufai_Vitoria_FS']

#Splitting X and y into training data and test data within proportion of 80% as training and 20% as test
x_train, x_test, Y_train, Y_test = train_test_split(x, Y, test_size = 0.2, random_state=42)

#Setting the model and the prediction based on train samples
we = DecisionTreeClassifier(random_state=42)

#Create classifier object
we = we.fit(x_train,Y_train)
#Train the classifier using the training data

#Running the prediction model
predictions = we.predict(x_test)
predictions

#Calculates the score accuracy
scoree = accuracy_score(Y_test,predictions)
scoree

# calculates f1 score
f1_score = f1_score(Y_test, predictions , average='macro')

# calculates the fbeta-score
f_score_recall = fbeta_score(Y_test,predictions, average='macro', beta = 0)
f_score_precision = fbeta_score(Y_test,predictions, average='macro', beta = 1)

In [164]:
f_score = [[f_score_recall, f_score_precision, f1_score]]

In [165]:
f_df = pd.DataFrame(f_score, columns = ['recall','precision','f1_score'])


In [166]:
f_df

Unnamed: 0,recall,precision,f1_score
0,0.912709,0.89011,0.89011
