# Overview

### Using this [BERT base multilingual NER model](https://huggingface.co/Davlan/bert-base-multilingual-cased-ner-hrl?text=%E1%80%99%E1%80%BC%E1%80%AD%E1%80%AF%E1%80%84%E1%80%BA%E1%80%90%E1%80%BD%E1%80%84%E1%80%BA+%E1%80%85%E1%80%85%E1%80%BA%E1%80%80%E1%80%B1%E1%80%AC%E1%80%84%E1%80%BA%E1%80%85%E1%80%AE%E1%80%9A%E1%80%AC%E1%80%89%E1%80%BA%E1%80%90%E1%80%94%E1%80%BA%E1%80%B8%E1%80%80%E1%80%AD%E1%80%AF+%E1%80%99%E1%80%AD%E1%80%AF%E1%80%84%E1%80%BA%E1%80%B8%28%E1%81%81%E1%81%81%29%E1%80%9C%E1%80%AF%E1%80%B6%E1%80%B8%E1%80%96%E1%80%BC%E1%80%84%E1%80%B7%E1%80%BA%E1%80%96%E1%80%B1%E1%80%AC%E1%80%80%E1%80%BA%E1%80%81%E1%80%BD%E1%80%B2%E1%80%90%E1%80%AD%E1%80%AF%E1%80%80%E1%80%BA%E1%80%81%E1%80%AD%E1%80%AF%E1%80%80%E1%80%BA%E1%80%9B%E1%80%AC+%E1%80%81%E1%80%BC%E1%80%B1%E1%80%AC%E1%80%80%E1%80%BA%E1%80%A6%E1%80%B8%E1%80%9E%E1%80%B1%E1%81%8A+%E1%80%9E%E1%80%AF%E1%80%B6%E1%80%B8%E1%80%A6%E1%80%B8%E1%80%92%E1%80%8F%E1%80%BA%E1%80%9B%E1%80%AC%E1%80%9B%E1%80%9F%E1%80%AF%E1%80%95%E1%80%BC%E1%80%8A%E1%80%BA%E1%80%9E%E1%80%B0%E1%80%B7%E1%80%80%E1%80%AC%E1%80%80%E1%80%BD%E1%80%9A%E1%80%BA%E1%80%9B%E1%80%B1%E1%80%B8%E1%80%A1%E1%80%96%E1%80%BD%E1%80%B2%E1%80%B7) to identify entities in our landmine tweets.

bert-base-multilingual-cased-ner-hrl is a Named Entity Recognition model for 10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese) based on a fine-tuned mBERT base model. It has been trained to recognize three types of entities: location (LOC), organizations (ORG), and person (PER). Specifically, this model is a bert-base-multilingual-cased model that was fine-tuned on an aggregation of 10 high-resourced languages

In [1]:
import pandas as pd
import numpy as np
import re
import os
import datetime as dt

## Get Tweets

In [4]:
from mysql.connector import (connection)
from mysql.connector import Error

In [5]:
# fetch values from environment variables and set the target database
hostname = os.environ['MYSQL_HOST']
username = os.environ['MYSQL_USER']
password = os.environ['MYSQL_PASSWORD']
dbname = 'found'

# establish connection to db1 database in your mysql service
cnx = connection.MySQLConnection(user=username,
                                 password=password,
                                 host=hostname,
                                 database=dbname)


In [6]:
# DEFINE QUERY
# UPDATE DATE
# query =("SELECT * FROM landmine_tweets WHERE date > CURDATE() - INTERVAL 7 day")
query =("SELECT * FROM landmine_tweets WHERE date > CURDATE() - INTERVAL 2 day")

In [9]:
df = pd.read_sql_query(query, cnx)
df = df.drop_duplicates(subset=['tweet', 'handle'], keep="first")
df.columns = df.columns.str.lower()

## Language detection

Using Spacy for language detection. Here are the [ISO 639 Codes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)

In [11]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("Davlan/bert-base-multilingual-cased-ner-hrl")

model = AutoModelForTokenClassification.from_pretrained("Davlan/bert-base-multilingual-cased-ner-hrl")

Downloading:   0%|          | 0.00/264 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/676M [00:00<?, ?B/s]

In [12]:
from transformers import pipeline

2023-03-21 19:49:31.567851: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/oracle/instantclient_12_1:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server:
2023-03-21 19:49:31.567899: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [29]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

### Clean tweets

In [14]:
def clean_tweets(df, tweets):
    """
    Clean text column
    df = dataframe
    tweets (string) = column name containing tweets
    """
    # lowercase text
    df[tweets] = df[tweets].str.lower()

    # remove URLs
    df[tweets] = df[tweets].map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))

    # remove URL cutoffs
    df[tweets] = df[tweets].map(lambda x: re.sub('\\[^\s]*', ' ', x))

    # remove spaces
    df[tweets] = df[tweets].map(lambda x: re.sub('\n', ' ', x))

    # remove picture URLs
    df[tweets] = df[tweets].map(lambda x: re.sub('pic.twitter.com\/[^\s]*', ' ', x))

    # remove blog/map type
    df[tweets] = df[tweets].map(lambda x: re.sub('blog\/maps\/info\/[^\s]*', ' ', x))

    # remove hashtags =
    df[tweets] = df[tweets].map(lambda x: re.sub("\#[\w]*", "", x))

    # remove and signs
    df[tweets] = df[tweets].map(lambda x: re.sub("\&amp;", "", x))

    # remove AT users
    df[tweets] = df[tweets].map(lambda x: re.sub("\@[\w]*", "", x))

    # remove single quotations
    df[tweets] = df[tweets].map(lambda x: re.sub("'", "", x))
    df[tweets] = df[tweets].map(lambda x: re.sub("'", "", x))

    # remove characters that are not word characters or digits
    df[tweets] = df[tweets].map(lambda x: re.sub("[^\w\d]", " ", x))

    # remove all characters that are not letters
    #df[tweets] = df[tweets].map(lambda x: re.sub("[^a-zA-Z]", " ", x))

    # remove multiple spaces
    df[tweets] = df[tweets].map(lambda x: re.sub("\s{2,6}", " ", x))

    # drop duplicate rows
    #df.drop_duplicates(subset='text', keep='first', inplace=True)

    # remove multiple spaces
    df[tweets] = df[tweets].map(lambda x: re.sub("\s{3,20}", "", x))

    return df

## Detect language

In [15]:
import spacy
from spacy.language import Language
from spacy_language_detection import LanguageDetector

In [16]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [17]:
def get_lang_detector(nlp, name):
    return LanguageDetector(seed=42)  # We use the seed 42


In [18]:
nlp_model = spacy.load("en_core_web_sm")
Language.factory("language_detector", func=get_lang_detector)
nlp_model.add_pipe('language_detector', last=True)

<spacy_language_detection.spacy_language_detector.LanguageDetector at 0x7f6d43c262b0>

In [19]:
def get_language(text):
   return nlp_model(text)._.language['language']

Create language column

In [20]:
df["language"]= df["tweet"].apply(get_language)

In [21]:
df.to_csv("landmine_tweets.csv", index = False)

### Create new table for non english tweets

In [22]:
non_eng = df[df["language"] != "en"]

In [26]:
non_eng.to_csv("landmine_tweets_non_eng_processed.csv", index = False)

In [23]:
en = df[df["language"]=="en"]

In [24]:
en =  clean_tweets(en, 'tweet')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[tweets] = df[tweets].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[tweets] = df[tweets].map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[tweets] = df[tweets].map(lambda x: re.sub('\\[^\s]*', ' ', x))
A value is trying to be se

In [25]:
en.to_csv("landmine_tweets_eng_processed.csv", index = False)

### Run model for non english

In [30]:
non_eng["ner"] = non_eng["tweet"].apply(nlp)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_eng["ner"] = non_eng["tweet"].apply(nlp)


In [31]:
non_eng.head()

Unnamed: 0,id,query_term,handle,tweet,date,language,ner
31,1635587352990040064,ied AND bomb,Borjr3,@abu_mburu @BazengaaKE @TransportKE @KURAroads...,2023-03-14,id,"[{'entity': 'B-LOC', 'score': 0.60220397, 'ind..."
48,1635679803549405187,ied AND explosive,geo_politie,Alerte info \n\nUne voiture piégée a explosé à...,2023-03-14,fr,"[{'entity': 'B-LOC', 'score': 0.58652264, 'ind..."
49,1635678429529137152,ied AND explosive,newsapius,"간헐적 폭발 장애(Intermittent Explosive Disorder, IED...",2023-03-14,de,[]
431,1636512342358859776,ied AND bomb,ddarek75,RT @JarekKociszewsk: IED w północnym Izraelu n...,2023-03-16,pl,"[{'entity': 'B-ORG', 'score': 0.99467397, 'ind..."
571,1636156623461707779,ied AND explosive,anas_philippe,RT @geo_politie: Alerte info \n\nUne voiture p...,2023-03-16,fr,"[{'entity': 'B-ORG', 'score': 0.96765924, 'ind..."


In [32]:
ner_records = []
for row_dict in non_eng.to_dict(orient="records"):
    for ner_tag_dict in nlp(row_dict["tweet"]):  # I Don't recall exact ner model function call
        ner_records.append({**row_dict, **ner_tag_dict})

In [33]:
def create_ner_tag_df_multi(records_list, model_type):
    df = pd.DataFrame(records_list)
    df = df[df['word'] != 'RT']
    df['sentence_number'] = 1
    df['ner_model'] = model_type
    df.rename(columns = {'id' :'document_id',
                        'tweet': 'sentence_text',
                        'word': 'tagged_text', 
                        'index': 'tagged_text_loci',
                         'score': 'probability',
                        'entity': 'ner_tag'}, inplace = True)
    df = df[['document_id', 'sentence_number', 'sentence_text', 'tagged_text', 'ner_tag', 'tagged_text_loci', 'ner_model', 'probability']]
    df['document_id'] = df['document_id'].astype(str)
    df['tagged_text_loci'] = df['tagged_text_loci'].astype(str)
    return df

In [34]:
ner_df = create_ner_tag_df_multi(ner_records, "Davlan/bert-base-multilingual-cased-ner-hrl")

In [35]:
def isEnglish(s):
  return s.isascii()

In [36]:
ner_df["language"]= ner_df["tagged_text"].apply(isEnglish)

In [37]:
ner_df= ner_df[ner_df["language"] == False].drop(columns="language", axis=True)

In [38]:
ner_df.to_csv("landmine_tweets_ner_results_non_eng.csv", index = False)

## Run model for English

In [39]:
from tner import TransformersNER

In [40]:
model = TransformersNER("tner/bert-large-tweetner7-2020")

2023-03-21 19:54:36 INFO     initialize language model with `tner/bert-large-tweetner7-2020`
2023-03-21 19:54:43 INFO     use CRF
2023-03-21 19:54:43 INFO     loading pre-trained CRF layer
2023-03-21 19:54:43 INFO     label2id: {'B-corporation': 0, 'B-creative_work': 1, 'B-event': 2, 'B-group': 3, 'B-location': 4, 'B-person': 5, 'B-product': 6, 'I-corporation': 7, 'I-creative_work': 8, 'I-event': 9, 'I-group': 10, 'I-location': 11, 'I-person': 12, 'I-product': 13, 'O': 14}
2023-03-21 19:54:43 INFO     device   : cpu
2023-03-21 19:54:43 INFO     gpus     : 0


Create function to parse ner dictionaries

In [41]:
def create_dict(i, model_dict):
    results_dict =  {
    'tagged_text_loci': i,
    'prediction' : model_dict['prediction'][0][i],
    'probability' : model_dict['probability'][0][i],
    'input' : model_dict['input'][0][i]}
    return results_dict

In [42]:
tner_records = []
for row_dict in en.to_dict(orient="records"):
    for ner_tag_dict in [model.predict([row_dict["tweet"]])]:
        for i in range(len(ner_tag_dict['prediction'][0])):
                results_dict = create_dict(i, ner_tag_dict)
                tner_records.append({**row_dict, **results_dict})

2023-03-21 19:54:52 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  2.83it/s]
2023-03-21 19:54:52 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  3.47it/s]
2023-03-21 19:54:52 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  3.17it/s]
2023-03-21 19:54:53 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  3.47it/s]
2023-03-21 19:54:53 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  2.57it/s]
2023-03-21 19:54:53 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  3.25it/s]
2023-03-21 19:54:54 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  3.00it/s]
2023-03-21 19:54:54 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  3.48it/s]
2023-03-21 19:54:54 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  3.52it/s]
2023-03-21 19:54:55 INFO     encode all the data: 1
100%|██████████| 1/1 [00:00<00:00,  3.36it/s]
2023-03-21 19:54:55 

In [43]:
stop_words = ["i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves","he","him","his","himself","she","her","hers","herself","it","its","itself","they","them","their","theirs","themselves","what","which","who","whom","this","that","these","those","am","is","are","was","were","be","been","being","have","has","had","having","do","does","did","doing","a","an","the","and","but","if","or","because","as","until","while","of","at","by","for","with","about","against","between","into","through","during","before","after","above","below","to","from","up","down","in","out","on","off","over","under","again","further","then","once","here","there","when","where","why","how","all","any","both","each","few","more","most","other","some","such","no","nor","not","only","own","same","so","than","too","very","s","t","can","will","just","don","should","now", "ied", "IED"]

def create_ner_tag_df_en(records_list, model_type):
    df = pd.DataFrame(records_list)
    df = df[df['input'] != 'RT']
    df = df[df['prediction'] != 'O']
    df['sentence_number'] = 1
    df['ner_model'] = model_type
    df.rename(columns = {'id' :'document_id',
                        'tweet': 'sentence_text',
                        'input': 'tagged_text', 
                        'prediction': 'ner_tag'}, inplace = True)
    df = df[['document_id', 'sentence_number', 'sentence_text', 'tagged_text', 'ner_tag', 'tagged_text_loci', 'ner_model', 'probability']]
    df = df[~df['tagged_text'].isin(stop_words)]
    df['document_id'] = df['document_id'].astype(str)
    df['tagged_text_loci'] = df['tagged_text_loci'].astype(str)
    return df

In [44]:
tner_df = create_ner_tag_df_en(tner_records, "tner/bert-large-tweetner7-2020")

In [45]:
tner_df

Unnamed: 0,document_id,sentence_number,sentence_text,tagged_text,ner_tag,tagged_text_loci,ner_model,probability
2,1635764993353195526,1,rt an israeli motorist was seriously wounded b...,israeli,B-person,2,tner/bert-large-tweetner7-2020,0.996506
9,1635764993353195526,1,rt an israeli motorist was seriously wounded b...,bomb,B-event,9,tner/bert-large-tweetner7-2020,0.679351
12,1635764993353195526,1,rt an israeli motorist was seriously wounded b...,north,B-location,12,tner/bert-large-tweetner7-2020,0.999964
25,1635742122350444550,1,rt an israeli motorist was seriously wounded b...,israeli,B-person,2,tner/bert-large-tweetner7-2020,0.996506
32,1635742122350444550,1,rt an israeli motorist was seriously wounded b...,bomb,B-event,9,tner/bert-large-tweetner7-2020,0.679351
...,...,...,...,...,...,...,...,...
32979,1637894168637853727,1,rt on june 4 2009 captain meraj and his qrf we...,captain,B-person,5,tner/bert-large-tweetner7-2020,0.999683
32980,1637894168637853727,1,rt on june 4 2009 captain meraj and his qrf we...,meraj,I-person,6,tner/bert-large-tweetner7-2020,0.999725
32989,1637894168637853727,1,rt on june 4 2009 captain meraj and his qrf we...,police,B-corporation,15,tner/bert-large-tweetner7-2020,0.874685
32990,1637894168637853727,1,rt on june 4 2009 captain meraj and his qrf we...,transport,I-corporation,16,tner/bert-large-tweetner7-2020,0.573862


In [46]:
tner_df.to_csv("landmine_tweets_ner_results_eng.csv", index = False)