# Text Preparation before NER Annotations

NER annotations involves manual annotation of a sample of tweets with the following NER tags:
- PER
- ADDR
- CITY
in order to be able to train a supervised learning model for classification.
For this annotation, instead of annotating a tweet with a single label, I will annotate different tokens in the tweet with different tags.

Since I discovered the potential pre-annotation reuirements for text classification and NER tasks along the way, in fact, after finishing the manual annotation by Doccano of the 10,000 sample tweets, I needed a second preparation step before annotating the tweets on NER tags.

As you can see in sampling_for_classification.ipynb, I had not performed any text cleaning and preparation before annotating the data on emergency calls. However, this decision resulted in certain caveats when I was about to perform a second annotation on NER tags. In particular, I realized a lot of users used abbreviations when they refer to addresses or even city names. For example:
   - cd, cad, cd. instead of "caddesi" (as in st. , str. instead of street)
   - mh. , mah, mah. instead of "mahallesi (neighborhood)
   - for kahramanmaraş which is the epicenter city there are multiple different ways to refer: maraş, kmaraş, k.maraş, kahramanmaraş...
    and I decided to standardize these abbreviations as best as possible since address detection is a very important part of the app.
    
## Dataset
- The sample I used for NER annotation is the tweets I annotated as emergency call from the previous task (tweet classification) on Doccano.
- Just note that I initially designed 3 labels: 
    - "Rescue_call" : tweets on people who are still under the rubbles and waiting for help
    - "Urgent_need" : tweets on urgent food, clothing, fuel or shelter needs for people who are on the streets
    - "Other"
    However, due to low occurrence of the "Urgent_need" category in the dataset, I decided to merge first two categories under the label emergency call.
    
### Note:
- Here, I performed this modifications during the Second Task (NER) since I had already finşished the annotations of the first task and it would have taken so long to re-annotate on doccano, but the final model will involve the completion of these modifications at the very first stage! See app.py for details.

In [9]:
import pandas as pd
import json
import re

In [4]:
with open('earthquake10K.json', 'r') as file:
    data = json.load(file)
df = pd.DataFrame(data)
df.head()

Unnamed: 0,id,text,index,label,Comments
0,46001,Arama Kurtarma ekipleri heryere yetişemiyor kı...,108332,[Other],[]
1,46002,"Marketleri, dükkanları, ölmüş insanları yağmal...",66352,[Other],[]
2,46003,Arkadaşlar böyle bir uygulama varmış. İlaçları...,32462,[Other],[]
3,46004,Adıyamanda destek yok. Çok fazla bina yıkıldı ...,84613,[Other],[]
4,46005,"Turunçlu mahallesi samandag yolu uzeri, saray ...",22536,[Rescue_call],[]


In [5]:
for i in range(len(df)):
    a = df['label'][i][0]
    df.loc[i, 'label'] = a

df.head()

Unnamed: 0,id,text,index,label,Comments
0,46001,Arama Kurtarma ekipleri heryere yetişemiyor kı...,108332,Other,[]
1,46002,"Marketleri, dükkanları, ölmüş insanları yağmal...",66352,Other,[]
2,46003,Arkadaşlar böyle bir uygulama varmış. İlaçları...,32462,Other,[]
3,46004,Adıyamanda destek yok. Çok fazla bina yıkıldı ...,84613,Other,[]
4,46005,"Turunçlu mahallesi samandag yolu uzeri, saray ...",22536,Rescue_call,[]


In [7]:
def merge_urgents(x):
    if x == 'Urgent_need' or x == 'Rescue_call':
        return 'emergency_call'
    else:
        return x
df['label'] = df['label'].apply(lambda x : merge_urgents(x))
set(df['label'])

{'Other', 'emergency_call'}

In [8]:
help_data = df[df['label'] == 'emergency_call'].reset_index(drop=False)

print("There are", len(help_data), "tweets annotated as emergency call. This corresponds to", 
      len(help_data)/10000 * 100, "% of the sample.")

There are 2304 tweets annotated as emergency call. This corresponds to 23.04 % of the sample.


In [113]:
help_data.head()

Unnamed: 0,level_0,id,text,index,label,Comments
0,4,46005,"Turunçlu mahallesi samandag yolu uzeri, saray ...",22536,emergency_call,[]
1,8,46009,MARAŞ ON İKİ ŞUBAT CADDESİ KÜLTÜRKENT SİTESİ A...,84306,emergency_call,[]
2,10,46011,ŞAZİBEY MAHALLESİ HAYDAR ALİYEV BULVARI YUNUS ...,2888,emergency_call,[]
3,11,46012,"Su otele yardım edin, çok çocuk var içeride. #...",36316,emergency_call,[]
4,13,46014,Malatya nergiz sitesi #malatyadeprem #Turkey #...,22886,emergency_call,[]


In [10]:
abbreviations = [
    ('apt', 'Apartmanı'),
    ('Apt', 'Apartmanı'),
    ('APT', 'Apartmanı'),
    ('apart', 'Apartmanı'),
    ('Apart', 'Apartmanı'),
    ('APART', 'Apartmanı'),
    ('sok', 'Sokak'),
    ('sk', 'Sokak'),
    ('Sok', 'Sokak'),
    ('Sk', 'Sokak'),
    ('SOK', 'Sokak'),
    ('SK', 'Sokak'),
    ('cad', 'Caddesi'),
    ('Cad', 'Caddesi'),
    ('CAD', 'Caddesi'),
    ('cd', 'Caddesi'),
    ('Cd', 'Caddesi'),
    ('CD', 'Caddesi'),
    ('bşk', 'başkanlığı'),
    ('bul', 'Bulvarı'),
    ('blv', 'Bulvarı'),
    ('Blv', 'Bulvarı'),
    ('BLV', 'Bulvarı'),
    ('bulv', 'Bulvarı'),
    ('Bulv', 'Bulvarı'),
    ('BULV', 'Bulvarı'),
    ('mey', 'meydanı'),
    ('meyd', 'meydanı'),
    ('ecz', 'Eczanesi'),
    ('Ecz', 'Eczanesi'),
    ('ECZ', 'Eczanesi'),
    ('mh', 'Mahallesi'),
    ('mah', 'Mahallesi'),
    ('Mh', 'Mahallesi'),
    ('Mah', 'Mahallesi'),
    ('MH', 'Mahallesi'),
    ('MAH', 'Mahallesi'),
    ('şb', 'şube'),
    ('maraş', 'Kahramanmaraş'),
    ('maras', 'Kahramanmaraş'),
    ('Maraş', 'Kahramanmaraş'),
    ('Maras', 'Kahramanmaraş'),
    ('MARAŞ', 'Kahramanmaraş'),
    ('MARAS', 'Kahramanmaraş'),
    ('kmaraş', 'Kahramanmaraş'),
    ('kmaras', 'Kahramanmaraş'),
    ('KMaraş', 'Kahramanmaraş'),
    ('KMaras', 'Kahramanmaraş'),
    ('KMARAŞ', 'Kahramanmaraş'),
    ('KMARAS', 'Kahramanmaraş'),
    ('antep', 'Gaziantep'),
    ('Antep', 'Gaziantep'),
    ('ANTEP', 'Gaziantep'),
    ('anteb', 'Gaziantep'),
    ('Anteb', 'Gaziantep'),
    ('ANTEB', 'Gaziantep'),
    ('Urfa', 'Şanlıuarfa'),
    ('urfa', 'Şanlıuarfa'),
    ('URFA', 'Şanlıuarfa'),
    
    ]

def normalize_abbreviations(text):
    for regex, replacement in abbreviations:
        text = re.sub(rf'\b{re.escape(regex)}\b', replacement, text)
        text = re.sub(r'\s\s+', ' ',text)
        text = text.replace('k.maraş', 'Kahramanmaraş')
        text = text.replace('K.maraş', 'Kahramanmaraş')
        text = text.replace('K.Maraş', 'Kahramanmaraş')
        text = text.replace('k.maras', 'kahramanmaraş')
        text = text.replace('K.maras', 'Kahramanmaraş')
        text = text.replace('K.Maras', 'kahramanmaraş')
    return text


Let's try it out:

In [11]:
normalize_abbreviations('K.maraş, KMARAŞ, Maras')

'Kahramanmaraş, Kahramanmaraş, Kahramanmaraş'

In [13]:
help_data['text'] = help_data['text'].apply(lambda x: normalize_abbreviations(x))

In [14]:
help_data['text'].tolist()[:10]

['Turunçlu mahallesi samandag yolu uzeri, saray market yanı 95/B Defne-Hatay Enkazda kalanlardan biri Nilay Oltacı İletişim 05161646506 #Turkey #CristianoRonaldo #hatayyardimbekliyor #hatayiskenderun #HalkTV #özgürdemirtaş #fulyaöztürk #EnkazAltında #tahaduymaz',
 'Kahramanmaraş ON İKİ ŞUBAT CADDESİ KÜLTÜRKENT SİTESİ ARKADAŞIMIN AMCASI KUZENİ ENKAZ ALTINDA LÜTFEN YARDIM EDİN VİLDAN GEZER ALİ GEZER #Kahramanmaras #deprem #Hatay #ENKAZALTİNDAYİM #Kahramanmaraş #Hatay #Turkey',
 'ŞAZİBEY MAHALLESİ HAYDAR ALİYEV BULVARI YUNUS APARTMANI A BLOK ACİLEN EKİBE İHTİYACIMIZ VAR LÜTFEN SESİMİZİ DUYURUN YARDIM EDİN #Turkey #Kahramanmaras #onikisubat #marasyardım #Marasayetisemiyoruz @haluklevent @ahbap @ekrem_imamoglu @berkcanguven @OguzhanUgur @efeuygac',
 'Su otele yardım edin, çok çocuk var içeride. #deprem #seferberlik #Turkey #YARDIMEDİN https://t.co/AP6oMgciQB',
 'Malatya nergiz sitesi #malatyadeprem #Turkey #TurkeyEarthquake #PrayForTurkey #sondakikadeprem https://t.co/j4mOfXZo7s',
 '@_BadBi

In [16]:
sample_list = []

for i in range(len(help_data)):
    sample_dict = {}
    sample_dict["index"] = str(help_data["index"][i])
    sample_dict["text"] = help_data["text"][i]
    sample_list.append(sample_dict)

For token classification tasks that performs many-to-many classification, doccano requires jsonl files.
Therefore, we will save our sample in that format.

In [17]:
def json_to_jsonl(json_data, output_file):
    with open(output_file, 'w') as f:
        for item in json_data:
            json_string = json.dumps(item)
            f.write(json_string + '\n')
json_to_jsonl(sample_list, 'JSONL_sample_NER.jsonl')

#### Now we're ready to annotate the tweets using Doccano !
For more information checkout the [tutorial]('https://doccano.github.io/doccano/')
- After finishing the annotation of 10,000 tweets on doccano, I downloaded the annotated tweets as a json file with the name admin2.jsonl.