# FILTERING AND SAMPLING THE DATA

The dataset is downloaded from [kaggle]('https://www.kaggle.com/datasets/swaptr/turkey-earthquake-tweets') and contains tweets on earthquake that hit Turkey and Syria on 6 February. The dataset includes the text of each tweet, the user profile information, the time and location of each tweet, and the number of likes, retweets, and replies for each tweet. The dataset also includes any hashtags, mentions, and links used in the tweets.

- Since it contains over 400,000 tweets in more than 60 languages, I will first filter down the corpus to the Turkish language tweets. 
- Then I will randomly sample a subcorpus for manual annotation using Doccano.

In [33]:
import pandas as pd
from sklearn.model_selection import train_test_split
import json
import warnings
warnings.filterwarnings('ignore')

In [8]:
pd.set_option('display.max_columns', None) 
df = pd.read_csv("/Users/yagmuraslan/Desktop/JEDHA/FINAL_PROJECT/tweets.csv")
df.head()

Unnamed: 0,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,coordinates,place,source
0,2023-02-21 03:30:04+00:00,तुर्की में सोमवार देर रात भूंकप के तेज झटके मह...,"['ATDigital', 'Turkey', 'Earthquake', 'TurkeyE...",0.0,0.0,19727712.0,True,hi,,,Twitter Media Studio
1,2023-02-21 03:29:07+00:00,New search &amp; rescue work is in progress in...,"['Hatay', 'earthquakes', 'Türkiye', 'TurkiyeQu...",1.0,0.0,5697.0,True,en,,,Twitter Web App
2,2023-02-21 03:29:04+00:00,Can't imagine those who still haven't recovere...,"['Turkey', 'earthquake', 'turkeyearthquake2023...",0.0,0.0,1.0,False,en,,,Twitter for Android
3,2023-02-21 03:28:06+00:00,its a highkey sign for all of us to ponder ove...,"['turkeyearthquake2023', 'earthquake', 'Syria']",0.0,0.0,3.0,False,en,,,Twitter for Android
4,2023-02-21 03:27:38+00:00,Turkiye Earthquake: तुर्किए में फिर आया भूकंप ...,"['turkey', 'earthquake', 'turkiye', 'india', '...",0.0,0.0,17.0,False,und,,,Twitter for Android


In [9]:
print("The number of tweets in the dataset is", len(df))

The number of tweets in the dataset is 478052


In [12]:
print("The dataset consists of tweets posted between", df['date'].min().split(" ")[0], "and", 
      df['date'].max().split(" ")[0], "on the earthquakes.")

The dataset consists of tweets posted between 2023-02-06 and 2023-02-21 on the earthquakes.


In [5]:
df['date'].min()

'2023-02-06 00:00:00+00:00'

In [17]:
print("The dataset contains tweets in", len(df['language'].unique()), "different languages.")

The dataset contains tweets in 66 different languages.


## Selecting the tweets in turkish

In [7]:
df.groupby(df['language'])['language'].count().sort_values()

language
dv          4
my          6
km         11
hy         13
am         13
        ...  
qht     13479
ar      17059
qme     38829
tr     140532
en     189626
Name: language, Length: 65, dtype: int64

As you can see above, a large majority of the tweets are posted either in Turkish or in English. This project focuses on Turkish language tweets only. Let's filter them.

In [21]:
tr = df[df['language'] == 'tr'].reset_index(drop = True)
tr.head()

Unnamed: 0,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,coordinates,place,source
0,2023-02-21 01:19:22+00:00,Hayatını kaybeden çocukların anısına bazı enka...,"['earthquake', 'DEPREMANI', 'depremoldu', 'dep...",1.0,0.0,149.0,False,tr,,,Twitter for Android
1,2023-02-21 01:09:45+00:00,Vatan hainleri yine TAG açmış: #70ildeOkullarK...,"['70ildeOkullarKapatılsın', 'deprem', 'earthqu...",3.0,1.0,0.0,False,tr,,,Twitter for iPhone
2,2023-02-21 00:56:12+00:00,2023 Bizi Sal Artık 🤦🏻‍♀️ #earthquake #turke...,"['earthquake', 'turkeyearthquake2023', 'Turkey...",0.0,0.0,1252.0,False,tr,,,Twitter for Android
3,2023-02-21 00:53:57+00:00,Türkiye'nin Güneyi ve Suriye'de 6.4 büyüklüğün...,"['Turkey', 'earthquake', 'Syria']",0.0,0.0,3338.0,False,tr,,,Twitter for Android
4,2023-02-21 00:36:52+00:00,Selocum onlar istifa etmiyor. Devlet malı deni...,"['earthquake', 'Erdbeben', 'depremoldu', 'Turk...",0.0,0.0,297.0,False,tr,"Coordinates(longitude=40.149462, latitude=37.8...","Place(fullName='Diyarbakır, Türkiye', name='Di...",Twitter for Android


In [22]:
tr['content'].tolist()[:10]

['Hayatını kaybeden çocukların anısına bazı enkazların üzerine balon astılar.  #earthquake #DEPREMANI #depremoldu #depremhatay #deprem #turkeyearthquake2023 #TurkeyEarthquake #turkiyeearthquake #Syria #syriaearthquake #afaddeprem #tribute #HopeForTurkeyNow #HopeForSyriaNow https://t.co/5Tw4QTSqTA',
 'Vatan hainleri yine TAG açmış: #70ildeOkullarKapatılsın  SİZ ÇENENİZİ KAPATIN!  EN İYİ PSİKOLOJİK TEDAVİ MERKEZLERİ OKULLAR AÇIK OLACAK, AÇIK KALACAK!  MALLIĞINIZDAN BIKTIK!   EĞİTİM HER ŞEYDEN DAHA ÖNEMLİ!  KALIN KAFANIZA SOKUN BUNU! #deprem #earthquake #Turkey #meb #sondakika',
 '2023 Bizi Sal Artık 🤦🏻\u200d♀️   #earthquake #turkeyearthquake2023 #Turkey #samandag',
 "Türkiye'nin Güneyi ve Suriye'de 6.4 büyüklüğündeki deprem şoku bir kez daha hissedildi. Son günlerde meydana gelen deprem nedeniyle 50 bin ölüm korkusu var ve yüzlerce bina yıkıldı Allah bizi affetsin, bize merhamet etsin. #Turkey #earthquake #Syria",
 'Selocum onlar istifa etmiyor. Devlet malı deniz yemeyen domuz diyerek yi

## Sampling for Annotation

- In order to train the tweet classification model, I chose first to annotate 10,000 tweets manually based on whether they are an emergency call or not.
- I will choose the annotation sample among the tweets posted in the first 72 hours of the earthquake, as this date range consists of a higher number of emergency and rescue calls, so it will allow for a less imbalanced trainset between the categories and it will allow me to train the model on a higher number of emergency tweets, which would enhance the models predictive power when applied to the entire dataset.

### First 72 hours

In [26]:
df3days = df[df['date'] < '2023-02-09 00:00:00+00:00']
tr3 = df3days[df3days['language'] == 'tr'].reset_index(drop = True)
tr3.head()

Unnamed: 0,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,coordinates,place,source
0,2023-02-08 23:59:54+00:00,DONATE VIA BANK TRANSFER: SWIFT: ISBKTRIS İ...,"['HelpTurkey', 'earthquake', 'earthquakeinturk...",0.0,1.0,30.0,False,tr,,,Twitter Web App
1,2023-02-08 23:59:52+00:00,6) Ara ara hastanın nefes alıp almadığını kont...,"['Deprem', 'Turkey']",0.0,0.0,358.0,False,tr,,,Twitter for Android
2,2023-02-08 23:59:50+00:00,#AFADGaziantep lütfen o binada Sude ve ailesin...,"['AFADGaziantep', 'gaziantepdeprem', 'Turkey']",0.0,0.0,284.0,False,tr,,,Twitter for iPhone
3,2023-02-08 23:59:44+00:00,71.saat 5. Kat 75 yaşındaki Bekir amcamız seni...,"['bekir', 'deprem', 'Gaziantep', 'Turkey', 'Tu...",6.0,1.0,84.0,False,tr,,,Twitter for Android
4,2023-02-08 23:59:38+00:00,ENKAZIN ALTINDAN AFAD EKİPLERİNE YEMEK SÖZÜ #...,"['Hatay', 'deprem', 'Turkey', 'enkazaltındayım...",177.0,25.0,1223.0,False,tr,,,Twitter for Android


### Sampling

In [27]:
Xtrain, Xtest = train_test_split(tr3, test_size = 10000, random_state = 42)
rest = tr3.iloc[Xtrain.index]
sample = tr3.iloc[Xtest.index]
len(sample)

10000

In [28]:
sample = sample.reset_index(drop=False)
sample.head()

Unnamed: 0,index,date,content,hashtags,like_count,rt_count,followers_count,isVerified,language,coordinates,place,source
0,108332,2023-02-06 11:17:08+00:00,Arama Kurtarma ekipleri heryere yetişemiyor kı...,"['deprem', 'Hatay', 'Gaziantep', 'Turkey', 'is...",2.0,1.0,61.0,False,tr,,,Twitter for iPhone
1,66352,2023-02-06 23:02:41+00:00,"Marketleri, dükkanları, ölmüş insanları yağmal...","['YazıklarOlsun', 'deprem', 'Turkey', 'Enkaz']",2.0,2.0,73.0,False,tr,,,Twitter for Android
2,32462,2023-02-07 19:07:43+00:00,Arkadaşlar böyle bir uygulama varmış. İlaçları...,"['deprem', 'hataydepremi', 'hatayyardimbekliyo...",0.0,0.0,30.0,False,tr,,,Twitter for Android
3,84613,2023-02-06 18:41:58+00:00,Adıyamanda destek yok. Çok fazla bina yıkıldı ...,"['deprem', 'adiyamandeprem', 'AFAD', 'afadadiy...",1.0,1.0,1.0,False,tr,,,Twitter for Android
4,22536,2023-02-08 01:00:38+00:00,"Turunçlu mahallesi samandag yolu uzeri, saray ...","['Turkey', 'CristianoRonaldo', 'hatayyardimbek...",0.0,0.0,0.0,False,tr,,,Twitter for iPhone


Now, let's only keep the variables of interest: the index number and the tweet content.

#### Note:
- Doccano requires the input to be a json file, with a key 'text' for the content and 'label' for the labels, see [tutorial page]('https://doccano.github.io/doccano/tutorial/#import-a-dataset') for further documentation.
- Therefore, we will save our sample dataset in the required format.
- Since we don't have a label column yet, we can ignore that part.

In [31]:
sample_list = []

for i in range(len(sample)):
    sample_dict = {}
    sample_dict["index"] = str(sample["index"][i])
    sample_dict["text"] = sample["content"][i]
    sample_list.append(sample_dict)

In [32]:
sample_list[:5]

[{'index': '108332',
  'text': 'Arama Kurtarma ekipleri heryere yetişemiyor kısmi seferberlik ilan edilmeli durum çok kötü  #deprem #Hatay #Gaziantep #Turkey #iskenderun'},
 {'index': '66352',
  'text': 'Marketleri, dükkanları, ölmüş insanları yağmalayan şerefsizler; duyarsızca dalga geçen iğrenç yaratıklar... #YazıklarOlsun #deprem #Turkey #Enkaz'},
 {'index': '32462',
  'text': 'Arkadaşlar böyle bir uygulama varmış. İlaçları temin edip gönderebiliriz. Yayalım. #deprem #hataydepremi #hatayyardimbekliyor #Turkey #ilaç #seferberlik #sondakikadeprem https://t.co/JhHaJdyIK0'},
 {'index': '84613',
  'text': 'Adıyamanda destek yok. Çok fazla bina yıkıldı hatayda aynı şekilde . Oradaki insnaların da yardıma ihtiyacı var . #deprem  #adiyamandeprem  #AFAD  #afadadiyaman  #Turkey  #HalukLevent  #seferberlik'},
 {'index': '22536',
  'text': 'Turunçlu mahallesi samandag yolu uzeri, saray market yanı 95/B Defne-Hatay Enkazda kalanlardan biri Nilay Oltacı  İletişim 05161646506 #Turkey #CristianoRon

In [34]:
import json
with open("sample_list_10K.json", "w", encoding="utf-8") as file:
    json.dump(sample_list, file, ensure_ascii=False)

#### Now we're ready to annotate the tweets using Doccano !
For more information checkout the [tutorial]('https://doccano.github.io/doccano/')
- After finishing the annotation of 10,000 tweets on doccano, I downloaded the annotated tweets as a json file with the name earthquake10K.json