# AMR Silver Data Creation
Steps:
1. get 200k sentences from Indo4b corpus
2. generate amr from those sentences using AMR parser

note: this notebook is created using google colab

## Get 200k sentences from Indo4b corpus
- note: only get formal indonesian sentences (e.g: news articles)

In [None]:
!wget https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/dataset/preprocessed/dataset_wot_uncased_blanklines.tar.xz

--2022-01-18 09:01:08--  https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/dataset/preprocessed/dataset_wot_uncased_blanklines.tar.xz
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.145.128, 74.125.143.128, 173.194.69.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.145.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6017328328 (5.6G) [application/x-tar]
Saving to: ‘dataset_wot_uncased_blanklines.tar.xz’


2022-01-18 09:03:03 (50.4 MB/s) - ‘dataset_wot_uncased_blanklines.tar.xz’ saved [6017328328/6017328328]



In [None]:
!tar -xf dataset_wot_uncased_blanklines.tar.xz

In [None]:
!ls /content/processed_uncased_blanklines

bppt.txt		kompas.txt	       talpco_indonesia.txt
conllu_all_uncased.txt	opensubtitles.txt      tempo.txt
frog_storytelling.txt	oscar_all_uncased.txt  wikipedia_conllu.txt
jw300.txt		parallel_corpus.txt    wiki.txt


In [None]:
FOLDER_PATH = '/content/processed_uncased_blanklines'

In [None]:
import os
import random
import math

In [None]:
list_sentences_from_tempo = []
with open(os.path.join(FOLDER_PATH, 'tempo.txt')) as f:
  for _ in range(5):
    sent = f.readline()
    print(f'length: {len(sent)}', ' | ',  f'sentence: {sent}')

length: 76  |  sentence: partai persatuan bangsa adalah partai yang telah memperoleh nur dari allah.

length: 111  |  sentence: partai ini pulalah yang benar-benar dilahirkan oleh para ulama dan fusi dari partai-partai islam di indonesia.

length: 176  |  sentence: hal itu disampaikan ketua umum ppp hamzah haz dalam sambutannya dalam perayaan hari lahir (harlah) partai itu yang ke-29 di gelora bung karno, senayan, jakarta, minggu (21/4).

length: 1  |  sentence: 

length: 163  |  sentence: di hadapan puluhan ribu pedukungnya, hamzah haz yang saat ini menjadi orang nomor dua di republik ini menegaskan, ppp sampai saat ini masih tetap kukuh dan besar.



In [None]:
def get_sentences_from_corpus(file_path, cnt_sentences = 100000, max_iter = 1000000, min_len = 1, max_len=384):
  # get some sentences from corpus
  file_size = os.path.getsize(file_path)
  max_iter = min(max_iter, file_size)
  list_sentences = []
  cnt = 0
  with open(file_path) as f:
    for _ in range(file_size):
      if (cnt==max_iter):
        break

      sent = f.readline()
      if ("'" not in sent and '"' not in sent and len(sent)>=min_len and len(sent)<=max_len and len(sent.split())>=2):
        list_sentences.append(sent.strip())
        cnt += 1
  
  # choose sentences in a uniform distribution
  cnt_sentences = min(cnt_sentences, len(list_sentences)-1)
  step = math.floor(len(list_sentences) / cnt_sentences)
  list_chosen_idx_sent = []
  cnt = 0
  for i in range(0, len(list_sentences), step):
    list_chosen_idx_sent.append(i)
    cnt+=1
    if (cnt==cnt_sentences):
      break

  random.shuffle(list_chosen_idx_sent)
  print(list_chosen_idx_sent[0:10])
  final_list_sentences =  [list_sentences[i] for i in list_chosen_idx_sent]

  return final_list_sentences
  

### Get 100k sentences from indo4b tempo (news article from Tempo)

In [None]:
list_sentences = get_sentences_from_corpus(os.path.join(FOLDER_PATH, 'tempo.txt'))

[47006, 72672, 165006, 77628, 216, 121350, 62066, 84880, 137600, 3238]


In [None]:
len(list_sentences)

100000

In [None]:
list_sentences[0:10]

['heran, sudah puluhan kali pemilihan pansus, baru kali ini berjalan alot dan sampai harus berjalan melalui voting.',
 'masalah ini seperti upah buruh, aksi buruh, dan masalah-masalah keamanan lainnya.',
 'pengadilan in absentia digelar karena hendra rahardja kini masih berada di australia dan menolak diekstradisi ke indonesia.',
 'tapi harus diingat, konsepnya itu penanggulangan pertama dilakukan oleh pemda.',
 'forum silaturahmi ulama yang beranggotakan para ulama di cisarua, bogor, jawa barat meminta aparat kepolisian untuk membebaskan santri-santrinya.',
 'menurutnya, kekurangbijakan mpr dalam mengambil keputusan justru akan melahirkan sesuatu yang menyakitkan bagi bangsa ini.',
 'meski demikian, menurut dia, tetap diperlukan surat resmi ke interpol.',
 'jadi tak semuanya berindikasi komunis.',
 'setelah itu presiden dan wapres melakukan acara foto bersama dengan para anggota kontingen yang dilanjutkan dengan acara ramah ramah di lingkungan wisma negara.',
 'saya harus menghindari 

In [None]:
for split in range(5):
  start_idx = split*20000
  stop_idx = start_idx + 20000
  with open(os.path.join(FOLDER_PATH, f'20k_indo4b_tempo_{split+1}.txt'), 'w') as f:
    for i in range(start_idx, stop_idx):
      f.write(list_sentences[i])
      f.write('\n')

### Get 100k sentences from indo4b kompas (news article from Kompas)

In [None]:
list_sentences = get_sentences_from_corpus(os.path.join(FOLDER_PATH, 'kompas.txt'))

[7381, 86937, 72410, 27866, 32574, 33056, 10622, 95412, 42741, 11803]


In [None]:
len(list_sentences)

100000

In [None]:
list_sentences[0:10]

['di situ diceritakan bagaimana anaknya minta izin tidak bisa pulang lebaran, karena ada janji main ski di alpen, yang lain lagi masih magang di ibm di new york, dan seterusnya.',
 'meski belum disampaikan, untuk sementara, pemerintah telah menetapkan angka defisit anggaran sebesar rp 56 trilyun lebih, atau 3,8 persen dari produk domestik bruto (pdb).',
 'mendagri hari sabarno dalam tanggapannya mengenai berbagai minderheidsnota itu mengatakan bahwa pemerintah tidak pernah membuat peraturan yang mendiskriminasi perempuan dan laki-laki.',
 'berbeda dalam penyelenggaraan komunikasi bergerak (selular), terdapat pilihan provider yang berkompetisi sehat memberikan pelayanan terbaiknya.',
 'dalam arti tidak mengizinkan satu pesawat pun lepas landas.',
 'untuk itu, tidak tanggung-tanggung, sembilan dari sepuluh orang amerika mendukung dilakukan tindakan militer.',
 'utang piutang',
 'kebijakan itu ditempuh untuk mengantisipasi ekspektasi masyarakat terhadap inflasi yang cukup tinggi dewasa in

In [None]:
for split in range(5):
  start_idx = split*20000
  stop_idx = start_idx + 20000
  with open(os.path.join(FOLDER_PATH, f'20k_indo4b_kompas_{split+1}.txt'), 'w') as f:
    for i in range(start_idx, stop_idx):
      f.write(list_sentences[i])
      f.write('\n')

In [None]:
%cd /content/processed_uncased_blanklines

/content/processed_uncased_blanklines


In [None]:
!zip -q 200k_sentences_indo4b.zip *_indo4b_kompas_*.txt *_indo4b_tempo_*.txt

## Generate AMR
- Generate AMR for all those 200k sentences
- note: the usage for AMR parser is following this repo https://github.com/banditelol/amr_parser

### Setup

In [None]:
!git clone https://github.com/taufiqhusada/amr_parser.git
%cd /content/amr_parser

Cloning into 'amr_parser'...
remote: Enumerating objects: 2542, done.[K
remote: Counting objects: 100% (2542/2542), done.[K
remote: Compressing objects: 100% (2093/2093), done.[K
remote: Total 2542 (delta 489), reused 2483 (delta 438), pack-reused 0[K
Receiving objects: 100% (2542/2542), 3.76 MiB | 20.91 MiB/s, done.
Resolving deltas: 100% (489/489), done.
/content/amr_parser


In [None]:
!pip install -r requirements.txt
!pip install pandas --upgrade

In [None]:
%cd /content/amr_parser
!chmod +x update-anago.sh
!./update-anago.sh -d /usr/local/lib/python3.7/dist-packages/anago

/content/amr_parser
commit_hash is unset, using default hash directory
anago directory: '/usr/local/lib/python3.7/dist-packages/anago'
commit hash: '9afccaa5bcc232676f9c2b59faa4c9531fb25190'
check https://raw.githubusercontent.com/banditelol/anago/9afccaa5bcc232676f9c2b59faa4c9531fb25190 for included files
updating callbacks.py
2022-01-19 01:00:29 URL:https://raw.githubusercontent.com/banditelol/anago/9afccaa5bcc232676f9c2b59faa4c9531fb25190/anago/callbacks.py [1257/1257] -> "/usr/local/lib/python3.7/dist-packages/anago/callbacks.py" [1]
updating layers.py
2022-01-19 01:00:29 URL:https://raw.githubusercontent.com/banditelol/anago/9afccaa5bcc232676f9c2b59faa4c9531fb25190/anago/layers.py [25660/25660] -> "/usr/local/lib/python3.7/dist-packages/anago/layers.py" [1]
updating models.py
2022-01-19 01:00:30 URL:https://raw.githubusercontent.com/banditelol/anago/9afccaa5bcc232676f9c2b59faa4c9531fb25190/anago/models.py [4931/4931] -> "/usr/local/lib/python3.7/dist-packages/anago/models.py" [1]


In [None]:
#Install needed resources 
import stanfordnlp; import stanza; import nltk; 
stanza.download('id'); nltk.download('punkt')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.2.1.json:   0%|   …

2022-01-19 01:00:40 INFO: Downloading default packages for language: id (Indonesian)...


Downloading http://nlp.stanford.edu/software/stanza/1.2.1/id/default.zip:   0%|          | 0.00/201M [00:00<?,…

2022-01-19 01:01:16 INFO: Finished downloading models and saved to /root/stanza_resources.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
%cd /content/amr_parser
# Download Adylan's Pretrained model
!wget https://storage.googleapis.com/riset_amr/adylan/pretrained_feature_models.zip -O pretrained.zip
!unzip pretrained.zip
!rm pretrained.zip

# Download NER model
!wget https://storage.googleapis.com/riset_amr/pretrained_model/ner/model_ner_12514_softmax_v5_w2v_100_POS_LSTM_EmbNotTrainable_OOV-20210926T165506Z-001.zip
!unzip model_ner_12514_softmax_v5_w2v_100_POS_LSTM_EmbNotTrainable_OOV-20210926T165506Z-001.zip -d pretrained
!rm model_ner_12514_softmax_v5_w2v_100_POS_LSTM_EmbNotTrainable_OOV-20210926T165506Z-001.zip

# Download Encoder-Decoder Model
!wget https://storage.googleapis.com/riset_amr/adylan/pretrained_model_and_encoder.zip -O saved_model.zip 
!unzip saved_model.zip -d saved_model 
!rm saved_model.zip

In [None]:
%cd /content/amr_parser

/content/amr_parser


In [None]:
!python amr_parser.py --predict --file /content/processed_uncased_blanklines/20k_indo4b_kompas_1.txt --output /content/AMR_20k_indo4b_kompas_1.txt

In [None]:
!python amr_parser.py --predict --file /content/processed_uncased_blanklines/20k_indo4b_kompas_2.txt --output /content/AMR_20k_indo4b_kompas_2.txt

In [None]:
!python amr_parser.py --predict --file /content/processed_uncased_blanklines/20k_indo4b_kompas_5.txt --output /content/AMR_20k_indo4b_kompas_5.txt


In [None]:
# ... continue to generate from all files