# Automatic Part-of-Speech Tagger with Lánnang-uè Source Language Tagging

## Steps

1. Set up the workspace. That is, run the code below **Setting the workspace**.
2. Load the files. Instructions can be found under the header **Loading the files**.
3. Click **Cell** and then **Run All*.
4. Wait for the program to finish. (Top-right is a white circle instead of a black circle.)
5. Check output file `output.txt` in the directory of this notebook.

### Setting the workspace
#### Installation and package import

In [1]:
%run -i 'functions_pack.py'
import csv
import re
import nltk
import operator
import unicodedata
import unidecode
from nltk import FreqDist
from nltk.probability import ConditionalFreqDist
from collections import Counter
from collections import defaultdict
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
import string as stringg


#files = fileload('file_a', 'file_b', 'file_c', 'file_d')
files = fileload('stringreplace','taggedcorpus','dictsplit','Corpus_LannangCorpus_2020_pretagged') #replace the arguments as needed

### Loading the Tagalog and English dictionaries

In [2]:
#Dictionaries
#English Dictionary
from nltk.corpus import words
word_list = words.words()

eng_dic = {}
for index,word in enumerate(word_list):
    eng_dic[word] = index
    
#Tagalog Dictionary 
with open('tagalog_dict.txt', 'r',encoding='utf-8-sig') as txtFile:
    contents = txtFile.read()
    tagwordlist = contents.split('\n')

tag_dic = {}
for index,word in enumerate(tagwordlist):
    tag_dic[word] = index

### Loading the String Replace Dictionary: A dictionary that replaces any strings of your choice

The `dictrep` function is from `functions_pack.py`. It takes in a CSV file and loads it into the program as a dictionary.



In [3]:
repdict = dictrep(files[0])

### Loading the Tagged Corpus

The `openaslist` function is from `functions_pack.py`. It takes in a CSV file and returns a list of tuples in the following format: ('word', 'gloss').

The `permtag` function is from `functions_pack.py`. It takes in a CSV file and returns a list of tuples in the following format: ('punctuation', 'gloss'). `'permtag'` is a file with additional punctuation data for training the model.

In [4]:
puncdata = permtag('permtag')
data = openaslist(files[1])    

#### Preparations

In [5]:
dat = []
for row in data:
    newrow = []
    replacedword = findrep(row[0], repdict, re)
    newrow.append(replacedword)
    newrow.append(row[1])
    dat.append(newrow)
data = dat

#### Combining the lists

In [6]:
joineddata = puncdata+ data
data = joineddata

#### Flattening the diacritics (if any)

In [7]:
for row in data:
    unacc_word = strip_accents(row[0],unicodedata)
    row.remove(row[0])
    row.append(unacc_word)
    row.reverse()
data = [tuple(row) for row in data]

### Creating frequency dictionaries

In [8]:
freqdict = {}
for (word, pos) in data:
    if word not in freqdict:
        freqdict[word] = {}
    if pos not in freqdict[word]:
        freqdict[word][pos] = 1
    else:
        freqdict[word][pos] += 1
glossdict = {}
for (word, pos) in data:
    if pos not in glossdict:
        glossdict[pos] = {}


### Loading the Word Split Dictionary: A dictionary of words to split

In [9]:
dictsplit = dictsplitterload(files[2]) 
    
dictionarysplit = {}
for k,v in dictsplit:
    dictionarysplit[k]=v


### Getting a trained model

#### Sentencize the tagged corpus data.

In [10]:
sentences = list(tosentences(data))

#### Check out number of words and sentences.

In [11]:
print("Sentences:", len(sentences), "Words:", len(data))

Sentences: 1357 Words: 15611


#### Split the model data into three parts: test, training, and dev

Note that we are training a model using Lánnang-uè data.

In [12]:
dev = sentences[:135]
training = sentences[135:]

#### Turn data into feature vectors

In [13]:
x_train = [sent2features(s,freqdict, operator, unicodedata) for s in training]
y_train = [sent2gloss(s,freqdict, operator, unicodedata) for s in training]
x_dev = [sent2features(s,freqdict, operator, unicodedata) for s in dev]
y_dev = [sent2gloss(s,freqdict, operator, unicodedata) for s in dev]

#### Training Proper: Scikit-Learn Conditional Random Field

In [14]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

#### Prediction and Comparison

In [15]:
# Prediction
crf.fit(x_train, y_train)
labels = list(crf.classes_)


# Comparison
y_pred = crf.predict(x_dev)
metric = metrics.flat_f1_score(y_dev, y_pred, average='weighted', labels=labels)
print("Your model based on the test data is ", float(metric)*100, "% accurate.")

Your model based on the test data is  81.63382403356209 % accurate.


  _warn_prf(


#### Converting Predictions to Nested Sentence Format

In [16]:
# With corrected data beside
predsentcor = []
for i, s in enumerate(x_dev):
    sentence = []
    for j, w in enumerate(s):
        sentence.append((w['word.lower()'],y_pred[i][j],y_dev[i][j]))
    predsentcor.append(sentence)

#> Without corrected data beside
predsent = []
for i, s in enumerate(x_dev):
    sentence = []
    for j, w in enumerate(s):
        sentence.append((w['word.lower()'],y_pred[i][j]))
    predsent.append(sentence)

# Convert NSF to CSV format (if needed)
dat = towords(predsent)
with open('output.csv','w') as out:
    for w,pos in dat:
        out.write(w+','+pos+'\n')

### Loading the New Data (Unseen Data)

#### Load

In [17]:
overallorig = openprocessraworig(files[3])

#### Flatten

In [18]:
newoverall=[]
for row in overallorig:
    newrow = []
    for i,item in enumerate(row):
        if i < 18:
            newrow.append(row[i])
        elif i == 18:
            newword = findrep(row[18], repdict, re)
            newrow.append(newword)
        else:
            newrow.append(row[i])
    newoverall.append(newrow)
overallorig = newoverall

overallorig


[['tag',
  'no',
  'filename',
  'year',
  'region',
  'order',
  'begin',
  'end',
  'duration',
  'interlocutor.no',
  'lg.clause',
  'conversant.code',
  'subcorpus',
  'interlocutor.id',
  'transcriber.id',
  'index.file',
  'filemax',
  'filepercent',
  'utterance'],
 ['<CLIN-18-68:1>',
  '1',
  'PC0001-CLIN18.eaf',
  '18',
  'MNL',
  '7491',
  '00:02.6',
  '00:06.8',
  '00:04.3',
  '68',
  'l',
  'A',
  'CLIN',
  'PC0068',
  '1',
  '1',
  '435',
  '0.002298851',
  'Ok, so hîge part threè diaû lo kô, e te lobê part kò si'],
 ['<CLIN-18-68:2>',
  '2',
  'PC0001-CLIN18.eaf',
  '18',
  'MNL',
  '7492',
  '00:07.4',
  '00:09.7',
  '00:02.3',
  '68',
  'l',
  'A',
  'CLIN',
  'PC0068',
  '1',
  '2',
  '435',
  '0.004597701',
  ' yá casual interview lâng ko sī'],
 ['<CLIN-18-1:3>',
  '3',
  'PC0001-CLIN18.eaf',
  '18',
  'MNL',
  '7493',
  '00:09.2',
  '00:09.7',
  '00:00.5',
  '1',
  'x',
  'B',
  'CLIN',
  'PC0001',
  '1',
  '3',
  '435',
  '0.006896552',
  'Hm'],
 ['<CLIN-18-68:4>',


In [19]:
overall = openprocessraw(files[4])

In [20]:
newoverall=[]
for row in overall:
    newrow = []
    stripz = []
    for i,item in enumerate(row):
        if i < 18:
            newrow.append(row[i])
        elif i == 18:
            newword = findrep(row[18], repdict, re)
            newrow.append(newword)
            
            stripped = strip_accents(item,unicodedata)
            stripz.append(stripped)               
         
            newrow.append(''.join(stripz))
            
        else:
            newrow.append(row[i])
    newoverall.append(newrow)
    
    

    
overall = newoverall

overall

[['tag',
  'no',
  'filename',
  'year',
  'region',
  'order',
  'begin',
  'end',
  'duration',
  'interlocutor.no',
  'lg.clause',
  'conversant.code',
  'subcorpus',
  'interlocutor.id',
  'transcriber.id',
  'index.file',
  'filemax',
  'filepercent',
  'utterance',
  'utterance'],
 ['<CLIN-18-68:1>',
  '1',
  'PC0001-CLIN18.eaf',
  '18',
  'MNL',
  '7491',
  '00:02.6',
  '00:06.8',
  '00:04.3',
  '68',
  'l',
  'A',
  'CLIN',
  'PC0068',
  '1',
  '1',
  '435',
  '0.002298851',
  'Ok, so hîge part threè diaû lo kô, e te lobê part kò si',
  'Ok, so hige part three diau lo ko, e te lobe part ko si'],
 ['<CLIN-18-68:2>',
  '2',
  'PC0001-CLIN18.eaf',
  '18',
  'MNL',
  '7492',
  '00:07.4',
  '00:09.7',
  '00:02.3',
  '68',
  'l',
  'A',
  'CLIN',
  'PC0068',
  '1',
  '2',
  '435',
  '0.004597701',
  ' yá casual interview lâng ko sī',
  ' ya casual interview lang ko si'],
 ['<CLIN-18-1:3>',
  '3',
  'PC0001-CLIN18.eaf',
  '18',
  'MNL',
  '7493',
  '00:09.2',
  '00:09.7',
  '00:00.5',

#### Sentence Extractor

In [21]:
# Original: sentencesonly
sentencesonly = []
for sentence in overall:
    sentencesonly.append(sentence[18])

# Duplicate: sentencesonly2
sentencesonly2 = []
for sentence in overallorig:
    sentencesonly2.append(sentence[18])

#### Token Extractor
Tokenizing and converting to splittable format

In [22]:
#Original: This is the original file that we want to put our content back to later
toksenorig = []
for sentence in sentencesonly2:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        tokentag = []
        tokentag.append(token)
        tokentag.append('?')
        tokentag = tuple(tokentag)
        toksenorig.append(tokentag)
    toksenorig.append(('#','<SB'))


#Duplicate/Working Token File
toksen = []
for sentence in sentencesonly:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        #unaccented_token = strip_accents(token,unicodedata) #remove if you want accent
        tokentag = []
        tokentag.append(token)
        tokentag.append('?')
        tokentag = tuple(tokentag)
        toksen.append(tokentag)
    toksen.append(('#','<SB'))


### Splitting merged tokens using the Word Split Dictionary

In [23]:
dictionarysplit = {}
for k,v in dictsplit:
    dictionarysplit[k]=v

toksencorr = []
for w in toksen:
    if w[0] in dictionarysplit.keys():
        tokens = dictionarysplit[w[0]].split(' ')
        for t in tokens:
            giz = t,w[1]
            tuple(giz)
            toksencorr.append(giz)
    else:
        toksencorr.append(w)


toksenorigcorr = []
for w in toksenorig:
    if w[0] in dictionarysplit.keys():
        tokens = dictionarysplit[w[0]].split(' ')
        for t in tokens:
            giz = t,w[1]
            tuple(giz)
            toksenorigcorr.append(giz)
    else:
        toksenorigcorr.append(w)


#### Stripping the diacritics

In [24]:
corrtoksencorr = []
for word in toksencorr:
        stripped = strip_accents(word[0],unicodedata)
        worded = []
        worded.append(stripped)
        worded.append(word[1])
        worded = tuple(worded)
        corrtoksencorr.append(worded) 

#### Use corrected tokens and sentencize them (the pre-tagging data)

In [25]:
predata = list(tosentences(corrtoksencorr))

#### Check length of sentences and words of the pre-tagging data

In [26]:
print("Sentences:", len(predata), "Words:", len(corrtoksencorr))


Sentences: 41917 Words: 493037


#### Extract features from the pre-tagging data

In [27]:
predatafeatx = [sent2features(s,freqdict, operator, unicodedata) for s in predata]
predatafeaty = [sent2gloss(s,freqdict, operator, unicodedata) for s in predata]

#### Tag using the trained model

In [28]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

x_total = [sent2features(s,freqdict, operator, unicodedata) for s in sentences]
y_total = [sent2gloss(s,freqdict, operator, unicodedata) for s in sentences]

crf.fit(x_total, y_total)
labels = list(crf.classes_)

postdata = crf.predict(predatafeatx)


#### Combine the output with unaccented token and autotagged gloss

In [29]:
predsentcor = []
for i, s in enumerate(predatafeatx):
    sentence = []
    for j, w in enumerate(s):
        sentence.append((w['word.lower()'],postdata[i][j]))
    predsentcor.append(sentence)

#### Make unaccented token accented.

In [30]:
list1 = list(tosentences(toksenorigcorr))
list2 = predsentcor
flist1 = towords(list1)
flist2 = towords(list2)
listed1 = [list(elem) for elem in flist1]
listed2 = [list(elem) for elem in flist2]
word = []
for sentence in listed1:
    word.append(sentence[0])
gloss = []
for sentence in listed2:
    gloss.append(sentence[1])
final = list(zip(word,gloss))
accentpredsentcor = list(tosentences(final))

### Revise the New Data (Unseen Data)

Put the output of `predsentcor` into `overall`.

#### Turn into list of strings first

In [31]:
#With Accent
taggedsent_withaccent = []
for sent in accentpredsentcor:
    collector = []
    for tw in sent:
        #collector.append(tw[0]+'/'+tw[1])
        #collector.append('<'+tw[1]+'>'+ tw[0]+'<'+'/'+tw[1]+'>')
        collector.append(tw[0]+'_'+tw[1])
    collector.append('.')
    joined = ' '.join(collector)
    taggedsent_withaccent.append(joined)

    
#Without Accent
taggedsent = [] #for 
for sent in predsentcor:
    collector = []
    for tw in sent:
        #collector.append(tw[0]+'/'+tw[1])
        #collector.append('<'+tw[1]+'>'+ tw[0]+'<'+'/'+tw[1]+'>')
        collector.append(tw[0]+'_'+tw[1])
    collector.append('.')
    joined = ' '.join(collector)
    taggedsent.append(joined)
    
#With Accent + Language
taggedsent_withaccent_lg = [] #for 
for sent in accentpredsentcor:
    collector = []
    for tw in sent:
        hokkienlist = ['la','di','in','bo','ho','ya','o','ko','ka',
                       'an','e','ti','lo','tui','i','u','sang', 'si',
                      'lan', 'ma','kai',
                      'tsia','a', 'nan', 'lai','ai', 'it', 'kana','lak','tio','gun',
                      'La','Di','In','Bo','Ho','Ya','O','Ko','Ka',
                       'An','E','Ti','Lo','Tui','I','U','Sang', 'Si',
                      'Lan', 'Ma','Kai',
                      'Tsia','A', 'Nan', 'Lai','Ai', 'It', 'Kana','Lak','Tio','Hun']
        fillerlist = ['uhm','uhhm','umm','mm','mmm','hm','hmm','hmmm', 'uh','Uhm','Uhhm','Umm','Mm','Mmm','Hm','Hmm','Hmmm', 'Uh']
        punclist = list(stringg.punctuation)
        punclist.append('--')
        punclist.append('...')
        
        if tw[0] in punclist:
            collector.append(tw[0]+'_P_'+tw[1])
        elif tw[0] in fillerlist:
            collector.append(tw[0]+'_X_'+tw[1])
        elif tw[0] in hokkienlist:
            collector.append(tw[0]+'_H_'+tw[1])     
        elif tw[0] in tag_dic:
            collector.append(tw[0]+'_T_'+tw[1])
        elif tw[0] in eng_dic:
            collector.append(tw[0]+'_E_'+tw[1])
        else:
            collector.append(tw[0]+'_H_'+tw[1])
    collector.append('.')
    joined = ' '.join(collector)
    taggedsent_withaccent_lg.append(joined)



    
    
    
#Without Accent + Language
taggedsent_lg = [] #for 
for sent in predsentcor:
    collector = []
    for tw in sent:
    
        hokkienlist = ['la','di','in','bo','ho','ya','o','ko','ka',
                       'an','e','ti','lo','tui','i','u','sang', 'si',
                      'lan', 'ma','kai',
                      'tsia','a', 'nan', 'lai','ai', 'it', 'kana','lak','tio','gun',
                      'La','Di','In','Bo','Ho','Ya','O','Ko','Ka',
                       'An','E','Ti','Lo','Tui','I','U','Sang', 'Si',
                      'Lan', 'Ma','Kai',
                      'Tsia','A', 'Nan', 'Lai','Ai', 'It', 'Kana','Lak','Tio','Hun']
        fillerlist = ['uhm','uhhm','umm','mm','mmm','hm','hmm','hmmm', 'uh','Uhm','Uhhm','Umm','Mm','Mmm','Hm','Hmm','Hmmm', 'Uh']
        punclist = list(stringg.punctuation)
        punclist.append('--')
        punclist.append('...')
        
        if tw[0] in punclist:
            collector.append(tw[0]+'_P_'+tw[1])
        elif tw[0] in fillerlist:
            collector.append(tw[0]+'_X_'+tw[1])
        elif tw[0] in hokkienlist:
            collector.append(tw[0]+'_H_'+tw[1])     
        elif tw[0] in tag_dic:
            collector.append(tw[0]+'_T_'+tw[1])
        elif tw[0] in eng_dic:
            collector.append(tw[0]+'_E_'+tw[1])
        else:
            collector.append(tw[0]+'_H_'+tw[1])
    collector.append('.')
    joined = ' '.join(collector)
    taggedsent_lg.append(joined)


#### Replace

In [32]:
for k,v in enumerate(overall):
    v.append(taggedsent_withaccent[k])
    v.append(taggedsent[k])
    v.append(taggedsent_withaccent_lg[k])
    v.append(taggedsent_lg[k])

overall

[['tag',
  'no',
  'filename',
  'year',
  'region',
  'order',
  'begin',
  'end',
  'duration',
  'interlocutor.no',
  'lg.clause',
  'conversant.code',
  'subcorpus',
  'interlocutor.id',
  'transcriber.id',
  'index.file',
  'filemax',
  'filepercent',
  'utterance',
  'utterance',
  'utterance_PRT .',
  'utterance_PRT .',
  'utterance_E_PRT .',
  'utterance_E_PRT .'],
 ['<CLIN-18-68:1>',
  '1',
  'PC0001-CLIN18.eaf',
  '18',
  'MNL',
  '7491',
  '00:02.6',
  '00:06.8',
  '00:04.3',
  '68',
  'l',
  'A',
  'CLIN',
  'PC0068',
  '1',
  '1',
  '435',
  '0.002298851',
  'Ok, so hîge part threè diaû lo kô, e te lobê part kò si',
  'Ok, so hige part three diau lo ko, e te lobe part ko si',
  'Ok_YNM ,_<C so_COOCJ hîge_DEM part_NC threè_COOCJ diaû_ADV lo_PERF kô_PRT ,_<C e_MDL te_ADJ lobê_ADJ part_NC kò_PRT si_COP .',
  'ok_YNM ,_<C so_COOCJ hige_DEM part_NC three_COOCJ diau_ADV lo_PERF ko_PRT ,_<C e_MDL te_ADJ lobe_ADJ part_NC ko_PRT si_COP .',
  'Ok_E_YNM ,_P_<C so_E_COOCJ hîge_H_DEM p

#### Export

Convert NSF to CSV format. The output file will be in the directory.

In [38]:
with open('output.txt','w',encoding='utf-8-sig') as out:
    for a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x in overall:
        out.write(a+'\t'+b+'\t'+c+'\t'+d +'\t'+ e +'\t'+ f +'\t'+ g +'\t'+ h +'\t'+ i +'\t'+ j+'\t'+ k+'\t'+l+'\t'+m+'\t'+n+'\t'+o+'\t'+ p +'\t'+ q +'\t'+ r +'\t'+ s +'\t'+ t +'\t'+ u +'\t'+ v +'\t' + w +'\t'+ x +'\n')
        
overall

[['tag',
  'no',
  'filename',
  'year',
  'region',
  'order',
  'begin',
  'end',
  'duration',
  'interlocutor.no',
  'lg.clause',
  'conversant.code',
  'subcorpus',
  'interlocutor.id',
  'transcriber.id',
  'index.file',
  'filemax',
  'filepercent',
  'utterance',
  'utterance',
  'utterance_PRT .',
  'utterance_PRT .',
  'utterance_E_PRT .',
  'utterance_E_PRT .'],
 ['<CLIN-18-68:1>',
  '1',
  'PC0001-CLIN18.eaf',
  '18',
  'MNL',
  '7491',
  '00:02.6',
  '00:06.8',
  '00:04.3',
  '68',
  'l',
  'A',
  'CLIN',
  'PC0068',
  '1',
  '1',
  '435',
  '0.002298851',
  'Ok, so hîge part threè diaû lo kô, e te lobê part kò si',
  'Ok, so hige part three diau lo ko, e te lobe part ko si',
  'Ok_YNM ,_<C so_COOCJ hîge_DEM part_NC threè_COOCJ diaû_ADV lo_PERF kô_PRT ,_<C e_MDL te_ADJ lobê_ADJ part_NC kò_PRT si_COP .',
  'ok_YNM ,_<C so_COOCJ hige_DEM part_NC three_COOCJ diau_ADV lo_PERF ko_PRT ,_<C e_MDL te_ADJ lobe_ADJ part_NC ko_PRT si_COP .',
  'Ok_E_YNM ,_P_<C so_E_COOCJ hîge_H_DEM p

### Word column

In [34]:
overall

simp = []
for index,row in enumerate(overall):
    simp.append((index,row[0],row[13],row[22]))
    
wordsimp = []
for index,tag,iden,utterance in simp:
    for word in utterance.split():
        if len(list(word.split('_'))) < 2:
            language = '?'
            spword = '?'
            POSword = '?'
            tup = [tag,iden,index,language, spword,POSword]
            wordsimp.append(tup)
        else:
            language = list(word.split('_'))[1]
            spword = list(word.split('_'))[0]
            POSword = list(word.split('_'))[2]
            tup = [tag,iden,index,language,spword,POSword]
            wordsimp.append(tup)
wordsimp = wordsimp[1:]

wordsimp[0][0] = 'tag' 
wordsimp[0][1] = 'id' 
wordsimp[0][2] = 'sentencenumber' 
wordsimp[0][3] = 'language'
wordsimp[0][4] = 'word' 
wordsimp[0][5] = 'POS' 

wordsimp

[['tag', 'id', 'sentencenumber', 'language', 'word', 'POS'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'E', 'Ok', 'YNM'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'P', ',', '<C'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'E', 'so', 'COOCJ'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'hîge', 'DEM'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'E', 'part', 'NC'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'threè', 'COOCJ'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'diaû', 'ADV'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'lo', 'PERF'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'kô', 'PRT'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'P', ',', '<C'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'e', 'MDL'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'E', 'te', 'ADJ'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'lobê', 'ADJ'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'E', 'part', 'NC'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'kò', 'PRT'],
 ['<CLIN-18-68:1>', 'PC0068', 1, 'H', 'si', 'COP'],
 ['<CLIN-18-68:1>', 'PC0068', 1, '?', '?', '?'],
 ['<CLIN-18-68:2>', 'PC0068', 2, 'H', 'yá', 'IN

#### Export

In [35]:
with open('out_wordcolumn.txt','w',encoding='utf-8-sig') as out:
    for a,b,c,d,e,f in wordsimp:
        out.write(a + '\t' + str(b) +'\t' + str(c) +'\t'+ str(d) +'\t'+ str(e) +'\t' + str(f) +'\n')
        

### Make into paragraph string (if needed)

In [33]:
collector = []
for row in overall:
    collector.append(row[0])
    collector.append(row[18])
    collector.append('#')
final = " ".join(collector)


with open('output_txt_nodiac_tagged.txt','w',encoding='utf-8-sig') as out:
    for a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x in overall:
        out.write(a + '\n'+ v +'\n' +'\n')
        
with open('output_txt_diac_tagged.txt','w',encoding='utf-8-sig') as out:
    for a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x in overall:
        out.write(a + '\n'+ u +'\n' +'\n')


In [37]:
final

