# Using regex to explore consonant clusters in WikiPron


The goal of this homework is to identify, count, and compare <b>word-initial and word-final consonant clusters</b> in two languages: English and Arabic.


- Created by Geoff Bacon; lightly touched up by Terry Regier, Aug 30 2020.
- [WikiPron](https://github.com/kylebgorman/wikipron) is a great project, [check it out](https://www.aclweb.org/anthology/2020.lrec-1.521.pdf)!

In [1]:
import re

In [2]:
# Read in English words.  
with open("data-WP/eng_rp.txt") as file:
    eng_words = file.read().split("\n")
# uncomment the following lines if you want to look at the words before processing them.
#for w in eng_words:
#    print(w)

In [3]:
# Read in Arabic words.  
with open("data-WP/ara.txt") as file:
    ara_words = file.read().split("\n")
# uncomment the following lines if you want to look at the words before processing them.
#for w in ara_words:
#    print(w)

In [4]:
# We are simplifying the task by focusing only on consonant clusters involving the two phonemes /s/ and /t/.

# Create a regex to search for consonant clusters involving /s/ and/or /t/.
cc_RE = r"[st](?: [st])+"
# Now a regex to search for WORD-INITIAL consonant clusters involving /s/ and/or /t/.
wicc_RE = r"^[st](?: [st])+"
# Now a regex to search for WORD-FINAL consonant clusters involving /s/ and/or /t/.
wfcc_RE = r"[st](?: [st])+$"
print(cc_RE)
print(wicc_RE)
print(wfcc_RE)

[st](?: [st])+
^[st](?: [st])+
[st](?: [st])+$


In [5]:
# Now search for consonant clusters in the English words you read in.
# Print out - and count - what you find.
cc_count = 0
wicc_count = 0
wfcc_count = 0
for w in eng_words:
    cc_results = re.findall(cc_RE,w)
    wicc_results = re.findall(wicc_RE,w)
    wfcc_results = re.findall(wfcc_RE,w)
    if cc_results:
        cc_count += 1
        print(w,"\t",cc_results,end="")
        if wicc_results:
            wicc_count += 1
            print ('  initial  ',end="")
        if wfcc_results:
            wfcc_count += 1
            print ('  final  ',end="")
        print()

ɡ æ n s t 	 ['s t']  final  
t w ɪ k s t 	 ['s t']  final  
ɑː l s t 	 ['s t']  final  
æ b ə k ɒ s t 	 ['s t']  final  
ə b e ɪ s t 	 ['s t']  final  
ə b e ɪ t s 	 ['t s']  final  
æ b ə t ɪ s t 	 ['s t']  final  
æ b ə t s 	 ['t s']  final  
ə b d j uː s t 	 ['s t']  final  
æ b ə ɹ ɪ s t w ɪ θ 	 ['s t']
ə b ɛ t s 	 ['t s']  final  
ə b h ɒ ɹ n̩ t s 	 ['t s']  final  
ə b a ɪ d ɪ s t 	 ['s t']  final  
æ b ə l ɪ ʃ n̩ ɪ s t 	 ['s t']  final  
æ b ə l ɪ ʃ n̩ ɪ s t s 	 ['s t s']  final  
ə b ɹ ɛ s t 	 ['s t']  final  
æ b ɹ ə ɡ e ɪ ʃ n̩ ɪ s t 	 ['s t']  final  
æ b s n̩ t s 	 ['t s']  final  
æ b s ə l j uː t ɪ s t 	 ['s t']  final  
æ b s ə l uː t ɪ s t 	 ['s t']  final  
ə b s t e ɪ n 	 ['s t']
æ b s t iː m ɪ ə s 	 ['s t']
æ b s t i m i ə s l i 	 ['s t']
ə b s t i m i ə s l i 	 ['s t']
ə b s t ɛ n ʃ n̩ 	 ['s t']
ə b s t ɛ n ʃ n̩ ɪ z m̩ 	 ['s t']
ə b s t ɛ n ʃ n̩ ɪ s t 	 ['s t', 's t']  final  
ə b s t ɜː ɹ d͡ʒ 	 ['s t']
ə b s t ɝː ɹˑ d͡ʒ n̩ t 	 ['s t']
ə b s t ɜː ɹ s 	 ['s t']
æ b s 

ɔː d ɪ t s 	 ['t s']  final  
ɔː t s 	 ['t s']  final  
ɔː ɡ ə s t 	 ['s t']  final  
ɔː ɡ ʌ s t 	 ['s t']  final  
ə ɡ ʌ s t ə 	 ['s t']
ɔː ɡ ʌ s t ə n 	 ['s t']
a ʊ ɡ ʊ s t 	 ['s t']  final  
ɔː ɡ ə s t iː n 	 ['s t']
ɔː ɡ ʌ s t ɪ n 	 ['s t']
ɔ ɡ ə s t s 	 ['s t s']  final  
ɔː ɡ ʌ s t ə s 	 ['s t']
a ʊ̯ ʃ v ɪ t s 	 ['t s']  final  
a ʊ̯ ʃ w ɪ t s 	 ['t s']  final  
ɒ s t ɪ n 	 ['s t']
ɔː s t ə n ɛ s k 	 ['s t']
ɔː s t iː n ɪ ə n 	 ['s t']
ɒ s t ɪ n a ɪ t 	 ['s t']
ɒ s t ə 	 ['s t']
ɒ s t ə ɹ 	 ['s t']
ɔː s t ə 	 ['s t']
ɔː s t ə ɹ 	 ['s t']
ɒ s t ɪ ə ɹ 	 ['s t']
ɔː s t ɪ ə ɹ 	 ['s t']
a ʊ s t ə l ɪ t s 	 ['s t', 't s']  final  
ɔː s t ə l ɪ t s 	 ['s t', 't s']  final  
ɒ s t ɪ n 	 ['s t']
ɒ s t ɪ n a ɪ t 	 ['s t']
ɒ s t ɪ n a ɪ t s 	 ['s t', 't s']  final  
ɒ s t ɹ ə l 	 ['s t']
ɔː s t ɹ ə l 	 ['s t']
ɒ s t ɹ ə l e ɪ ʒ ə 	 ['s t']
ɒ s t ɹ e ɪ l i j ə 	 ['s t']
ɒ s t ɹ e ɪ l iː ə 	 ['s t']
ɔː s t ɹ e ɪ l ɪ j ə 	 ['s t']
ɔː s t ɹ e ɪ l ɪ ə 	 ['s t']
ɒ s t ɹ e ɪ l i j ə n 	 ['s t']
ɒ 

k ɒ n s t ə t ɪ v 	 ['s t']
k ə n s t e ɪ t ɪ v 	 ['s t']
k ɒ n s t ə l e ɪ ʃ ə n 	 ['s t']
k ɒ n s t ə l e ɪ ʃ ə n ə l i 	 ['s t']
k ɒ n s t ə n e ɪ ʃ ə n 	 ['s t']
k ɒ n s t ɪ p e ɪ t 	 ['s t']
k ɒ n s t ə p e ɪ t ə d 	 ['s t']
k ə n s t ɪ t j u ə n s ɨ z 	 ['s t']
k ə n s t ɪ t ʃ ʊ ə n s ɨ z 	 ['s t']
k ɒ n s t ɪ t j uː t 	 ['s t']
k ɒ n s t ɪ t j uː t ɪ d 	 ['s t']
k ɒ n s t ɪ t j uː t s 	 ['s t', 't s']  final  
k ɒ n s t ɪ t j uː ʃ ə n 	 ['s t']
k ɒ n s t ɪ t ʃ uː ʃ ə n 	 ['s t']
k ɒ n s t ɪ t j uː ʃ ə n ə l 	 ['s t']
k ɒ n s t ɪ t j uː ʃ ə n ə l ɪ z ə m 	 ['s t']
k ɒ n s t ɪ t j uː ʃ ə n ə l i 	 ['s t']
k ə n s t ɹ e ɪ n 	 ['s t']
k ə n s t ɹ e ɪ n d 	 ['s t']
k ə n s t ɹ e ɪ n t 	 ['s t']
k ə n s t ɹ e ɪ n t s 	 ['s t', 't s']  final  
k ə n s t ɹ ɪ k t 	 ['s t']
k ə n s t ɹ uː ə l 	 ['s t']
k ɒ n s t ɹ ʌ k t 	 ['s t']
k ə n s t ɹ ʌ k t 	 ['s t']
k ə n s t ɹ ʌ k t ə d 	 ['s t']
k ə n s t ɹ ʌ k ʃ ə n 	 ['s t']
k ə n s t ɹ ʌ k t ɪ v 	 ['s t']
k ə n s t ɹ uː 	 ['s t']
k ə n s t ɹ 

f ɑː s t ə 	 ['s t']
f æ s t ɪ d i ə s 	 ['s t']
f ə s t ɪ d i ə s 	 ['s t']
f æ s t ɪ d ʒ i ɪ t 	 ['s t']
f æ s t n ə s 	 ['s t']
f e ɪ t s 	 ['t s']  final  
f ɒ l t s 	 ['t s']  final  
f ɔː l t s 	 ['t s']  final  
f a ʊ s t ɪ ə n 	 ['s t']
f ə ʊ s ɛ s t 	 ['s t']  final  
f iː s t 	 ['s t']  final  
f ɛ d ə ɹ ə l ɪ s t 	 ['s t']  final  
f ɛ d ɹ ə l ɪ s t 	 ['s t']  final  
f a ɪ s t 	 ['s t']  final  
f a ɪ s t i 	 ['s t']
f ɛ m ə n ɪ s t 	 ['s t']  final  
f ɛ m ə n ɪ s t s 	 ['s t s']  final  
f ɛ s t 	 ['s t']  final  
f ɛ s t 	 ['s t']  final  
f ɛ s t ə ɹ 	 ['s t']
f ɛ s t ə v ə l 	 ['s t']
f ɛ s t ɪ v 	 ['s t']
f ɛ s t ɪ v ə s 	 ['s t']
f ɛ s t uː n 	 ['s t']
f ɛ s t ʃ ɹ ɪ f t 	 ['s t']
f ɛ t ɪ ʃ ɪ s t 	 ['s t']  final  
f j uː d ə l ɪ s t 	 ['s t']  final  
f ɪ d l̩ s t ɪ k s 	 ['s t']
f ɪ ɛ s t ə 	 ['s t']
f ɪ f t s 	 ['t s']  final  
f a ɪ t s 	 ['t s']  final  
f ɪ l ɪ b ʌ s t ə ɹ 	 ['s t']
f a ɪ n ɪ s t 	 ['s t']  final  
f ɪ ŋ ɡ ə ɹ p ə ʊ s t 	 ['s t']  final  
f ɜː s

l iː t s p iː k 	 ['t s']
l ɛ f t ɪ s t 	 ['s t']  final  
l iː d ʒ ɪ s t 	 ['s t']  final  
l a ɪ b n ɪ t s 	 ['t s']  final  
l a ɪ p n ɪ t s 	 ['t s']  final  
l ɛ s t ə 	 ['s t']
l ɛ s t ə ɹ ʃ ə ɹ 	 ['s t']
l ɛ s t ɹ ɪ ə n 	 ['s t']
l ɛ n s t ə ɹ 	 ['s t']
l iː s t ə 	 ['s t']
l ɛ n ə n ɪ s t 	 ['s t']  final  
l ɛ n ɪ n ɪ s t 	 ['s t']  final  
l ɛ m s t ə 	 ['s t']
l ɛ s t 	 ['s t']  final  
l ɛ t s 	 ['t s']  final  
l ɛ t o ʊ v ɪ t s a ɪ t 	 ['t s']
l ɛ t s 	 ['t s']  final  
l uː k ə ʊ ɪ ɹ ɪ θ ɹ ə ʊ b l æ s t ə ʊ s ɪ s 	 ['s t']
l e ɪ v ə ɹ p ɒ s t a ɪ 	 ['s t']
l e ɪ t ə n s t ə ʊ n 	 ['s t']
l a ɪ s n̩ s t 	 ['s t']  final  
l ɪ k t ə n s t a ɪ n 	 ['s t']
l a ɪ f s t a ɪ l 	 ['s t']
l a ɪ f s t a ɪ l ə ɹ 	 ['s t']
l a ɪ t s 	 ['t s']  final  
l a ɪ t s e ɪ b ə ɹ 	 ['t s']
l ʌ ɪ t s ə m 	 ['t s']
l ɪ m ɪ t s 	 ['t s']  final  
l i ŋ ɡ w ɪ s t 	 ['s t']  final  
l ɪ ŋ ɡ w ɪ s t ə ɹ 	 ['s t']
l ɪ ŋ ɡ w ɪ s t ɪ k 	 ['s t']
l ɪ ŋ ɡ w ɪ s t ɪ k s 	 ['s t']
l ɪ p s t ɪ k 	 ['s t']

p ɒ s t ɪ k 	 ['s t']
p ɒ s t iː ʃ 	 ['s t']
p ə ʊ s t i 	 ['s t']
p ɒ s t ə l 	 ['s t']
p ɒ s t ɪ l 	 ['s t']
p ɒ s t ɪ l ɪ ə n 	 ['s t']
p ə ʊ s t ɪ ŋ 	 ['s t']
p ə ʊ s t ɪ ŋ z 	 ['s t']
p ə ʊ s t d ʒ ʊ d ɪ s 	 ['s t']
p ə ʊ s t l æ p s ɛ ə ɹ ɪ ə n 	 ['s t']
p ə ʊ s t l uː d 	 ['s t']
p ə ʊ s t m ə n 	 ['s t']
p ə ʊ s t n a ʃ ə n ə l 	 ['s t']
p ə ʊ s t ɒ p ə ɹ ə t ɪ v 	 ['s t']
p ə ʊ s t p ɹ a n d ɪ ə l 	 ['s t']
p ə ʊ s t p ɹ a n d ɪ ə l i 	 ['s t']
p ɒ s t ɹ iː m ə ʊ d ʒ ɛ n ɪ t j ʊ ə 	 ['s t']
p ɒ s t ɹ iː m ə ʊ d ʒ ɛ n ɪ t ʃ ə 	 ['s t']
p ə ʊ s t s 	 ['s t s']  final  
p ə ʊ s t s k ɹ ɪ p t ə m 	 ['s t s']
p ɒ s t j ʊ l ə n t 	 ['s t']
p ɒ s t ʃ ʊ l ə n t 	 ['s t']
p ɒ s t j ʊ l e ɪ t ə 	 ['s t']
p ɒ s t j ʊ l ɑː t ə 	 ['s t']
p ɒ s t j ʊ l e ɪ t 	 ['s t']
p ɒ s t j ʊ l ə t 	 ['s t']
p ɒ s t j ʊ l e ɪ t ə ɹ i 	 ['s t']
p ɒ s t ʃ ə ɹ ə l 	 ['s t']
p ɒ s t ʃ ə 	 ['s t']
p ə ʊ s t v ə ʊ k æ l ɪ k 	 ['s t']
p o ʊ t i t s ə 	 ['t s']
p ɒ t s 	 ['t s']  final  
p ɒ t s d æ m 	 ['t s']

s t e ɪ ʃ ə n ə ɹ i 	 ['s t']  initial  
s t ɛ ɪ ʃ ə n d 	 ['s t']  initial  
s t e ɪ ʃ ə n ə ɹ i 	 ['s t']  initial  
s t e ɪ ʃ ə n z 	 ['s t']  initial  
s t e ɪ t ɪ z ə m 	 ['s t']  initial  
s t e ɪ t ɪ s t 	 ['s t', 's t']  initial    final  
s t ə t ɪ s t ɪ k ə l 	 ['s t', 's t']  initial  
s t ə t ɪ s t ɪ k l̩ i 	 ['s t', 's t']  initial  
s t æ t ɪ s t ɪ ʃ n̩ 	 ['s t', 's t']  initial  
s t ə t ɪ s t ɪ k s 	 ['s t', 's t']  initial  
s t e ɪ t ɪ v 	 ['s t']  initial  
s t a t ə ʊ s ɪ s t 	 ['s t', 's t']  initial    final  
s t e ɪ t ɔ ɪ d 	 ['s t']  initial  
s t a t j ʊ ə 	 ['s t']  initial  
s t a t ʃ ʊ ə 	 ['s t']  initial  
s t æ t ʃ ʊ ə ɹ i 	 ['s t']  initial  
s t æ t j uː 	 ['s t']  initial  
s t æ t ʃ uː 	 ['s t']  initial  
s t a t j ʊ ɛ s k 	 ['s t']  initial  
s t a t ʃ ʊ ɛ s k 	 ['s t']  initial  
s t æ t ʃ ə 	 ['s t']  initial  
s t e ɪ t ə s 	 ['s t']  initial  
s t æ t ʃ uː t 	 ['s t']  initial  
s t æ t j ʊ t ə ɹ ɪ 	 ['s t']  initial  
s t a t ʃ uː v ə l ɪ z m̩

s t a ɪ 	 ['s t']  initial  
s t ɪ d ʒ i ə n 	 ['s t']  initial  
s t a ɪ l 	 ['s t']  initial  
s t a ɪ l ʃ iː t 	 ['s t']  initial  
s t a ɪ l a ɪ 	 ['s t']  initial  
s t a ɪ l ɪ s t 	 ['s t', 's t']  initial    final  
s t a ɪ l ɪ s t ɪ k 	 ['s t', 's t']  initial  
s t ʌ ɪ l ʌ ɪ t 	 ['s t']  initial  
s t a ɪ l a ɪ t iː z 	 ['s t']  initial  
s t ʌ ɪ l ʌ ɪ t iː z 	 ['s t']  initial  
s t ʌ ɪ l ʌ ɪ t s 	 ['s t', 't s']  initial    final  
s t a ɪ l ə f ə ʊ n 	 ['s t']  initial  
s t a ɪ l ə s 	 ['s t']  initial  
s t a ɪ m i 	 ['s t']  initial  
s t ɪ p t ɪ k 	 ['s t']  initial  
s t a ɪ ð 	 ['s t']  initial  
s t ɪ k s 	 ['s t']  initial  
s t ɜː d 	 ['s t']  initial  
s ʌ b k ə m m ɪ t t i 	 ['t t']
s ʌ b d ʒ ɛ k t s 	 ['t s']  final  
s ʌ b p ɛ ɹ i ɒ s t ɪ ə l 	 ['s t']
s ə b s ɪ s t 	 ['s t']  final  
s ə b s ɪ s t ə n s 	 ['s t']
s ʌ b s t æ n d ə d 	 ['s t']
s ə b s t æ n ʃ ə l 	 ['s t']
s ə b s t a n ʃ ɪ e ɪ t 	 ['s t']
s ʌ b s t ɪ t j u t 	 ['s t']
s ʌ b s t ɪ t u t 	 ['s t

In [6]:
num_words = len(eng_words)
print((cc_count/num_words)*100, "percent of words had consonant clusters containing /s/ and/or /t/.")
print((wicc_count/num_words)*100, "percent of words had word-initial consonant clusters with /s/ and/or /t/.")
print((wfcc_count/num_words)*100, "percent of words had word-final consonant clusters with /s/ and/or /t/.")

8.486987207763566 percent of words had consonant clusters containing /s/ and/or /t/.
1.388619320688134 percent of words had word-initial consonant clusters with /s/ and/or /t/.
2.6272606969563297 percent of words had word-final consonant clusters with /s/ and/or /t/.


In [7]:
# Now search for consonant clusters in the Arabic words you read in.
# Print out - and count - what you find.
cc_count = 0
wicc_count = 0
wfcc_count = 0
for w in ara_words:
    cc_results = re.findall(cc_RE,w)
    wicc_results = re.findall(wicc_RE,w)
    wfcc_results = re.findall(wfcc_RE,w)
    if cc_results:
        cc_count += 1
        print(w,"\t",cc_results,end="")
        if wicc_results:
            wicc_count += 1
            print ('  initial  ',end="")
        if wfcc_results:
            wfcc_count += 1
            print ('  final  ',end="")
        print()

ʔ a t t uː n 	 ['t t']
ʔ u ħ a s s a n 	 ['s s']
ʔ u ħ a s s a n a 	 ['s s']
ʔ u ħ a s s a n u 	 ['s s']
ʔ u ħ a s s i n 	 ['s s']
ʔ u ħ a s s i n a 	 ['s s']
ʔ u ħ a s s i n u 	 ['s s']
ʔ u s t aː z 	 ['s t']
ʔ u s t aː ð 	 ['s t']
ʔ u s t u r aː l i j j 	 ['s t']
ʔ u s t u r aː l i j aː 	 ['s t']
ʔ a s tˤ u r l aː b 	 ['s t']
ʔ u s tˤ u w aː n a h 	 ['s t']
ʔ u s tˤ uː r a 	 ['s t']
ʔ u s tˤ uː l 	 ['s t']
ʔ a ɣ u s tˤ u s 	 ['s t']
ʔ a f ɣ aː n i s t aː n 	 ['s t']
ʔ a m a s s a 	 ['s s']
ʔ a m a s s i 	 ['s s']
ʔ a m a s s u 	 ['s s']
ʔ u m a s s a 	 ['s s']
ʔ u m a s s i 	 ['s s']
ʔ u m a s s u 	 ['s s']
ʔ o r k e s t r aː 	 ['s t']
ʔ uː z b a k i s t aː n 	 ['s t']
ʔ i s t aː d 	 ['s t']
ʔ i s t i d j oː 	 ['s t']
ʔ i s t uː d i j oː 	 ['s t']
ʔ i s t oː n i j aː 	 ['s t']
ʔ i s tˤ a b l 	 ['s t']
i t t i ħ aː d 	 ['t t']
i t t a dˤ a ħ a 	 ['t t']
i t t a ʕ a d a 	 ['t t']
i t t i h aː m 	 ['t t']
i s t a ʔ n a f a 	 ['s t']
i s t a b ħ a r a 	 ['s t']
i s t a b a d d a 	 ['s t'

In [8]:
num_words = len(ara_words)
print((cc_count/num_words)*100, "percent of words had consonant clusters containing /s/ and/or /t/.")
print((wicc_count/num_words)*100, "percent of words had word-initial consonant clusters with /s/ and/or /t/.")
print((wfcc_count/num_words)*100, "percent of words had word-final consonant clusters with /s/ and/or /t/.")

3.0374289633548894 percent of words had consonant clusters containing /s/ and/or /t/.
0.0 percent of words had word-initial consonant clusters with /s/ and/or /t/.
0.03919263178522438 percent of words had word-final consonant clusters with /s/ and/or /t/.


# Observations

- Consonant clusters containing /s/ and/or /t/ are more common in English than in Arabic, overall.
- Arabic has no such clusters word-initially - and many of the clusters it does have are actually gemination.

# Directions for possible extensions (optional)

- Look for clusters involving more consonants in each language - or any consonants in the language.
- Some phonemes have representations that are more than one character long - this is the reason the Wikipron pronunciations are presented in space-separated form, to clarify phoneme segmentation.  I have avoided searching for such multi-character phonemes in this homework, to keep things simple, but if you would like to extend this to tackle such phonemes as well, you will need to represent each phoneme as a *string* rather than as a character, and you will need to search for 2 or more consecutive instances of such strings.  
- Do the same thing for other languages.  E.g. [PHOIBLE](https://phoible.org/) has phoneme inventories for many languages - you would need to first download the data and then get it in the format of the existing English and Arabic data.