# Do pidgins really exist? Do creoles come from pidgin?

This video features a lesson held by Prof. Michel DeGraff an Haitian Creolist. 
The speech is plain, not controlled and speaker has a strong accent.
No noise.
Google ASR system has been used for automatic captioning the video.
Mistakes in the output will be used as a base to develop the comprehension test and to evaluate the systems performance in different conditions.

## Methodology 

I import all the modules I will need.

In [1]:
import nbconvert
import re
import nltk
import jiwer 
from jiwer import wer
import nltk.corpus 
from nltk.corpus import wordnet
from nltk.corpus import stopwords
import collections
import random
import collections 
print("done!")

done!


## Evaluation metrics

When evaluating a ASR system it's important to keep in mind the nature of the speaker and the condition of the audio, therefore these measurements need to be interpreted accordingly.

Word Error Rate (WER) is a comparison measure, it expresses the distance between the word sequence that produces an ASR and the reference series.

WER = (S + D + I) / N1 = (S + D + I) / (H + S + D)

where I = the total number of entries, D = total number of deletions, S = total number of replacements, H = total number of hits, and N1 = total number of reference words.

Cons: it is not a real percentage, it has no upperbound (i.e. WER can be 200%). Used on its own it doesn't tell much about the system performance.

Match Error Rate (MER) is the proportion of I/O word matches, which are errors, which means that is the probability of a given match being incorrect.

MER = (S + D + I) / (N = H + S + D + I) = 1 − H/N

Word Information Lost (WIL) is a simple approximation to the proportion of word information lost.

In [25]:
infile = open("Text files/Transcription.txt", "r", encoding = "utf8")
ref = infile.read().lower()
infile.close()

infile2 = open("Text files/auto_cc.txt", "r", encoding = "utf8")
hyp = infile2.read().lower()
infile.close()

reference = " ".join(re.split(r"\W+", ref))
hypothesis = " ".join(re.split(r"\W+", hyp))

measures = jiwer.compute_measures(reference, hypothesis)


print("{0:15}  {1}".format('N reference:', len(reference)))
print("{0:15}  {1}".format('N hypothesis:', len(hypothesis)))

print( )

for measure, val in measures.items():
    if measure != 'wip':
        print("{0:15}  {1}".format(measure, val))
        

N reference:     8758
N hypothesis:    8608

wer              0.06647058823529411
mer              0.06592765460910152
wil              0.10091514960188008
hits             1601
substitutions    62
deletions        37
insertions       14


These results are considered good and in line with Google ASR performance, which is the best one available.

### Line by line comparison

I created a text file containing both input and output texts. I will use the document to detect shift or change in meaning due to transcription errors of morphological or syntactic nature.

In this cell I have POS-tagged words and lemmatised the two texts and retrieved only the nouns and their frequencies to see how many keywords were transcribed correctly.

In [9]:
#TRANSCRIPTION

infile = open("Text files/Transcription.txt", "r", encoding = "utf8")
ref = infile.read().lower()
infile.close()

lemmatizer = nltk.WordNetLemmatizer()

is_noun = lambda pos: pos[:2] == 'NN'
reference = [lemmatizer.lemmatize(word) for (word, pos) 
                                 in nltk.pos_tag(nltk.word_tokenize(ref)) if is_noun(pos)] 
count_ref = collections.Counter(reference)

#AUTOMATIC CAPTIONS

infile = open("Text files/auto_cc.txt", "r", encoding = "utf8")
hyp = infile.read().lower()
infile.close()

lemmatizer = nltk.WordNetLemmatizer()

is_noun = lambda pos: pos[:2] == 'NN'

hypothesis = [lemmatizer.lemmatize(word) for (word, pos) 
              in nltk.pos_tag(nltk.word_tokenize(hyp)) if is_noun(pos)]

count_hyp = collections.Counter(hypothesis)

#LET'S PRINT BOTH
print("Input nouns:\n")

for word, val in sorted(count_ref.items()): 
    if len(word) > 3:
        print("{} {} {} ".format(word, val,  "|"), end = " ")

print("\n") 
  
print('Output nouns:\n')

for word, val in sorted(count_hyp.items()): 
    if len(word) > 3:
        print("{} {} {} ".format(word, val,  "|"), end = " ")


Input nouns:

absence 2 |  affix 4 |  africa 1 |  african 3 |  anybody 1 |  anything 2 |  archival 1 |  argument 1 |  asia 1 |  attention 1 |  background 1 |  barrier 1 |  bell 1 |  bias 1 |  bickerton 3 |  business 1 |  candidate 1 |  child 6 |  china 1 |  choice 1 |  claim 5 |  coast 1 |  code 1 |  communication 1 |  community 5 |  complexity 1 |  concern 1 |  condition 1 |  creation 1 |  creole 12 |  data 9 |  debate 1 |  definition 2 |  difference 1 |  doubt 1 |  erectus 1 |  europe 1 |  evidence 6 |  example 2 |  fact 6 |  family 1 |  fifth 1 |  franca 1 |  french 1 |  generation 4 |  guess 1 |  hearing 1 |  history 2 |  home 1 |  homo 1 |  hypothesis 1 |  idea 2 |  kind 3 |  l'ouverture 1 |  language 18 |  latin 1 |  leader 1 |  level 1 |  line 1 |  marker 1 |  mean 1 |  nobody 2 |  observer 1 |  okay 3 |  order 1 |  part 1 |  patois 1 |  pattern 2 |  people 9 |  period 1 |  perspective 1 |  pidgin 21 |  piece 1 |  point 6 |  question 6 |  rachel 1 |  recording 4 |  remember 2 | 

By cross-referring nouns frequency data and the texts alignment we can see that original keywords have been transcribe as:

In [11]:
print("Topic keywords:\n")
      
for word, val in sorted(count_ref.items()): 
    if word in ["pidgin", "creole", "affix", "bickerton", "structure"]:
        print("{} {} {} ".format(word, val,  "|"), end = " ")
        
print("\n\n")   

print("Output:\n")
print("{0:11} {1} {2}".format("affix 4 ", "=" , "affix 4"))
print("{0:10} {1}".format("\nbickerton 4 =", "bickerton 1 + biggerton 1 + bikerton 1 + because 1"))
print("{0:12} {1} {2}".format("\ncreole 12", "=", "creole 6 + career 2 + girl 1 + grill 1 + critical 1 + curl 1"))
print("{0:12} {1} {2}".format("\npidgin 21", "=", "pigeon 22 + pitching 1"))
print("{0:12} {1} {2}".format("\nstructure 4", "=", "structure 4"))

Topic keywords:

affix 4 |  bickerton 3 |  creole 12 |  pidgin 21 |  structure 4 |  


Output:

affix 4     = affix 4

bickerton 4 = bickerton 1 + biggerton 1 + bikerton 1 + because 1

creole 12   = creole 6 + career 2 + girl 1 + grill 1 + critical 1 + curl 1

pidgin 21   = pigeon 22 + pitching 1

structure 4 = structure 4


The following errors - caused mainly by co-articulation - are problematic because they make perfect sense from a linguistic point of view but the meaning has changed from the original sentences:

In [12]:
print("(an original speech system) that's extremely\033[94m'rich using structure'\033[0m\n(an original speech system) that's extremely '\033[94m'reducing structure''\033[0m")
print("\n")
print("They are really simplest \033[94mimpossible'\033[0m system -- in fact, even un-language-like \nthey're really simply system \033[94mpossible'\033[0m in fact even uh unlanguage like")
print("\n")
print("there was a system where all the French \033[94msuffixes\033[0m were gone")
print("there was a system where all the french \033[94maffixes\033[0m were gone")
print("\n")
print("In the cases above the human made caption is wrong and the automatic one is correct")
print("\n")
print("so in the history of \033[94mHaitian Creole\033[0m")
print("so in the history of \033[94masian creole'\033[0m")
print("\n")
print("if there is a recording, we're like, OK, oh, \033[94mthen a pidgin did exist\033[0m.")
print("if there is a recording we're like okay oh then \033[94mthen it didn't exist\033[0m")
print("\n")


(an original speech system) that's extremely[94m'rich using structure'[0m
(an original speech system) that's extremely '[94m'reducing structure''[0m


They are really simplest [94mimpossible'[0m system -- in fact, even un-language-like 
they're really simply system [94mpossible'[0m in fact even uh unlanguage like


there was a system where all the French [94msuffixes[0m were gone
there was a system where all the french [94maffixes[0m were gone


In the cases above the human made caption is wrong and the automatic one is correct


so in the history of [94mHaitian Creole[0m
so in the history of [94masian creole'[0m


if there is a recording, we're like, OK, oh, [94mthen a pidgin did exist[0m.
if there is a recording we're like okay oh then [94mthen it didn't exist[0m




On the other hand, the following sentences are almost nonsensical:

In [6]:
print("you might go to a stage where you \033[94mproduce Italian verbs without affixes\033[0m")
print("you might go through a state where you \033[94mput your satanic verbs without affixes\033[0m")
print("\n")
print("so in terms of just a mix, you see-- so \033[94min Haitian creole revolution\033[0m")
print("so in terms of just the mix you see so \033[94minhibition critical revolution\033[0m")

you might go to a stage where you [94mproduce Italian verbs without affixes[0m
you might go through a state where you [94mput your satanic verbs without affixes[0m


so in terms of just a mix, you see-- so [94min Haitian creole revolution[0m
so in terms of just the mix you see so [94minhibition critical revolution[0m


Proper nouns became:

In [12]:
print("Bickerton:\n")
print("\033[94mBickerton has used\033[0m something of that kind of argument")
print("\033[94mbecause use\033[0m something of that that kind of argument")
print("\n")
print("there was anything like what \033[94mBickerton\033[0m posits")
print("there was anything like what \033[94mbiggerton\033[0m posits")
print("\n")
print("remember that for \033[94mBickerton\033[0m, the \033[94mcrucial fifth piece\033[0m (of his theory) is")
print("remember for \033[94mbikerton\033[0m you know the \033[94mcrucial face\033[0m is")
print("\n\n")
print("Toussaint L'Ouverture:")
print("\n")
print("So someone like \033[94mToussaint L'Ouverture\033[0m, for example, the well-known Haitian leader")
print("so someone like \033[94mlouis veracio\033[0m for example you know the well-known haitian leader")

Bickerton:

[94mBickerton has used[0m something of that kind of argument
[94mbecause use[0m something of that that kind of argument


there was anything like what [94mBickerton[0m posits
there was anything like what [94mbiggerton[0m posits


remember that for [94mBickerton[0m, the [94mcrucial fifth piece[0m (of his theory) is
remember for [94mbikerton[0m you know the [94mcrucial face[0m is



Toussaint L'Ouverture:


So someone like [94mToussaint L'Ouverture[0m, for example, the well-known Haitian leader
so someone like [94mlouis veracio[0m for example you know the well-known haitian leader


One sentence sounds better, though:

In [13]:
print("you're already creating a bias against the area that--")
print("you're already increasing your bias against the idea that")
print("\n")
print("(African people were learning French like French people were learning Latin in the past)")

you're already creating a bias against the area that--
you're already increasing your bias against the idea that


(African people were learning French like French people were learning Latin in the past)


### References 

B, F. F. (2020). A Benchmarking of IBM , Google and Wit. Springer International Publishing. https://doi.org/10.1007/978-3-030-49161-1

MIT Opencourseware 24.908, Spring 2017
Creole Languages and Carribean Identities - Michel DeGraff
Lesson 1. Do "Pidgins" exist? Do creoles come from pidgin?

Morris, A. C., Maier, V., & Green, P. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. 8th International Conference on Spoken Language Processing, ICSLP 2004, June, 2765–2768.