<a href="https://colab.research.google.com/github/seojeongmin123/NLP_2024/blob/main/00_SelfTaught_Voca_Pronunciation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🐹 🐾 🍩 Vocabulary Drills

#📘 **Self-Taught Pronunciation**

**Table of Contents:**  
using **{gTTS}** Text-to-Speech & CMU pronunciation dictionary.  

* Exposure to Keyword pronunciation (using 📍_frequency distribution, gTTS_)
* English rhyming (using 📍_CMU dictionary_): e.g., night, right, bite, etc.
* Learning English vowels with rhyming words.


# 👽😇😡 Vocabulary Drills

## Enchanting folktakes from Korea (Mimi Kim of Washington University & Angela Lee-Smith of Yale University)

[Article about their digital transformation project](https://ealc.wustl.edu/news/folktales-cultural-literacy)

### <font color = 'red'>**Enchanting folktakes from Korean**
- **[The old man with the lump on his face 혹부리 할아버지](https://www.youtube.com/watch?v=pyPNKS6IjB8&list=PLgJYJCNIVKi1XNC3swicgU-9nRpnf82Lm&index=1)**
- **[The tale of the green frog 청개구리](https://www.youtube.com/watch?v=pVBqT0gh37I&list=PLgJYJCNIVKi1XNC3swicgU-9nRpnf82Lm&index=2)**
- **[The Loving Brothers 의좋은 형제](https://www.youtube.com/watch?v=7UpB0t9R4hY&list=PLgJYJCNIVKi1XNC3swicgU-9nRpnf82Lm&index=3)**
- **[The Tiger and The Persimmon 호랑이와 곶감](https://www.youtube.com/watch?v=JSfSnjtgnqo&list=PLgJYJCNIVKi1XNC3swicgU-9nRpnf82Lm&index=4)**
- **[The Idler That Became a Cow 소가 된 게으름뱅이](https://www.youtube.com/watch?v=9ku3JSW7Vlw&list=PLgJYJCNIVKi1XNC3swicgU-9nRpnf82Lm&index=5)**
- **[Sister Sun and Brother Moon 해님 달님](https://www.youtube.com/watch?v=yM5GOtrPzEE&list=PLgJYJCNIVKi1XNC3swicgU-9nRpnf82Lm&index=6)**
- **[The Golden Ax, The Silver Ax](https://www.youtube.com/watch?v=xHzxGLuc6MA&list=PLgJYJCNIVKi1XNC3swicgU-9nRpnf82Lm&index=7)**


💾 Sample text: [Article about digital transformation](https://raw.githubusercontent.com/MK316/workshop22/main/data/TheHeron.txt) Copy and get it ready to past below :-)

In [None]:
#@markdown 🔳 Paste the text here for analysis: (text)
text = input()

Mimi Kim, teaching professor of Korean language, is the co-author of a new Korean language textbook that uses folktales as a springboard for language learning. The 21 stories that make up Tigers, Fairies, and Gods: Enchanting Folktales from Korea progress through increasingly challenging levels of diction and vocabulary while developing students’ cultural literacy. Every semester, Mimi Kim asks her Korean language students at WashU to write their own Korean-style folktales. She says that folktales are “rich in cultural context, and so they provide a very effective backdrop for discussions about cultural practices and perspectives. Students really get into our discussions about commonalities and differences between Korean folktales and the stories that they grew up with in their own communities.” Kim sees cultural literacy as a key tool for the language classroom. Tigers, Fairies, and Gods grew out of that idea. The book is an unusually rich visual object, with vibrant illustrations don

> 🔲 **Preprocessing of the text**

Set up for processing: no action required.

In [None]:
#@markdown 🔳 Install and import packages
%%capture
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download("punkt")

!pip install corpus-toolkit

#@markdown 🔳 Create a foloder named "txtdata" for further processing
import os
os.mkdir("txtdata")

#@markdown 🔳 Write text to a file under 'txtdata' folder

with open('txtdata/mytext.txt','w') as f:
  f.write(text)


> 🔲 **Frequency analysis:** your actions required (2 times)

In [None]:
#@markdown 🔳 Tokenize, getting frequency list with tagging information

from nltk.tokenize import RegexpTokenizer
retokenize =RegexpTokenizer("[\w]+")
words = retokenize.tokenize(text)
words
print('Before stopwords: %d'%len(words))

# Lower case
wlist = []
for w in words:
  w1 = w.lower()
  wlist.append(w1)

words = wlist
#@markdown 🔳 Remove stopwords for frequency distribution analysis:

# import stopwords from nltk.corpus

from nltk.corpus import stopwords
nltk.download("stopwords")

words = [w for w in words if not w in stopwords.words('english')]
print('After stopwords: %d'%len(words))

#@markdown 🔳 POS tagging

from corpus_toolkit import corpus_tools as ct

brown_corp = ct.ldcorpus("txtdata") #load and read text files under 'txtdata' directory
tok_corp = ct.tokenize(brown_corp)  #tokenize corpus - by default this lemmatizes as well
brown_freq = ct.frequency(tok_corp) #creates a frequency dictionary

ct.write_corpus("tagged_txt",ct.tag(ct.ldcorpus("txtdata")))

tagged_freq = ct.frequency(ct.reload("tagged_txt"))
# ct.head(tagged_freq, hits = 10)

#@markdown 🔳 Result saving as a csv file with POS information

import pandas as pd
data_dict = tagged_freq
data_items = data_dict. items()
data_list = list(data_items)
df = pd.DataFrame(data_list)

df.columns=['Tagged_words','Freq']

mycol = list(df['Tagged_words'])

# print(df)

# Word, POS into dataframe

wlist = []
cat = []

for w in mycol:
  w1 = w.split("_")
  wlist.append(w1[0])
  cat.append(w1[1])

df['Word'] = wlist
df['POS'] = cat

#@markdown 🔳 🚩 Sorting by? Answer [pop up box]

print("Sorting by Frequency (type '1'), POS & Freq (type '2'), or by Word alphabetically (type '3')")
sorting = input()

for t in sorting:
  if t == "1":
    df = df.sort_values(by=['Freq'], ascending = False)
  if t == "2":
    df = df.sort_values(by=['POS', 'Freq'], ascending = False)
  if t == "3":
    df = df.sort_values(by=['Word'], ascending = True)
  else:
    print("Type 1, 2, or 3")
df['Index'] = range(1,len(df['POS'])+1)

df = df[["Index", "POS", "Word","Freq"]]
# print df.to_string(index=False)

#@markdown 🔳 🚩 Saving file? Answer [pop up box]

print('Save it as a file? (y/n)')
saving = input()

for s in saving:
  if s == "y":
    with open('pos_wordlist.csv','w') as f:
      df.to_csv(f)
    print('File is saved: pos_wordlist.csv')
  if s == "n":
    print('No file will be saved.')

df.head()


Before stopwords: 285
After stopwords: 167
Processing mytext.txt (1 of 1 files)
Processing mytext.txt (1 of 1 files)
Processing 1.txt (1 of 1 files)
Sorting by Frequency (type '1'), POS & Freq (type '2'), or by Word alphabetically (type '3')
2
Type 1, 2, or 3
Save it as a file? (y/n)
y
File is saved: pos_wordlist.csv


Unnamed: 0,Index,POS,Word,Freq
59,1,VERB,say,3
81,2,VERB,grow,3
16,3,VERB,use,2
2,4,VERB,teach,1
24,5,VERB,make,1


## **[1] Generating audio file of word reading**  
Result file => df

In [None]:
#@markdown 🚩 {gTTS} package installation and import
%%capture
!pip install gTTS
from gtts import gTTS
from IPython.display import Audio

In [None]:
#@markdown Word reading by gTTS:

#@markdown 🚩 Select word POS:

word_POS_select = "ADJ" #@param = ["NOUN","VERB","ADJ","ADV","PROPN","ALL"]

wordlist = df[df['POS'] == word_POS_select]
wordlist = wordlist.sort_values(by=['Word'], ascending = True)

collist = list(wordlist['Word'])

print(collist)

#@markdown 🚩 Language to choose: (english, korean, french, spanish)
def tts(mytext):
  text_to_say = mytext

# Step ⓵ Language to choose:
  language_to_choose = "en" #@param ["en", "ko", "fr", 'es']
  # lang = language_to_choose
  language = language_to_choose
  print("Play language accent: %s"%language_to_choose)

  gtts_object = gTTS(text = text_to_say,
                     lang = language,
                    slow = False)

  gtts_object.save("mytext.wav")
  return Audio("mytext.wav")

# join wordlist by adding '!': joining with '.' tend to yield an error of saying two words as one.
text_to_say = '! '.join(collist)
intro_text = "Okay. I'm going to read a wordlist, so repeat after me."
text_to_say = intro_text + text_to_say
tts(text_to_say)


['-', 'challenging', 'cross', 'cultural', 'effective', 'evocative', 'key', 'korean', 'linguistic', 'little', 'natural', 'new', 'own', 'rich', 'useful', 'vibrant', 'visual', 'vocabulary']
Play language accent: en
