<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setting-up-Environment" data-toc-modified-id="Setting-up-Environment-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setting up Environment</a></span></li><li><span><a href="#Initializing-Storage-Variables" data-toc-modified-id="Initializing-Storage-Variables-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Initializing Storage Variables</a></span></li><li><span><a href="#Wikipedia-Scraping" data-toc-modified-id="Wikipedia-Scraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Wikipedia Scraping</a></span><ul class="toc-item"><li><span><a href="#Extrating-Chinese-food-names" data-toc-modified-id="Extrating-Chinese-food-names-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Extrating Chinese food names</a></span></li><li><span><a href="#Extracting-all-other-cuisines" data-toc-modified-id="Extracting-all-other-cuisines-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Extracting all other cuisines</a></span></li><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Data Cleaning</a></span></li></ul></li><li><span><a href="#html-Scraping" data-toc-modified-id="html-Scraping-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>html Scraping</a></span><ul class="toc-item"><li><span><a href="#Adding-Japanese-food-from-html-manually" data-toc-modified-id="Adding-Japanese-food-from-html-manually-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Adding Japanese food from html manually</a></span></li><li><span><a href="#Adding-Korean-food-from-html-with-beautiful-soup" data-toc-modified-id="Adding-Korean-food-from-html-with-beautiful-soup-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Adding Korean food from html with beautiful soup</a></span></li><li><span><a href="#Exporting-Corpus" data-toc-modified-id="Exporting-Corpus-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Exporting Corpus</a></span></li></ul></li><li><span><a href="#References" data-toc-modified-id="References-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>References</a></span></li></ul></div>

# Setting up Environment

In [1]:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
import wikipedia

# Initializing Storage Variables

In [2]:
df_corpus = pd.DataFrame(columns=["Food", "Cuisine"])
df_corpus

Unnamed: 0,Food,Cuisine


In [3]:
cuisines = ["Chinese", "Malay", "Indian", "Cross-cultural",
            "Seafood", "Fruit", "Desserts", "Drinks and beverages"]

# Wikipedia Scraping
Documentation at https://wikipedia.readthedocs.io/en/latest/code.html

In [4]:
wiki = wikipedia.page("Singaporean cuisine")
links = wiki.links

In [5]:
wiki.sections

[]

## Extrating Chinese food names

In [6]:
txt_chinese = wiki.section("Chinese") #section(section_title)
txt_chinese

'The dishes that comprise "Singaporean Chinese cuisine" today were originally brought to Singapore by the early southern Chinese immigrants (Hokkien, Teochew, Cantonese, Hakka and Hainanese). They were then adapted to suit the local availability of ingredients, while absorbing influences from Malay, Indian and other cooking traditions.\nMost of the names of Singaporean Chinese dishes were derived from dialects of southern China, Hokkien (Min Nan) being the most common. As there was no common system for transliterating these dialects into the Latin alphabet, it is common to see different variants on the same name for a single dish. For example, bah kut teh may also be spelt bak kut teh, and char kway tiao may also be spelt char kuay teow.\n\nBak kut teh (肉骨茶; ròu gǔ chá), pork rib soup made with a variety of Chinese herbs and spices.\nBeef kway teow (牛肉粿条; niú ròu guǒ tiáo), flat rice noodles stir-fried with beef, served dry or with soup.\nBak chang (肉粽; ròu zòng), glutinous rice dumpli

In [7]:
len(txt_chinese)

6763

In [8]:
def get_corpus(txt, start_chars, end_chars):
    start, end = 0, 0
    corpus = []
    new = False
    for i in range(len(txt)):
        if txt[i] in start_chars:
            start = i+1 # start of food name just after start_char
            new = True # found start of new word flag
        if txt[i] in end_chars:
            end = i # end of food name, non inclusive of txt[i]
        if new and end > start:
            while txt[start] == " ":
                start += 1 # remove space at start
            while txt[end-1] == " ":
                end -= 1 # remove space at end
            corpus.append(txt[start: end].lower()) # change to lower case
            new = False # word copied out, prevents duplicates
    return corpus

We notice that Chinese cuisine is the only cuisine where all food names are bounded by ```'\n'``` and ```'('``` characters, (eg. \nSliced fish soup (鱼片汤; yú piàn tāng)) we shall extract it step by step as an example.

In [9]:
corpus_chinese = get_corpus(txt_chinese, ['\n'], ['(', '/'])
corpus_chinese

['most of the names of singaporean chinese dishes were derived from dialects of southern china, hokkien',
 'bak kut teh',
 'beef kway teow',
 'bak chang',
 'bak chor mee',
 'ban mian',
 'chai tow kway',
 'char kway teow',
 'char siu',
 'crab bee hoon',
 'drunken prawns',
 'duck rice',
 'fish ball noodles',
 'fish soup bee hoon',
 'frog leg porridge',
 'hae mee',
 'hainanese chicken rice',
 'har cheong gai',
 'hokkien mee',
 'hum chim peng',
 'kuay chap',
 'mee pok',
 'min chiang kueh',
 "pig's brain soup",
 "pig's organ soup",
 'popiah',
 'shredded chicken noodles',
 'sliced fish soup',
 'soon kway',
 'teochew porridge',
 'turtle soup',
 'vegetarian bee hoon',
 'yong tau foo',
 'youtiao']

In [10]:
corpus_chinese = corpus_chinese[1:]
corpus_chinese_size = len(corpus_chinese)
print(corpus_chinese_size)

33


In [11]:
df_corpus = df_corpus.append(pd.DataFrame({"Food":corpus_chinese,
                                          "Cuisine":["Chinese"]*len(corpus_chinese)}))
df_corpus

Unnamed: 0,Food,Cuisine
0,bak kut teh,Chinese
1,beef kway teow,Chinese
2,bak chang,Chinese
3,bak chor mee,Chinese
4,ban mian,Chinese
5,chai tow kway,Chinese
6,char kway teow,Chinese
7,char siu,Chinese
8,crab bee hoon,Chinese
9,drunken prawns,Chinese


## Extracting all other cuisines

By viewing the wikipedia page, it is observed that all other cuisines have food names are bounded by ```'\n'```, ```','``` and ```'/'``` characters.

Similarly, we will expect the section summary to have a substring bounded by the same characters and will need to remove those.

In [12]:
for cuisine in cuisines[1:]: # exclude "Chinese"
    txt = wiki.section(cuisine) #section(section_title)
    corpus = get_corpus(txt, ['\n'], [',', '(', '/'])
    print("\n", cuisine, ", Corpus size =", len(corpus))
    print(corpus)


 Malay , Corpus size = 27
['acar', 'assam pedas', 'ayam penyet', 'bakso', 'begedil', 'curry puff', 'dendeng paru', 'goreng pisang', 'gulai daun ubi', 'keropok', 'ketupat', 'lemak siput', 'lontong', 'nagasari', 'nasi goreng', 'nasi padang', 'otak-otak', 'pecel lele', 'rawon', 'rojak bandung', 'roti john', 'sambal', 'satay', 'sayur lodeh', 'soto', 'soto ayam', 'tumpeng']

 Indian , Corpus size = 10
['appam', 'dosa', 'murtabak', 'naan', 'roti prata', 'soup kambing', 'soup tulang', 'soup tulang merah', 'tandoori chicken', 'vadai']

 Cross-cultural , Corpus size = 20
['ayam buah keluak', 'biryani', 'cereal prawns', 'chili crab pasta', 'curry laksa', 'fish head curry', 'kari debal', 'kari lemak ayam', 'katong laksa', 'kueh pie tee', 'kway teow goreng', 'mee rebus', 'mee siam', 'mee goreng', 'mee soto', 'rojak', 'sambal kangkong', 'satay bee hoon', 'tauhu goreng', '"western food" in hawker centres where "singapore-style" chicken chop']

 Seafood , Corpus size = 5
['black pepper crab', 'chill

From the results above, we realised that we will only need to remove the summary text (the first element of the corpus) for "Fruit", and the last element for "Cross-cultural".

However, to make our code reusable, we shall fulter the food names by length instead. Let's use 40 characters as a filter.

Running the above loop again, we have:

In [13]:
corpus_size = corpus_chinese_size
for cuisine in cuisines[1:]: # exclude "Chinese"
    txt = wiki.section(cuisine) #section(section_title)
    corpus = get_corpus(txt, ['\n'], [',', '(', '/'])
    corpus = list(filter(lambda x: len(x) <= 40, corpus))
    print("\n", cuisine, ", Corpus size =", len(corpus))
    print(corpus)
    
    corpus_size += len(corpus)
    df_corpus = df_corpus.append(pd.DataFrame({"Food":corpus,
                                          "Cuisine":[cuisine]*len(corpus)}))


 Malay , Corpus size = 27
['acar', 'assam pedas', 'ayam penyet', 'bakso', 'begedil', 'curry puff', 'dendeng paru', 'goreng pisang', 'gulai daun ubi', 'keropok', 'ketupat', 'lemak siput', 'lontong', 'nagasari', 'nasi goreng', 'nasi padang', 'otak-otak', 'pecel lele', 'rawon', 'rojak bandung', 'roti john', 'sambal', 'satay', 'sayur lodeh', 'soto', 'soto ayam', 'tumpeng']

 Indian , Corpus size = 10
['appam', 'dosa', 'murtabak', 'naan', 'roti prata', 'soup kambing', 'soup tulang', 'soup tulang merah', 'tandoori chicken', 'vadai']

 Cross-cultural , Corpus size = 19
['ayam buah keluak', 'biryani', 'cereal prawns', 'chili crab pasta', 'curry laksa', 'fish head curry', 'kari debal', 'kari lemak ayam', 'katong laksa', 'kueh pie tee', 'kway teow goreng', 'mee rebus', 'mee siam', 'mee goreng', 'mee soto', 'rojak', 'sambal kangkong', 'satay bee hoon', 'tauhu goreng']

 Seafood , Corpus size = 5
['black pepper crab', 'chilli crab', 'oyster omelette', 'sambal lala', 'sambal stingray']

 Fruit , C

Verifying that our corpus has been created correctly, we check that the size of the created dataframe has the same size as ```corpus_size```

In [14]:
print("Dataframe size =", df_corpus.shape)
print("Expected corpus size =", corpus_size)

Dataframe size = (112, 2)
Expected corpus size = 112


## Data Cleaning

From a visual inspection of the corpus, we found that the following food names can be modified:

**Chinese**
- 'beef kway teow', 'hainanese chicken rice', 'shredded chicken noodles' *- add alternative name*

**Malay**
- 'nasi goreng', 'nasi padang', 'roti john' *- add alternative name*

**Indian**
- 'roti prata' *- add alternative name*

**Desserts:**
- 'kuih or kueh' *- change to alternative name*
- 'kueh lapis is a rich'
- 'lapis sagu is also a popular kueh with layers of alternating colour and a sweet'

**Drinks and beverages:**
- 'chin chow drink', 'yuenyeung coffee' *- add alternative name*

In [15]:
df_corpus["alternative names"] = df_corpus["Food"]

In [16]:
wrong_names = ['kueh lapis is a rich',
               'lapis sagu is also a popular kueh with layers of alternating colour and a sweet',
              'kuih or kueh']

# Dropping wrong rows
df_corpus = df_corpus[~df_corpus["Food"].isin(wrong_names)]
df_corpus.shape

(110, 3)

In [17]:
# Adding corrected names
names = ['kueh lapis', 'lapis sagu']
df_corpus = df_corpus.append(pd.DataFrame({"Food": names,
                                           "alternative names": names,
                                           "Cuisine":['Desserts']*2}))

# Adding Alternative names
cus = 'Chinese'
names = ['beef kway teow', 'hainanese chicken rice', 'shredded chicken noodles']
alternative = ['kway teow', 'chicken rice', 'chicken noodles']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

cus = 'Malay'
names = ['nasi goreng', 'nasi padang', 'nasi goreng', 'nasi padang', 'roti john']
alternative = ['nasigoreng', 'nasipadang', 'nasi-goreng', 'nasi-padang', 'roti-john']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

cus = 'Indian'
names = ['roti prata', 'roti prata', 'roti prata']
alternative = ['rotiprata', 'roti-prata', 'prata']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

cus = 'Desserts'
names = ['kueh', 'kueh']
alternative = ['kuih', 'kueh']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

cus = 'Drinks and beverages'
names = ['chin chow drink', 'chin chow drink', 'chin chow drink', 'yuenyeung coffee']
alternative = ['chin chow', 'chinchow', 'chin-chow', 'yuenyeung']

df_corpus = df_corpus.append(pd.DataFrame({"Food":names,
                                           "alternative names": alternative,
                                           "Cuisine":[cus]*len(alternative)}))

corpus_size = df_corpus.shape[0]
df_corpus.shape

(129, 3)

Finally, we generate some summary statistics of our corpus

In [18]:
print(df_corpus.shape)
print(df_corpus["Cuisine"].value_counts())

(129, 3)
Chinese                 36
Malay                   32
Cross-cultural          19
Drinks and beverages    13
Indian                  13
Desserts                11
Seafood                  5
Name: Cuisine, dtype: int64


# html Scraping

In [19]:
import codecs

In [20]:
def get_corpus(txt, start_str, end_str):
    start, end = 0, 0
    corpus = []
    new = False
    i=0
    
    while len(txt) > len(end_str):
        start = txt.find(start_str) + len(start_str)
        new = True # found start of new word flag
        end = txt.find(end_str)
        
        if new and end > start and end-start < 40:
            while txt[start] == " ":
                start += 1 # remove space at start
            while txt[end-1] == " ":
                end -= 1 # remove space at end
            corpus.append(txt[start: end].lower()) # change to lower case
            new = False # word copied out, prevents duplicates
        
            #update search space when word is found
            txt = txt[end:]
            i+=1
        elif end > start and end-start >= 40:
            txt = txt[start+len(start_str):] # for start to be updated
        else:
            txt = txt[start:] # end < start, for end to be updated
        
        if start == -1 or end == -1:
            break
    print("found", i, "food names")
    return corpus

## Adding Japanese food from html manually

In [21]:
cuisine = "Japanese"
f = codecs.open("japfood.html", 'r', 'utf-8')
txt = f.read()
txt

'<!DOCTYPE html>\n<!-- saved from url=(0067)https://www.japancentre.com/en/pages/156-30-must-try-japanese-foods -->\n<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <link rel="alternate" type="application/rss+xml" title="Japancentre blog » Feed" href="http://blog.japancentre.com/feed/">\n\n          <link rel="alternate" hreflang="en" href="https://www.japancentre.com/en/pages/156-30-must-try-japanese-foods">\n      <link rel="alternate" hreflang="ja" href="https://www.japancentre.com/ja/pages/156-30-must-try-japanese-foods">\n      <link rel="alternate" hreflang="fr" href="https://www.japancentre.com/fr/pages/156-30-must-try-japanese-foods">\n      <link rel="alternate" hreflang="it" href="https://www.japancentre.com/it/pages/156-30-must-try-japanese-foods">\n      <link rel="alternate" hreflang="zh" href="https://www.japancentre.com/zh/pages/156-30-must-try-japanese-foods">\n    

In [22]:
corpus = get_corpus(txt, ". ", "</h")
corpus = list(filter(lambda x: len(x) <= 40, corpus))
print("\n", cuisine, ", Corpus size =", len(corpus))
print(corpus)

found 31 food names

 Japanese , Corpus size = 31
['sushi', 'udon', 'tofu', 'tempura', 'yakitori', 'sashimi', 'ramen', 'donburi', 'natto', 'oden', 'tamagoyaki', 'soba', 'tonkatsu', 'kashipan', 'sukiyaki', 'miso soup', 'okonomiyaki', 'mentaiko', 'nikujaga', 'curry rice', 'unagi no kabayaki', 'shabu shabu', 'onigiri', 'gyoza', 'takoyaki', 'kaiseki ryori', 'edamame', 'yakisoba', 'chawanmushi', 'wagashi', 'api_processed="true"></script></body>']


In [23]:
corpus = corpus[:-1]
corpus

['sushi',
 'udon',
 'tofu',
 'tempura',
 'yakitori',
 'sashimi',
 'ramen',
 'donburi',
 'natto',
 'oden',
 'tamagoyaki',
 'soba',
 'tonkatsu',
 'kashipan',
 'sukiyaki',
 'miso soup',
 'okonomiyaki',
 'mentaiko',
 'nikujaga',
 'curry rice',
 'unagi no kabayaki',
 'shabu shabu',
 'onigiri',
 'gyoza',
 'takoyaki',
 'kaiseki ryori',
 'edamame',
 'yakisoba',
 'chawanmushi',
 'wagashi']

In [24]:
# Check for duplicate names
for x in df_corpus["alternative names"]:
    for food in corpus:
        if food in x or x in food:
            print("'" + x + "',")

In [25]:
corpus_size += len(corpus)
df_corpus = df_corpus.append(pd.DataFrame({"Food":corpus,
                                          "Cuisine":[cuisine]*len(corpus),
                                          "alternative names": corpus}))


In [26]:
print("Dataframe size =", df_corpus.shape)
print("Expected corpus size =", corpus_size)

Dataframe size = (159, 3)
Expected corpus size = 159


## Adding Korean food from html with beautiful soup

In [27]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

cuisine = "Korean"

# html = urlopen('koreanfood.html').read().decode("utf-8")
html = f = codecs.open("koreanfood.html", 'r', 'utf-8')

# using beautifulsoup
bs = BeautifulSoup(html, 'html.parser')
corpus = list(map(lambda x: x.get_text(), bs.find_all(["h1"])))
corpus

['\n\t\t\t\t\t\t\tSouth Korean Food: 29 of the Best Tasting Dishes\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '1. Chili Pickled Cabbage (Kimchi 김치)',
 '2. Samgyeopsal (삼겹살)',
 '3. Pork Bulgogi (Daeji Bulgogi 불고기)',
 '4. Korean Barbecue (Gogigui 고기구이)',
 '5. Hangover Stew (Haejangguk 해장국)',
 '6. Soft Tofu Stew (Sundubu\xa0Jjigae 순두부찌게)',
 '7. Mixed Seafood Stew',
 '8. Kimchi Stew (Kimchi Jjigae\xa0김치찌개)',
 '9. Fish Stew (Saengseon Jjigae 생선찌개)',
 '10. Spicy Stir Fried Octopus (Nakji Bokkeum\xa0낙지볶음)',
 '11. Korean Ox Bone Soup (Seolleongtang\xa0설렁탕)',
 '12. Hotpot Mixed Rice (Dolsot Bibimbap\xa0돌솥 비빔밥)',
 '13. Korean Mixed Rice (Cold Bibimbap 비빔밥)',
 '14. Steamed Mandu Dumplings (Jjinmandu 찐만두)',
 '15. Deep Fried Mandu (Yaki Mandu)',
 '16. Noodles in Ice Soup (Mul Naengmyeon 물 냉면)',
 '17. Mixed Cold Noodles (Bibim Naengmyeon 비빔 냉면)',
 '18. Kimchi Fried Rice (Kimchi Bokkeumbap\xa0김치 볶음밥)',
 '19. Fried Sweet Potato Noodles (Japchae 잡채)',
 '20. Mung Bean Pancake (Bindaetteok 빈대떡)',
 '21. Korean Blood Sa

In [28]:
corpus = corpus[1:] # removing article title
alternative = []

# Removing start and end extra characters
start_str = '. '
end_str = ' ('
for i in range(len(corpus)):
    x = corpus[i]
    corpus[i] = x[x.find(start_str) + len(start_str):]
    alternative.append(x[x.find(start_str) + len(start_str): x.find(end_str)])
alternative

['Chili Pickled Cabbage',
 'Samgyeopsal',
 'Pork Bulgogi',
 'Korean Barbecue',
 'Hangover Stew',
 'Soft Tofu Stew',
 'Mixed Seafood Ste',
 'Kimchi Stew',
 'Fish Stew',
 'Spicy Stir Fried Octopus',
 'Korean Ox Bone Soup',
 'Hotpot Mixed Rice',
 'Korean Mixed Rice',
 'Steamed Mandu Dumplings',
 'Deep Fried Mandu',
 'Noodles in Ice Soup',
 'Mixed Cold Noodles',
 'Kimchi Fried Rice',
 'Fried Sweet Potato Noodles',
 'Mung Bean Pancake',
 'Korean Blood Sausage',
 'Octopus Mixed Plat',
 'Gimbap 김',
 'Korean Chicken Skewers',
 'Korean Side Dishes',
 'Tornado Potatoe',
 'Gooey Deep Fried Snack',
 'Korean Tempura',
 'Red Rice Cakes']

Further finetuning and adding alternative names

In [29]:
corpus = ['Chili Pickled Cabbage (Kimchi 김치)',
            'Samgyeopsal (삼겹살)',
            'Pork Bulgogi (Daeji Bulgogi 불고기)',
            'Korean Barbecue (Gogigui 고기구이)',
            'Korean Barbecue (Gogigui 고기구이)',
            'Korean Barbecue (Gogigui 고기구이)',
            'Korean Barbecue (Gogigui 고기구이)',
            'Korean Barbecue (Gogigui 고기구이)',
            'Hangover Stew (Haejangguk 해장국)',
            'Soft Tofu Stew (Sundubu\xa0Jjigae 순두부찌게)',
            'Mixed Seafood Stew',
            'Kimchi Stew (Kimchi Jjigae\xa0김치찌개)',
            'Fish Stew (Saengseon Jjigae 생선찌개)',
            'Spicy Stir Fried Octopus (Nakji Bokkeum\xa0낙지볶음)',
            'Korean Ox Bone Soup (Seolleongtang\xa0설렁탕)',
            'Mixed Rice (Bibimbap 비빔밥)',
            'Mixed Rice (Bibimbap 비빔밥)',
            'Mixed Rice (Bibimbap 비빔밥)',
            'Steamed Mandu Dumplings (Jjinmandu 찐만두)',
            'Deep Fried Mandu (Yaki Mandu)',
            'Noodles in Ice Soup (Mul Naengmyeon 물 냉면)',
            'Mixed Cold Noodles (Bibim Naengmyeon 비빔 냉면)',
            'Kimchi Fried Rice (Kimchi Bokkeumbap\xa0김치 볶음밥)',
            'Fried Sweet Potato Noodles (Japchae 잡채)',
            'Fried Sweet Potato Noodles (Japchae 잡채)',
            'Mung Bean Pancake (Bindaetteok 빈대떡)',
            'Korean Blood Sausage (Soondae 순대)',
            'Octopus Mixed Plate',
            'Gimbap 김밥',
            'Korean Chicken Skewers (Dakkochi 닭꼬치)',
            'Korean Side Dishes (Banchan반찬)',
            'Tornado Potatoes',
            'Tornado Potatoes',
            'Gooey Deep Fried Snack (Hotteok 호떡)',
            'Korean Tempura (Twigim 튀김)',
            'Red Rice Cakes (Tteokbokki 떡볶이)',
            'Red Rice Cakes (Tteokbokki 떡볶이)',
            'Korean Food',
             'Army Stew (Budae Jjigae)',
            'Army Stew (Budae Jjigae)',
         'korean fried chicken',
         'korean fried chicken']

In [30]:
alternative = ['Chili Pickled Cabbage',
                'Samgyeopsal',
                'Bulgogi',
                'Korean Barbecue',
                'Korean BBQ',
                'K BBQ',
                'k BBQ',
                'k-BBQ',
                'Hangover Stew',
                'Soft Tofu Stew',
                'Mixed Seafood Stew',
                'Kimchi Stew',
                'Fish Stew',
                'Fried Octopus',
                'Ox Bone Soup',
                'Hotpot Mixed Rice',
                'Korean Mixed Rice',
                'Bibimbap',
                'Mandu Dumplings',
                'Deep Fried Mandu',
                'Ice Soup',
                'Cold Noodles',
                'Kimchi Fried Rice',
                'Sweet Potato Noodles',
               'Japchae',
                'Mung Bean Pancake',
                'Korean Blood Sausage',
                'Octopus Mixed Plat',
                'Gimbap',
                'Korean Chicken Skewers',
                'Korean Side Dishes',
                'Tornado Potatoe',
               'Tornado Potato',
                'Gooey Deep Fried Snack',
                'Korean Tempura',
                'Red Rice Cakes',
               'Tteokbokki',
                'Korean',
                'Army Stew',
                  'Budae Jjigae',
              'korean fried chicken',
              'chir chir']
alternative = list(map(lambda x: x.lower(), alternative))

In [31]:
print(len(corpus))
print(len(alternative))

42
42


In [32]:
# Check for duplicate names
for x in df_corpus["alternative names"]:
    for food in alternative:
        if food in x or x in food:
            print("'" + x + "',")

'tofu',
'tempura',


In [33]:
corpus_size += len(corpus)
df_corpus = df_corpus.append(pd.DataFrame({"Food":corpus,
                                          "Cuisine":[cuisine]*len(corpus),
                                          "alternative names": alternative}))


In [34]:
print("Dataframe size =", df_corpus.shape)
print("Expected corpus size =", corpus_size)

Dataframe size = (201, 3)
Expected corpus size = 201


## Exporting Corpus

In [35]:
df_corpus.to_csv("../Instagram/" + "corpus_wikipedia.csv", index=False)

# References

- Japanese food: https://www.japancentre.com/en/pages/156-30-must-try-japanese-foods
- Korean food: https://migrationology.com/south-korean-food-dishes/

In [36]:
df_corpus[df_corpus["Cuisine"]=="Korean"]

Unnamed: 0,Food,Cuisine,alternative names
0,Chili Pickled Cabbage (Kimchi 김치),Korean,chili pickled cabbage
1,Samgyeopsal (삼겹살),Korean,samgyeopsal
2,Pork Bulgogi (Daeji Bulgogi 불고기),Korean,bulgogi
3,Korean Barbecue (Gogigui 고기구이),Korean,korean barbecue
4,Korean Barbecue (Gogigui 고기구이),Korean,korean bbq
5,Korean Barbecue (Gogigui 고기구이),Korean,k bbq
6,Korean Barbecue (Gogigui 고기구이),Korean,k bbq
7,Korean Barbecue (Gogigui 고기구이),Korean,k-bbq
8,Hangover Stew (Haejangguk 해장국),Korean,hangover stew
9,Soft Tofu Stew (Sundubu Jjigae 순두부찌게),Korean,soft tofu stew
