## Data Cleaning and Normalizing Task

### Job description:
#### 1. Translate data from Vietnamese to English:

a. **kokotaru** 'Assert/cleaned_recipes_translated.txt'

b. **kitchenart** 'Assert/cleaned_recipes_2_translated.txt'


#### 2. Clean and Normalize data from 'txt' file to 'csv' file

   | Name of dish | Ingredient 1  | Ingredient 2  |...            |
   | -------------|---------------| --------------|---------------|
   | ...          |0/1            |0/1            |...            |


### Install the libraries

In [1]:
!pip install nltk



### Import shared libraries and functions

In [2]:
from Library_Used import *
from Shared_Functions import *

### Read data from files

**black_list.txt**, **units.txt**, **key_words.txt** are txt files containing lists of words that will be removed from data lines to filter out food ingredients.

- **black_list.txt** contains noise words to describe the properties and preparation methods of ingredients.

- **units.txt** contains words that are units of measurement and quantity of words for an ingredient.

- **key_words.txt** is similar to **black_list.txt**, a list of noise words for data, but instead of single words, key words will help you accurately identify clusters of noise words.

In [3]:
units = read_from_file("./Assert/units.txt","r_b_line")
key_words = read_from_file("./Assert/key_words.txt","r_b_line")
black_list = read_from_file("./Assert/black_list.txt","r_b_line")

### Set up tools to use the *nktl* library

- Download the necessary data (WordNet and OMW) to create **lemmatizer**.
- Create a **lemmatizer** to process words and make them singular.

In [4]:
nltk.download('wordnet')
nltk.download('omw-1.4')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\huyen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\huyen\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Data Cleaning

In [5]:
def clean_ingredient(ingredient: str, units: list, key_words:list, black_list:list):
    """
    Cleans a list of raw words based on specified units, keywords, and blacklist
    Parameters:
        word (str): a string which maybe contains one or more in it with description.
        units (list): contains words that are units of measurement, quantity words
        key_words (list): contains words that are signals to identify the noise part in the string
        black_list (list): noise words

    Returns:
        clean_parts(list): list containing one or more cleaned ingredients from the input string 'ingredient'
    """
    # Bước 1: Chuẩn hóa Unicode, đưa chuỗi về kí tự thường
    ingredient = unicodedata.normalize("NFKD", ingredient)
    ingredient = ingredient.lower()

    # Bước 2: thực hiện trước khi tách nguyên liệu: Loại bỏ nội dung trong ngoặc đơn
    ingredient = re.sub(r"\s*\(.*?\)", "", ingredient)
    
    # Bước 3: Tách thành các nguyên liệu khác nhau khi gặp ',', ';', 'and', 'or', 'with', 'in'
    parts = re.split(r",|;|\band\b|\bor\b|\bwith\b|\bin\b", ingredient, flags=re.IGNORECASE)
    clean_parts = []

    
    # Tiến thành duyệt qua từng phần đã tách được từ bước 3:
    for part in parts:
        # Bước 4.1: Loại bỏ kí tự không phải ASCII
        part = re.sub(r"[^\x00-\x7F]+", "", part)
        
        # Bước 4.2: Chuẩn hóa các phân số bị lỗi
        part = re.sub(r"\b\d+/[a-zA-Z]+\b", "", part)
        
        # Bước 4.3: Loại bỏ số lượng và đơn vị đo lường
        units_pattern = r"\b\d*[\d/]*\s*(" + "|".join(map(re.escape, units)) + r")\b"
        part = re.sub(units_pattern, "", part, flags=re.IGNORECASE)
        part = re.sub(r"\d+", "", part) 
        
        # Bước 4.4: Thay thế các ký tự '-', '/', '.', '*' thành khoảng trắng
        part = re.sub(r"[+/.*%>]", " ", part)
        part = re.sub(r"(?<=\s)-|-(?=\s)|(?<=\d)-(?=\d)", " ", part)
        
        # Bước 4.5: Nếu có từ "of", chỉ giữ lại phần sau chữ "of"
        if " of " in part:
            part = part.split(" of ", 1)[-1]
        
        # Bước 4.6: Sử dụng các từ trong key_words để xác định cụm gây nhiễu và xóa chúng
        for word in key_words:
            pattern = rf"\b{re.escape(word)}\b.*"
            part = re.sub(pattern, "", part, flags=re.IGNORECASE)
        
        # Bước 4.7: Xóa các từ trong black_list mà có xuất hiện trong chuỗi. Chỉ xóa khi từ đó không gắn liền với dấu '-' khi vừa là tên vừa biểu thị đặc tính của nguyên liệu
        # (Ví dụ: all-purpose flour : bột mì đa dụng)
        for word in black_list:
            pattern = rf"(?<!-)\b{re.escape(word)}\b(?!-)" 
            part = re.sub(pattern, "", part, flags=re.IGNORECASE)
        
        # Bước 4.8: Xóa khoảng trắng thừa
        part = re.sub(r"\s+", " ", part)
        clean_part = part.strip()
        
        # Bước 4.9: Chuẩn hóa về dạng số ít của nguyên liệu (Ví dụ: apples -> apple)
        clean_part = lemmatizer.lemmatize(clean_part)

        # Bước 5: Trả về kết quả là một list chứa các nguyên liệu đã được làm sạch.
        if clean_part: 
            clean_parts.append(clean_part)
    
    return clean_parts

### Data Normalizing

In [6]:
def convert_to_plural(word):
    """
    Converts the infinitive form of a word to its plural form
    Parameters: word (string)
    Returns: plural form of word (string)
    """
    
    endings = ('s', 'ss', 'sh', 'ch', 'z', 'x')

    if word.endswith('f'):
        word = word[:-1] + 'ves'

    elif word.endswith('y'):
        
        if word[-2] in 'aeiou':  
            word = word + 's'
        else:  
            word = word[:-1] + 'ies'
    
    elif any(word.endswith(ending) for ending in endings):
        word = word + 'es'

    else:  
        word = word + 's'

    return word

In [7]:
def normalize_recipes_to_dataframe(file_path, units, key_words, black_list):
    """
    Processes one or more text files of recipes into a binary DataFrame of ingredients.

    Parameters:
        file_paths (list of str): List of paths to text files containing recipes.
        units (list): List of units to clean ingredients.
        key_words (list): List of keywords to clean ingredients.
        black_list (list): List of blacklisted words to remove from ingredients.

    Returns:
        pd.DataFrame: Binary DataFrame with ingredients as columns.
        list: List of unique ingredients.
    """
    combined_data = []

    content = read_from_file(file_path, "r_b_str")

    dishes = content.split('-' * 50)
    for dish in dishes:
        title_match = re.search(r"Title:\s*(.+)", dish)
        ingredients_match = re.search(r"Ingredients:\s*(.+)", dish, re.DOTALL)

        if title_match and ingredients_match:

            title = title_match.group(1).strip()

            ingredients_raw = ingredients_match.group(1).strip()
            ingredients_lines = ingredients_raw.splitlines() 

            if not ingredients_lines:  
                combined_data.append({
                    'title': title,
                    'ingredients': [] 
                })
                continue

            clean_ingredients = []
            for line in ingredients_lines:
                if ':' in line or '=>' in line:
                    continue
                clean_ingredients.extend(clean_ingredient(line, units, key_words, black_list))

            combined_data.append({
                'title': title,
                'ingredients': clean_ingredients
            })

    df = pd.DataFrame(combined_data)

    all_ingredients = sorted(set(ingredient.strip().lower()
                                 for ingredients in df["ingredients"]
                                 for ingredient in ingredients))
    return df, all_ingredients

### Cleaning and Normalizing data from Kitchenart Website

In [8]:
file_path = "./Assert/cleaned_recipes_2_translated.txt"
df_kitchenart, all_ingredients = normalize_recipes_to_dataframe (file_path,units,key_words,black_list)

print(f"Number of dishes: {df_kitchenart.shape[0]}")
print(f"Number of ingredients: {len(all_ingredients)}")

Number of dishes: 682
Number of ingredients: 797


### Cleaning and Normalizing data from Kokotaru Website

Because the data on ingredients of the dishes collected from the Kokotaru website is extremely complex and has many noise factors, we cannot handle it simply like the Kitchenart website.

Here, we will use the list of ingredients obtained from the Kitchenart website to search for those ingredients in the data of the Kokotaru website, then filter it a second time to get new ingredients.

#### Find ingredients in all_ingredients that appear in the file 'cleaned_recipes_translated.txt'

In [9]:
def process_file(file_path, all_ingredients):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
        
    all_ingredients_sub = sorted(all_ingredients, key=len, reverse=True)        
    processed_lines = []
    output1_lines = [] 
    output2_lines = []  
    output3_lines = []  
    combined_data = []  
    in_ingredients_section = False
    current_dish_name = None
    found_ingredients = []  
    output2_dishes = [] 

    for line in lines:
        original_line = line.strip()

        if original_line.startswith("Title: "):
            if current_dish_name:
                if found_ingredients:
                    output1_lines.append(f"Title: {current_dish_name}\nIngredients:\n" + "\n".join(found_ingredients) + "\n" + "-" * 50)
                    combined_data.append({
                        'title': current_dish_name,
                        'ingredients': found_ingredients
                    })
                
                if output2_dishes:
                    if any(len(dish) > 50 for dish in output2_dishes) or any("byDang Ngoc Linh" in dish for dish in output2_dishes):
                        output3_lines.append(f"Title: {current_dish_name}\nIngredients:\n" + "\n".join(output2_dishes) + "\n" + "-" * 50)
                    else:
                        output2_lines.append(f"Title: {current_dish_name}\nIngredients:\n" + "\n".join(output2_dishes) + "\n" + "-" * 50)

            current_dish_name = original_line.split("Title: ", 1)[1]
            found_ingredients = []  
            output2_dishes.clear() 
            processed_lines.append(original_line)
            continue

        if original_line.startswith("Ingredients:"):
            in_ingredients_section = True
            processed_lines.append(original_line)
            continue

        if original_line == "-" * 50:
            in_ingredients_section = False
            processed_lines.append(original_line)
            continue

        if in_ingredients_section:
            if ":" in original_line:
                processed_lines.append(original_line)
                continue

            regex_ingredients = []
            for ingredient in all_ingredients_sub:
                regex_ingredients.append(re.escape(ingredient))  
                plural_ingredient = convert_to_plural(ingredient) 
                if plural_ingredient != ingredient:
                    regex_ingredients.append(re.escape(plural_ingredient))

           
            matched_ingredients = re.findall(r'\b(?:' + '|'.join(regex_ingredients) + r')\b', original_line)

            filtered_ingredients = []
            for m in matched_ingredients:
                clean_ingredient = m.strip()

                singular_ingredient = lemmatizer.lemmatize(clean_ingredient)
                plural_ingredient = convert_to_plural(singular_ingredient) 
                
                if re.search(r'\bthe\b\s+' + re.escape(clean_ingredient), original_line):
                    continue 

                if singular_ingredient in all_ingredients and singular_ingredient not in filtered_ingredients:
                    filtered_ingredients.append(singular_ingredient)

                elif plural_ingredient in original_line and singular_ingredient not in filtered_ingredients:
                    filtered_ingredients.append(singular_ingredient)
            
            for ingredient in filtered_ingredients:
                found_ingredients.append(ingredient)

            if not filtered_ingredients:
                output2_dishes.append(original_line)

            processed_lines.append(original_line)
            continue

        processed_lines.append(original_line)

    if found_ingredients:
        output1_lines.append(f"Title: {current_dish_name}\nIngredients:\n" + "\n".join(found_ingredients) + "\n" + "-" * 50)
        combined_data.append({
            'title': current_dish_name,
            'ingredients': found_ingredients
        })

    if output2_dishes:
        if any(len(dish) > 50 for dish in output2_dishes) or any("byDang Ngoc Linh" in dish for dish in output2_dishes):
            output3_lines.append(f"Title: {current_dish_name}\nIngredients:\n" + "\n".join(output2_dishes) + "\n" + "-" * 50)
        else:
            output2_lines.append(f"Title: {current_dish_name}\nIngredients:\n" + "\n".join(output2_dishes) + "\n" + "-" * 50)

    with open("./Testing/output2.txt", "w", encoding="utf-8") as output2_file:
        output2_file.write("\n".join(output2_lines))

    with open("./Testing/output3.txt", "w", encoding="utf-8") as output3_file:
        output3_file.write("\n".join(output3_lines))

    with open("./Testing/output1.txt", "w", encoding="utf-8") as output1_file:
        output1_file.write("\n".join(output1_lines))

    df = pd.DataFrame(combined_data)
    return df


#### Filter the ingredients from Kokotaru website for the first time by searching based on the ingredients available in "all_ingredients"

In [10]:
file_path = "./Assert/cleaned_recipes_translated.txt"
df_kokotaru_1 = process_file(file_path, all_ingredients)


#### Reuse the "normalize_recipes_to_dataframe" function to filter new ingredients from the rows that have not been filtered for ingredients saved in the output2.txt file.

In [11]:
file_path_2 = "./Testing/output2.txt"
df_kokotaru_2, all_ingredients_2 = normalize_recipes_to_dataframe(file_path_2, units,key_words,black_list)

### Merge dataframes from two websites Kitchenart and Kokotaru, create binary dataframe

#### Merge dataframe

In [12]:
df_kokotaru = pd.concat([df_kokotaru_1, df_kokotaru_2], ignore_index=True)
df_kokotaru = df_kokotaru.groupby('title', as_index=False).agg({
    'ingredients': lambda x: list(set().union(*x))  # Hợp nhất các danh sách nguyên liệu
})

all_ingredients = list(set(all_ingredients+all_ingredients_2))
combined_df = pd.concat([df_kokotaru, df_kitchenart], ignore_index=True)
combined_df

Unnamed: 0,title,ingredients
0,10 common problems and mistakes when making bread,[bread]
1,11 ways to use leftover egg yolks,"[cream cheese, chocolate, egg yolk, custard, tea]"
2,12 types of nuts for baking,"[skin, seed, hazelnut, chestnut, peanut, walnu..."
3,14 ways to make Sponge cake/Gato for birthday ...,"[fruit, rice, cake base, cake, whipping cream,..."
4,14 ways to use leftover egg whites,"[egg white, fruit, chestnut, sweet, chocolate,..."
...,...,...
1052,Corn Dessert,"[corn, white corn, sugar, salt, tapioca starch..."
1053,Sweet and Sour Vegetables (Caponata),"[olive oil, eggplant, shallot, tomato, caper, ..."
1054,Milk Tea,"[milk, sugar, vanilla, cornstarch, pistachio, ..."
1055,Cheese Chiffon Cake,"[milk, cream cheese, unsalted butter, all-purp..."


#### Convert to binary dataframe; check and merge ingredients that appear in both singular and plural forms

In [13]:
def convert_to_binary_df(combined_df, all_ingredients):
   
    binary_df = pd.DataFrame(0, index=combined_df['title'], columns=all_ingredients)
    
    for _, row in combined_df.iterrows():
        ingredients = row['ingredients'] 
        for ingredient in ingredients:
            if ingredient in binary_df.columns:
                binary_df.loc[row['title'], ingredient] = 1
    
    binary_df.insert(0, 'Name of dish', binary_df.index)
    binary_df = binary_df.reset_index(drop=True) 


    colnames = list(binary_df.columns)
    for col in colnames:
        if col == 'Name of dish':
            continue
    
        plural_col = convert_to_plural(col)
        
        if plural_col in colnames:
            binary_df[col] = binary_df[col] | binary_df[plural_col]
            
            binary_df.drop(columns=[plural_col], inplace=True)

            all_ingredients.remove(plural_col)
            all_ingredients = sorted(all_ingredients,key=str.lower)
    return binary_df, all_ingredients

binary_df, all_ingredients= convert_to_binary_df(combined_df, all_ingredients)


In [14]:
binary_df.to_csv("../Assert/ingredients.csv", index=False, encoding='utf-8')

with open("./Testing/ingredients.txt", "w", encoding="utf-8") as file:
    for ingredient in all_ingredients:
        file.write(ingredient + "\n")


print(f"Number of dishes: {binary_df.shape[0]}")
print(f"Number of ingredients: {len(all_ingredients)}")

Number of dishes: 1057
Number of ingredients: 813
