## Data Collection Task

### Job description:
1. Data scraping website:\
   a. [kokotaru](https://kokotaru.com/) - Total: ~ 500 Posts.\
   b. [kitchenart](https://cook.kitchenart.vn/cong-thuc-nau-an/) - Total: ~ 800 articles.
2. Data to scratch:

   | Name of dish | Main ingredients     |
   | -------------|----------------------|
   | ...          |...                   |


### Install the libraries

In [2]:
# !pip install bs4
# !pip install selenium
# !pip install numpy
# !pip install pandas
# !pip install requests
# !pip install tqdm

### Import shared libraries and functions

In [3]:
from Libraries_Used import *
from Shared_Functions import *

### **Comment:**
Because these two websites have different ways of operating and interacting, there will be 2 separate information collection sections.
1. **Kokotaru:** Interactively scroll to load more articles.
2. **Kitchenart:** Divide into separate pages, each page contains 20 articles.

---

### KOKOTARU WEBSITE ARTICLE URLS PARSING

#### Interact:
* Load page according to user's mouse scroll action.
#### Describe:
* All articles are gathered together in one kokotaru homepage link.

### Step 1: Retrieve website content using Selenium library.

In [4]:
KOKOTARU_BASE_URL = 'https://kokotaru.com/'

**Step 1.1:** Create selenium chrome browser.\
**Step 1.2:** Let the page load all the content.

In [5]:
def kokotaru_page_loader(BASE_URL: str) -> str:
    driver = webdriver.Chrome()
    driver.get(BASE_URL)

    last_height = driver.execute_script("return document.body.scrollHeight")
    
    max_scrolls = 100
    progress_bar = tqdm(desc="Scrolling", ncols=100, leave=True, unit="scroll", total=max_scrolls)
    
    scroll_count = 0
    
    while scroll_count < max_scrolls:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
        
        new_height = driver.execute_script("return document.body.scrollHeight")
        
        progress_bar.update(1)
        scroll_count += 1
        

        if new_height == last_height:
            break
        
        last_height = new_height
    
    html_content = driver.page_source
    driver.quit()
    
    progress_bar.close() 
    print("Parsing completed.")
    
    return html_content

In [6]:
# kokotaru_html = kokotaru_page_loader(KOKOTARU_BASE_URL)

**Step 1.3:** Get the content into `kokotaru_html_content.html`

In [7]:
# write_to_file("Assert/kokotaru_html_content.html",kokotaru_html, 'w_b_str')

#### Step 2: Get the article links on the homepage

In [8]:
# kokotaru_content = read_from_file("Assert/kokotaru_html_content.html", 'r_b_str')

In [9]:
def get_kokotaru_articles_urls(html_content) -> list:
    urls = []
    soup = BeautifulSoup(html_content, 'html.parser')
    article_headers = soup.find_all("article", { "class" : "entry-preview" })
    for header in article_headers:
        found = header.find('a', class_='cs-overlay-link')
        url = found['href'] if found else None
        if url is not None:
            urls.append(url)
    return urls

In [10]:
# kokotaru_article_urls = get_kokotaru_articles_urls(kokotaru_content)

Record the retrieved urls in `kokotaru_aricle_urls.txt`

In [11]:
# write_to_file('Assert/kokotaru_article_urls.txt', kokotaru_article_urls, 'w_b_element')

### KITCHENART WEBSITE ARTICLE URLS PARSING

#### Interact:
* General pagination.
#### Describe:
* Each page includes 20 articles.

### Step 1: Get the urls leading to the articles on the current page

In [12]:
def kitchenart_article_urls_onepage(page_url: str):
    u_list = []

    response = requests.get(page_url)
    
    if response.status_code == 200:
        try:
            soup = BeautifulSoup(response.content, "html.parser")
            
            articles_list = soup.find_all("a", {"class": "recipe-card__title-link"})
            
            for article in articles_list:
                link = article.get('href') if article.get('href') else None
                if link is not None:
                    u_list.append(link)
        
        except Exception as err:
            print(f'Requests error: {err}')
    else:
        print(f"Failed to access {page_url}. Status code: {response.status_code}")
        return page_url

    return u_list

### Step 2: Get the url leading to the next page

**Solution 1:** Get the next page through the current page.

In [13]:
def kitchenart_find_next_page_url(html_content) -> str:
    
    next_page = ""
    
    soup = BeautifulSoup(html_content,"html.parser")

    link_found = soup.find("a", {"class" : "next page-numbers"}) if soup.find("a", {"class" : "next page-numbers"}) else None
        
    if link_found is not None:
        
        next_page = link_found.get("href")
            
        
    return next_page

**Solution 2:** Get all pages from beginning to end, try connecting to that page.

In [14]:
def get_all_page_urls(BASE_URL: str) -> list:
    urls = []
    failed_urls = []
    next_page = BASE_URL
    page_counter = 2
    
    print("Starting to gather URLs from the website:")
    progress_bar = tqdm(desc="Page Loading", unit="page")
    
    while True:
        response = requests.get(next_page)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            urls.append(next_page)
            print(f"Page {len(urls)} loaded successfully: {next_page}")
            
            find_nextpage = soup.find("a", {"class": "next page-numbers"})
            
            if find_nextpage is None:
                print("No more pages to load.")
                break
            
            next_page = BASE_URL + f'page/{page_counter}/'
            page_counter += 1
            
            time.sleep(3)
            
            progress_bar.update(1)
        else:
            failed_urls.append(next_page)
            print(f"Failed to load page {page_counter - 1}: {next_page}")
            break
    
    progress_bar.close()
    
    if failed_urls:
        print("\nRetrying failed URLs...")
        retry_failed_urls = []
        
        retry_progress_bar = tqdm(total=len(failed_urls), desc="Retrying Failed URLs", unit="page")
        
        for url in failed_urls:
            response = requests.get(url)
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, "html.parser")
                urls.append(url)
                print(f"Retry successful: {url}")
            else:
                retry_failed_urls.append(url)
                print(f"Retry failed for: {url}")
            
            time.sleep(3)
            retry_progress_bar.update(1)
        
        retry_progress_bar.close()
        
        if retry_failed_urls:
            print(f"Failed to load {len(retry_failed_urls)} URLs after retrying.")
        else:
            print("All URLs loaded successfully after retry.")
    
    print(f"\nTotal pages loaded: {len(urls)}")
    return urls

### Step 3: Retrieve all post urls in all existing pages

In [15]:
KITCHENART_BASE_URL = 'https://cook.kitchenart.vn/cong-thuc-nau-an/'

**Step 3.1:** Get all the page links. If the link is not accessible (error caused by the web owner), the request will be retried 3 times. If it still doesn't work after 3 times, skip it.

In [16]:
# next_pages = get_all_page_urls(KITCHENART_BASE_URL)
# next_pages

**Step 3.2:** Write page links into file `kitchenart_page_urls.txt`

In [17]:
# write_to_file('Assert/kitchenart_pages_urls.txt', next_pages, 'w_b_element')

**Step 3.3:** The function retrieves all link articles contained in each page. Try to access these pages twice. If the second time still fails, give up.

In [18]:
def find_all_article_urls_in_all_pages(PAGE_URLS: list) -> list:
    u_list = []
    failed_urls = []
    
    progress_bar = tqdm(total=len(PAGE_URLS), desc="Page Loading", unit="page")
    
    for current_page in PAGE_URLS:
        tmp_url = kitchenart_article_urls_onepage(current_page)
        
        if isinstance(tmp_url, str):
            failed_urls.append(current_page)
        else:
            u_list.extend(tmp_url)
        
        time.sleep(3)
        progress_bar.update(1)
    
    progress_bar.close()
    print(f"Articles found: {len(u_list)}")
    print('---------------------------------')
    print(f"Failed URLs: {len(failed_urls)}")
    
    if failed_urls:
        print("Retry is starting...")
        retry_failed_urls = []
        
        retry_progress_bar = tqdm(total=len(failed_urls), desc="Retrying Failed URLs", unit="page")
        
        for current_page in failed_urls:
            tmp_url = kitchenart_article_urls_onepage(current_page)
            
            if isinstance(tmp_url, str):
                retry_failed_urls.append(current_page)
            else:
                u_list.extend(tmp_url)
            
            time.sleep(3)
            retry_progress_bar.update(1)
        
        retry_progress_bar.close()
        print(f"Articles found after retry: {len(u_list)}")
        print('---------------------------------')
        print(f"Failed URLs after retry: {len(retry_failed_urls)}")
    
    print("Crawling completed.")
    return u_list


### Step 4: Run the program to retrieve all article urls

In [19]:
# page_urls = read_from_file('Assert/kitchenart_pages_urls.txt', 'r_b_line')

In [20]:
# kitchenart_article_urls = find_all_article_urls_in_all_pages(page_urls)

**Step 4.1:** Write these urls into the file `kitchenart_article_urls.txt`

In [21]:
# write_to_file('Assert/kitchenart_aricle_urls.txt', kitchenart_article_urls, 'w_b_element')

---

### KOKOTARU WEBSITE ARTICLE DATA PARSING

#### Type 1: Ingredients are presented consistently and have distinguishing tags.

In [22]:
def fetch_ingredients_type1(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            #Title tag
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            #Find ingredients
            ingredient_div = soup.find("div", class_="wprm-recipe-ingredient-group")
            
            if ingredient_div:
                ingredients = ingredient_div.get_text(separator="\n").strip()
                return title, ingredients  
            else:
                return title, None 
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [23]:
def fetch_ingredients_type2(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            # Find title: <h1>, class "entry-title"
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            # Find ingreditents: <h2>, keyword: "tỉ lệ", "thành phần", hoặc "nguyên liệu" (regardless of case)
            ingredient_header = None
            for h2 in soup.find_all("h2", class_="has-vivid-red-color has-text-color wp-block-heading"):
                h2_text = h2.get_text().strip().lower()
                if any(keyword in h2_text for keyword in ["tỉ lệ", "thành phần", "nguyên liệu"]):
                    ingredient_header = h2
                    break
            
            ingredients = None  
            
            # Get Title
            if ingredient_header:
                ingredients_list = []
                
                # Get ingredient list
                ul_tag = ingredient_header.find_next("ul")
                
                if ul_tag:
                    for li in ul_tag.find_all("li"):
                        ingredients_list.append(li.get_text().strip())
                
                # List To string
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            if ingredients:
                return title, ingredients
            else:
                return title, None
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [24]:
def fetch_ingredients_type3(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            

            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            

            ingredient_header = soup.find("h3", string=lambda text: "Nguyên liệu" in text if text else False)
            
            ingredients = None  
            
            if ingredient_header:
                ingredients_list = []
                
                for sibling in ingredient_header.find_next_siblings():
                    if sibling.name == "h3":  # Stop when you encounter the next <h3> tag ("How to" section)
                        break
                    if sibling.name == "p" and "text-align" in sibling.get("style", ""):
                        ingredients_list.append(sibling.get_text().strip())
                
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [25]:
def fetch_ingredients_type4(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            # Find paragraphs containing the keyword "Ingredients" with tags like <strong>, <span>, or <h3>
            ingredient_header = soup.find(lambda tag: tag.name in ["strong", "span", "h3"] and "nguyên liệu" in tag.get_text().lower())
            
            ingredients = None
            
            if ingredient_header:
                ingredients_list = []
                
                ul_tag = ingredient_header.find_next("ul")
                
                if ul_tag:
                    for li in ul_tag.find_all("li"):
                        ingredients_list.append(li.get_text().strip())
                
                if not ingredients_list:
                    for sibling in ingredient_header.find_next_siblings():
                        if sibling.name == "h3" or sibling.name == "strong":
                            break
                        if sibling.name == "p":
                            ingredients_list.append(sibling.get_text().strip())
                
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [26]:
def fetch_ingredients_type5(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            # Find the <h4> tag containing the keyword "nguyên liệu"
            ingredient_header = soup.find(lambda tag: tag.name == "h4" and "nguyên liệu" in tag.get_text().lower())
            
            ingredients = None
            
            if ingredient_header:
                ingredients_list = []
                
                # Find all <p> tags following the <h4> tag until the next <h3> tag
                for sibling in ingredient_header.find_next_siblings():
                    if sibling.name == "h3":
                        break
                    if sibling.name == "p":
                        ingredients_list.append(sibling.get_text().strip())
                
                # Combine all ingredient lines into a single string
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [27]:
def fetch_ingredients_type6(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            ingredient_start = soup.find(lambda tag: tag.name == "p" and "nguyên liệu" in tag.get_text().lower())
            
            ingredients = None
            
            if ingredient_start:
                ingredients_list = []
                
                for sibling in ingredient_start.find_next_siblings():
                    if sibling.name == "p" and "cách làm" in sibling.get_text().lower():
                        break

                    if sibling.name == "p" and sibling.get_text().strip():
                        ingredients_list.append(sibling.get_text().strip())
                
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [28]:
def fetch_ingredients_type7(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            ingredient_header = soup.find(lambda tag: tag.name == "h3" and "nguyên liệu" in tag.get_text().lower())
            
            ingredients = None
            
            if ingredient_header:
                ingredients_list = []
                
                for sibling in ingredient_header.find_next_siblings():
                    if sibling.name == "h3" and "cách làm" in sibling.get_text().lower():
                        break
                    if sibling.name == "p" and sibling.get_text().strip():
                        ingredients_list.append(sibling.get_text().strip())
                
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

#### Get all ingredients from urls list by type of article format

In [29]:

def get_all_ingredients_from_urls(url_list, type):
    ingredients_dict = {}
    not_found_urls = []
    if type == 1:
        fetch_ingredients_type = fetch_ingredients_type1
    elif type == 2:
        fetch_ingredients_type = fetch_ingredients_type2
    elif type == 3:
        fetch_ingredients_type = fetch_ingredients_type3
    elif type == 4:
        fetch_ingredients_type = fetch_ingredients_type4
    elif type == 5:
        fetch_ingredients_type = fetch_ingredients_type5
    elif type == 6:
        fetch_ingredients_type = fetch_ingredients_type6
    elif type == 7:
        fetch_ingredients_type = fetch_ingredients_type7
    

    with ThreadPoolExecutor(max_workers=8) as executor:
        with tqdm(total=len(url_list), desc="Page Loading", unit="page") as progress_bar:
            future_to_url = {executor.submit(fetch_ingredients_type, url): url for url in url_list}
            
            for future in as_completed(future_to_url):
                url = future_to_url[future]
                try:
                    result = future.result()
                    if result:
                        title, ingredients = result
                        if ingredients:
                            ingredients_dict[title] = ingredients
                        else:
                            not_found_urls.append(url)
                    else:
                        not_found_urls.append(url)
                except Exception as e:
                    print(f"Error processing {url}: {e}")
                
                progress_bar.update(1)
                time.sleep(2)
    
    return ingredients_dict, not_found_urls

In [30]:
kokotaru_article_links = read_from_file("Assert/kokotaru_article_urls_full.txt", 'r_b_line')

In [31]:
all_ingredients_type1, type_1_not_found = get_all_ingredients_from_urls(kokotaru_article_links, 1)
print("Ingredients Dictionary:", all_ingredients_type1)
print("Not Found URLs:", type_1_not_found)
print("Ingredients Dictionary Len:", len(all_ingredients_type1))
print("Not Found URLs Len:", len(type_1_not_found))

Page Loading: 100%|██████████| 458/458 [15:17<00:00,  2.00s/page]

Ingredients Dictionary: {'Bánh mì cà rốt': '600\n \ng\n \nbột bánh mì\n1\n \ntsp\n \nmuối\n30\n \ng\n \nđường trắng\n200\n \nml\n \nsữa tươi\n - \nấm\n12\n \ng\n \nmen khô\n170\n \nml\n \nnước ép cà rốt\n120\n \ng\n \ncà rốt bào sợi\n40\n \ng\n \nbơ nhạt', 'Bánh hành chiên': '1\n \nquả\n \ntrứng\n1\n \ntsp\n \nmen khô\n¼\n \ntsp\n \nbaking soda\n¼\n \ntsp\n \nbaking powder\n1\n \ntsp\n \nmuối\n200\n \nml\n \nnước ấm\n300\n \ng\n \nbột mì đa dụng\n100\n \ng\n \nhành lá\n - \nthái nhỏ\n50\n \ng\n \nvừng trắng', 'Salted caramel cheesecake': 'Đế bánh\n1\n \ncup\n \nbột mì nguyên cám\n - \nwhole wheat flour\n½\n \ntsp\n \nmuối\n1\n \ntsp\n \nbaking powder\n¼\n \ncup\n \nđường nâu\n - \nbrown sugar\n100\n \ng\n \nbơ nhạt\n - \nlạnh, cắt khối vuông nhỏ', 'Apple crisp cheesecake pie': 'Đế bánh:\n1\n \ncup\n \nbột mì nguyên cám\n - \nwhole meal flour\n½\n \ntsp\n \nmuối\n1\n \ntsp\n \nbaking powder\n¼\n \ncup\n \nđường nâu\n100\n \ng\n \nbơ nhạt\n - \nlạnh, cắt hạt lựu', 'Pumpkin buns': 'Vỏ bán




In [32]:
all_ingredients_type2, type_2_not_found = get_all_ingredients_from_urls(type_1_not_found, 2)
print("Ingredients Dictionary:", all_ingredients_type2)
print("Not Found URLs:", type_2_not_found)
print("Ingredients Dictionary Len:", len(all_ingredients_type2))
print("Not Found URLs Len:", len(type_2_not_found))

Page Loading: 100%|██████████| 382/382 [12:45<00:00,  2.00s/page]

Ingredients Dictionary: {'Cách làm hạt đác rim lá dứa': '500g hạt đác tươi\n150g đường\n200ml nước lá dứa', 'Cách làm hạt đác rim dâu tây': '500g hạt đác tươi\n180g đường trắng (bạn có thể dùng đường phèn)\n150g dâu tây, thái lát', 'Cách làm hạt đác rim chanh leo': '500g hạt đác tươi\n3-4 quả chanh leo\n200g đường trắng (hoặc đường phèn)', 'Cách làm Bánh quy dứa': '330g (2 1/4 cups) bột mì đa dụng\n1/2 tsp baking powder\n1/4 tsp muối\n250g (1cup) bơ nhạt, để mềm\n110g (1/2 cup) đường trắng\n1 quả trứng to\n1 tbs nước ép dứa (dứa tươi – không phải loại nước hoa quả ép đóng chai nhé)', 'Cách làm bánh vừng giòn tan – Sesame tuiles': '150g đường bột\n210g cake flour\n1/8 tsp bột nutmeg\n140g lòng trắng trứng\n140g bơ chảy\n3g vỏ chanh bào vụn\n30g vừng trắng\nThêm 15g vừng nữa để rắc lên mặt bánh.', 'Cách làm bánh pancake dứa – Pineapple pancake': 'Dứa (chuẩn bị khoảng 1-2 quả tùy độ to nhỏ), gọt vỏ, bỏ mắt, thái lát tròn độ dày 1cm.\n225g bột self-rising flour\n1 tsp baking powder\n35g dừ




In [34]:
all_ingredients_type3, type_3_not_found = get_all_ingredients_from_urls(type_2_not_found, 3)
print("Ingredients Dictionary:", all_ingredients_type3)
print("Not Found URLs:", type_3_not_found)
print("Ingredients Dictionary Len:", len(all_ingredients_type3))
print("Not Found URLs Len:", len(type_3_not_found))

Page Loading: 100%|██████████| 332/332 [11:06<00:00,  2.01s/page]

Ingredients Dictionary: {'Cách làm bánh cupcake socola kem phomai': '* Nhân cream cheese:\n– 227g cream cheese, để nhiệt độ phòng.\n– 50g đường trắng\n– 1 quả trứng to\n– 1/2 tsp vanilla\n* Bánh chocolate:\n– 210g bột mì đa dụng\n– 170g đường nâu\n– 30g bột cacao\n– 1 tsp baking soda\n– 1/4 tsp muối\n– 240ml nước\n– 80ml dầu ăn\n– 1 tbs dấm trắng (distilled vinegar – hoặc bạn có thể dùng dấm tự nhiên – natural vinegar hoặc dấm táo – apple cider vinegar)\n– 1 tsp vanilla', 'Công thức làm bánh chanh vàng – Lemon bars': '(khuôn vuông 20x20cm)', 'Công thức bánh Pavlova dâu tây': '– 120g lòng trắng trứng\n– 140g đường trắng hạt mịn hoặc đường bột\n– 1 tsp dấm trắng\n– 1/2 tbs bột ngô\n– 1 cup (240ml) whipping cream\n– 1 tbs đường trắng\n– 1/2 tsp tinh chất vanilla', 'Pancake ngô ngọt': '– 200-250g ngô ngọt đã tách hạt\n– 20ml sữa tươi làm nóng\n– 15g bột mì đa dụng\n– 1 chút muối và tiêu\n– 2 lòng trắng trứng to hoặc 3 lòng trắng trứng nhỏ\n– ít dầu ăn để bôi chảo', 'Công thức bánh cà rốt n




In [35]:
all_ingredients_type4, type_4_not_found = get_all_ingredients_from_urls(type_3_not_found, 4)
print("Ingredients Dictionary:", all_ingredients_type4)
print("Not Found URLs:", type_4_not_found)
print("Ingredients Dictionary Len:", len(all_ingredients_type4))
print("Not Found URLs Len:", len(type_4_not_found))

Page Loading: 100%|██████████| 310/310 [10:21<00:00,  2.01s/page]

Ingredients Dictionary: {'Lỗi thường gặp khi làm bánh': 'Nhiều bạn có bộ thìa đong, bộ cup đong nhưng lại chưa để ý đong thế nào là chuẩn, có người đong võng (ít hơn) và có người đong vồng (nhiều hơn). Vậy cần dùng thìa đong, cup đong như thế nào cho đúng?\nCách đong chuẩn là bạn đong thật đầy, không cần ấn chặt, sau đó dùng 1 dụng cụ gạt phẳng mặt, như vậy sẽ có được lượng đong chuẩn. Điều này áp dụng với nguyên liệu khô, còn nguyên liệu lỏng thì đương nhiên bạn đong đầy đến mặt là ko thế đong quá được nữa ?', 'Cách làm bánh bông lan trứng muối chà bông': '7 quả trứng to, tách riêng lòng trắng, đỏ\n50g sữa tươi\n50g dầu hạt cải / dầu hướng dương\n1/2 tsp tinh chất vanilla\n75g bột mì cake flour\n75g bột ngô\n60g đường trắng\n1 tsp cream of tartar\n1/8 tsp muối', 'Cách làm bánh Bông lan/Gato HongKong lá dứa': '4 quả trứng, tách riêng lòng trắng và lòng đỏ\n80g đường trắng\n1/4 tsp cream of tartar\n70g\xa0nước cốt lá dứa đặc\n35g dầu hạt cải (canola oil), hoặc dầu hướng dương (sunflower




In [36]:
all_ingredients_type5, type_5_not_found = get_all_ingredients_from_urls(type_4_not_found, 5)
print("Ingredients Dictionary:", all_ingredients_type5)
print("Not Found URLs:", type_5_not_found)
print("Ingredients Dictionary Len:", len(all_ingredients_type5))
print("Not Found URLs Len:", len(type_5_not_found))

Page Loading: 100%|██████████| 92/92 [03:05<00:00,  2.02s/page]

Ingredients Dictionary: {}
Not Found URLs: ['https://kokotaru.com/banh-mi-nuong-trung-thit-hun-khoi/', 'https://kokotaru.com/banh-sinh-nhat-trang-tri-fondant-lam-theo-yeu-cau/', 'https://kokotaru.com/tim-hieu-ve-noi-chien-khong-dau/', 'https://kokotaru.com/pizza-banh-mi-banh-pizza-lam-tu-banh-mi-sandwich/', 'https://kokotaru.com/walnut-quince-yogurt-cake-banh-quince-sua-chua-hat-oc-cho/', 'https://kokotaru.com/poppy-seeds-quince-cake-banh-qua-quince-hat-poppy/', 'https://kokotaru.com/cach-lam-hat-dac-rim-dua-thom/', 'https://kokotaru.com/cach-che-bien-hat-dac-tuoi-dac-san-nha-trang/', 'https://kokotaru.com/9-loai-banh-pancake-ngon-khong-can-dung-lo-nuong/', 'https://kokotaru.com/kinh-nghiem-lua-chon-va-su-dung-may-tron-bot-mixer/', 'https://kokotaru.com/van-de-thuong-gap-voi-cookies/', 'https://kokotaru.com/nhung-vat-dung-co-ban-cua-lam-banh/', 'https://kokotaru.com/14-cach-lam-banh-bong-lan-gato-cho-trang-tri-banh-kem-sinh-nhat/', 'https://kokotaru.com/banh-he-nuong-chao/', 'https://k




In [37]:
all_ingredients_type6, type_6_not_found = get_all_ingredients_from_urls(type_5_not_found, 6)
print("Ingredients Dictionary:", all_ingredients_type6)
print("Not Found URLs:", type_6_not_found)
print("Ingredients Dictionary Len:", len(all_ingredients_type6))
print("Not Found URLs Len:", len(type_6_not_found))

Page Loading: 100%|██████████| 92/92 [03:05<00:00,  2.02s/page]

Ingredients Dictionary: {'Pizza bánh mì': 'Có thể bạn cũng sẽ gặp phải hoàn cảnh giống như mình. Thỉnh thoảng khi làm pizza, mình hay bị làm thừa topping, bao gồm cả xốt cà chua tự làm và các nguyên liệu rau, thịt khác. Khi còn thừa một ít như vậy, mình không thích khi cứ phải để dành chúng trong tủ lạnh, nếu bảo phải trộn thêm một mẻ bột làm đế bánh pizza thì cũng ngại. Nhưng nếu sử dụng bánh mì sandwich thì thật dễ dàng và tiện lợi. Bánh mì sandwich là một trong những thực phẩm luôn có sẵn trong bếp nhà mình, đặc biệt là khi đang sống ở NZ thì bánh mì sandwich đã trở thành thực phẩm chính hơn cả cơm.\nĐiều đặc biệt hơn là, khi bạn dùng bánh mì làm đế pizza, bạn có nhiều lựa chọn loại bánh mì, gồm cả những loại bánh mì lúa mạch đen, bánh mì nguyên cám, bánh mì ngũ cốc. Những loại bánh mì này sẽ tăng thêm dinh dưỡng cho món bánh pizza của bạn.\nVí dụ như bữa nay, mình chọn loại bánh mì ngũ cốc nguyên hạt, bánh nướng xong cả nhà đều thích và xử lý rất nhanh gọn, ngon lành. Khi làm loại 




In [38]:
all_ingredients_type7, type_7_not_found = get_all_ingredients_from_urls(type_6_not_found, 7)
print("Ingredients Dictionary:", all_ingredients_type7)
print("Not Found URLs:", type_7_not_found)
print("Ingredients Dictionary Len:", len(all_ingredients_type7))
print("Not Found URLs Len:", len(type_7_not_found))

Page Loading: 100%|██████████| 59/59 [01:59<00:00,  2.03s/page]

Ingredients Dictionary: {}
Not Found URLs: ['https://kokotaru.com/banh-mi-nuong-trung-thit-hun-khoi/', 'https://kokotaru.com/cach-che-bien-hat-dac-tuoi-dac-san-nha-trang/', 'https://kokotaru.com/poppy-seeds-quince-cake-banh-qua-quince-hat-poppy/', 'https://kokotaru.com/banh-he-nuong-chao/', 'https://kokotaru.com/hinh-anh-va-ten-goi-cac-loai-khuon-dung-trong-lam-banh/', 'https://kokotaru.com/banh-sinh-nhat-trang-tri-fondant-lam-theo-yeu-cau/', 'https://kokotaru.com/lam-banh-cho-nguoi-bi-tieu-duong/', 'https://kokotaru.com/cach-lam-hat-dac-rim-dua-thom/', 'https://kokotaru.com/cach-lam-sua-dau-phong-me-den/', 'https://kokotaru.com/cach-lam-banh-mi-gion-tan-cho-salad-va-sup-croutons/', 'https://kokotaru.com/cac-loai-kem-phu-icing-frosting/', 'https://kokotaru.com/phan-loai-lo-nuong-kinh-nghiem-su-dung-lo-nuong-gia-dinh/', 'https://kokotaru.com/cach-lam-che-dau-xanh-pho-tai/', 'https://kokotaru.com/ten-goi-va-cong-dung-cac-loai-dao-nha-bep/', 'https://kokotaru.com/cac-loai-duong-lam-banh/'




### KITCHENART WEBSITE ARTICLE DATA PARSING

In [39]:
def fetch_ingredients_and_title(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("div", class_="post-header").find("h1")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            ingredient_tags = soup.find_all("span", class_="ingredient-item__title")
            ingredients = '***'.join(tag.get_text().strip() for tag in ingredient_tags)
            
            return title, ingredients
        else:
            return None, None
    except requests.exceptions.RequestException:
        return None, None

def fetch_ingredients_from_multiple_urls(url_list):
    all_ingredients = {}
    failed_urls = []
    
    with ThreadPoolExecutor(max_workers=8) as executor:
        with tqdm(total=len(url_list), desc="Fetching Ingredients", unit="article") as progress_bar:
            future_to_url = {executor.submit(fetch_ingredients_and_title, url): url for url in url_list}
            
            for future in as_completed(future_to_url):
                url = future_to_url[future]
                title, ingredients = future.result()
                
                if title and ingredients:
                    all_ingredients[title] = ingredients
                else:
                    failed_urls.append(url)
                
                progress_bar.update(1)
                time.sleep(3)
    
    print(f"\nInitial fetch completed. Failed URLs: {len(failed_urls)}")
    
    if failed_urls:
        print("\nRetrying failed URLs (1st attempt)...")
        retry_1_failed_urls = []
        
        with tqdm(total=len(failed_urls), desc="Retrying Failed URLs (1st attempt)", unit="article") as retry_progress_bar:
            for url in failed_urls:
                title, ingredients = fetch_ingredients_and_title(url)
                
                if title and ingredients:
                    all_ingredients[title] = ingredients
                else:
                    retry_1_failed_urls.append(url)
                
                retry_progress_bar.update(1)
                time.sleep(3)
        
        print(f"\n1st retry completed. Failed URLs after 1st retry: {len(retry_1_failed_urls)}")
    
    if retry_1_failed_urls:
        print("\nRetrying failed URLs (2nd attempt)...")
        retry_2_failed_urls = []
        
        with tqdm(total=len(retry_1_failed_urls), desc="Retrying Failed URLs (2nd attempt)", unit="article") as retry_progress_bar:
            for url in retry_1_failed_urls:
                title, ingredients = fetch_ingredients_and_title(url)
                
                if title and ingredients:
                    all_ingredients[title] = ingredients
                else:
                    retry_2_failed_urls.append(url)
                
                retry_progress_bar.update(1)
                time.sleep(3)
        
        print(f"\n2nd retry completed. Final failed URLs: {len(retry_2_failed_urls)}")
    else:
        retry_2_failed_urls = []

    print("\nFetching Completed")
    print(f"Total successful pages: {len(all_ingredients)}")
    print(f"Failed URLs after all retries: {len(retry_2_failed_urls)}")
    
    if retry_2_failed_urls:
        print("\nFinal Failed URL List:")
        for failed_url in retry_2_failed_urls:
            print(f" - {failed_url}")

    return all_ingredients, retry_2_failed_urls

In [41]:
url_list = read_from_file('Tesing/kitchenart_article_urls_first_25.txt', 'r_b_line')
all_ingredients, failed_urls = fetch_ingredients_from_multiple_urls(url_list)

Fetching Ingredients: 100%|██████████| 25/25 [01:15<00:00,  3.01s/article]



Initial fetch completed. Failed URLs: 6

Retrying failed URLs (1st attempt)...


Retrying Failed URLs (1st attempt): 100%|██████████| 6/6 [00:19<00:00,  3.29s/article]



1st retry completed. Failed URLs after 1st retry: 5

Retrying failed URLs (2nd attempt)...


Retrying Failed URLs (2nd attempt): 100%|██████████| 5/5 [00:16<00:00,  3.23s/article]


2nd retry completed. Final failed URLs: 5

Fetching Completed
Total successful pages: 20
Failed URLs after all retries: 5

Final Failed URL List:
 - https://cook.kitchenart.vn/cong-thuc-nau-an/thit-kho-tieu-cay-nong-dam-vi-mon-an-khong-the-buong-dua-ngay-dong/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/banh-tortilla-phu-trung-nuong-bua-sang-nhanh-gon-du-chat-va-vo-cung-thom-ngon/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/khoai-tay-tam-gia-vi-nuong-kieu-au/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/cach-lam-canh-dau-phu-cay-kieu-han/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/cach-lam-hat-de-ngao-mat-ong-dac-san-khong-the-thieu-vao-mua-dong/





In [42]:
for url, ingredients in all_ingredients.items():
    print(f"\nIngredients for {url}:")
    print(ingredients)


Ingredients for Vịt Om Sấu – Món Ăn Bất Chấp Mọi Thời Tiết  Khiến Bao Người Mê Mẩn Dù Đông Hay Hè:
1 con vịt (làm sạch, chặt nhỏ)***2 củ tỏi băm nhỏ***2 củ hành tím băm nhỏ***3 tbs dầu hào***2 tsp muối***1 tsp tiêu xay***1 tbs đường***1 tbs dầu ăn***10-15 quả sấu chín***1,5L nước (có thể dùng nước ninh xương hoặc nước dừa)***20ml nước cốt dừa (nếu dùng nước dừa thì bỏ qua)***15-20 củ khoai sọ (gọt vỏ, bổ đôi)***Rau muống***Rau húng, mùi tàu ăn kèm***Bún tươi

Ingredients for Bánh Mì Cuộn Phô Mai Chiên Xù Giòn Rụm Bên Ngoài, Tan Chảy Bên Trong:
6 lát bánh mì sandwich***50g phô mai creamcheese***50g phô mai mozzarella***2 quả trứng (đánh tan)***50g bột chiên xù

Ingredients for Đổi Vị Với Phở Gà Trộn Lạ Miệng, Thơm Ngon Ăn Mãi Mà Không Thấy Ngán:
½ con gà ta***1 miếng gừng nhỏ (nạo vỏ, đập dập)***250ml nước***1 tsb muối***2 tbs xì dầu***2 tbs đường***1 tsp giấm tỏi***1 củ hành tây***500g bánh phở tươi***1 nắm giá***Rau mùi, hành lá***Hành phi***Lạc rang

Ingredients for Khoai Tây Nướng 