## Data Collection Task

### Job description:
1. Data scraping website:\
   a. [kokotaru](https://kokotaru.com/) - Total: ~ 500 Posts.\
   b. [kitchenart](https://cook.kitchenart.vn/cong-thuc-nau-an/) - Total: ~ 800 articles.
2. Data to scratch:

   | Name of dish | Main ingredients     |
   | -------------|----------------------|
   | ...          |...                   |


### Install the libraries

In [122]:
!pip install bs4
!pip install selenium
!pip install numpy
!pip install pandas
!pip install requests
!pip install tqdm




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Import shared libraries and functions

In [123]:
from Libraries_Used import *
from Shared_functions import *

### **Comment:**
Because these two websites have different ways of operating and interacting, there will be 2 separate information collection sections.
1. **Kokotaru:** Interactively scroll to load more articles.
2. **Kitchenart:** Divide into separate pages, each page contains 20 articles.

---

### KOKOTARU WEBSITE ARTICLE URLS PARSING

#### Interact:
* Load page according to user's mouse scroll action.
#### Describe:
* All articles are gathered together in one kokotaru homepage link.

### Step 1: Retrieve website content using Selenium library.

In [124]:
KOKOTARU_BASE_URL = 'https://kokotaru.com/'

**Step 1.1:** Create selenium chrome browser.\
**Step 1.2:** Let the page load all the content.

In [125]:
def kokotaru_page_loader(BASE_URL: str) -> str:
    driver = webdriver.Chrome()
    driver.get(BASE_URL)

    last_height = driver.execute_script("return document.body.scrollHeight")
    
    max_scrolls = 100
    progress_bar = tqdm(desc="Scrolling", ncols=100, leave=True, unit="scroll", total=max_scrolls)
    
    scroll_count = 0
    
    while scroll_count < max_scrolls:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
        
        new_height = driver.execute_script("return document.body.scrollHeight")
        
        progress_bar.update(1)
        scroll_count += 1
        

        if new_height == last_height:
            break
        
        last_height = new_height
    
    html_content = driver.page_source
    driver.quit()
    
    progress_bar.close() 
    print("Parsing completed.")
    
    return html_content

In [126]:
# kokotaru_html = kokotaru_page_loader(KOKOTARU_BASE_URL)

**Step 1.3:** Get the content into `kokotaru_html_content.html`

In [127]:
# write_to_file("Assert/kokotaru_html_content.html",kokotaru_html, 'w_b_str')

#### Step 2: Get the article links on the homepage

In [128]:
# kokotaru_content = read_from_file("Assert/kokotaru_html_content.html", 'r_b_str')

In [129]:
def get_kokotaru_articles_urls(html_content) -> list:
    urls = []
    soup = BeautifulSoup(html_content, 'html.parser')
    article_headers = soup.find_all("article", { "class" : "entry-preview" })
    for header in article_headers:
        found = header.find('a', class_='cs-overlay-link')
        url = found['href'] if found else None
        if url is not None:
            urls.append(url)
    return urls

In [130]:
# kokotaru_article_urls = get_kokotaru_articles_urls(kokotaru_content)

Record the retrieved urls in `kokotaru_aricle_urls.txt`

In [131]:
# write_to_file('Assert/kokotaru_article_urls.txt', kokotaru_article_urls, 'w_b_element')

### KITCHENART WEBSITE ARTICLE URLS PARSING

#### Interact:
* General pagination.
#### Describe:
* Each page includes 20 articles.

### Step 1: Get the urls leading to the articles on the current page

In [132]:
def kitchenart_article_urls_onepage(page_url: str):
    u_list = []

    response = requests.get(page_url)
    
    if response.status_code == 200:
        try:
            soup = BeautifulSoup(response.content, "html.parser")
            
            articles_list = soup.find_all("a", {"class": "recipe-card__title-link"})
            
            for article in articles_list:
                link = article.get('href') if article.get('href') else None
                if link is not None:
                    u_list.append(link)
        
        except Exception as err:
            print(f'Requests error: {err}')
    else:
        print(f"Failed to access {page_url}. Status code: {response.status_code}")
        return page_url

    return u_list

### Step 2: Get the url leading to the next page

**Solution 1:** Get the next page through the current page.

In [133]:
def kitchenart_find_next_page_url(html_content) -> str:
    
    next_page = ""
    
    soup = BeautifulSoup(html_content,"html.parser")

    link_found = soup.find("a", {"class" : "next page-numbers"}) if soup.find("a", {"class" : "next page-numbers"}) else None
        
    if link_found is not None:
        
        next_page = link_found.get("href")
            
        
    return next_page

**Solution 2:** Get all pages from beginning to end, try connecting to that page.

In [134]:
def get_all_page_urls(BASE_URL: str) -> list:
    urls = []
    failed_urls = []
    next_page = BASE_URL
    page_counter = 2
    
    print("Starting to gather URLs from the website:")
    progress_bar = tqdm(desc="Page Loading", unit="page")
    
    while True:
        response = requests.get(next_page)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            urls.append(next_page)
            print(f"Page {len(urls)} loaded successfully: {next_page}")
            
            find_nextpage = soup.find("a", {"class": "next page-numbers"})
            
            if find_nextpage is None:
                print("No more pages to load.")
                break
            
            next_page = BASE_URL + f'page/{page_counter}/'
            page_counter += 1
            
            time.sleep(3)
            
            progress_bar.update(1)
        else:
            failed_urls.append(next_page)
            print(f"Failed to load page {page_counter - 1}: {next_page}")
            break
    
    progress_bar.close()
    
    if failed_urls:
        print("\nRetrying failed URLs...")
        retry_failed_urls = []
        
        retry_progress_bar = tqdm(total=len(failed_urls), desc="Retrying Failed URLs", unit="page")
        
        for url in failed_urls:
            response = requests.get(url)
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, "html.parser")
                urls.append(url)
                print(f"Retry successful: {url}")
            else:
                retry_failed_urls.append(url)
                print(f"Retry failed for: {url}")
            
            time.sleep(3)
            retry_progress_bar.update(1)
        
        retry_progress_bar.close()
        
        if retry_failed_urls:
            print(f"Failed to load {len(retry_failed_urls)} URLs after retrying.")
        else:
            print("All URLs loaded successfully after retry.")
    
    print(f"\nTotal pages loaded: {len(urls)}")
    return urls

### Step 3: Retrieve all post urls in all existing pages

In [135]:
KITCHENART_BASE_URL = 'https://cook.kitchenart.vn/cong-thuc-nau-an/'

**Step 3.1:** Get all the page links. If the link is not accessible (error caused by the web owner), the request will be retried 3 times. If it still doesn't work after 3 times, skip it.

In [136]:
# next_pages = get_all_page_urls(KITCHENART_BASE_URL)
# next_pages

**Step 3.2:** Write page links into file `kitchenart_page_urls.txt`

In [137]:
# write_to_file('Assert/kitchenart_pages_urls.txt', next_pages, 'w_b_element')

**Step 3.3:** The function retrieves all link articles contained in each page. Try to access these pages twice. If the second time still fails, give up.

In [138]:
def find_all_article_urls_in_all_pages(PAGE_URLS: list) -> list:
    u_list = []
    failed_urls = []
    
    progress_bar = tqdm(total=len(PAGE_URLS), desc="Page Loading", unit="page")
    
    for current_page in PAGE_URLS:
        tmp_url = kitchenart_article_urls_onepage(current_page)
        
        if isinstance(tmp_url, str):
            failed_urls.append(current_page)
        else:
            u_list.extend(tmp_url)
        
        time.sleep(3)
        progress_bar.update(1)
    
    progress_bar.close()
    print(f"Articles found: {len(u_list)}")
    print('---------------------------------')
    print(f"Failed URLs: {len(failed_urls)}")
    
    if failed_urls:
        print("Retry is starting...")
        retry_failed_urls = []
        
        retry_progress_bar = tqdm(total=len(failed_urls), desc="Retrying Failed URLs", unit="page")
        
        for current_page in failed_urls:
            tmp_url = kitchenart_article_urls_onepage(current_page)
            
            if isinstance(tmp_url, str):
                retry_failed_urls.append(current_page)
            else:
                u_list.extend(tmp_url)
            
            time.sleep(3)
            retry_progress_bar.update(1)
        
        retry_progress_bar.close()
        print(f"Articles found after retry: {len(u_list)}")
        print('---------------------------------')
        print(f"Failed URLs after retry: {len(retry_failed_urls)}")
    
    print("Crawling completed.")
    return u_list


### Step 4: Run the program to retrieve all article urls

In [139]:
# page_urls = read_from_file('Assert/kitchenart_pages_urls.txt', 'r_b_line')

In [140]:
# kitchenart_article_urls = find_all_article_urls_in_all_pages(page_urls)

**Step 4.1:** Write these urls into the file `kitchenart_article_urls.txt`

In [141]:
# write_to_file('Assert/kitchenart_aricle_urls.txt', kitchenart_article_urls, 'w_b_element')

---

### KOKOTARU WEBSITE ARTICLE DATA PARSING

#### Type 1: Ingredients are presented consistently and have distinguishing tags.

In [142]:
def fetch_ingredients_type1(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            #Title tag
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            #Find ingredients
            ingredient_div = soup.find("div", class_="wprm-recipe-ingredient-group")
            
            if ingredient_div:
                ingredients = ingredient_div.get_text(separator="\n").strip()
                return title, ingredients  
            else:
                return title, None 
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [143]:
def fetch_ingredients_type2(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            # Find title: <h1>, class "entry-title"
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            # Find ingreditents: <h2>, keyword: "tỉ lệ", "thành phần", hoặc "nguyên liệu" (regardless of case)
            ingredient_header = None
            for h2 in soup.find_all("h2", class_="has-vivid-red-color has-text-color wp-block-heading"):
                h2_text = h2.get_text().strip().lower()
                if any(keyword in h2_text for keyword in ["tỉ lệ", "thành phần", "nguyên liệu"]):
                    ingredient_header = h2
                    break
            
            ingredients = None  
            
            # Get Title
            if ingredient_header:
                ingredients_list = []
                
                # Get ingredient list
                ul_tag = ingredient_header.find_next("ul")
                
                if ul_tag:
                    for li in ul_tag.find_all("li"):
                        ingredients_list.append(li.get_text().strip())
                
                # List To string
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            if ingredients:
                return title, ingredients
            else:
                return title, None
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [144]:
def fetch_ingredients_type3(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            

            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            

            ingredient_header = soup.find("h3", string=lambda text: "Nguyên liệu" in text if text else False)
            
            ingredients = None  
            
            if ingredient_header:
                ingredients_list = []
                
                for sibling in ingredient_header.find_next_siblings():
                    if sibling.name == "h3":  # Stop when you encounter the next <h3> tag ("How to" section)
                        break
                    if sibling.name == "p" and "text-align" in sibling.get("style", ""):
                        ingredients_list.append(sibling.get_text().strip())
                
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [145]:
def fetch_ingredients_type4(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            # Find paragraphs containing the keyword "Ingredients" with tags like <strong>, <span>, or <h3>
            ingredient_header = soup.find(lambda tag: tag.name in ["strong", "span", "h3"] and "nguyên liệu" in tag.get_text().lower())
            
            ingredients = None
            
            if ingredient_header:
                ingredients_list = []
                
                ul_tag = ingredient_header.find_next("ul")
                
                if ul_tag:
                    for li in ul_tag.find_all("li"):
                        ingredients_list.append(li.get_text().strip())
                
                if not ingredients_list:
                    for sibling in ingredient_header.find_next_siblings():
                        if sibling.name == "h3" or sibling.name == "strong":
                            break
                        if sibling.name == "p":
                            ingredients_list.append(sibling.get_text().strip())
                
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [146]:
def fetch_ingredients_type5(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            # Find the <h4> tag containing the keyword "nguyên liệu"
            ingredient_header = soup.find(lambda tag: tag.name == "h4" and "nguyên liệu" in tag.get_text().lower())
            
            ingredients = None
            
            if ingredient_header:
                ingredients_list = []
                
                # Find all <p> tags following the <h4> tag until the next <h3> tag
                for sibling in ingredient_header.find_next_siblings():
                    if sibling.name == "h3":
                        break
                    if sibling.name == "p":
                        ingredients_list.append(sibling.get_text().strip())
                
                # Combine all ingredient lines into a single string
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [147]:
def fetch_ingredients_type6(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            ingredient_start = soup.find(lambda tag: tag.name == "p" and "nguyên liệu" in tag.get_text().lower())
            
            ingredients = None
            
            if ingredient_start:
                ingredients_list = []
                
                for sibling in ingredient_start.find_next_siblings():
                    if sibling.name == "p" and "cách làm" in sibling.get_text().lower():
                        break

                    if sibling.name == "p" and sibling.get_text().strip():
                        ingredients_list.append(sibling.get_text().strip())
                
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

In [148]:
def fetch_ingredients_type7(url):
    try:
        response = requests.get(url, timeout=10)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("h1", class_="entry-title")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            ingredient_header = soup.find(lambda tag: tag.name == "h3" and "nguyên liệu" in tag.get_text().lower())
            
            ingredients = None
            
            if ingredient_header:
                ingredients_list = []
                
                for sibling in ingredient_header.find_next_siblings():
                    if sibling.name == "h3" and "cách làm" in sibling.get_text().lower():
                        break
                    if sibling.name == "p" and sibling.get_text().strip():
                        ingredients_list.append(sibling.get_text().strip())
                
                ingredients = "\n".join(ingredients_list) if ingredients_list else None
            
            return title, ingredients
        else:
            print(f"Failed to retrieve {url}. Status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

#### Get all ingredients from urls list by type of article format

In [149]:

def get_all_ingredients_from_urls(url_list, type):
    ingredients_dict = {}
    not_found_urls = []
    if type == 1:
        fetch_ingredients_type = fetch_ingredients_type1
    elif type == 2:
        fetch_ingredients_type = fetch_ingredients_type2
    elif type == 3:
        fetch_ingredients_type = fetch_ingredients_type3
    elif type == 4:
        fetch_ingredients_type = fetch_ingredients_type4
    elif type == 5:
        fetch_ingredients_type = fetch_ingredients_type5
    elif type == 6:
        fetch_ingredients_type = fetch_ingredients_type6
    elif type == 7:
        fetch_ingredients_type = fetch_ingredients_type7
    

    with ThreadPoolExecutor(max_workers=8) as executor:
        with tqdm(total=len(url_list), desc="Page Loading", unit="page") as progress_bar:
            future_to_url = {executor.submit(fetch_ingredients_type, url): url for url in url_list}
            
            for future in as_completed(future_to_url):
                url = future_to_url[future]
                try:
                    result = future.result()
                    if result:
                        title, ingredients = result
                        if ingredients:
                            ingredients_dict[title] = ingredients
                        else:
                            not_found_urls.append(url)
                    else:
                        not_found_urls.append(url)
                except Exception as e:
                    print(f"Error processing {url}: {e}")
                
                progress_bar.update(1)
                time.sleep(2)
    
    return ingredients_dict, not_found_urls

In [150]:
# kokotaru_article_links = read_from_file("Assert/kokotaru_article_urls_full.txt", 'r_b_line')

In [151]:
# # all_ingredients_type1, type_1_not_found = get_all_ingredients_from_urls(kokotaru_article_links, 1)
# print("Ingredients Dictionary:", all_ingredients_type1)
# print("Not Found URLs:", type_1_not_found)
# # print("Ingredients Dictionary Len:", len(all_ingredients_type1))
# # print("Not Found URLs Len:", len(type_1_not_found))

In [152]:
# # all_ingredients_type2, type_2_not_found = get_all_ingredients_from_urls(kokotaru_article_links, 2)
# print("Ingredients Dictionary:", all_ingredients_type2)
# print("Not Found URLs:", type_2_not_found)
# # print("Ingredients Dictionary Len:", len(all_ingredients_type2))
# # print("Not Found URLs Len:", len(type_2_not_found))

In [153]:
# all_ingredients_type3, type_3_not_found = get_all_ingredients_from_urls(kokotaru_article_links, 3)
# # print("Ingredients Dictionary:", all_ingredients_type3)
# # print("Not Found URLs:", type_3_not_found)
# print("Ingredients Dictionary Len:", len(all_ingredients_type3))
# print("Not Found URLs Len:", len(type_3_not_found))

In [154]:
# all_ingredients_type4, type_4_not_found = get_all_ingredients_from_urls(kokotaru_article_links, 4)
# # print("Ingredients Dictionary:", all_ingredients_type4)
# # print("Not Found URLs:", type_4_not_found)
# print("Ingredients Dictionary Len:", len(all_ingredients_type4))
# print("Not Found URLs Len:", len(type_4_not_found))

In [155]:
# all_ingredients_type5, type_5_not_found = get_all_ingredients_from_urls(kokotaru_article_links, 5)
# # print("Ingredients Dictionary:", all_ingredients_type5)
# # print("Not Found URLs:", type_5_not_found)
# print("Ingredients Dictionary Len:", len(all_ingredients_type5))
# print("Not Found URLs Len:", len(type_5_not_found))

In [156]:
# all_ingredients_type6, type_6_not_found = get_all_ingredients_from_urls(kokotaru_article_links, 6)
# # print("Ingredients Dictionary:", all_ingredients_type6)
# # print("Not Found URLs:", type_6_not_found)
# print("Ingredients Dictionary Len:", len(all_ingredients_type6))
# print("Not Found URLs Len:", len(type_6_not_found))

In [157]:
# all_ingredients_type7, type_7_not_found = get_all_ingredients_from_urls(kokotaru_article_links, 7)
# # print("Ingredients Dictionary:", all_ingredients_type7)
# # print("Not Found URLs:", type_7_not_found)
# print("Ingredients Dictionary Len:", len(all_ingredients_type7))
# print("Not Found URLs Len:", len(type_7_not_found))

In [158]:
# kokotaru_combined_dict = {}
# # Kokotaru website
# kokotaru_combined_dict.update(all_ingredients_type1)
# kokotaru_combined_dict.update(all_ingredients_type2)
# kokotaru_combined_dict.update(all_ingredients_type3)
# kokotaru_combined_dict.update(all_ingredients_type4)
# kokotaru_combined_dict.update(all_ingredients_type5)
# kokotaru_combined_dict.update(all_ingredients_type6)
# kokotaru_combined_dict.update(all_ingredients_type7)

In [159]:
# kokotaru_combined_dict

In [160]:
# with open("cleaned_recipes.txt", "w", encoding="utf-8") as f:
#     for title, ingredients in kokotaru_combined_dict.items():
#         ingredients = ingredients.replace("***", "\n")
#         f.write(f"Title: {title}\n")
#         f.write(f"Ingredients:\n{ingredients}\n")
#         f.write("-" * 50 + "\n")

### KITCHENART WEBSITE ARTICLE DATA PARSING

In [161]:
def fetch_ingredients_and_title(url):
    try:
        response = requests.get(url, timeout=30)
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            
            title_tag = soup.find("div", class_="post-header").find("h1")
            title = title_tag.get_text().strip() if title_tag else "Unknown Title"
            
            ingredient_tags = soup.find_all("span", class_="ingredient-item__title")
            ingredients = '***'.join(tag.get_text().strip() for tag in ingredient_tags)
            
            return title, ingredients
        else:
            return None, None
    except requests.exceptions.RequestException:
        return None, None

def process_batches(url_list, batch_size, delay, max_workers=1):
    all_ingredients = {}
    failed_urls = []
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        for i in range(0, len(url_list), batch_size):
            batch = url_list[i:i + batch_size]
            future_to_url = {executor.submit(fetch_ingredients_and_title, url): url for url in batch}
            
            batch_success_count = 0
            batch_failure_count = 0
            
            with tqdm(total=len(batch), desc=f"Fetching Batch {i // batch_size + 1}", unit="article") as progress_bar:
                for future in as_completed(future_to_url):
                    url = future_to_url[future]
                    title, ingredients = future.result()
                    
                    if title and ingredients:
                        all_ingredients[title] = ingredients
                        batch_success_count += 1
                    else:
                        failed_urls.append(url)
                        batch_failure_count += 1
                    
                    progress_bar.update(1)
                    time.sleep(5)
            
            # Display results for the completed batch
            print(f"\nBatch {i // batch_size + 1} completed.")
            print(f"Successful URLs in this batch: {batch_success_count}")
            print(f"Failed URLs in this batch: {batch_failure_count}")
            
            # Pause between batches
            if i + batch_size < len(url_list):
                print(f"Waiting {delay} seconds before next batch...\n")
                time.sleep(delay)
    
    return all_ingredients, failed_urls

def get_all_ingredients_from_urls_ver2(url_list):
    # Initial fetch
    print("Starting initial fetch...")
    all_ingredients, failed_urls = process_batches(url_list, batch_size=50, delay=180)

    print(f"\nInitial fetch completed. Failed URLs: {len(failed_urls)}")
    
    print(f"Waiting {180} seconds before Retry 1 for failed URLs...\n")
    time.sleep(180)
    
    # Retry 1 for failed URLs
    retry_1_failed_urls = []
    if failed_urls:
        print("\nRetrying failed URLs (1st attempt)...")
        retry_1_ingredients, retry_1_failed_urls = process_batches(failed_urls, batch_size=50, delay=180)
        all_ingredients.update(retry_1_ingredients)
        print(f"\n1st retry completed. Failed URLs after 1st retry: {len(retry_1_failed_urls)}")

    print(f"Waiting {180} seconds before Retry 2 for failed URLs...\n")
    time.sleep(180)    
    
    # Retry 2 for remaining failed URLs
    retry_2_failed_urls = []
    if retry_1_failed_urls:
        print("\nRetrying failed URLs (2nd attempt)...")
        retry_2_ingredients, retry_2_failed_urls = process_batches(retry_1_failed_urls, batch_size=50, delay=180)
        all_ingredients.update(retry_2_ingredients)
        print(f"\n2nd retry completed. Final failed URLs: {len(retry_2_failed_urls)}")
        
    print(f"Waiting {180} seconds before Retry 3 for failed URLs...\n")
    time.sleep(180)    
    
    # Retry 2 for remaining failed URLs
    retry_3_failed_urls = []
    if retry_2_failed_urls:
        print("\nRetrying failed URLs (2nd attempt)...")
        retry_3_ingredients, retry_3_failed_urls = process_batches(retry_2_failed_urls, batch_size=50, delay=180)
        all_ingredients.update(retry_3_ingredients)
        print(f"\n2nd retry completed. Final failed URLs: {len(retry_3_failed_urls)}")

    # Summary of results
    print("\nFetching Completed")
    print(f"Total successful pages: {len(all_ingredients)}")
    print(f"Failed URLs after all retries: {len(retry_3_failed_urls)}")
    
    if retry_3_failed_urls:
        print("\nFinal Failed URL List:")
        for failed_url in retry_3_failed_urls:
            print(f" - {failed_url}")

    return all_ingredients, retry_3_failed_urls


In [162]:
# url_list_0 = read_from_file('Assert/kitchenart_article_urls copy 0.txt', 'r_b_line')
# all_ingredients_0, failed_urls_0 = get_all_ingredients_from_urls_ver2(url_list_0)

In [163]:
# url_list_1 = read_from_file('Assert/kitchenart_article_urls copy 1.txt', 'r_b_line')
# all_ingredients_1, failed_urls_1 = get_all_ingredients_from_urls_ver2(url_list_1)

In [164]:
# url_list_2 = read_from_file('Assert/kitchenart_article_urls copy 2.txt', 'r_b_line')
# all_ingredients_2, failed_urls_2 = get_all_ingredients_from_urls_ver2(url_list_2)

In [165]:
# url_list_3 = read_from_file('Assert/kitchenart_article_urls copy 3.txt', 'r_b_line')
# all_ingredients_3, failed_urls_3 = get_all_ingredients_from_urls_ver2(url_list_3)

In [166]:
# url_list_4 = read_from_file('Assert/kitchenart_article_urls copy 4.txt', 'r_b_line')
# all_ingredients_4, failed_urls_4 = get_all_ingredients_from_urls_ver2(url_list_4)

In [167]:
# url_list_5 = read_from_file('Assert/kitchenart_article_urls copy 5.txt', 'r_b_line')
# all_ingredients_5, failed_urls_5 = get_all_ingredients_from_urls_ver2(url_list_5)

In [168]:
# url_list_6 = read_from_file('Assert/kitchenart_article_urls copy 6.txt', 'r_b_line')
# all_ingredients_6, failed_urls_6 = get_all_ingredients_from_urls_ver2(url_list_6)

In [169]:
# url_list_7 = read_from_file('Assert/kitchenart_article_urls copy 7.txt', 'r_b_line')
# all_ingredients_7, failed_urls_7 = get_all_ingredients_from_urls_ver2(url_list_7)

In [170]:
# url_list_7 = read_from_file('Assert/kitchenart_article_urls copy 7.txt', 'r_b_line')
# all_ingredients_7, failed_urls_7 = get_all_ingredients_from_urls_ver2(url_list_7)

In [171]:
#url_list_8 = read_from_file('Assert/kitchenart_article_urls copy 8.txt', 'r_b_line')
#all_ingredients_8, failed_urls_8 = get_all_ingredients_from_urls_ver2(url_list_8)

In [172]:
# url_list_9 = read_from_file('Assert/kitchenart_article_urls copy 9.txt', 'r_b_line')
# all_ingredients_9, failed_urls_9 = get_all_ingredients_from_urls_ver2(url_list_9)

In [173]:
# url_list_10 = read_from_file('Assert/kitchenart_article_urls copy 10.txt', 'r_b_line')
# all_ingredients_10, failed_urls_10 = get_all_ingredients_from_urls_ver2(url_list_10)

In [174]:
# url_list_11 = read_from_file('Assert/kitchenart_article_urls copy 11.txt', 'r_b_line')
# all_ingredients_11, failed_urls_11 = get_all_ingredients_from_urls_ver2(url_list_11)

In [175]:
# url_list_12 = read_from_file('Assert/kitchenart_article_urls copy 12.txt', 'r_b_line')
# all_ingredients_12, failed_urls_12 = get_all_ingredients_from_urls_ver2(url_list_12)

In [176]:
url_list_13 = read_from_file('Assert/kitchenart_article_urls copy 13.txt', 'r_b_line')
all_ingredients_13, failed_urls_13 = get_all_ingredients_from_urls_ver2(url_list_13)

Starting initial fetch...


Fetching Batch 1: 100%|██████████| 50/50 [04:11<00:00,  5.03s/article]



Batch 1 completed.
Successful URLs in this batch: 16
Failed URLs in this batch: 34

Initial fetch completed. Failed URLs: 34
Waiting 180 seconds before Retry 1 for failed URLs...


Retrying failed URLs (1st attempt)...


Fetching Batch 1: 100%|██████████| 34/34 [02:50<00:00,  5.01s/article]



Batch 1 completed.
Successful URLs in this batch: 5
Failed URLs in this batch: 29

1st retry completed. Failed URLs after 1st retry: 29
Waiting 180 seconds before Retry 2 for failed URLs...


Retrying failed URLs (2nd attempt)...


Fetching Batch 1: 100%|██████████| 29/29 [02:25<00:00,  5.02s/article]



Batch 1 completed.
Successful URLs in this batch: 2
Failed URLs in this batch: 27

2nd retry completed. Final failed URLs: 27
Waiting 180 seconds before Retry 3 for failed URLs...


Retrying failed URLs (2nd attempt)...


Fetching Batch 1: 100%|██████████| 27/27 [02:15<00:00,  5.03s/article]


Batch 1 completed.
Successful URLs in this batch: 2
Failed URLs in this batch: 25

2nd retry completed. Final failed URLs: 25

Fetching Completed
Total successful pages: 25
Failed URLs after all retries: 25

Final Failed URL List:
 - https://cook.kitchenart.vn/cong-thuc-nau-an/com-bento-3-mon/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/cooking-show-14-ngo-nuong-chanh-ot-kieu-mexico/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/cooking-show-13-banh-chocolate-5-phut/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/cooking-show-12-cha-ga-goi-la-nep/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/mut-man/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/mut-man-2/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/ca-ro-ran-sot-me-chua-ngot/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/thit-vien-xiu-mai/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/my-y-voi-ca-tim-nuong-cai-bo-xoi-va-pho-mai-ricotta/
 - https://cook.kitchenart.vn/cong-thuc-nau-an/cooking-show-10-pho-mai-que-chien-gion/




In [177]:
# url_list_14 = read_from_file('Assert/kitchenart_article_urls copy 14.txt', 'r_b_line')
# all_ingredients_14, failed_urls_14 = get_all_ingredients_from_urls_ver2(url_list_14)

In [182]:
# Kitchenart website
kitchenart_combined_dict = {}
kitchenart_combined_dict.update(all_ingredients_13)

In [183]:
with open("cleaned_recipes_13.txt", "a", encoding="utf-8") as f:
    for title, ingredients in kitchenart_combined_dict.items():
        ingredients = ingredients.replace("***", "\n")
        f.write(f"Title: {title}\n")
        f.write(f"Ingredients:\n{ingredients}\n")
        f.write("-" * 50 + "\n")

### Group by dataframe and Save dataset

In [180]:
# combined_dict = {}
# # combined_dict.update(kokotaru_combined_dict)
# combined_dict.update(kitchenart_combined_dict)

In [181]:
# with open("Food_and_Ingredients.txt", "w", encoding="utf-8") as f:
#     for title, ingredients in kitchenart_combined_dict.items():
#         ingredients = ingredients.replace("***", "\n")
#         f.write(f"Title: {title}\n")
#         f.write(f"Ingredients:\n{ingredients}\n")
#         f.write("-" * 50 + "\n")