# üìÇ DATA COLLECTION

## üéØ Objective

The goal of this step is to collect a high-quality dataset of sports images with corresponding captions for the Image Captioning task. We utilize two data sources:
- Google Images Crawling - Scraping sports images and generating captions by using GPT-4-Vision model.
- UIT-ViIC Dataset - A publicly available dataset containing Vietnamese captions for images.

## üåê Data Sources

**1Ô∏è‚É£ Google Images Crawling**

- Reason: Beside Pinterest, Google Images contains a vast collection of sports-related images.
- Approach:
    - Use web scraping techniques to extract images and descriptions.
    - Target annotating by hand.
- Challenges:
    - Ensuring high-quality, relevant captions from the model.
    - Handling duplicates and low-quality images.

## üì• Data Collection Methods

### 1Ô∏è‚É£ Self-crawl from GG Images

#### Import libraries

In [18]:
!pip install icrawler
from icrawler.builtin import GoogleImageCrawler
from icrawler.builtin import BingImageCrawler
from icrawler.downloader import Downloader
import os

# Disable warnings
import warnings
warnings.filterwarnings('ignore')

# Disable logging from icrawler
import logging
logging.getLogger('icrawler').setLevel(logging.CRITICAL)




#### Define functions/classes

In [22]:
# Remove images      
def remove_images(folder):
    try:
        os.system(f"rm -rf {folder}")
    except Exception as e:
        print(e)

def crawl_images(sport, num_images, save_dir):
    keyword = sport + " action shots"    
    num = num_images // 2
    crawler = GoogleImageCrawler(storage={"root_dir": save_dir})
    crawler.crawl(keyword=keyword, max_num=num, min_size=(300, 300), max_size=None)
    crawler2 = BingImageCrawler(storage={"root_dir": save_dir})
    crawler2.crawl(keyword=keyword, max_num=num, min_size=(300, 300), max_size=None, file_idx_offset=num)
    

#### TEST

In [None]:
crawl_images("soccer", 10, "../data/raw_images/soccer")

2025-04-14 00:14:27,319 - INFO - feeder - thread feeder-001 exit
2025-04-14 00:14:28,461 - INFO - parser - parsing result page https://www.google.com/search?q=soccer+action+shots&ijn=0&start=0&tbs=&tbm=isch
2025-04-14 00:14:28,757 - INFO - downloader - image #1	https://i.pinimg.com/736x/59/77/f2/5977f28d3e23f3a6268cfb3e45325c93.jpg
2025-04-14 00:14:29,569 - ERROR - downloader - Response status code 400, file https://media.istockphoto.com/id/860880772/photo/determined-bicycle-kick-on-a-soccer-match.jpg
2025-04-14 00:14:29,658 - INFO - downloader - image #2	https://i.pinimg.com/originals/db/ae/90/dbae904063ae000a82dc6032eb7d4f45.jpg
2025-04-14 00:14:30,225 - ERROR - downloader - Response status code 400, file https://media.istockphoto.com/id/500240235/photo/soccer-player-kicking-ball.jpg
2025-04-14 00:14:30,355 - INFO - downloader - image #3	https://c8.alamy.com/comp/P73AX4/england-columbia-soccer-moscow-july-03-2018-harry-kane-england-9-drives-controls-the-ball-action-full-size-single-a

2025-04-14 00:14:39,731 - INFO - parser - downloaded image reached max num, thread parser-001 is ready to exit
2025-04-14 00:14:39,732 - INFO - parser - thread parser-001 exit


#### MAIN

In [7]:
with open("../data/metadata/sports_cate.txt", "r", encoding="utf-8") as file:
    sports_list = [line.strip() for line in file.readlines()]
    
# Li·ªát k√™ th·ª≠ v√†i m√¥n th·ªÉ thao
sports_list[:5]

['Soccer', 'Volleyball', 'Baseball', 'Tennis', 'Basketball']

In [24]:
remove_images("../data/raw_images")

In [25]:
for sport in sports_list:
    # check if there's sport_40.jpg or .png in the folder or not
    sport_name = sport.lower()
    sport_name = sport_name.replace(" ", "_")
    save_dir = f"../data/raw_images/{sport_name}"
    if os.path.exists(f"{save_dir}/000100.jpg") or os.path.exists(f"{save_dir}/000100.png"):
        continue
    crawl_images(sport, 200, save_dir)
    
print("üéâ Done")

2025-04-14 00:15:29,147 - INFO - feeder - thread feeder-001 exit
2025-04-14 00:15:30,200 - INFO - parser - parsing result page https://www.google.com/search?q=Soccer+action+shots&ijn=0&start=0&tbs=&tbm=isch
2025-04-14 00:15:30,496 - INFO - downloader - image #1	https://i.pinimg.com/736x/59/77/f2/5977f28d3e23f3a6268cfb3e45325c93.jpg
2025-04-14 00:15:31,295 - ERROR - downloader - Response status code 400, file https://media.istockphoto.com/id/860880772/photo/determined-bicycle-kick-on-a-soccer-match.jpg
2025-04-14 00:15:31,368 - INFO - downloader - image #2	https://i.pinimg.com/originals/db/ae/90/dbae904063ae000a82dc6032eb7d4f45.jpg
2025-04-14 00:15:32,121 - ERROR - downloader - Response status code 400, file https://media.istockphoto.com/id/500240235/photo/soccer-player-kicking-ball.jpg
2025-04-14 00:15:32,249 - INFO - downloader - image #3	https://c8.alamy.com/comp/P73AX4/england-columbia-soccer-moscow-july-03-2018-harry-kane-england-9-drives-controls-the-ball-action-full-size-single-a

üéâ Done


In [27]:
crawl_images("running", 200, "../data/raw_images/running")

2025-04-14 01:14:07,532 - INFO - feeder - thread feeder-001 exit
2025-04-14 01:14:08,649 - INFO - parser - parsing result page https://www.google.com/search?q=running+action+shots&ijn=0&start=0&tbs=&tbm=isch
2025-04-14 01:14:09,556 - ERROR - downloader - Response status code 400, file https://media.istockphoto.com/id/589985118/photo/action-shot-of-running-girl.jpg
2025-04-14 01:14:09,741 - INFO - downloader - image #1	https://thumbs.dreamstime.com/b/action-shot-sporty-young-man-running-outdoors-start-pathway-blue-sky-background-copy-space-around-44061816.jpg
2025-04-14 01:14:11,418 - INFO - downloader - image #2	https://www.barksdalephoto.com/img/sprinter.jpg
2025-04-14 01:14:12,576 - INFO - downloader - image #3	https://jacoblund.com/cdn/shop/products/08a73602be21965df61186e3ac7c5ed8.jpg
2025-04-14 01:14:13,151 - INFO - downloader - image #4	https://images.squarespace-cdn.com/content/v1/5b7dded11aef1dc9d40697f9/6dd7335b-af17-41b4-a0cc-33acbf419c5d/Ava+Nkadi_Track+%26+Field_Action+Shot

In [29]:
root_dir="../data/raw_images"
supported_exts = [".jpg", ".jpeg", ".png"]

for sport in os.listdir(root_dir):
    sport_path = os.path.join(root_dir, sport)
    if not os.path.isdir(sport_path):
        continue  # b·ªè qua file th∆∞·ªùng

    files = sorted(os.listdir(sport_path))  # s·∫Øp x·∫øp cho g·ªçn g√†ng
    count = 0

    for file in files:
        ext = os.path.splitext(file)[1].lower()
        if ext not in supported_exts:
            continue  # b·ªè qua file kh√¥ng ph·∫£i ·∫£nh

        new_name = f"{sport}_{count}{ext}"
        src = os.path.join(sport_path, file)
        dst = os.path.join(sport_path, new_name)

        os.rename(src, dst)
        count += 1

    print(f"‚úÖ ƒê√£ rename {count} ·∫£nh trong th∆∞ m·ª•c '{sport}'.")

‚úÖ ƒê√£ rename 131 ·∫£nh trong th∆∞ m·ª•c 'archery'.
‚úÖ ƒê√£ rename 87 ·∫£nh trong th∆∞ m·ª•c 'athletics'.
‚úÖ ƒê√£ rename 103 ·∫£nh trong th∆∞ m·ª•c 'badminton'.
‚úÖ ƒê√£ rename 113 ·∫£nh trong th∆∞ m·ª•c 'baseball'.
‚úÖ ƒê√£ rename 128 ·∫£nh trong th∆∞ m·ª•c 'basketball'.
‚úÖ ƒê√£ rename 117 ·∫£nh trong th∆∞ m·ª•c 'boxing'.
‚úÖ ƒê√£ rename 124 ·∫£nh trong th∆∞ m·ª•c 'cycling'.
‚úÖ ƒê√£ rename 146 ·∫£nh trong th∆∞ m·ª•c 'equestrianism'.
‚úÖ ƒê√£ rename 128 ·∫£nh trong th∆∞ m·ª•c 'golf'.
‚úÖ ƒê√£ rename 114 ·∫£nh trong th∆∞ m·ª•c 'skiing'.
‚úÖ ƒê√£ rename 148 ·∫£nh trong th∆∞ m·ª•c 'soccer'.
‚úÖ ƒê√£ rename 114 ·∫£nh trong th∆∞ m·ª•c 'surfing'.
‚úÖ ƒê√£ rename 113 ·∫£nh trong th∆∞ m·ª•c 'swimming'.
‚úÖ ƒê√£ rename 136 ·∫£nh trong th∆∞ m·ª•c 'tennis'.
‚úÖ ƒê√£ rename 111 ·∫£nh trong th∆∞ m·ª•c 'volleyball'.
