# Data scraping

Images for various terms are scraped from Google Images. The aim is to cover as many motifs as possible that could appear on digitized slides. An existing, larger data set could also have been used, but I opted for this approach in order to have more control over the data and to have a more challenging task. The search terms were mostly generated with ChatGPT.

### Imports

In [1]:
import os
import requests
from bs4 import BeautifulSoup

### Scraping

In [2]:
search_terms = [
    "landscape", "forest", "mountain", "beach", "river", "sunset", "desert", "village", "lake", "waterfall", "airport", "harbor",
    "skyscraper", "bridge", "street", "house", "monuments", "church", "castle", "ruins", "lighthouse", "tunnel", "building", "bar", "stadium"
    "flower", "tree", "fruit", "vegetable", "diving", "letters", "architecture",
    "bedroom", "living room", "kitchen", "bathroom", "office", "library", "classroom", "gym", "apartment interior",
    "dog", "cat", "bird", "horse", "wildlife", "animals", "pets",
    "portrait", "family photos", "children playing", "people at work", "crowds", "concert", "wedding", "protest", "people"
    "computer", "chair", "table", "bed", "band", "book", "newspaper", "party", "birthday", "christmas",
    "vacation photos", "travel photography", "sports", "swimming", "hiking", "winter sports", "football", "basketball", "tennis", "golf", "hockey", "surfing",
    "sculpture", "musical instrument", "museums", "theaters","drawings", "military",
    "car", "bicycle", "boat", "airplane", "train", "furniture", "kitchen appliances", "books",
    "food", "drinks", "restaurant", "cooking", "baking", "picnic", "store", "hotel", "hospital", "school", "factory", "construction site", "farm", "Public transportation", 
    "sunrise", "supermarket", "tradition", "marked", "sightseeing", "tourist attractions", "landmarks", "historical sites", "zoo", "park", "camping" 
]


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

for term in search_terms:
    url = f'https://www.google.com/search?q={term}&tbm=isch'
    html = requests.get(url, headers=headers).text
    soup = BeautifulSoup(html, 'html.parser')

    # find 20 images per search term
    images = soup.find_all('img')[1:21]  # Skip the first image, which is the Google logo

    for i, image in enumerate(images):
        image_url = image['src']
        
        # download and save the image
        response = requests.get(image_url)
        if response.status_code == 200:
            with open(f'../data/raw/{term}_{i+1}.jpg', 'wb') as f:
                f.write(response.content)
        else:
            print(f"Could not download image {i+1} for search term '{term}'.")

print("Pictures downloaded successfully.")

Pictures downloaded successfully.
