# Digital image archiving, APIs, and webscraping

![gallery](gallery.jpeg)

In [20]:
import requests
import os
import re
import requests
import pandas as pd
import json
import getpass
from urllib.parse import urlparse
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
import string
punct = list(string.punctuation)
import seaborn as sns
sns.set()

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## How can we programmatically access images in a way that facilitates research?

An API (Application Programming Interface) is a set of rules and protocols that allows different software applications to communicate with each other. It defines how requests and responses should be structured, enabling developers to access and use the functionality of another service, library, or platform without needing to understand its internal workings.

APIs should be your go-to resource of choice when gathering large quantities of data, as they generally provide this data in structured form, allowing you to easily manipulate it.

Microsoft makes Bing image search available as an API; so do other search providers. The Bing API is useful because it gives good metadata on the images it finds. But first, let's look at a more intuitive API.

### The Project Gutenberg API

[Project Gutenberg](https://www.gutenberg.org/) provides electronic copies of large variety of out-of-copyright texts. It can be accessed using the [Gutendex API](https://gutendex.com/). The `requests` library in python can be used to query this API via the relevant parameters (see the documentation for what these are).  

In [6]:
# Define the API root url:

gut = 'https://gutendex.com/books/'

In [7]:
# Query by topic (here, 'death')

params = {'topic':'death'} 
death = requests.get(url = gut, params = params).json() # returns the results as a python dictionary

In [8]:
print(death['results'][2]['summaries'][0])

"Baron Trump's Marvellous Underground Journey" by Ingersoll Lockwood is a children's novel written in the late 19th century. This imaginative tale follows the adventures of a young baron named Wilhelm Heinrich Sebastian von Troomp, also known as Baron Trump, alongside his loyal dog, Bulger. Together, they embark on a fantastical journey in search of the mysterious portals to a 'World within a World,' guided by ancient manuscripts and their sense of curiosity.  The opening of the story introduces us to Baron Trump and his concerns for his less-than-happy companion, Bulger, who is weary of the familiar surroundings of Castle Trump. After discovering a musty manuscript by Don Fum, which suggests the existence of an underground world, the baron feels compelled to leave home for adventure. His departure is filled with heartfelt farewells from his parents and preparations for what promises to be a thrilling expedition. As Baron Trump and Bulger journey northward through Russia, they face var

## Unsplash is a free high-quality image API. How can we access it?

In [25]:
api_key = "z_387ySbV7I3U0uEOMhFW2neHnnALBVPCFVw-7-wc7o"

def search_unsplash(query, per_page=10):
    url = "https://api.unsplash.com/search/photos"
    headers = {
        "Authorization": f"Client-ID {api_key}"
    }
    params = {
        "query": query,
        "per_page": per_page
    }
    
    # Make the API request
    response = requests.get(url, headers=headers, params=params)
    
    if response.status_code == 200:
        data = response.json()
        results = data["results"]
        
        # Create a DataFrame with all raw metadata
        df = pd.json_normalize(results)
        
        return df
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return None




def download_images(image_urls, save_dir="unsplash_images"):
    # Create the save directory if it doesn't exist
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    # Loop through the list of URLs and download each image
    for i, url in enumerate(image_urls):
        try:
            # Send a GET request to the image URL
            response = requests.get(url, stream=True)
            response.raise_for_status()  # Raise an error for bad status codes

            # Extract the image file name from the URL
            file_name = f"image_{i + 1}.png"  # You can customize the naming convention
            file_path = os.path.join(save_dir, file_name)

            # Save the image to the specified directory
            with open(file_path, "wb") as file:
                for chunk in response.iter_content(chunk_size=8192):
                    file.write(chunk)
            print(f"Downloaded: {file_path}")
        except requests.exceptions.RequestException as e:
            print(f"Failed to download {url}: {e}")

In [26]:
d = search_unsplash("death")

In [27]:
d

Unnamed: 0,id,slug,created_at,updated_at,promoted_at,width,height,color,blur_hash,description,...,user.social.instagram_username,user.social.portfolio_url,user.social.twitter_username,user.social.paypal_email,topic_submissions.experimental.status,user.links.following,user.links.followers,topic_submissions.current-events.status,topic_submissions.current-events.approved_on,topic_submissions.color-theory.status
0,rPWsIbJDeX8,red-rose-shallow-focus-photography-rPWsIbJDeX8,2018-06-14T18:35:24Z,2025-05-12T11:42:17Z,,3456,5184,#262626,"L14xSj?G5Q9[x^t8M|Rj5,Nb=J,?",,...,matreding,,,,,,,,,
1,j3R9C-Xqe1w,lit-candle-in-hand-j3R9C-Xqe1w,2019-12-18T20:44:28Z,2025-05-12T12:01:41Z,,5304,7952,#262626,L86%.-I;9]=xWXj@j@WW0}$%=xEM,THE light,...,Jphotography__,http://www.jphotography2012.myportfolio.com,,,,,,,,
2,1PtM6b85sdw,a-black-and-white-photo-of-a-human-skull-1PtM6...,2019-09-12T10:29:34Z,2025-05-12T11:57:08Z,,6000,4000,#262626,L655II%M009Ft7t7WBM{9FM{?b-;,human skull in black - proud of death,...,ahmedadlyraslan,https://ko-fi.com/ahmedadly,,,rejected,,,,,
3,IjaKTePIu60,grayscale-photography-of-gray-tombstone-IjaKTe...,2019-02-10T16:29:05Z,2025-05-12T14:58:34Z,,3840,5760,#8c8c8c,LMD]rH00%Mt7_3M{WBofIUofRjRj,,...,,http://fernandamarin.com,,,,https://api.unsplash.com/users/feymarin/following,https://api.unsplash.com/users/feymarin/followers,,,
4,g32g_IBprFA,grayscale-photography-of-cemetery-g32g_IBprFA,2015-06-24T18:04:31Z,2025-05-12T11:30:24Z,,4288,2848,#f3f3f3,LgGIo.xut7xu~qt7xut7WBayxut7,,...,davideragusa,https://davideragusa.com,davideragusa,,,https://api.unsplash.com/users/davideragusa/fo...,https://api.unsplash.com/users/davideragusa/fo...,,,
5,ExV72ahe4sE,man-in-white-and-black-jacket-and-pants-sittin...,2020-04-04T15:52:34Z,2025-05-12T12:08:08Z,,6370,4247,#262626,LB6u9T?wS%o#XU%g?bx]RjRjWCof,,...,grantwhitty,http://grantwhitty.com,,,,https://api.unsplash.com/users/grantwhitty/fol...,https://api.unsplash.com/users/grantwhitty/fol...,,,
6,xesvLZQ1_bc,gray-skull-lot-xesvLZQ1_bc,2019-08-13T10:18:52Z,2025-05-12T11:55:45Z,,4000,6000,#262626,L9EMLDIUM{j[~qWBj[ofIUt7RjWB,,...,gabormolnar92,https://gabormolnar.dev,gabormolnar92,,,,,,,
7,fuGPLDhQBo8,man-in-yellow-jacket-and-pants-holding-white-a...,2020-10-06T08:06:37Z,2025-05-12T12:19:45Z,,3038,3958,#a6a6a6,"LLG8[xs+M_byIUR.Rloc0Ms,xut6",Consequences of the pandemic,...,isaac.q.q,https://iqq.es/,,,,,,approved,2020-10-07T09:37:10Z,
8,BXOXnQ26B7o,selective-focus-photo-of-brown-and-blue-hourgl...,2017-07-27T07:07:17Z,2025-05-12T14:41:17Z,2017-07-27T10:51:16Z,6000,4000,#0c2626,LRDIj{$*j?of0fa|oejsRjNHNbWV,"Eventually everything hits the bottom, and all...",...,aronvisuals,http://linktr.ee/aronvisuals,aronvisuals,,,,,,,rejected
9,W1J8mMlkmXY,man-holding-gray-dagger-W1J8mMlkmXY,2018-11-20T20:19:59Z,2025-05-12T11:46:32Z,2018-11-21T09:16:51Z,3756,4694,#262626,L23[#dM{4n-;ofayayj[00xu?bIA,I and me are always too deep in conversation.,...,reziiz,,RezaHasannia2,,,https://api.unsplash.com/users/rezahasannia/fo...,https://api.unsplash.com/users/rezahasannia/fo...,,,


In [13]:
urls = [i for i in images['url']]

In [19]:
data

NameError: name 'data' is not defined

### Now, let's measure the emotional variation of any text using the VAD norms

In [None]:
vad = pd.read_csv('vad.csv', index_col = 0)  #VAD norms
vad = vad[["V.Mean.Sum", "A.Mean.Sum", "D.Mean.Sum"]]
vad.columns = ['valence', 'arousal', 'dominance']

def vad_data(word_list):
    word_list = [i.lower() for i in word_list]
    words = []
    norms = []
    
    for i in word_list:
        if i in vad.index:
            norms.append(vad.loc[i])
            words.append(i)
        else:
            pass
    norms_vad = pd.DataFrame(norms).mean()
    return norms_vad