In [1]:
from IPython.core.display import HTML
HTML("""
<link href="https://fonts.googleapis.com/css2?family=Open+Sans&display=swap" rel="stylesheet">

<style>

div.text_cell_render h1 {
    font-size: 1.5em;
    line-height:1.4em;
    text-align:center;
    }

div.text_cell_render { 
    font-family: 'Open Sans';
    font-size:1.1em;
    line-height:1.5em;
    padding-left:3em;
    padding-right:3em;
    }
</style>
""")

# Working with Web Data

In the previous notebook we identified webpages that host a specific image. The URLs that refer to these webpages are found in .json files. In this notebook, we use the URLs to download the actual pages, extract their textual content and prepare the textual data for analysis. Also, we extract general features of the webpages (the metadata) for analysis. To get an idea of the .json files, we start by opening them and inspecting the list of URLs. 

First, we define some basic variables. Then we gather the list of .json files and open the first one.

In [8]:
import os,json

base_path = "D:/react-data/npg"
photo = "npg"
photo_folder = os.path.join(base_path, photo)

# Check how many iterations we have by using the os.listdir function. We don't want the "source" folder because it doesn't contain jsons
num_iterations = len([fol for fol in os.listdir(photo_folder) if os.path.isdir(os.path.join(photo_folder,fol)) and "source" not in fol])
start_iter = 1
range_iter = [str(i) for i in list(range(1,num_iterations+1))]

list_jsons = []

# We now "loop" through the folders associated with the iterations and gather the .jsons in these folders
for iteration in range_iter:
    iteration_folder = os.path.join(photo_folder, photo + "_" + str(iteration))
    list_json_in_iteration_folder = [os.path.join(iteration_folder,js) for js in os.listdir(iteration_folder) if ".json" in js]
    list_jsons += list_json_in_iteration_folder

Now we can open the .json files by loading them with the json module that comes automatically with your Python installation. You can inspect or "walk" the data by selecting keys with names (```json_data['responses']```) or elements in lists (```json_data['responses'][0:10]```). To find the URLs, navigate to the ```pagesWithMatchingImages``` list.

In [17]:
with open(list_jsons[0],'r') as fr:
    json_data = json.load(fr)

# Show the first elements in the list:
json_data['responses'][0]['webDetection']['pagesWithMatchingImages'][0:2]

[{'pageTitle': '&#39;The Girl in the Picture&#39; from Vietnam visits Philly this week for two ...',
  'partialMatchingImages': [{'url': 'https://www.inquirer.com/resizer/iytjLFruXq6zbI7pVdDfZrXTDNU=/1400x932/smart/arc-anglerfish-arc2-prod-pmn.s3.amazonaws.com/public/X3VW2BH4ZBFBBMOJCHTLJL52B4.jpg'}],
  'url': 'https://www.inquirer.com/arts/girl-in-vietnam-picture-kim-phuc-hannibal-locumbe-20191205.html'},
 {'pageTitle': 'The &#39;Napalm Girl&#39; To Share Story Of Hope During Free Event ...',
  'partialMatchingImages': [{'url': 'https://wpr-public.s3.amazonaws.com/wprorg/styles/facebook/s3/field/image/ap_431676074992.jpg?itok=3jjMa-gm'}],
  'url': 'https://www.wpr.org/napalm-girl-share-story-hope-during-free-event-saturday'}]

In [19]:
# Show the first Page URLs in the list:
[j['url'] for j in json_data['responses'][0]['webDetection']['pagesWithMatchingImages'][0:5]]

['https://www.inquirer.com/arts/girl-in-vietnam-picture-kim-phuc-hannibal-locumbe-20191205.html',
 'https://www.wpr.org/napalm-girl-share-story-hope-during-free-event-saturday',
 'https://www.theguardian.com/world/video/2015/oct/26/girl-photo-vietnam-war-kim-phuc-laser-treatment-napalm-burns-video',
 'https://time.com/5527944/napalm-girl-dresden-peace-price-james-nachtwey/',
 'https://time.com/4485344/napalm-girl-war-photo-facebook/']