In [1]:
import wikipedia
from typing import Dict
from bs4 import BeautifulSoup

# Data Exploration: Using the Wikipedia-Provided Image captions

While the wikipedia package is a good starting off point, it lacks a few key features, some of which seemed like they were already implemented.

 - Image Captions: The package provides an `images` array, but they are in no way linked to their captions
 - Sections: The package also provides a `sections` array, but it seems to be empty for every page checked so far.

Will need to extend the `wikipedia.WikipediaPage` class in order to properly handle these to situations

This seems like an essential step, as wiki pages can easily exceed the token limit for `gpt-4`. It will be useful to break the page down into pieces, then let the OpenAI gpt functions deciede which pieces of information it needs to properly answer the user's question

### Extended Class

The new class adds the following attributes, using `BeautifulSoup` to run through the html of the page:
 - `image_captions`: Dictionary containing the image url as key, and the caption as the value.
 - `indexed_content`: Dictionary that splits up the total content into their respective sections. The key is the section title, the value is the section content.

In [2]:
class WikiPage(wikipedia.WikipediaPage):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.html = BeautifulSoup(self.html())
        self.image_captions = self._get_all_image_captions()
        self.indexed_content = self._build_sections()

    def _get_all_image_captions(self) -> Dict[str, str]:
        figures = self.html.findAll(name="figure")

        data = {}

        for fig in figures:
            try:
                img_src = fig.findAll(name="a")[0].findAll(name="img")[0].attrs['src']
                img_caption = fig.findAll(name='figcaption')[0].text
                data[img_src] = img_caption
            except IndexError:
                pass

        return data
    
    def _build_sections(self):
        sections = {}

        current_section = "Summary"
        sections[current_section] = []

        for child in list(list(self.html.children)[0].children):
            if child.name == 'h2':
                current_section = child.find(name="span").text
                sections[current_section] = []
            elif child.name == 'p':
                sections[current_section].append(child.text)
        return sections

In [4]:
test = WikiPage(title="Dinosaur")

In [5]:
list(test.indexed_content.keys())

['Summary',
 'Definition',
 'History of study',
 'Evolutionary history',
 'Classification',
 'Paleobiology',
 'Origin of birds',
 'Extinction of major groups',
 'Cultural depictions',
 'See also',
 'Further reading',
 'Notes',
 'Bibliography',
 'References']

In [8]:
test.image_captions

{'//upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Neognathae.jpg/220px-Neognathae.jpg': 'Birds are avian dinosaurs, and in phylogenetic taxonomy are included in the group Dinosauria.',
 '//upload.wikimedia.org/wikipedia/commons/thumb/e/ec/LA-Triceratops_mount-2.jpg/250px-LA-Triceratops_mount-2.jpg': 'Triceratops skeleton, Natural History Museum of Los Angeles County',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/44/Dromaeosaurus_skull_en.svg/220px-Dromaeosaurus_skull_en.svg.png': 'Labeled diagram of a typical archosaur skull, the skull of Dromaeosaurus',
 '//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Sprawling_and_erect_hip_joints_-_horizontal.svg/220px-Sprawling_and_erect_hip_joints_-_horizontal.svg.png': 'Hip joints and hindlimb postures of: (left to right) typical reptiles (sprawling), dinosaurs and mammals (erect), and rauisuchians (pillar-erect)',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0e/William_Buckland_c1845.jpg/170px-William_Buckland_c1845.jpg': 'W