# Random Objects and Hidden Hierarchies
By Sara Harvey & Chantal Brousseau

## Analyzing the Ingenium Collection 
In this notebook we will explore questions about the context of artifacts and what this means for digitizing collections.  Our central purpose is to create a random artifact generator using data from the Ingenium collection, and subsequently using the same data set used for this to visualize the most common categories of technology and how they are distributed within the collection.  We will be looking at how these categories reflect the museums in the Ingenium network and what this could mean in terms of categorical representation regarding what is priroitized within the imagined hierarchy the collections produce.  We want our random artifact generator to give people a chance to find new objects that they otherwise would not have known to search for, as well as encouraging them to consider how the collection is curated, allowing them to make new connections amoungst these artifacts.  

## Data
As previously stated, the data we are using is from the Ingenium collection.  When thinking about this data, we ask questions about where the data came from and its relevance in the museum.  Assuming this represents most, if not all, of the artifacts in the Ingenium collection, we know that it was collected by Ingenium "to represent the products and processes of all areas of science and technology" [^1].  This data is meant to represent all areas of technology, but we must assume this is a Western definition of technology, which could mean some areas are left out or could be misclassified.  This brings us to a drawback of our data, as we were not able to use any entries that did not have an image due to the nature of our notebook.  Certain artifacts have been left out of our notebook due to this fact, and this means that those using our notebook will not be presented anything that is not visually represented within the digital collections.  We are also assuming that these entries have all the informaiton given to Ingenium, even when it looks like the table is missing data.  We will be using all the information given to us from the data, but it is important to note that, like with much of the digitized data available online, we may not have it all. 


[^1]: "Artifact Open Data Set: mash up the past, map the future." Ingenium.  https://ingeniumcanada.org/collection-research/artifact-open-data-set-mash-up 


### Installing and importing what we need
This is where we install all the packages we will use in this notebook.  `pandas` is for manipulating data, `plotly` and `numpy` are for visualizing the data.  Following this, we then import everything we need and set the display settings for our notebook. Displaying all rows of the very large data set is too demanding for Jupyter notebook, so instead we just display the the entirety of the columns.

In [None]:
!pip install ipywidgets
!pip install pandas
!pip3 install requests
!pip3 install plotly
!pip3 install numpy

In [None]:
!pip install plotly

In [None]:
# for randomiser
import requests
import random
import pandas as pd
from IPython.display import Image, display, HTML
import ipywidgets as widgets

# for visualizations
import plotly.graph_objects as go
import plotly.express as px
import numpy as np

pd.set_option('display.max_colwidth', None)
pd.options.display.max_columns = None

### Reading and manipulating our data
Here we are importing the data into a `pandas` data frame to use in our notebook, making it easier to further manipulate. When we import all the data we can see that there are a total of 108463 rows/entries in the collection. That's a lot of data!

In [None]:
df = pd.read_csv("https://dhmuse.netlify.app/data/cstmc-CSV-en.csv")
# or download that directly, save in the directory and read it in
df = pd.read_csv('cstmc-csv-en.csv')
df

Now, we take out all the artifacts that did not have an image, because we need an image for the random artifact generator to pull up.  Once we did this we can see that the total number of artifacts in the collection decreased drastically from 108463 to only 67631.  

In this process, looking at this shrunken data set, we can ask: what artifacts were prioritized in the process of full digitization? Or perhaps even, what artifacts were deemed worth digitizing through photography? Seeing this decrease helps us understand that there are still artifacts that were either deemed unworthy of digitization, or they could not be digitized due to the condition they were in, whether that be caused simply by aging or poor preservation.  

In [None]:
# remove any artifacts that do not have images
df = df[df['image'].notna()]

# replace NaN in date column
df['BeginDate'] = df['BeginDate'].fillna('n.d.')

# replace NaN in general desc column

df['GeneralDescription'] = df['GeneralDescription'].fillna('No further desc.')

# replace remaining NaN

df = df.fillna('Unknown')

df

### Creating the randomiser
Here we are selecting a random artifact to be displayed.  The artifact is displayed with:

- Title
- Year
- Image
- Group 
- Category
- Manufactuering information
- Context
- Link to Ingenium archive

We chose to specifically show more contextual information not just as a play on the concept of taking an artifact out of context only for it to be place heavily into its own context, but also to highlight for those using this notebook what kind of explanatory information objects are archived with. How frequently is this context given? When there is context, how sufficient is the information in situating the artifact in its respective "place" in history? 

The randomiser gives the user the opportunity to discover new things but also consider the artifact out of the context of the exhibit/other artifacts it might be displayed with.  By allowing the user to only see one artifact, we are giving them the oportunity to try and find new meaning for the artifact and the ability to discover a new artifact in an "unconventional" way.   


In [None]:
# "randomizing" function
def display_random(b):
    out.clear_output()
    # Randomly select a record from the randomly sorted data
    randomArtifact = df.sample()
    
    
    artiNum = randomArtifact.iloc[0]['artifactNumber']
    imgURL = randomArtifact.iloc[0]['image']
    objName = randomArtifact.iloc[0]['ObjectName']
    genDesc = randomArtifact.iloc[0]['GeneralDescription']
    yearMan = randomArtifact.iloc[0]['BeginDate']
    
    model = randomArtifact.iloc[0]['model']
    
    group = randomArtifact.iloc[0]['group1']
    cate = randomArtifact.iloc[0]['category1']
    
    canCon = randomArtifact.iloc[0]['ContextCanada']
    funcCon = randomArtifact.iloc[0]['ContextFunction']
    techCon = randomArtifact.iloc[0]['ContextTechnical']
    
    # creating string for "Manufacturing" displayed information
    manufacturing = ''
    
    if randomArtifact.iloc[0]['Manufacturer'] != 'Unknown':
        manufacturing = manufacturing + randomArtifact.iloc[0]['Manufacturer'] + ', '
        
    if randomArtifact.iloc[0]['ManuCity'] != 'Unknown':
        manufacturing = manufacturing + randomArtifact.iloc[0]['ManuCity'] + ', '
        
    if randomArtifact.iloc[0]['ManuProvince'] != 'Unknown':
        manufacturing = manufacturing + randomArtifact.iloc[0]['ManuProvince'] + ', '
        
    if randomArtifact.iloc[0]['ManuCountry'] != 'Unknown':
        manufacturing = manufacturing + randomArtifact.iloc[0]['ManuCountry']
        
    if manufacturing == '':
        manufacturing = 'Unknown'
    
    # formatting year
    if type(yearMan) is float:
        yearMan = int(yearMan)

    # Display the record
    with out:
        display(HTML(f'<h3>{objName} [{genDesc}, {yearMan}]</h3>'))

        # displays image
        display(Image(requests.get(imgURL).content))
        
        # contextual information
        display(HTML(f'<p><b>Group:</b> {group}</p>'))
        display(HTML(f'<p><b>Category:</b> {cate}</p>'))

        
        display(HTML(f'<p><b>Model:</b> {model}</p>'))

        display(HTML(f'<p><b>Manufacturing:</b> {manufacturing}</p>'))

        display(HTML(f'<p><b>Canadian Context:</b> {canCon.capitalize()}</p>'))
        display(HTML(f'<p><b>Functional Context:</b> {funcCon.capitalize()}</p>'))
        display(HTML(f'<p><b>Technical Context:</b> {techCon.capitalize()}</p>'))
        
        # links to archival record on Ingenium website
        display(HTML(f'<a href="https://ingeniumcanada.org/ingenium/collection-research/collection-item.php?id={artiNum}">Further Information</a>'))



# Create a button to launch the randomness
go = widgets.Button(description='Randomise!')
out = widgets.Output()
# calls the "randomizing" function
go.on_click(display_random)
display(go)
display(out)

# hit shift + 'O' to see the whole output without the scroll bar!

### Visualizing and Putting the Data into Context 

To take our analysis further, why not look at the most common group of technology that has been photographically preserved? 

In the following cell, we are counting the values in the column containing the group a given artifact belongs to in order to gain a more macroscopic view of this portion of the Ingenium collections. Once we do this we can then group on a smaller scale to see any further trends or changes.   


In [None]:
# grouping 'group1'
group1 = df['group1'].value_counts()

group1df = group1.to_frame()

group1df = group1df.reset_index()

# calculating percentage of each group
group1per = df['group1'].value_counts(normalize=True) * 100

group1perdf = group1per.to_frame()

group1perdf = group1perdf.reset_index()

# merging count and percentage into one dataframe 
groupdf = pd.merge(group1df, group1perdf, how='outer', on='index', )

groupdf.columns = ['Group', 'Count', 'Percentage']

groupdf

Looking at the following graph, we see that aviation is by far the largest group with 10314 artifacts, and communications follows very far behind with 6230. Closer to this are industrial technology and photography with 4642 and 4550, respectively. 

When thinking about artifacts in these groups, we can also think about what might happen if an artifact falls into 2 groups, and where they may be placed in such a situation.  Perhaps aviation is so high due to the lack of alternate groups that could have better described many of the artifacts within it, or perhaps the group of 'unknown's is likewise so vast because of this exact situation.

In [None]:
fig = px.bar(groupdf, y='Count', x='Group', text='Percentage', labels={
    'Count': '# of Occurences',
    'Group': 'Group',
    'Percentage': 'Percent'
                 }, color='Percentage',
                title='Artifacts Arranged by Group')

fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide', xaxis_tickangle=45)

fig.update_layout(
    autosize=False,
    width=950,
    height=800,)
fig

### Looking at Smaller Categories
Here we are grouping by category, a categorization meant to be more specific than the broader groups.  We wanted to take a more microscopic look at the division of artifacts in order to gain further understanding of how artifacts are categorized and see if there is any coorelation with the larger groups.  From this we can see that there is still a large differnece between first place, commemorative, and second, tools & equipment-trades, but there are also more categories that give us a better sense of the collection contains.  

In [None]:
# grouping category1
category1 = df['category1'].value_counts()

category1df = category1.to_frame()

category1df = category1df.reset_index()

# calculating percentage of each category
category1per = df['category1'].value_counts(normalize=True) * 100

category1perdf = category1per.to_frame()

category1perdf = category1perdf.reset_index()

# merging count and percentage into one dataframe
categorydf = pd.merge(category1df, category1perdf, how='outer', on='index', )

categorydf.columns = ['Category', 'Count', 'Percentage']

categorydf

We chose to use a treemap to see if there may be an implied hierarchy amongst the digitized, photo documented artifacts, whether this hierarchy was created intentionally or not.

Note that the largest category count is 'unknown'-- why might they have been commerated through photography yet still deemed unidentifiable? 

Seeing that 'commemorative' objects were the greatest part of the known collection by a large margin, we can infer that commeration is seen as important within the Ingenium collections.  Why might there be such significance placed on commerative objects? Further, WHAT is being deemed commerative?   

In [None]:
# dropping "Unknown" rows for this visualisation for a clearer look at what's "really" there
categorydf.replace('Unknown', np.nan, inplace = True)

categorydf = categorydf.dropna()

fig = px.treemap(categorydf, path=['Category'], values='Count',
                  color='Percentage', hover_data=['Percentage'],
                  color_continuous_scale='RdBu',
                  color_continuous_midpoint=np.average(categorydf['Percentage'], weights=categorydf['Percentage']))

fig.data[0].hovertemplate = '%{label}<br> Count: %{value}'

fig.update_layout(
    autosize=False,
    width=900,
    height=800,)

fig

## Conclusion 
The purpose of this notebook was to generate random artifacts from the Ingenium collection in order to allow the user to find new artifacts or think about previously-seen ones in a new way.  In thinking about this, it is important to note that we had to exclude almost half of the artifacts in the collection, as they do not have images. A patron might question why so much was excluded, and what was excluded from the collection.  If we had kept all of the artifacts, perhaps the information in our graphs could have been drastically different and told a completely different story.  We wanted to be able to showcase the variety of artifacts that the Ingenium collection contains, yet also help users understand there are more ways to think about these pieces than just how they are displayed in exhibits.   

To extend our notebook, we could look at what artifacts are missing images and how that could reflect the imlied hierarchy of artifacts.  Perhaps there is an entire category that has not been digitized yet, or another way hierarchy is able to be seen from the data we do have.

We could also continue to organize by each column to see how Ingenium further classified their artifacts, and if the labels given reflect anything.  This would help us understand how certain artifacts are labeled, and consider if there may be a better way to do this that would better reflect what is being archived. 

Users may also want to compare the Ingenium data with other GLAM institutions that relate to science and technology to observe any differences in how they digitize their material.  Perhaps Ingenium does not have the resources that others museums have to digitize certain artifacts, or vice versa.  Nonetheless, the comparisons that could be made with any aspect of this data can help us see what should focus on in further digitization efforts.  If there are entire categories that cannot be digitized, we should be wondering why and what could be done instead to preserve them in an accessible format.  Does this mean only artifacts that can be digitized can be preserved, or is there other ways of preserving them that we have not discovered yet?  


## References 
- Notebook inspired by the notebook ["A random item from Museums Victoria's collections!"](https://glam-workbench.net/museumsvictoria/#a-random-item-from-museums-victorias-collections) from Tim Sherratt's [GLAM Workbench](https://glam-workbench.net/)
- [Open data for Ingenium collections](https://ingeniumcanada.org/collection-research/open-data) 
- Referenced Melanie Walsh's [Introduction to Cultural Analytics & Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html) for Python instruction

## Further Readings 

Bivens, Joy, and Ben Garcia, Porchia Moore, nikhil trivedi, Aletheia Wittman. ‘Collections: How We Hold the Stuff We Hold in Trust’ in _MASSAction, Museums As Site for Social Action_, toolkit, 125-139.  https://static1.squarespace.com/static/58fa685dff7c50f78be5f2b2/t/59dcdd27e5dd5b5a1b51d9d8/1507646780650/TOOLKIT_10_2017.pdf 

Houghton, Bernadette.  "Preservation Challenges in the Digital Age." _D-Lib Magazine_.  July/August 2016.  http://www.dlib.org/dlib/july16/houghton/07houghton.html

Kelly, Linda.  "The (post) digital visitor: What has (almost) twenty years of museum audience research revealed?" Museums and the Web.  https://mw2016.museumsandtheweb.com/paper/the-post-digital-visitor-what-has-almost-20-years-of-museum-audience-research-revealed/

Lincoln, Matthew D. "Some problems with GLAM data on GitHub" _Matthew Lincoln, PhD_ (blog). https://matthewlincoln.net/2016/01/06/some-problems-with-glam-data-on-github.html

saywhatnathan.  "Why do we collect?" _Archival Decolonist_ (blog). https://archivaldecolonist.com/2018/08/18/why-do-we-collect/ 

Wong, Amelia.  "The whole story, and then some: ‘digital storytelling’ in evolving museum practice." Museums in the Web. https://mw2015.museumsandtheweb.com/paper/the-whole-story-and-then-some-digital-storytelling-in-evolving-museum-practice/
