# Structuring a complex object OLR with Python
For simple objects, OLRs are pretty simple, since you can easily do `ls > filelist.txt` to get a file listing in a directory, and paste that into your OLR. But what about complex objects? Additionally, what if it's a mix of simple and complex objects? 

That's where `pyobstruct.py` comes in. It takes user input for the absolute path of the files on staging, parses that into a data frame, then outputs it as a csv. 

Although it "works" now, `pyobstruct` could use improvement. It sorts in a weird way, so that e.g. `b100291089_2.tif` comes a line before `b100291089_1.tif`. It also does not print the single "object" row that contains object-level information. This notebook will tinker and try to achieve these things. 

In [2]:
import pandas as pd
import sys
import os
import numpy as np

Let's isolate the first part of the script, which seems to be pretty solid. It takes the absolute path of the files, and returns some useful information, like how many objects there are

In [3]:
user_path = input("Enter the absolute path of the directory containing all objects: ")
os.chdir(user_path)
folders = [name for name in os.listdir(".") if os.path.isdir(name)]

# Make a list of the files within the folders
files = []
for f in folders:
    file = os.listdir(f)
    files.append(file)
    continue

# Remove any Thumbs.db files in our list
for file in files:
    while 'Thumbs.db' in file: file.remove('Thumbs.db')    
    
# Give the user some textual output of the files, then report total number of objects
print("Here is a sample of the files:")
print('\n',files[0:15])
print('\n')
print("There are",len(files),"total objects")

Enter the absolute path of the directory containing all objects:  /mnt/digital-staging/Mexican-Broadsides/batch3/Working_Files


Here is a sample of the files:

 [['b100291089_2.tif', 'b100291089_1.tif'], ['b10343608x_1.tif'], ['b9675820x_1.tif'], ['b103565759_1.tif', 'b103565759_2.tif'], ['b103720285_1.tif'], ['b103436133_1.tif'], ['b100710001_1.tif'], ['b103785012_1.tif', 'b103785012_2.tif'], ['b10065731x_1.tif'], ['b103784457_1.tif', 'b103784457_2.tif'], ['b103719593_1.tif'], ['b100311556_1.tif'], ['b103836895_2.tif', 'b103836895_1.tif'], ['b100753668_1.tif'], ['b103722154_2.tif', 'b103722154_1.tif']]


There are 111 total objects


As you can see, the second tif `_2` comes before the first, `_1`. Perhaps we can sort this by using `sort()` on the list 

In [4]:
user_path = input("Enter the absolute path of the directory containing all objects: ")
os.chdir(user_path)
folders = [name for name in os.listdir(".") if os.path.isdir(name)]

# Make a list of the files within the folders
files = []
for f in folders:
    file = sorted(os.listdir(f))
    files.append(file)
    continue

# Remove any Thumbs.db files in our list
for file in files:
    while 'Thumbs.db' in file: file.remove('Thumbs.db') 
    sorted(file)
    

# Give the user some textual output of the files, then report total number of objects
print("Here is a sample of the files:")
print('\n',files[0:15])
print('\n')
print("There are",len(files),"total objects")

Enter the absolute path of the directory containing all objects:  /mnt/digital-staging/Mexican-Broadsides/batch3/Working_Files


Here is a sample of the files:

 [['b100291089_1.tif', 'b100291089_2.tif'], ['b10343608x_1.tif'], ['b9675820x_1.tif'], ['b103565759_1.tif', 'b103565759_2.tif'], ['b103720285_1.tif'], ['b103436133_1.tif'], ['b100710001_1.tif'], ['b103785012_1.tif', 'b103785012_2.tif'], ['b10065731x_1.tif'], ['b103784457_1.tif', 'b103784457_2.tif'], ['b103719593_1.tif'], ['b100311556_1.tif'], ['b103836895_1.tif', 'b103836895_2.tif'], ['b100753668_1.tif'], ['b103722154_1.tif', 'b103722154_2.tif']]


There are 111 total objects


Nice, now the files should appear in order

In [5]:
user_path = input("Enter the absolute path of the directory containing all objects: ")
os.chdir(user_path)
folders = [name for name in os.listdir(".") if os.path.isdir(name)]

# Make a list of the files within the folders
files = []
for f in folders:
    file = sorted(os.listdir(f))
    files.append(file)
    continue

# Remove any Thumbs.db files in our list
for file in files:
    while 'Thumbs.db' in file: file.remove('Thumbs.db')    
    
# Give the user some textual output of the files, then report total number of objects
print("Here is a sample of the files:")
print('\n',files[0:15])
print('\n')
print("There are",len(files),"total objects")

# Make a dictionary where the key is the bib/folder, and the value is one or more files
listing_dict = dict(zip(folders, files))

# Make a dictionary of a dataFrame using the above dictionary 
#dict_of_df = {k: pd.DataFrame(v) for k,v in listing_dict.items()}

# Make THAT a dataFrame
#df = pd.concat(dict_of_df, names=['bib','filename'])

df = pd.DataFrame.from_dict(listing_dict, orient='index', columns=['file_1', 'file_2'])

# Give the user a sample of the dataFrame
print(df[0:50])


Enter the absolute path of the directory containing all objects:  /mnt/digital-staging/Mexican-Broadsides/batch3/Working_Files


Here is a sample of the files:

 [['b100291089_1.tif', 'b100291089_2.tif'], ['b10343608x_1.tif'], ['b9675820x_1.tif'], ['b103565759_1.tif', 'b103565759_2.tif'], ['b103720285_1.tif'], ['b103436133_1.tif'], ['b100710001_1.tif'], ['b103785012_1.tif', 'b103785012_2.tif'], ['b10065731x_1.tif'], ['b103784457_1.tif', 'b103784457_2.tif'], ['b103719593_1.tif'], ['b100311556_1.tif'], ['b103836895_1.tif', 'b103836895_2.tif'], ['b100753668_1.tif'], ['b103722154_1.tif', 'b103722154_2.tif']]


There are 111 total objects
                      file_1            file_2
b100291089  b100291089_1.tif  b100291089_2.tif
b10343608x  b10343608x_1.tif              None
b9675820x    b9675820x_1.tif              None
b103565759  b103565759_1.tif  b103565759_2.tif
b103720285  b103720285_1.tif              None
b103436133  b103436133_1.tif              None
b100710001  b100710001_1.tif              None
b103785012  b103785012_1.tif  b103785012_2.tif
b10065731x  b10065731x_1.tif              None
b103784457  b1037

In [6]:
df2 = df.stack()
df2

b100291089  file_1    b100291089_1.tif
            file_2    b100291089_2.tif
b10343608x  file_1    b10343608x_1.tif
b9675820x   file_1     b9675820x_1.tif
b103565759  file_1    b103565759_1.tif
            file_2    b103565759_2.tif
b103720285  file_1    b103720285_1.tif
b103436133  file_1    b103436133_1.tif
b100710001  file_1    b100710001_1.tif
b103785012  file_1    b103785012_1.tif
            file_2    b103785012_2.tif
b10065731x  file_1    b10065731x_1.tif
b103784457  file_1    b103784457_1.tif
            file_2    b103784457_2.tif
b103719593  file_1    b103719593_1.tif
b100311556  file_1    b100311556_1.tif
b103836895  file_1    b103836895_1.tif
            file_2    b103836895_2.tif
b100753668  file_1    b100753668_1.tif
b103722154  file_1    b103722154_1.tif
            file_2    b103722154_2.tif
b103719829  file_1    b103719829_1.tif
b103503584  file_1    b103503584_1.tif
b103831137  file_1    b103831137_1.tif
b103618685  file_1    b103618685_1.tif
            file_2    b10

We'll have to reset the index in order to get the bib numbers to "fill down" correctly

In [7]:
df2 = df2.reset_index()
df2.index.name = None
print(df2)

        level_0 level_1                 0
0    b100291089  file_1  b100291089_1.tif
1    b100291089  file_2  b100291089_2.tif
2    b10343608x  file_1  b10343608x_1.tif
3     b9675820x  file_1   b9675820x_1.tif
4    b103565759  file_1  b103565759_1.tif
5    b103565759  file_2  b103565759_2.tif
6    b103720285  file_1  b103720285_1.tif
7    b103436133  file_1  b103436133_1.tif
8    b100710001  file_1  b100710001_1.tif
9    b103785012  file_1  b103785012_1.tif
10   b103785012  file_2  b103785012_2.tif
11   b10065731x  file_1  b10065731x_1.tif
12   b103784457  file_1  b103784457_1.tif
13   b103784457  file_2  b103784457_2.tif
14   b103719593  file_1  b103719593_1.tif
15   b100311556  file_1  b100311556_1.tif
16   b103836895  file_1  b103836895_1.tif
17   b103836895  file_2  b103836895_2.tif
18   b100753668  file_1  b100753668_1.tif
19   b103722154  file_1  b103722154_1.tif
20   b103722154  file_2  b103722154_2.tif
21   b103719829  file_1  b103719829_1.tif
22   b103503584  file_1  b10350358

Now, we can make a dataframe of all the 'dupe' ARKs. First let's find all the duplicate ARKs, which will let us know these are complex objects.

In [8]:
df_duplicates = df2.duplicated(subset=['level_0'])
df_duplicates

0      False
1       True
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10      True
11     False
12     False
13      True
14     False
15     False
16     False
17      True
18     False
19     False
20      True
21     False
22     False
23     False
24     False
25      True
26     False
27     False
28     False
29     False
       ...  
124     True
125    False
126     True
127    False
128    False
129    False
130     True
131    False
132    False
133     True
134    False
135     True
136    False
137     True
138    False
139     True
140    False
141    False
142    False
143    False
144     True
145    False
146    False
147    False
148     True
149    False
150    False
151    False
152    False
153    False
Length: 154, dtype: bool

Now we can get the values of those "True" rows

In [9]:
arks = df2["level_0"]
dupe_df = df2[arks.isin(arks[arks.duplicated()])]

In [10]:
dupe_arks = dupe_df["level_0"]
dupe_arks

0      b100291089
1      b100291089
4      b103565759
5      b103565759
9      b103785012
10     b103785012
12     b103784457
13     b103784457
16     b103836895
17     b103836895
19     b103722154
20     b103722154
24     b103618685
25     b103618685
29     b103506421
30     b103506421
38     b103831034
39     b103831034
43     b100751726
44     b100751726
46     b103618454
47     b103618454
51     b103609441
52     b103609441
53     b100659615
54     b100659615
55     b103705545
56     b103705545
58     b100294753
59     b100294753
          ...    
101    b103785954
102    b103785954
110    b103504497
111    b103504497
113    b103833493
114    b103833493
115    b100532032
116    b100532032
118     b9672979x
119     b9672979x
121    b100295915
122    b100295915
123    b103722683
124    b103722683
125    b103618636
126    b103618636
129    b103585242
130    b103585242
132    b100757698
133    b100757698
134    b100311647
135    b100311647
136    b100525519
137    b100525519
138     b9

In [11]:
dupe_arks = dupe_arks.unique()
dupe_arks

array(['b100291089', 'b103565759', 'b103785012', 'b103784457',
       'b103836895', 'b103722154', 'b103618685', 'b103506421',
       'b103831034', 'b100751726', 'b103618454', 'b103609441',
       'b100659615', 'b103705545', 'b100294753', 'b100713671',
       'b100757054', 'b100531088', 'b103786387', 'b103722634',
       'b103568414', 'b103705429', 'b100294790', 'b100313723',
       'b100294935', 'b100310758', 'b103836512', 'b100489114',
       'b103785954', 'b103504497', 'b103833493', 'b100532032',
       'b9672979x', 'b100295915', 'b103722683', 'b103618636',
       'b103585242', 'b100757698', 'b100311647', 'b100525519',
       'b96726647', 'b100532263', 'b103609192'], dtype=object)

Now we can make a dataframe of the dupe ARKs

In [16]:
dupe_df = pd.DataFrame({'level_0':dupe_arks})
dupe_df

Unnamed: 0,level_0
0,b100291089
1,b103565759
2,b103785012
3,b103784457
4,b103836895
5,b103722154
6,b103618685
7,b103506421
8,b103831034
9,b100751726


Now we can concatenate the two dataframes

In [63]:
final_df = pd.concat([df2, dupe_df], sort=True, ignore_index=True)

In [70]:
final_df

Unnamed: 0,level_0,level_1,0
0,b100291089,file_1,b100291089_1.tif
1,b100291089,file_2,b100291089_2.tif
2,b10343608x,file_1,b10343608x_1.tif
3,b9675820x,file_1,b9675820x_1.tif
4,b103565759,file_1,b103565759_1.tif
5,b103565759,file_2,b103565759_2.tif
6,b103720285,file_1,b103720285_1.tif
7,b103436133,file_1,b103436133_1.tif
8,b100710001,file_1,b100710001_1.tif
9,b103785012,file_1,b103785012_1.tif


Finally, we can sort the dataframe so that complex objects are grouped by their empty header row first

In [72]:
final_df = final_df.sort_values(by=['level_0','level_1'], na_position='first')
final_df

Unnamed: 0,level_0,level_1,0
154,b100291089,,
0,b100291089,file_1,b100291089_1.tif
1,b100291089,file_2,b100291089_2.tif
168,b100294753,,
58,b100294753,file_1,b100294753_1.tif
59,b100294753,file_2,b100294753_2.tif
176,b100294790,,
84,b100294790,file_1,b100294790_1.tif
85,b100294790,file_2,b100294790_2.tif
178,b100294935,,


In [76]:
# Uncomment to print out!
# final_df.to_csv('~/Documents/md_structured.csv')