# DickensAssignmentValidator

MUCEP Task 1 (Dr. Pierre-Paul Bitton)

Author: Shawon Ibn Kamal\
Email: sikamal@mun.ca

### Updates to exisiting program

I made a few changes in the existing curator program to work with it efficiently and sorted out a few bugs. Here's the list:

- The files getting read from is renamed to "DataFiles" from "Files".
- The output csv files are being stored in a folder named "OutputFiles".
- Renamed "NotMatched.csv" to "MissingMeta.csv" in order to avoid confusion with "MissingFiles.csv".
- Stored the program in git, currently a private repo to me. I think it is a good way to track updates,\
  we can work on it if you are interested
- Fixed few minor bugs in DickensAssignment.py program


In [107]:
import pandas as pd
from Levenshtein import distance # pip install python-Levenshtein

### Run DickensAssignment.py



In [108]:
exec(open('DickensAssignment.py').read())

4905 no. of files
4093 match found
812 match not found
Complete


### Compare OutputFiles with OutputFiles_2020_07_14

In [115]:
# Load old outputs
df_old_result = pd.read_csv('OutputFiles_2020_07_14/Result.csv', engine='python')
df_old_missing_files = pd.read_csv('OutputFiles_2020_07_14/MissingFiles.csv', engine='python')
df_old_not_matched_files = pd.read_csv('OutputFiles_2020_07_14/MissingMeta.csv', engine='python')

# Load new outputs
df_new_result = pd.read_csv('OutputFiles_2020_07_14/Result.csv', engine='python')
df_new_missing_files = pd.read_csv('OutputFiles_2020_07_14/MissingFiles.csv', engine='python')
df_new_missing_meta = pd.read_csv('OutputFiles_2020_07_14/MissingMeta.csv', engine='python')

df_new_result = df_new_result.sort_values(by='FileName')
df_new_result

Unnamed: 0,FileName,institutionCode,collectionCode,catalogueNumber,class,order,family,genus,specificEpithet,infraspecificEpithet,...,verbatimElevation,eventDate,measurementDeterminedDate,Patch,LightAngle1,LightAngle2,ProbeAngle1,ProbeAngle2,Replicate,Comments
2553,AM.H.AMNH278606.00000001,AMNH,,278606,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,1930-11-15,,Head,0,0,0,0,1,
2554,AM.H.AMNH278606.00000002,AMNH,,278606,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,1930-11-15,,Head,0,0,0,0,2,
2555,AM.H.AMNH278606.00000003,AMNH,,278606,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,1930-11-15,,Head,0,0,0,0,3,
2556,AM.H.AMNH278606.00000004,AMNH,,278606,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,1930-11-15,,Head,0,0,0,0,4,
2557,AM.H.AMNH278606.00000005,AMNH,,278606,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,1930-11-15,,Head,0,0,0,0,5,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
845,TE.U.USNM605260.00000001,USNM,,605260,Aves,Trogoniformes,Trogonidae,Trogon,rufus,tenellus,...,,1987-4-3,,Chest,0,0,0,0,1,
846,TE.U.USNM605260.00000002,USNM,,605260,Aves,Trogoniformes,Trogonidae,Trogon,rufus,tenellus,...,,1987-4-3,,Chest,0,0,0,0,2,
847,TE.U.USNM605260.00000003,USNM,,605260,Aves,Trogoniformes,Trogonidae,Trogon,rufus,tenellus,...,,1987-4-3,,Chest,0,0,0,0,3,
848,TE.U.USNM605260.00000004,USNM,,605260,Aves,Trogoniformes,Trogonidae,Trogon,rufus,tenellus,...,,1987-4-3,,Chest,0,0,0,0,4,


In [116]:
df_diff_result = pd.concat([df_old_result,df_new_result]).drop_duplicates(keep=False)
df_diff_missing_files = pd.concat([df_old_missing_files,df_new_missing_files]).drop_duplicates(keep=False)
df_diff_not_matched_files = pd.concat([df_old_not_matched_files,df_new_not_matched_files]).drop_duplicates(keep=False)

if (df_diff_result.size == 0):
    print("Results are the same")
else:
    print("Results have ", df_diff_result.size, " differences")
    
if (df_diff_missing_files.size == 0):
    print("MissingFiles are the same")
else:
    print("MissingFiles hav ", df_diff_missing_files.size, " differences")

if (df_new_missing_meta.size == 0):
    print("NotMatchedFiles are the same")
else:
    print("NotMatchedFiles have ", df_new_missing_meta.size, " differences")


Results are the same
MissingFiles are the same
NotMatchedFiles have  812  differences


### Check to see if MissingMetaData entries are due to typo

In [138]:
# returns true if two strings are same or different by 1 letter
def similar(s1, s2):
    lengthCheck = ((len(s1) == len(s2)+1) or (len(s1) + 1 == len(s2)) or (len(s1) == len(s2)))
#     print(distance(s1, s2))
#     print(lengthCheck)
    return distance(s1, s2) <= 2 and lengthCheck

def includes(fullstring, substrings=[]):
    count = 0
    for each_substring in substrings:
        if fullstring.find(each_substring) != -1:
            count += 1
    return count

# Testing
print(similar("Hello", "Helyo"))
print(similar("Hello", "Hello"))

print(includes("I like data", ["like", "data"]))

True
True
2


In [127]:
# copy the filenames from the directory
filenames = [name for path, subdirs, files in os.walk("DataFiles")
             for name in files]

df_data_files = pd.DataFrame({'filename':filenames}).sort_values(by='filename')

df_template = pd.read_csv('template.csv', engine='python')

Unnamed: 0,FileName,institutionCode,collectionCode,catalogueNumber,class,order,family,genus,specificEpithet,infraspecificEpithet,...,verbatimElevation,eventDate,measurementDeterminedDate,Patch,LightAngle1,LightAngle2,ProbeAngle1,ProbeAngle2,Replicate,Comments
0,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,2012-7-20,,,0,0,0,0,,
1,,MZUSP,,97287,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,2013-6-26,,,0,0,0,0,,
2,,MZUSP,,76792,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,2007-1-20,,,0,0,0,0,,
3,,MZUSP,,86474,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,2009-7-16,,,0,0,0,0,,
4,,AMNH,,278606,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,1930-11-15,,,0,0,0,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222,,LSUMNS,,164077,Aves,Trogoniformes,Trogonidae,Trogon,rufus,tenellus,...,,1997-3-23,,,0,0,0,0,,
223,,USNM,,461912,Aves,Trogoniformes,Trogonidae,Trogon,rufus,tenellus,...,,1957-2-14,,,0,0,0,0,,
224,,USNM,,145488,Aves,Trogoniformes,Trogonidae,Trogon,rufus,tenellus,...,,,,,0,0,0,0,,
225,,FMNH,,282539,Aves,Trogoniformes,Trogonidae,Trogon,rufus,tenellus,...,,,,,0,0,0,0,,


In [135]:
df_template['key'] = 0
df_new_missing_meta['key'] = 0
merged_data_files_and_missing_data = df_template.merge(df_new_missing_meta, how='outer')
merged_data_files_and_missing_data.head(50)

Unnamed: 0,FileName,institutionCode,collectionCode,catalogueNumber,class,order,family,genus,specificEpithet,infraspecificEpithet,...,measurementDeterminedDate,Patch,LightAngle1,LightAngle2,ProbeAngle1,ProbeAngle2,Replicate,Comments,key,notmatched
0,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180686.00000001.Master.Transmission
1,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180686.00000002.Master.Transmission
2,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180686.00000003.Master.Transmission
3,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180686.00000004.Master.Transmission
4,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180686.00000005.Master.Transmission
5,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180687.00000001.Master.Transmission
6,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180687.00000002.Master.Transmission
7,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180687.00000003.Master.Transmission
8,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180687.00000004.Master.Transmission
9,,MZUSP,,95838,Aves,Trogoniformes,Trogonidae,Trogon,rufus,amazonicus,...,,,0,0,0,0,,,0,TE.F.B.LSU180687.00000005.Master.Transmission


In [147]:
merged_data_files_and_missing_data['similarity'] = merged_data_files_and_missing_data.apply(lambda row : includes(row['notmatched'], [str(row['catalogueNumber']), row['institutionCode']]), axis=1)

In [151]:
merged_data_files_and_missing_data = merged_data_files_and_missing_data[['institutionCode', 'catalogueNumber', 'notmatched', 'similarity']][merged_data_files_and_missing_data['similarity'] > 1].sort_values(by='similarity', ascending=False)
print(merged_data_files_and_missing_data.size)
merged_data_files_and_missing_data.head(50)

108


Unnamed: 0,institutionCode,catalogueNumber,notmatched,similarity
23144,CM,72696,AM.M.CM972696.00000002.Master.Transmission,2
23165,CM,72696,AM.U.CM972696.00000003.Master.Transmission,2
43376,MNRJ,4359,CH.R.MNRJ44359.00000002.csv,2
23143,CM,72696,AM.M.CM972696.00000001.Master.Transmission,2
43377,MNRJ,4359,CH.R.MNRJ44359.00000005.csv,2
23145,CM,72696,AM.M.CM972696.00000003.Master.Transmission,2
23146,CM,72696,AM.M.CM972696.00000004.Master.Transmission,2
23147,CM,72696,AM.M.CM972696.00000005.Master.Transmission,2
23149,CM,72696,AM.R.CM972696.00000002.Master.Transmission,2
23156,CM,72696,AM.S.CM972696.00000004.Master.Transmission,2
