There are 3 tables in the database that contain general information about the cats:
- IndividualAnimalRecordsTZ_ZV_WA_SA_MG_DM
- FullTaxonList IndividualAnimalRecordsTZ_NAT_WA
- PrevalenceIncludedFelidsTZ_ZV_WA_SA_MG_DM

We would like to combine these into one unified list of cats whose general data can then be joined to blood data.  Because some cats appear in more than one table we merge the records using combine_first so that if data in one table is missing the data in the other table will be used.  Each table has a different set of columns.  Here we take only the coulmns that overlap in all 3 tables.

The final unified animal list contains 703 cats and is at:  
/processed_data/unified_animal_list.csv

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [2]:
def intersection(lst1, lst2): 
    # returns a list of the intersection of 2 lists
    return list(set(lst1) & set(lst2)) 

def intersection_of3(lst1, lst2, lst3): 
    # returns a list of the intersection of 3 lists
    return list(set(lst1) & set(lst2) & set(lst3)) 

In [3]:
def Diff(li1, li2): 
    # returns the difference between 2 lists
    return (list(set(li1) - set(li2))) 

In [4]:
# read 3 files with similar information

data_dir = '../Data/big_cats/Access_DB_table_exports/'
tab1 = pd.read_excel(data_dir + 'IndividualAnimalRecordsTZ_ZV_WA_SA_MG_DM.xlsx')
tab1_arks = list(tab1['ARKS Number'].unique())

tab2 = pd.read_excel(data_dir + 'FullTaxonList IndividualAnimalRecordsTZ_NAT_WA.xlsx')
tab2_arks = list(tab2['ARKS Number'].unique())

tab3 = pd.read_excel(data_dir + 'PrevalenceIncludedFelidsTZ_ZV_WA_SA_MG_DM.xlsx')
tab3_arks = list(tab3['ARKS Number'].unique())

print('tab1 has this many unique arks: ' + str(len(tab1_arks)))
print('tab2 has this many unique arks: ' + str(len(tab2_arks)))
print('tab3 has this many unique arks: ' + str(len(tab3_arks)))

tab1 has this many unique arks: 490
tab2 has this many unique arks: 648
tab3 has this many unique arks: 424


In [5]:
overlap1_2 = len(intersection(tab1_arks, tab2_arks))
print("there are " + str(overlap1_2) + " overlapping records between 1 and 2")

overlap2_3 = len(intersection(tab2_arks, tab3_arks))
print("there are " + str(overlap2_3) + " overlapping records between 2 and 3")

overlap1_3 = len(intersection(tab1_arks, tab3_arks))
print("there are " + str(overlap1_3) + " overlapping records between 1 and 3")

there are 439 overlapping records between 1 and 2
there are 384 overlapping records between 2 and 3
there are 419 overlapping records between 1 and 3


In [6]:
print('this many arks appear in all 3 tables: ' + str(len(intersection_of3(tab1_arks, tab2_arks, tab3_arks))))

this many arks appear in all 3 tables: 383


In [7]:
all_arks = tab1_arks + tab2_arks + tab3_arks
print('total unique arks: ' + str(len(set(all_arks))))

total unique arks: 703


# see if any of the 3 tables have multiple enties for the same ARKS Number
## (they don't)

In [8]:
tab1['ARKS Number'].value_counts()

980508                     1
740008                     1
K0095                      1
A20124                     1
820013                     1
A20046                     1
A80262                     1
970005                     1
B20133                     1
B10353                     1
A40081                     1
960995                     1
B00026                     1
880015                     1
MGTiger7                   1
820014                     1
820029                     1
980202                     1
960095                     1
830014                     1
980123                     1
A70193                     1
A30022                     1
840016                     1
880045                     1
A70603                     1
A70071                     1
B10071                     1
810056                     1
990294                     1
730013                     1
A50018                     1
540002                     1
800014MZ                   1
A80000        

In [9]:
tab2['ARKS Number'].value_counts()

980508                                    1
810057                                    1
A50431                                    1
B20133                                    1
970005                                    1
810123                                    1
A80262                                    1
A20046                                    1
670005WA                                  1
820013                                    1
720008                                    1
A20124                                    1
A30021                                    1
990294                                    1
B29101                                    1
A90698                                    1
B30002                                    1
760107                                    1
980082                                    1
A10253                                    1
A40081                                    1
950062                                    1
980202                          

In [10]:
tab3['ARKS Number'].value_counts()

980508         1
A90698         1
730013         1
990294         1
K0095          1
A20124         1
820013         1
A20046         1
A80262         1
970005         1
B20133         1
B10353         1
A40081         1
950062         1
960995         1
800029         1
MGTiger7       1
820014         1
820029         1
980202         1
960095         1
830014         1
980123         1
A70193         1
A30022         1
B29101         1
B30002         1
B20032         1
980082         1
970620         1
A80000         1
750011TZ       1
910001         1
900248         1
A10081         1
MGSnLeopard    1
840002         1
911073         1
700022         1
810001         1
A30376         1
A50007         1
920534         1
930229         1
840026         1
A50305         1
840033         1
890027         1
A10012         1
B19036         1
950078         1
810057         1
790025         1
880045         1
A70603         1
A70071         1
A79193         1
A90017         1
A70070        

# look at columns

### in tab2 rename 'Acute kidney disease' to 'Acute kidney failure'
The latter appears in table 1 and 3.  Assume they are equivalent.

In [11]:
tab2 = tab2.rename(columns={'Acute kidney disease': 'Acute kidney failure'})

### look at what's in common in all tables

In [12]:
cols1 = list(tab1.columns)
cols2 = list(tab2.columns)
cols3 = list(tab3.columns)

In [13]:
print('number of columns in table 1: ' + str(len(cols1)))
print('number of columns in table 2: ' + str(len(cols2)))
print('number of columns in table 3: ' + str(len(cols3)))

number of columns in table 1: 40
number of columns in table 2: 58
number of columns in table 3: 52


In [14]:
cols_in_common = intersection_of3(cols1, cols2, cols3)
cols_in_common

['Date of Death',
 'Species (common name)',
 'Transacted internationally',
 'Zoo Location',
 'GIT',
 'Dam ARKS Number',
 'Infection',
 'Neoplasia',
 'Date  of Birth',
 'ARKS Number',
 'Animal Name',
 'Neonatal',
 'Biochem data',
 'CKD - co-morbidity',
 'CKD - cause of mortality',
 'Respiratory/Cardiovascular',
 'Status (alive/deceased/transacted)',
 'Endocrine',
 'Hand-raised',
 'Unknown',
 'Housing (confined enclosure/open range)',
 'Pathology Report Numbers',
 'Sire ARKS Number',
 'International transaction',
 'Date of Transaction',
 'Sibling ARKS Number',
 'Urine data',
 'Neurological',
 'Trauma',
 'Sex',
 'Excluded from study?',
 'Comments:',
 'Age at Death',
 'Post Mortem Report Number',
 'Acute kidney failure',
 'Old Age debility',
 'Cause of Death']

# see what we're missing if we only take cols in common

In [15]:
Diff(cols1, cols_in_common)

['Renal diet?', 'Age in years at death', 'Pyometra']

In [16]:
Diff(cols2, cols_in_common)

['Date of first low USG',
 'Date of USG less than 1,035',
 'Hand-raised1',
 'Date of Onset of Azoatemia',
 'Sibling ARKS Number1',
 'Included in study',
 'International transaction1',
 'Onset of azotaemia',
 'First report of CKD CS',
 'Validated:',
 'GAN ID',
 'Date of diagnosis',
 'Transacted internationally1',
 'Ante-mortem Dx of CKD',
 'Sire ARKS Number1',
 'Alternative ARKS Nos',
 'Studbook Number',
 'Dam ARKS Number1',
 'Comments re kidney function',
 'Pyometra',
 'Housing (confined enclosure/open range)1']

In [17]:
Diff(cols3, cols_in_common)

['Date of first low USG',
 'Alternative ARKS Nos',
 'Date of USG less than 1,035',
 'Studbook Number',
 'Date of diagnosis',
 'Comments re kidney function',
 'Date of Onset of Azoatemia',
 'Pyometra cause of death',
 'Ante-mortem Dx of CKD',
 'Pyometra (survived)',
 'Onset of azotaemia',
 'First report of CKD CS',
 'Age in years at death',
 'Microchip Number',
 'GAN ID']

# Combine first: a table at a time

In [18]:
# only keep columns in common
tab1 = tab1[cols_in_common]
tab2 = tab2[cols_in_common]
tab3 = tab3[cols_in_common]

In [19]:
# list of arks that appear in all 3 tables
arks_in_all3tables = intersection_of3(tab1_arks, tab2_arks, tab3_arks)

# list of arks that appear in 2 tables
arks_in_tab1_and_tab2 =intersection(tab1_arks, tab2_arks)
arks_in_tab1_and_tab3 =intersection(tab1_arks, tab3_arks)
arks_in_tab2_and_tab3 =intersection(tab2_arks, tab3_arks)

# TAB 1
tab1_arks_appearing_elsewhere = set(arks_in_all3tables + arks_in_tab1_and_tab2 + arks_in_tab1_and_tab3)
# tab 1 unique arks = tab 1 arks - arks in all 3 - arks in 1 and 2 - arks in 1 and 3
arks_unique_tab1 = Diff(list(tab1['ARKS Number'].unique()), tab1_arks_appearing_elsewhere)

# TAB 2
tab2_arks_appearing_elsewhere = set(arks_in_all3tables + arks_in_tab1_and_tab2 + arks_in_tab2_and_tab3)
arks_unique_tab2 = Diff(list(tab2['ARKS Number'].unique()), tab2_arks_appearing_elsewhere)

# TAB 3
tab3_arks_appearing_elsewhere = set(arks_in_all3tables + arks_in_tab1_and_tab3 + arks_in_tab2_and_tab3)
arks_unique_tab3 = Diff(list(tab3['ARKS Number'].unique()), tab3_arks_appearing_elsewhere)

In [20]:
print('number of arks which appear in all tables:  ' + str(len(arks_in_all3tables)))
print()
print('number of arks that only appear in tab 1:  ' + str(len(arks_unique_tab1)))
print('number of arks that only appear in tab 2:  ' + str(len(arks_unique_tab2)))
print('number of arks that only appear in tab 3:  ' + str(len(arks_unique_tab3)))

number of arks which appear in all tables:  383

number of arks that only appear in tab 1:  15
number of arks that only appear in tab 2:  208
number of arks that only appear in tab 3:  4


In [21]:
# for the ARKS numbers that are ONLY in one table ('unique' to table 1,2 or 3)
# make a new master data frame to which we will append these to 

# start with the arks that are unique to tab 1
master = tab1[tab1['ARKS Number'].isin(arks_unique_tab1)]
# append the arks that are unique to tab 2
master = master.append(tab2[tab2['ARKS Number'].isin(arks_unique_tab2)])
# append the arks that are unique to tab 3
master = master.append(tab3[tab3['ARKS Number'].isin(arks_unique_tab3)], sort = True)

In [22]:
print('At this point there are 227 rows in master and each arks number appears only once.')
# len(master)
# master['ARKS Number'].value_counts()

At this point there are 227 rows in master and each arks number appears only once.


## now take care of arks numbers that appear in tab 1 and tab 2 but not 3

In [23]:
arks_in_tab1_tab2_not_tab3 = Diff(arks_in_tab1_and_tab2, arks_in_all3tables)
print('number of arks in tab 1 and tab 2 but not tab 3: ' + str(len(arks_in_tab1_tab2_not_tab3)))

number of arks in tab 1 and tab 2 but not tab 3: 56


In [24]:
# temporary data frames of the overlapping arks records in tab 1 and table 2
temp1 = tab1[tab1['ARKS Number'].isin(arks_in_tab1_tab2_not_tab3)]
temp2 = tab2[tab2['ARKS Number'].isin(arks_in_tab1_tab2_not_tab3)]

In [25]:
# check
len ( set(list(temp1['ARKS Number'].unique()) + list(temp2['ARKS Number'].unique()) ) )

56

In [26]:
# set index to arks number
temp1.set_index('ARKS Number', inplace = True)
temp2.set_index('ARKS Number', inplace = True)

# now use combine first to fill in any null values
# "Update null elements with value in the same location in other." e.g. df.combine_first(df_other)
combined_df_1_2 = temp1.combine_first(temp2)
combined_df_1_2.reset_index(inplace=True)

#append to master
master = master.append(combined_df_1_2, sort = True)

print('at this point there are 283 rows in master and each arks number appears only once.')

at this point there are 283 rows in master and each arks number appears only once.


## now take care of arks numbers that appear in tab 1 and tab 3 but not 2

In [27]:
arks_in_tab1_tab3_not_tab2 = Diff(arks_in_tab1_and_tab3, arks_in_all3tables)
print('number of arks in tab 1 and tab 3 but not tab 2: ' + str(len(arks_in_tab1_tab3_not_tab2)))

number of arks in tab 1 and tab 3 but not tab 2: 36


In [28]:
# temporary data frames of the overlapping arks records in tab 1 and table 3
temp1 = tab1[tab1['ARKS Number'].isin(arks_in_tab1_tab3_not_tab2)]
temp3 = tab3[tab3['ARKS Number'].isin(arks_in_tab1_tab3_not_tab2)]

In [29]:
# check
len ( set(list(temp1['ARKS Number'].unique()) + list(temp3['ARKS Number'].unique()) ) )

36

In [30]:
# set index to arks number
temp1.set_index('ARKS Number', inplace = True)
temp3.set_index('ARKS Number', inplace = True)

# now use combine first to fill in any null values
combined_df_1_3 = temp1.combine_first(temp3)
combined_df_1_3.reset_index(inplace=True)

#append to master
master = master.append(combined_df_1_3, sort = True)

print('at this point there are 319 rows in master and each arks number appears only once.')

at this point there are 319 rows in master and each arks number appears only once.


## now take care of arks numbers that appear in tab 2 and tab 3 but not 1

In [31]:
arks_in_tab2_tab3_not_tab1 = Diff(arks_in_tab2_and_tab3, arks_in_all3tables)
print('number of arks in tab 2 and tab 3 but not tab 1: ' + str(len(arks_in_tab2_tab3_not_tab1)))

number of arks in tab 2 and tab 3 but not tab 1: 1


In [32]:
# temporary data frames of the overlapping arks records in tab 1 and table 3
temp2 = tab2[tab2['ARKS Number'].isin(arks_in_tab2_tab3_not_tab1)]
temp3 = tab3[tab3['ARKS Number'].isin(arks_in_tab2_tab3_not_tab1)]

In [33]:
# check
len ( set(list(temp2['ARKS Number'].unique()) + list(temp3['ARKS Number'].unique()) ) )

1

In [34]:
# set index to arks number
temp2.set_index('ARKS Number', inplace = True)
temp3.set_index('ARKS Number', inplace = True)

# now use combine first to fill in any null values
combined_df_2_3 = temp2.combine_first(temp3)
combined_df_2_3.reset_index(inplace=True)

#append to master
master = master.append(combined_df_2_3, sort = True)

print('at this point there are 320 rows in master and each arks number appears only once.')

at this point there are 320 rows in master and each arks number appears only once.


## now take care of arks numbers that appear in all 3 tables

In [35]:
# temporary data frames of the arks records that appear in all 3 tables
temp1 = tab1[tab1['ARKS Number'].isin(arks_in_all3tables)]
temp2 = tab2[tab2['ARKS Number'].isin(arks_in_all3tables)]
temp3 = tab3[tab3['ARKS Number'].isin(arks_in_all3tables)]

In [36]:
# check
len ( set( list(temp1['ARKS Number'].unique()) + list(temp2['ARKS Number'].unique()) + list(temp3['ARKS Number'].unique()) ) )

383

In [37]:
# set index to arks number
temp1.set_index('ARKS Number', inplace = True)
temp2.set_index('ARKS Number', inplace = True)
temp3.set_index('ARKS Number', inplace = True)

# now use combine first to fill in any null values
combined_df = temp1.combine_first(temp2).combine_first(temp3)
combined_df.reset_index(inplace=True)

#append to master
master = master.append(combined_df, sort = True)

print('at this point there are 703 rows in master and each arks number appears only once.')

at this point there are 703 rows in master and each arks number appears only once.


# write unified animal list to csv

In [44]:
master.to_csv('../Data/big_cats/processed_data/unified_animal_list.csv', index = False)