# Fix column names 

This notebook:
* removes empty column in `pathogenic` sheet
* fixes column names to be consistent among excel sheets

```
pathogenic: 
"Location in  Genome release  37 (hg19)" (orig) -> "Location in Genome release 37 (hg19)" (new)
"Location in Genome release  38 (hg38)" (orig) -> "Location in Genome release 38 (hg38)" (new)
"Protein or mRNA variants" (orig) -> "Protein or mRNA Variants" (new)
" Functional outcome (MLCL/CL ratio)" (orig) -> "Functional outcome (MLCL/CL ratio)" (new)
"Taffazin Functional motifs " (orig) -> "Taffazin Functional motifs" (new)

benign:
"Location in  Genome release 37 (hg19)" (orig) -> "Location in Genome release 37 (hg19)" (new)
"Splicing prediction" (orig) -> "Splicing Prediction" (new)

vus:
"Location in  Genome release 37 (hg19)" (orig) -> "Location in Genome release 37 (hg19)" (new)

exon5:
"Genome Assembly Release 37" (orig) -> "Location in Genome release 37 (hg19)" (new)
"Genome Assembly Release 38" (orig) -> "Location in Genome release 38 (hg38)" (new)
```

In [1]:
import pandas as pd
import os
import helpers

In [2]:
input_path = '../database_original/[Data Only] Human TAFAZZIN Variants Database_v07-20-2023.xlsx'
output_path = helpers.create_database_output_path()

version number: 0000
database output path: ../database_versions/0000_2023-10-30-12-58-35-482589_Human-TAFAZZIN-Variants-Database.xlsx


# Load data

In [3]:
if not os.path.exists(input_path):
    print('Please download the original database into `database_original` folder from https://drive.google.com/drive/folders/1O2MKa5FHsvq3hyjOVSsOZf37xkwKYAJ8 ')

In [4]:
xls = pd.ExcelFile(input_path)
sheet_names = xls.sheet_names
print(sheet_names)

['1.PATHOGENIC LIKELY_v07062023', '2.VUS_v01222023', '3.BENIGN_v09012021', '4.EXON 5_v01012020']


In [5]:
pathogenic_sheet_names = [a for a in sheet_names if 'PATHOGENIC' in a] 
assert len(pathogenic_sheet_names) == 1, 'we expect just one "PATHOGENIC" sheet'
df_pathogenic = pd.read_excel(xls, pathogenic_sheet_names[0])
    
vus_sheet_names = [a for a in sheet_names if 'VUS' in a] 
assert len(vus_sheet_names) == 1, 'we expect just one "VUS" sheet'
df_vus = pd.read_excel(xls, vus_sheet_names[0])
    
benign_sheet_names = [a for a in sheet_names if 'BENIGN' in a] 
assert len(benign_sheet_names) == 1, 'we expect just one "BENIGN" sheet'
df_benign = pd.read_excel(xls, benign_sheet_names[0])
    
exon5_sheet_names = [a for a in sheet_names if 'EXON 5' in a] 
assert len(exon5_sheet_names) == 1, 'we expect just one "EXON 5" sheet'
df_exon5 = pd.read_excel(xls, exon5_sheet_names[0])    

In [6]:
print(df_pathogenic.shape)
df_pathogenic.head(3)

(406, 16)


Unnamed: 0,Location,Location in Genome release 37 (hg19),Location in Genome release 38 (hg38),Protein Variant Type,Impact of Variant,DNA Modifications,Protein or mRNA variants,Functional outcome (MLCL/CL ratio),Taffazin Functional motifs,Method of Validation,References,Source,Additional variants in other genes,Location and Order of Discovery,Notes,Unnamed: 15
0,Exon 1,X:153640189,X:154411852,Frameshift,,c.9_10dupG,p.His4Alafs*130,MLCL/CL elevated,,,Ref. 1 (Pat.1); Ref. 80; Ref. 113,,,1-1,,
1,Exon 1,X:153640197 - 153640198,X: 154411860 - 154411861,Frameshift,,c.18_22dup,p.Pro8fs*,,,,Ref. 140; Ref.83,ClinVar,,1-12,,
2,Exon 1,X:153640219_241,X:154411882_904,Frameshift,,c.39_60del22,p.Pro14Alafs*19,MLCL/CL elevated,,,Ref. 95 (Pat. 1),,"Mitochondrial: m.1555A>G in 12S rRNA, homoplasmic",1-11,,


In [7]:
print(df_vus.shape)
df_vus.head(3)

(126, 12)


Unnamed: 0,Location,Location in Genome release 37 (hg19),Location in Genome release 38 (hg38),DNA Modifications,Protein or mRNA Variants,References & population frequency,Source,SIFT prediction,PolyPhen2 prediction,Amino acid conservation & comments,Additional variants in other genes,Notes
0,Exon 1,X:153640193,X:154411856,c.13G>T,p.Val5Leu,gnomad exomes; Ashk Jewish female 5/99652,ClinVar; LMM 2011; GeneDX 2017; Invitae 2020;...,tolerated; 0.14,0.984,Vertebrates 100% Val; invertebrates have Ile,,
1,Exon 1,X:153640197_198,X:154411860_861,c.17_18insA,fs*,Ref. 57,,,,,,Not in ClinVar
2,Exon 1,X:153640207,X:154411870,c.27C>G,p.Phe9Leu,,ClinVar; Invitae 2020; Ambry 2020,deleterious,benign,vertebrates 100%,,Not in ExAC


In [8]:
print(df_benign.shape)
df_benign.head(3)

(178, 13)


Unnamed: 0,Location,Location in Genome release 37 (hg19),Location in Genome release 38 (hg38),DNA Modifications,Protein or mRNA Variants,References & population frequency,Source,Amino acid conservation & comments,SIFT prediction,PolyPhen2 prediction,Splicing prediction,Additional variants in other genes,Notes
0,5'UTR,X:153640060-153640061,X:154411724-154411725,c.-119 (or '-121_-119) insert/del T,,ExAC 5786/6746 alleles; 3249 homozy; 1111 hemizy,ClinVar,,,,,,Benign
1,5'UTR,X:153640093,X:154411756,c.-88G > C,,ExAC 28/6688; 13 hemizyg; Ref. 4; Ref. 14,ClinVar Jun 2016,,,,,,
2,5'UTR,X:153640097,X:154411760,c.-84C>G,,,,,,,,,


In [9]:
print(df_exon5.shape)
df_exon5.head(3)

(11, 13)


Unnamed: 0,Location,Classification,Genome Assembly Release 37,Genome Assembly Release 38,DNA Modifications,Protein or mRNA Variants,References & population frequency,Source,Amino acid conservation & comments,SIFT prediction,PolyPhen2 prediction,Splicing Prediction,Notes
0,Exon 5,VUS,X:153642485,X:154414148_52,c.418_422del ACAGGinsA,p.Arg142ThrfsX41,ExAC 1/79840; 1 hemizyg (LOW CONFIDENCE),,,,,,Not in ClinVar
1,Exon 5,VUS,X:153642486,X:154414149,c.419C>T,p.Thr140Ile,Ref. 83,ClinVar Jan 2014,Not all primates,Tolerated; 0.36,0.0,Acceptor much reduced; donor small reduction,VUS; not in ExAC
2,Exon 5,VUS,X:153642492_95,X:154414155_58,c.425_428delGGCA or c.419_422delCAGG,p.Arg142ThrfsX41,Ref. 83,ClinVar Feb 2017,Should only manifest in FL variant,,,,Likely path. Var ID 423852; Not in ExAC


# Inspect column names

In [10]:
columns_all = list(df_pathogenic.columns) + list(df_benign.columns) + \
            list(df_vus.columns) + list(df_exon5.columns)

In [11]:
columns_not_in_all = [x for x in set(columns_all) if columns_all.count(x) != 4]

In [12]:
print('Column names, which are not present in all 4 sheets:')
sorted(columns_not_in_all)

Column names, which are not present in all 4 sheets:


[' Functional outcome (MLCL/CL ratio)',
 'Additional variants in other genes',
 'Amino acid conservation & comments',
 'Classification',
 'Genome Assembly Release 37',
 'Genome Assembly Release 38',
 'Impact of Variant',
 'Location and Order of Discovery',
 'Location in  Genome release  37 (hg19)',
 'Location in  Genome release 37 (hg19)',
 'Location in Genome release  38 (hg38)',
 'Location in Genome release 38 (hg38)',
 'Method of Validation',
 'PolyPhen2 prediction',
 'Protein Variant Type',
 'Protein or mRNA Variants',
 'Protein or mRNA variants',
 'References',
 'References & population frequency',
 'SIFT prediction',
 'Splicing Prediction',
 'Splicing prediction',
 'Taffazin Functional motifs ',
 'Unnamed: 15']

### Column names missing in some sheets, but seem to be OK:

--> no action needed

In [13]:
columns_missing_seem_ok = ['Functional outcome (MLCL/CL ratio)', # only in df_pathogenic
    'Additional variants in other genes', # not in exon5
    'Amino acid conservation & comments', # not in df_pathogenic
    'Classification', # only in exon5
    'Impact of Variant', # only in df_pathogenic
    'Location and Order of Discovery', # only in df_pathogenic
    'Method of Validation'] # only in df_pathogenic

In [14]:
len(columns_not_in_all)

24

In [15]:
columns_not_in_all = [x for x in columns_not_in_all if x not in columns_missing_seem_ok]

In [16]:
len(columns_not_in_all)

18

### Column names are missing, but column renaming is not enough, more parsing will be needed:

--> work left to be done in next notebooks TODO

In [17]:
columns_missing_more_parsing_needed = ['PolyPhen2 prediction', # column not in df_pathogenic, but this info can be extracted from 'Method of Validation'
    'SIFT prediction', # column not in df_pathogenic, but this info can be extracted from 'Method of Validation'
    'Protein Variant Type', # only in df_pathogenic, should be added also to other sheets
    'References', # only in df_pathogenic, should be made consistent with other sheets, where similar column contains population frequency, see below
    'References & population frequency'] # not in df_pathogenic, see above

In [18]:
columns_not_in_all = [x for x in columns_not_in_all if x not in columns_missing_more_parsing_needed]

In [19]:
len(columns_not_in_all)

13

# Rename columns:

#### ' Functional outcome (MLCL/CL ratio)':

reason: typo: space at the beginning

In [20]:
new_col_name = 'Functional outcome (MLCL/CL ratio)'

In [21]:
columns_all.count(' Functional outcome (MLCL/CL ratio)')

1

In [22]:
' Functional outcome (MLCL/CL ratio)'in df_pathogenic

True

In [23]:
df_pathogenic.rename(columns={' Functional outcome (MLCL/CL ratio)': new_col_name.strip()}, inplace=True)

In [24]:
assert new_col_name in df_pathogenic.columns

In [25]:
columns_not_in_all.remove(' Functional outcome (MLCL/CL ratio)')

In [26]:
len(columns_not_in_all)

12

#### 'Genome Assembly Release 37'
#### 'Location in  Genome release  37 (hg19)'
#### 'Location in  Genome release 37 (hg19)'

reason: typo: names inconsistent, spaces in names

In [27]:
new_col_name = 'Location in Genome release 37 (hg19)' # new name with fixed spaces

In [28]:
columns_all.count('Genome Assembly Release 37')

1

In [29]:
'Genome Assembly Release 37' in df_exon5

True

In [30]:
df_exon5.rename(
    columns={'Genome Assembly Release 37': new_col_name}, inplace=True)

In [31]:
columns_all.count('Location in  Genome release  37 (hg19)')

1

In [32]:
df_pathogenic.rename(
    columns={'Location in  Genome release  37 (hg19)': new_col_name}, inplace=True)

In [33]:
columns_all.count('Location in  Genome release 37 (hg19)')

2

In [34]:
'Location in  Genome release 37 (hg19)' in df_vus

True

In [35]:
'Location in  Genome release 37 (hg19)' in df_benign

True

In [36]:
df_benign.rename(columns={
    'Location in  Genome release 37 (hg19)': new_col_name}, inplace=True)

In [37]:
df_vus.rename(
    columns={'Location in  Genome release 37 (hg19)': new_col_name}, inplace=True)

In [38]:
assert new_col_name in df_exon5

In [39]:
assert new_col_name in df_pathogenic

In [40]:
assert new_col_name in df_benign.columns

In [41]:
assert new_col_name in df_vus.columns

In [42]:
columns_not_in_all.remove('Genome Assembly Release 37')

In [43]:
columns_not_in_all.remove('Location in  Genome release  37 (hg19)')

In [44]:
columns_not_in_all.remove('Location in  Genome release 37 (hg19)')

In [45]:
len(columns_not_in_all)

9

#### 'Genome Assembly Release 38'
#### 'Location in Genome release  38 (hg38)',
#### 'Location in Genome release 38 (hg38)'

reason: typo: names inconsistent, spaces in names

In [46]:
new_col_name = 'Location in Genome release 38 (hg38)'

In [47]:
columns_all.count('Genome Assembly Release 38')

1

In [48]:
'Genome Assembly Release 38' in df_exon5

True

In [49]:
df_exon5.rename(
    columns={'Genome Assembly Release 38': new_col_name}, inplace=True)

In [50]:
columns_all.count('Location in Genome release  38 (hg38)')

1

In [51]:
'Location in Genome release  38 (hg38)' in df_pathogenic

True

In [52]:
df_pathogenic.rename(
    columns={'Location in Genome release  38 (hg38)': new_col_name}, inplace=True)

In [53]:
assert new_col_name in df_exon5

In [54]:
assert new_col_name in df_pathogenic

In [55]:
assert new_col_name in df_benign

In [56]:
assert new_col_name in df_vus

In [57]:
columns_not_in_all.remove('Genome Assembly Release 38')

In [58]:
columns_not_in_all.remove('Location in Genome release  38 (hg38)')

In [59]:
columns_not_in_all.remove('Location in Genome release 38 (hg38)')

In [60]:
len(columns_not_in_all)

6

#### 'Protein or mRNA Variants',
#### 'Protein or mRNA variants'

reason: typo: names inconsistent, spaces in names

In [61]:
new_col_name = 'Protein or mRNA Variants'

In [62]:
assert new_col_name in df_benign

In [63]:
assert new_col_name in df_vus

In [64]:
assert new_col_name in df_exon5

In [65]:
df_pathogenic.rename(
    columns={'Protein or mRNA variants': new_col_name}, inplace=True)

In [66]:
assert new_col_name in df_pathogenic

In [67]:
columns_not_in_all.remove('Protein or mRNA Variants')

In [68]:
columns_not_in_all.remove('Protein or mRNA variants')

In [69]:
len(columns_not_in_all)

4

#### 'Taffazin Functional motifs '

reason: space at the end of the column name

In [70]:
columns_all.count('Taffazin Functional motifs ')

1

In [71]:
'Taffazin Functional motifs ' in df_pathogenic

True

In [72]:
new_col_name = 'Taffazin Functional motifs'

In [73]:
df_pathogenic.rename(
    columns={'Taffazin Functional motifs ': new_col_name}, inplace=True)

In [74]:
assert new_col_name in df_pathogenic

In [75]:
columns_not_in_all.remove('Taffazin Functional motifs ')

In [76]:
len(columns_not_in_all)

3

#### 'Splicing Prediction'
#### 'Splicing prediction'

reason: inconsistent column names

TODO why is this column not in pathogenic and vus?

In [77]:
columns_all.count('Splicing prediction')

1

In [78]:
columns_all.count('Splicing Prediction')

1

In [79]:
'Splicing prediction' in df_benign

True

In [80]:
'Splicing Prediction' in df_exon5

True

In [81]:
new_col_name = 'Splicing Prediction'

In [82]:
df_benign.rename(
    columns={'Splicing prediction': new_col_name}, inplace=True)

In [83]:
assert new_col_name in df_benign

In [84]:
assert new_col_name in df_exon5

In [85]:
columns_not_in_all.remove('Splicing Prediction')

In [86]:
columns_not_in_all.remove('Splicing prediction')

In [87]:
len(columns_not_in_all)

1

#### Remove 'Unnamed: 15' column

reason: only space in one row, inserted by mistake

In [88]:
'Unnamed: 15' in df_pathogenic

True

In [89]:
columns_all.count('Unnamed: 15')

1

In [90]:
df_pathogenic['Unnamed: 15'].value_counts()

     1
Name: Unnamed: 15, dtype: int64

In [91]:
df_pathogenic[~df_pathogenic['Unnamed: 15'].isna()]['Unnamed: 15'].iloc[0]

' '

In [92]:
df_pathogenic.drop(['Unnamed: 15'], axis=1, inplace=True)

In [93]:
assert not 'Unnamed: 15' in df_pathogenic

In [94]:
columns_not_in_all.remove('Unnamed: 15')

In [95]:
assert len(columns_not_in_all) == 0, 'all columns should have been handled by now'

# Save data

In [96]:
#! pip install xlsxwriter

In [97]:
# create a Pandas Excel writer using XlsxWriter as the engine
writer = pd.ExcelWriter(output_path, engine='xlsxwriter')

# write each dataframe to a different worksheet
df_pathogenic.to_excel(writer, sheet_name=pathogenic_sheet_names[0], index=False)
df_vus.to_excel(writer, sheet_name=vus_sheet_names[0], index=False)
df_benign.to_excel(writer, sheet_name=benign_sheet_names[0], index=False)
df_exon5.to_excel(writer, sheet_name=exon5_sheet_names[0], index=False)

# save the Pandas Excel writer to disk
writer.save()
print(f'Output was saved to {output_path}')

Output was saved to ../database_versions/0000_2023-10-30-12-58-35-482589_Human-TAFAZZIN-Variants-Database.xlsx


# Load what was saved and compare with original version

In [98]:
! diff {output_path} "{input_path}" 

Binary files ../database_versions/0000_2023-10-30-12-58-35-482589_Human-TAFAZZIN-Variants-Database.xlsx and ../database_original/[Data Only] Human TAFAZZIN Variants Database_v07-20-2023.xlsx differ


In [99]:
# remove working dataframes to make sure we are comparing only the saved excels
del df_pathogenic
del df_vus
del df_benign
del df_exon5

### Load original data

In [100]:
xls_orig = pd.ExcelFile(input_path)

df_pathogenic_orig = pd.read_excel(xls_orig, pathogenic_sheet_names[0])
df_vus_orig = pd.read_excel(xls_orig, vus_sheet_names[0])
df_benign_orig = pd.read_excel(xls_orig, benign_sheet_names[0])
df_exon5_orig = pd.read_excel(xls_orig, exon5_sheet_names[0])    

In [101]:
xls_new = pd.ExcelFile(output_path)

df_pathogenic_new = pd.read_excel(xls_new, pathogenic_sheet_names[0])
df_vus_new = pd.read_excel(xls_new, vus_sheet_names[0])
df_benign_new = pd.read_excel(xls_new, benign_sheet_names[0])
df_exon5_new = pd.read_excel(xls_new, exon5_sheet_names[0])    

#### Show column renaming:

In [102]:
def print_diffs(df, df_orig):
    df_cols = df.columns 
    df_orig_cols = df_orig.columns 
    for i in range(0, len(df.columns)):
        if df_cols[i] != df_orig_cols[i]:
            print(f'difference: "{df_orig_cols[i]}" (orig) -> "{df_cols[i]}" (new)')

In [103]:
print_diffs(df_pathogenic_new, df_pathogenic_orig)

difference: "Location in  Genome release  37 (hg19)" (orig) -> "Location in Genome release 37 (hg19)" (new)
difference: "Location in Genome release  38 (hg38)" (orig) -> "Location in Genome release 38 (hg38)" (new)
difference: "Protein or mRNA variants" (orig) -> "Protein or mRNA Variants" (new)
difference: " Functional outcome (MLCL/CL ratio)" (orig) -> "Functional outcome (MLCL/CL ratio)" (new)
difference: "Taffazin Functional motifs " (orig) -> "Taffazin Functional motifs" (new)


In [104]:
print_diffs(df_benign_new, df_benign_orig)

difference: "Location in  Genome release 37 (hg19)" (orig) -> "Location in Genome release 37 (hg19)" (new)
difference: "Splicing prediction" (orig) -> "Splicing Prediction" (new)


In [105]:
print_diffs(df_vus_new, df_vus_orig)

difference: "Location in  Genome release 37 (hg19)" (orig) -> "Location in Genome release 37 (hg19)" (new)


In [106]:
print_diffs(df_exon5_new, df_exon5_orig)

difference: "Genome Assembly Release 37" (orig) -> "Location in Genome release 37 (hg19)" (new)
difference: "Genome Assembly Release 38" (orig) -> "Location in Genome release 38 (hg38)" (new)


#### Show that nothing else changed in the data, except from column renaming:

In [107]:
df_benign_orig.columns = df_benign_new.columns
assert df_benign_new.equals(df_benign_orig)

In [108]:
df_vus_orig.columns = df_vus_new.columns
assert df_vus_new.equals(df_vus_orig)

In [109]:
df_exon5_orig.columns = df_exon5_new.columns
assert df_exon5_new.equals(df_exon5_orig)

##### Slightly more difficult comparison for df_pathogenic: we have removed that one empty column:

In [110]:
df_pathogenic_orig.columns = list(df_pathogenic_new.columns) + ['Unnamed: 15']

In [111]:
df_pathogenic_new.shape

(406, 15)

In [112]:
df_pathogenic_orig.shape

(406, 16)

In [113]:
df_pathogenic_orig['Unnamed: 15'].value_counts() # one more column in orig, but doesn't contain anything

     1
Name: Unnamed: 15, dtype: int64

In [114]:
assert df_pathogenic_new.equals(df_pathogenic_orig[df_pathogenic_new.columns]), 'dfs should be the same apart from the unnamed column'