# Read and Standardize Mutation Information
This notebook reads a .csv file with one mutation per line. This notebook is a template that you can modify for your specific use case.

For subsequent analysis, the input file must contain at least the following 3 fields:
* **var_id** with the genomic location using the [HGVS sequence variant nomenclature]
(http://varnomen.hgvs.org/recommendations/general/), e.g. chr5:g.149440497C>T
* **annotation** short label, e.g., cancer type
* **color** color for visualization ([list of colors](https://github.com/3dmol/3Dmol.js/blob/master/3Dmol/colors.js#L45-L192)), e.g., to color by cancer type

This notebook is an example that standardizes the variant nomenclature
1. Read the file with your mutation information
2. Create a column 'var_id' with the genomic location using the [HGVS sequence variant nomenclature](http://varnomen.hgvs.org/recommendations/general/), e.g. chr5:g.149440497C>T
3. Filter out any variations that are not SNPs
4. Save the file as 'mutations.csv'

The mutations.csv file is the input for the next step: MapTo3DStructures

In [1]:
import pandas as pd

#### Input parameters (specify your input file name below)

In [2]:
input_file_name = "../data/example-grch37.csv"

#input_file_name = <path to your input file> # mutation info (chromosome number and position required)

output_file_name = 'mutations.csv' # contains mutation info in standard format (e.g., chr5:g.149440497C>T)

## Read input file and remove mutations that are not SNVs

In [3]:
df = pd.read_csv(input_file_name)
pd.options.display.max_columns = None # show all columns

Filter out any variants that are not SNVs

In [4]:
# df = df[df['ANN[*].EFFECT'] == 'missense_variant']
df.head()

Unnamed: 0,ID,CHROM,POS,REF,ALT,annotation,color
0,rs147776857,6,52619766,C,T,GSTA2 missense mutation,blue
1,rs121913460,9,133738358,A,T,ABL1 missense mutation,green
2,rs34933751,11,5246945,G,T,HBB missense mutation,purple


## Create a new column `var_id` with a standard variant identifier

In [5]:
# G2S service does not accept the standard chr format anymore
# def var_id(chrom, pos, ref, alt):
#    return "chr" + str(chrom) + ":g." + str(pos) + ref + ">" + alt

In [6]:
def var_id(chrom, pos, ref, alt):
    return str(chrom) + ":g." + str(pos) + ref + ">" + alt

In [7]:
df['var_id'] = df.apply(lambda x: var_id(x['CHROM'], x['POS'], x['REF'], x['ALT']), axis=1)

In [8]:
df.head()

Unnamed: 0,ID,CHROM,POS,REF,ALT,annotation,color,var_id
0,rs147776857,6,52619766,C,T,GSTA2 missense mutation,blue,6:g.52619766C>T
1,rs121913460,9,133738358,A,T,ABL1 missense mutation,green,9:g.133738358A>T
2,rs34933751,11,5246945,G,T,HBB missense mutation,purple,11:g.5246945G>T


In [9]:
df.to_csv(output_file_name, index=False)

## Now run the next step
Map mutations to 3D Structure: [2-MapTo3DStructures.ipynb](./2-MapTo3DStructures.ipynb)