In [9]:
#Here I import all the essenital libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns

## 4: Clinical disease data Project
1) Gene name

2) Mutation ID number

3) Mutation Position (chromosome & position)

4) Mutation value (reference & alternate bases)

5) Clinical significance (CLNSIG)

6) Disease that is implicated


### VCF file description (Summarized from version 4.1)

```
* The VCF specification:

VCF is a text file format which contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also can contain genotype information on samples for each position.

* Fixed fields:

There are 8 fixed fields per record. All data lines are **tab-delimited**. In all cases, missing values are specified with a dot (‘.’). 

1. CHROM - chromosome number
2. POS - position DNA nuceleotide count (bases) along the chromosome
3. ID - The unique identifier for each mutation
4. REF - reference base(s)
5. ALT - alternate base(s)
6. FILTER - filter status
7. QUAL - quality
8. INFO - a semicolon-separated series of keys with values in the format: <key>=<data>

```
### Applicable INFO field specifications

```
GENEINFO = <Gene name>
CLNSIG =  <Clinical significance>
CLNDN = <Disease name>
```

### Sample ClinVar data (vcf file format - not exactly the same as the file to download!)

```
##fileformat=VCFv4.1
##fileDate=2019-03-19
##source=ClinVar
##reference=GRCh38							
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	949523	rs786201005	C	T	.	.	GENEINFO=ISG15;CLNSIG=5
1	949696	rs672601345	C	CG	.	.	GENEINFO=ISG15;CLNSIG=5;CLNDN=Cancer
1	949739	rs672601312	G	T	.	.	GENEINFO=ISG15;CLNDBN=Cancer
1	955597	rs115173026	G	T	.	.	GENEINFO=AGRN;CLNSIG=2; CLNDN=Cancer
1	955619	rs201073369	G	C	.	.	GENEINFO=AGG;CLNDN=Heart_dis 
1	957640	rs6657048	C	T	.	.	GENEINFO=AGG;CLNSIG=3;CLNDN=Heart_dis 
1	976059	rs544749044	C	T	.	.	GENEINFO=AGG;CLNSIG=0;CLNDN=Heart_dis 
```

In [13]:
# 4) Your code here - can use as many code blocks as you would like
df = pd.read_csv('clinvar_final.txt', sep='\t', comment='#', header=None, low_memory=False)
pd.set_option('display.max_colwidth', None)

In [14]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO
1,1,1014O42,475283,G,A,.,.,"AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;ALLELEID=446939;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014042G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=143888043"
2,1,1O14122,542074,C,T,.,.,"AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014122C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=150861311"
3,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014143C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=Pathogenic;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=OMIM_Allelic_Variant:147571.0003;GENEINFO=ISG15:9636;MC=SO:0001587|nonsense;ORIGIN=1;RS=786201005"
4,1,1014179,542075,C,T,.,.,"ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014179C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=1553169766"


In [15]:
df.shape

(102322, 8)

In [16]:
# Extract information and create new columns for the essenital features listed in qual
df['GENEINFO'] = df[7].str.extract(r'GENEINFO=([^;]+)', expand=False)
df['CLNSIG'] = df[7].str.extract( r'CLNSIG=([^;]+)', expand=False)
df['CLNDN'] = df[7].str.extract( r'CLNDN=([^;]+)', expand=False)

In [17]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,GENEINFO,CLNSIG,CLNDN
0,CHROM,POS,ID,REF,ALT,FILTER,QUAL,INFO,,,
1,1,1014O42,475283,G,A,.,.,"AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619;ALLELEID=446939;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014042G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Benign;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=143888043",ISG15:9636,Benign,Immunodeficiency_38_with_basal_ganglia_calcification
2,1,1O14122,542074,C,T,.,.,"AF_ESP=0.00015;AF_EXAC=0.00010;ALLELEID=514926;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014122C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=150861311",ISG15:9636,Uncertain_significance,Immunodeficiency_38_with_basal_ganglia_calcification
3,1,1014143,183381,C,T,.,.,"ALLELEID=181485;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014143C>T;CLNREVSTAT=no_assertion_criteria_provided;CLNSIG=Pathogenic;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=OMIM_Allelic_Variant:147571.0003;GENEINFO=ISG15:9636;MC=SO:0001587|nonsense;ORIGIN=1;RS=786201005",ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
4,1,1014179,542075,C,T,.,.,"ALLELEID=514896;CLNDISDB=MedGen:C4015293,OMIM:616126,Orphanet:ORPHA319563;CLNDN=Immunodeficiency_38_with_basal_ganglia_calcification;CLNHGVS=NC_000001.11:g.1014179C>T;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=ISG15:9636;MC=SO:0001583|missense_variant;ORIGIN=1;RS=1553169766",ISG15:9636,Uncertain_significance,Immunodeficiency_38_with_basal_ganglia_calcification


In [18]:
df.columns=['CHROM', 'POS', 'ID', 'REF', 'ALT', 'FILTER', 'QUAL', 'INFO', 'GENEINFO', 'CLNSIG', 'CLNDN']
df=df.drop(0)

In [19]:
df=df[['CHROM', 'POS', 'ID', 'REF', 'ALT', 'GENEINFO', 'CLNSIG', 'CLNDN']]
df.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,GENEINFO,CLNSIG,CLNDN
1,1,1014O42,475283,G,A,ISG15:9636,Benign,Immunodeficiency_38_with_basal_ganglia_calcification
2,1,1O14122,542074,C,T,ISG15:9636,Uncertain_significance,Immunodeficiency_38_with_basal_ganglia_calcification
3,1,1014143,183381,C,T,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
4,1,1014179,542075,C,T,ISG15:9636,Uncertain_significance,Immunodeficiency_38_with_basal_ganglia_calcification
5,1,1014217,475278,C,T,ISG15:9636,Benign,Immunodeficiency_38_with_basal_ganglia_calcification


In [20]:
df.dtypes

CHROM       object
POS         object
ID          object
REF         object
ALT         object
GENEINFO    object
CLNSIG      object
CLNDN       object
dtype: object

In [21]:
df.CLNDN.value_counts()

CLNDN
not_specified                                                           14400
Hereditary_cancer-predisposing_syndrome                                  2951
Limb-girdle_muscular_dystrophy,_type_2J|Dilated_cardiomyopathy_1G        2588
Lynch_syndrome                                                           1717
not_specified|not_provided                                               1190
                                                                        ...  
Neuroblastoma_3|Large_Cell/Anaplastic_Medulloblastoma                       1
Retinoblastoma|Neuroblastoma_3                                              1
Pilocytic_astrocytoma                                                       1
B_Lymphoblastic_Leukemia/Lymphoma_with_Hyperdiploidy|Neuroblastoma_3        1
Cowden_syndrome|Hereditary_cancer-predisposing_syndrome                     1
Name: count, Length: 6139, dtype: int64

In [22]:
df.CLNSIG.value_counts()

CLNSIG
Uncertain_significance                                                                      47980
Likely_benign                                                                               17885
Pathogenic                                                                                  12313
Likely_pathogenic                                                                            6269
Benign                                                                                       6138
Conflicting_interpretations_of_pathogenicity                                                 5404
Benign/Likely_benign                                                                         3338
Pathogenic/Likely_pathogenic                                                                  854
risk_factor                                                                                    98
association                                                                                    70
drug_response

In [23]:
df.isnull().sum()

CHROM           0
POS             0
ID              0
REF             0
ALT             0
GENEINFO     4718
CLNSIG       1797
CLNDN       12670
dtype: int64

In [24]:
df=df[df['CLNSIG'].isin(['Pathogenic', 'Pathogenic/Likely_pathogenic', 'Pathogenic,_risk_factor', 'Pathogenic/Likely_pathogenic,_other', 
                                         'Pathogenic,_other', 'Pathogenic,_Affects', 'Pathogenic,_protective', 'Pathogenic,_association,_protective',
                                         'Likely_pathogenic,_association','Likely_pathogenic,_other','Likely_pathogenic,_risk_factor'])]
df.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,GENEINFO,CLNSIG,CLNDN
3,1,1014143,183381,C,T,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
9,1,1014316,161455,C,CG,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
10,1,1014359,161454,G,T,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
25,1,1022225,243036,G,A,AGRN:375790,Pathogenic,Congenital_myasthenic_syndrome
27,1,1022313,243037,A,T,AGRN:375790,Pathogenic,Congenital_myasthenic_syndrome


In [25]:
df.CLNSIG.value_counts()

CLNSIG
Pathogenic                             12313
Pathogenic/Likely_pathogenic             854
Pathogenic,_risk_factor                   11
Pathogenic/Likely_pathogenic,_other        8
Likely_pathogenic,_risk_factor             5
Pathogenic,_other                          2
Pathogenic,_Affects                        2
Pathogenic,_association,_protective        1
Likely_pathogenic,_association             1
Pathogenic,_protective                     1
Likely_pathogenic,_other                   1
Name: count, dtype: int64

In [26]:
df.isnull().sum()

CHROM          0
POS            0
ID             0
REF            0
ALT            0
GENEINFO    2231
CLNSIG         0
CLNDN       1908
dtype: int64

In [27]:
for_grading = df.copy()  # Create a copy of the original DataFrame so I avoid modifying the original

# Here I fill the missing values with 'Not_Given', but for my final anlaysis I want to drop these values so I am just making a copy of this to show I did it
for_grading['GENEINFO'].fillna('Not_Given', inplace=True)
for_grading['CLNDN'].fillna('Not_Given', inplace=True)

for_grading.head(10)


Unnamed: 0,CHROM,POS,ID,REF,ALT,GENEINFO,CLNSIG,CLNDN
3,1,1014143,183381,C,T,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
9,1,1014316,161455,C,CG,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
10,1,1014359,161454,G,T,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
25,1,1022225,243036,G,A,AGRN:375790,Pathogenic,Congenital_myasthenic_syndrome
27,1,1022313,243037,A,T,AGRN:375790,Pathogenic,Congenital_myasthenic_syndrome
47,1,1041354,574478,CGCCCGCCAGGAGAATGTCTTCAAGAAGTTCGACG,C,AGRN:375790,Pathogenic,"Myasthenic_syndrome,_congenital,_8"
50,1,1O41582,126556,C,T,AGRN:375790,Pathogenic,"Congenital_myasthenic_syndrome|Myasthenic_syndrome,_congenital,_8"
64,1,1042136,243038,T,TC,AGRN:375790,Pathogenic,Congenital_myasthenic_syndrome
271,1,1050473,243039,G,A,AGRN:375790,Pathogenic,Congenital_myasthenic_syndrome
281,1,1050575,18241,G,C,AGRN:375790,Pathogenic,"Congenital_myasthenic_syndrome|Myasthenic_syndrome,_congenital,_8"


In [28]:
len(df)

13199

In [29]:
df = df.dropna()

In [30]:
len(df)

9095

In [31]:
df.isnull().sum()

CHROM       0
POS         0
ID          0
REF         0
ALT         0
GENEINFO    0
CLNSIG      0
CLNDN       0
dtype: int64

In [32]:
df.sort_values(by=['ID'])

Unnamed: 0,CHROM,POS,ID,REF,ALT,GENEINFO,CLNSIG,CLNDN
78132,2,224503834,100515,A,C,CUL3:8452,Pathogenic,"Pseudohypoaldosteronism,_type_2|Pseudohypoaldosteronism_type_2E"
78129,2,224503823,100516,C,T,CUL3:8452,Pathogenic,"Pseudohypoaldosteronism,_type_2|Pseudohypoaldosteronism_type_2E"
78133,2,224503848,100517,T,C,CUL3:8452,Pathogenic,"Pseudohypoaldosteronism,_type_2|Pseudohypoaldosteronism_type_2E"
78134,2,224503850,100518,A,C,CUL3:8452,Pathogenic,"Pseudohypoaldosteronism,_type_2|Pseudohypoaldosteronism_type_2E"
78130,2,224503825,100519,G,A,CUL3:8452,Pathogenic,"Pseudohypoaldosteronism,_type_2|Pseudohypoaldosteronism_type_2E"
...,...,...,...,...,...,...,...,...
12639,1,94000870,99460,G,A,ABCA4:24,Pathogenic,Retinal_dystrophy|Visual_loss|Macular_degeneration|Blindness|Stargardt_disease_1|not_provided
12623,1,93997981,99473,G,T,ABCA4:24,Pathogenic,Stargardt_disease_1|not_provided
12621,1,93997932,99476,G,A,ABCA4:24,Pathogenic,Cone/cone-rod_dystrophy|Stargardt_disease_1|not_provided
13393,1,94098794,99505,C,A,ABCA4:24,Pathogenic,Stargardt_disease_1|Retinitis_pigmentosa_19|not_provided


Checking for Duplicates

In [33]:
duplicates=df.duplicated(subset=['ID'], keep=False)
df[duplicates]

Unnamed: 0,CHROM,POS,ID,REF,ALT,GENEINFO,CLNSIG,CLNDN


In [34]:
df.drop_duplicates(subset=['ID'])

Unnamed: 0,CHROM,POS,ID,REF,ALT,GENEINFO,CLNSIG,CLNDN
3,1,1014143,183381,C,T,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
9,1,1014316,161455,C,CG,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
10,1,1014359,161454,G,T,ISG15:9636,Pathogenic,Immunodeficiency_38_with_basal_ganglia_calcification
25,1,1022225,243036,G,A,AGRN:375790,Pathogenic,Congenital_myasthenic_syndrome
27,1,1022313,243037,A,T,AGRN:375790,Pathogenic,Congenital_myasthenic_syndrome
...,...,...,...,...,...,...,...,...
102112,3,172448408,7633,C,T,GHSR:2693,Pathogenic,"Short_stature,_idiopathic,_autosomal"
102179,3,177046179,559628,A,AT,TBL1XR1:79718,Pathogenic,Fitzsimmons-Guilbert_syndrome
102224,3,179199073,585023,T,C,PIK3CA:5290,Pathogenic,"CAPILLARY_MALFORMATION_OF_THE_LOWER_LIP,_LYMPHATIC_MALFORMATION_OF_FACE_AND_NECK,_ASYMMETRY_OF_FACE_AND_LIMBS,_AND_PARTIAL/GENERALIZED_OVERGROWTH,_SOMATIC"
102274,3,179203760,376498,G,A,PIK3CA:5290,Pathogenic/Likely_pathogenic,Non-Hodgkin_lymphoma|Neoplasm_of_the_breast|Neoplasm_of_the_large_intestine|Squamous_cell_carcinoma_of_the_head_and_neck|Malignant_melanoma_of_skin|Uterine_cervical_neoplasms|Glioblastoma|Cowden_syndrome|Malignant_neoplasm_of_body_of_uterus|not_provided


In [35]:
df.CHROM.value_counts()

CHROM
2    4409
1    2650
3    2036
Name: count, dtype: int64

# Chromosome 1 

In [36]:
chrom_1 = df[df.CHROM=='1']
print("num of mutations associated with chromosome 1: ", len(chrom_1))
#number of unique genes with mutations asscoaited with chromosome 1
print("num of unique genes associated with chrom 1 mutations: ", len(chrom_1.GENEINFO.unique()))


num of mutations associated with chromosome 1:  2650
num of unique genes associated with chrom 1 mutations:  342


In [37]:
#display of most common genes associated with chrom 1 mutations
chrom_1.GENEINFO.value_counts()

GENEINFO
USH2A:7399                135
ASPM:259266               130
LMNA:4000                  94
ABCA4:24                   93
FH:2271                    90
                         ... 
ERMAP:114625                1
PTPRF:5792                  1
MUTYH:4595|TOE1:114034      1
TOE1:114034                 1
IGSF3:3321                  1
Name: count, Length: 342, dtype: int64

In [38]:
#to determine the common diseases associated with mutations in chromosome 1
print('Diseases Associated with Chromosome 1:',chrom_1.CLNDN.value_counts())

Diseases Associated with Chromosome 1: CLNDN
Primary_autosomal_recessive_microcephaly_5                                                                 102
Usher_syndrome,_type_2A|Retinitis_pigmentosa_39                                                             48
Inborn_genetic_diseases                                                                                     45
Hereditary_cancer-predisposing_syndrome                                                                     36
Chédiak-Higashi_syndrome                                                                                    34
                                                                                                          ... 
Stargardt_disease_1|Retinitis_pigmentosa_19|not_provided                                                     1
Visual_impairment|Central_scotoma|Macular_degeneration|Retinal_atrophy|Stargardt_disease_1|not_provided      1
Bile_acid_synthesis_defect,_congenital,_5                          

- **Key Takeaways Chrom 1:**
- num of mutations associated with chromosome 1:  2650
-  
num of unique genes associated with chrom 1 mutations:  34
-  The most common genes associated with chromosome 1 mutations from highest are:
    - USH2A:7399                
    - ASPM:259266               
    - LMNA:4000                  
    - ABCA4:24                   
    - FH:2271                                      
-  The most common diseases associated with chrom 1 mutations from highest are:
    -  (Primary_autosomal_recessive_microcephaly_5)
    -  (Usher_syndrome,_type_2A|Retinitis_pigmentosa_39)
    -  (Hereditary_cancer-predisposing_syndrome)
    -  (Chédiak-Higashi_syndrome)   34

2



## Chromosome 2

In [39]:
chrom_2 = df[df.CHROM=='2']
print("num of mutations associated with chromosome 2: ", len(chrom_2))
print('\n','Diseases Associated with Chromosome 2:',chrom_2.CLNDN.value_counts())
print("num of unique genes associated with chrom 2 mutations: ", len(chrom_2.GENEINFO.unique()))
print('\n',chrom_2.GENEINFO.value_counts())


num of mutations associated with chromosome 2:  4409

 Diseases Associated with Chromosome 2: CLNDN
Lynch_syndrome                                                                                                                                                                                     357
Ehlers-Danlos_syndrome,_type_4                                                                                                                                                                     314
Primary_pulmonary_hypertension                                                                                                                                                                     297
Hereditary_cancer-predisposing_syndrome                                                                                                                                                            252
Severe_myoclonic_epilepsy_in_infancy                                                                    

**Key Takeaways from Chromosome 2**
- Num of mutations associated with chromosome 2: 4409
- Num of unique genes associated with chrom 2 mutations: 256
- The most common genes associated with chromosome 2 mutations from highest are:
  - MSH2:4436                            
  - MSH6:2956                            
  - COL3A1:1281                          
  - BMPR2:659                            
  - SCN1A:6323|LOC102724058:102724058
- The most common diseases associated with chrom 2 mutations from highest are:
  - Lynch_syndrome
  - Ehlers-Danlos_syndrome,_type_4
  - Primary_pulmonary_hypertension
  - Hereditary_cancer-predisposing_syndrome
  - Severe_myoclonic_epilesy_in_infancy
169
rome)

# Chromosome 3

In [40]:
chrom_3 = df[df.CHROM=='3']

print("num of mutations associated with chromosome 3: ", len(chrom_3))
print('\n', 'Diseases Associated with Chromosome 3:', chrom_3.CLNDN.value_counts())
print("num of unique genes associated with chrom 3 mutations: ", len(chrom_3.GENEINFO.unique()))
print('\n', chrom_3.GENEINFO.value_counts())


num of mutations associated with chromosome 3:  2036

 Diseases Associated with Chromosome 3: CLNDN
Lynch_syndrome                                                                                                                                      310
Biotinidase_deficiency                                                                                                                              101
Hereditary_cancer-predisposing_syndrome                                                                                                              57
Von_Hippel-Lindau_syndrome                                                                                                                           46
Deficiency_of_ferroxidase                                                                                                                            42
                                                                                                                                            

**Key Takeaways from Chromosome 3**

- Num of mutations associated with chromosome 3: 2036
- Num of unique genes associated with chrom 3 mutations: 177
- The most common genes associated with chromosome 3 mutations from highest are:
  - MLH1:4292
  - BTD:686
  - SCN5A:6331
  - VHL:7428|LOC107303340:107303340
  - CASR:846
- The most common diseases associated with chrom 3 mutations from highest are:
  - Lynch_syndrome
  - Biotinidase_deficiency
  - Hereditary_cancer-predisposing_syndrome
  - Von_Hippel-Lindau_syndrome
  - Deficiency_of_ferroxidase
 42
ancy

## Gene MSH2:4436  

In [41]:
MSH2=df[df.GENEINFO=='MSH2:4436']
len(MSH2)

608

In [42]:
MSH2.POS.value_counts()

POS
47463086    5
47403373    3
47429929    2
47412448    2
47410370    2
           ..
47414381    1
47414377    1
47414370    1
47414368    1
47482812    1
Name: count, Length: 540, dtype: int64

In [43]:
MSH2.CLNDN.value_counts()

CLNDN
Lynch_syndrome                                                                                                                                                                   257
Hereditary_cancer-predisposing_syndrome                                                                                                                                           99
Hereditary_cancer-predisposing_syndrome|Lynch_syndrome                                                                                                                            48
Hereditary_nonpolyposis_colon_cancer                                                                                                                                              38
Lynch_syndrome|not_provided                                                                                                                                                       20
Hereditary_cancer-predisposing_syndrome|Lynch_syndrome|not_provided                      

**Key TakeAways from MSH2:4436 associated with Chromosome 2**
- The following positions are associated with the most mutations associated with gene 'MSH2:4436': 47463086, 
4740337
- Lynch_syndrome and Herediatry Cancers are most common diseases associated with gene 'MSH2:4436'3

# Assumptions:
4)What assumtions are made during this exloratory analysis?



**1. Filtering Clinical Significance Records:**
   - I will only extract the columns `GENEINFO`, `CLNSIG`, `CLNDN` from the `info` column because those are the only needed features.
   - I will filter the DataFrame to include only records with specific clinical significance classifications:
      - 'Pathogenic/Likely_pathogenic'
      - 'Pathogenic,_risk_factor'
      - 'Pathogenic/Likely_pathogenic,_other'
      - 'Pathogenic,_other'
      - 'Pathogenic,_Affects'
      - 'Pathogenic,_protective'
      - 'Pathogenic,_association,_protective'
      - 'Likely_pathogenic,_association'
      - 'Likely_pathogenic,_other'
      - 'Likely_pathogenic,_risk_factor'
      
      This is to focus on pathogenic mutations.

**2. Handling Null Values:**
   - Records with NaNs or null values will be removed. My boss wants information only about mutations 
   - I made another df replacing the missing values with Not given but I do not continue my analysis with this dfwith complete data:
      1. Gene name
      2. Mutation ID number
      3. Mutation Position (chromosome & position)
      4. Mutation value (reference & alternate bases)
      5. Clinical significance (CLNSIG)
      6. Disease implicated

**3. Removal of Duplicates:**
   - Duplicates based on unique mutation IDs will be removed, although none were found.

**4. Unit of Observation:**
   - Each row represents a single mutation, and there is no overlap with mutations.

**5. Representativeness of Presented Dataset:**
   - I assume the presented dataset is representative of the original, and that the modifications have not affected the reliability of the data.

**6. Independence of Mutations in Different Chromosomes:**
   - I assume mutations in one chromosome do not influence those in another.
e do not influence those in another.
e do not influence those in another.
 do not influence those in another.

   - Assumes mutations in one chromosome do not influence those in another.
romosome do not overlap or influen

4) Findings / What would you present to your boss?

See Deilverables sections 

# Deliverables

In [None]:
#the following are the first 100 mutations organized by ID in the data set
df.sort_values(by=['ID']).head(100)

In [None]:
#below is the frequency of mutations associated with each chromosome 
plt.hist(df.CHROM)
plt.xlabel('Chromosome Number')
plt.ylabel('Frequency')
plt.title('Histogram of Chromosome Frequencies')

In [None]:
#mutations surrounding MSH2:4436
MSH2.head()

## Final report
In this analysis I started of by cleaning and isolating pathoegnic mutations according to the clinical signficance column. I defined dangerous mutations as those which was labels as_other Pathogenic, likely Pathogenic, or had some indication of Pathogenic features. The first 100 of this mutations can be seen above in the first deliverable table. There were a total of 9095 mutations after filtering according the criteria you had provided me with. In order to have a better assessment of which mutations are important for our company I decided to divide the mutations analyze the mutations specfic to each chromosome.

The analysis of mutations on Chromosome 1 revealed a total of 2,650 mutations, associated with 342 unique genes. Notable genes include USH2A, ASPM, LMNA, ABCA4, and FH. The most common diseases linked to Chromosome 1 mutations include: Primary Autosomal Recessive Microcephaly 5, Usher Syndrome Type 2A with Retinitis Pigmentosa 39, Hereditary Cancer-Predisposing Syndrome, and Chédiak-Higashi Syndrome.

Chromosome 2 exhibits a higher mutation count, totaling 4,409, with 256 unique genes involved. Predominant genes include MSH2, MSH6, COL3A1, BMPR2, and SCN1A. The prevalent diseases associated with Chromosome 2 mutations encompass Lynch Syndrome, Ehlers-Danlos Syndrome Type 4, Primary Pulmonary Hypertension, Hereditary Cancer-Predisposing Syndrome, and Severe Myoclonic Epilepsy in Infancy.

Lastly, with respect to Chromosome 3, the analysis identifies 2,036 mutations associated with 177 unique genes. Key genes include MLH1, BTD, SCN5A, VHL, and CASR. The prevalent diseases linked to Chromosome 3 mutations comprise Lynch Syndrome, Biotinidase Deficiency, Hereditary Cancer-Predisposing Syndrome, Von Hippel-Lindau Syndrome, and Deficiency of Ferroxidase.

Overall, as you can see in the historgam in the dliverable section chromosome 2 included the most mutations in the data set. More importantly I identifed gene MSH2 associated with chromosome 2 which had a total of 608 mutations associated with it. This more than any other gene associated with a chromsome in our data set. Therefore, I decided to delve deep into chromosome two. In doing so I uncovered that the positions 47463086 and 47403373 are notably associated with the highest number of mutations with MSH2. The predominant diseases linked MSH2 include Lynch Syndrome and Hereditary Cancers. These findings provide valuable insights into the specific genomic locations and associated diseases related to the 'MSH2:4436' gene on Chromosome 2. Our company may benefit from investing more resoruces to investigate the gene MSH2 in chromosome 2 and we should particularly focus on the specific locations of 47463086 and 47403373.

373.






