# What is about ?

Notebook for https://www.kaggle.com/alsgroup/end-als

We prepare a list of genes which are "interacting" with known ALS genes (Amyotrophic lateral sclerosis).
ALS genes list taken from: https://alsod.ac.uk/ 

Results are saved to: https://github.com/chervov/genes/blob/main/genes_interacting_with_ALS_genes_by_BIOGRID.csv
And so can be downloaded like that:

genes_interacting_with_ALS_genes_by_BIOGRID = pd.read_csv('https://raw.githubusercontent.com/chervov/genes/main/genes_interacting_with_ALS_genes_by_BIOGRID.csv')


Interactions are taken from BIOGRID database.

We get 7012 genes interacting with ALS, order them by number of interactions.
At the top of the list we see some ALS genes themselves - some indication that there is some sense in these genes.

Although, it is not completely clear that there is really much sense -  top gene is UBC - one of the sources of Ubiquitin -  "Ubiquitin is a small protein that exists in all eukaryotic cells. It performs its myriad functions through conjugation to a large range of target proteins".
So probably it is just top interacting gene in general not particular to ALS - need to check that.
The same may concern other top genes in the list. 
Other example: TP3 - is the famous gene - the most studied ever - but it is cancer related -  "guardian of the genome", not clear whether its relation to ALS make sense.

Many things can be analysed further. Any way we have what we have for the moment.


----------

Information on biogrid file: 

protein-protein interaction data file from BIOGRID database

File: BIOGRID-ALL-4.3.195.tab3.txt: 

Almost 2 millions interactions records in dataset.
About 70+ organisms, 

About 20K genes for humans, 8K for mouse etc.



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
i = 0
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        i += 1
        if i < 5:
            print(os.path.join(dirname, filename))
print('Printed 5 filenames out of ', i)
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
url = 'https://raw.githubusercontent.com/chervov/genes/main/genes_ALS_from_alsod_ac_uk.csv'
df_genes_alsod = pd.read_csv(url)
df_genes_alsod

In [None]:

# Protein protein interactions data
fn = '/kaggle/input/protein-protein-interactions/BIOGRID-ALL-4.3.195.tab3/BIOGRID-ALL-4.3.195.tab3.txt'
df = pd.read_csv(fn, sep ='\t')
df

In [None]:
genes_ALS = list( df_genes_alsod['Gene symbol'])
print(len(genes_ALS), genes_ALS )

In [None]:
maskA = df['Official Symbol Interactor A'].isin(genes_ALS) & (df['Organism Name Interactor A'] == 'Homo sapiens') & (df['Organism Name Interactor B'] == 'Homo sapiens')
maskB = df['Official Symbol Interactor B'].isin(genes_ALS) & (df['Organism Name Interactor A'] == 'Homo sapiens') & (df['Organism Name Interactor B'] == 'Homo sapiens')
print( maskA.sum(), maskB.sum() ) 
print(len( set(df['Official Symbol Interactor A'][maskB])   & set(df['Official Symbol Interactor B'][maskA]  )) )
print(len( set(df['Official Symbol Interactor A'][maskB])   | set(df['Official Symbol Interactor B'][maskA]  )) )

In [None]:
print( df[maskA]['Official Symbol Interactor B'].value_counts() )

In [None]:
print( df[maskB]['Official Symbol Interactor A'].value_counts() )

In [None]:
d1 =  df[maskA]['Official Symbol Interactor B'].value_counts() 
d2 =  df[maskB]['Official Symbol Interactor A'].value_counts() 
d = d1.to_frame().join(d2, how = 'outer')
d = d.fillna(0)
d['Count All Interactions'] = d.iloc[:,0] + d.iloc[:,1]
d = d.join( df_genes_alsod.set_index('Gene symbol') )
d = d.sort_values('Count All Interactions',ascending = False)
d = d[ ['Count All Interactions', 'Gene name', 'Category' , 'Official Symbol Interactor A', 'Official Symbol Interactor B']  ]
d.columns = ['Count All BIOGRID Interactions', 'Gene name', 'Relation to ALS' , 'Count Left Interactions', 'Count Right Interactions']  

print(d.columns)
d.index.name = 'Gene symbol'
d.head(30)


In [None]:
d.to_csv('genes_interacting_with_ALS_genes_by_BIOGRID.csv')

In [None]:
import matplotlib.pyplot as plt


In [None]:
plt.figure(figsize = (20,6))
plt.plot( d['Count All Interactions'].values, '*-')
plt.show()

In [None]:
d['Count All Interactions'].describe()

In [None]:
# Check downloading the obtained result from github
genes_interacting_with_ALS_genes_by_BIOGRID = pd.read_csv('https://raw.githubusercontent.com/chervov/genes/main/genes_interacting_with_ALS_genes_by_BIOGRID.csv')
genes_interacting_with_ALS_genes_by_BIOGRID