# THESIS - Preprocessing 


Four datasets are going to be used : 

- RNA-seq for HEK293 cells (Sun et al) --> HEK  
- RNA-seq for HepG2 (Wold, ENCODE) --> Hep 
- ENCODE eCLIP for protein-RNA interactions (Wold, ENCODE)
- miCLIP for m6A modifications (HEK293 cells, Linder et al)

Firstly, RNA-seq data is going to be filtered and combined in order to obtain a dataset which contains genes that are expressed in both HEK293 and HepG2 cell lines. 


During the preprocessing, dataframes are utilized and for this purpose, the pandas library is particularly effective. 

In [1]:
import pandas as pd 
import numpy as np 
from pybedtools import BedTool
import pybedtools
#import pybiomart
import scanpy as sc

Matplotlib is building the font cache; this may take a moment.


# 1. The dataset RNA-seq for HEK293 cells (Sun et al)
The dataset RNA-seq for HEK293 cells (Sun et al) is uploaded and the dataframe dfHek is produced. 
It is noticeable from the function len() that the number of elements contained is equal to 57905. 
It is necessary to eliminate the transcripts that are not expressed: the amount of elements after the stripping(?) is equal to 32396.   
Notice that the two RNA-seq files have different genome versions, it is necesssary to lift one or the other, in order for them to be in the same version. This is going to be achieved by an R script and the use of the package useMart(): maybe copy the script here. 
I have decided to lift HEK293 to the version hg38/GRCh38 and I am going to use the lifted version for the comparison. 


In [68]:
dfHEK = pd.read_excel("HEK293.xlsx")

gene_id                                                 ENSG00000257341
length                                                             5069
HEK293NK-SEQ1                                                         2
HEK293NK-SEQ1_RPKM                                                 0.03
HEK293NK-SEQ2                                                         1
HEK293NK-SEQ2_RPKM                                                 0.01
HEK293NK-SEQ3                                                         1
HEK293NK-SEQ3_RPKM                                                 0.01
HEK293-SEQ1                                                           0
HEK293-SEQ1_RPKM                                                    0.0
HEK293-SEQ2                                                           0
HEK293-SEQ2_RPKM                                                    0.0
HEK293-SEQ3                                                           0
HEK293-SEQ3_RPKM                                                

In [69]:
#check if the gene_id actually changed correctly: at position 22859 --> 'ENSG00000005955'
dfHEK.loc[22859]

gene_id                                                 ENSG00000005955
length                                                            45542
HEK293NK-SEQ1                                                      3534
HEK293NK-SEQ1_RPKM                                                 6.65
HEK293NK-SEQ2                                                      3664
HEK293NK-SEQ2_RPKM                                                 6.54
HEK293NK-SEQ3                                                      4111
HEK293NK-SEQ3_RPKM                                                 6.94
HEK293-SEQ1                                                        3181
HEK293-SEQ1_RPKM                                                   6.56
HEK293-SEQ2                                                        3078
HEK293-SEQ2_RPKM                                                    6.3
HEK293-SEQ3                                                        3267
HEK293-SEQ3_RPKM                                                

In [70]:
len(dfHEK)

57905

In [71]:
#dropping transcripts without expression.
dfHEK.columns = dfHEK.columns.str.replace('-', '_')
dfHEK = dfHEK[(dfHEK.HEK293NK_SEQ1 != 0.00) | (dfHEK.HEK293NK_SEQ2 != 0.00)]
len(dfHEK)

32396

In [76]:
dfVersion = pd.read_excel("inconsistenciesENSEMBL_noNaN.xlsx", usecols = "B, C")

dictionary = pd.Series(dfVersion['ensembl_gene_id.y'].values,index=dfVersion['ensembl_gene_id.x']).to_dict()

#Problem: the dictionary has duplicates: for example ENSG00000257341 is linked both to 'ENSG00000257341' and 'ENSG00000213145'--> the fuction replace is 
#using the same ENSG00000257341, ignoring the other. 

In [77]:
#just keep the lifted ids.--> will be just protein coding genes since we are using ccds as unique key.   
#dfVersion = pd.read_excel("incostintenciesENSEMBL.xlsx")
dfHEK = dfHEK.replace({"gene_id": dictionary})

In [78]:
#check if the gene_id actually changed correctly: at position 22859 --> 'ENSG00000278311' --> it's right 
dfHEK.loc[22859]

gene_id                                                 ENSG00000278311
length                                                            45542
HEK293NK_SEQ1                                                      3534
HEK293NK_SEQ1_RPKM                                                 6.65
HEK293NK_SEQ2                                                      3664
HEK293NK_SEQ2_RPKM                                                 6.54
HEK293NK_SEQ3                                                      4111
HEK293NK_SEQ3_RPKM                                                 6.94
HEK293_SEQ1                                                        3181
HEK293_SEQ1_RPKM                                                   6.56
HEK293_SEQ2                                                        3078
HEK293_SEQ2_RPKM                                                    6.3
HEK293_SEQ3                                                        3267
HEK293_SEQ3_RPKM                                                

In [79]:
#dataframe without unexpressed genes and with genome version hg38 
dfHEK


Unnamed: 0,gene_id,length,HEK293NK_SEQ1,HEK293NK_SEQ1_RPKM,HEK293NK_SEQ2,HEK293NK_SEQ2_RPKM,HEK293NK_SEQ3,HEK293NK_SEQ3_RPKM,HEK293_SEQ1,HEK293_SEQ1_RPKM,HEK293_SEQ2,HEK293_SEQ2_RPKM,HEK293_SEQ3,HEK293_SEQ3_RPKM,GeneSymbol,KO,GO
1,ENSG00000227232,15444,705,4.77,812,5.21,1121,6.80,732,5.43,690,5.07,804,5.39,WASH7P,K18461,_
7,ENSG00000238009,44272,1,0.01,2,0.01,0,0.00,1,0.01,3,0.02,1,0.01,RP11-34P13.7,_,_
9,ENSG00000233750,3812,1,0.01,0,0.00,0,0.00,0,0.00,0,0.00,1,0.01,CICP27,K12581,_
10,ENSG00000237683,4479,4,0.04,10,0.09,7,0.06,9,0.09,12,0.12,11,0.10,AL627309.1,_,_
14,ENSG00000241860,32389,13,0.05,4,0.01,2,0.01,14,0.06,6,0.02,2,0.01,RP11-34P13.13,_,_
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57433,ENSG00000231341,855,2,0.06,1,0.03,2,0.05,0,0.00,1,0.03,1,0.03,VDAC1P6,K05862,_
57434,ENSG00000235001,1220,19,0.19,7,0.07,12,0.11,8,0.09,9,0.10,8,0.08,EIF4A1P2,K03257,_
57589,ENSG00000215414,741,17,0.29,9,0.14,17,0.26,8,0.15,10,0.18,10,0.17,PSMA6P1,K02730,_
57691,ENSG00000185275,243,636,64.62,682,65.72,317,28.92,361,40.24,394,43.54,355,35.78,CD24P4,K06469,_


'''tried to imput in the R script just the expressed genes but the problem persists: 
the actual length of the gene_id column in HEK293 is 57 905 (32396 now )but 
once the merge function is applied in the script, only 29 285 elements are kept. 
NB: out37 and out38 are containing 29 107 and 28 126 CCDS respectively, 
which means that not all of the gene_ids contained in the file HEK293 have a corresponding CCDS in Mart37 or 38. 
I have checked for duplicates in HEK293 and there are just 2. 
The problem must be related to the number of CCDS.
'The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein CODING REGIONS 
that are consistently annotated and of high quality. 
The long term goal is to support convergence towards a standard set of gene annotations.[https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi]''''



# 2. The dataset RNA-seq for HepG2 (Wold, ENCODE)
The dataset RNA-seq for HepG2 (Wold, ENCODE) is uploaded and the dataframe dfHep is produced. The dataset is processed analogously, removing non-expressed transcripts. In this case the original number of values is 207507, whereas the elements after the elimination correspond to 96044 gene identifiers. (?)  
PS: SEEMS LIKE gene_ids ARE NOT ONLY IN THE FORMAT ENSEMBL, IS THIS A PROBLEM? --> no


In [80]:
dfHep = pd.read_excel("HepG2.xlsx")
print(len(dfHep))
dfHep

207507


Unnamed: 0,transcript_id,gene_id,length,effective_length,expected_count,TPM,FPKM,IsoPct,posterior_mean_count,posterior_standard_deviation_of_count,pme_TPM,pme_FPKM,IsoPct_from_pme_TPM,TPM_ci_lower_bound,TPM_ci_upper_bound,TPM_coefficient_of_quartile_variation,FPKM_ci_lower_bound,FPKM_ci_upper_bound,FPKM_coefficient_of_quartile_variation
0,10904,10904,93,0,0.0,0.00,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,12954,12954,94,0,0.0,0.00,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,12956,12956,72,0,0.0,0.00,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,12958,12958,82,0,0.0,0.00,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,12960,12960,73,0,0.0,0.00,0.00,0.0,0.0,0.0,0.00,0.00,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
207502,tSpikein_ERCC-00165,gSpikein_ERCC-00165,872,773,182.0,4.79,5.10,100.0,182.0,0.0,4.70,5.11,100.0,4.027980,5.379590,0.049978,4.379740,5.849960,0.049976
207503,tSpikein_ERCC-00168,gSpikein_ERCC-00168,1024,925,1.0,0.02,0.02,100.0,1.0,0.0,0.04,0.05,100.0,0.001247,0.102962,0.475355,0.001355,0.111972,0.475281
207504,tSpikein_ERCC-00170,gSpikein_ERCC-00170,1023,924,68.0,1.50,1.60,100.0,68.0,0.0,1.48,1.61,100.0,1.146680,1.843870,0.080920,1.243590,2.001790,0.080810
207505,tSpikein_ERCC-00171,gSpikein_ERCC-00171,505,406,7125.0,357.14,380.37,100.0,7125.0,0.0,348.37,378.75,100.0,340.237000,356.454000,0.007965,370.223000,387.840000,0.007956


In [81]:
#dropping all transcripts that are not expressed 
dfHep = dfHep[dfHep.TPM != 0.00]
len(dfHep)
#check the shape of the set

96044

# 3. Identify an intersection of genes expressed in HEK293 and HepG2 cell lines 
The objective is to consider the RNA-seq datasets comparing the genes that are expressed in both cell lines. 
For this reason, it is crucial to observe the gene_ids formats in the two examples and ensure their compatibility. 
In this very case, the identifiers contained in the dataset HepG2 are complemented with the model's version, which is not specified in HEK293: it is essential to level out these conceptual differences..   

In [82]:
#duplicating gene_id to manipulate it
dfHep['gene_id_new'] = dfHep['gene_id']

#for the comparison, since in dfHep there are more detailed gene_ids containing also the version of the model, I have decided to remove the last part of the id. 
dfHep['gene_id_new']= dfHep.gene_id_new.str.split('.',1).str[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfHep['gene_id_new'] = dfHep['gene_id']
  dfHep['gene_id_new']= dfHep.gene_id_new.str.split('.',1).str[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfHep['gene_id_new']= dfHep.gene_id_new.str.split('.',1).str[0]


The two sets are prepared to be intersected : the final intersection comprises of 20875 [14474] --> 20953 elements.  

In [83]:
#intersecting the two columns so that in set3 there are just the ids in common 
listHEK = dfHEK['gene_id'].to_list() 
listHep = dfHep['gene_id_new'].to_list()
set_common_genes = set(listHEK).intersection(set(listHep))
len(set_common_genes)

20953

# 4. CLIP data 

At this point it is possible to upload the CLIP files, in this case eCLIP and miCLIP files. The aim is to just keep the genes that are contained in the set of common expressed genes from the Ref_Seq.
To note: in the eCLIP file, there are no ENSEMBL ids, just the locus --> it is needed to compare them to get the info. 

In [26]:
#seems better to use the .txt file, in the .bed file the information linked to + and - is lost. 
eCLIP = BedTool("eCLIP.bed.txt")

#eCLIP = eCLIP.to_dataframe(disable_auto_names=True, header= None)

eCLIP.head()

chr10	100176311	100176312	B045_1-4_chr10_r_c635[k=4][m=2]_AGACT	2	-	AGACT	4	0.5
 chr10	101090591	101090592	B045_1-4_chr10_f_c695[k=4][m=2]_GGACT	2	+	GGACT	4	0.5
 chr10	101370996	101370997	B045_1-4_chr10_r_c643[k=4][m=2]_GAACG	2	-	GAACG	4	0.5
 chr10	101503020	101503021	B045_1-4_chr10_f_c698[k=4][m=2]_TGACA	2	+	TGACA	4	0.5
 chr10	101515460	101515461	B045_1-4_chr10_f_c699[k=11][m=2]_GGACT	2	+	GGACT	11	0.181818
 chr10	101515593	101515594	B045_1-4_chr10_f_c701[k=8][m=2]_GGACA	2	+	GGACA	8	0.25
 chr10	101636699	101636700	B045_1-4_chr10_r_c647[k=6][m=2]_GGACC	2	-	GGACC	6	0.333333
 chr10	101949128	101949129	B045_1-4_chr10_r_c654[k=6][m=2]_GTACT	2	-	GTACT	6	0.333333
 chr10	102035014	102035015	B045_1-4_chr10_r_c657[k=10][m=2]_TCACA	2	-	TCACA	10	0.2
 chr10	102107901	102107902	B045_1-4_chr10_f_c704[k=4][m=2]_AGACG	2	+	AGACG	4	0.5
 

In [27]:
miCLIP = BedTool("miCLIP.bed")

#miCLIP = miCLIP.to_dataframe(disable_auto_names=True, header= None)

miCLIP.head()

chr10	98416554	98416555	B045_1-4_chr10_r_c635[k=4][m=2]_AGACT	1	-	chr10	98416198	98446935	ENSG00000107521.20|HPS1|mRNA;ENSG00000264610.1|MIR4685|miRNA	.	-	1
 chr10	99330834	99330835	B045_1-4_chr10_f_c695[k=4][m=2]_GGACT	1	+	chr10	99329356	99394330	ENSG00000119946.12|CNNM1|mRNA	.	+	1
 chr10	99611239	99611240	B045_1-4_chr10_r_c643[k=4][m=2]_GAACG	1	-	chr10	99610522	99620609	ENSG00000155287.11|SLC25A28|mRNA	.	-	1
 chr10	99743263	99743264	B045_1-4_chr10_f_c698[k=4][m=2]_TGACA	1	+	chr10	99659509	99756134	ENSG00000198018.7|ENTPD7|mRNA;ENSG00000119929.13|CUTC|mRNA	.	+	1
 chr10	99755703	99755704	B045_1-4_chr10_f_c699[k=11][m=2]_GGACT	1	+	chr10	99659509	99756134	ENSG00000198018.7|ENTPD7|mRNA;ENSG00000119929.13|CUTC|mRNA	.	+	1
 chr10	99755836	99755837	B045_1-4_chr10_f_c701[k=8][m=2]_GGACA	1	+	chr10	99659509	99756134	ENSG00000198018.7|ENTPD7|mRNA;ENSG00000119929.13|CUTC|mRNA	.	+	1
 chr10	99876942	99876943	B045_1-4_chr10_r_c647[k=6][m=2]_GGACC	1	-	chr10	99875577	100009947	ENSG00000107554.17|DNMBP|

In [29]:
mi_with_e = miCLIP.intersect(eCLIP) 
mi_with_e.head()
len(mi_with_e)

chr11	1010823	1010824	B045_1-4_chr11_f_c66[k=10][m=2]_GGACT	1	+	chr11	924881	1012245	ENSG00000183020.15|AP2A2|mRNA;ENSG00000222561.1|RNU6-1025P|snRNA	.	+	1
 chr11	214561	214562	B045_1-4_chr11_f_c0[k=9][m=2]_GAACT	1	+	chr11	207708	215113	ENSG00000177963.15|RIC8A|mRNA;ENSG00000283920.1|MIR6743|miRNA	.	+	1
 chr11	214595	214596	B045_1-4_chr11_f_c1[k=8][m=2]_GTACT	1	+	chr11	207708	215113	ENSG00000177963.15|RIC8A|mRNA;ENSG00000283920.1|MIR6743|miRNA	.	+	1
 chr11	252727	252728	B045_1-4_chr11_f_c4[k=22][m=2]_AGACT	1	+	chr11	236966	252984	ENSG00000185627.19|PSMD13|mRNA	.	+	1
 chr11	252887	252888	B045_1-4_chr11_f_c5[k=6][m=2]_GTACT	1	+	chr11	236966	252984	ENSG00000185627.19|PSMD13|mRNA	.	+	1
 chr11	490529	490530	B045_1-4_chr11_f_c16[k=5][m=2]_AGACC	1	+	chr11	448268	491399	ENSG00000174915.12|PTDSS2|mRNA	.	+	1
 chr11	532404	532405	B045_1-4_chr11_r_c15[k=4][m=2]_GGACC	1	-	chr11	532242	537321	ENSG00000174775.18|HRAS|mRNA	.	-	1
 chr11	567189	567190	B045_1-4_chr11_r_c17[k=6][m=3]_TTACA	1	-	chr11	56565

70

In [16]:
eCLIP = set(eCLIP)
miCLIP = set(miCLIP)
mi_with_eCLIP = eCLIP.intersection(miCLIP)
mi_with_eCLIP

{0, 1, 2, 3, 4, 5, 6, 7, 8}

In [2]:
from pybiomart import Server

server = Server(host='http://www.ensembl.org')

dataset = (server.marts['ENSEMBL_MART_ENSEMBL']
                 .datasets['hsapiens_gene_ensembl'])

dataset.query(attributes=['ensembl_gene_id', 'external_gene_name'],
              filters={'chromosome_name': ['1','2']})

Unnamed: 0,Gene stable ID,Gene name
0,ENSG00000290825,DDX11L2
1,ENSG00000223972,DDX11L1
2,ENSG00000227232,WASH7P
3,ENSG00000278267,MIR6859-1
4,ENSG00000243485,MIR1302-2HG
...,...,...
9997,ENSG00000291147,
9998,ENSG00000220804,LINC01881
9999,ENSG00000224160,CICP10
10000,ENSG00000244528,SEPTIN14P2


In [5]:
dataset = pybiomart.Dataset(name='hsapiens_gene_ensembl',
                  host='http://www.ensembl.org')
dataset

<biomart.Dataset name='hsapiens_gene_ensembl', display_name=''>

In [9]:
annot = sc.queries.biomart_annotations(
        "hsapiens",
        ["ensembl_gene_id", "start_position", "end_position", "chromosome_name"],
    ).set_index("ensembl_gene_id")
#adata.var[annot.columns] = annot

In [4]:
annot

Unnamed: 0_level_0,start_position,end_position,chromosome_name
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ENSG00000210049,577,647,MT
ENSG00000211459,648,1601,MT
ENSG00000210077,1602,1670,MT
ENSG00000210082,1671,3229,MT
ENSG00000209082,3230,3304,MT
...,...,...,...
ENSG00000162543,20186096,20196050,1
ENSG00000134686,33323623,33431095,1
ENSG00000159023,28887091,29120046,1
ENSG00000198216,181317690,181813262,1


In [None]:
#accessing all of the files in a certain path automatically - in this case just checking .bed files from the protein HNRNPC  
list = os.list_dir("C:/Users/sofia/Desktop/THESIS/PROJECT/ENCODE/HNRNPC_HepG2")
for file in list:
  # add all of the next code for each file 


In [2]:
#how to access all of the files in an automatic way? 
df = pd.read_csv("C:/Users/sofia/Desktop/THESIS/PROJECT/ENCODE/HNRNPC_HepG2/fold-0/negative-1.fold-0.bed",sep="\t", names = ['Chromosome', 'Start', 'End', 'Gene_id', '.', '-'])
df.astype(str)
df.head()


Unnamed: 0,Chromosome,Start,End,Gene_id,.,-
0,chr17,44396141,44396142,ENSG00000186566.13|GPATCH8|mRNA;ENSG0000028304...,.,-
1,chr12,53616956,53616957,ENSG00000267281.2|ATF7-NPFF|mRNA;ENSG000001395...,.,-
2,chr21,46241398,46241399,ENSG00000160294.11|MCM3AP|mRNA;ENSG00000239415...,.,-
3,chr11,72924243,72924244,ENSG00000137478.15|FCHSD2|mRNA;ENSG00000206638...,.,-
4,chr7,34521715,34521716,ENSG00000197085.11|NPSR1-AS1|lncRNA;ENSG000001...,.,-


In [25]:
#it doesn't work because every index of the df should become a df on its own; how do you overcome this? - is there a way to divide Gene_id per ';', create a new row for every string and copy Chr, Start, End from the original row?
annots = df['Gene_id'].str.split(';', expand=True)
annots.columns = ['biotype','gene_name','gene_type']
df_annotated = pd.concat([df, annots], axis=1)

ValueError: Length mismatch: Expected axis has 168 elements, new values have 3 elements

In [3]:
#since I don't want to use a for loop, should I leave all of these columns ? I would rather not. 
annots = df['Gene_id'].str.split(';', expand=True)
new = annots[0].pivot()
new

AttributeError: 'Series' object has no attribute 'pivot'

In [10]:
#use this function to get a nice df- it doesn't consider that there are more than just three cells in 'name'
def read_bindingsites_4fields(path: str) -> pd.DataFrame:
    columns = ['chrom','start','end','name','.','-']
    df = pd.read_csv(path, header=None, index_col=None, sep="\t", names=columns, \
        dtype={'chrom': str, 'start': int, 'end': int, 'name': str, '.': str, '-': str})

    # We explode annotations from the 'name' column
    annots = df['name'].str.split(';', expand=True)
    annots.columns = ['biotype','gene_name','gene_type']

    df_annotated = pd.concat([df, annots], axis=1)
    return df_annotated


In [12]:
df12= read_bindingsites_4fields("C:/Users/sofia/Desktop/THESIS/PROJECT/ENCODE/HNRNPC_HepG2/fold-0/negative-1.fold-0.bed")

ValueError: Length mismatch: Expected axis has 168 elements, new values have 3 elements

In [None]:
# I can use this once I divided the dataframe optimally
eclip_df = eclip_df[eclip_df['Gene_id'].isin(genes_list)]

In [141]:
#replacing ; to then further process the column NO
df=df.replace(regex=[';'],value='|')
df

Unnamed: 0,Chromosome,Start,End,Gene_id,.,-
0,chr17,44396141,44396142,ENSG00000186566.13|GPATCH8|mRNA|ENSG0000028304...,.,-
1,chr12,53616956,53616957,ENSG00000267281.2|ATF7-NPFF|mRNA|ENSG000001395...,.,-
2,chr21,46241398,46241399,ENSG00000160294.11|MCM3AP|mRNA|ENSG00000239415...,.,-
3,chr11,72924243,72924244,ENSG00000137478.15|FCHSD2|mRNA|ENSG00000206638...,.,-
4,chr7,34521715,34521716,ENSG00000197085.11|NPSR1-AS1|lncRNA|ENSG000001...,.,-
...,...,...,...,...,...,...
3997,chr4,84715381,84715382,ENSG00000163625.16|WDFY3|mRNA|ENSG00000252062....,.,-
3998,chr12,50539944,50539945,ENSG00000066084.13|DIP2B|mRNA|ENSG00000207136....,.,+
3999,chr11,65040348,65040349,ENSG00000213465.8|ARL2|mRNA|ENSG00000273003.1|...,.,+
4000,chr2,97880185,97880186,ENSG00000075568.17|TMEM131|mRNA|ENSG0000023871...,.,-


In [142]:
#removing the information regarding to the number of transcript NO
df['Gene_id'] = df.Gene_id.str.replace(r'(\.\d\d|)', '')
df['Gene_id'] = df.Gene_id.str.replace(r'(\.\d|)', '')
df['Gene_id']

0       ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|A...
1       ENSG00000267281|ATF7-NPFF|mRNA|ENSG00000139574...
2       ENSG00000160294|MCM3AP|mRNA|ENSG00000239415|AP...
3       ENSG00000137478|FCHSD2|mRNA|ENSG00000206638|RN...
4       ENSG00000197085|NPSR1-AS1|lncRNA|ENSG000001970...
                              ...                        
3997    ENSG00000163625|WDFY3|mRNA|ENSG00000252062|RNU...
3998    ENSG00000066084|DIP2B|mRNA|ENSG00000207136|RNU...
3999    ENSG00000213465|ARL2|mRNA|ENSG00000273003|ARL2...
4000    ENSG00000075568|TMEM131|mRNA|ENSG00000238719|R...
4001    ENSG00000039123|MTREX|mRNA|ENSG00000039123|MTR...
Name: Gene_id, Length: 4002, dtype: object

In [143]:
#column Gene_id to list, even if I'm losing information regarding the indeces, is there another way to do it without losing info?
list_gene_id = df['Gene_id'].tolist()
df.head()
list_gene_id
#if one of the gene ids found in the set_common_genes, return the index ? do we need the index? in what are we interested in? 

['ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|AC103703|lncRNA|ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|AC103703|lncRNA|ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|AC103703|lncRNA|ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|AC103703|lncRNA|ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|AC103703|lncRNA|ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|AC103703|lncRNA|ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|AC103703|lncRNA|ENSG00000186566|GPATCH8|mRNA|ENSG00000283045|AC103703|lncRNA',
 'ENSG00000267281|ATF7-NPFF|mRNA|ENSG00000139574|NPFF|mRNA|ENSG00000170653|ATF7|mRNA|ENSG00000267281|ATF7-NPFF|mRNA|ENSG00000139574|NPFF|mRNA|ENSG00000170653|ATF7|mRNA|ENSG00000267281|ATF7-NPFF|mRNA|ENSG00000139574|NPFF|mRNA|ENSG00000170653|ATF7|mRNA|ENSG00000267281|ATF7-NPFF|mRNA|ENSG00000139574|NPFF|mRNA|ENSG00000170653|ATF7|mRNA|ENSG00000267281|ATF7-NPFF|mRNA|ENSG00000139574|NPFF|mRNA|ENSG00000170653|ATF7|mRNA|ENSG00000267281|ATF7-NPFF|mRNA|ENSG00000139574|NPFF|mRNA|ENSG00000170653|ATF7|mRNA|ENSG0000

In [144]:
len(list_gene_id)

4002

In [145]:
#apparently it's not mandatory to remove duplicates, matching_genes contains the same nr of genes using both filtered and unfiltered lists
list_gene_id = list( dict.fromkeys(list_gene_id) )
len(list_gene_id)

705

In [146]:
#could work but I probably lost information on the chromosomes 
matching_genes = []
list_common_genes = list(set_common_genes)
jList = []
jList = "|".join(list_gene_id)

matching_genes= [i for i in list_common_genes if i in jList ]
len(matching_genes)

892

In [147]:
len(jList)

162738

In [148]:
#I think the length of the lists makes sense but I am not sure that I am addressing the problem in the best way. 
matching_genes = list( dict.fromkeys(matching_genes) )
len(matching_genes)

892

In [7]:
#apparently, there are no gene ids in this file, right?? - yes, just the range of positions 
df2 = pd.read_csv("C:/Users/sofia/Desktop/THESIS/PROJECT/GSE63753_hek293.abcam.CIMS.m6A.9536.bed.txt", sep='\t', header = None)
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,chr10,100176311,100176312,B045_1-4_chr10_r_c635[k=4][m=2]_AGACT,2,-,AGACT,4,0.5
1,chr10,101090591,101090592,B045_1-4_chr10_f_c695[k=4][m=2]_GGACT,2,+,GGACT,4,0.5
2,chr10,101370996,101370997,B045_1-4_chr10_r_c643[k=4][m=2]_GAACG,2,-,GAACG,4,0.5
3,chr10,101503020,101503021,B045_1-4_chr10_f_c698[k=4][m=2]_TGACA,2,+,TGACA,4,0.5
4,chr10,101515460,101515461,B045_1-4_chr10_f_c699[k=11][m=2]_GGACT,2,+,GGACT,11,0.181818
