# Hepatic ERa-Controlled Genes Validation 
## Overview
This notebook filters and validates estrous cycle associated genes from Villa 2012 (DOI: 10.1073/pnas.1205797109), resulting a set of  genes that are transcriptionally controlled by Estrogen Receptor Alpha (ERa) in the human liver.  

## Findings from Villa 2012 
Villa 2012 identifies a set of genes putatively controlled by ERa in mouse liver by sequencing the transcriptomes of mouse liver harvested at varied points in the estrous cycle during with ERa levels undergo known fluctuations. These candidate genes are confounded by non-ERa related estrous cycle transcriptional changes, and suffer from translation issues to humans. 

## Validation Path #1: Cross reference with human ERa target genes
This strategy cross references the genetic findings from Villa 2012 with validated ERa targets in humans from the TRRUST V2 database. By filtering hepatic ERa targets in mice with general ERa targets in humans, one is left with a subset of hepatic ERa targets in humans. Data for the ERa targets is avaliable at: (https://www.grnpedia.org/trrust/result.php?gene=ESR1&species=human&confirm=0), and pubmed IDs for each validated ERa target are avaliable online and in the csvs called by this script.

## Validation Path #2: Cross reference with human hepatic ERa CHiP-seq
This strategy cross references the genetic findings from Villa 2012 with hepatic ERa binding sites from human CHiP-seq experiments (DOI: 10.3390/ijms22031461). Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique used to identify DNA regions bound by specific proteins, such as transcription factors, by isolating protein-DNA complexes, sequencing the associated DNA fragments, and mapping them to the genome.

The binding of ERα to a gene's regulatory region in ChIP-seq data suggests that ERα may play a role in its transcriptional regulation. However, binding alone does not confirm functional control, as transcriptional activation or repression depends on additional factors such as co-regulator recruitment and chromatin accessibility. Genes that exhibit estrous cycle-dependent expression in mouse liver and are also bound by ERα in human liver provide strong evidence for ERα-mediated transcriptional regulation in humans. 

In [11]:
import pandas as pd

# Load DataFrames
villa_df = pd.read_csv("estrogen enriched genes liver Villa 2012.csv")
TRRUST_df = pd.read_csv("TRRUST_ERa_Human_General_Targets.csv")
hep_chip_df = pd.read_csv("Liver_ERa_human_CHiP-seq_DOI_10.3390_ijms22031461.csv")

# Function to explore DataFrames
def explore_df(df, name):
    print(f"\nExploring {name}:")
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    print("Head:\n", df.head(5))
    print("\nMissing Values:\n", df.isnull().sum())
    print("-" * 40)

# Explore each DataFrame
explore_df(villa_df, "Villa DF")
explore_df(TRRUST_df, "TRRUST DF")
explore_df(hep_chip_df, "Hep Chip DF")



Exploring Villa DF:
Shape: (47, 4)
Columns: ['regulation stage', 'GO category PANTHER DB', 'GO group p-val', 'most enriched genes']
Head:
                                 regulation stage  \
0  Genes upregulated in metestrus (low E2) stage   
1  Genes upregulated in metestrus (low E2) stage   
2  Genes upregulated in metestrus (low E2) stage   
3  Genes upregulated in metestrus (low E2) stage   
4  Genes upregulated in metestrus (low E2) stage   

                         GO category PANTHER DB GO group p-val  \
0  Lipid, fatty acid and cholesterol metabolism      5.10X10^6   
1  Lipid, fatty acid and cholesterol metabolism      5.10X10^7   
2  Lipid, fatty acid and cholesterol metabolism      5.10X10^8   
3  Lipid, fatty acid and cholesterol metabolism      5.10X10^9   
4  Lipid, fatty acid and cholesterol metabolism     5.10X10^10   

  most enriched genes  
0               SAMD8  
1                ACLY  
2              CYP2R1  
3                SC5D  
4               CYB5B  

Missi

In [22]:
#list stat function

def list_stat(list, name):
    print(f"{name} gene list len: {len(list)} \n containing genes: {list}")
    return

#list up the gene columns for filtering
TRRUST_gene_list = TRRUST_df['Target'].to_list()
list_stat(TRRUST_gene_list, "TRRUST")

hep_chip_gene_list = hep_chip_df['liver ERa chipseq binding genes doi 10.3390 ijms22031461'].to_list()
list_stat(hep_chip_gene_list, 'Hep CHiP')

#filter on validation strat
villa_df_TRRUST_val = villa_df[villa_df['most enriched genes'].isin(TRRUST_gene_list)]
explore_df(villa_df_TRRUST_val, "Villa Genes TRRUST Validated")
villa_df_TRRUST_val.to_csv("villa_ERa_hep_genes_TRRUST_validated.csv")

villa_df_hep_chip_val = villa_df[villa_df['most enriched genes'].isin(hep_chip_gene_list)]
explore_df(villa_df_hep_chip_val,"Villa Genes Hep CHiP Validated")
villa_df_hep_chip_val.to_csv("villa_ERa_hep_genes_CHiP_validated.csv")


TRRUST gene list len: 86 
 containing genes: ['ABCG2', 'AHR', 'AMH', 'AR', 'AVP', 'BCL2', 'BCL2', 'BLM', 'BRCA1', 'BTG2', 'CCNA2', 'CCND1', 'CCND1', 'CD24', 'CDH1', 'CDK4', 'CDKN1A', 'CDKN1A', 'CDKN1B', 'CEBPB', 'CHAT', 'CRH', 'CRHBP', 'CTNNB1', 'CTSD', 'CXCL12', 'CYP19A1', 'CYP1A1', 'CYP1B1', 'CYP2C19', 'DCT', 'E2F1', 'EGFR', 'ESR1', 'ESRRA', 'F12', 'FLT1', 'FOS', 'FOXP1', 'GREB1', 'GREB1', 'HSPB1', 'HTRA2', 'JAK2', 'JUN', 'JUN', 'JUNB', 'KDR', 'KRT19', 'MDM2', 'MICB', 'MMP13', 'MTA3', 'MYC', 'NQO1', 'NR5A2', 'NRF1', 'OXT', 'PELP1', 'PGR', 'PGR', 'PLAC1', 'PLAC1', 'PMAIP1', 'PTMA', 'RARA', 'RET', 'RUNX2', 'SERPINB9', 'SERPINE1', 'SP1', 'TAC3', 'TERT', 'TFF1', 'TFF1', 'TGFA', 'TGFA', 'TP53', 'TYMS', 'UGT1A4', 'UGT2B15', 'VEGFA', 'VEGFA', 'YWHAQ', 'ZEB1', 'ZFHX3']
Hep CHiP gene list len: 128 
 containing genes: ['ABCB1', 'ABCB11', 'ABCC3', 'ABCG5', 'ABCG8', 'AHRR', 'AHRR', 'ALDH1A1', 'ALDH2', 'ARNT', 'CEBPA', 'CEBPB', 'CEBPD', 'CEBPG', 'CES1', 'CES2', 'CYP11A1', 'CYP17A1', 'CYP1A1', 'CY