## All of Us Alzheimer's disease data exploration

[The All of Us Research Program](https://www.researchallofus.org/) is a biomedical data platform and all data needs to be analyzed on the platform's secure cloud environment.

This tutorial is  for three tasks on the All of Us dataset:
1. analyze Alzheimer's disease (AD) patients' genomic data
2. analyze Alzheimer's disease (AD) patients' diagnosis code
3. analyze patients with AD-related diseases aged from 65 to 75



Corresponding codes and files are under each task's folder as follows:
```
.
└── allofus/
    ├── genomics/
    │   ├── code/
    │   │   ├── combine_all_samples.py	
    │   │   ├── create_run.sh		
    │   │   ├── sample_snp1.py
    │   │   ├── copy_vcf.py		
    │   │   ├── gwas_select.py		
    │   │   └── table_cohort.py	
    │   └── gwas_reference.csv
    ├── diagnosis_code/
    │   ├── code/
    │   │   ├── download.py			
    │   │   ├── get_patient_code_matrix.py
    │   │   ├── get_cohort_table.py		
    │   │   ├── run_get_patient_matrix.py
    │   │   └── get_observation_time.py	
    │   └── disease_names.csv
    └── age65_75/
        ├── code/
        │   ├── get_apoe_ids.py			
        │   ├── run_get_geno_score.py
        │   ├── get_dict.py			
        │   ├── run_get_patient_matrix.py
        │   ├── get_geno_score.py		
        │   ├── run_logit.py
        │   ├── get_patient_code_matrix.py	
        │   └── run_run_get_geno_score.py	
        └── disease_names.csv
```

### Task 1: Analysis on Alzheimer's disease patients' genomic data

In this genomic task, we first created a cohort with conditions "Alzheimer's disease, unspecified", "Alzheimer's disease", "Alzheimer's disease with late onset", and "Other Alzheimer's disease". Then, we extracted patients' whole genome sequencing data (VCF files). Details about creating an AllofUs cohort could be found [here](https://support.researchallofus.org/hc/en-us/articles/360039585591-Selecting-participants-using-the-Cohort-Builder-tool) and details about selecting genomic data can be found [here](https://support.researchallofus.org/hc/en-us/articles/4558187754772-Selecting-Genomic-data-using-the-Genomic-Extraction-tool).

#### Step 1 -- Data extraction
The genomic data extraction process will take approximately 1 hour. After finishing the extraction, we can copy those VCF files to the current analysis environment under certain directories using the script `genomics/code/copy_vcf.py`.

#### Step 2 -- Select GWAS reference
We can download a GWAS reference file from the [GWAS website](https://www.ebi.ac.uk/gwas/api/search/downloads/full) and select some SNPs by p-values using the command

`python genomics/code/gwas_select.py downloaded_gwas_file_path output_filtered_gwas_file_path`

The output filtered GWAS reference will be used for further analysis. 

#### Step 3 -- Generate two summarized matrices
For the created cohort, we intended to generate two matrices: 
1. a sample by SNP matrix where each value represents the number of reference alleles; 
2. a binary sample by gene matrix where 1 indicating a gene is mutated in that patient. We identify a gene as mutated if at least one corresponding SNP for that gene is mutated.

To achieve this, we run the script `genomics/code/create_run.sh` which can generate scripts to run parallelly based on the `genomics/code/sample_snp1.py` script. Please change the filtered GWAS file path accordingly in the `sample_snp1.py` script. The output files will be saved under `genomics/sample_snp` and `genomics/sample_gene` folder respectively.

This step will take approximately 1 hour. After finishing the `genomics/code/create_run.sh`, we can combine all outputs to one single matrix using the script `genomics/code/combine_all_samples.py`. The sample by SNP matrix will be saved as `sample_snp_all.csv` and the sample by gene matrix will be saved as `sample_gene_all.csv`.


#### Step 4 -- Generate frequency tables for the cohort
We calculated the frequency of mutation for each gene using the script `genomics/code/table_cohort.py`. This code will return 3 tables:
1. the count and proportion of each mutated gene in the whole cohort
2. the count and proportin of each mutated gene in the male cohort
3. the count and proportion of each mutated gene in the female cohort


### Task 2:  Analysis of Alzheimer's disease patients'  diagnosis code
For the Alzheimer's disease patients cohort created in Task 1, we further explored diagnosis codes, i.e. diagnosis of other diseases, in task 2. We focused on AD related diseases from published literature (disease names are stored in `diagnosis_code/disease_name.csv`). Following guidance from the AllofUs website, we downloaded all patient information to the `diagnosis_code/patient_disease` folder. An example code for downloading data is in `diagnosis_code/code/download.py`.

#### Step 1 -- Identify observation time for each patient
We identified the date which is 5 years prior to the patient's first diagnosis with Alzheimer's disease as its observation time. We focus on investigating patients' disease diagnosis after the observation time. 

This step can be done by the script `diagnosis_code/code/get_observation_time.py`. The script will return two files: one is the observation time for each patient, and the other is the demographic information for each patient.

#### Step 2 -- Generate patient by diagnosis code matrix
We generated a binary patient by disease matrix where 1 indicating a patient was diagnosed for the disease after its observation time. This can be done by the `diagnosis_code/code/run_get_patient_matrix.py` which runs the `get_patient_matrix.py` script for each disease. The outputs will be saved under the `diagnosis_code/patient_code` folder.

#### Step 3 -- Generate frequency tables for the cohort
We calculated the frequency of each disease in Alzheimer's disease patients cohort using the script `diagnosis_code/code/get_table_cohort.py`. This code will return 3 tables:

1. the count and proportion of each disease in the whole cohort
2. the count and proportion of each disease in the male cohort
3. the count and proportion of each disease in the female cohort

### Task 3: Analysis of patients aged from 65 to 75 with AD-related diseases
We want to compare the AD cohort with general cohorts. To achieve this goal, we first created a general cohort for patients aged from 65 to 75 and extracted their demographic and disease EHR information to the `age65_75/EHR_data` folder. 


#### Step 1: Explore the diagnosis code
Similar to task 2, we can generate 3 disease frequency tables by setting 2012-01-01 (5 years prior to now) as the observation time for all patients. The corresponding code is available at `age65_75/code/get_patient_code_matrix.py` and `age65_75/code/run_get_patient_matrix.py`. The output results will be stored under the `age65_75/patient_code` folder

#### Step 2: Explore genomic data
Similar to task 1, we extracted the whole genomic sequencing data for the age 65-75 cohort. One difference is that we only focus on two SNPs related to the gene ApoE $\it{\epsilon}4$ variant: rs429358 and rs7412. 

This step can be implemented by the script `age65_75/code/run_run_get_geno_score.py`. The output results will be stored under the `age65_75/sample_geno` and the `age65_75/sample_score` folders. 

We then explored the ApoE results and saved whether a patient is ApoE mutated or not by `age65_75/code/get_apoe_ids.py`. The output is a `age65_75/apoe_ids.csv` file.

#### Step 3: Fit Logistic regression between disease and ApoE status
To further explored the relationship between AD related diseases status and the ApoE gene, we fit logistic regressions  combining demographic information by the equation 
\begin{align*}
is\_disease \sim is\_ApoE+is\_Female+is\_White+is\_HispanicorLatino+age
\end{align*}

This step can be implemented by the script `age65_75/code/run_logit.py` and output model results will be stored in the `age65_75/logit_result.csv`.