<h1 style="color:blue"> Data retrieval from GEO</h1>

In this exercise, we are downloading data from the NCBI GEO database via programmatic access. This exercise is based on the example from https://geoparse.readthedocs.io/en/latest/usage.html#examples.

GEOparse is a Python library to access Gene Expression Omnibus Database (GEO). GEOparse.get_GEO() will check the GEO database for a specified accession ID and download it to specified directory. The result will be loaded into GEOparse.GSE file. See the documentation in https://geoparse.readthedocs.io/en/latest/introduction.html#features.


We will get familiar with exploring unfamiliar data.

## Import libraries

The first step is to import the required Python libraries. 


In [1]:
import GEOparse
# To read, write and process tabular data:
import pandas as pd

## Exercise 1

Let's download an example data set from the study "Kidney Transplant Rejection and Tissue Injury by Gene Profiling of Biopsies and Peripheral Blood Lymphocytes" by Flechner et al, 2007 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2041877/).

In [None]:
# Check your current working folder if necessary:
import os
os.getcwd()

In [3]:
# download the data set using GEOparse(the data is available in GEO database with the accession ID GSE1563)

kidney_data = GEOparse.get_GEO(geo="GSE1563", destdir="./")



28-Jun-2022 15:36:48 DEBUG utils - Directory ./ already exists. Skipping.
28-Jun-2022 15:36:48 INFO GEOparse - Downloading ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1563/soft/GSE1563_family.soft.gz to ./GSE1563_family.soft.gz
100%|█████████████████████████████████████████████████████████████████████████████| 9.62M/9.62M [00:07<00:00, 1.31MB/s]
28-Jun-2022 15:36:57 DEBUG downloader - Size validation passed
28-Jun-2022 15:36:57 DEBUG downloader - Moving C:\Users\Aleksi\AppData\Local\Temp\tmptv8quqtf to C:\Users\Aleksi\Documents\UEF_laskennallinen_biomed\CBM101\C_Data_resources\GSE1563_family.soft.gz
28-Jun-2022 15:36:58 DEBUG downloader - Successfully downloaded ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1563/soft/GSE1563_family.soft.gz
28-Jun-2022 15:36:58 INFO GEOparse - Parsing ./GSE1563_family.soft.gz: 
28-Jun-2022 15:36:58 DEBUG GEOparse - DATABASE: GeoMiame
28-Jun-2022 15:36:58 DEBUG GEOparse - SERIES: GSE1563
28-Jun-2022 15:36:58 DEBUG GEOparse - PLATFORM: GPL8300
28-

<div class='alert alert-warning'>
<h4>Exercise 1.  </h4>Inspect your downloaded data. a) what data type is it?

In [None]:
# Ex1


In [None]:
# %load solutions/ex1_1a.py

<div class='alert alert-warning'>
 b) what does it contain? Try to play around to access these different contents.
Hint: use `dir` or write `kidney_data.` and press Tab

In [None]:
# b)


In [None]:
# %load solutions/ex1_1b.py

<div class='alert alert-warning'>
c) look into the GSMs of `kidney_data`. 
Hint: you can also use the Tab trick multiple times to go deeper e.g. `kidney_data.gsms.` and press Tab

In [None]:
kidney_data.gsms

In [None]:
# c)


In [None]:
# %load solutions/ex1_1c.py

### Printing a summary
We could then do something like this:

In [None]:
# A GSM (or a Sample) contains information the conditions and preparation of the sample

print("GSM example:\n-------------")
for gsm_name, gsm in kidney_data.gsms.items():
    print("Name: ", gsm_name)
    print("Metadata:",)
    for key, value in gsm.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print ("Table data:",)
    print()
    print (gsm.table.head())
    break # so we stop after the first
    

or this:

In [None]:
# A GPL (or a Platform) contains a tab-delimited table containing the array definition eg. mappings from probe IDs to RefSeq IDs

print()
print("GPL example:\n-------------")
for gpl_name, gpl in kidney_data.gpls.items():
    print("Name: ", gpl_name)
    print("Metadata:",)
    for key, value in gpl.metadata.items():
        print(" - %s : %s" % (key, ", ".join(value)))
    print("Table data:",)
    print(gpl.table.head())
    break

<div class='alert alert-warning'>
<h4>Exercise 2. </h4>a.
Now your task is to load the data set from the study "A circadian gene expression atlas in mammals assayed by microarray" by Zhang et al (http://www.pnas.org/content/111/45/16219.long). The data is available in the GEO database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54650, accession ID 54650).

In [None]:
# Ex2


In [None]:
# %load solutions/ex1_2a.py

<div class='alert alert-warning'>
b) use the GSM example and GPL example codes above to print information of the data

In [None]:
# b)

In [None]:
# %load solutions/ex1_2b.py