# Overview

[Script of Scripts (SoS)](https://vatlab.github.io/sos-docs/index.html#content) is a scripting language designed for the execution of workflows that involve the analysis of data in multiple languages.


It is a web-based notebook environment that allows the use of multiple scripting language in a single notebook, with data flowing freely within and across languages. SoS Notebook enables researchers to perform sophisticated bioinformatic analysis using the most suitable tools for different parts of the workflow, without the limitations of a particular language or complications of cross-language communications. If you are interested in `SoS`, you can read more about it in [this paper](https://academic.oup.com/bioinformatics/article/34/21/3768/5001386).

In this notebook, you will explore some basic functions of `sos` in the analysis of UK Biobank, based on a simulated dataset. Before you start, first make sure that the general kernel for this notebook is in SoS (on the top right corner, you can select the SoS kernel after installation), and in each cell, you should choose the correct kernel (in the scroll-down list you should see at least three options `Python`, `R`, `SoS`).

Please follow the instructions throughout this notebook and run all the cells. Add cells under each question (in `Python`, `SoS` or `markdown`) to answer.


# Intro -- set up

In [10]:
# set the kernel of this cell as SoS
print("hello world!")

hello world!


In [11]:
# set the kernel of this cell as Python
print("hello world!")

hello world!


You see they give you the same result. So you can write in python directly in a sos cell. Before we start any analysis, let's import the packages required first.

In [12]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns # pretty plotting, similar to ggplot2
import matplotlib.pyplot as plt # base plots

import statsmodels.api as sm # similar to glm() in R
import statsmodels.formula.api as smf

from scipy.stats import norm
from scipy.stats import t
from sklearn.preprocessing import scale

If you have something reporting `No module named 'xxxx'` that means it is not installed. Please refer to [this page](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html) to install the modules required under micromamba. Otherwise you can install them using `conda`.

# Download the toy data and preview

Please download the data under `data/toy_data.tsv` and import it in python.

In [13]:
bd = pd.read_table("~/student_test_2024/data/toy_data.tsv", low_memory=False)
bd.head()

Unnamed: 0,ID,f.31.0.0,f.33.0.0,f.42009.0.0,f.42007.0.0
0,633627,0,1966-10-26,,
1,542438,0,1962-08-23,,
2,727287,1,1966-02-14,,
3,355926,1,1951-06-24,,
4,714555,0,1952-06-02,,


You can also view it in SoS using the Magic [`%preview`](https://vatlab.github.io/sos-docs/doc/user_guide/magic_preview.html):

In [16]:
# set the kernel of this cell as SoS
%preview -n "~/student_test_2024/data/toy_data.tsv"

ID	f.31.0.0	f.33.0.0	f.42009.0.0	f.42007.0.0
633627	0	1966-10-26	NA	NA
542438	0	1962-08-23	NA	NA
727287	1	1966-02-14	NA	NA
355926	1	1951-06-24	NA	NA

# Explore the "first occurence" phenotypes

As you can see, the raw data are not very easy to understand!

The first thing you want to understand is the column names, which we can do by searching in the UK Biobank [Showcase](https://biobank.ndph.ox.ac.uk/showcase/index.cgi). For example, search `31` and then you will see it actually represents [sex](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=31). Please read more into the data-coding of it (and other columns) and answer the following questions.

**Question 1: how many females and males are in the toy data? What's the average age of all females now (as up to Jan-01-2024)?**

**Answer:**

In [None]:



The easiest place to start is with the UK Biobank's pre-processed "[first occurence](https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=593)" disease phenotypes. These phenotypes have been generated from all the sources listed above, and are encoded by 3-character [ICD-10 codes](https://icd.who.int/browse10/2019/en), which are a widely used international standard for cataloguing human diseases. Each disease (specified by an ICD-10 code) is encoded in two fields: one with the data source where the first occurence was observed (encoding described [here](https://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=2171)), and another with the date when that event happened.

The first thing we need to do is identify which fields correspond to our disease of interest, which we can do by search the UK Biobank [Showcase](https://biobank.ndph.ox.ac.uk/showcase/index.cgi). For example, searching for "crohn's disease" reveals that field 131626 contains the date of the first reported diagnosis and 131627 contains the source where the diagnosis was reported.

Let's first rename the columns containing the first reported occurrence source and date for Crohn's disease (CD) and ulcerative colitis (UC) into human readable names and use those to explore the data.