# AI in Medicine
From vectors and matrices to artificial intelligence

## Data Science - Basics I
### Python Programming: *numpy* and *pandas*

Instructors:
- Dominique Sydow, AG Volkamer, Charité (dominique.sydow@charite.de)
- Moritz Seiler, AG Ritter, Charité (moritz.seiler@charite.de)

January 2020

## 1. Aims of this session

In this talktorial, you will get in touch with **data science**. Using the **Python packages *numpy* and *pandas***, you will load, describe, and manipulate a dataset on the early detection and tracking of **Alzheimer’s disease** for further use in visualization, machine learning, and artifical intelligence later this week.

## 2. Learning goals

### Theory

* Data science
* The *numpy* library
* The *pandas* library

### Practical

* Dataset
* Read data with *pandas* as DataFrame
* Look at data
* Look at individual DataFrame components
* Sort data
* Group data
* Subset data
* Drop columns
* Write data

## 3. References

- Data science, machine learning, artificial intelligence
  - http://varianceexplained.org/r/ds-ml-ai/
- Vectors, matrices, tensors
  - https://www.quantstart.com/articles/scalars-vectors-matrices-and-tensors-linear-algebra-for-deep-learning-part-1/
  - https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66
- Numpy
  - https://scipy-lectures.org/intro/numpy/array_object.html
- Pandas
  - https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955
  - https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/#iloc-selection
  - https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c

## 4. Theory

### Data Science

#### What's the difference between data science, machine learning, and artificial intelligence? 

Taken from David Robinson's blog post: http://varianceexplained.org/r/ds-ml-ai/

The fields do have a great deal of **overlap**, but they are **not interchangeable**.

- **Data science** produces **insights**
- **Machine learning** (ML) produces **predictions**
- **Artificial intelligence** (AI) produces **actions**

#### Data science produces insights
  - “The average patient has a 70% chance of survival” (descriptive: describe a dataset)
  - “Different patients have different chances of survival” (exploratory: find relationships you did not know about)  
  - “A randomized experiment shows that patients assigned to Alice are more likely to survive than those assigned to Bob” (causal: find out what happens to one variable when you make another variable change)

#### Machine learning produces predictions
  - "Predict whether this patient will go into sepsis”
  - “Predict whether this image has a bird in it"

#### Artificial intelligence produces actions
  - Game-playing algorithms (Deep Blue, AlphaGo)
  - Robotics and control theory (motion planning, walking a bipedal robot)
  - Optimization (Google Maps choosing a route)

### The *numpy* library

#### Overview

* Role: Scientific computing (with arrays)
* Website: https://numpy.org/
> NumPy is the fundamental package for scientific computing with Python. It contains among other things:
  * a powerful N-dimensional array object
  * sophisticated (broadcasting) functions
  * tools for integrating C/C++ and Fortran code
  * useful linear algebra, Fourier transform, and random number capabilities
* Documentation: https://numpy.org/devdocs/

#### Applications

- Create vectors (1D), matrices (2D), tensors (>= 3D) in the form of arrays
- Use a large collection of high-level mathematical functions to operate on these arrays

![](https://res.cloudinary.com/practicaldev/image/fetch/s--oTgfo1EL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://raw.githubusercontent.com/adhiraiyan/DeepLearningWithTF2.0/master/notebooks/figures/fig0201a.png)
Figure source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66

### The *pandas* library

#### Overview

* Role: Data manipulation and analysis
* Website: https://pandas.pydata.org/
> *pandas* is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
* Documentation: https://pandas.pydata.org/pandas-docs/stable/

#### Applications

Taken from: https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955

*pandas* is capable of many tasks including:

* Reading/writing many different data formats
* Selecting subsets of data
* Calculating across rows and down columns
* Finding and filling missing data
* Applying operations to independent groups within the data
* Reshaping data into different forms
* Visualization through matplotlib and seaborn

#### DataFrame and Series

The *pandas* library has two main containers of data, the DataFrame and the Series. The DataFrame is used more than the Series, so let’s take a look at its components.

![DataFrame anatomy](https://cdn-images-1.medium.com/max/1600/1*ZSehcrMtBWN7_qCWq_HiSg.png)
Figure source: https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c

## 5. Practical

### 5.1. Dataset

The dataset cotains observations from a longitudinal multi-site study for the early detection and tracking of **Alzheimer’s disease**. Alzheimer’s disease is an irreversible neurodegenerative disease resulting in a loss of mental function caused by the deterioration of brain tissue. 

For this study, patients

- with Alzheimer’s disease (AD)
- with mild cognitive impairment (MCI) as well as 
- with no signs of cognitive impairment / healthy controls (CN) 

were recruited at multiple sites with the overall goal to track the progression of the disease using biomarkers, together with clinical measures, for an assessment of the brain‘s structure and function over the course of the disease states.

We use two files during our course:

* File with raw data containing personal and health-related patient data: `alzheimers_disease_rand.csv`
* File containing descriptions for column names: `dictionary.csv`

### 5.2. Read data with *pandas* as DataFrame

In [1]:
import numpy as np
import pandas as pd

The *pandas* library provides the function `read_csv` to read a comma-separated values (csv) file into a DataFrame.

In [2]:
data = pd.read_csv('./data/alzheimers_disease_rand.csv', delimiter=',')

  interactivity=interactivity, compiler=compiler, result=result)


Set the number of rows displayed in this Jupyter notebook (default 60).

In [3]:
pd.set_option('display.max_rows', 110)  # Set the number of rows shown in the notebook

Read in the csv file containing the description of column names to check what kind of data we are working with.

In [4]:
pd.read_csv('./data/dictionary.csv', delimiter=',')

Unnamed: 0,column names,text
0,RID,Participant roster ID
1,VISCODE,Visit code
2,SITE,Site
3,EXAMDATE,Date
4,DX_bl,Baseline Dx
5,AGE,Age
6,PTGENDER,Sex
7,PTEDUCAT,Education
8,WORK,Profession
9,PTETHCAT,Ethnicity


Note: The first time, we read in the csv file and passed it in the variable `data`, thus no output is shown in this notebook. The second time, we read in the csv file without passing it to a variable, thus the output is shown directly in this notebook.

Reduce the number of rows displayed in this Jupyter notebook for further use.

In [5]:
pd.set_option('display.max_rows', 20)

### 5.3. Look at data

First of all, let's set pandas to display all columns in our DataFrame by increasing the `'display.max_columns'` option.

In [6]:
pd.set_option('display.max_columns', 110)

#### DataFrame dimensionality

Show the number of columns and rows (= dimensionality/shape) of the table in the form of `(number of rows, number of columns)`

In [7]:
data.shape

(14532, 110)

#### DataFrame head/tail

Have a look at the first few rows of the table.

In [8]:
data.head()  # data.head(10)

Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
0,128,bl,164,2005-09-08,CN,74.2,Male,16,technical writer and editor,Not Hisp/Latino,White,Married,0,1.36665,,,,,,0.0,10.67,18.67,5.0,28.0,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,,,,,,,,,,,,,,,,,35479.0,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,CN,-4.41005,-4.23545,2005-09-08,0.0,10.67,18.67,5.0,28,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,-4.41005,-4.23545,,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,,,,,,,,,,,,,,,,,,,1.36665,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
1,129,bl,164,2005-09-12,AD,82.4,Male,18,Secretary,Not Hisp/Latino,White,Married,1,1.08355,,,741.5,239.7,22.83,4.5,22.0,31.0,8.0,20.0,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,,,,,,,,,,,,,,,,,32241.0,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,Dementia,-16.6244,-16.2332,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
2,129,m06,164,2006-03-13,AD,81.4,Male,18,Elementary school teacher,Not Hisp/Latino,White,Married,1,1.05803,,,,,,6.0,19.0,30.0,10.0,24.0,19.0,2.0,6.0,100.0,,19.0,135.0,12.0,,,,,,,,,,,,,,,,,31851.0,88580.0,5445.0,1100055.0,2426.0,14401.0,16975.0,1906429.0,Dementia,-15.092,-13.4932,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,0.498289,5.96721,6.0,6.0,2019-12-04 04:19:56.0,
3,129,m12,164,2006-09-12,AD,81.3,Male,18,Communication,Not Hisp/Latino,White,Married,1,1.0969,,,601.4,251.7,24.18,3.5,24.0,35.0,10.0,17.0,31.0,2.0,7.0,100.0,0.0,21.0,126.0,17.0,,,,,,,,,,,,,,,,,35572.0,90099.0,5156.0,1095635.0,1595.0,14618.0,17333.0,1903819.0,Dementia,-21.4587,-20.2909,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,0.999316,11.9672,12.0,12.0,2019-12-04 04:19:56.0,
4,129,m24,164,2007-09-12,AD,80.5,Male,18,Accounting,Not Hisp/Latino,White,Married,1,1.03258,,,,,,8.0,25.67,37.67,10.0,19.0,23.0,1.0,5.0,100.0,0.0,16.0,275.0,14.0,,,,,,,,,,,,,,,,,88263.0,97420.0,5138.0,1088555.0,1174.0,14034.0,16401.0,1903419.0,Dementia,-20.1324,-20.3426,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,1.99863,23.9344,24.0,24.0,2019-12-04 04:19:56.0,


Have a look at the last few rows of the table.

In [9]:
data.tail()  # data.tail(10)

Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
14527,6223,m24,252,2019-12-12,SMC,74.6,Male,18,RN,Not Hisp/Latino,White,Married,1.0,,,,,,,0.0,10.0,12.0,1.0,27.0,40.0,7.0,3.0,27.2727,10.0,,67.0,0.0,26.0,1.75,1.33333,1.0,1.0,1.0,1.25,1.26316,1.0,1.11111,1.0,1.0,1.16667,1.25,1.07895,,,,,,,,,,,-2.56502,-1.67829,2017-12-04,0.0,8.0,9.0,1.0,29,45.0,11.0,2.0,14.2857,15.0,,64.0,0.0,1.84363,1.73679,,,,,,,,,30.0,1.75,1.33333,1.0,1.0,1.0,1.25,1.26316,1.25,1.11111,1.0,1.4,1.33333,1.5,1.23684,,,,,,1.0049,2.02053,24.1967,24.0,24.0,2019-12-14 04:20:30.0,
14528,4832,m96,190,2019-12-16,EMCI,60.9,Male,14,painting contractor,Not Hisp/Latino,White,Married,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2012-05-16,2.0,8.0,12.0,4.0,30,30.0,6.0,4.0,44.4444,8.0,,77.0,5.0,-1.96982,-1.56211,,18008.0,7733.0,1091811.0,4085.0,19721.0,22308.0,1479077.0,19.0,3.125,2.44444,1.0,1.4,1.16667,2.0,1.94872,2.125,2.0,1.14286,1.8,1.5,2.0,1.76923,>1700,261.3,23.94,1.28876,,0.946004,7.58385,90.8197,90.0,96.0,2019-12-17 04:20:22.0,
14529,6262,m24,194,2019-12-03,CN,60.7,Male,12,Surgeon,Not Hisp/Latino,Black,Never married,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2017-12-14,0.0,7.33,8.33,1.0,29,55.0,9.0,7.0,46.6667,8.0,,51.0,0.0,-0.937768,0.187803,,,,,,,,,26.0,1.5,1.0,1.0,1.0,1.16667,1.0,1.13158,1.0,1.0,1.0,1.0,1.0,1.0,1.0,,,,,,,1.96851,23.5738,24.0,24.0,2019-12-17 04:20:25.0,
14530,6736,m12,159,2019-11-26,LMCI,60.7,Male,10,Insurance Broker,Not Hisp/Latino,White,Married,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2018-11-20,2.5,15.67,21.67,5.0,28,31.0,4.0,4.0,57.1429,4.0,,199.0,5.0,-6.7288,-7.39761,,,,,,,,,25.0,1.25,1.11111,1.2,1.0,1.33333,1.25,1.18919,2.625,1.44444,1.5,2.8,2.33333,2.75,2.15789,,,,1.19248,,,1.01574,12.1639,12.0,12.0,2019-12-17 04:20:26.0,
14531,6960,bl,169,2019-12-11,,81.3,Female,16,,Not Hisp/Latino,Black,Divorced,,,,,,,,1.0,10.0,16.0,5.0,30.0,46.0,9.0,9.0,69.2308,10.0,,86.0,0.0,24.0,2.25,1.55556,2.0,2.2,2.33333,2.25,2.05263,1.5,1.22222,1.5,1.0,1.0,1.25,1.26316,,,,,,,,,,,-1.92281,-1.78967,2019-12-11,1.0,10.0,16.0,5.0,30,46.0,9.0,9.0,69.2308,10.0,,86.0,0.0,-1.92281,-1.78967,,,,,,,,,24.0,2.25,1.55556,2.0,2.2,2.33333,2.25,2.05263,1.5,1.22222,1.5,1.0,1.0,1.25,1.26316,,,,,,,0.0,0.0,0.0,0.0,2019-12-17 04:20:26.0,


**Exercise**

Why is there a column named "Unnamed: 109" with `NaN` values? Check the csv file for answers.

#### DataFrame information

Get information about a DataFrame including:
- Index dtype
- Column dtypes
- Non-null values
- Memory usage

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14532 entries, 0 to 14531
Columns: 110 entries, RID to Unnamed: 109
dtypes: float64(85), int64(4), object(21)
memory usage: 12.2+ MB


![bla](https://pbpython.com/images/pandas_dtypes.png)

#### DataFrame column names

Get a all columns names.

In [11]:
data.columns

Index(['RID', 'VISCODE', 'SITE', 'EXAMDATE', 'DX_bl', 'AGE', 'PTGENDER',
       'PTEDUCAT', 'WORK', 'PTETHCAT',
       ...
       'PTAU_bl', 'FDG_bl', 'PIB_bl', 'AV45_bl', 'Years_bl', 'Month_bl',
       'Month', 'M', 'update_stamp', 'Unnamed: 109'],
      dtype='object', length=110)

#### Descriptive statistics on DataFrame data

Use the describe method to see how the data is distributed (numerical features only!).

In [12]:
data.describe()

Unnamed: 0,RID,SITE,AGE,PTEDUCAT,FDG,PIB,AV45,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,mPACCtrailsB,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M
count,14532.0,14532.0,14529.0,14532.0,3631.0,223.0,2669.0,10431.0,10298.0,10203.0,10324.0,10341.0,10215.0,10216.0,10188.0,10116.0,8309.0,3801.0,9899.0,10440.0,6329.0,6463.0,6443.0,6392.0,6433.0,6307.0,6396.0,6451.0,6514.0,6519.0,6383.0,6437.0,6230.0,6345.0,6512.0,1.0,7952.0,7446.0,6803.0,7690.0,6470.0,6470.0,6470.0,7954.0,10332.0,14502.0,14436.0,14524.0,14532.0,14489.0,14489.0,14488.0,14466.0,14523.0,7126.0,14282.0,14465.0,14526.0,14526.0,3.0,13146.0,11824.0,13417.0,11693.0,11693.0,11693.0,13581.0,7278.0,7359.0,7353.0,7328.0,7356.0,7200.0,7310.0,7357.0,7334.0,7341.0,7239.0,7286.0,7003.0,7179.0,7332.0,10204.0,152.0,6749.0,14529.0,14532.0,14532.0,14531.0
mean,2689.323218,230.022296,73.522541,16.052092,1.205512,1.782337,1.189418,2.087528,11.494013,17.530563,5.102867,26.7214,35.26001,4.155149,4.079309,56.712539,8.614541,37.364904,117.866249,5.368103,23.276347,2.057865,1.718384,1.388757,1.408397,1.527153,1.823171,1.674671,2.110136,1.698362,1.58268,1.6812,1.761065,1.968348,1.803508,1.76316,238351.771127,42158.511818,6690.837719,1011558.0,3780.983,17141.19459,19202.884853,1533709.0,-5.810488,9.704883,15.253412,4.809877,27.668318,37.150459,4.548623,4.25849,53.672283,7.977207,40.198569,109.817883,3.163844,-4.912862,-4.682232,-5.80532,38593.604138,6908.691644,1027019.0,3758.083,17500.660737,19751.74412,1535276.0,497.9593,2.09003,1.717113,1.365063,1.388882,1.49491,1.792442,1.663902,1.986185,1.552675,1.393904,1.505861,1.594714,1.780169,1.639943,1.250349,1.581759,1.193165,2.422421,29.003632,28.925473,28.746198
std,2050.550766,120.005597,7.092087,2.796145,0.159062,0.423372,0.227379,2.852369,8.364001,11.371985,3.108959,3.931502,13.744249,2.879895,10.634914,103.915605,6.345335,14.446934,77.273785,8.061208,4.735876,0.795149,0.640174,0.545718,0.559745,0.631117,0.757081,0.551183,0.996786,0.829791,0.850826,0.913555,0.974632,1.006557,0.853222,,138006.638622,23291.051914,1396.681459,113163.6,19197.7,2830.513114,3138.488244,168002.3,7.830616,5.818237,8.594518,2.844647,2.393435,12.167704,2.754395,2.595281,34.852861,5.369075,12.416141,67.177912,5.697284,5.755696,5.486149,4.651175,21693.85245,1237.887877,109814.9,14432.52,2660.002223,2964.206076,163649.3,28683.37,0.774728,0.621605,0.524857,0.541176,0.60674,0.733149,0.52917,0.887123,0.694449,0.640527,0.727897,0.813544,0.876843,0.685036,0.224457,0.305296,0.219321,2.559634,30.652204,30.620796,30.468485
min,128.0,155.0,0.0,4.0,0.636743,1.095,0.805364,0.0,0.0,0.0,0.0,0.0,0.0,-8.0,-1035.0,-9409.09,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.76316,31851.0,5652.0,2216.0,6597.0,1040.0,2707.0,8044.0,296.884,-60.9891,0.0,0.0,0.0,2.0,0.0,-8.0,-28.0,-400.0,0.0,0.0,0.0,0.0,-23.4185,-23.4185,-10.6964,5652.0,2995.0,6792.0,1425.0,3274.0,9375.0,18077.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.693671,1.155,0.805364,0.0,0.0,0.0,0.0
25%,806.0,174.0,69.0,14.0,1.1082,1.3575,1.00744,0.0,6.0,9.0,3.0,25.0,25.0,2.0,3.0,27.2727,3.0,29.0,65.0,0.0,21.0,1.5,1.22222,1.0,1.0,1.0,1.25,1.25641,1.25,1.0,1.0,1.0,1.0,1.0,1.127407,1.76316,107924.0,25130.75,5828.5,933652.0,2905.0,15267.25,17181.75,1417092.0,-10.3459,5.67,9.0,2.0,26.0,28.0,3.0,3.0,25.0,4.0,33.0,66.0,0.0,-8.92579,-8.48042,-7.988745,22997.0,6106.0,950068.0,3061.0,15753.0,17802.0,1421003.0,22.0,1.5,1.22222,1.0,1.0,1.0,1.25,1.27027,1.25,1.0,1.0,1.0,1.0,1.0,1.10811,1.16083,1.36,1.01536,0.495551,5.93443,6.0,6.0
50%,2171.0,194.0,73.5,16.0,1.21781,1.85,1.10645,1.0,9.33,15.0,5.0,28.0,34.0,4.0,4.0,60.0,9.0,38.0,89.0,1.0,24.0,1.875,1.55556,1.14286,1.2,1.33333,1.75,1.53846,1.75,1.33333,1.14286,1.2,1.33333,1.75,1.45946,1.76316,259473.5,37452.0,6760.0,1009262.0,3482.0,17180.0,19241.0,1522356.0,-3.74311,8.67,13.67,5.0,28.0,36.0,4.0,4.0,50.0,8.0,41.0,87.0,0.0,-3.882295,-3.59036,-5.28109,33594.0,6987.0,1025128.0,3594.0,17428.0,19683.0,1522877.0,24.0,2.0,1.55556,1.14286,1.2,1.33333,1.75,1.53846,1.75,1.25,1.14286,1.2,1.16667,1.5,1.37838,1.25604,1.49,1.10814,1.52498,18.2623,18.0,18.0
75%,4615.0,269.0,78.5,18.0,1.310275,2.1275,1.35369,3.0,15.0,23.33,8.0,29.0,45.0,6.0,6.0,100.0,14.0,47.0,141.0,8.0,27.0,2.5,2.11111,1.57143,1.6,1.83333,2.25,1.94872,2.875,2.118055,1.85714,2.0,2.25,2.75,2.28205,1.76316,350677.5,53423.5,7581.0,1088223.0,4013.0,19024.0,21393.75,1641200.0,0.066128,12.33,20.0,7.0,30.0,46.0,7.0,6.0,85.7143,12.0,48.0,124.0,4.0,-0.493413,-0.509204,-3.35978,48932.0,7713.0,1099823.0,4114.0,19216.0,21755.0,1641093.0,26.0,2.5,2.11111,1.5,1.6,1.8,2.25,1.92105,2.625,1.77778,1.5,1.8,2.0,2.25,1.96429,1.33996,1.835,1.36015,3.54278,42.4262,42.0,42.0
max,6963.0,1094.0,93.1,20.0,1.75121,2.9275,2.66921,18.0,70.0,85.0,29.33,30.0,75.0,42.0,15.0,100.0,100.0,83.0,996.0,161.0,30.0,26.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.76316,722839.0,162729.0,56830.0,1486041.0,1141372.0,29952.0,32193.0,2110293.0,12.5208,42.67,54.67,28.33,30.0,71.0,42.0,15.0,100.0,100.0,80.0,300.0,165.0,6.26433,7.4125,-1.43847,145113.0,50745.0,1486041.0,1163222.0,29952.0,32193.0,2110293.0,1850330.0,26.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.94872,18.62,2.2825,2.02556,14.0862,168.689,168.0,168.0


#### Exercise

Note: You can click on "> Solution" to get the programmatic answer to the exercise questions (copy-paste it in the blank cell above each solution and execute the cell to get the answer). In this first exercise however, you can also simply check the outputs above for answers.

What is the number of rows of this DataFrame? Why are some counts lower than that value?

<details>
<summary> > Solution</summary>
    
```python
data.shape[0]  # Number of rows in DataFrame
```
    
</details>

What is the mean patients' age?

<details>
<summary> > Solution</summary>
    
```python
data.describe().loc["mean", "AGE"]  # Patient's mean age
```
    
</details>

3. What is the minimum and maximum number of years that a patient (or multiple patients) spent on his/her/their education?

<details>
<summary> > Solution</summary>
    
```python
data.describe().loc["min", "PTEDUCAT"]  # Minimum number of years patients spent on education
data.describe().loc["max", "PTEDUCAT"]  # Maximum number of years patients spent on education
```
    
</details>

### 5.4. Look at individual DataFrame components

Taken from: https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c

*pandas* DataFrame are composed of three components: 
- **index**
- **columns** 
- data (**values**)

Let's extract each of these components into their own variables and inspect them:

In [13]:
index = data.index
index

RangeIndex(start=0, stop=14532, step=1)

In [14]:
columns = data.columns
columns

Index(['RID', 'VISCODE', 'SITE', 'EXAMDATE', 'DX_bl', 'AGE', 'PTGENDER',
       'PTEDUCAT', 'WORK', 'PTETHCAT',
       ...
       'PTAU_bl', 'FDG_bl', 'PIB_bl', 'AV45_bl', 'Years_bl', 'Month_bl',
       'Month', 'M', 'update_stamp', 'Unnamed: 109'],
      dtype='object', length=110)

In [15]:
values = data.values
values

array([[128, 'bl', 164, ..., 0.0, '2019-12-04 04:19:56.0', nan],
       [129, 'bl', 164, ..., 0.0, '2019-12-04 04:19:56.0', nan],
       [129, 'm06', 164, ..., 6.0, '2019-12-04 04:19:56.0', nan],
       ...,
       [6262, 'm24', 194, ..., 24.0, '2019-12-17 04:20:25.0', nan],
       [6736, 'm12', 159, ..., 12.0, '2019-12-17 04:20:26.0', nan],
       [6960, 'bl', 169, ..., 0.0, '2019-12-17 04:20:26.0', nan]],
      dtype=object)

In [16]:
type(index)

pandas.indexes.range.RangeIndex

In [17]:
type(columns)

pandas.indexes.base.Index

In [18]:
type(values)  # Important: Get numpy array from pandas DataFrame/Series!

numpy.ndarray

Interestingly, both the `index` and the `columns` are the same type. They are both a *pandas* `Index` objects. This object is quite powerful in itself, but for now you can just think of it as a sequence of labels for either the rows or the columns.

The `values`'s type is a *numpy* `ndarray`, which stands for n-dimensional array, and is the primary container of data in the *numpy* library. *pandas* is built directly on top of *numpy* and it's this array that is responsible for the bulk of the workload.

### 5.5. Sort data

We can sort data by the values along one or multiple rows or columns using the `sort_values` function.

For simplicity, we will only sort our data by one column, e.g. by the patients' examination date.

In [19]:
data.sort_values(by='EXAMDATE', ascending=False).head()  # Sorted by latest examination date

Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
14528,4832,m96,190,2019-12-16,EMCI,60.9,Male,14,painting contractor,Not Hisp/Latino,White,Married,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2012-05-16,2.0,8.0,12.0,4.0,30,30.0,6.0,4.0,44.4444,8.0,,77.0,5.0,-1.96982,-1.56211,,18008.0,7733.0,1091811.0,4085.0,19721.0,22308.0,1479077.0,19.0,3.125,2.44444,1.0,1.4,1.16667,2.0,1.94872,2.125,2.0,1.14286,1.8,1.5,2.0,1.76923,>1700,261.3,23.94,1.28876,,0.946004,7.58385,90.8197,90.0,96.0,2019-12-17 04:20:22.0,
14522,5261,m84,182,2019-12-12,EMCI,76.5,Male,16,engineer,Not Hisp/Latino,White,Married,1.0,,,,,,,,,,,,,,,,,,,1.0,,2.125,1.44444,1.0,1.0,1.16667,1.25,1.39474,2.0,1.33333,1.2,1.6,1.16667,3.0,1.64865,,,,,,,,,,,,,2013-05-15,0.5,10.0,18.0,8.0,28,32.0,3.0,7.0,100.0,14.0,,70.0,0.0,-4.99839,-3.60723,,43152.0,6351.0,944750.0,,,,1213242.0,25.0,1.875,1.44444,1.0,1.0,1.0,1.75,1.35897,1.625,1.11111,1.0,1.4,1.0,2.75,1.38462,746.8,174.8,16.13,1.26256,,1.1861,6.57632,78.7541,78.0,84.0,2019-12-17 04:20:24.0,
14527,6223,m24,252,2019-12-12,SMC,74.6,Male,18,RN,Not Hisp/Latino,White,Married,1.0,,,,,,,0.0,10.0,12.0,1.0,27.0,40.0,7.0,3.0,27.2727,10.0,,67.0,0.0,26.0,1.75,1.33333,1.0,1.0,1.0,1.25,1.26316,1.0,1.11111,1.0,1.0,1.16667,1.25,1.07895,,,,,,,,,,,-2.56502,-1.67829,2017-12-04,0.0,8.0,9.0,1.0,29,45.0,11.0,2.0,14.2857,15.0,,64.0,0.0,1.84363,1.73679,,,,,,,,,30.0,1.75,1.33333,1.0,1.0,1.0,1.25,1.26316,1.25,1.11111,1.0,1.4,1.33333,1.5,1.23684,,,,,,1.0049,2.02053,24.1967,24.0,24.0,2019-12-14 04:20:30.0,
14521,4351,m96,155,2019-12-12,CN,70.0,Male,20,secretary,Not Hisp/Latino,White,Married,1.0,,,,,,,,22.0,34.0,9.0,24.0,25.0,-1.0,5.0,100.0,2.0,,159.0,2.0,15.0,1.625,1.125,1.33333,1.0,1.0,1.0,1.22222,2.125,1.11111,1.0,1.2,1.4,1.66667,1.42857,,,,,,,,,,,-15.358,-13.3325,2011-10-04,0.0,6.0,11.0,4.0,28,44.0,10.0,4.0,30.7692,12.0,,64.0,0.0,-2.80238,-1.74772,,39314.0,8754.0,1280619.0,4097.0,23174.0,29432.0,1906709.0,25.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.25,1.0,1.0,1.0,1.0,1.0,1.05128,713.7,371.6,38.07,1.43957,,1.49611,8.18891,98.0656,96.0,96.0,2019-12-13 04:20:22.0,
14531,6960,bl,169,2019-12-11,,81.3,Female,16,,Not Hisp/Latino,Black,Divorced,,,,,,,,1.0,10.0,16.0,5.0,30.0,46.0,9.0,9.0,69.2308,10.0,,86.0,0.0,24.0,2.25,1.55556,2.0,2.2,2.33333,2.25,2.05263,1.5,1.22222,1.5,1.0,1.0,1.25,1.26316,,,,,,,,,,,-1.92281,-1.78967,2019-12-11,1.0,10.0,16.0,5.0,30,46.0,9.0,9.0,69.2308,10.0,,86.0,0.0,-1.92281,-1.78967,,,,,,,,,24.0,2.25,1.55556,2.0,2.2,2.33333,2.25,2.05263,1.5,1.22222,1.5,1.0,1.0,1.25,1.26316,,,,,,,0.0,0.0,0.0,0.0,2019-12-17 04:20:26.0,


In [20]:
data.sort_values(by='EXAMDATE', ascending=True).head()  # Sorted by earliest examination date

Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
10,131,bl,164,2005-09-07,CN,75.7,Male,16,,Not Hisp/Latino,White,Married,0,1.29343,,,547.3,337.0,33.43,0.0,8.67,14.67,4.0,29.0,37.0,4.0,4.0,44.4444,12.0,38.0,90.0,0.0,,,,,,,,,,,,,,,,,32253.0,34061.0,7068.0,1116634.0,4428.0,24789.0,21616.0,1640772.0,CN,-1.95295,-1.64932,2005-09-07,0.0,8.67,14.67,4.0,29,37.0,4.0,4.0,44.4444,12.0,38.0,90.0,0.0,-1.95295,-1.64932,,34061.0,7068.0,1116634.0,4428.0,24789.0,21616.0,1640772.0,,,,,,,,,,,,,,,,547.3,337.0,33.43,1.29343,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
0,128,bl,164,2005-09-08,CN,74.2,Male,16,technical writer and editor,Not Hisp/Latino,White,Married,0,1.36665,,,,,,0.0,10.67,18.67,5.0,28.0,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,,,,,,,,,,,,,,,,,35479.0,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,CN,-4.41005,-4.23545,2005-09-08,0.0,10.67,18.67,5.0,28,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,-4.41005,-4.23545,,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,,,,,,,,,,,,,,,,,,,1.36665,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
1,129,bl,164,2005-09-12,AD,82.4,Male,18,Secretary,Not Hisp/Latino,White,Married,1,1.08355,,,741.5,239.7,22.83,4.5,22.0,31.0,8.0,20.0,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,,,,,,,,,,,,,,,,,32241.0,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,Dementia,-16.6244,-16.2332,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
11507,134,bl,164,2005-09-19,CN,85.8,Female,18,pharmaceutical salesperson,Not Hisp/Latino,White,Widowed,0,1.2745,,,>1700,440.2,37.51,0.0,5.0,7.0,2.0,28.0,51.0,7.0,3.0,25.0,16.0,49.0,51.0,0.0,,,,,,,,,,,,,,,,,32267.0,18751.0,6072.0,948687.0,4194.0,14043.0,20073.0,1396073.0,CN,0.530658,1.10288,2005-09-19,0.0,5.0,7.0,2.0,28,51.0,7.0,3.0,25.0,16.0,49.0,51.0,0.0,0.530658,1.10288,,18751.0,6072.0,948687.0,4194.0,14043.0,20073.0,1396073.0,,,,,,,,,,,,,,,,>1700,440.2,37.51,1.2745,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
20,133,bl,175,2005-10-06,AD,75.3,Male,10,chartered accountant,Hisp/Latino,More than one,Married,1,,,,,,,6.0,27.33,40.33,10.0,20.0,17.0,2.0,3.0,75.0,0.0,9.0,300.0,17.0,,,,,,,,,,,,,,,,,59346.0,25699.0,6725.0,875793.0,2046.0,12067.0,15379.0,1353521.0,Dementia,-19.9104,-19.6431,2005-10-06,6.0,27.33,40.33,10.0,20,17.0,2.0,3.0,75.0,0.0,9.0,300.0,17.0,-19.9104,-19.6431,,25699.0,6725.0,875793.0,2046.0,12067.0,15379.0,1353521.0,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,


#### Exercise

Sort data by youngest patients and show only top 3 patients. How old is the youngest patient?

<details>
<summary> > Solution</summary>
    
```python
data.sort_values(by='AGE', ascending=True).head(3)
```
    
</details>

### 5.6. Group data

From https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

> By “group by” we are referring to a process involving one or more of the following steps:
> * **Splitting** the data into groups based on some criteria.
> * **Applying** a function to each group independently.
> * **Combining** the results into a data structure.


**Splitting**: Split data into groups based on criteria, e.g. the civil status.

In [21]:
data.groupby('PTMARRY')

<pandas.core.groupby.DataFrameGroupBy object at 0x7f23c3affba8>

**Applying and combining**: Apply function to each group, e.g. get the number of entries in each group (function `size`).

In [22]:
data.groupby('PTMARRY').size()

PTMARRY
Divorced          1209
Married          11116
Never married      454
Unknown             53
White                3
Widowed           1697
dtype: int64

The result is combined into a data structure, i.e. `Series`.

In [23]:
type(data.groupby('PTMARRY').size())

pandas.core.series.Series

Side note: Does anyone know what "white" in context of the civil status means? I do not. Maybe this is an artifact in the data?

We can also apply other functions to our grouped data. 

Return the first element in each group (`first`).

In [24]:
data.groupby('PTMARRY').first()

Unnamed: 0_level_0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
PTMARRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1
Divorced,140,bl,175,2005-11-04,CN,78.3,Female,12,legal secretary,Hisp/Latino,White,0,1.25096,1.58,1.35463,1582.0,203.6,16.68,0.0,4.33,8.33,4.0,29.0,45.0,6.0,4.0,36.3636,10.0,30.0,101.0,0.0,21.0,1.375,1.33333,1.0,1.0,1.16667,1.0,1.17949,1.125,1.11111,1.0,1.0,1.0,1.0,1.05128,,59356.0,46281.0,6729.0,861745.0,3580.0,13781.0,17795.0,1269537.0,CN,-3.33317,-2.51998,2005-11-04,0,4.33,8.33,4.0,29,45.0,6.0,4.0,36.3636,10.0,30.0,101.0,0.0,-3.33317,-2.51998,,46281.0,6729.0,861745.0,3580.0,13781.0,17795.0,1269537.0,26.0,1.375,1.22222,1.0,1.2,1.0,1.25,1.17949,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1582.0,203.6,16.68,1.25096,1.6725,1.04955,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
Married,128,bl,164,2005-09-08,CN,74.2,Male,16,technical writer and editor,Not Hisp/Latino,White,0,1.36665,2.3575,1.32679,741.5,239.7,22.83,0.0,10.67,18.67,5.0,28.0,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,27.0,1.25,1.44444,1.0,1.0,1.16667,1.75,1.25641,1.625,1.33333,1.14286,1.2,1.0,1.0,1.25641,,35479.0,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,CN,-4.41005,-4.23545,2005-09-08,0,10.67,18.67,5.0,28,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,-4.41005,-4.23545,,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,27.0,1.25,1.44444,1.0,1.0,1.16667,1.75,1.25641,1.625,1.33333,1.14286,1.2,1.0,1.0,1.25641,741.5,239.7,22.83,1.36665,1.49,1.32679,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
Never married,5406,m24,253,2015-09-29,SMC,66.4,Male,16,Registered Nurse,Not Hisp/Latino,Black,0,1.25096,1.265,0.982566,731.8,101.2,10.08,0.0,2.0,3.0,1.0,30.0,38.0,6.0,5.0,50.0,12.0,40.0,67.0,0.0,27.0,1.625,1.11111,1.14286,1.2,1.66667,1.5,1.35897,1.75,1.11111,1.33333,1.2,1.25,1.25,1.4,,389179.0,33187.0,8297.0,1165500.0,4945.0,20148.0,21190.0,1656457.0,CN,1.86257,1.6424,2013-09-17,0,3.0,5.0,1.0,30,42.0,7.0,2.0,20.0,15.0,40.0,72.0,0.0,3.0546,2.36566,,33187.0,8297.0,1165500.0,4945.0,20148.0,21190.0,1656457.0,28.0,1.875,1.0,1.0,1.0,2.0,2.0,1.4359,1.625,1.11111,1.28571,1.0,1.25,1.0,1.25,731.8,101.2,10.08,1.25096,,0.983143,2.03149,24.3279,24.0,24.0,2019-12-04 04:20:18.0,
Unknown,1393,bl,247,2007-03-16,CN,73.4,Male,16,Educator,Not Hisp/Latino,White,0,1.25804,,1.00469,980.4,116.8,10.4,0.0,5.0,6.0,1.0,30.0,56.0,10.0,1.0,6.66667,14.0,69.0,50.0,0.0,25.0,2.25,2.11111,1.0,1.2,1.2,2.0,1.68421,2.75,2.33333,1.0,1.0,1.0,2.5,1.94444,,80695.0,19980.0,8657.0,1195613.0,4324.0,19230.0,21954.0,1731299.0,CN,4.27238,2.93081,2007-03-16,0,5.0,6.0,1.0,30,56.0,10.0,1.0,6.66667,14.0,69.0,50.0,0.0,4.27238,2.93081,,19980.0,8657.0,1195613.0,4324.0,19230.0,21954.0,1731299.0,25.0,2.25,2.11111,1.0,1.2,1.2,2.0,1.68421,2.75,2.33333,1.0,1.0,1.0,2.5,1.94444,980.4,116.8,10.4,1.25804,,1.00469,0.0,0.0,0.0,0.0,2019-12-04 04:20:06.0,
White,982,m12,164,2007-09-25,LMCI,60.7,Male,13,Military,Air Force,Not Hisp/Latino,Married,1.0,1.2512,1.435,,,,,2.5,17.33,29.33,10.0,27.0,22.0,5.0,5.0,100.0,0.0,29.0,161.0,14.0,26.0,3.14286,2.71429,1.16667,1.6,2.5,4.0,2.48571,2.375,1.55556,1.33333,1.2,2.0,2.0,1.76316,,78789.0,56830.0,6597.0,1141372.0,2780.0,16818.0,16880.0,1846260,Dementia,-11.5923,-11.7945,2006-09-29,1.0,17.33,28.33,9,27.0,24.0,4.0,6.0,100.0,2.0,37.0,165.0,3.0,-9.65127,-10.6964,,50745.0,6834.0,1163222.0,3440.0,17074.0,18077.0,1850330.0,26.0,3.14286,2.71429,1.16667,1.6,2.5,4.0,2.48571,2.375,1.55556,1.33333,1.2,2.0,2.0,1.76316,1308.0,213.7,18.62,1.18734,1.5675,,0.988364,11.8361,12.0,12,2019-12-04 04:20:03.0
Widowed,149,bl,164,2005-11-08,CN,73.4,Male,14,Secretary,Not Hisp/Latino,Black,0,1.36097,2.27,0.964111,1647.0,181.1,16.74,0.0,4.0,8.0,4.0,26.0,40.0,7.0,7.0,63.6364,8.0,43.0,114.0,0.0,22.0,1.125,1.11111,1.0,1.0,1.0,1.0,1.05128,1.25,1.11111,1.14286,1.0,1.0,1.0,1.10256,,32428.0,21895.0,8316.0,1040561.0,3622.0,18025.0,19499.0,1568537.0,CN,-5.37952,-6.12939,2005-11-08,0,4.0,8.0,4.0,26,40.0,7.0,7.0,63.6364,8.0,43.0,114.0,0.0,-5.37952,-6.12939,,21895.0,8316.0,1040561.0,3622.0,18025.0,19499.0,1568537.0,25.0,2.75,2.22222,1.14286,1.4,1.4,1.5,1.84211,1.875,1.0,1.0,1.0,1.0,1.75,1.25641,1647.0,181.1,16.74,1.36097,1.9025,1.73413,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,


Return the mean of each numerical column for each group (`mean`):

In [25]:
data.groupby('PTMARRY').mean()

Unnamed: 0_level_0,RID,SITE,AGE,PTEDUCAT,FDG,PIB,AV45,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,mPACCtrailsB,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M
PTMARRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1
Divorced,3291.65426,223.070306,70.829032,16.191067,1.24445,1.64,1.160615,1.38164,9.692649,14.474584,4.128,27.392449,39.161066,4.858633,3.844548,47.425462,10.141457,38.839286,102.003534,2.965157,24.5703,2.019086,1.657964,1.374131,1.348843,1.508141,1.850129,1.641122,1.798203,1.405836,1.361824,1.408663,1.488619,1.7066,1.5291,,270371.363914,31515.77561,6881.013722,1006044.0,3516.545,17315.25,18998.447917,1463609.0,-3.903863,8.332983,12.68683,3.826015,28.025641,40.929577,5.202983,4.025684,44.961698,9.201159,40.381201,94.118289,1.66055,-3.461543,-3.164937,,29943.786822,7114.693401,1024601.0,3632.987,17789.262846,19573.578063,1475378.0,25.10037,2.056265,1.673695,1.380301,1.388564,1.505799,1.813058,1.652824,1.692674,1.337108,1.242633,1.311714,1.34362,1.572745,1.420171,1.28054,1.742308,1.150095,2.171089,25.999675,25.957816,25.7866
Married,2641.822688,230.516823,73.121623,16.198813,1.196356,1.796808,1.190302,2.21535,11.861407,18.073441,5.255935,26.584115,34.616825,4.013591,4.213266,59.053171,8.361783,37.106649,118.497411,5.756061,23.131192,2.069483,1.728869,1.39138,1.423182,1.531277,1.835863,1.68439,2.165452,1.757264,1.61122,1.72693,1.805617,2.011747,1.851985,,235688.480256,43605.737458,6679.47335,1021657.0,3453.634,17258.479726,19362.160379,1553296.0,-6.127014,10.037228,15.762419,4.976598,27.588881,36.52585,4.393215,4.288486,55.04165,7.723623,40.142704,110.331072,3.402205,-5.190128,-4.924596,,39932.131867,6901.944605,1037495.0,3587.217,17625.947628,19919.337894,1555202.0,23.41527,2.108906,1.722058,1.368785,1.399488,1.496558,1.809537,1.673526,2.05015,1.606915,1.419309,1.546458,1.640734,1.822297,1.687627,1.24139,1.543079,1.196235,2.443094,29.257051,29.174433,29.014575
Never married,3079.861233,247.359031,73.142952,16.288546,1.255222,1.574167,1.129356,1.050459,8.969072,13.779149,3.844311,27.997006,39.84985,4.732733,3.90991,48.169036,11.003636,39.308411,110.350769,2.705521,24.443439,1.810006,1.6474,1.351972,1.322646,1.500682,1.728979,1.578666,1.627326,1.392489,1.30853,1.360714,1.420957,1.619565,1.454961,,255590.510288,40377.38565,6829.151111,989061.9,3517.871,16647.980861,19105.272727,1482008.0,-2.996565,7.62815,11.896872,3.572687,28.495595,39.973568,4.764317,3.995595,47.159533,10.07489,41.952128,109.788889,1.721854,-2.480422,-2.574614,,37799.0401,6977.434783,1003797.0,3633.925,17030.840399,19769.970075,1484635.0,24.83142,1.830189,1.652464,1.344593,1.314286,1.497782,1.799812,1.585091,1.584209,1.383333,1.270781,1.31128,1.379476,1.584651,1.416639,1.277663,,1.136882,2.167781,25.960063,25.803965,25.638767
Unknown,2941.018868,250.0,68.7,15.509434,1.306954,,1.018476,0.959459,7.33325,11.316154,3.325,28.65,38.769231,6.025641,3.74359,42.5357,9.9375,62.2,82.842105,1.631579,24.588235,2.419048,2.131745,1.386939,1.308571,1.552525,1.7,1.816112,2.03311,1.516617,1.189796,1.35625,1.331884,1.830357,1.557582,,292398.205882,28855.878788,8057.655172,1131052.0,3967.667,19804.5,21685.5,1568945.0,-1.784398,8.213774,12.270377,3.792453,29.622642,39.490566,5.716981,2.90566,33.165607,7.943396,69.0,85.521739,1.113208,-2.080783,-1.935192,,27664.211538,8053.133333,1148639.0,3991.452,19908.238095,21757.952381,1570393.0,21.97826,2.347826,2.134661,1.381988,1.304348,1.425362,1.48913,1.763424,2.09058,1.509962,1.329062,1.347826,1.499999,1.88587,1.618948,1.308734,,1.020542,1.853266,22.193632,22.018868,21.396226
White,3135.0,208.666667,69.366667,16.333333,1.0,1.2512,1.435,,1.666667,13.666667,20.666667,5.666667,28.333333,35.0,3.333333,5.333333,74.0741,3.333333,32.0,99.333333,5.0,26.0,3.14286,2.71429,1.16667,1.6,2.5,4.0,2.48571,2.375,1.55556,1.33333,1.2,2.0,2.0,1.76316,,100246.5,40338.0,6771.0,1093406.0,2743.5,16969.5,19219.0,-6.449037,1.166667,11.22,17.22,4.666667,28.0,31.666667,2.666667,5.333333,75.6614,4.333333,38.5,128.666667,2.0,-5.46822,-5.80532,,37793.0,6813.0,1104006.0,3357.0,16902.5,19647.0,1725966.0,26.0,3.14286,2.71429,1.16667,1.6,2.5,4.0,2.48571,2.375,1.55556,1.33333,1.2,2.0,2.0,18.62,1.18734,1.5675,,0.667121,7.989067,8.0
Widowed,2458.21921,226.511491,78.326458,14.945197,1.223788,1.774485,1.235103,2.063839,11.259461,17.450841,5.179916,26.769231,35.263783,4.273113,3.424061,51.019401,8.210856,37.775551,128.854167,5.11405,22.719585,2.039231,1.70144,1.394168,1.393874,1.523945,1.744265,1.658422,2.141702,1.639487,1.66964,1.709104,1.797419,2.003234,1.817711,,226681.464052,40968.026995,6443.295958,951225.8,3271.105,16236.510057,18166.413793,1469116.0,-6.039665,9.122096,14.746559,4.758986,27.692398,37.732386,4.957963,4.343991,53.368305,8.086034,40.00639,118.372176,2.898405,-4.882019,-4.823704,,36126.831365,6706.048816,964204.1,3371.2,16519.855951,18690.333586,1461396.0,23.64103,2.033463,1.724068,1.325367,1.343875,1.474285,1.660643,1.624628,1.97243,1.451835,1.418806,1.499909,1.60459,1.753841,1.610115,1.260279,1.664583,1.246703,2.551963,30.560797,30.496759,30.194461


#### Exercise

Group data by the patients' occupation. If you wanted to statistically analyse the patient's occupation, which problems in the data would you need to solve first?

<details>
<summary> > Solution</summary>
    
```python
data.groupby('WORK').size()
```
    
</details>

### 5.7. Subset data

We can select data by column label (e.g. column name):

- A single label (as *string*), e.g. `'a'`
- A *list* or array of labels `['a', 'b', 'c']`

Select data by a single label (column name) in the form of a string or list:

In [26]:
data['AGE'].head()  # Returns Series

0    74.2
1    82.4
2    81.4
3    81.3
4    80.5
Name: AGE, dtype: float64

In [27]:
data[['AGE']].head()  # Returns DataFrame

Unnamed: 0,AGE
0,74.2
1,82.4
2,81.4
3,81.3
4,80.5


Select data by a list of labels (column names):

In [28]:
data[['PTGENDER', 'AGE']].head()

Unnamed: 0,PTGENDER,AGE
0,Male,74.2
1,Male,82.4
2,Male,81.4
3,Male,81.3
4,Male,80.5


It is possible to check for the elements of a column whether they fulfill a certain condition (`True`) or not (`False`), returning a `Series` of boolean values (`True`/`False`).

In [29]:
data['PTGENDER']  == 'Female'  # Per row: Is the patient female?

0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
         ...  
14522    False
14523    False
14524     True
14525     True
14526    False
14527    False
14528    False
14529    False
14530    False
14531     True
Name: PTGENDER, dtype: bool

This list of boolean values can be used to select only those rows of the DataFrame that fullfill a condition. 

**Important**: The number of boolean values must be equal to the number of DataFrame rows. 

In [30]:
data[data['PTGENDER']  == 'Female']

Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
15,132,bl,253,2005-11-29,LMCI,80.0,Female,13,Transportation Operation Sales,Not Hisp/Latino,White,Married,0,,,,,,,0.5,18.67,25.67,7.0,25.0,30.0,1.0,5.0,83.3333,3.0,34.0,168.0,0.0,,,,,,,,,,,,,,,,,33012.0,39822.0,5342.0,927504.0,2276.0,17966.0,17802.0,1485836.0,MCI,-10.3423,-11.136500,2005-11-29,0.5,18.67,25.67,7.0,25,30.0,1.0,5.0,83.33330,3.0,34.0,168.0,0.0,-10.342300,-11.13650,,39822.0,5342.0,927504.0,2276.0,17966.0,17802.0,1485836.0,,,,,,,,,,,,,,,,,,,,,,0.000000,0.00000,0.0,0.0,2019-12-04 04:19:56.0,
16,132,m06,253,2006-06-01,LMCI,79.4,Female,13,clerical worker,Not Hisp/Latino,White,Married,0,,,,,,,1.0,19.00,26.00,7.0,21.0,41.0,6.0,7.0,63.6364,,32.0,145.0,0.0,,,,,,,,,,,,,,,,,33031.0,40161.0,5536.0,957215.0,2479.0,18461.0,18689.0,1514266.0,MCI,-14.7828,-15.110800,2005-11-29,0.5,18.67,25.67,7.0,25,30.0,1.0,5.0,83.33330,3.0,34.0,168.0,0.0,-10.342300,-11.13650,,39822.0,5342.0,927504.0,2276.0,17966.0,17802.0,1485836.0,,,,,,,,,,,,,,,,,,,,,,0.503765,6.03279,6.0,6.0,2019-12-04 04:19:56.0,
17,132,m12,253,2006-11-20,LMCI,80.3,Female,13,Special Education Teacher,Not Hisp/Latino,White,Married,0,,,,,,,1.5,18.67,25.67,7.0,26.0,31.0,4.0,8.0,100.0000,0.0,33.0,153.0,0.0,,,,,,,,,,,,,,,,,62329.0,40672.0,5528.0,954562.0,2419.0,17851.0,18289.0,1512746.0,MCI,-10.4261,-10.898500,2005-11-29,0.5,18.67,25.67,7.0,25,30.0,1.0,5.0,83.33330,3.0,34.0,168.0,0.0,-10.342300,-11.13650,,39822.0,5342.0,927504.0,2276.0,17966.0,17802.0,1485836.0,,,,,,,,,,,,,,,,,,,,,,0.974675,11.67210,12.0,12.0,2019-12-04 04:19:56.0,
18,132,m18,253,2007-05-15,LMCI,80.0,Female,13,RN,Not Hisp/Latino,White,Married,0,,,,,,,2.0,11.33,21.33,9.0,24.0,31.0,3.0,6.0,85.7143,,33.0,94.0,2.0,,,,,,,,,,,,,,,,,72772.0,41166.0,5202.0,921976.0,2278.0,16852.0,17430.0,1482666.0,MCI,-12.5145,-11.594200,2005-11-29,0.5,18.67,25.67,7.0,25,30.0,1.0,5.0,83.33330,3.0,34.0,168.0,0.0,-10.342300,-11.13650,,39822.0,5342.0,927504.0,2276.0,17966.0,17802.0,1485836.0,,,,,,,,,,,,,,,,,,,,,,1.456540,17.44260,18.0,18.0,2019-12-04 04:19:56.0,
19,132,m36,253,2008-12-08,LMCI,81.2,Female,13,Book keeper,Not Hisp/Latino,White,Married,0,,,,,,,2.0,17.67,26.67,9.0,22.0,26.0,3.0,6.0,85.7143,0.0,28.0,209.0,6.0,,,,,,,,,,,,,,,,,160845.0,46545.0,4903.0,924397.0,2308.0,17675.0,17356.0,1492366.0,MCI,-15.6707,-16.399400,2005-11-29,0.5,18.67,25.67,7.0,25,30.0,1.0,5.0,83.33330,3.0,34.0,168.0,0.0,-10.342300,-11.13650,,39822.0,5342.0,927504.0,2276.0,17966.0,17802.0,1485836.0,,,,,,,,,,,,,,,,,,,,,,3.025330,36.22950,36.0,36.0,2019-12-04 04:19:56.0,
22,136,bl,164,2005-11-10,AD,73.9,Female,12,Professor of Neurophysiology,Not Hisp/Latino,White,Married,1,1.115320,,,357.4,329.9,31.26,5.0,12.33,24.33,10.0,24.0,20.0,2.0,5.0,100.0000,2.0,27.0,100.0,11.0,,,,,,,,,,,,,,,,,32253.0,26819.0,5479.0,1033537.0,2674.0,16758.0,19736.0,1471182.0,Dementia,-13.917,-12.786000,2005-11-10,5,12.33,24.33,10.0,24,20.0,2.0,5.0,100.00000,2.0,27.0,100.0,11.0,-13.917000,-12.78600,,26819.0,5479.0,1033537.0,2674.0,16758.0,19736.0,1471182.0,,,,,,,,,,,,,,,,357.4,329.9,31.26,1.11532,,,0.000000,0.00000,0.0,0.0,2019-12-04 04:19:56.0,
23,136,m06,164,2006-05-09,AD,74.4,Female,12,Registered Nurse,Not Hisp/Latino,White,Married,1,1.075580,,,,,,2.5,14.33,24.33,9.0,26.0,20.0,1.0,3.0,75.0000,,25.0,218.0,12.0,,,,,,,,,,,,,,,,,31901.0,29615.0,5313.0,989042.0,2417.0,15889.0,18590.0,1459662.0,Dementia,-11.1382,-11.852500,2005-11-10,5,12.33,24.33,10.0,24,20.0,2.0,5.0,100.00000,2.0,27.0,100.0,11.0,-13.917000,-12.78600,,26819.0,5479.0,1033537.0,2674.0,16758.0,19736.0,1471182.0,,,,,,,,,,,,,,,,357.4,329.9,31.26,1.11532,,,0.492813,5.90164,6.0,6.0,2019-12-04 04:19:56.0,
24,136,m12,164,2006-11-09,AD,74.9,Female,12,Retail Store Owner/paint,Not Hisp/Latino,White,Married,1,1.007210,,,340,287.7,27.37,4.0,23.33,35.33,9.0,25.0,21.0,2.0,5.0,100.0000,1.0,24.0,144.0,15.0,,,,,,,,,,,,,,,,,94390.0,31565.0,5260.0,1010657.0,2182.0,15756.0,18013.0,1467272.0,Dementia,-13.0401,-12.485300,2005-11-10,5,12.33,24.33,10.0,24,20.0,2.0,5.0,100.00000,2.0,27.0,100.0,11.0,-13.917000,-12.78600,,26819.0,5479.0,1033537.0,2674.0,16758.0,19736.0,1471182.0,,,,,,,,,,,,,,,,357.4,329.9,31.26,1.11532,,,0.996578,11.93440,12.0,12.0,2019-12-04 04:19:56.0,
25,136,m24,164,2007-11-07,AD,73.4,Female,12,Director of Purchasing,Not Hisp/Latino,White,Married,1,0.959766,,,,,,5.0,19.33,30.33,10.0,28.0,15.0,0.0,3.0,100.0000,0.0,16.0,280.0,20.0,,,,,,,,,,,,,,,,,94371.0,35428.0,4862.0,989150.0,2253.0,15649.0,18014.0,1469532.0,Dementia,-11.9584,-12.211800,2005-11-10,5,12.33,24.33,10.0,24,20.0,2.0,5.0,100.00000,2.0,27.0,100.0,11.0,-13.917000,-12.78600,,26819.0,5479.0,1033537.0,2674.0,16758.0,19736.0,1471182.0,,,,,,,,,,,,,,,,357.4,329.9,31.26,1.11532,,,1.990420,23.83610,24.0,24.0,2019-12-04 04:19:56.0,
26,140,bl,175,2005-11-04,CN,78.3,Female,12,legal secretary,Hisp/Latino,White,Divorced,0,1.250960,,,1582,203.6,16.68,0.0,4.33,8.33,4.0,29.0,45.0,6.0,4.0,36.3636,10.0,30.0,101.0,0.0,,,,,,,,,,,,,,,,,59356.0,46281.0,6729.0,861745.0,3580.0,13781.0,17795.0,1269537.0,CN,-3.33317,-2.519980,2005-11-04,0,4.33,8.33,4.0,29,45.0,6.0,4.0,36.36360,10.0,30.0,101.0,0.0,-3.333170,-2.51998,,46281.0,6729.0,861745.0,3580.0,13781.0,17795.0,1269537.0,,,,,,,,,,,,,,,,1582,203.6,16.68,1.25096,,,0.000000,0.00000,0.0,0.0,2019-12-04 04:19:56.0,


#### Exercise

Get all patients' civil/marital status (column name *PTMARRY*). Return the selection as (a) Series and (b) DataFrame.

<details>
<summary> > Solution</summary>
    
```python
data['PTMARRY']
data[['PTMARRY']]
```
    
</details>

Get all patients' age and gender.

<details>
<summary> > Solution</summary>
    
```python
data[['AGE', 'PTGENDER']]
```
    
</details>

Get all patients' age and gender but show only the last 5.

<details>
<summary> > Solution</summary>
    
```python
data[['AGE', 'PTGENDER']].tail(5)
```
    
</details>

Select only divorced patients.

<details>
<summary> > Solution</summary>
    
```python

data[data['PTMARRY'] == 'Divorced']
```
    
</details>

How many patients are divorced?

<details>
<summary> > Solution</summary>
    
```python

data[data['PTMARRY'] == 'Divorced'].shape[0]
```
    
</details>

Select cross-sectional baseline data, i.e. where column with name `VISCODE` (visit code) is `bl` (baseline).

<details>
<summary> > Solution</summary>
    
```python

data[data['VISCODE'] == 'bl']
```
    
</details>

How many patients are dropped by this action?

<details>
<summary> > Solution</summary>
    
```python

data.shape[0] - data[data['VISCODE'] == 'bl'].shape[0]
```
    
</details>

Save this selection to the new variable `data_reduced` (use `copy()` to make a copy of this object’s indices and data). Use this variable from now on.

In [31]:
data_reduced = data[data['VISCODE'] == 'bl'].copy()

### 5.8. Drop columns and rows

#### Drop columns by label

We can use the `drop` function to remove rows (`axis=0`, default) or columns (`axis=1`) by their labels. 

In [32]:
data_reduced.head(3)

Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
0,128,bl,164,2005-09-08,CN,74.2,Male,16,technical writer and editor,Not Hisp/Latino,White,Married,0,1.36665,,,,,,0.0,10.67,18.67,5.0,28.0,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,,,,,,,,,,,,,,,,,35479.0,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,CN,-4.41005,-4.23545,2005-09-08,0.0,10.67,18.67,5.0,28,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,-4.41005,-4.23545,,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,,,,,,,,,,,,,,,,,,,1.36665,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
1,129,bl,164,2005-09-12,AD,82.4,Male,18,Secretary,Not Hisp/Latino,White,Married,1,1.08355,,,741.5,239.7,22.83,4.5,22.0,31.0,8.0,20.0,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,,,,,,,,,,,,,,,,,32241.0,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,Dementia,-16.6244,-16.2332,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
5,130,bl,175,2005-11-08,LMCI,67.1,Male,10,,Hisp/Latino,White,Married,0,,,,1501.0,153.1,13.29,1.0,14.33,21.33,6.0,27.0,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,,,,,,,,,,,,,,,,,64621.0,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,MCI,-8.54931,-9.60664,2005-11-08,1.0,14.33,21.33,6.0,27,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,-8.54931,-9.60664,,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,,,,,,,,,,,,,,,,1501.0,153.1,13.29,,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,


In [33]:
data_reduced.drop(0, axis=0).head(3)  # Remove first row

Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
1,129,bl,164,2005-09-12,AD,82.4,Male,18,Secretary,Not Hisp/Latino,White,Married,1,1.08355,,,741.5,239.7,22.83,4.5,22.0,31.0,8.0,20.0,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,,,,,,,,,,,,,,,,,32241.0,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,Dementia,-16.6244,-16.2332,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
5,130,bl,175,2005-11-08,LMCI,67.1,Male,10,,Hisp/Latino,White,Married,0,,,,1501.0,153.1,13.29,1.0,14.33,21.33,6.0,27.0,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,,,,,,,,,,,,,,,,,64621.0,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,MCI,-8.54931,-9.60664,2005-11-08,1.0,14.33,21.33,6.0,27,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,-8.54931,-9.60664,,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,,,,,,,,,,,,,,,,1501.0,153.1,13.29,,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
10,131,bl,164,2005-09-07,CN,75.7,Male,16,,Not Hisp/Latino,White,Married,0,1.29343,,,547.3,337.0,33.43,0.0,8.67,14.67,4.0,29.0,37.0,4.0,4.0,44.4444,12.0,38.0,90.0,0.0,,,,,,,,,,,,,,,,,32253.0,34061.0,7068.0,1116634.0,4428.0,24789.0,21616.0,1640772.0,CN,-1.95295,-1.64932,2005-09-07,0.0,8.67,14.67,4.0,29,37.0,4.0,4.0,44.4444,12.0,38.0,90.0,0.0,-1.95295,-1.64932,,34061.0,7068.0,1116634.0,4428.0,24789.0,21616.0,1640772.0,,,,,,,,,,,,,,,,547.3,337.0,33.43,1.29343,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,


In [34]:
data_reduced.drop('RID', axis=1).head(3)  # Remove RID column

Unnamed: 0,VISCODE,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
0,bl,164,2005-09-08,CN,74.2,Male,16,technical writer and editor,Not Hisp/Latino,White,Married,0,1.36665,,,,,,0.0,10.67,18.67,5.0,28.0,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,,,,,,,,,,,,,,,,,35479.0,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,CN,-4.41005,-4.23545,2005-09-08,0.0,10.67,18.67,5.0,28,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,-4.41005,-4.23545,,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,,,,,,,,,,,,,,,,,,,1.36665,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
1,bl,164,2005-09-12,AD,82.4,Male,18,Secretary,Not Hisp/Latino,White,Married,1,1.08355,,,741.5,239.7,22.83,4.5,22.0,31.0,8.0,20.0,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,,,,,,,,,,,,,,,,,32241.0,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,Dementia,-16.6244,-16.2332,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
5,bl,175,2005-11-08,LMCI,67.1,Male,10,,Hisp/Latino,White,Married,0,,,,1501.0,153.1,13.29,1.0,14.33,21.33,6.0,27.0,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,,,,,,,,,,,,,,,,,64621.0,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,MCI,-8.54931,-9.60664,2005-11-08,1.0,14.33,21.33,6.0,27,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,-8.54931,-9.60664,,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,,,,,,,,,,,,,,,,1501.0,153.1,13.29,,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,


In [35]:
data_reduced.drop(['RID', 'VISCODE'], axis=1).head(3)  # Remove RID and VISCODE columns

Unnamed: 0,SITE,EXAMDATE,DX_bl,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,EXAMDATE_bl,CDRSB_bl,ADAS11_bl,ADAS13_bl,ADASQ4_bl,MMSE_bl,RAVLT_immediate_bl,RAVLT_learning_bl,RAVLT_forgetting_bl,RAVLT_perc_forgetting_bl,LDELTOTAL_BL,DIGITSCOR_bl,TRABSCOR_bl,FAQ_bl,mPACCdigit_bl,mPACCtrailsB_bl,FLDSTRENG_bl,Ventricles_bl,Hippocampus_bl,WholeBrain_bl,Entorhinal_bl,Fusiform_bl,MidTemp_bl,ICV_bl,MOCA_bl,EcogPtMem_bl,EcogPtLang_bl,EcogPtVisspat_bl,EcogPtPlan_bl,EcogPtOrgan_bl,EcogPtDivatt_bl,EcogPtTotal_bl,EcogSPMem_bl,EcogSPLang_bl,EcogSPVisspat_bl,EcogSPPlan_bl,EcogSPOrgan_bl,EcogSPDivatt_bl,EcogSPTotal_bl,ABETA_bl,TAU_bl,PTAU_bl,FDG_bl,PIB_bl,AV45_bl,Years_bl,Month_bl,Month,M,update_stamp,Unnamed: 109
0,164,2005-09-08,CN,74.2,Male,16,technical writer and editor,Not Hisp/Latino,White,Married,0,1.36665,,,,,,0.0,10.67,18.67,5.0,28.0,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,,,,,,,,,,,,,,,,,35479.0,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,CN,-4.41005,-4.23545,2005-09-08,0.0,10.67,18.67,5.0,28,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,-4.41005,-4.23545,,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,,,,,,,,,,,,,,,,,,,1.36665,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
1,164,2005-09-12,AD,82.4,Male,18,Secretary,Not Hisp/Latino,White,Married,1,1.08355,,,741.5,239.7,22.83,4.5,22.0,31.0,8.0,20.0,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,,,,,,,,,,,,,,,,,32241.0,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,Dementia,-16.6244,-16.2332,2005-09-12,4.5,22.0,31.0,8.0,20,22.0,1.0,4.0,100.0,2.0,25.0,148.0,10.0,-16.6244,-16.2332,,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,,,,,,,,,,,,,,,,741.5,239.7,22.83,1.08355,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,
5,175,2005-11-08,LMCI,67.1,Male,10,,Hisp/Latino,White,Married,0,,,,1501.0,153.1,13.29,1.0,14.33,21.33,6.0,27.0,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,,,,,,,,,,,,,,,,,64621.0,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,MCI,-8.54931,-9.60664,2005-11-08,1.0,14.33,21.33,6.0,27,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,-8.54931,-9.60664,,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,,,,,,,,,,,,,,,,1501.0,153.1,13.29,,,,0.0,0.0,0.0,0.0,2019-12-04 04:19:56.0,


#### Exercise

We want to remove all column with column names containing the patterns **\*_bl** and **Ecog\***.

1. Look again at column names.

In [36]:
list(data_reduced.columns)

['RID',
 'VISCODE',
 'SITE',
 'EXAMDATE',
 'DX_bl',
 'AGE',
 'PTGENDER',
 'PTEDUCAT',
 'WORK',
 'PTETHCAT',
 'PTRACCAT',
 'PTMARRY',
 'APOE4',
 'FDG',
 'PIB',
 'AV45',
 'ABETA',
 'TAU',
 'PTAU',
 'CDRSB',
 'ADAS11',
 'ADAS13',
 'ADASQ4',
 'MMSE',
 'RAVLT_immediate',
 'RAVLT_learning',
 'RAVLT_forgetting',
 'RAVLT_perc_forgetting',
 'LDELTOTAL',
 'DIGITSCOR',
 'TRABSCOR',
 'FAQ',
 'MOCA',
 'EcogPtMem',
 'EcogPtLang',
 'EcogPtVisspat',
 'EcogPtPlan',
 'EcogPtOrgan',
 'EcogPtDivatt',
 'EcogPtTotal',
 'EcogSPMem',
 'EcogSPLang',
 'EcogSPVisspat',
 'EcogSPPlan',
 'EcogSPOrgan',
 'EcogSPDivatt',
 'EcogSPTotal',
 'FLDSTRENG',
 'IMAGEUID',
 'Ventricles',
 'Hippocampus',
 'WholeBrain',
 'Entorhinal',
 'Fusiform',
 'MidTemp',
 'ICV',
 'DX',
 'mPACCdigit',
 'mPACCtrailsB',
 'EXAMDATE_bl',
 'CDRSB_bl',
 'ADAS11_bl',
 'ADAS13_bl',
 'ADASQ4_bl',
 'MMSE_bl',
 'RAVLT_immediate_bl',
 'RAVLT_learning_bl',
 'RAVLT_forgetting_bl',
 'RAVLT_perc_forgetting_bl',
 'LDELTOTAL_BL',
 'DIGITSCOR_bl',
 'TRABSCOR_b

2. Get all column names with the patterns **\*_bl** or **Ecog\*** and save them in a list `column_names_unwanted`.

Let's walk through the following `for` loop with a condition (`if-else` statement):

- We initialize an empty list `column_names_unwanted` which will store all column names for all columns that we want to drop from our DataFrame.
- We iterate over all column names (`data_reduced.columns`) using the `for` loop.
- For each column name (`column_name`) we test a condition:
  - If the column name contains the patterns **\*_bl** or **Ecog\***, append this column name to our `column_names_unwanted` list.
  - Else do nothing.

In [37]:
column_names_unwanted = []

for column_name in data_reduced.columns:
    if ('_bl' in column_name) or ('Ecog' in column_name):
        column_names_unwanted.append(column_name)
    else:
        pass

column_names_unwanted

['DX_bl',
 'EcogPtMem',
 'EcogPtLang',
 'EcogPtVisspat',
 'EcogPtPlan',
 'EcogPtOrgan',
 'EcogPtDivatt',
 'EcogPtTotal',
 'EcogSPMem',
 'EcogSPLang',
 'EcogSPVisspat',
 'EcogSPPlan',
 'EcogSPOrgan',
 'EcogSPDivatt',
 'EcogSPTotal',
 'EXAMDATE_bl',
 'CDRSB_bl',
 'ADAS11_bl',
 'ADAS13_bl',
 'ADASQ4_bl',
 'MMSE_bl',
 'RAVLT_immediate_bl',
 'RAVLT_learning_bl',
 'RAVLT_forgetting_bl',
 'RAVLT_perc_forgetting_bl',
 'DIGITSCOR_bl',
 'TRABSCOR_bl',
 'FAQ_bl',
 'mPACCdigit_bl',
 'mPACCtrailsB_bl',
 'FLDSTRENG_bl',
 'Ventricles_bl',
 'Hippocampus_bl',
 'WholeBrain_bl',
 'Entorhinal_bl',
 'Fusiform_bl',
 'MidTemp_bl',
 'ICV_bl',
 'MOCA_bl',
 'EcogPtMem_bl',
 'EcogPtLang_bl',
 'EcogPtVisspat_bl',
 'EcogPtPlan_bl',
 'EcogPtOrgan_bl',
 'EcogPtDivatt_bl',
 'EcogPtTotal_bl',
 'EcogSPMem_bl',
 'EcogSPLang_bl',
 'EcogSPVisspat_bl',
 'EcogSPPlan_bl',
 'EcogSPOrgan_bl',
 'EcogSPDivatt_bl',
 'EcogSPTotal_bl',
 'ABETA_bl',
 'TAU_bl',
 'PTAU_bl',
 'FDG_bl',
 'PIB_bl',
 'AV45_bl',
 'Years_bl',
 'Month_bl']

3. Drop all unwanted column names listed in `column_names_unwanted`.

<details>
<summary> > Solution</summary>
    
```python
data_reduced.drop(column_names_unwanted, axis=1)
```
    
</details>

4. Save this selection by overwriting the variable `data_reduced`.

In [38]:
data_reduced = data_reduced.drop(column_names_unwanted, axis=1).copy()  # Note .copy()!!

How many columns are left in our dataset?

<details>
<summary> > Solution</summary>
    
```python
data_reduced.shape
```
    
</details>

5. Drop also the last column that contains only NaN values.

In [39]:
data_reduced = data_reduced.drop('Unnamed: 109', axis=1).copy()

#### Drop rows that contain NaN values in certain columns

We want to drop rows that contain NaN values in important columns such as the 
- diagnosis (`'DX'`) 
- Mini Mental Status Examination (`'MMSE'`) and
- Hippocampus volume (`'Hippocampus'`).

1. Check if these columns contain NaN values.

The diagnosis is a categorial parameter and can be checked for NaN values using the `unique()` function, which returns a list of unique values contained in selected column.

In [40]:
data_reduced['DX'].unique()

array(['CN', 'Dementia', 'MCI', nan], dtype=object)

The MMSE and hippocampus volume parameters are numerical values. Here, we can check if any (`any()`) NaN value (`isnull()`) exists in the selected column.

In [41]:
data_reduced['MMSE'].isnull().values.any()

False

In [42]:
data_reduced['Hippocampus'].isnull().values.any()

True

2. Now drop the rows using the function `dropna()` with the parameters `axis=0` (for index) and `subset` to specify which columns are to be checked for NaN values.

In [43]:
data_reduced.dropna(axis=0, subset=['DX', 'MMSE', 'Hippocampus'])

Unnamed: 0,RID,VISCODE,SITE,EXAMDATE,AGE,PTGENDER,PTEDUCAT,WORK,PTETHCAT,PTRACCAT,PTMARRY,APOE4,FDG,PIB,AV45,ABETA,TAU,PTAU,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,DIGITSCOR,TRABSCOR,FAQ,MOCA,FLDSTRENG,IMAGEUID,Ventricles,Hippocampus,WholeBrain,Entorhinal,Fusiform,MidTemp,ICV,DX,mPACCdigit,mPACCtrailsB,LDELTOTAL_BL,Month,M,update_stamp
0,128,bl,164,2005-09-08,74.2,Male,16,technical writer and editor,Not Hisp/Latino,White,Married,0,1.36665,,,,,,0.0,10.67,18.67,5.0,28.0,44.0,4.0,6.0,54.5455,10.0,34.0,112.0,0.0,,,35479.0,118226.0,8340.0,1229736.0,4182.0,16562.0,27930.0,1984658.0,CN,-4.41005,-4.235450,10.0,0.0,0.0,2019-12-04 04:19:56.0
1,129,bl,164,2005-09-12,82.4,Male,18,Secretary,Not Hisp/Latino,White,Married,1,1.08355,,,741.5,239.7,22.83,4.5,22.00,31.00,8.0,20.0,22.0,1.0,4.0,100.0000,2.0,25.0,148.0,10.0,,,32241.0,84599.0,5318.0,1129825.0,1790.0,15507.0,18425.0,1920689.0,Dementia,-16.6244,-16.233200,2.0,0.0,0.0,2019-12-04 04:19:56.0
5,130,bl,175,2005-11-08,67.1,Male,10,,Hisp/Latino,White,Married,0,,,,1501,153.1,13.29,1.0,14.33,21.33,6.0,27.0,37.0,7.0,4.0,36.3636,4.0,25.0,271.0,0.0,,,64621.0,39604.0,6870.0,1154979.0,3981.0,19035.0,19614.0,1679439.0,MCI,-8.54931,-9.606640,4.0,0.0,0.0,2019-12-04 04:19:56.0
10,131,bl,164,2005-09-07,75.7,Male,16,,Not Hisp/Latino,White,Married,0,1.29343,,,547.3,337,33.43,0.0,8.67,14.67,4.0,29.0,37.0,4.0,4.0,44.4444,12.0,38.0,90.0,0.0,,,32253.0,34061.0,7068.0,1116634.0,4428.0,24789.0,21616.0,1640772.0,CN,-1.95295,-1.649320,12.0,0.0,0.0,2019-12-04 04:19:56.0
15,132,bl,253,2005-11-29,80.0,Female,13,Transportation Operation Sales,Not Hisp/Latino,White,Married,0,,,,,,,0.5,18.67,25.67,7.0,25.0,30.0,1.0,5.0,83.3333,3.0,34.0,168.0,0.0,,,33012.0,39822.0,5342.0,927504.0,2276.0,17966.0,17802.0,1485836.0,MCI,-10.3423,-11.136500,3.0,0.0,0.0,2019-12-04 04:19:56.0
20,133,bl,175,2005-10-06,75.3,Male,10,chartered accountant,Hisp/Latino,More than one,Married,1,,,,,,,6.0,27.33,40.33,10.0,20.0,17.0,2.0,3.0,75.0000,0.0,9.0,300.0,17.0,,,59346.0,25699.0,6725.0,875793.0,2046.0,12067.0,15379.0,1353521.0,Dementia,-19.9104,-19.643100,0.0,0.0,0.0,2019-12-04 04:19:56.0
22,136,bl,164,2005-11-10,73.9,Female,12,Professor of Neurophysiology,Not Hisp/Latino,White,Married,1,1.11532,,,357.4,329.9,31.26,5.0,12.33,24.33,10.0,24.0,20.0,2.0,5.0,100.0000,2.0,27.0,100.0,11.0,,,32253.0,26819.0,5479.0,1033537.0,2674.0,16758.0,19736.0,1471182.0,Dementia,-13.917,-12.786000,2.0,0.0,0.0,2019-12-04 04:19:56.0
26,140,bl,175,2005-11-04,78.3,Female,12,legal secretary,Hisp/Latino,White,Divorced,0,1.25096,,,1582,203.6,16.68,0.0,4.33,8.33,4.0,29.0,45.0,6.0,4.0,36.3636,10.0,30.0,101.0,0.0,,,59356.0,46281.0,6729.0,861745.0,3580.0,13781.0,17795.0,1269537.0,CN,-3.33317,-2.519980,10.0,0.0,0.0,2019-12-04 04:19:56.0
31,141,bl,253,2005-10-18,80.5,Male,18,Homemaker,Not Hisp/Latino,White,Married,1,,,,,,,0.0,7.00,9.00,2.0,29.0,50.0,5.0,3.0,25.0000,19.0,63.0,49.0,0.0,,,33053.0,33418.0,6731.0,942729.0,4307.0,14949.0,17272.0,1500997.0,CN,3.70528,2.999520,19.0,0.0,0.0,2019-12-04 04:19:56.0
34,142,bl,164,2005-10-13,64.6,Male,9,Ombudsman,Not Hisp/Latino,Black,Married,1,1.38702,,,550.6,170.5,15.88,0.0,10.33,14.33,4.0,28.0,40.0,8.0,6.0,50.0000,9.0,46.0,79.0,0.0,,,32291.0,17310.0,7306.0,936538.0,3463.0,15931.0,17596.0,1351994.0,CN,-2.97098,-3.141490,9.0,0.0,0.0,2019-12-04 04:19:56.0


**Exercise**

How many patients contain values in the diagnosis, MMSE, and Hippocampus volume column?

<details>
<summary> > Solution</summary>
    
```python
data_reduced.shape
data_reduced.dropna(axis=0, subset=['DX', 'MMSE', 'Hippocampus']).shape
```
    
</details>

### 5.9 Write data

Save our reduced DataFrame `data_reduced` to a csv file next to the original csv file.

In [44]:
data_reduced.to_csv('data/alzheimers_disease_reduced.csv', index=False)

### 6. Discussion

With the help of **Python's *numpy* and *pandas* packages**, we performed basic **data science** tasks on data for the early detection and tracking of **Alzheimer's disease**: 

- Read in a csv file
- Look a basic descriptive statistics of the dataset
- Selected columns and rows of importance
- Saved the reduced dataset for later sessions

### 7. Quiz

- Name typical steps during data cleaning/preparation.
- Name challenges/artifacts you can face when working with a dataset.
- Name important measures in descriptive statistics.