<p style="text-align:center; font-size: 28px;">CS-C4100 - Digital Health and Human Behavior (2025)</p>

In [None]:
ID = ""  # Your Student Id

<div class="alert alert-info">

<details open>
    <summary style="font-size: 22px"><b>⚠️ General Tips</b></summary>

* Please review [the general course info](https://mycourses.aalto.fi/course/view.php?id=46377) carefully.
* Feel free to create new code cells. This is a good way to write extra tests you may want to run. Note that all cells are executed during evaluation. 
* **Never** create new cells by menu commands "Edit/Copy Cells" and "Edit/Paste Cells ...". These commands create cells with duplicate ids and make autograding impossible. Use menu commands "Insert/Insert Cell ..." or the button with a plus sign to insert new cells.
* When you write code required to solve an assignment, we highly recommend you to insert the code where it says `# YOUR CODE HERE`.
* If your notebook is broken (e.g. accidentally removed a hidden tests cell, etc.), you can always re-fetch the assignment. To do so, rename the existing folder and fetch the assignment again.
* Do not forget to "Submit" your solution in the "Nbgrader→Assignment List" page after finishing your work. 
* To better understand how doing assignments in JupyterLab works, please refer to the [documentation](https://scicomp.aalto.fi/aalto/jupyterhub/nbgrader-jupyterlab/).
    
**Note:** Exercise sessions take place on Mondays 14:15 - 15:45 via [Zoom](https://aalto.zoom.us/j/61175427911) and in-person (Tietotekniikka, C111).

</details>
</div>

---


# Assignment 1.1: Prerequisites Assessment

We will use **Python** as the programming language for all the coding assignments in this course. 
You can also recap your skills here [Python tutorial](https://docs.python.org/3/tutorial/).
> Please note that if you do not have any previous Python experience, then the course may require more than 5 ECTs of workload.

<div class="alert alert-info">

### 📚 Learning outcomes
Upon completing this assignment, you will be able to: 

1. Understand the characteristics of digital health data and their applications.
2. Apply common analysis tasks for digital health data.
3. Become acquainted with the basic usage of all the techniques you will need in future assignments and project work.

</div>

For those of you already familiar with Python coding in the data science and machine learning field, this assessment will also give you a quick recap. Any detailed concepts and non-trivial techniques will be introduced under each assignment respectively.

To complete the assignments, you will need some general knowledge of data science, data visualization and a tiny bit of machine learning.

> Don't panic even if you have **absolute zero** or **very limited** experience with the topics mentioned above since the course primarily aims at **undergraduate students**, and it would not be too difficult to solve the tasks.


### Table of Contents

* [Chapter 1: Concept of digital health data](#chapter_1)
* [Chapter 2: Data analysis workflow](#chapter_2)
* [Chapter 3: Practice: Data analysis workflow ](#chapter_3)
    * [Chapter 3.1: Importing basic packages](#chapter_3_1)
    * [Chapter 3.2: Concept of data types](#chapter_3_2)
    * [Chapter 3.3: Data handling and preprocesing](#chapter_3_3)
    * [Chapter 3.4: Plotting techniques (visualization)](#chapter_3_4)
* [Chapter 4: Basic Concepts of Machine Learning](#chapter_4)
    * [Chapter 4.1: Supervised learning](#chapter_4_1)
    * [Chapter 4.2: Unsupervised learning](#chapter_4_2)
    * [Chapter 4.3: Natural languages processing (NLP)](#chapter_4_3)
    
* [References](#references)

<hr />

<a class="anchor" id="chapter_1"></a>
## 1. Concept of digital health data

Owing to the rapid adoption of electronic health systems and the ubiquity of mobile and wearable devices, researchers have been blessed with access to a prosperous and non-intrusive personal data source called *digital health data*. According to the World Health Organization (WHO), digital health is an umbrella term that covers multiple aspects, such as mobile health (mHealth), digital health records, personalized medicine, etc [[1]](#references). Through collecting and manipulating personal data from different sources, researchers can extract meaningful behaviour digital features such as the number of step counts per day, sleep duration each night, or the number of outgoing/incoming calls.

So, what can we do with those features? Health and behavioural digital features are commonly used in predictive modeling or association discovery. For instance, by monitoring measures related to sleep (total sleep time, time in bed), researchers have found significant positive associations between total sleep time and depression [[2]](#references). In the same manner, the duration of incoming and outgoing calls/day and the number and duration of incoming calls/day have been utilized to predict bipolar disorder episodes [[3]](#references), thus providing an opportunity for relapse intervention. Moreover, natural language processing techniques have been used to infer pieces of evidence of suicide risk from social media posts [[4]](#references). Digital health applications have been continuously gaining traction in the healthcare landscape and playing an increasingly important role in policy-making.
In this course, we will mainly deal with behavioural data. Due to its multi-sensor and longitudinal nature, behavioural data possesses the following characteristics:

1. **Noise and missingness**. Noise and missingness occur for various reasons: participants drop out of surveys, data entry errors, or even the sudden death of individuals. Dealing with noise and missingness is one of the most important parts of all digital health data projects.

2. **Unstructured and fragmented**. As mentioned above, digital health data can be gathered from various sources and thus are highly fragmented. The data often come in unstandardized format (physician notes, blog posts), which makes it difficult to process and digest. Curating and maintaining steady data streams is a challenge in itself.

3. **Privacy and security**. Digital health data is extremely personal and sensitive and, thus, must be handled with utmost care. Problems surrounding privacy policies and proper handling of digital health data will be discussed later in the course.

Inevitably, proper cleaning and preprocessing of the data already make up much of the allocated time for a digital health project. In this assignment, we will go through all the steps for a typical digital health data analysis workflow, focusing extra on the preprocessing phase. We will also have a chance to practice on a real-life longitudinal dataset.


<hr />

<a class="anchor" id="chapter_2"></a>
## 2. Data analysis workflow
A common data analysis workflow often consists of five stages. We will provide a description for each stage and some corresponding packages that you can use.

**1. Loading**

In this stage, we simply read/load data from different sources: APIs, CSV files, databases, etc. One of the most popular frameworks for reading and manipulating data is [pandas](https://pandas.pydata.org/). Although there are many ways to gather the data, we will only use pandas to read from CSV files in this course.

**2. Preprocessing**

Undoubtedly, it is the most time-consuming step in the workflow. There are several things we could do here. We can start by investigating the data types and checking various statistics (min, max, mean, and standard deviation values). Then, we can examine data quality (visualizing missingness, detecting outliers), dropping duplicates, or scaling and standardizing values. The amount of steps can vary significantly based on the data. Once again, [pandas](https://pandas.pydata.org/) provides many helpful functions for those tasks. On top of that, we often use [numpy](https://numpy.org/) for data manipulation. For specific computational problems, such as calculating distances or sampling from different distributions, we can use [scikit-learn](https://scikit-learn.org/stable/) or [scipy](https://www.scipy.org/).

**3. Exploratory data analysis (EDA)**

An image speaks a thousand words. Data visualization helps us gain insights from our data, discover underlying structures, and reveal relationships between variables. The most common visualization library is [matplotlib.pyplot](https://matplotlib.org/). For more beautiful and advanced plots, refer to [seaborn](https://seaborn.pydata.org/). Finally, if you need interactive plots, use [plotly](https://plotly.com/).

**4. Modeling**

Modeling refers to applying algorithms to our dataset to achieve a desired output. The output varies based on our needs. We could predict some metrics or find some associations between the variables. We could also discover clusters within our data or interpret the structure of some texts. Regardless of the tasks, you can utilize [scikit-learn](https://scikit-learn.org/stable/) to build the model that suits your needs. To use traditional statistical models, refer to [statsmodel](https://www.statsmodels.org/stable/index.html). For the Natural Language Processing framework, we will discuss it at the end of this assignment. More advanced methods like Deep Learning will not be covered in this course.

**5. Evaluation**

The final step is interpreting and visualizing the results we just extracted from our model. This step includes choosing the best model (through cross-validation and AICs comparison) or examining the quality of our model (checking goodness-of-fit and reliability of estimates). After we are satisfied with the results, we can make some nice plots to demonstrate them and put them somewhere, on a poster or a paper. We will, however, not go too deep into the details of this step in this assignment.

For a nice overview of the workflow, refer to this [link](https://aaltoscicomp.github.io/data-analysis-workflows-course/chapter-1-understanding/).

<div class="alert alert-success">

<details open>
    <summary style="font-size: 20px"><b>References for Extra Reading</b></summary>
    
- [15 Python libraries for data science](https://www.dataquest.io/blog/15-python-libraries-for-data-science/)
- [Numpy tutorial](https://cs231n.github.io/python-numpy-tutorial/#numpy)
- [Pandas userguide](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [An introduction to machine learning with scikit-learn](https://scikit-learn.org/1.4/tutorial/basic/tutorial.html)

</details>
</div>

<hr />

<a class="anchor" id="chapter_3"></a>
## 3. Practice: Data analysis workflow
Enough with the theory! In this chapter, we will apply what we have just learnt into practice. We will perform a basic analysis workflow on a real dataset ([Pima Indian diabetes](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)).

<a class="anchor" id="chapter_3_1"></a>
### 3.1. Importing basic packages

Run the next cells to install and import baseline packages.

In [None]:
import numpy as np  # Scientific computing
import pandas as pd  # Data loading and processing
import os  # OS operations
import matplotlib.pyplot as plt  # For generating figures
import seaborn as sns  # For generating visualizations, better support with pandas than matplotlib
import json  # Json package
from datetime import datetime  # To handle objects with certian date format
from tqdm.notebook import tqdm  # Progress bar

if 'AALTO_JUPYTERHUB' in os.environ:
    # using jupyter sharedata directory
    DATA = '/coursedata/pa1/'
else:
    DATA = '../../data/pa1/'

np.random.seed(123)  # To set a random seed
plt.rcParams['font.size'] = 20

<a class="anchor" id="chapter_3_2"></a>
### 3.2. Concepts of Data types

<!-- https://www.linkedin.com/posts/spotlit-ai_introducing-data-science-activity-6811549253794619394-m1eS/ -->
To understand the data, knowing the data types you are dealing with is essential. It helps to select the right tools for visualization and further analysis.


**Numerical data:**
- Discrete data - the data that consists of numerical values that are distinct and separate. In other words, this data type cannot be measured but can be counted. Examples include the number of students in the class, the number of languages an individual speaks, and show sizes.
- Continuous data - the data that consists of numerical values that cannot be counted but can be measured. Examples: height, weight. This type of data can be divided into intervals (represent ordered units that have the same difference but do not have an absolute zero, for example, temperature) and ratio data (the same as interval values, with the difference that they do have an absolute zero, for example, height and weight).


**Categorical data:**


- Nominal data - the data that represents distinct categories with no order. Examples include marital status (married, single, divorced, widowed, and so on) and the languages an individual speaks (English, Finnish, Swedish).
- Ordinal data - the data that represents ordered categories. Examples include education (elementary, high school, undergraduate, graduate) and customer satisfaction (satisfied, somewhat satisfied, unsatisfied). The main limitation of this data type is that the distances (differences) between the values are not known.


> Extra Material for data types: [link 1](http://www.intellspot.com/data-types/), and [link 2](https://towardsdatascience.com/data-types-in-statistics-347e152e8bee).





#### Identifying the data type

<div class="alert alert-warning">

<p style="font-size: 20px">&#x1F4DD; <b>Question</b></p>


Take a look at the list of variables below and try to identify their data types.

<ol>
    <li>Dogs' breeds</li>
    <li>Lengths of the athletes' jumps</li>
    <li>Songs' positions in the charts</li>
    <li>Number of the times a student missed a class</li>
    <li>Nationalities</li>
</ol>

<p>Answer options:</p>
<ol>
    <li>Numerical discrete data</li>
    <li>Numerical continuous data</li>
    <li>Categorical nominal data</li>
    <li>Categorical ordinal data</li>
</ol>

</div>

<div class="alert alert-info">

<details>
    <summary style="font-size: 20px"><b>Answers</b></summary>
    <ul>
        <li>Dogs' breeds: 3</li>
        <li>Lengths of the athletes' jumps: 2</li>
        <li>Songs' positions in the charts: 4</li>
        <li>Number of the times a student missed a class: 1</li>
        <li>Nationalities: 3</li>
    </ul>
</details>
</div>

<a class="anchor" id="chapter_3_3"></a>
### 3.3. Data handling and processing

This chapter will teach us to process and manipulate our data with the handy package `pandas`.


> Furthermore, there are various data manipulation procedures other than those introduced in this chapter. Check the complete [tutorial](https://scikit-learn.org/stable/modules/preprocessing.html) if you are interested.


First, let's look at the description of our data.

In [None]:
from sklearn.datasets import load_diabetes

data = pd.read_csv(os.path.join(DATA, 'diabetes_data.csv'))

<table>
  <tr>
    <th>Column</th>
    <th>Type</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>Pregnancies</td>
    <td>Numerical</td>
    <td>Number of times pregnant</td>
  </tr>
  <tr>
    <td>Glucose</td>
    <td>Numerical</td>
    <td>Plasma glucose concentration a 2 hours in an oral glucose tolerance test</td>
  </tr>
  <tr>
    <td>BloodPressure</td>
    <td>Numerical</td>
    <td>Diastolic blood pressure (mm Hg)</td>
  </tr>
    <tr>
    <td>SkinThickness</td>
    <td>Numerical</td>
    <td>Triceps skin fold thickness (mm)</td>
  </tr>
    <tr>
    <td>Insulin</td>
    <td>Numerical</td>
    <td>2-Hour serum insulin (mu U/ml)</td>
  </tr>
  <tr>
    <td>BMI</td>
    <td>Numerical</td>
    <td>Body mass index (weight in kg/(height in m)^2)</td>
  </tr>
  <tr>
    <td>DiabetesPedigreeFunction</td>
    <td>Numerical</td>
    <td>A function that determines the risk of type 2 diabetes based on family history. The larger the function, the higher the risk of type 2 diabetes</td>
  </tr>
  <tr>
    <td>Age</td>
    <td>Numerical</td>
    <td>Age (years)</td>
  </tr>
  <tr>
    <td>Outcome</td>
    <td>Boolean</td>
    <td>Class 1 indicates person having diabetes and 0 indicates other</td>
  </tr>  
</table>

In any case, it is always worthwhile to look at the data's statistics before doing any actual plotting. Now, let's look at our data's description by calling `pandas` built-in function `describe()`.


In [None]:
data.describe()

Notice that there are zero values in some features such as `BMI`, `BloodPressure`, which makes no sense in a medical setting. Looking at data's description helps us detect possible outliers earlier.

#### 3.3.1. Detecting and handling duplicate values

A real dataset often consists of multiple duplicate values, due to various reasons (technical errors, entry errors). We should always check for duplications from our data and get rid of them.

First, we can check for duplications using the `duplicated` function. 

In [None]:
duplicates = data[data.duplicated()]
print("Number of duplications:", len(duplicates))

There are quite a lot of duplications. We can get rid of them by calling the `drop_duplicates` function from `pandas`.

In [None]:
# Drop the duplicate values
data = data.drop_duplicates()

# Check for the changes
data[data.duplicated()]

#### 3.3.2. Detecting and handling missing values


Missing values is one of the most common problems for digital health data. The mechanism for missing data can vary. They could arise from technical issues with the sensors or participants opting out of the study. In general, there are three different Missingness Mechanism:


1. **Missing Completely at Random (MCAR)**: The probability that a value is missing does not depend on either the observed or missing values of the response. For example, a participant in the study suddenly moves to a new area and no longer takes part in the survey. Frequently, we *want* the missing data to be MCAR.


2. **Missing at Random (MAR)**: When missingness depends only on observed values and not on values that are missing. For example, a participant does not answer some of the survey questions because the question is not written in their first language.


3. **Missing Not at Random (MNAR)**: This happens when neither MCAR nor MAR holds. The probability of missingness depends on reasons that have yet to be discovered. Example: Wealthy people often withdraw their income when surveyed. We can only make educated guesses on that phenomenon but cannot confirm it. Therefore, the nature of the missingness in income data is unknown. This is the worst-case scenario because we cannot make an assumption about the missing values.


Let us see how many missing values there are in our data.


In [None]:
# Check for the missing values
data.isnull().any()

It appears that our data is complete. However, as we have seen from the visualizations in the previous chapter, some of the entries consist of suspicious values, for example, zeroes in BloodPressure, Insulin, and BMI. This is probably due to the encoding process on the original data, where missing values were imputed with zeroes. We can reverse this process and observe the changes.

In [None]:
# Assign NaN to the zero values in BP, Insulin and BMI
cols = ["BloodPressure", "Insulin", "BMI"]
data[cols] = data[cols].replace({'0': np.nan, 0: np.nan})

# Check the number of missing values in each column
data.isnull().sum()

Now, it seems like there are way more missing values than we thought. One of the most common strategy in this situation is simply dropping all the records with NaNs.

In [None]:
# Drop records which contain missing values
data_dropped = data.dropna()
print("Number of rows after dropping missing records:", len(data_dropped))

# Assert if there is any Nan in the filtered dataset
assert data_dropped.isnull().any().all() == False

Even though we have got rid of all the missing values, the resulting dataset shrinks significantly in size because we have removed 374 records with no `Insulin` information. In this scenario, it is better if we just simply drop the `Insulin` column.

In [None]:
data = data.drop(columns="Insulin")

<div class="alert alert-info">

<p style="font-size: 20px"><b>🔍 Note</b></p>
A common rule of thumb is that you can remove the missing rows if they account for less than 5% the size of your dataset.

</div>

#### 3.3.3. Handling missing values for numerical values

Although dropping missing values is the most convenient method, it is usually not the most optimal. As we have seen from the above example, the resulting dataset could be extremely small and will result in biased statistical measurements. To this end, we can utilize the Imputer to preserve the sample size. [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) can predict and fill in missing values with some statistics (e.g. mean, median, ...) retrieved from existing data.


In [None]:
from sklearn.impute import SimpleImputer  # Load Simple Imputer

# SimpleImputer is used to predicting missing values by their context (neighbours)

# Define the imputer. We use 'mean' as the imputing strategy.
imp_mean = SimpleImputer(missing_values=np.nan, strategy="mean")

# Predict column age
data["BloodPressure"] = imp_mean.fit_transform(data[["BloodPressure"]])
# Predict column salary
data["BMI"] = imp_mean.fit_transform(data[["BMI"]])
# See the result after imputation
data.head()

In [None]:
# Check the number of missing data
data.isnull().sum()

#### 3.3.4. Feature Scaling (Normalization)

Sometimes, our data might consist of multiple features (also known as *dimensions*), where each feature has its variance. Feature scaling is [necessary](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html) in many situations. For example, specific algorithms such as clustering methods require the data to be normalized before feeding into the model.

Among all functions, [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) is the most commonly used one.


In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data_scaled = data.copy().astype(np.float64)
data_scaled.iloc[:, :-2] = sc.fit_transform(data_scaled.iloc[:, :-2])  # Exclude the last 2 columns
data_scaled

Wondering how the data looks like after scaling? Let's make a plot for each variable:

In [None]:
data_scaled.iloc[:, :-2].hist(figsize=(12, 12))
plt.suptitle("Histogram of features")
plt.show()

The scaled features are now in the same range with mean = 0.

#### 3.3.5. Exercise: Preprocessing data
Now, you should write your own data processing scripts based on what you have learned from the chapter so far.

The `hepatitis_data.csv` dataset describes basic information of patients and it requires some pre-processing.

In [None]:
# Loading the dataset
df = pd.read_csv(os.path.join(DATA, "hepatitis.csv"))

# Inspect the data
df.head()

In [None]:
# Exploring basic details

def basic_details(df):
    details = pd.DataFrame()
    details['Missing values'] = df.isnull().sum()
    details['N unique values'] = df.nunique()
    details['dtype'] = df.dtypes
    return details

print(basic_details(df))

Here, in the "Missing values" column, you can see that `alk_phosphate`, `albumin` and `protime` columns consist of plenty of missing values. Try to improve the quality of the data using the methods you have learned above.

<div class="alert alert-warning">

#### 📝 Task 1: Handle missing values using `dropna()`

You should remove all the missing values from the dataset.

</div>

In [None]:
df_dropped = df.copy()  # Treat df_dropped as original df
                        # and perform all the manipulations on df_dropped
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden tests



<div class="alert alert-warning">

#### 📝 Task 2: Handle missing values using `SimpleImputer`
Instead of removing the missing values, impute the following columns with their mean values: `alk_phosphate`, `albumin`, `protime`, `bilirubin`.

</div>

In [None]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

df_imputed = df.copy()  # Treat df_imputed as the original df and perform all the manipulations on df_imputed

In [None]:
# Impute the following columns with their mean values: alk_phosphate, albumin, protime, bilirubin

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden tests


<div class="alert alert-warning">

#### 📝 Task 3: Scale data using `StandardScaler`

Scale the following columns with Standard scaler: `alk_phosphate` and `albumin`.

</div>

In [None]:
# Define the scaler
sc = StandardScaler()

df_scale = df.copy()  # Treat df_scale as the original df and perform scaling on df_scale
df_scale

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Hidden tests


<a class="anchor" id="chapter_3_4"></a>
### 3.4. Plottings Techniques (Visualization)
In this chapter, you can view snippets and practice various plotting functions with seaborn and matplotlib.

#### 3.4.1. Univariate plot

Univariate plot is used to visualize the distribution of one variable. Probably the most commonly used univariate plot is histogram. 

In [None]:
# Initialize a figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

title = fig.suptitle("Histogram of Blood Pressure")
fig.tight_layout(rect=[0, 0.03, 1, 0.95])

# Make a scatter plot with matplotlib
ax1.hist(data['BloodPressure'])
ax1.set_xlabel("value")
ax1.set_ylabel("frequency")
ax1.set_title("matplotlib")

# Make a scatter plot with seaborn
sns.histplot(data["BloodPressure"], ax=ax2)
ax2.set(xlabel="value", ylabel="frequency", title="seaborn")
plt.show()

You can easily see that there are quite a few outliers values in the data (Blood pressure = 0). Consequently, we will need to get rid of these data in the preprocessing step.

#### 3.4.2. Multivariate plot

A multivariate plot shows the relationship between several variables. Most of the times, we are interested in plots that show the relationship between two variables. We call them bivariate plots.

In the following plots, we showcase some insightful plots, such as lineplot, correlation heatmap, and scatter plot.

##### Scatter plot
First, let's look at scatter plot. We often use this kind of plot to reveal the relationship between 2 variables.

In [None]:
# Initialize a figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

title = fig.suptitle("Blood pressure versus Age")
fig.tight_layout(rect=[0, 0.03, 1, 0.95])

# Make a scatter plot with matplotlib
ax1.scatter(data["BloodPressure"], data["Age"])
ax1.set_xlabel("Blood Pressure (mm Hg)")
ax1.set_ylabel("Age")
ax1.set_title("matplotlib")

# Make a scatter plot with seaborn
sns.scatterplot(x=data["BloodPressure"], y=data["Age"], ax=ax2)
ax2.set_title("seaborn")
ax2.set_xlabel("Blood Pressure (mm Hg)")

plt.show()

You can also combine the histogram and scatter plots into one, by calling `seaborn`'s `pairplot` function.

In [None]:
sns.pairplot(data, diag_kind="hist", hue="Outcome")
plt.suptitle("Pairwise relationships plot", y=1.03)
plt.show()

<div class="alert alert-warning">

#### 📝 Task 4

Do you notice any patterns or abnormality from the above plot?

Write your answer in the provided cell below (And please don't say no!):

</div>

YOUR ANSWER HERE

---

##### Boxplot/Violin plot
When there are grouping factors within our data (like gender and origin), boxplot often helps reveal the trend within each group.

Now, since there is no grouping factor in our current data, we will come up with an arbitrary one. Let's divide people into different age groups.

In [None]:
# Young: age < 32
# Middle age: age >= 32 and age < 50
# Old: age >= 50
data['AgeGroup'] = data['Age'].apply(
    lambda x: 'Young' if x < 32 else ('Middle age' if 32 <= x < 50 else 'Old')
)

Alright, let's make a boxplot. This time we will use `pandas` built-in function `boxplot` for a change. In fact, `pandas` provides many useful built-in plotting functions that you can explore [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html).

In [None]:
# Initialize a figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
fig.tight_layout(rect=[0, 0.03, 1, 0.95])

# Make a scatter plot with pandas
data.boxplot(column=['BloodPressure'], by=["AgeGroup"], ax=ax1)
ax1.set_ylabel("Blood Pressure (mm Hg)")
ax1.set_xlabel("Age group")
ax1.set_title("pandas")

# Make a box plot with seaborn
sns.boxplot(x=data['AgeGroup'], y=data["BloodPressure"], ax=ax2)
ax2.set_xlabel("Age group")
ax2.set_ylabel("Blood Pressure (mm Hg)")
ax2.set_title("seaborn")

plt.show()

<div class="alert alert-info">

<p style="font-size: 20px"><b>🔍 Note</b></p>
Notice the patterns, quantiles, and outliers among different groups. Are they <i>significantly</i> different?

</div>

---

**Violin plot** is another way to visualize group difference. Violin plot is more informative than a plain boxplot, since it shows the full distribution of the data as well.

In [None]:
# Make a violin plot with seaborn
fig, ax = plt.subplots(1, 1, figsize = (12,6))
fig.suptitle("Blood pressure distribution among age groups")
sns.violinplot(x=data['AgeGroup'], y=data['BloodPressure'], ax=ax)
ax.set_xlabel("Age group")
ax.set_ylabel("Blood Pressure (mm Hg)")
plt.show()

##### Correlation heatmap

Correlation is a statistical measure to describe the extent in which two variables are linearly related. Correlation can range from -1 to 1.  We use correlation heatmap to show the strength of correlation between different variables. 

In [None]:
numerical_cols = data.columns[:-2]

fig, ax = plt.subplots(figsize=(14, 14))  # Sample figsize in inches
sns.heatmap(data[numerical_cols].corr(), annot=True, linewidths=.5, ax=ax)
ax.set_title("Correlation heatmap of features", y=1.05, fontsize=32)
plt.show()

#### 3.4.3. Practice: Make some plots by yourself

Now, it's time for you to make some plots for a real dataset. You should use the `seaborn` package for this assignment.

In [None]:
# Load sample dataset
df = pd.read_csv(os.path.join(DATA, '1503960366_dailyCaloriesAndSteps.csv'), sep=";")
df.head()

<div class="alert alert-warning">

#### 📝 Task 5: Make a plot for `ActivityDay` and `Calories`

</div>

<div class="alert alert-success">

<details>
    <summary style="font-size: 20px"><b>💡 Hint</b></summary>

For timeseries data, [line plot](https://seaborn.pydata.org/generated/seaborn.lineplot.html) is the most informative method for visualization.


</details>
</div>

In [None]:
# Calories line plot

plt.figure(figsize=(16, 4))

# YOUR CODE HERE
raise NotImplementedError()

plt.title("Daily Calories Burned", pad=20, fontsize=24)
plt.xticks(rotation=45)

plt.show()

<div class="alert alert-warning">

#### 📝 Task 6: Similarly, make a plot for `ActivityDay` and `TotalSteps`

</div>

In [None]:
# TotalSteps line plot

plt.figure(figsize=(16, 4))

# YOUR CODE HERE
raise NotImplementedError()

plt.title("Daily Activity Intensity", pad=20, fontsize=24)
plt.xticks(rotation=45)

plt.show()

<div class="alert alert-info">

<p style="font-size: 20px"><b>🔍 Note</b></p>
Can you spot any recurring patterns in this plot?

</div>

<div class="alert alert-warning">

#### 📝 Task 7: Make histograms of `Calories` and `TotalSteps`
Comply with the following criteria:
- set parameter `kde=True` to see the kernel density estimate
- set parameters `height=6` and `aspect=2` to fix the size of the plot


The [sns.displot](https://seaborn.pydata.org/generated/seaborn.displot.html) is a figure-level interface for drawing distribution plots onto a FacetGrid. It provides you with more control over the plot and does not fully comply with the metadata defined by `matplotlib`


For instance, if you want to adjust the size of the `FacetGrid`, you should define `height` and `aspect` in the `sns.displot` function rather than using the `figsize=` definition in the `plt.figure()` function.

</div>

In [None]:
# Part 1 - Calories histogram

# YOUR CODE HERE
raise NotImplementedError()

plt.title("Histogram of Calories burned")

plt.xticks(rotation=45)
plt.show()

In [None]:
# Part 2 - TotalSteps histogram

# Set the same parameters as those in the above section, but using TotalSteps column

# YOUR CODE HERE
raise NotImplementedError()

plt.title("Histogram of Total Steps")

plt.show()

<div class="alert alert-warning">

#### 📝 Task 8: Make a scatter plot of `Calories` AND `TotalSteps`

</div>


In [None]:
# Scatter plot

plt.figure(figsize=(12, 6))

# YOUR CODE HERE
raise NotImplementedError()

# Saving the title text for later
title_text = plt.gca().title.get_text()

plt.show()

<div class="alert alert-info">
<p style="font-size: 20px"><b>🔍 Note</b></p>
The following cell serves as a reminder to <b>always</b> include a title in your plots.
</div>

In [None]:
### Sanity Checks. Do not modify.

assert len(title_text) > 0, "You shouldn't leave the title empty"

<hr>

<a class="anchor" id="chapter_4"></a>
## 4. Basic Concepts of Machine Learning

This chapter introduces various techniques like clustering, linear regression and classification with several hands-on coding sections.


> Again, do not panic if you do not have any experience with machine learning! In future assignments, we will only cover a **tiny bit** of it with good explanations.


Before going into the concepts, we define a few technical terms that will be used repeatedly below.


1. **Dependent variable (DV)**: the effect that we want to measure. Some kinds of literature refer to them as the *outcome* or the *response variable*.


2. **Independent variable(s) (IVs)**: the factors we believe to affect the dependent variable. They are often called *explanatory variables* or *predictor variables*.


3. **Statistical model**: a mathematical method to draw inferences between variables. In other words, a model is a simpler and more convenient way to simulate a real-world phenomenon.


We often define a model as follows:


$$DV \sim \alpha + IV_1 + IV_2 + ... + IV_n$$


where $\alpha$ is an independent term (we call it the *intercept*).



<a class="anchor" id="chapter_4_1"></a>
### 4.1. Supervised learning

[Supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) is a group of learning algorithms that are trained using labeled data. The supervised learning algorithm learns from this labeled data through an iterative process, allowing it to predict new unseen data. Supervised learning is further divided into Regression and Classification algorithms.


#### 4.1.1. Linear Regression


Regression analysis is a statistical methodology that estimates a relationship between a dependent variable and a set of independent variables. Linear regression is the most basic form of regression analysis and is used extensively in digital health data analysis [[5, 6]](#references). The relationship between the variables is assumed to be linear in linear regression. Typically, we can achieve two tasks by using linear regression:


1. Predicting the value of a variable based on the value of other variables.


2. Yielding the estimates between a dependent variable and a set of independent variables


There are a few articles ([here](https://towardsdatascience.com/linear-regression-explained-1b36f97b7572) and [here](https://medium.com/analytics-vidhya/understanding-the-linear-regression-808c1f6941c0)) that explain the concept of linear regression in details.



**Splitting the dataset**

Typically, we would like to split the dataset into two subsets: one for training the model and one for testing. We can easily achieve this by using the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from the [scikit-learn](https://scikit-learn.org/stable/index.html).

In [None]:
# Splitting the dataset into train and test sets

from sklearn.model_selection import train_test_split

# Independent variables: we will use a few variables to predict the outcome
iv_cols = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Age", "DiabetesPedigreeFunction"]
X = data_scaled[iv_cols]
y = data_scaled["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

print("Train sample size:", len(X_train))
print("Test sample size:", len(X_test))

We will use the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model from scikit-learn for a simple prediction task. Now, we formally define our model as follows:

$$Outcome \sim Pregnancies + Glucose + BloodPressure + SkinThickness + Age + DiabetesPedigreeFunction$$

In [None]:
# Importing LinearRegression model from the sklearn library
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression().fit(X_train, y_train)  # Training Linear Regression model

print("[1] R-squared score: ", lin_reg.score(X_train, y_train))  # Return the mean accuracy on the given test data and labels
print("[2] Calculated coefficients: ", lin_reg.coef_)  # Calculated coefficients for the linear regression problem
print("[3] Independent term: ", lin_reg.intercept_)  # Independent term in the linear model

print("[4] Predicting the first 10 datapoints from test set: ", lin_reg.predict(X_test.head(10)))

Let us iterate through our results:


[1] The R-squared score shows how well our model captures the variance of the dependent variable (the higher, the better, max value is 1). A value of 0.28 is relatively low, so our model needs to explain the variance better.


[2] Coefficients show the strength and direction of the relationship between the dependent and independent variables. You can see that `Glucose` has the highest coefficient values (0.209).


[3] Independent term is the value of the intercept. Combining this with the coefficients above, we can yield the following estimate:


$$Outcome = 0.355 + 0.088*Pregnancies + 0.201*Glucose + 0.012*BloodPressure  \\
+ 0.026*SkinThickness + 0.005*Age + 0.055*DiabetesPedigreeFunction$$


[4] Finally, we can predict the test set. There are various ways to validate this prediction result, but we will only delve into a few details in this assignment.


**What if we did not normalize the data?**

Now you will see why standardization/normalization is important. We will fit the same model again, but with the original data.

In [None]:
X_orig = data[iv_cols]
y_orig = data["Outcome"]
X_orig_train, X_orig_test, y_orig_train, y_orig_test = train_test_split(X_orig, y_orig, test_size=0.2, random_state=2)

lin_reg = LinearRegression().fit(X_orig_train, y_orig_train)  # Train the Linear Regression model

print("[1] R-squared score:", lin_reg.score(X_orig_train, y_orig_train))  # Return the mean accuracy on the given test data and labels
print("[2] Calculated coefficients:", lin_reg.coef_)  # Calculate coefficients for the linear regression problem
print("[3] Independent term:", lin_reg.intercept_)  # Independent term in the linear model

See how the estimates have changed? The effect of `Glucose` is *underestimated* while `DiabetesPedigreeFunction` is *overestimated* because they are not in the same value range. The estimates are biased in this scenario.

#### 4.1.2. Logistic regression

[Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) is another regression analysis method. Logistic model can predict the probability that an outcome belongs to a class.

It's highly recommended to read [this article](https://www.analyticsvidhya.com/blog/2021/04/beginners-guide-to-logistic-regression-using-python/) to learn more about regression.

> You **don't need to** understand any mathematical parts, and having a general idea about what regression does is adequate.

In [None]:
# Import LogisticRegression model from the sklearn library
from sklearn.linear_model import LogisticRegression

# Define the logistic regression model
log_reg = LogisticRegression(random_state=0)

# Train Logistic Regression model
log_reg.fit(X_train, y_train)

# Validate model accuracy with test samples
log_reg.score(X_test, y_test)

print("Probability estimates:\n", log_reg.predict_proba(X_test[:10]))  # Probability estimates
print("Accuracy:", log_reg.score(X_test, y_test))  # Return the mean accuracy on the given test data and labels
print("Predicting the label for the unseen datapoint:", log_reg.predict(X_test[:10]))

As you can see from above, Logistic regression yields the probability that a sample belongs to a group. Logistic regression is instrumental when the dependent variable has more than two classes.

<a class="anchor" id="chapter_4_2"></a>
### 4.2. Unsupervised learning

<p><b><a href="https://en.wikipedia.org/wiki/Unsupervised_learning">Unsupervised learning </a></b> - group of learning algorithms that are fed with unlabeled data. It is up to the algorithm to find hidden structures, patterns, or relationships within the data.

Clustering algorithms and Dimensionality reduction are two commonly used methods for unsupervised machine learning tasks.</p>
<h4>Examples</h4>
<b><a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html"> K-means clustering algorithm:</a></b>

> Read more about clustering <a href="https://www.geeksforgeeks.org/clustering-in-machine-learning/">here</a> and <a href="https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/">here</a>, and about K-means clustering <a href="https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1">here</a>

In [None]:
from sklearn.cluster import KMeans  # Import KMeans model from the sklearn library

df_0 = X[["BloodPressure", "Age"]].copy()
df_0["pred_label"] = KMeans(n_clusters=2, random_state=0, n_init="auto") \
    .fit_predict(df_0) 

sns.scatterplot(x="Age", y="BloodPressure", hue="pred_label", data=df_0)
plt.legend(loc="lower right")

The K-means algorithm attempts to separate the data points into clusters. As you can observe from the above plot, a cut-off point between the two clusters indicates that the `BloodPressure` level starts to change once you age. It is hard to conclude anything from the plot alone, but you have a general idea of how clustering works. Feel free to tinker with the `n_clusters` variable if you need more clusters.

<a class="anchor" id="chapter_4_3"></a>
### 4.3. Natural Language Processing (NLP) Techniques
We introduce some NLP techniques here, which will be further used in assignment 5. Don't worry; it's not hard if you are willing to spend an hour reading the concepts, and we will **NOT** use extra techniques other than those covered here.

Some terms you need to understand [[7]](#references):
- Corpus: Body of text, singular. Corpora is its plural.
- Lexicon: Words and their meanings.
- Token: Each “entity” that is a part of whatever was split up based on rules. For example, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token if you tokenize the sentences from a paragraph. So basically, tokenizing involves splitting sentences and words from the body of the text.

Below, we present a simple and commonly used template. We use the Python `NLP` module. The NLTK module is a massive tool kit that aims to help you with the entire NLP methodology.



In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize  # Import String tokenizer

import ssl

# Bypass ssl to enable downloading punkt package
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download("punkt", quiet=True)  # Download the punkt package

In [None]:
text = "How do we improve health and well-being using digital data? Is it a kind of magic?"

In [None]:
# Tokens of each complete sentence
sent_token = sent_tokenize(text)

sent_token

In [None]:
# Tokens of each word repectively in a list
word_token = word_tokenize(text)

word_token

In [None]:
from nltk.stem import PorterStemmer  # Word stem extractor

ps = PorterStemmer()

# Extract the stems of each word token
word_stem = [ps.stem(w) for w in word_tokenize(text)]

print(word_stem)

In [None]:
# Compare the word stem and the word token
np.column_stack([word_stem, word_token])

### 4.4. Extra reading: Statistical test


Statistical tests are often conducted in behavioural studies to measure the difference in some variables between multiple groups. Before going into the details of statistical testing, we first need to understand the concepts of the *null hypothesis* and *alternate hypothesis*.


A *null hypothesis* ($H_0$) proposes that no significant difference exists in a set of observations. The *alternate hypothesis* ($H_a$), on the other hand, is the direct opposite of the *null hypothesis*.


**Why does this matter?**


In almost every research problem, we often develop multiple hypotheses and work to disprove/approve them. For example, to test the effect of a new drug, the gold standard in clinical practice is randomized control trials (RCT). In RCT, the participants are divided into treatment and control groups. At the end of the experiment, the researchers compare an outcome between the two groups, e.g., the cholesterol level. The hypotheses in this case are then defined as:


$H_0$: There is no significant difference in the cholesterol level between the two groups


$H_a$: There is a significant difference in the cholesterol level


Of course, we would want to *reject* the null hypothesis and confidently say there is a significant difference between the two treatments. This is where the statistical tests come into play. A statistical test describes how much the relationship between variables differs from the null hypothesis of no relationship. The test yields a statistic that we can use to compare against a critical value and decide to reject/accept the null hypothesis.


Choosing a suitable statistical test is often a confusing topic that needs to be clarified. Statistical tests are divided into two categories: parametric and non-parametric. In parametric tests, certain assumptions must hold, e.g., Independence of observations, Homogeneity of variance, and Normality of data. On the other hand, nonparametric tests do not require the above assumptions, but they often yield 'weaker' inference power than parametric tests.


Here, we will briefly explain two statistical tests: the t-test (parametric) and the permutation test (non-parametric). You can follow [this link](https://www.scribbr.com/statistics/statistical-tests/) for a more in-depth explanation of statistical testing.

**Independent t-test**

We use an independent t-test to compare the means of a continuous outcome between two separate groups of individuals (male/female, treatment/control).

In [None]:
from scipy import stats
stats.ttest_ind(data_scaled[data_scaled.Outcome == 0]["Glucose"], data_scaled[data_scaled.Outcome == 1]["Glucose"])

Above, we just compare the difference in Glucose levels between those who have diabetes and those who do not. You can see that the p-value is extremely low, so we can reject the null hypothesis.

**Permutation test**


Permutation tests follow a three-step procedure:


1. Compute a test statistic for the original data


2. Resample the data *n* times and yield *n* test statistics to create a sampling distribution of the test statistics.


3. Compute the p-value as the percentage of test statistics as extreme or more extreme than initially observed.


The permutation test is practical when the data generation mechanism is unknown, i.e., we assume we do not know the underlying distribution. A nice visual explanation can be found [here](https://www.jwilber.me/permutationtest/).

<hr>

<a class="anchor" id="references"></a>
## References

[[1]](https://www.who.int/europe/health-topics/digital-health) Digital health | World Health Organization. Retrieved 10 Oct 2024.

[[2]](https://www.frontiersin.org/articles/10.3389/fpsyt.2021.625247/full) Moshe, I., Terhorst, Y., Opoku Asare, K., Sander, L., Ferreira, D., & Baumeister, H. et al. (2021). Predicting Symptoms of Depression and Anxiety Using Smartphone and Wearable Data. Frontiers In Psychiatry, 12. doi: 10.3389/fpsyt.2021.625247

[[3]](https://doi.org/10.1111/bdi.12332) Faurholt-Jepsen, M., Vinberg, M., Frost, M., Christensen, E., Bardram, J., & Kessing, L. (2015). Smartphone data as an electronic biomarker of illness activity in bipolar disorder. Bipolar Disorders, 17(7), 715-728. doi: 10.1111/bdi.12332

[[4]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5659860/) De Choudhury, M., Kiciman, E., Dredze, M., Coppersmith, G., & Kumar, M. (2016). Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media. Proceedings Of The 2016 CHI Conference On Human Factors In Computing Systems. doi: 10.1145/2858036.2858207

[[5]](https://mhealth.jmir.org/2019/2/e11638/) Do, D., Garfein, R., Cuevas-Mota, J., Collins, K., & Liu, L. (2019). Change in Patient Comfort Using Mobile Phones Following the Use of an App to Monitor Tuberculosis Treatment Adherence: Longitudinal Study. JMIR Mhealth And Uhealth, 7(2), e11638. doi: 10.2196/11638

[[6]](https://link.springer.com/chapter/10.1007/978-3-030-61527-7_43) Unnikrishnan, V., Shah, Y., Schleicher, M., Strandzheva, M., Dimitrov, P., & Velikova, D. et al. (2020). Predicting the Health Condition of mHealth App Users with Large Differences in the Number of Recorded Observations - Where to Learn from?. Discovery Science, 659-673. doi: 10.1007/978-3-030-61527-7_43

[[7]](https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/) Tokenize text using NLTK in python | GeeksforGeeks. Retrieved 10 Oct 2024.