# Advanced Epi II: Introduction to Python
## Lesson 2
- Intensify lists
- NumPy
- Importing packages with pandas
---
### Lists
You can create **lists within a list**:

In [75]:
L2.ipynb# area variables (in square meters)
hall = 11.25
kit = 18.0
liv = 20.0
bed = 10.75
bath = 9.50

# house information as list of lists
house = [["hallway", hall],
         ["kitchen", kit],
         ["living room", liv],
         ["bedroom", bed],
         ["bathroom", bath]]

In [76]:
print(house)
print(type(house))

[['hallway', 11.25], ['kitchen', 18.0], ['living room', 20.0], ['bedroom', 10.75], ['bathroom', 9.5]]
<class 'list'>


### Key Points About Indexing
- **Zero-based Indexing:** The first row has index `0`, the second has index `1`, and so on.
- **Slicing Excludes the End:** When selecting a range, e.g., `df[1:5]`, it includes **index 1 to 4**, but **not 5**.
- **Custom Indexing:** You can assign custom index values (e.g., setting a column as the index).


**Subset** elements of a list:

In [77]:
# Create the areas list
areas = ["hallway", 11.25, "kitchen", 18.0, "living room", 20.0, "bedroom", 10.75, "bathroom", 9.50]

In [78]:
print(areas[1]) # second element

11.25


In [79]:
print(areas[0])

hallway


In [80]:
print(house[1])

['kitchen', 18.0]


In [81]:
print(areas[-1]) # last element

9.5


In [82]:
print(areas[-5]) # the fifth last element

20.0


**Slicing**: select multiple elements from your list.

In [83]:
downstairs = areas[:6]

In [84]:
print(areas)

['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5]


In [85]:
print(downstairs)

['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0]


In [86]:
upstairs = areas[-4:]

In [87]:
print(upstairs)

['bedroom', 10.75, 'bathroom', 9.5]


In [88]:
print(areas[2:6])

['kitchen', 18.0, 'living room', 20.0]


Subsetting lists of lists:

In [89]:
print(house)

[['hallway', 11.25], ['kitchen', 18.0], ['living room', 20.0], ['bedroom', 10.75], ['bathroom', 9.5]]


In [90]:
house[-1][1]

9.5

**Extending** a list:

In [91]:
areas = areas + ["poolhouse", 24.5]
print(areas)

['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5, 'poolhouse', 24.5]


In [None]:
# not .append()!
areas.append(["cellar", 20.4])
print(areas)

In [95]:
areas.extend(["cellar", 20.4])
print(areas)

['hallway', 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5, 'poolhouse', 24.5, ('cellar', 20.4), ['cellar', 20.4], 'cellar', 20.4]


Changing elements of lists:

In [100]:
areas[1] = 99  # changes the second element to 99
print(areas) 

['hallway', 99, 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5, 'poolhouse', 24.5, ('cellar', 20.4), ['cellar', 20.4], 'cellar', 20.4]


In [99]:
# .insert()
areas.insert(1, 23)  # inserts 23 at index 1
print(areas) 

['hallway', 23, 11.25, 'kitchen', 18.0, 'living room', 20.0, 'bedroom', 10.75, 'bathroom', 9.5, 'poolhouse', 24.5, ('cellar', 20.4), ['cellar', 20.4], 'cellar', 20.4]


In [101]:
my_list = [10, 20, 30, 40, 50]
my_list[1:3] = [99, 88]
print(my_list)  

[10, 99, 88, 40, 50]


**Deleting** elements of a list:

In [None]:
del areas[-2:-1]
print(areas)

### Exercise: Advanced List Manipulation
Try to solve the following challenges using Python list operations:

**Modify Elements:**
   - Given the list `numbers = [3, 6, 9, 12, 15]`, replace the second element with `99`.
   - Swap the first and last elements in the list.

**Extend Lists:**
   - Add `[18, 21, 24]` to `numbers` in three different ways (`append()`, `extend()`, `+` operator).

**List Filtering & Comprehensions:**
   - Given `values = [10, 23, 45, 66, 77, 89, 90]`, create a new list containing only even numbers.
   - Create a new list where each number is squared if it is greater than 50.

**Index-Based Selection:**
   - From `letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']`, extract every second letter.
   - Reverse the order of the list using slicing.

**Nested Lists:**
   - Given `matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]`, extract the middle element (`5`).
   - Flatten the nested list into a single list `[1, 2, 3, 4, 5, 6, 7, 8, 9]` using list comprehension.



### Import packages and libraries
In Python, we can use **packages** (collections of modules) to access **pre-built functions and tools**. This allows us to **analyze data efficiently** without writing everything from scratch.

For epidemiological modeling and analysis, a useful package is **`epipy`** or **`lifelines`**, which provides statistical tools for public health.

---

If you are <ins>not</ins> using Jupyter Notebook but another IDE (e.g. PyCharm) to use an external package, you first need to install it (**only once**) via terminal:

```python
!pip install lifelines  # Install the lifelines package
```

Import a whole package:

In [None]:
import numpy as np

# Example: Simulating the spread of a disease over 10 days
days = np.arange(1, 11)  # Days 1 to 10
infection_rates = np.exp(-0.1 * days)  # Exponential decay of infections

In [None]:
print("Days:", days)
print("Infection rates:", infection_rates)

Import specific functions:

In [None]:
from math import sqrt  # Import only the square root function

# Example: Calculate the standard deviation of infection rates
infection_rates = [0.1, 0.15, 0.2, 0.25, 0.22]
mean_rate = sum(infection_rates) / len(infection_rates)

variance = sum((x - mean_rate) ** 2 for x in infection_rates) / len(infection_rates)
std_dev = sqrt(variance)

print(f"Standard Deviation of Infection Rates: {std_dev:.4f}")

Mulitple functions:

In [None]:
from random import randint, uniform  # Import multiple functions

# Example: Simulate the daily number of infections in a small town
daily_cases = [randint(5, 20) for x in range(7)]
reproduction_numbers = [uniform(1.5, 3.5) for x in range(7)]

print("Daily Infections:", daily_cases)
print("R₀ Values:", reproduction_numbers)

You can also import **everything** from a package. However, this is **not recommended** because it leads to namespace pollution and makes debugging more difficult.

In [None]:
from math import *  # Imports everything (not recommended)

### NumPy
![NumPy Logo](https://upload.wikimedia.org/wikipedia/commons/3/31/NumPy_logo_2020.svg)

**NumPy** (*Numerical Python*) is one of the most important libraries for numerical calculations in Python.  
It is widely used in **epidemiology, data analysis and scientific research**.

**Key Features of NumPy:**
- Supports **multi-dimensional arrays**
- Enables **fast element-wise calculations**
- Provides **advanced mathematical operations**
- Optimized for **performance and memory efficiency**

Advantages over normal Python lists:
- **Faster & more efficient** (thanks to optimized C implementation)
- **Memory-saving** as it stores **homogeneous data** (e.g. only numbers)
- **Mathematical operations** can be applied to entire arrays (**vectorization**)
- **Practical functions** for statistics, simulations and scientific computing

**Common NumPy Functions**
| Function | Description |
|----------|-------------|
| `np.array([1,2,3])` | Creates a NumPy array |
| `np.zeros((3,3))` | Creates a 3×3 array filled with zeros |
| `np.ones((2,2))` | Creates a 2×2 array filled with ones |
| `np.linspace(0,10,5)` | Generates 5 evenly spaced numbers between 0 and 10 |
| `np.random.rand(3,3)` | Generates a 3×3 matrix with random numbers |
| `np.mean(arr)` | Computes the mean of an array |
| `np.std(arr)` | Computes the standard deviation |
| `np.min(arr)` | Finds the minimum value in an array |
| `np.max(arr)` | Finds the maximum value in an array |
| `arr[arr > 5]` | Returns all elements greater than 5 |
| `np.arange(0, 10, 2)` | Generates an array with values from 0 to 10 (step 2) |
| `np.eye(4)` | Creates a 4×4 identity matrix |
| `np.full((3,3), 7)` | Creates a 3×3 matrix filled with the value 7 |
| `np.random.randint(10, size=(3,3))` | Generates a 3×3 matrix with random integers from 0 to 9 |
| `np.random.randn(3,3)` | Generates a 3×3 matrix with normally distributed random numbers |
| `np.sqrt(arr)` | Computes the square root of each element in an array |
| `np.exp(arr)` | Computes the exponential (e^x) of each element |
| `np.log(arr)` | Computes the natural logarithm (ln) of each element |
| `np.dot(A, B)` | Computes the dot product of two matrices |
| `np.sum(arr, axis=0)` | Sums along the first axis (column-wise sum) |
| `np.sum(arr, axis=1)` | Sums along the second axis (row-wise sum) |
| `np.argmax(arr)` | Returns the index of the maximum value |
| `np.argmin(arr)` | Returns the index of the minimum value |
| `np.cumsum(arr)` | Computes the cumulative sum of an array |
| `np.diff(arr)` | Computes the difference between consecutive elements |
| `np.sort(arr)` | Sorts an array |
| `np.unique(arr)` | Returns the unique values in an array |
| `np.concatenate((arr1, arr2), axis=0)` | Concatenates two arrays along axis 0 |
| `np.reshape(arr, (2,5))` | Reshapes an array to the given shape |
| `np.transpose(arr)` | Transposes a matrix |
| `np.linalg.inv(A)` | Computes the inverse of a matrix |
| `np.linalg.det(A)` | Computes the determinant of a matrix |
| `np.linalg.eig(A)` | Computes the eigenvalues and eigenvectors of a matrix |


In [2]:
import numpy as np

arr = np.array([10, 20, 30, 40, 50])

print(arr)
type(arr)

[10 20 30 40 50]


numpy.ndarray

In [3]:
# Difference 1: elementwise calculation
## with array
print(arr + 10)  # add 10 to every element of the list
print(arr * 2)  # double each element

[20 30 40 50 60]
[ 20  40  60  80 100]


In [97]:
## with list
list_example = [10, 20, 30, 40, 50]
list_example * 3

[10, 20, 30, 40, 50, 10, 20, 30, 40, 50, 10, 20, 30, 40, 50]

In [98]:
list_example + 10

TypeError: can only concatenate list (not "int") to list

In [None]:
print([x + 10 for x in list_example]) 

In [5]:
# Difference 2: NumPy is more efficient for big arrays
large_list = list(range(100000000))
large_array = np.array(large_list)

In [6]:
## time difference
import time
start_time = time.time()
[x * 2 for x in large_list]
print("List calculation took:", time.time() - start_time, "seconds")

start_time = time.time()
large_array * 2
print("Array calculation took:", time.time() - start_time, "Sekunden")

Listen-Berechnung dauerte: 3.612644910812378 Sekunden
NumPy-Berechnung dauerte: 0.18405914306640625 Sekunden


In [11]:
# Difference 3: 2D arrays
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix)
print(matrix.shape)  # Zeigt die Dimensionen

[[1 2 3]
 [4 5 6]
 [7 8 9]]
(3, 3)


In [12]:
# Difference 4: NumPy-methods for simple statistics
random_array = np.random.randint(0, 100, size=10)  # random numbers between 0 and 100
print(random_array)

[94 88 84  4 67 31  9 94 53  2]


In [13]:
print("Mean:", np.mean(random_array))
print("STD:", np.std(random_array))
print("Minimum:", np.min(random_array))
print("Maximum:", np.max(random_array))

Mean: 52.6
STD: 36.28277828391867
Minimum: 2
Maximum: 94


In [14]:
# Difference 5: logical indexing
## with array
print(random_array[random_array > 50])  # shows all values greater than 50

[94 88 84 67 94 53]


In [None]:
## with list
list_example = [10, 20, 30, 40, 50]
print([x for x in list_example if x > 25])

In [7]:
np.arange(1, 11)  # creates an array from 1 to 10

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [8]:
np.linspace(0, 1, 5)  # 5 evenly distributed values between 0 and 1

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [9]:
np.zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [10]:
np.ones((2,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [20]:
np.random.randint(10, size=(3,3))

array([[6, 0, 9],
       [5, 7, 9],
       [3, 0, 2]])

In [21]:
np.random.rand(3,3)

array([[0.26360693, 0.8645443 , 0.03446848],
       [0.58738278, 0.79332431, 0.2220018 ],
       [0.7389586 , 0.6353606 , 0.65374993]])

In [22]:
np.random.randn(3,3)

array([[-1.11339989,  1.54158804,  1.69457915],
       [ 0.43779866,  0.1870225 , -0.50762927],
       [-0.38195129,  0.85872162,  1.23629716]])

Sometimes there are several ways to use methods or functions:

In [50]:
print(np.max(arr))

50


In [51]:
print(arr.max())

50


Indexing and slicing:

In [None]:
print(arr[0])
print(arr[-1])
print(arr[1:4])

### Exercise

A city with **1,000,000 inhabitants** is experiencing an influenza outbreak. The number of **daily new infections** has been recorded for **30 days**.

**Steps to solve:**
1. **Create a NumPy array** with random daily new infections between **500 and 5000**, set the seed to 42.
2. **Calculate the cumulative sum** of infections over time (total infections at each day).
3. **Find the day with the highest infection rate**, returning both the index and the value.
4. **Write a function `normalize_infections()` that scales the infection numbers between 0 and 1** using `np.min()` and `np.max()`.

In [None]:
import numpy as np

# Generate an array of random daily infections (between 500 and 5000) for 30 days
np.random.seed(42)  # set seed for reproducibility
daily_cases = np.random.randint(500, 5000, size=30)
print("Daily new infections:", daily_cases)

# Calculate cumulative infections over time
cumulative_infections = np.cumsum(daily_infections)
print("Cumulative infections:", cumulative_infections)

# Find the day with the highest infection rate
ighest_infection_day = np.argmax(daily_infections)  # Index of highest infection day
highest_infection_value = daily_infections[highest_infection_day]  # Value at that index
print("Day with highest infections (index):", highest_infection_day)
print("Day with highest infections (value):", highest_infection_value)

# Function to normalize infection numbers
def normalize_infections(arr):
    return (arr - np.min(arr)) / (np.max(arr) - np.min(arr))

normalized_infections = normalize_infections(daily_infections)
print("Normalized infection rates:", normalized_infections)


## Working with directories

A **directory** (or folder) is a structured location used to store files on a computer. Directories help in organizing code, datasets, configurations, and outputs in programming projects. Managing directories efficiently is crucial for ensuring smooth workflow and maintainability of code.

#### **Why is Directory Organization Important?**
- **Maintains Clean Code Structure:** Organizing files logically prevents confusion and makes projects more readable.
- **Enhances Reproducibility:** A well-structured directory allows other users (or your future self) to understand and rerun the code easily.
- **Avoids Path-Related Errors:** Using the correct directory structure prevents issues when accessing or modifying files.
- **Enables Automation:** A well-structured directory makes automated processing (e.g., batch processing of multiple files) easier.

#### **Best Practices for Organizing Directories**
- Use **clear and consistent folder structures**: Keep related files together (e.g., scripts, datasets, and outputs should be in separate folders).
- Store **datasets in a `data/` folder**, Python scripts in `src/`, and results in `results/`.
- Avoid using spaces in directory names; instead, use underscores (`_`) or dashes (`-`).
- Use **relative paths instead of absolute paths** to keep your project portable across different machines.
- Automate directory creation using Python’s `os` and `pathlib` modules.

#### **Relative vs. Absolute Paths**
A **relative path** specifies the file location **relative** to the current working directory, while an **absolute path** specifies the full location from the root of the file system.

![directory](directory.png "Directory")

### Importing Data  

Pandas is one of the most widely used Python libraries for **data manipulation and analysis**. It is built on top of NumPy and provides data structures like **DataFrames** and **Series** that make working with structured data easy.
- **DataFrames:** The core structure of Pandas, similar to an Excel table or SQL table, allows for easy data organization.
- **Handling Missing Data:** Pandas provides functions to detect, fill, or remove missing values.
- **Efficient Indexing & Filtering:** Quickly locate, filter, and manipulate data using labels or conditions.
- **Powerful Grouping & Aggregation:** Easily summarize and compute statistics on large datasets.
- **Data Import & Export:** Supports reading and writing in various formats such as CSV, Excel, JSON, and SQL.
- **Integration with Other Libraries:** Works seamlessly with NumPy, Matplotlib, and Scikit-learn.

![pandas logo](https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg)

**Common Pandas Functions**
| Function | Description |
|----------|-------------|
| `pd.read_csv("file.csv")` | Reads a CSV file into a DataFrame |
| `df.head(n)` | Displays the first `n` rows of the DataFrame |
| `df.info()` | Shows summary information about the DataFrame |
| `df.describe()` | Provides statistics about numerical columns |
| `df.isnull().sum()` | Counts missing values per column |
| `df.dropna()` | Removes rows with missing values |
| `df.fillna(value)` | Replaces missing values with a specified value |
| `df.apply(function, axis=0)` | Apply a function to each row or column of a DataFrame or a Series |
| `df["column"]` | Selects a single column |
| `df[["col1", "col2"]]` | Selects multiple columns |
| `df[df["column"] > 50]` | Filters rows based on a condition |
| `df.sort_values(by="column")` | Sorts the DataFrame by a specific column |
| `df.groupby("category").mean()` | Groups data and calculates mean values |
| `df.to_csv("output.csv")` | Saves the DataFrame as a CSV file |


In [32]:
# create dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}

In [33]:
type(data)

dict

In [35]:
import pandas as pd

# Convert to DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Los Angeles
2,Charlie,35,Chicago


In [36]:
# Load the dataset (assuming it's in the same folder as the notebook)
df = pd.read_csv("healthcare-dataset-stroke-data.csv")

# Display the first few rows
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [None]:
df_stata = pd.read_stata("example_dataset.dta")

If your dataset is not in the same folder as your Jupyter Notebook, you need to provide the full path:

In [None]:
df = pd.read_csv("/Advanced Epidemiology II/L2/healthcare-dataset-stroke-data.csv")  # Mac/Linux
df = pd.read_csv("C:/Advanced Epidemiology II/L2/healthcare-dataset-stroke-data.csv")  # Windows

In [37]:
# Display basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [38]:
# Show the first 5 rows
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [None]:
# Show summary statistics for numerical columns
df.describe()

In [None]:
print(df["work_type"].nunique())  # Number of unique work types

In [None]:
# Count missing values in each column
df.isnull().sum()

In [None]:
df["bmi"].fillna(df["bmi"].mean(), inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df_senior = df[df["age"] >= 60]
df_senior.head()

**Practice**:
- Find all female patients who smoke.
- Find all patients with hypertension and heart disease.

In [None]:
# Solution
female_smokers = df[(df["gender"] == "Female") & (df["smoking_status"] == "smokes")]
hypertension_heart_disease = df[(df["hypertension"] == 1) & (df["heart_disease"] == 1)]

Grouping and aggregations:

In [None]:
df.groupby("smoking_status")["avg_glucose_level"].median()

**Practice**:
- Count how many people in each residence type (Urban vs. Rural) had a stroke.
- Find the mean BMI for each `gender` and `work_type` combination.

In [None]:
# Solution
print(df.groupby("Residence_type")["stroke"].sum())  # Count of strokes by residence type

bmi_means = df.groupby(["gender", "work_type"])["bmi"].mean()
print(bmi_means)

Creating new columns:

In [None]:
df["elderly"] = df["age"].apply(lambda x: "Yes" if x >= 65 else "No")

**Practice**: 
- Create a new column "high_risk": `1` if `bmi > 30` AND `avg_glucose_level > 200`, `0` otherwise.
- Add a new column `bmi_category`, which classifies `bmi` as:
   - "Underweight" if BMI < 18.5
   - "Normal" if 18.5 ≤ BMI < 25
   - "Overweight" if 25 ≤ BMI < 30
   - "Obese" if BMI ≥ 30

In [None]:
# Solution
df["high_risk"] = df.apply(lambda x: 1 if x["bmi"] > 30 and x["avg_glucose_level"] > 200 else 0, axis=1)
print(df[["bmi", "avg_glucose_level", "high_risk"]].head())  # Show new column

def classify_bmi(bmi):
    if bmi < 18.5:
        return "Underweight"
    elif bmi < 25:
        return "Normal"
    elif bmi < 30:
        return "Overweight"
    else:
        return "Obese"

df["bmi_category"] = df["bmi"].apply(classify_bmi)
print(df[["bmi", "bmi_category"]].head())

In [None]:
df.head()

Sorting and ranking:

In [None]:
df.sort_values(by="avg_glucose_level", ascending=False).head(10)

**Practice**:
- Find the top 5 oldest patients.
- Find the 5 patients with the lowest BMI.

In [None]:
# Solution
print(df.sort_values(by="age", ascending=False).head(5))  # Oldest patients
print(df.sort_values(by="bmi", ascending=True).head(5))  # Lowest BMI patients

Saving modified data:

In [None]:
df.to_csv("cleaned_healthcare_data.csv", index=False)