# Worksheet week 5 Monday

## Exercise 1 - NumPy + Matplotlib

**Draw plots for the data in experimental_results.txt**

1. Read the file into a numpy array `data` ( use `np.genfromtext(path)` )
2. Draw the distributions of both columns as histograms in a single figure using `plt.subplot()`, and save the figure to `feature_distributions.png`

3. Draw a single 2D-scatter plot for `data`, where each row is a 2D point. Save it to `scatter_plot.png`

In [None]:
import numpy as np
from matplotlib import pyplot as plt
data = np.genfromtxt("experimental_results.txt")
# feature distributions

plt.figure()
plt.subplot(211)
# Histogram of normal distribution
plt.hist(data[:,0], bins=100)

plt.subplot(212)
plt.hist(data[:,1], bins=100)

plt.savefig('feature_distributions.png')
# Scatter plot of 2D-Gauss dist

# Creating a new figure
plt.figure()
plt.scatter(data[:,0],data[:,1])
plt.savefig('scatter_plot.png')

## Exercise 2 NumPy masking

### 2.1 Simple boolean mask

- Create a NumPy array with the numbers 1–10
- Create a mask that selects numbers greater than 5
- Use the mask to extract those numbers

In [None]:
# Your code here
import numpy as np

a = np.arange(1, 11)

mask = a > 5
result = a[mask]

print(result)
# [6 7 8 9 10]

### 2.2 Multiple conditions
- Create a NumPy array with the numbers 1–8
- Create a mask that selects numbers greater than 3 and less than 8
- Print the selected values

In [None]:
# You code here
a = np.array([3, 7, 1, 9, 4, 6, 2, 8])

mask = (a > 3) & (a < 8)
result = a[mask]

print(result)
# [7 4 6]

### 2.3 Mask and modify
- use `np.arange`to create a Numpy array with numbers 10-30 and a step size of 5
- Replace all values greater than 20 with -1
- Print the modified array

In [None]:
# Your code here
a = np.array([10, 15, 20, 25, 30])

a[a > 20] = -1

print(a)
# [10 15 20 -1 -1]

### 2.4 2D masking
Given the array:
```python
b = np.array([[4, 9, 2],
              [7, 6, 1],
              [8, 3, 5]])
```
- Create a mask for even numbers
- Use it to extract all even values


In [None]:
# Your code here
b = np.array([[4, 9, 2],
              [7, 6, 1],
              [8, 3, 5]])

mask = b % 2 == 0
result = b[mask]

print(result)
# [4 2 6 8]

### Mask rows
Given the array:
```python
b = np.array([[4, 9, 2],
              [7, 6, 1],
              [8, 3, 5]])
```
- Select rows where the sum of the row is greater than 15

In [None]:
# You code here
row_sums = b.sum(axis=1)
mask = row_sums > 15

result = b[mask]

print(result)
# [[4 9 2]
#  [8 3 5]]

### Mask NaN values
Given the array:
```python
c = np.array([1.2, np.nan, 3.4, np.nan, 5.6])
```
- Create a mask that selects only the non-NaN values
- Compute the mean of those values

In [None]:
# your code here
c = np.array([1.2, np.nan, 3.4, np.nan, 5.6])

mask = np.isnan(c) == False
values = c[mask]

mean_value = values.mean()

print(values)
# [1.2 3.4 5.6]
print(mean_value)
# 3.4

## Exercise 3

1. Use `range` to create a list with values from 0 to 99 and use it to initialize a `series`, without providing the index (labels). Print the series and think about why it is the case.

2. Now do the same, but set the index (labels) to be from 100 to 199.

3. Convert the entries in the `series` to type string using `astype()` method, and calculate the lengths of each of the entries using string accessor

Woorksheet on Github

In [None]:
# Initialize series with a list
series = pd.Series(range(100))
print(series)

# set the range to 100 - 199
series = pd.Series(range(100), index=range(100,200))
print(series)

# Convert to string series and calculate the length
series = series.astype('str')
print(series.str.len())

## Exercise 4 - Pandas

#### Life Expectancy (WHO) - life-expectancy-data.csv


**Dataset** contains the following columns:

- Country
- Year 
- immunization factors:
    - Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
    - Measles: Measles - number of reported cases per 1000 population
    - Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
    - Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)

- mortality factors:
    - Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
    - HIV/AIDS: Deaths per 1000 live births HIV/AIDS (0-4 years)
    - under-five deaths: Number of under-five deaths per 1000 population
    - infant deaths: Number of Infant Deaths per 1000 population

- economic factors:
    - GDP: Gross Domestic Product per capita (in USD)
    - Income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1). Measures how good a country is at utilizing its resources
    - Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
    - percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)

- social factors:
    - Population: Population of the country
    - Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
    - Status: Developed or Developing status
    - Schooling: Number of years of Schooling(years)

- other health-related factors:
    - BMI: Average Body Mass Index of entire population
    - thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (%)
    - thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)

- Life expectancy: Life Expectancy in age

### 4.1 Data Collection
Make sure you have the  `life-expectancy-data.csv` file
- Install the pandas module if you have not already done that
- Read the file to a pandas dataframe variable

In [None]:
# Import pandas and use pd.read_csv() to read in the data
import pandas as pd
df = pd.read_csv('Life Expectancy Data.csv')

### 4.2 Basic Dataset Info

We can get a basic understanding of the dataset by the following:

- Check the dataset shape
    - `len(df)`
    - `df.columns`
- Check the feature types
    - `df.dtypes`
- Take a first look at the data 
    - `df.head()` - print first n rows
    - `df.sample()` - print random n rows
- Check the dataset info
    - `df.info()`

In [None]:
# print the number of samples by len(df)
print(len(df))


In [None]:
# print the columns by df.columns
print(df.columns)

In [None]:
# print the dataset feature types by df.dtypes
print(df.dtypes)

In [None]:
# print the dataset info by df.info()
df.info()

In [None]:
# print the first 10 rows by df.head()
df.head(10)

In [None]:
# print 20 random rows by df.sample()
df.sample(20)

### 3.3 Data Cleaning

**Detecting missing values**
We can detect missing values using `df.isna()` (or equivalently `.isnull()`)

In [None]:
# Please use df.isna().sum() to calculate the number of missing values for each column
df.isna().sum()

**dropping missing values**

The simplest strategy. Removing all samples with missing values from the dataset

- `df.dropna()` drops rows(samples) with nan values

- `df.dropna(axis='columns')` drops columns(features) with num values

In [None]:
# Please remove all samples with missing values. 
clean_df = df.dropna()

# Then check the number of missing values for each column and confirm that all missing values have been removed.
print(clean_df.isna().sum())

# Also print out the length of the cleaned data. This tells you how many samples are filtered out.
print(len(clean_df))

## Exercise 4 Pandas indexinand slicing 

Given: 
```python
import pandas as pd

data = {
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"],
    "age": [24, 30, 18, 35, 28, 40],
    "city": ["Aarhus", "Copenhagen", "Aarhus", "Odense", "Copenhagen", "Aarhus"],
    "score": [88, 92, 79, 85, 90, 73]
}

df = pd.DataFrame(data)
```

### 4.1 Basic indexing (use square brackets)

1.	Select the age column using column indexing.
2.	Select the first three rows using slicing.

In [None]:
# Your code here
df["age"]
# 2
df[:3]

### 4.2 .loc 
1.	Use .loc to select the row with index 3.
2.	Use .loc to select the columns name and city for all rows.
3.	Use .loc to select rows with indices 1 through 4 (inclusive).
4.	Use .loc to select rows where the index is 2–5 and only the age and score columns.
5.	Use .loc to select all rows where city is “Aarhus”.

In [None]:
# Your code here
# 1
df.loc[3]
# 2
df.loc[:, ["name", "city"]]
# 3
df.loc[1:4]
# 4
df.loc[2:5, ["age", "score"]]
# 5
df.loc[df["city"] == "Aarhus"]

### 4.3 .iloc
1.	Use .iloc to select the first row.
2.	Use .iloc to select the first three rows and first two columns.
3.	Use .iloc to select the last column.
4.	Use .iloc to select rows 2 to 4 (note Python slicing rules).
5.	Use .iloc to select every second row.

In [None]:
# Your code here
# 1
df.iloc[0]
# 2
df.iloc[:3, :2]
# 3
df.iloc[:, -1]
# 4
df.iloc[2:5]
# 5
df.iloc[::2]

### 4.4 Boolean indexing
1.	Select all rows where age is greater than 30.
2.	Select all rows where score is at least 90.
3.	Select rows where city is “Copenhagen” and score > 85.
4.	Select the names of people who are under 25.
5.	Select rows where age is between 25 and 35 (inclusive).

In [None]:
# Your code here
# 1
df[df["age"] > 30]
# 2
df[df["score"] >= 90]
# 3
df[(df["city"] == "Copenhagen") & (df["score"] > 85)]
# 4
df.loc[df["age"] < 25, "name"]
# 5
df[df["age"].between(25, 35, inclusive="both")]