# **Data Collection**

## Objectives

* Fetch data from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) and save raw data to inputs/datasets
* Inspect the data and prepare it for analysis and save it in outputs/datasets

## Inputs

* [Dataset](https://archive.ics.uci.edu/dataset/863/maternal+health+risk): Ahmed, M. (2020). Maternal Health Risk [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DP5D.

## Outputs

* Maternal health risk dataset: outputs/datasets/maternal-health-risk-dataset.csv

## Additional Comments

* For the purpose of this project the data is pushed to a public repository, we are aware that in a real-life application the data would remain private.
* The data is anonymized and has a Creative Commons license (CC BY 4.0).


---

# Install and Import Packages for Data Collection

Install ucimlrepo package to get access to UCI Machine Learning Repository datasets

Run the following command in your terminal:

`pip install ucimlrepo`

Import packages

In [None]:
import os
import pandas as pd

from ucimlrepo import fetch_ucirepo

4:36: W291 trailing whitespace


AttributeError: '_io.StringIO' object has no attribute 'buffer'

# Change working directory

We need to change the working directory from its current folder to its parent folder
* Access current directory with os.getcwd()

In [5]:
current_dir = os.getcwd()
current_dir

'/workspaces/ML-maternal-health-risk/jupyter_notebooks'

Make the parent of the current directory the new current directory, and confirm new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* os.getcwd() get the current directory

In [6]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"New current directory set to {current_dir}.")

New current directory set to /workspaces/ML-maternal-health-risk.


# Collect Data

Import the maternal health risk dataset into the code.

Code to import dataset is found the [UCI repository website](https://archive.ics.uci.edu/dataset/863/maternal+health+risk).

In [None]:
# fetch dataset
maternal_health_risk = fetch_ucirepo(id=863)

# data (as pandas dataframes)
X = maternal_health_risk.data.features
y = maternal_health_risk.data.targets

1:16: W291 trailing whitespace
2:45: W291 trailing whitespace
3:1: W293 blank line contains whitespace
4:30: W291 trailing whitespace
5:39: W291 trailing whitespace
6:38: W291 trailing whitespace


Store data in dataframe df

In [8]:
df = pd.DataFrame(X)
df["RiskLevel"] = y
df

Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,98.0,86,high risk
1,35,140,90,13.0,98.0,70,high risk
2,29,90,70,8.0,100.0,80,high risk
3,30,140,85,7.0,98.0,70,high risk
4,35,120,60,6.1,98.0,76,low risk
...,...,...,...,...,...,...,...
1009,22,120,60,15.0,98.0,80,high risk
1010,55,120,90,18.0,98.0,60,high risk
1011,35,85,60,19.0,98.0,86,high risk
1012,43,120,90,18.0,98.0,70,high risk


Inspect dataset metadata

In [None]:
# metadata
print("Metadata:")
print(maternal_health_risk.metadata)

# print abstract
print("\nAbstract:")
print(maternal_health_risk.metadata.abstract)

# print additional_info summary
print("\nAdditional information - Summary:")
print(maternal_health_risk.metadata.additional_info.summary)


1:11: W291 trailing whitespace
12:1: W391 blank line at end of file


Metadata:
{'uci_id': 863, 'name': 'Maternal Health Risk', 'repository_url': 'https://archive.ics.uci.edu/dataset/863/maternal+health+risk', 'data_url': 'https://archive.ics.uci.edu/static/public/863/data.csv', 'abstract': 'Data has been collected from different hospitals, community clinics, maternal health cares from the rural areas of Bangladesh through the IoT based risk monitoring system.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 1013, 'num_features': 6, 'feature_types': ['Real', 'Integer'], 'demographics': ['Age'], 'target_col': ['RiskLevel'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2020, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DP5D', 'creators': ['Marzia Ahmed'], 'intro_paper': {'ID': 274, 'type': 'NATIVE', 'title': 'Review and Analysis of Risk Factor of Maternal Health in Remote Area Using the Internet of Things (IoT)', 

Save raw data to inputs/datasets as csv file

In [None]:
# create inputs/datasets folder
try:
    os.makedirs(name='inputs/datasets/')
except Exception as e:
    print(e)

# save dataset as csv file
df.to_csv('inputs/datasets/maternal-health-risk-dataset-raw.csv', index=False)

3:3: E111 indentation is not a multiple of 4
5:3: E111 indentation is not a multiple of 4


[Errno 17] File exists: 'inputs/datasets/'


---

# Data Preparation

Load and inspect raw data, prepare for next steps - exploratory data analysis, data cleaning and feature engineering (see following notebooks).

Note that if you ran the cells above, the data is already stored as df.

In [11]:
df = pd.read_csv('inputs/datasets/maternal-health-risk-dataset-raw.csv')

In [12]:
df.head()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BS,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,98.0,86,high risk
1,35,140,90,13.0,98.0,70,high risk
2,29,90,70,8.0,100.0,80,high risk
3,30,140,85,7.0,98.0,70,high risk
4,35,120,60,6.1,98.0,76,low risk


Inspect dataset variable information

In [13]:
maternal_health_risk.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Age,Feature,Integer,Age,Any ages in years when a women during pregnant.,,no
1,SystolicBP,Feature,Integer,,"Upper value of Blood Pressure in mmHg, another...",,no
2,DiastolicBP,Feature,Integer,,"Lower value of Blood Pressure in mmHg, another...",,no
3,BS,Feature,Integer,,Blood glucose levels is in terms of a molar co...,mmol/L,no
4,BodyTemp,Feature,Integer,,,F,no
5,HeartRate,Feature,Integer,,A normal resting heart rate,bpm,no
6,RiskLevel,Target,Categorical,,Predicted Risk Intensity Level during pregnanc...,,no


Check data types of variables (they already seem to be appropriate for our analysis)

In [14]:
df.dtypes

Age              int64
SystolicBP       int64
DiastolicBP      int64
BS             float64
BodyTemp       float64
HeartRate        int64
RiskLevel       object
dtype: object

There are a few steps we want to take in preparing the dataset for further analysis:

* Rename BS to BloodSugar
* Convert BodyTemp to degrees Celsius since this fits our buisiness case better
* Most relevant: Convert RiskLevel to numeric, since the ML algorithm will require numeric target variables.

Rename BS column to BloodSugar

In [15]:
df.rename(columns={"BS": "BloodSugar"}, inplace=True)
df.head(3)

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,98.0,86,high risk
1,35,140,90,13.0,98.0,70,high risk
2,29,90,70,8.0,100.0,80,high risk


Convert BodyTemp from Fahrenheit to degrees Celcius

In [None]:
df['BodyTemp'] = df.apply(lambda x: round((x['BodyTemp']-32)*(5/9), 1), axis=1)
df.head()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,36.7,86,high risk
1,35,140,90,13.0,36.7,70,high risk
2,29,90,70,8.0,37.8,80,high risk
3,30,140,85,7.0,36.7,70,high risk
4,35,120,60,6.1,36.7,76,low risk


1:67: E231 missing whitespace after ','


Convert target datatype to numeric

In [17]:
df["RiskLevel"].unique()

array(['high risk', 'low risk', 'mid risk'], dtype=object)

We choose the map:
* low risk &rarr; 0
* mid risk &rarr; 1
* high risk &rarr; 2

Note that we keep the datatype the same (object) in the first step and change it to numeric later on. This is due to a FutureWarning that was shown in a first try, stating that replace will change its behaviour regarding dtype casting in the future. For consistency we make sure that this won't be an issue.

In [18]:
df["RiskLevel"] = df["RiskLevel"].replace({
                                    "low risk": "0",
                                    "mid risk": "1",
                                    "high risk": "2"
                                    })
df.head()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,36.7,86,2
1,35,140,90,13.0,36.7,70,2
2,29,90,70,8.0,37.8,80,2
3,30,140,85,7.0,36.7,70,2
4,35,120,60,6.1,36.7,76,0


In [19]:
df.tail()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
1009,22,120,60,15.0,36.7,80,2
1010,55,120,90,18.0,36.7,60,2
1011,35,85,60,19.0,36.7,86,2
1012,43,120,90,18.0,36.7,70,2
1013,32,120,65,6.0,38.3,76,1


Check RiskLevel datatype

In [20]:
df["RiskLevel"].dtype

dtype('O')

Convert datatype to integer

In [21]:
df = df.astype({"RiskLevel": "int"})
df.dtypes

Age              int64
SystolicBP       int64
DiastolicBP      int64
BloodSugar     float64
BodyTemp       float64
HeartRate        int64
RiskLevel        int64
dtype: object

In [22]:
df.head()

Unnamed: 0,Age,SystolicBP,DiastolicBP,BloodSugar,BodyTemp,HeartRate,RiskLevel
0,25,130,80,15.0,36.7,86,2
1,35,140,90,13.0,36.7,70,2
2,29,90,70,8.0,37.8,80,2
3,30,140,85,7.0,36.7,70,2
4,35,120,60,6.1,36.7,76,0


---

## Push files to Repo

Save processed dataset to outputs folder

In [None]:
try:
    # create outputs/datasets/collection folder
    os.makedirs(name='outputs/datasets/collection')
except Exception as e:
    print(e)

2:3: E114 indentation is not a multiple of 4 (comment)
3:3: E111 indentation is not a multiple of 4
5:3: E111 indentation is not a multiple of 4


[Errno 17] File exists: 'outputs/datasets/collection'


In [None]:
df.to_csv(
    "outputs/datasets/collection/maternal-health-risk-dataset.csv",
    index=False
    )

---

# Conclusions and Next Steps

We collected the maternal health risk dataset and inspected it. We then processed it to prepare for further analysis and stored it in the outputs folder.

In the next notebook we will carry out an extensive analysis of the dataset, including exploratory data analysis, correlation study and more detailes analyses of most relevant variables.