# **Data Collection**

## Objectives

* Fetch data from [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/) and save raw data to inputs/datasets
* Inspect the data and prepare it for analysis and save it in outputs/datasets

## Inputs

* [Dataset](https://archive.ics.uci.edu/dataset/863/maternal+health+risk): Ahmed, M. (2020). Maternal Health Risk [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DP5D.

## Outputs

* Maternal health risk dataset: outputs/datasets/maternal-health-risk-dataset.csv

## Additional Comments

* For the purpose of this project the data is pushed to a public repository, we are aware that in a real-life application the data would remain private.
* The data is anonymized and has a Creative Commons license (CC BY 4.0).


---

# Install and Import Packages for Data Collection

Install ucimlrepo package to get access to UCI Machine Learning Repository datasets

In [None]:
pip install ucimlrepo

Import packages

In [None]:
import os
import pandas as pd

from ucimlrepo import fetch_ucirepo 

# Change working directory

We need to change the working directory from its current folder to its parent folder
* Access current directory with os.getcwd()

In [None]:
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory, and confirm new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory
* os.getcwd() get the current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"New current directory set to {current_dir}.")

# Collect Data

Import the maternal health risk dataset into the code.

Code to import dataset is found the [UCI repository website](https://archive.ics.uci.edu/dataset/863/maternal+health+risk).

In [None]:
# fetch dataset 
maternal_health_risk = fetch_ucirepo(id=863) 
  
# data (as pandas dataframes) 
X = maternal_health_risk.data.features 
y = maternal_health_risk.data.targets 

Store data in dataframe df

In [None]:
df = pd.DataFrame(X)
df["RiskLevel"] = y
df

Inspect dataset metadata

In [None]:
# metadata 
print("Metadata:")
print(maternal_health_risk.metadata)

# print abstract
print("\nAbstract:")
print(maternal_health_risk.metadata.abstract)

# print additional_info summary
print("\nAdditional information - Summary:")
print(maternal_health_risk.metadata.additional_info.summary)


Save raw data to inputs/datasets as csv file

In [None]:
# create inputs/datasets folder
try:
  os.makedirs(name='inputs/datasets/')
except Exception as e:
  print(e)

# save dataset as csv file
df.to_csv('inputs/datasets/maternal-health-risk-dataset-raw.csv', index=False)

---

# Data Preparation

Load and inspect raw data, prepare for next steps - exploratory data analysis, data cleaning and feature engineering (see following notebooks).

Note that if you ran the cells above, the data is already stored as df.

In [None]:
df = pd.read_csv('inputs/datasets/maternal-health-risk-dataset-raw.csv')

In [None]:
df.head()

Inspect dataset variable information

In [None]:
maternal_health_risk.variables

Check data types of variables (they already seem to be appropriate for our analysis)

In [None]:
df.dtypes

There are a few steps we want to take in preparing the dataset for further analysis:

* Rename BS to BloodSugar
* Convert BodyTemp to degrees Celsius since this fits our buisiness case better
* Most relevant: Convert RiskLevel to numeric, since the ML algorithm will require numeric target variables.

Rename BS column to BloodSugar

In [None]:
df.rename(columns={"BS": "BloodSugar"}, inplace=True)
df.head(3)

Convert BodyTemp from Fahrenheit to degrees Celcius

In [None]:
df['BodyTemp'] = df.apply(lambda x: round((x['BodyTemp']-32)*(5/9),1), axis=1)
df.head()

Convert target datatype to numeric

In [None]:
df["RiskLevel"].unique()

We choose the map:
* low risk &rarr; 0
* mid risk &rarr; 1
* high risk &rarr; 2

Note that we keep the datatype the same (object) in the first step and change it to numeric later on. This is due to a FutureWarning that was shown in a first try, stating that replace will change its behaviour regarding dtype casting in the future. For consistency we make sure that this won't be an issue.

In [None]:
df["RiskLevel"] = df["RiskLevel"].replace({"low risk": "0", "mid risk": "1", "high risk": "2"})
df.head()

In [None]:
df.tail()

Check RiskLevel datatype

In [None]:
df["RiskLevel"].dtype

Convert datatype to integer

In [None]:
df = df.astype({"RiskLevel": "int"})
df.dtypes

In [None]:
df.head()

## Push files to Repo

Save processed dataset to outputs folder

In [None]:
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/maternal-health-risk-dataset.csv", index=False)

---

# Conclusions and Next Steps

We collected the maternal health risk dataset and inspected it. We then processed it to prepare for further analysis and stored it in the outputs folder.

In the next notebook we will carry out an extensive analysis of the dataset, including exploratory data analysis, correlation study and more detailes analyses of most relevant variables.