# CI Portfolio Project 5 - Filter Maintenance Predictor 2022
## **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.

## Outputs

* Generates Two Datasets: 
    1. outputs/datasets/collection/**PredictiveMaintenanceTest**.csv
    2. outputs/datasets/collection/**PredictiveMaintenanceTrain**.csv

## Additional Comments
* The data is from a publicly accessible Kaggle repo found [here](https://www.kaggle.com/datasets/prognosticshse/preventive-to-predicitve-maintenance) and comes pre-divided into distinctly different Testing and Training data.
* For the purposes of the learning context of this project, we are hosting the data in a publicly accessible repo at [GitHub](https://github.com/roeszler/filter-maintenance-predictor).
* In the workplace, we would never push data to a public repository due to security exposure it represents.

---

# Change working directory

The notebooks are stored in a subfolder. When running the notebook in the editor, we change the working directory from its current folder to its parent folder.
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

To make the parent of the current directory the new current directory
* `os.path.dirname()` = gets the parent directory
* `os.chir()` = defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

#### We have pre-installed Kaggle package to fetch data using : 
`pip install kaggle==1.5.12`

#### This is pre included in the requirements.txt documentation to load on initialization using : 

`pip3 freeze --local > requirements.txt`

#### 1. Download a .JSON file (authentication token) from Kaggle and include it in the root directory
* kaggle.json

#### 2. Recognize the token in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

#### 3. Define the Kaggle dataset, and destination folder and download it.

Kaggle url: [/prognosticshse/preventive-to-predicitve-maintenance](https://www.kaggle.com/datasets/prognosticshse/preventive-to-predicitve-maintenance) .
* **Note** the misspelling of 'predictive'

The following function: 
* Retrieves the Kaggle dataset
* Creates a destination folder folder for the data to be placed
* Downloads it to the destination folder
* Unzips the downloaded file
* Deletes the **.zip** file and unused data
* Removes any **kaggle.json** files used to access the dataset on Kaggle

In [None]:
KaggleDatasetPath = 'prognosticshse/preventive-to-predicitve-maintenance'
DestinationFolder = 'inputs/datasets/raw'   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm {DestinationFolder}/*.pdf \
  && rm {DestinationFolder}/*.mat \
#   && rm kaggle.json

---

# Load and Inspect Kaggle data

#### Load Data to Inspect
We could combine both datasets, however as they have been included as two sets with slightly different content, we will inspect them each separately.

In [None]:
import pandas as pd
df_test = pd.read_csv(f'inputs/datasets/raw/Test_Data_CSV.csv')
df_train = pd.read_csv(f'inputs/datasets/raw/Train_Data_CSV.csv')

#### DataFrame Summary

In [None]:
df_test.info()

In [None]:
df_train.info()

---

# Explore Data


Pre installed `pandas_profiling` and `ipywidgets` with: 

* `pip install pandas-profiling`

* `pip install ipywidgets`

Not forgetting to update the requirements.txt

#### To explore the **Test** dataset:

In [None]:
from pandas_profiling import ProfileReport
pandas_report_test = ProfileReport(df=df_test, minimal=True)
pandas_report_test.to_notebook_iframe()

#### To explore the **Train** dataset:

In [None]:
pandas_report_train = ProfileReport(df=df_test, minimal=True)
pandas_report_train.to_notebook_iframe()

#### To view the distributions of the data?

---

## Considerations

#### We note that the dataset has **no missing data**. 
* This is outside of what we already know to be true for **df_test** (with RUL) and **df_train** (without RUL).

In [None]:
pandas_report_train

In [None]:
pandas_report_test

#### Extend Data_No of df_test dataset

A comparison between sets reveals that the **Data_No** variable:
* Is a categorical variable presented as an integer
* Restarts at the beginning of each dataset

This has the potential to confound subsequent analysis between the sets, where the analysis erroneously considers *Data_No* a discrete value &/or a duplicate entry. To help avoid confusion we alter the values in the **df_test dataset** to be a continuation from the bins seen in the **df_train dataset**.

This is as simple as adding the total number of unique test bins in the df_test set to each one seen in the df_train set:

Quick reminder of the tables we are working with

Calculate the total number of test sets in **df_train**

Continue the numbering in the next set

Replace new data references into **df_test**

Check new Data_No values in both sets

---

## Save Datasets

#### Combine datasets

In [None]:
# combined_list = [df_test, df_train, df_validate]
# df = pd.concat(combined_list)
# df

#### Save the files to an output folder

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df_train.to_csv(f'outputs/datasets/collection/PredictiveMaintenanceTrain.csv',index=False)
df_test.to_csv(f'outputs/datasets/collection/PredictiveMaintenanceTest.csv',index=False)
# df_validate.to_csv(f'outputs/datasets/collection/PredictiveMaintenanceValidate.csv',index=False)
# df.to_csv(f'outputs/datasets/collection/FilterMaintenancePredictorDataset.csv',index=False)

Now push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)

---

# Conclusions and Next steps

#### Conclusions: 
* Data supplied without missing observations
* The Data_No references were repeated and corrected

#### Next Steps:
* Data Cleaning

---