# CI Portfolio Project 5 - Filter Maintenance Predictor 2022
## **Data Collection Notebook**

## Objectives


* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

### Inputs

*   Kaggle JSON file - the authentication token.

### Outputs

* Combine Two Datasets: 
    1. outputs/datasets/collection/**PredictiveMaintenanceTest**.csv
    2. outputs/datasets/collection/**PredictiveMaintenanceTrain**.csv
    * outputs/datasets/collection/**PredictiveMaintenanceTotal**.csv

### Additional Comments
* The data is from a publicly accessible Kaggle repo found [here](https://www.kaggle.com/datasets/prognosticshse/preventive-to-predicitve-maintenance) and comes pre-divided into distinctly different Testing and Training data.
* For the purposes of the learning context of this project, we are hosting the data in a publicly accessible repo at [GitHub](https://github.com/roeszler/filter-maintenance-predictor).
* In the workplace, we would never push data to a public repository due to security exposure it represents.

---

# Change working directory

The notebooks are stored in a subfolder. When running the notebook in the editor, we change the working directory from its current folder to its parent folder.
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

To make the parent of the current directory the new current directory
* `os.path.dirname()` = gets the parent directory
* `os.chir()` = defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("Current directory set to new location")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Fetch data from Kaggle

#### 1. Download a .JSON file (authentication token) from Kaggle and include it in the root directory
* kaggle.json

#### 2. Recognize the token in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

#### 3. Define the Kaggle dataset, and destination folder and download it.

Kaggle url: [/prognosticshse/preventive-to-predicitve-maintenance](https://www.kaggle.com/datasets/prognosticshse/preventive-to-predicitve-maintenance) .
* **Note** the misspelling of 'predictive'

The following function: 
* Retrieves and defines the Kaggle dataset
* Creates a destination folder folder for the data to be placed
* Downloads it to the destination folder
* Unzips the downloaded file
* Deletes the **.zip** file and unused data
* Removes any **kaggle.json** files used to access the dataset on Kaggle

In [None]:
KaggleDatasetPath = 'prognosticshse/preventive-to-predicitve-maintenance'
DestinationFolder = 'inputs/datasets/raw'   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm {DestinationFolder}/*.pdf \
  && rm {DestinationFolder}/*.mat \
#   && rm kaggle.json

---

# Load and Inspect Kaggle data

#### Load Data to Inspect
We could combine both datasets, however as they have been included as two sets with slightly different content, we will inspect them each separately.

In [None]:
import pandas as pd
df_test = pd.read_csv(f'inputs/datasets/raw/Test_Data_CSV.csv')
df_train = pd.read_csv(f'inputs/datasets/raw/Train_Data_CSV.csv')

#### DataFrame Summary

In [None]:
df_test.info()

In [None]:
df_train.info()

---

# Explore Data


#### To explore the **Test** dataset:

In [None]:
df_test.head()

In [None]:
from pandas_profiling import ProfileReport
pandas_report_test = ProfileReport(df=df_test, minimal=True)
pandas_report_test.to_notebook_iframe()

### Main observations of the **Test** Dataset :

* There are no missing cells.

* Differential Pressure has zero's and has a **reverse exponential** shaped distribution 
    * This correlates to what we understand. The beginning of each test set will have a period where the filter is clean and the difference in pressure is negligible.
    * Subsequently, the measures of distribution (Mean, Median, Mode, Skewness, Kurtosis) correlate to the reverse exponential shape

* Most of the **Dust_Feed** was run at 60mm<sup>3</sup>/s
    * possibly manipulate data to make the range of test sets more evenly distributed

* There is more than three times the amount of A3 Medium Dust observations (47.9%) as there is A2 Fine dust (14.8%), with A4 Course tests (37.3%)
    * possibly manipulate data to make the range of test sets more evenly distributed
    
* The RUL target distribution is right or **positively skewed** at 0.71.
    * Confirmed by the **Mean** of **111.48** > **Median** of **93.5**
    * An ideal normal distribution has mean, median and mode similar in value and a skewness measure approaching zero
    * A measure of the distributions tails; Kurtosis at -0.34 is relatively low in value and negative, indicating few outliers.
    * Similar to **differential pressure** This shape is what we expect for a variable that progresses to zero.

### Early Conclusions
* Further box plot visualization to further investigate this skewness.
* We will consider manipulating data at the feature engineering stage to reduce the affect of skewness, like:
    * Random Forest Selection (Bagging)
    * Logarithmic transformation
    * Manipulate the data range to that of test sets more evenly distributed
    * Feature Scaling

#### Note: 
This dataset has deliberately had the tails of its observations removed at random points (right censored). This needs to be considered when looking at engineering the distributions. In light of this, depending on our Principal Component Analysis (PCA) a Random Forest Selection (Bagging) may present itself as the preferred method to engineer this set.

---

#### To explore the **Train** dataset:

In [None]:
df_train.head()

In [None]:
pandas_report_train = ProfileReport(df=df_train, minimal=True)
pandas_report_train.to_notebook_iframe()

What group do the zeros appear in mostly?

### Main observations of the **Train** Dataset :

* There are also no missing cells.

* Differential Pressure has zero's and has the same **reverse exponential** shaped distribution as df_test.
    * This correlates to what we understand. The beginning of each test set will have a period where the filter is clean and the difference in pressure is negligible.
    * Subsequently, the measures of distribution (Mean, Median, Mode, Skewness, Kurtosis) correlate to the same reverse exponential shape


* Most of the **Dust_Feed** was ab bit more evenly spread through the data, a 27% from 158.5mm<sup>3</sup>/s to around 20% in feeds between 60mm<sup>3</sup>/s to 118mm<sup>3</sup>/s.
    * In a live project, we would check the stakeholders as to possible reasons for this and confirm that it represents typical data seen in practice
    * possibly manipulate data to make the range of test sets more evenly distributed


* The dust observations maintain A3 Medium Dust as the highest proportion (47.9%), however the portions of A2 Fine dust (28.2%) to A4 Course Dust (23.7%) are approximately the same.
    * We would also check this with the stakeholders in a live workplace project
    * possibly manipulate data to make the range of test sets more evenly distributed

#### Reminder Note: 
This dataset has deliberately had the tails of its observations removed at random points (right censored). This needs to be considered when looking at engineering the distributions of this dataset. In light of this and further Principal Component Analysis (PCA) a Random Forest Selection (Bagging) may present itself as the preferred method to engineer this set.

---

## Save Dataset

#### Save the files to an outputs/../collection folder

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df_train.to_csv(f'outputs/datasets/collection/PredictiveMaintenanceTrain.csv',index=False)
df_test.to_csv(f'outputs/datasets/collection/PredictiveMaintenanceTest.csv',index=False)
# df_total.to_csv(f'outputs/datasets/collection/PredictiveMaintenanceTotal.csv',index=False)

Now push the changes to your GitHub Repo, using the Git commands (git add, git commit, git push)

---

# Conclusions and Next steps

#### Conclusions: 
* Data supplied without missing observations

#### Next Steps:
* Clean Data
    * Create Total dataset
    * Extend Data_No references
    * Determine Correlations
    * Manage Missing Data
    * Review Outliers

---