# **Import and Review Datasets**

## Objectives

* 'Fetch data from Kaggle and save as raw data'

### Inputs

* Trash Locator
* House Price Predictor
* Skin Checker
* Disease Screener
* Dog Emotions
* Filter Maintenance

### Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

### Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print('You set a new current directory')

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

After importing your **kaggle.json** token file; run the following to recognize it in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

We are using the following Kaggle URL: [https://www.kaggle.com/datasets/prognosticshse/preventive-to-predicitve-maintenance](https://www.kaggle.com/datasets/prognosticshse/preventive-to-predicitve-maintenance)

<!-- ![image.png](https://static.streamlit.io/examples/cat.jpg) -->
![image.png](/workspace/dataset-testing/static/img/PPM_Dataset_Kaggle.png)

Get the dataset path from the Kaggle url
* When you are viewing the dataset at Kaggle, check what is after '[https://www.kaggle.com/datasets/](https://www.kaggle.com/datasets/prognosticshse/preventive-to-predicitve-maintenance)' .

The following function: 
* Retrieves the Kaggle dataset
* Creates a destination folder folder for the data to be placed
* Downloads it to the destination folder
* Unzips the downloaded file
* Deletes the **.zip** file 
* Deletes unused copies of the data as MATLAB **.mat** files
* Removes any  **kaggle.json** files used to access the dataset on Kaggle

In [None]:
KaggleDatasetPath = 'prognosticshse/preventive-to-predicitve-maintenance'
DestinationFolder = 'inputs/datasets/raw'   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm {DestinationFolder}/*.mat \
#   && rm kaggle.json

---

# Load and Inspect Kaggle data

#### Load Data to Inspect

In [None]:
import pandas as pd
df_test = pd.read_csv(f'inputs/datasets/raw/Test_Data_CSV.csv')
df_train = pd.read_csv(f'inputs/datasets/raw/Train_Data_CSV.csv')

#### Data Composition

In [None]:
df_test.to_numpy()
df_test.shape

In [None]:
df_train.to_numpy()
df_train.shape

In [None]:
test_size = float(df_test.size)
train_size = float(df_train.size)
print(f'Train Data Shape {df_train.shape}; is {(train_size / (train_size + test_size))*100:.2f}% of the total data')
print(f'Test Data Shape {df_test.shape}; is {(test_size / (train_size + test_size))*100:.2f}% of the total data')

In [None]:
df_train.columns.to_list()

In [None]:
df_train.Data_No.count()

In [None]:
df_train['Data_No'].unique()
# df_train['Data_No'].nunique()

In [None]:
df_test.columns.to_list()

In [None]:
df_test_np = df_test.to_numpy()
df_test_np

In [None]:
df_test['Data_No'].to_list()

In [None]:
last_test_row = []
for col in df_test.columns.values:
    last_test_row.append(df_test[col].iloc[-1])
print(last_test_row)

### List of the observations at the end of each life test. 
Used to answer the question:
* **Did the filter fail at the end of the test**?
    * Will help us indicate if the test is part of the right censored test group

In [None]:
df_test[df_test.Data_No != df_test.Data_No.shift(1)]

In [None]:
df_test[df_test.Data_No != df_test.Data_No.shift(-1)]

In [None]:
# df_train['Data_No'].value_counts().unique()
# df_train['Data_No'].value_counts()
count = df_train.groupby(['Data_No']).count()
# print(count.head())
print(count)

In [None]:
df_test.columns[6]

### Test Data
* includes the Remaining Useful Life (RUL) target variable sourced from live measures 

In [None]:
df_test

Test DataFrame Summary

In [None]:
df_test.info()
df_test.head()

In [None]:
df_test.isnull().sum()

In [None]:
df_test.describe()

In [None]:
df_test.corr()

### List of the observations at the end of each life test. 
Used to answer the question:
* **Did the filter fail at the end of the test**?
    * Will help us indicate if the test is part of the right censored test group

In [None]:
df_test('2').head()

In [None]:
df_test[df_test.Data_No != df_test.Data_No.shift(-1)]

### Train Data

In [None]:
df_train

Train DataFrame Summary

In [None]:
df_train.info()
df_train.head()

In [None]:
df_train.isnull().sum()

In [None]:
df_train.describe()

In [None]:
df_train.corr()

In [None]:
df_train[df_train.Data_No != df_train.Data_No.shift(1)]

In [None]:
df_train[df_train.Data_No != df_train.Data_No.shift(-1)]

---

# Push Files to Repo

Add RUL column to Train Data

In [None]:
# df_train['RUL'].unique()
# del df_train['RUL']
# df_train.insert(loc=6, column='RUL', value='Training Data', allow_duplicates=False)
df_train.insert(loc=6, column='RUL', value=0.0, allow_duplicates=False)
df_train

In [None]:
df_train.dtypes['RUL']
df_train.info()

#### Combine Files

In [None]:
combined_list = [df_test, df_train]
df = pd.concat(combined_list)
df

* In case you don't need to push files to Repo, you may replace this section with 'Conclusions and Next Steps' and state your conclusions and next steps.

In [None]:
pip install openpyxl

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f'outputs/datasets/collection/FilterMaintenancePredictorDataset.csv',index=False)
df_test.to_csv(f'outputs/datasets/collection/Test_FilterMaintenancePredictorDataset.csv',index=False)
df_train.to_csv(f'outputs/datasets/collection/Train_FilterMaintenancePredictorDataset.csv',index=False)

---

# Notes Section

#### Combine Train & Test Data

In [None]:
import pandas as pd
df_test = pd.read_csv(f'inputs/datasets/raw/Test_Data_CSV.csv')
df_train = pd.read_csv(f'inputs/datasets/raw/Train_Data_CSV.csv')
combined_list = [df_test, df_train]
df = pd.concat(combined_list)
df

In [None]:
df.info()
df.head()

In [None]:
df[df.duplicated(subset=['Data_No'])]

## Impute Missing Remaining Useful Life (RUL) Data

* The RUL information of the test data is not an estimate, rather the actual time when the experiment exceeded the threshold. 
    * In order to define a specific test problem, the measurements in the test data set are right-censored at random points and only the corresponding RUL information is provided.

#### Test Overfitting

#### Test Underfitting

Check missing data:
* https://docs.google.com/document/d/1yXb5g5IU7IldBpND1FbIfHIyGGmNSlXogarAyBlNO_g/edit?usp=sharing
* df.isnull()
* df.isnull().sum()

In [None]:
df.isnull()
df.isnull().sum()

You can calculate the value to be filled in. 
* The example below calculates the mean for column A and inserts this value where it is missing for that column.


In [None]:
df['RUL'].fillna(value=df['RUL'].mean(),inplace=True)
df

## Manage .mat files in Python?

The data has been created using MATLAB as data.mat file.
* However the source contributor has uploaded the data as CSV files as well. 
* They indicate the although file structure is slightly different between the .csv the .mat filed, the variable names have been kept however.

#### Install scipy

In [None]:
pip install scipy

Import the scipy.io.loadmat module

In [None]:
from os.path import dirname, join as pjoin
import scipy.io as sio

data_dir = pjoin(dirname(sio.__file__), 'matlab', 'tests', 'data')
mat_fname = pjoin(data_dir, '/workspace/dataset-testing/inputs/datasets/raw/Data.mat')
mat_contents = sio.loadmat(mat_fname)
sorted(mat_contents.keys())

In [None]:
mat_contents['None']

OR?

In [None]:
from scipy.io import loadmat
annots_data = loadmat(f'inputs/datasets/raw/Data.mat')
annots_fine = loadmat(f'inputs/datasets/raw/Particle size distribution_ISO_12103_1_A2_Fine.mat')
annots_medium = loadmat(f'inputs/datasets/raw/Particle size distribution_ISO_12103_1_A3_Medium.mat')
annots_coarse = loadmat(f'inputs/datasets/raw/Particle size distribution_ISO_12103_1_A4_Coarse.mat')
print(annots_data)
# print('---')
# print(annots_fine)
# print('---')
# print(annots_medium)
# print('---')
# print(annots_coarse)

---

## Section N

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

## Push files to Repo

#### Combine Files

In [None]:
combined_list = [df_test, df_train]
df = pd.concat(combined_list)
df

* In case you don't need to push files to Repo, you may replace this section with 'Conclusions and Next Steps' and state your conclusions and next steps.

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f'outputs/datasets/collection/FilterMaintenancePredictorDataset.csv',index=False)
df_test.to_csv(f'outputs/datasets/collection/Test_FilterMaintenancePredictorDataset.csv',index=False)
df_train.to_csv(f'outputs/datasets/collection/Train_FilterMaintenancePredictorDataset.csv',index=False)