### This is the statement for the evaluation of **Course 3: Machine Learning applied to Predictive Maintenance in the Industry**
#### The solution can be found in [this accompanying notebook](https://www.kaggle.com/brjapon/course3-solution-crwu-bearing-feat-importance)

**NOTE:** *The errors that are produced when running the notebook means that you have to fix/complete those lines with the right commands*

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import seaborn as sns
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Fault type identification
There are 10 types of faults, linked to each bearing deffect:

- **Ball_007_1**: Ball defect (0.007 inch)
- **Ball_014_1**: Ball defect (0.014 inch)
- **Ball_021_1**: Ball defect (0.021 inch)
- **IR_007_1**: Inner race fault (0.007 inch)
- **IR_014_1**: Inner race fault (0.014 inch)
- **IR_021_1**: Inner race fault (0.021 inch)
- **Normal_1**: Normal
- **OR_007_6_1**: Outer race fault (0.007 inch, data collected from 6 O'clock position)
- **OR_014_6_1**: Outer race fault (0.014 inch, 6 O'clock)
- **OR_021_6_1**: Outer race fault (0.021 inch, 6 O'clock)

## Get the data
The file we will read is the result of preprocessing the raw data files (folder `/kaggle/input/cwru-bearing-datasets/raw/`).

Time series segments contains 2048 points each. Given that the sampling frequency is 48kHz each time serie covers 0.04 seconds.

In [None]:
data_time = pd.read_csv("../input/cwru-bearing-datasets/feature_time_48k_2048_load_1.csv")
data_time

## Split into train and test datasets

In [None]:
train_data, test_data = train_test_split(data_time, test_size = 750, stratify = data_time['fault'], random_state = 1234)
print( train_data['fault'].value_counts(), "\n\n", test_data['fault'].value_counts())

## Scale features in training set

In [None]:
# Scale each column to have zero mean and standard deviation equal to 1
scaler = StandardScaler()
train_data_scaled = scaler.fit_transform(train_data.iloc[:,:-1]) # Skip last column 'fault'
pd.DataFrame(train_data_scaled).describe()

In [None]:
test_data_scaled = (test_data.iloc[:,:-1].values - scaler.mean_)/np.sqrt(scaler.var_)
pd.DataFrame(test_data_scaled).describe()

## Train a model using Random Forest Classifier
Call the `RandomForestclassifier` model from sklearn and fit the model to the training data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Initualize the model
rf_model = RandomForestClassifier(n_estimators= 300, max_features = "sqrt", n_jobs = -1, random_state = 38)

# Train the model
# rf_model.fit(...)

## Model Evaluation
Now get predictions from the model, compute the confusion matrix and produce a classification report.

In [None]:
from sklearn.metrics import confusion_matrix,classification_report, accuracy_score

In [None]:
# Training data prediction
# ...
# Testing data prediction
# ...

Plot confusion matrixes:

In [None]:
# Confusion matrix for training data 
# train_confu_matrix = ...

# Confusion matrix for test data 
# test_confu_matrix = ...

In [None]:
fault_type = data_time.fault.unique()

plt.figure(1,figsize=(18,8))

plt.subplot(121)
sns.heatmap(train_confu_matrix, annot= True,fmt = "d",
xticklabels=fault_type, yticklabels=fault_type, cmap = "Blues", cbar = False)
plt.title('Training Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.subplot(122)

plt.subplot(122)
sns.heatmap(test_confu_matrix, annot = True,
xticklabels=fault_type, yticklabels=fault_type, cmap = "Blues", cbar = False)
plt.title('Test Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')

plt.show()

In [None]:
# Model Accuracy, how often is the classifier correct?
#print("Accuracy:", accuracy_score(...))

In [None]:
# Classification report (test set)
#class_report = classification_report(...)
print(class_report)

- **recall**    = for each failure, proportion of those correctly classified over the total of the actual ones = `TP / (TP + sum(FN))`
- **precision** = for each failure, proportion of those correctly identified over the total of the predicted = `TP / (TP + sum(FP))`

Refer to [Understanding Data Science Classification Metrics in Scikit-Learn in Python](https://towardsdatascience.com/understanding-data-science-classification-metrics-in-scikit-learn-in-python-3bc336865019) for the explanation of these metrics

## Feature importance

In [None]:
# Obtain feature importance
# ...
# Check that importances sum is 1
# ...

In [None]:
# Retrieve features' names
#features = ...

# And count them
# num_features = ...

# Sort features by descending importance
indices = np.argsort(feature_importance)[::-1]

# Reorder dict by descending feature importance
features_sorted = []
for key in indices:
    features_sorted = np.append(features_sorted, features[key])

In [None]:
# Bar plot of feature importance (descending order)
#   Place the code below
# ...