## Libraries and Exploratory Analysis
Imported libraries and data for further analysis.

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

filepath = "C:/Users/WALDMJN/OneDrive - Schaeffler/Uni/Data Exploration Project/Pred Maintenance Project/Predictive-Maintenance/Data/predictive_maintenance.csv"
df = pd.read_csv(filepath)
df.sample(10)

In [None]:
df.apply(lambda x: x.nunique())

#### Additional Insights:

- UDI and Product ID are unique -> it can be removed.
- Target is integrated with Failure Type.
- 10 000 values, none missing.
- 6 different Failure Type's. No Failure (1) + Failure Varieties (5)

In [None]:
df['Failure Type'].value_counts()

#### Descriptive Statistics:

- Air Temperature: Ranges from 295.3 K to 304.5 K, with a mean of 300.0 K.
- Process Temperature: Ranges from 305.7 K to 313.8 K, with a mean of 310.0 K.
- Rotational Speed: Ranges from 1168 to 2886 rpm, with a mean of 1538.8 rpm.
- Torque: Ranges from 3.8 to 76.6 Nm, with a mean of 40.0 Nm.
- Tool Wear: Ranges from 0 to 253 minutes, with a mean of 108 minutes.
- Target: Majority (96.6%) are labeled 0 (No Failure), and 3.4% are labeled 1 (Failure).

#### 1.1 ID Columns

In [None]:
df['Product ID'] = df['Product ID'].apply(lambda x: x[1:])
df['Product ID'] = pd.to_numeric(df['Product ID'])

# Histogram of ProductID
sns.histplot(data=df, x='Product ID', hue='Type')
plt.show()

UDI appears to be an index number, while Product ID serves as an identification number. Therefore, it can be omitted.

In [None]:
df = df.drop(["UDI", "Product ID"], axis = 1)
df.head()

Check for missing values:

In [None]:
df.isna().sum()

#### 1.2 Target anomalies

Incosistencies between "Target" and "Failure Type":

In [None]:
fail_df = df[df['Target'] == 1]
fail_df['Failure Type'].value_counts()

In [None]:
fail_df[fail_df['Failure Type'] == 'No Failure']

9 Datasets have been registered as "1" in Column Target even if the Failure Type is "No Failure". The 9 entries can be deleted before they interfere with data determination.

In [None]:
indexPossibleFailure = fail_df[fail_df['Failure Type'] == 'No Failure'].index
df.drop(indexPossibleFailure, axis=0, inplace=True)

In [None]:
df.shape[0]

In [None]:
fail_df  = df[df['Target'] == 0]
fail_df ['Failure Type'].value_counts()

In [None]:
fail_df [fail_df ['Failure Type'] == 'Random Failures']

Same for the other direction. 18 Datasets have some "random failures" even if they're target is "0". 

In [None]:
indexPossibleFailure = fail_df[fail_df['Failure Type'] == 'Random Failures'].index
df.drop(indexPossibleFailure, axis=0, inplace=True)
df.shape[0]

27 instaces were removed (0.27% of the entire dataset). Of which:

- 9 belonged to class Failure in 'Target' variable and No failure in target 'Failure Type'
- 18 belonged to class No failure in 'Target' variable and Random failures in target 'Failure Type'

#### 1.3 Outliers inspection

In [None]:
df.describe()

We can guess the presence of outliers in Rotational Speed and Torque because the maximum is very different from the third quartile. To make this consideration more concrete we take a closer look at the situation with boxplots, using histograms to understand the distribution.

In [None]:
df['Tool wear [min]'] = df['Tool wear [min]'].astype('float64')
df['Rotational speed [rpm]'] = df['Rotational speed [rpm]'].astype('float64')

features = [col for col in df.columns
            if df[col].dtype=='float64' or col =='Type']

num_features = [feature for feature in features  if df[feature].dtype=='float64']


fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(18,7))
fig.suptitle('Numeric features histogram')
for j, feature in enumerate(num_features):
    sns.histplot(ax=axs[j//3, j-3*(j//3)], data=df, x=feature)
plt.show()


fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(18,7))
fig.suptitle('Numeric features boxplot')
for j, feature in enumerate(num_features):
    sns.boxplot(ax=axs[j//3, j-3*(j//3)], data=df, x=feature)
plt.show()

The boxplots highlight possible outliers in the features mentioned above. However, in the case of Torque, these outliers are likely due to the method of detection using boxplots. For Rotational Speed, the Gaussian distribution is skewed, and it is not unrealistic to think that the few observations with high Rotational Speed are likely to fail. 

#### 1.4 Feature Engineering

We include another features such that Power [W], Overstrain [minNm], Heat dissipation [rpminK] with following formulas:

- Power W = Nm * ((2 * pi * rpm)/60)
- Overstrain minNm = Nm * min
- Heat dissipation rpminK = abs(Air Temperature (K) - Process Temperature (K) * rpm)

In [None]:
df['Power [W]'] = df['Torque [Nm]'] * (2 * np.pi * df['Rotational speed [rpm]'] / 60.0)
df['Overstrain [minNm]'] = df['Torque [Nm]'] * df['Tool wear [min]']
df['Heat dissipation [rpminK]'] = abs(df['Air temperature [K]'] - df['Process temperature [K]']) * df['Rotational speed [rpm]']

df.head(5)

In [None]:
df[['Power [W]', 'Overstrain [minNm]', 'Heat dissipation [rpminK]']].describe()

In [None]:
df[['Power [W]', 'Overstrain [minNm]', 'Heat dissipation [rpminK]']].plot.box(subplots=True,
                                                                                      figsize=(15,5))
plt.suptitle('Machine Failure boxplot')
plt.show()

- Machine fails if 3500 W < power < 9000 W (the outlier values gets a power failure).
- Machine fails if oversstrain > 11,000 minNm (the outlier values gets a overstrain failure).
- Machine fails if Heat dissipation < 11,868 rpmK (intlier values get heat dissipation failure).

In [None]:
filtered_df = df[(df['Tool wear [min]'] >= 150) & (df['Tool wear [min]'] <= 300)]

failure_count = filtered_df[filtered_df['Failure Type'] == 'Tool Wear Failure'].groupby(
    pd.cut(filtered_df['Tool wear [min]'], bins=np.arange(150, 310, 10), right=False), observed=True).size()

most_failures_interval = failure_count.idxmax()
most_failures_count = failure_count.max()
total_failures_count = filtered_df[filtered_df['Failure Type'] == 'Tool Wear Failure'].shape[0]


plt.figure(figsize=(12, 8))
plt.hist(filtered_df[filtered_df['Failure Type'] == 'Tool Wear Failure']['Tool wear [min]'], bins=np.arange(150, 310, 10), color='red', alpha=0.7, label='Tool Wear Failures')
plt.axvline(filtered_df[filtered_df['Failure Type'] == 'Tool Wear Failure']['Tool wear [min]'].mean(), color='blue', linestyle='dashed', linewidth=2, label='Mean Tool Wear Time')
plt.axvline(filtered_df[filtered_df['Failure Type'] == 'Tool Wear Failure']['Tool wear [min]'].median(), color='green', linestyle='dashed', linewidth=2, label='Median Tool Wear Time')
plt.xlabel('Tool wear time [min]')
plt.ylabel('Number of Failures')
plt.title('Distribution of Tool Wear Failures (Filtered Data)')
plt.legend()
plt.grid(True)
plt.show()

print("Interval with the most failures:", most_failures_interval)
print("Number of failures in the most frequent interval:", most_failures_count)
print("Total number of tool wear failures:", total_failures_count)


- Machine fails if the tool wear [min]: 190 min < tool wear [min] < 260 min (the inlier values gets a tool wear failure).

#### 1.5 Relation plotting

In [None]:
print(df.columns)

##### Rotational Speed and Torque

In [None]:
plt.figure(figsize=(12,5))
sns.scatterplot(x='Rotational speed [rpm]', y='Torque [Nm]', hue='Target', alpha=0.85, data=df, palette='inferno', s = 50)
plt.show()

##### Torque and Tool wear

In [None]:
plt.figure(figsize=(7,7))
sns.scatterplot(x='Torque [Nm]', y='Tool wear [min]', hue='Target', alpha=0.85, data=df, palette='inferno', s = 50)
plt.show()

##### Torque and Process temperature

In [None]:
plt.figure(figsize=(7,7))
sns.scatterplot(x='Torque [Nm]', y='Process temperature [K]', hue='Target', alpha=0.85, data=df, palette='inferno', s = 50)
plt.show()

##### Rotational speed and Air temperature

In [None]:
plt.figure(figsize=(7,7))
sns.scatterplot(x='Rotational speed [rpm]', y='Air temperature [K]', hue='Target', alpha=0.85, data=df, palette='inferno', s = 70)
plt.show()

##### Rotational speed and tool wear

In [None]:
plt.figure(figsize=(7,7))
sns.scatterplot(x='Rotational speed [rpm]', y='Tool wear [min]', hue='Target', alpha=0.85, data=df, palette='inferno', s = 70)
plt.show()

##### Process temperature and Air temperature

In [None]:
plt.figure(figsize=(15,10))
sns.scatterplot(x='Process temperature [K]', y='Air temperature [K]', hue='Target', alpha=0.90, data=df,s = 100, palette='inferno')
plt.show()

- Torque and rotational speed are highly correlated.
- Process temperature and air temperature are also highly correlated.
- It confirms the earlier assumption that torque and rotational speed play a significant role in identifying failures.
- There is a range of normal conditions in which the machines operate.
- Machines tend to fail when operating above or below this normal range.

## Correlation Analysis

In [None]:
numeric_df = df.select_dtypes(include=[np.number])

corr_matrix = numeric_df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='inferno', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()

There is a strong correlation between process temperature and air temperature, as well as between rotational speed and torque.