Project: prediction of the mechanical properties using the alloy composition and temperature.



Conclusion: random forest regression has showed best performance for this task. Data scaling, outliers removal and skewness amendmend have sensibly improved the performance.

Methods: machine learning and deep learning

Dataset: "Mechanical properties of low alloy steels" from Kaggle, Contains alloy composition, temperature and mechanical properties

Context: currently there are no precise theoretical methods to predict mechanical properties of steels. All the methods available are by backed by statistics and extensive physical testing of the materials. Since testing each material with different composition is a highly tedious task (imagine the number of possibilities!), let's apply our knowledge of machine learning and statistics to solve this problem.

Content: this dataset contains compositions by weight percentages of low-alloy steels along with the temperatures at which the steels were tested and the values mechanical properties observed during the tests. The alloy code is a string unique to each alloy. Weight percentages of alloying metals and impurities like Aluminum, copper, manganese, nitrogen, nickel, cobalt, carbon, etc are given in columns. The temperature in celsius for each test is mentioned in a column. Lastly mechanical properties including tensile strength, yield strength, elongation and reduction in area are given in separate columns. The dataset contains 915 rows.

Link to the dataset:

https://www.kaggle.com/rohannemade/mechanical-properties-of-low-alloy-steels

STEP 1: Learning the dataset and feature engineering

In [None]:
# importing libraries
import numpy as np
import pandas as pd

In [None]:
# importing the dataset
df = pd.read_csv("../input/mechanical-properties-of-low-alloy-steels/MatNavi Mechanical properties of low-alloy steels.csv")

Learning the dataset and making feature engineering

In [None]:
# showing first five rows of the dateset
df.head()

In [None]:
# showing the column names
list(df.columns)

In [None]:
# editing the column names
new_list = {' C': 'C',
            ' Si': 'Si',
            ' Mn': 'Mn',
            ' P': 'P',
            ' S': 'S',
            ' Ni': 'Ni',
            ' Cr': 'Cr',
            ' Mo': 'Mo',
            ' Cu': 'Cu', 
            ' Al': 'Al',
            ' N': 'N', 
            ' Temperature (°C)': 'Temperature (°C)',
            ' 0.2% Proof Stress (MPa)': '0.2% Proof Stress (MPa)',
            ' Tensile Strength (MPa)': 'Tensile Strength (MPa)',
            ' Elongation (%)': 'Elongation (%)', 
            ' Reduction in Area (%)': 'Reduction in Area (%)'}
df.rename(columns=new_list, inplace=True)
list(df.columns)

In [None]:
# showing the column names
list(df.columns)

In [None]:
# showing statistical information about the dataset
df.info()

In [None]:
# showing statistical data of the dataset
df.describe()

In [None]:
# removeing Alloy code column because it is for information only
df.drop('Alloy code', axis='columns', inplace=True)

In [None]:
df.head()

Visualize data columns

Explore distribution, skewness, outliers and other statistical properties

In [None]:
# plotting C variable
boxplot = df.boxplot(column=["C"]);

In [None]:
# plotting Si variable
boxplot = df.boxplot(column=["Si"])

In [None]:
# plotting Mn variable
boxplot = df.boxplot(column=["Mn"]);

In [None]:
# plotting S variable in descending order
boxplot = df.boxplot(column=["S"]);

In [None]:
# plotting P variable in descending order
boxplot = df.boxplot(column=["P"]);

In [None]:
# plotting Ni variable
boxplot = df.boxplot(column=["Ni"]);

In [None]:
# plotting non-zero Ni variable
df_Ni = df["index" and "Ni"]
df_Ni_nonzero = df_Ni[df_Ni != 0]
df_Ni_nonzero.to_frame().boxplot(column=["Ni"])

In [None]:
# plotting Cr variable
boxplot = df.boxplot(column=["Cr"]);

In [None]:
# plotting Mo variable
boxplot = df.boxplot(column=["Mo"]);

In [None]:
# plotting Cu variable
boxplot = df.boxplot(column=["Cu"]);

In [None]:
# calculating skewness of Al variable
cu_skewness = df["Cu"].skew()
cu_skewness

In [None]:
# trying log transformation to improve the skewness
log_cu_skewness = np.log(df["Cu"]).skew()
log_cu_skewness

In [None]:
# trying root square transformation to improve the skewness
sqrt_cu_skewness = np.sqrt(df["Cu"]).skew()
sqrt_cu_skewness

Square root transformation gave low skewness

In [None]:
# creating sqrt(Cu) column
df["sqrt(Cu)"] = np.sqrt(df['Cu'])

In [None]:
# plotting sqrt(Cu) variable
boxplot = df.boxplot(column=["sqrt(Cu)"]);

In [None]:
# removing Cu column
df.drop(["Cu"], axis=1, inplace=True)
df.head()

In [None]:
# plotting V variable
boxplot = df.boxplot(column=["V"]);

In [None]:
# plotting non-zero V variable
df_V = df["index" and "V"]
df_V_nonzero = df_V[df_V != 0]
df_V_nonzero.to_frame().boxplot(column=["V"])

In [None]:
# plotting Al variable
boxplot = df.boxplot(column=["Al"]);

In [None]:
# plotting Al variable
df["Al"].plot(); # There is no zewroes in Al variable

In [None]:
# calculating skewness of Al variable
al_skewness = df["Al"].skew()
al_skewness

Al variable is highly skewed

In [None]:
# trying log transformation to improve the skewness
log_al_skewness = np.log(df["Al"]).skew()
log_al_skewness

In [None]:
# trying root square transformation to improve the skewness
sqrt_al_skewness = np.sqrt(df["Al"]).skew()
sqrt_al_skewness

Log transformation gave better results then square root transformation

In [None]:
# creating log(Al) column
df["log(Al)"] = np.log(df['Al'])

In [None]:
# plotting log(Al) variable
boxplot = df.boxplot(column=["log(Al)"]);

In [None]:
# removing Al column
df.drop(["Al"], axis=1, inplace=True)
df.head()

In [None]:
# plotting N variable
boxplot = df.boxplot(column=["N"]);

In [None]:
# plotting Ceq variable
boxplot = df.boxplot(column=["Ceq"]);

In [None]:
# plotting Ceq non-zero variable
df_Ceq = df["index" and "Ceq"]
df_Ceq_nonzero = df_Ceq[df_Ceq != 0]
df_Ceq_nonzero.to_frame().boxplot(column=["Ceq"])

In [None]:
# plotting Nb + Ta variable
ax = df["Nb + Ta"].value_counts().sort_index().plot.bar(xlabel="Nb + Ta", ylabel="Frequency", figsize=(2,6), rot=45);

In [None]:
# plotting Temperature (°C) variable
boxplot = df.boxplot(column=["Temperature (°C)"]);

In [None]:
# plotting 0.2% Proof Stress (MPa) variable
boxplot = df.boxplot(column=["0.2% Proof Stress (MPa)"]);

In [None]:
# plotting Tensile Strength (MPa) variable
boxplot = df.boxplot(column=["Tensile Strength (MPa)"]);

Tensile Strength (MPa) column has an outlier. Let's detect and remove it.

In [None]:
# detecting the outlier
df.loc[df["Tensile Strength (MPa)"] > 1000]

In [None]:
# deleting the outlier
df.drop(626, inplace=True)

In [None]:
# plotting Tensile Strength (MPa) variable
boxplot = df.boxplot(column=["Tensile Strength (MPa)"]);

In [None]:
# plotting Elongation (%) variable
boxplot = df.boxplot(column=["Elongation (%)"]);

In [None]:
# calculating skewness of Elongation (%) variable
elongation_skewness = df["Elongation (%)"].skew()
elongation_skewness

Elongation (%) variable is highly skewed

In [None]:
# trying log transformation to improve the skewness
log_el_skewness = np.log(df["Elongation (%)"]).skew()
log_el_skewness

Let's replace Elongation (%) variable with log(Elongation (%)) to fix the skewness

In [None]:
# creating log(Elongation (%)) column
df["log(Elongation (%))"] = np.log(df['Elongation (%)'])

In [None]:
# plotting log(Elongation (%)) variable
boxplot = df.boxplot(column=["log(Elongation (%))"]);

In [None]:
# removing Elongation column
df.drop(["Elongation (%)"], axis=1, inplace=True)
df.head()

In [None]:
# plotting Reduction in Area (%) variable
boxplot = df.boxplot(column=["Reduction in Area (%)"]);

In [None]:
# calculating skewness of Reduction in Area (%) variable
Reduction_skewness = df["Reduction in Area (%)"].skew()
Reduction_skewness

Reduction in Area (%) variable is reasonably skewed

In [None]:
# reordering the dataframe columns in original order
df = df[['C',
 'Si',
 'Mn',
 'P',
 'S',
 'Ni',
 'Cr',
 'Mo',
 'sqrt(Cu)',
 'V',
 'log(Al)',
 'N',
 'Ceq',
 'Nb + Ta',
 'Temperature (°C)',
 '0.2% Proof Stress (MPa)',
 'Tensile Strength (MPa)',
 'log(Elongation (%))',
 'Reduction in Area (%)']]

In [None]:
df.head()

In [None]:
# importing libraries for heatmap building 
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# calculating a correlation matrix
corr_matrix = df.corr()
print(corr_matrix)

In [None]:
# drawing a heatmap
plt.figure(figsize = (20,20))
ax = sns.heatmap(corr_matrix, annot=True, square=True, cmap='Blues')
plt.show()

STEP2: Choosing best performing machine learning model

Multiple linear regression for 0.2% Proof Stress (MPa)

In [None]:
# defining variables
X = df.iloc[:, :-4].values
y = df.iloc[:, -4].values

In [None]:
# splitting the dataset into the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=42)

In [None]:
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# training the multiple Linear regression model on the training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
# predicting the test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=0)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
# test scoring
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

Decision tree regression for 0.2% Proof Stress (MPa)

In [None]:
# training the decision tree regression model on the training set
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

In [None]:
# predicting the test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=0)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
# test scoring
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

Random forest regression for 0.2% Proof Stress (MPa)

In [None]:
# training the random forest regression model on the training set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
regressor.fit(X_train, y_train)

In [None]:
# predicting the test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=0)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

In [None]:
# test scoring
from sklearn.metrics import r2_score
score_1 = r2_score(y_test, y_pred)
score_1

Discussion: Random forest seems suitable model for this type of data.

Random forest regression for Tensile Strength (MPa)

In [None]:
# defining variables
X = df.iloc[:, :-4].values
z = df.iloc[:, -3].values

In [None]:
# splitting the dataset into the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, z_train, z_test = train_test_split(X, z, test_size = 0.20, random_state=42)

In [None]:
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# training the random forest regression model on the training set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
regressor.fit(X_train, z_train)

In [None]:
# predicting the test set results
z_pred = regressor.predict(X_test)
np.set_printoptions(precision=0)
print(np.concatenate((z_pred.reshape(len(z_pred),1), z_test.reshape(len(z_test),1)),1))

In [None]:
# test scoring
from sklearn.metrics import r2_score
score_2 = r2_score(z_test, z_pred)
score_2

Random forest regression for Elongation (%)

In [None]:
# defining variables
X = df.iloc[:, :-4].values
v = df.iloc[:, -2].values

In [None]:
# splitting the dataset into the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, v_train, v_test = train_test_split(X, v, test_size = 0.20, random_state=42)

In [None]:
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# training the random forest regression model on the training set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
regressor.fit(X_train, v_train)

In [None]:
# predicting the test set results
v_pred = regressor.predict(X_test)
np.set_printoptions(precision=1)
print(np.concatenate((v_pred.reshape(len(v_pred),1), v_test.reshape(len(v_test),1)),1))

In [None]:
# test scoring
from sklearn.metrics import r2_score
score_3 = r2_score(v_test, v_pred)
score_3

Random forest regression for Reduction in Area (%)

In [None]:
# defining variables
X = df.iloc[:, :-4].values
w = df.iloc[:, -1].values

In [None]:
# splitting the dataset into the training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, w_train, w_test = train_test_split(X, w, test_size = 0.20, random_state=42)

In [None]:
# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# training the random forest regression model on the training set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()
regressor.fit(X_train, w_train)

In [None]:
# predicting the test set results
w_pred = regressor.predict(X_test)
np.set_printoptions(precision=1)
print(np.concatenate((w_pred.reshape(len(w_pred),1), w_test.reshape(len(w_test),1)),1))

In [None]:
# test scoring
from sklearn.metrics import r2_score
score_4 = r2_score(w_test, w_pred)
score_4

R squared score for random forest with default parametrs:

In [None]:
print({"0.2% Proof Stress (MPa) score is": "{:.3f}".format(score_1)})
print({"Tensile Strength (MPa) score is": "{:.3f}".format(score_2)})
print({"log(Elongation (%)) score is": "{:.3f}".format(score_3)})
print({"Reduction in Area (%) score is": "{:.3f}".format(score_4)})

Results discussion: Random forest regression has showed high level of predectibility for the both strengths and reasonably high for geometrical deformations.

STEP3: Building and optimising deep learning model

Building the ANN #1 for prediction of 0.2% Proof Stress (MPa)

In [None]:
# importing tensorflow
import tensorflow as tf

In [None]:
tf.__version__

In [None]:
# initializing the ANN
ann_1 = tf.keras.models.Sequential()

In [None]:
# showing a shape of X_train array
X_train.shape

In [None]:
# adding the input layer and the first hidden layer
ann_1.add(tf.keras.layers.Dense(units=15, activation='relu'))

In [None]:
# adding the second hidden layer
ann_1.add(tf.keras.layers.Dense(units=240, activation='sigmoid'))

In [None]:
# adding the third hidden layer
ann_1.add(tf.keras.layers.Dense(units=60, activation='relu'))

In [None]:
# adding the output layer
ann_1.add(tf.keras.layers.Dense(units=1))

Training the ANN#1

In [None]:
# compiling the ANN
ann_1.compile(optimizer = 'adam', loss = 'mean_squared_error')

In [None]:
# training the ANN model on the Training set
ann_1.fit(X_train, y_train, batch_size = 64, epochs = 600)

In [None]:
ann_1.summary()

Predicting the results of the Test set

In [None]:
y_pred = ann_1.predict(X_test)
np.set_printoptions(precision=2)

Scoring the ANN#1 prediction

In [None]:
# y_test scoring
from sklearn.metrics import r2_score
ann_1_score = r2_score(y_test, y_pred)
ann_1_score

Building the ANN #2 for prediction of Tensile Strength (MPa)

In [None]:
# initializing the ANN
ann_2 = tf.keras.models.Sequential()

In [None]:
# adding the input layer and the first hidden layer
ann_2.add(tf.keras.layers.Dense(units=15, activation='relu'))

In [None]:
# adding the second hidden layer
ann_2.add(tf.keras.layers.Dense(units=240, activation='sigmoid'))

In [None]:
# adding the third hidden layer
ann_2.add(tf.keras.layers.Dense(units=60, activation='relu'))

In [None]:
# adding the output layer
ann_2.add(tf.keras.layers.Dense(units=1))

Training the ANN#2

In [None]:
# compiling the ANN
ann_2.compile(optimizer = 'adam', loss = 'mean_squared_error')

In [None]:
# training the ANN model on the Training set
ann_2.fit(X_train, z_train, batch_size = 64, epochs = 750)

Predicting the results of the Test set

In [None]:
z_pred = ann_2.predict(X_test)
np.set_printoptions(precision=2)

Scoring the ANN#2 prediction

In [None]:
# z_test scoring
from sklearn.metrics import r2_score
ann_2_score = r2_score(z_test, z_pred)
ann_2_score

Building the ANN #3 for prediction of log(Elongation (%))

In [None]:
# initializing the ANN
ann_3 = tf.keras.models.Sequential()

In [None]:
# adding the input layer and the first hidden layer
ann_3.add(tf.keras.layers.Dense(units=15, activation='sigmoid'))

In [None]:
# adding the second hidden layer
ann_3.add(tf.keras.layers.Dense(units=60, activation='sigmoid'))

In [None]:
# adding the third hidden layer
ann_3.add(tf.keras.layers.Dense(units=240, activation='sigmoid'))

In [None]:
# adding the output layer
ann_3.add(tf.keras.layers.Dense(units=1))

Training the ANN#3

In [None]:
# compiling the ANN
ann_3.compile(optimizer = 'adam', loss = 'mean_squared_error')

In [None]:
# training the ANN model on the Training set
ann_3.fit(X_train, v_train, batch_size = 64, epochs = 5000)

Predicting the results of the Test set

In [None]:
v_pred = ann_3.predict(X_test)
np.set_printoptions(precision=3)

Scoring the ANN#3 prediction

In [None]:
# v_test scoring
from sklearn.metrics import r2_score
ann_3_score = r2_score(v_test, v_pred)
ann_3_score

Building the ANN #4 for prediction of Reduction in Area (%)

In [None]:
# initializing the ANN
ann_4 = tf.keras.models.Sequential()

In [None]:
# adding the input layer and the first hidden layer
ann_4.add(tf.keras.layers.Dense(units=15, activation='sigmoid'))

In [None]:
# adding the second
ann_4.add(tf.keras.layers.Dense(units=15, activation='sigmoid'))

In [None]:
# adding the third hidden layer
ann_4.add(tf.keras.layers.Dense(units=240, activation='sigmoid'))

In [None]:
# adding the output layer
ann_4.add(tf.keras.layers.Dense(units=1))

Training the ANN#4

In [None]:
# compiling the ANN
ann_4.compile(optimizer = 'adam', loss = 'mean_squared_error')

In [None]:
# training the ANN model on the Training set
ann_4.fit(X_train, w_train, batch_size = 64, epochs = 1000)

Predicting the results of the Test set

In [None]:
w_pred = ann_4.predict(X_test)
np.set_printoptions(precision=2)

Scoring the ANN#4 prediction

In [None]:
# w_test scoring
from sklearn.metrics import r2_score
ann_4_score = r2_score(w_test, w_pred)
ann_4_score

R squared score for the ANNs:

In [None]:
print({"0.2% Proof Stress (MPa) score is": "{:.3f}".format(ann_1_score)})
print({"Tensile Strength (MPa) score is": "{:.3f}".format(ann_2_score)})
print({"log(Elongation (%)) score is": "{:.3f}".format(ann_3_score)})
print({"Reduction in Area (%) score is": "{:.3f}".format(ann_4_score)})

Discussion: ANN performs similarly to Random Forest regression for all the four dependent variables. R square score is about 2% less for the ANN.

Conclusion: random forest regression has showed best performance for this task. Data scaling, outliers removal and skewness amendmend have sensibly improved the performance.