# **Prediction Multi-Stage Continuous-Flow Manufacturing Process**
**Problem description**
    The data of the input and output was collected from the multi-flow manufacturing line. The data was stored in CSV file type, name "continuous_factory_process". The manufacturing line was separated into 2 stages.
*  **Stage 1**: Have 3 machines, all machine has parallel run.  After the product (Not sure they are a product or something) finished the process by machine. It is then sent to combine with products from other machines at the conveyor. At conveyor 15 output location was measured
* **Stage 2**: Output of stage 1 was feed into stage 2 by the conveyor. Stage 2 have 2 machines, it was a series run. After the product finished the process by machine 5, the output was measured in 15 location.
* **The goal of problem** 1. Prediction output of stage 1 2. Prediction output of stage 2 

# Prediction concept
I decided to crate 3 models for predict this problem
1. Model A: This model was used to predict only stage 1 output. The ambient condition, machine 1-3 parameters, and combination zone parameters were employed to be an input of this model
2. Model B: This model is the same concept as model A. The model was used to predict stage 2 output only. The ambient condition, machine 4-5 parameters, and exit zone parameters were employed to be an input of this model.
3. Model C: This model is the same concept as model B (same output) but different input parameters. From the problem description, it saw the output at stage 1 was feed into stage 2 maybe the output at stage 1 is can use to be an input for predict output at stage 2. Thus in model C, I will use output at stage 1, ambient condition, machine 4-5 parameters, and exit zone parameters to input this model.

# Let's strat data preparation

**First import library** 
* Pandas for reading CSV and crate data frame
* Statsmodels.api for the feature selection process
* Numpy for array data management
* Scikit-learn for management train/test dataset and create a prediction model
* Time for record process time
* Warning for ignoring some systems warning

In [None]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
import time
import warnings
warnings.filterwarnings('ignore')

Start record time before import raw data file

In [None]:
#Start time record for process
start_time = time.time()

In [None]:
print("**************************Import raw data********************************")

Import raw data file (CSV) process was used pandas read CSV file. The timestamp column was used for the index of data
* Raw data info was print to shown data type, number of columns, missing (null) in columns
* Raw data head was print to shown columns head and some values in columns

In [None]:
raw_data = pd.read_csv("../input/multistage-continuousflow-manufacturing-process/continuous_factory_process.csv", index_col="time_stamp")
print(raw_data.info())
print(raw_data.head())

From raw data, in the output session is seen data have setpoint columns (I think is only a reference value, not output). I decide to remove them from raw data.
For remove setpoint columns
* From raw_data.info() above, I collected the index of setpoint columns and store it in the set_point variable
* The set_point_name variable is a list of setpoint columns name, it used for select columns drop in the set_data variable
* Print (set_data.info()) is shown info of data after dropped

In [None]:
#Drop setpoint data
print("*************************Remove set point from raw data*****************************")
set_point = [42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114]
set_point_name = list(raw_data.columns[set_point])
set_data = raw_data.drop(columns = set_point_name, axis=1)
print(set_data.info())

Again in raw data, is seen some columns is many of zero values, maybe it can drop those columns from the prediction model. 
* I decided to use for loop to count and show the columns have %0 more than 30%
* For loop was iteration in length set_data.columns 
* CC is values of set_data in index (column) i
* CC is np. array, thus I use np.count_nonzero function to count 0 and divided by the total data in column i for create %0
* "if" function was used for selected the column have %0 more than 30. 30% is not from the theory. It set by myself. 
* After the column has %0 more than 30 was selected. Those columns will show in the "DROP" variable.

In [None]:
#Count 0 value in data
print("***************************Check 0 more than 30% in columns**************************")
for i in range(len(set_data.columns[:])):
    CC = set_data.values[:,i]
    CCN = (((np.count_nonzero(CC == 0))/14088)*100)
    if (CCN>30):
        DROP = (set_data.columns[i])
        print ("    Column Number ",i," Name ",DROP, " Value =","{:.2f}".format(CCN ))


After I get the column's name above. Those columns removal process was started. List of remove columns was stored in the "Drop_list and Drop_list_name" variables. After removed data was stored in the "Araw_data" and print to show info of data.

In [None]:
#Drop columns 0 more than 30% of output
Drop_list = [42,46,47,48,52,55,74]
Drop_list_name = list(set_data.columns[Drop_list])
Araw_data = set_data.drop(columns= Drop_list_name, axis=1 )
#Araw_list = list(Araw_data.columns)
print("********************************Columns after drop*******************************")
print("    ",Araw_data.info())

After the data preparing process (remove setpoint/%0 columns). Data were split into input and output for models A, B, and C. Data splitting process used columns index for reference to split.
* Model A used ambient condition columns, machine 1-3 columns, combination condition columns to an input 

In [None]:
#Data for model A
Input_A = Araw_data.values[:,0:41]
Input_AN = Araw_data.columns[0:41]
Input_AA = pd.DataFrame(data=Input_A, columns=Input_AN)
Input_AAN = pd.DataFrame(data=Input_A) #Non name
Output_A = Araw_data.values[:,41:50]
Output_AN = Araw_data.columns[41:50]
Output_AA = pd.DataFrame(data=Output_A, columns=Output_AN)
print("Input for model A")
print(Input_AA.info())
print("Output for model A")
print(Output_AA.info())

* Model B used ambient condition columns, machine 4-5 columns, exit zone condition columns to an input 

In [None]:
#Data for model B
IN  = [0,1,50,51,52,53,54,55,56,57,58,59,60,61,62,63]
Input_B = Araw_data.values[:,IN]
Input_BN = Araw_data.columns[IN]
Input_BAN = pd.DataFrame(data=Input_B) #Non name
Input_BA = pd.DataFrame(data=Input_B, columns=Input_BN)
Output_B = Araw_data.values[:,64:78]
Output_BN = Araw_data.columns[64:78]
Output_BA = pd.DataFrame(data=Output_B, columns=Output_BN)
print("Input for model B")
print(Input_BA.info())
print("Output for model B")
print(Output_BA.info())

* Model C used ambient condition columns, stage 1 output (actual vale not from prediction), machine 4-5 columns, exit zone condition columns to an input 

In [None]:
#Data for model C
INC  = [0,1,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]
Input_C = Araw_data.values[:,INC]
Input_CN = Araw_data.columns[INC]
Input_CA = pd.DataFrame(data=Input_C, columns=Input_CN)
Input_CAN = pd.DataFrame(data=Input_C) #Non name
Output_CA = Output_BA
print("Input for model C")
print(Input_CA.info())
print("Output for model C")
print(Output_CA.info())

# Let's start the prediction session

**Model A**
> Feature selection
1. Feature selection: I used the p-value method to find a correlation of input and each output columns in the "for loop"
2. Correlation was calculate using sm.OLS fitting model
3. The p-value was used to remove the input columns has a low correlation to output columns. The p-values are one of the statistical indicator values. P-value 0.05 was coming from the significant level at 95% confidence interval (It generally used 0.05). If columns have a p-value of more than 0.05, they were removed.
4. List of input effects to each output was stored in the "Selected_features" variable. That list does not show because difficult to see in the console. Number of total and selected feature was print.
5. After getting the effect input on output, those inputs was select for the creation model in the "x_selected" variable

> Data split and prediction algorithm
1. Data (Input is x_selected, Output is Yi ) was split into train and test set in ratio 70:30 by split function from scikit-learn library
2. The "Decision tree", "Poly-Support vector machine", and "K-nearest neighbor" were employed for the prediction output of the model. All algorithm is regression model.

>Support vector machine
* Support vector machine (SVM) is one of popular machine learning algorithm. The SVM is widely used in the classification problem but is can use for linear/non-linear problems too. In this prediction model, SVM was selected because it is both flexible and works well especially when the information is complex.  The Poly-SVM algorithm is one of SVM. TThe Poly-SVM adds polynomial degree 3 into the model (For trial run degree 3 and 4 is no difference in accuracy). Input data of Poly-SVM was standardization before the training (for remove feature for use polynomial)

>Decision trees
* In this prediction model, the Decision trees algorithm was selected because of fast calculation and high accuracy.

> K-nearest neighbor
* In this prediction model, the K-nearest neighbor algorithm was selected because of it simple function, powerful and no training involved.
* K-nearest neighbor is modify weights parameter only in scikit-learn library, using "distance". 

> Accuracy
* The % prediction accuracy was presented by the score function.
1. The accuracy show "Decision tree" is the highest  average accuracy. The accuracy around 40%+ except for Stage1.Output.Measurement4.U.Actual Stage1.Output.Measurement10.U.Actual.
2. The "Poly-SVM" is the lowest is average accuracy comparision to another algorithm. Maybe the Poly-SVM algorithm is not suitable for this problem. 
3. The "K-nearest neighbor" is low accuracy too. The hihest accuracy around 50% at Stage1.Output.Measurement9.U.Actual.


In [None]:
#Feature selection and prediction for model A
print("********************************Prediction model A (Stage 1)*******************************")
for n in range (len(Output_AN)):
    Yi = Output_AA.values [:,n];
    # Backward Elimination #for feature selection
    cols = list(Input_AAN.columns)
    pmax = 1
    while (len(cols) > 0):
       p = []
       X_1 = Input_AAN[cols]
       X_1 = sm.add_constant(X_1)
       model = sm.OLS(Yi, X_1).fit()
       p = pd.Series(model.pvalues.values[1:], index=cols)
       pmax = max(p)
       feature_with_p_max = p.idxmax()
       if (pmax > 0.05):
          cols.remove(feature_with_p_max)
       else:
          break
    selected_features = cols
    #print(Output_AN[n], "(Selected feature) =", list(Input_AA.columns[cols])) #list of selected feature model A
    print(Output_AN[n],"     Total feature","(",len(Input_AN),")","Selected feature",len(selected_features))
    x_selected = Input_AA.values[:, selected_features]
    x_train, x_test, y_train, y_test = train_test_split(x_selected, Yi, test_size=0.3)

    # SVM-poly
    svr_poly = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2, kernel='poly', degree=3))
    svr_poly = svr_poly.fit(x_train, y_train)
    Svr_poly = abs(svr_poly.score(x_test, y_test))
    Svr_poly = "{:.2f}".format(Svr_poly * 100)

    # Decision tree
    clf = tree.DecisionTreeRegressor(max_features='auto')
    clf = clf.fit(x_train, y_train)
    CRF = abs(clf.score(x_test, y_test))
    CRF = "{:.2f}".format(CRF * 100)

    # KNN
    neigh = KNeighborsRegressor(n_neighbors=5, weights='distance', algorithm='auto', leaf_size=30, p=2,
                                metric='minkowski', metric_params=None, n_jobs=None)
    neigh = neigh.fit(x_train, y_train)
    KNN = abs(neigh.score(x_test, y_test))
    KNN = "{:.2f}".format(KNN * 100)

    # Accuracy
    print("             %Prediction accuracy,", "Tree ", CRF, " SVM-Poly ", Svr_poly, "KNN ", KNN)

**Model B**

> Feature selected and Data split
1. Model B feature selection and data split were the same as model A. The only difference is variable names. 

>Model B accuracy
1. The difference from model A. The accuracy show "Poly-SVM" is the highest average accuracy  but is a slight difference form K-nearest neighbor. 
2. The "Decision tree" is the most change ranking from highest to lowest rank. The reason that happened maybe occurred value of each point in same columns of output is a high difference.

In [None]:
#Feature selection and prediction for model A
print("********************************Prediction model B (Stage 2)*******************************")
for m in range (len(Output_BN)):
    YBi = Output_BA.values [:,m];
    # Backward Elimination #for feature selection
    colsB = list(Input_BAN.columns)
    pBmax = 1
    while (len(colsB) > 0):
       pB = []
       X_2 = Input_AAN[colsB]
       X_2 = sm.add_constant(X_2)
       model = sm.OLS(YBi, X_2).fit()
       pB = pd.Series(model.pvalues.values[1:], index=colsB)
       pBmax = max(pB)
       feature_with_p_maxB = pB.idxmax()
       if (pBmax > 0.05):
          colsB.remove(feature_with_p_maxB)
       else:
          break
    selected_featuresB = colsB
    #print(Output_BN[m], "(Selected feature) =", list(Input_BA.columns[colsB])) #list of selected feature model B
    print(Output_BN[m],"     Total feature","(",len(Input_BN),")","Selected feature",len(selected_featuresB))
    x_selectedB = Input_AA.values[:, selected_featuresB]
    x_trainB, x_testB, y_trainB, y_testB = train_test_split(x_selectedB, YBi, test_size=0.3)

    # SVM-poly
    svr_polyB = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2, kernel='poly', degree=3))
    svr_polyB = svr_polyB.fit(x_trainB, y_trainB)
    Svr_polyB = abs(svr_polyB.score(x_testB, y_testB))
    Svr_polyB = "{:.2f}".format(Svr_polyB * 100)

    # Decision tree
    clfB = tree.DecisionTreeRegressor(max_features='auto')
    clfB = clfB.fit(x_trainB, y_trainB)
    CRFB = abs(clfB.score(x_testB, y_testB))
    CRFB = "{:.2f}".format(CRFB * 100)

    # KNN
    neighB = KNeighborsRegressor(n_neighbors=5, weights='distance', algorithm='auto', leaf_size=30, p=2,
                                metric='minkowski', metric_params=None, n_jobs=None)
    neighB = neighB.fit(x_trainB, y_trainB)
    KNNB = abs(neighB.score(x_testB, y_testB))
    KNNB = "{:.2f}".format(KNNB * 100)

    # Accuracy
    print("             %Prediction accuracy,", "Tree ", CRFB, " SVM-Poly ", Svr_polyB, "KNN ", KNNB)

**Model C**

> Feature selected and Data split
1. Model C feature selection and data split were the same as model A. The only difference is variable names. 

>Model C accuracy
1. The average accuracy of all algorithm has seemed higher than model B. It means stage 1 output strongly effects to stage 2 output.
2. In model C, The highest average accuracy is "K-nearest neighbor"


In [None]:
print("*********************Prediction model C (Output Stage 1 sent to input stage 2 )*************")
for r in range (len(Output_BN)):
    YCi = Output_CA.values [:,r];
    # Backward Elimination #for feature selection
    colsC = list(Input_CAN.columns)
    pCmax = 1
    while (len(colsC) > 0):
       pC = []
       X_3 = Input_CAN[colsC]
       X_3 = sm.add_constant(X_3)
       model = sm.OLS(YCi, X_3).fit()
       pC = pd.Series(model.pvalues.values[1:], index=colsC)
       pCmax = max(pC)
       feature_with_p_maxC = pC.idxmax()
       if (pCmax > 0.05):
          colsC.remove(feature_with_p_maxC)
       else:
          break
    selected_featuresC = colsC
    #print(Output_BN[r], "(Selected feature) =", list(Input_CA.columns[colsC])) #list of selected feature model C
    print(Output_BN[r],"     Total feature","(",len(Input_CN),")","Selected feature",len(selected_featuresC))
    x_selectedC = Input_CA.values[:, selected_featuresC]
    x_trainC, x_testC, y_trainC, y_testC = train_test_split(x_selectedC, YCi, test_size=0.3)

    # SVM-poly
    svr_polyC = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2, kernel='poly', degree=3))
    svr_polyC = svr_polyC.fit(x_trainC, y_trainC)
    Svr_polyC = abs(svr_polyC.score(x_testC, y_testC))
    Svr_polyC = "{:.2f}".format(Svr_polyC * 100)

    # Decision tree
    clfC = tree.DecisionTreeRegressor(max_features='auto')
    clfC = clfC.fit(x_trainC, y_trainC)
    CRFC = abs(clfC.score(x_testC, y_testC))
    CRFC = "{:.2f}".format(CRFC * 100)

    # KNN
    neighC = KNeighborsRegressor(n_neighbors=5, weights='distance', algorithm='auto', leaf_size=30, p=2,
                                metric='minkowski', metric_params=None, n_jobs=None)
    neighC = neighC.fit(x_trainC, y_trainC)
    KNNC = abs(neighC.score(x_testC, y_testC))
    KNNC = "{:.2f}".format(KNNC * 100)

    # Accuracy
    print("             %Prediction accuracy,", "Tree ", CRFC, " SVM-Poly ", Svr_polyC, "KNN ", KNNC)

**Process time** was show how long for calculate prediction models

In [None]:
#Finish process time
finish_time = time.time()
time = finish_time-start_time
print('Process time', "{:.2f}".format(time)," sec")