Domain Semiconductor manufacturing process  

Business Context  
A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system.  The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analyzing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.     

Objective  
We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyze whether all the features are required to build the model or not.   
Dataset description  
   
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.                                                                        2  
sensor-data.csv : (1567, 592)  
The data consists of 1567 examples each with 591 features.  The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.    
Steps   
1. Import the necessary liberraries and read the provided CSV as a dataframe and perform the below steps. ( 5 points)  a. Check a few observations and shape of the dataframe b. Check for missing values. Impute the missing values if there is any c. Univariate analysis - check frequency count of target column and distribution of the first few features (sensors) d. Perform bivariate analysis and check for the correlation  e. Drop irrelevant columns   
2. Standardize the data ( 3 points)  3. Segregate the dependent column ("Pass/Fail") from the data frame. And split the dataset into training and testing set ( 70:30 split) ( 2 points) 4. Build a logistic regression, random forest, and xgboost classifier model and print confusion matrix for the test data ( 10 points)  5. Apply sampling techniques to handle the imbalanced classes ( 5 points)  6. Build a logistic regression, random forest, and xgboost classifier model after resampling the data and print the confusion matrix for the test data ( 10 points) 7. Apply Grid Search CV to get the best hyper parameters for any one of the above model  ( 5 points) 8. Build a classifier model using the above best hyper parameters and check the accuracy and confusion matrix ( 5 points) 9. Report feature importance and mention your comments ( 2 points) 
   
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.                                                                        3 
10. Report your findings and inferences ( 3 points)     
Further Questions ( Optional) -  
1. Check for outliers and impute them as required. 2. Apply PCA to get rid of redundant features and reduce dimension of the data 3. Try cross validation techniques to get better results 4. Try OneCLassSVM model to get better recall   
Learning Outcomes  
● Feature Importance ● Sampling ● SMOTE ● Grid Search ● Random Forest ● Exploratory Data Analysis ● Logistic Regression   

In [None]:
# Widen the display of python output
# This is done to avoid ellipsis appearing which restricts output view in row or column
import pandas as pd
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

In [None]:
# Importing Signals Data file
# This is same as the signals dataset provided
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv('../input/uci-semcom/uci-secom.csv')
dataset = pd.DataFrame(data)
dataset.head()

In [None]:
#Dataset has 1,567 rows and 592 columns
dataset.shape

In [None]:
#Add a prefix to the column names for easeof understanding
dataset.columns = 'feature_' + dataset.columns

In [None]:
#Rename the time column and Pass_Fail column as they are not features 
dataset.rename(columns = {'feature_Time':'Time'}, inplace = True) 
dataset.rename(columns = {'feature_Pass/Fail':'Pass_Fail'}, inplace = True)

In [None]:
#All variables except Pass/Fail (Wether the process entity passed or not) is float. 
#Pass/Fail is a integer variable
dataset.dtypes

In [None]:
#5 number summary of the columns 
#The counts are varying for each feature so there might be issues with some of the values like missing values
#Few variables like feature_586 etc have negative values as well
#There also appears to be several features with outliers as there is significant 
#difference between 75% and max value
dataset.describe().transpose()

In [None]:
#Check for null values - Most of the features are having null values
dataset.isnull().sum()

In [None]:
#Check for NA values - Most of the features are having NA values
dataset.isna().sum()

In [None]:
#fill NA with zero of each column in signal dataset for missing value imputation
#this shows that we did not have any signal from that feature
df = dataset.iloc[:,1:]
df = df.apply(lambda x: x.fillna(0),axis=0)

In [None]:
df2 = dataset.iloc[:,0]
result = pd.concat([df, df2], axis=1).reindex(df.index)

In [None]:
#Some features seem to have the same values throughout for example feature_5
#We would need to drop such features and those which are highly correlated to each other in future steps
result.head()

In [None]:
#We can see that all the NA values are removed now from the dataframe
result.isna().sum()

In [None]:
#We can see that all the null values are also removed now from the dataframe
result.isnull().sum()

In [None]:
#The classes for Pass_Fail variable are not balanced - 93.4% observations are passed  
#while 6.6% people are failed. 
#So if an alogrithm just assigns value as -1 to all observations, it will still achieve 93.4% accuracy, 
#So our selected model should have better accuracy than 93.4% to be called a good model 
result["Pass_Fail"].value_counts(normalize=True)

In [None]:
#The bar plot below also shows us that the classes are not balanced
result["Pass_Fail"].value_counts().plot(kind="bar");

In [None]:
# Get the correlation matrix
corr = result.corr()
#sns.heatmap(corr,annot=True);
#mask = np.zeros_like(corr)
#mask[np.triu_indices_from(mask)] = True
#with sns.axes_style("white"):
#    f, ax = plt.subplots(figsize=(20, 20))
#    ax = sns.heatmap(corr, mask=mask, vmax=.3, annot=True, square=True);

In [None]:
print(corr)

In [None]:
#Since the correlation is very big to view here so have exported as csv file
corr.to_csv("correlation.csv")

In [None]:
#Based on the correlation excel, a lot of columns are having same value through and no variance
#These are shown as blank values in the correlation excel
#Removing such columns from the dataframe below - About 116 columns are blank
result.drop(['feature_5','feature_13','feature_42','feature_49','feature_52','feature_69','feature_97','feature_141',
             'feature_149','feature_178','feature_179','feature_186','feature_189','feature_190','feature_191','feature_192',
             'feature_193','feature_194','feature_226','feature_229','feature_230','feature_231','feature_232','feature_233',
             'feature_234','feature_235','feature_236','feature_237','feature_240','feature_241','feature_242','feature_243',
             'feature_256','feature_257','feature_258','feature_259','feature_260','feature_261','feature_262','feature_263',
             'feature_264','feature_265','feature_266','feature_276','feature_284','feature_313','feature_314','feature_315',
             'feature_322','feature_325','feature_326','feature_327','feature_328','feature_329','feature_330','feature_364',
             'feature_369','feature_370','feature_371','feature_372','feature_373','feature_374','feature_375','feature_378',
             'feature_379','feature_380','feature_381','feature_394','feature_395','feature_396','feature_397','feature_398',
             'feature_399','feature_400','feature_401','feature_402','feature_403','feature_404','feature_414','feature_422',
             'feature_449','feature_450','feature_451','feature_458','feature_461','feature_462','feature_463','feature_464',
             'feature_465','feature_466','feature_481','feature_498','feature_501','feature_502','feature_503','feature_504',
             'feature_505','feature_506','feature_507','feature_508','feature_509','feature_512','feature_513','feature_514',
             'feature_515','feature_528','feature_529','feature_530','feature_531','feature_532','feature_533','feature_534',
             'feature_535','feature_536','feature_537','feature_538'],axis=1,inplace=True)

In [None]:
#Remove the highly collinear features from results dataframe
def remove_collinear_features(x, threshold):
    '''
    Objective:
        Remove collinear features in a dataframe with a correlation coefficient
        greater than the threshold. Removing collinear features can help a model 
        to generalize and improves the interpretability of the model.

    Inputs: 
        x: features dataframe
        threshold: features with correlations greater than this value are removed

    Output: 
        dataframe that contains only the non-highly-collinear features
    '''

    # Calculate the correlation matrix
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterate through the correlation matrix and compare correlations
    for i in iters:
        for j in range(i+1):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)

            # If correlation exceeds the threshold
            if val >= threshold:
                # Print the correlated features and the correlation value
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])

    # Drop one of each pair of correlated columns
    drops = set(drop_cols)
    x = x.drop(columns=drops)

    return x

In [None]:
#Remove columns having more than 70% correlation
#Both positive and negative correlations are considered here
result = remove_collinear_features(result,0.70)

In [None]:
result.head()

In [None]:
#After dropping the highly correlated variables we have 197 columns and 1,567 rows
result.shape

In [None]:
#Observation is that most of the variables distribution are right skewed with long tails and outliers 
#

def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure(figsize=(20,10))
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=10,ax=ax)
        ax.set_title(var_name+" Distribution")
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()


#test = pd.DataFrame(np.random.randn(30, 9), columns=map(str, range(9)))
#Most the variables are approximately normally distributed except for feature_4, feature_7,
#feature_11, feature_12, feature_15, feature_16
draw_histograms(result, result.iloc[:,0:15], 5, 3)

In [None]:
result.Time.dtype

In [None]:
from datetime import datetime
result['year'] = pd.DatetimeIndex(result['Time']).year
result['month'] = pd.DatetimeIndex(result['Time']).month
result['date'] = pd.DatetimeIndex(result['Time']).day
result['week_day'] = pd.DatetimeIndex(result['Time']).weekday
result['start_time'] = pd.DatetimeIndex(result['Time']).time
result['hour'] = pd.DatetimeIndex(result['Time']).hour
result['min'] = pd.DatetimeIndex(result['Time']).minute

In [None]:
result.head()

In [None]:
#This consists of only year 2008
result.year.unique()

In [None]:
#This consists of all the months of 2008
result.month.unique()

In [None]:
#All the dates of the month are not there, might be related to production on certain days only
result.date.unique()

In [None]:
#All the weekdays of the month are here, so production happens on all 7 days
#0 stand for Sunday, 1 for Monday ... 6 for Saturday
result.week_day.unique()

In [None]:
#We see that the failures (Pass_Fail=1) peak in August which is also the peak for pass.
#August and September are months with most product and most failures as well
#The failures seem to subside from September onwards post some correction 
#(May-Aug we see more failures than passes)
sns.distplot( result[result.Pass_Fail == -1]['month'], color = 'g');
sns.distplot( result[result.Pass_Fail == 1]['month'], color = 'r');

In [None]:
#The failures tend to decrease towards month end and is in close sync with pass population
sns.distplot( result[result.Pass_Fail == -1]['date'], color = 'g');
sns.distplot( result[result.Pass_Fail == 1]['date'], color = 'r');

In [None]:
#Failures appear to be more towards start and end of the week rather than in the middle of the week
sns.distplot( result[result.Pass_Fail == -1]['week_day'], color = 'g');
sns.distplot( result[result.Pass_Fail == 1]['week_day'], color = 'r');

In [None]:
#There is no specific trend in terms of hours, it seems to be fairly distributed
sns.distplot( result[result.Pass_Fail == -1]['hour'], color = 'g');
sns.distplot( result[result.Pass_Fail == 1]['hour'], color = 'r');

In [None]:
#There is no specific trend in terms of minutes, it seems to be fairly distributed
sns.distplot( result[result.Pass_Fail == -1]['min'], color = 'g');
sns.distplot( result[result.Pass_Fail == 1]['min'], color = 'r');

In [None]:
#Pairplot of the dataset without name and status columns
#Most of the independent variables have a positive skew
#sns.pairplot(dataset3);

In [None]:
# Create a boxplot for all the features by target (Pass_Fail) column
# One common observation is that almost all have outliers, so outlier removal/correction 
# will be required in future steps
# we see certain features with very less observations like feature 4, 8, 9, 10, 11
result.boxplot(column = ['feature_0',
'feature_1',
'feature_2',
'feature_3',
'feature_4',
'feature_8',
'feature_9',
'feature_10',
'feature_11'
], by='Pass_Fail', figsize = (20,20));


In [None]:
#There appears to be very low variation in the features, so we can drop such features
result["feature_11"].unique()

In [None]:
result_num = result.drop(['Pass_Fail','Time','start_time'],axis=1)

In [None]:
#Contains all the features
result_num.head()

In [None]:
#Drop columns with very low standard deviation thresholds 
threshold = 0.2
result_num.drop(result_num.std()[result_num.std() < threshold].index.values, axis=1)

In [None]:
result_num.shape

In [None]:
#The biggest challenge in the dataset seems to be presence of outliers in almost all variables
#Another challenge is that except few variables there is not very good seperation between observations 
#having failure and those that have passed
#There is also the challenge of domain knowledge as none of the features are named, so we are not able
#to apply any intuitive understanding

In [None]:
#Create a copy of the dataset for maintain data after outlier removal
#Here after identifying outliers we replace with median
pd_data = result_num.copy()
#pd_data.head()

#pd_data2 = pd_data.drop(columns=['name'],axis=1)
#pd_data2 = pd_data2.apply(replace,axis=1)
from scipy import stats

#Define a function to remove outliers on max side
def outlier_removal_max(var):
    var = np.where(var > var.quantile(0.75)+ stats.iqr(var),var.quantile(0.50),var)
    return var

#Define a function to remove outliers on min side
def outlier_removal_min(var):
    var = np.where(var < var.quantile(0.25) - stats.iqr(var),var.quantile(0.50),var)
    return var

#Loop over the columns and remove the outliers on min and max side
for column in pd_data:
    pd_data[column] = outlier_removal_max(pd_data[column])
    pd_data[column] = outlier_removal_min(pd_data[column])
    

In [None]:
pd_data2 = pd_data.copy()

In [None]:
pd_data2["Pass_Fail"] = result["Pass_Fail"]

In [None]:
#Plotting sample boxplot to check if outliers are removed or not
#They are removed from the boxplots so we can now go for PCA
pd_data2.boxplot(column = ['feature_64',
'feature_67',
'feature_71',
'feature_72',
'feature_74',
'feature_75',
'feature_76',
'feature_77',
'feature_78',
'feature_79',
'feature_80'
], by='Pass_Fail', figsize = (20,10));

In [None]:
#Drop columns with very low standard deviation thresholds 
threshold = 0.2
pd_data.drop(pd_data.std()[pd_data.std() < threshold].index.values, axis=1)

In [None]:
pd_data.shape

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
scaler.fit(pd_data)

In [None]:
#pd_data_scaled = scaler.transform(pd_data)
pd_data_scaled = pd_data.copy()

In [None]:
pd_data_scaled[pd_data_scaled.columns] = scaler.fit_transform(pd_data[pd_data.columns])

In [None]:
pd_data_scaled.head()

In [None]:
# PCA
# Step 1 - Create covariance matrix

cov_matrix = np.cov(pd_data_scaled.T)
print('Covariance Matrix \n%s', cov_matrix)

In [None]:
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)

In [None]:
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)

In [None]:
plt.plot(var_exp)

In [None]:
# Ploting we see that PCA is not giving us much benefit
plt.figure(figsize=(10 , 5))
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

In [None]:
#result_z.describe().transpose()

In [None]:
#Let us scale the data before plotting histogram or boxplot
#This will help us visualize better since there are more than 200 variables
#from scipy.stats import zscore
#result2 = result.drop("Time",axis=1)
#result3 = result2.drop("Pass_Fail",axis=1)
#result_z = pd_data.apply(zscore)
#result_z = pd.DataFrame(result_z , columns  = result_z.columns)
#result_z.describe().transpose()

In [None]:
result_z2 = pd_data_scaled

In [None]:
#Copy over the target column to the scaled datasets
result_z2["Pass_Fail"] = result["Pass_Fail"]

In [None]:
#Check the shape of result_z dataset
#It has 1,567 rows and 202 columns
result_z2.shape

In [None]:
result_z2.head()

In [None]:
%matplotlib inline


# Numerical libraries
import numpy as np   

# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

from sklearn.metrics import r2_score

# to handle data in form of rows and columns 
import pandas as pd    

# importing ploting libraries
import matplotlib.pyplot as plt   

#importing seaborn for statistical plots
import seaborn as sns

In [None]:
#result_z2.dropyear 	month 	date 	week_day 	hour

In [None]:
# separating the dependent and independent data

x = result_z2.iloc[:,:201]
y = result_z2["Pass_Fail"]

# getting the shapes of new data sets x and y
print("shape of x:", x.shape)
print("shape of y:", y.shape)

In [None]:
def makeOverSamplesADASYN(X,y):
 #input DataFrame
 #X →Independent Variable in DataFrame\
 #y →dependent Variable in Pandas DataFrame format
 from imblearn.over_sampling import ADASYN 
 sm = ADASYN()
 X, y = sm.fit_sample(X, y)
 return(X,y)

In [None]:
x_samp, y_samp = makeOverSamplesADASYN(x, y)

In [None]:
y_samp.head().unique()

In [None]:
# splitting them into train test and split
# 70% data is for training and 30% is for test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_samp, y_samp, test_size = 0.3, random_state = 0)

# getting the shapes - 70:30 split
print("shape of x_train: ", x_train.shape)
print("shape of x_test: ", x_test.shape)
print("shape of y_train: ", y_train.shape)
print("shape of y_test: ", y_test.shape)

In [None]:
y_train.head()

In [None]:
x_train.head()

In [None]:
lasso = Lasso(alpha=0.1)
lasso.fit(x_train,y_train)
print ("Lasso model:", (lasso.coef_))

In [None]:
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(result_z)
    prediction=model.predict(techSuppScaled)
    meanDistortions.append(sum(np.min(cdist(techSuppScaled, model.cluster_centers_, 'euclidean'), axis=1))
                           / techSuppScaled.shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')

In [None]:
# splitting them into train test and split
# 70% data is for training and 30% is for test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

# getting the shapes - 70:30 split
print("shape of x_train: ", x_train.shape)
print("shape of x_test: ", x_test.shape)
print("shape of y_train: ", y_train.shape)
print("shape of y_test: ", y_test.shape)

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Return the model statistics
def fit_n_print(model, X_train, X_test, y_train, y_test):  # take the model, and data as inputs
    from sklearn import metrics
    from sklearn.model_selection import cross_val_score
    
    model.fit(X_train, y_train)   # fit the model with the train data

    pred = model.predict(X_test)  # make predictions on the test set

    score = round(model.score(X_test, y_test), 3)   # compute accuracy score for test set
    mae = mean_absolute_error(y_test, pred)
    mse = mean_squared_error(y_test, pred)
    r2 = r2_score(y_test, pred)
   
    return score, mae, mse, r2  # return all the metrics


In [None]:
#Function to display confusion matrix
def disp_confusion_matrix(model_name, model, X_test, y_test):
    from sklearn.metrics import confusion_matrix
    y_pred = model.predict(X_test)
    conf_mat = confusion_matrix(y_test, y_pred)
    df_conf_mat = pd.DataFrame(conf_mat)
    #ax = plt.axes()
    #plt.title()
    plt.figure(figsize = (10,7))
    plt.suptitle("Confusion matrix: "+model_name)
    sns.heatmap(df_conf_mat, annot=True,cmap='Blues', fmt='g')
    #ax.set_title()
    #plt.show();

In [None]:
# Function to display roc curve and auc
def disp_roc_curve(model_name, model, X_test, y_test):    
    from sklearn.metrics import roc_curve
    from sklearn.metrics import roc_auc_score
    from matplotlib import pyplot
    # generate a no skill prediction (majority class)
    ns_probs = [0 for _ in range(len(y_test))]
    # predict probabilities
    lr_probs = model.predict_proba(X_test)
    #lr_probs = model.predict(X_test)
    # keep probabilities for the positive outcome only
    lr_probs = lr_probs[:, 1]
    # calculate scores
    ns_auc = roc_auc_score(y_test, ns_probs)
    lr_auc = roc_auc_score(y_test, lr_probs)
    # summarize scores
    #print('Random: ROC AUC=%.3f' % (ns_auc))
    print(model_name + ': ROC AUC=%.3f' % (lr_auc))
    # calculate roc curves
    ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
    lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
    # plot the roc curve for the model
    pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label='Random')
    pyplot.plot(lr_fpr, lr_tpr, marker='.', label=model_name)
    # axis labels
    pyplot.xlabel('False Positive Rate')
    pyplot.ylabel('True Positive Rate')
    # show the legend
    pyplot.legend()
    # show the plot
    pyplot.show()

In [None]:
#Install XGBoost if not installed
!pip install xgboost

In [None]:
from xgboost import XGBClassifier

In [None]:
#Define different classifiers including Logistic Regression, Random Forest an XG Boost
#We have created a pipeline to do PCA first and then do the modeling part

#from sklearn.calibration import CalibratedClassifierCV
#from sklearn.linear_model import LogisticRegression
#lr = LogisticRegression(random_state=0)
#lr_model = CalibratedClassifierCV(lr) 

from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline 


pipe_lr = Pipeline([('pca', PCA(n_components=10)), ('lr', LogisticRegression(random_state=1))]) 
pipe_lr.fit(x_train, y_train) 
print('Test Accuracy - Logistic Regression: %.4f' % pipe_lr.score(x_test, y_test))

pipe_rf = Pipeline([('pca', PCA(n_components=10)), ('rf', RandomForestClassifier(n_estimators=50,
                                                                                random_state=1))]) 
pipe_rf.fit(x_train, y_train) 
print('Test Accuracy - Random Forest: %.4f' % pipe_rf.score(x_test, y_test)) 

pipe_xgb = Pipeline([('pca', PCA(n_components=10)), ('xg',XGBClassifier(random_state=1))]) 
pipe_xgb.fit(x_train, y_train) 
print('Test Accuracy - XG Boost: %.4f' % pipe_xgb.score(x_test, y_test)) 

#from sklearn.ensemble import StackingClassifier
#estimators = [('dt', dt),('rf', rf),('bg', bg), ('gb', gb), ('ab', ab)]
#estimators = [('lr', lr_model),('rf', rf_model),('xgb', xgb_model)]

#reg = StackingClassifier(estimators=estimators)

In [None]:
#This javascript code disables autoscroll

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [None]:
#We can see that even though Logistic and Random Forest have more accuracy, 
#they have not classified any observation in failure class correctly
#XGBoost though slightly low accuracy has classified 1 observations in the failure class correctly
for model, model_name in zip([pipe_lr,pipe_rf, pipe_xgb], ['Logistic Regression','Random Forest', 
                                                      'XG Boost']):
    disp_confusion_matrix(model_name, model, x_test, y_test);

In [None]:
#Install imbalanced library if not installed
!pip install -U imbalanced-learn

In [None]:
#!pip install scikit-learn==0.23.1

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
import imblearn
from imblearn.over_sampling import RandomOverSampler

In [None]:
# Count of records before oversampling
print("Before Upsampling, count of pass '-1':{}".format(sum(y_train==-1)))
print("Before Upsampling, count of pass '1':{}".format(sum(y_train==1)))

In [None]:
def makeOverSamplesSMOTE(X,y):
 #input DataFrame
 #X →Independent Variable in DataFrame\
 #y →dependent Variable in Pandas DataFrame format
 from imblearn.over_sampling import SMOTE
 sm = SMOTE()
 X, y = sm.fit_sample(X, y)
 return X,y

In [None]:
def makeOverSamplesADASYN(X,y):
 #input DataFrame
 #X →Independent Variable in DataFrame\
 #y →dependent Variable in Pandas DataFrame format
 from imblearn.over_sampling import ADASYN 
 sm = ADASYN()
 X, y = sm.fit_sample(X, y)
 return(X,y)

In [None]:
#Use the SMOTE technique to oversample
sm = SMOTE(sampling_strategy=1,k_neighbors=5,random_state = 1)

In [None]:
sm_x_train,sm_y_train = sm.fit_sample(x_train, y_train)

In [None]:
# Count of records after oversampling
print("After Upsampling,counts of label '-1':{}".format(sum(sm_y_train==-1)))
print("After Upsampling,counts of label '1':{}".format(sum(sm_y_train==1)))

In [None]:
pipe_lr2 = Pipeline([('pca', PCA(n_components=10)), ('lr', LogisticRegression(random_state=1))]) 
pipe_lr2.fit(sm_x_train, sm_y_train) 
print('Test Accuracy - Logistic Regression: %.4f' % pipe_lr2.score(x_test, y_test))

pipe_rf2 = Pipeline([('pca', PCA(n_components=10)), ('rf', RandomForestClassifier(n_estimators=50,
                                                                                random_state=1))]) 
pipe_rf2.fit(sm_x_train, sm_y_train) 
print('Test Accuracy - Random Forest: %.4f' % pipe_rf2.score(x_test, y_test)) 

pipe_xgb2 = Pipeline([('pca', PCA(n_components=10)), ('xg',XGBClassifier(random_state=1))]) 
pipe_xgb2.fit(sm_x_train, sm_y_train) 
print('Test Accuracy - XG Boost: %.4f' % pipe_xgb2.score(x_test, y_test)) 


In [None]:
#We can see that Logistic Regression has accuracy of 73%,
#While random forest and XG Boost has much better accuracy of 87% and 86% 
#Basis on the confusion matrix we see that more observations are classified under 1 now compared to earlier
#13, 6 and 8 observation for logistic, random forest and XG boost, so Logistic is better even though 
#accuracy is lower
for model, model_name in zip([pipe_lr2,pipe_rf2, pipe_xgb2], ['Logistic Regression','Random Forest', 
                                                      'XG Boost']):
    disp_confusion_matrix(model_name, model, x_test, y_test);

In [None]:
#Using Cluster centroids method to undersample
from collections import Counter
from imblearn.under_sampling import ClusterCentroids 
print('Original dataset shape {}'.format(Counter(y_train)))
cc = ClusterCentroids(random_state=1)
x_res, y_res = cc.fit_sample(x_train, y_train)

In [None]:
# Count of records after downsampling
print("After Downsampling,counts of label '-1':{}".format(sum(y_res==-1)))
print("After Downsampling,counts of label '1':{}".format(sum(y_res==1)))

In [None]:
pipe_lr3 = Pipeline([('pca', PCA(n_components=10)), ('lr', LogisticRegression(random_state=1))]) 
pipe_lr3.fit(x_res, y_res) 
print('Test Accuracy - Logistic Regression: %.4f' % pipe_lr3.score(x_test, y_test))

pipe_rf3 = Pipeline([('pca', PCA(n_components=10)), ('rf', RandomForestClassifier(n_estimators=50,
                                                                                random_state=1))]) 
pipe_rf3.fit(x_res, y_res) 
print('Test Accuracy - Random Forest: %.4f' % pipe_rf3.score(x_test, y_test)) 

pipe_xgb3 = Pipeline([('pca', PCA(n_components=10)), ('xg',XGBClassifier(random_state=1))]) 
pipe_xgb3.fit(x_res, y_res) 
print('Test Accuracy - XG Boost: %.4f' % pipe_xgb3.score(x_test, y_test)) 

In [None]:
#We can see that Logistic Regression has accuracy of 33%,
#While random forest and XG Boost has much better accuracy of 44% and 49% 
#Basis on the confusion matrix we see that more observations are classified under 1 now compared to upsampling
#20, 17 and 17 observation for logistic, random forest and XG boost, so Logistic is better even though 
#accuracy is lower
for model, model_name in zip([pipe_lr3,pipe_rf3, pipe_xgb3], ['Logistic Regression','Random Forest', 
                                                      'XG Boost']):
    disp_confusion_matrix(model_name, model, x_test, y_test);

In [None]:
#Try using KFold cross validation with Upsampling as we have more accuracy there
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics

num_folds = 50
seed = 1

kfold = KFold(n_splits=num_folds, random_state=seed)
#model = LogisticRegression()

pipe_lr2 = Pipeline([('pca', PCA(n_components=10)), ('lr', LogisticRegression(random_state=1))]) 
pipe_lr2.fit(sm_x_train, sm_y_train) 
print('Test Accuracy - Logistic Regression: %.4f' % pipe_lr2.score(x_test, y_test))
results = cross_val_score(pipe_lr2, x, y, cv=kfold)
#print(results)
#pred = cross_val_predict(pipe_lr2,sm_x_train, sm_y_train, cv=kfold)
print("CV Accuracy - Logistic Regression: %.4f (%.4f)" % (results.mean(), results.std()))


pipe_rf2 = Pipeline([('pca', PCA(n_components=10)), ('rf', RandomForestClassifier(n_estimators=50,
                                                                                random_state=1))]) 
pipe_rf2.fit(sm_x_train, sm_y_train) 
print('Test Accuracy - Random Forest: %.4f' % pipe_rf2.score(x_test, y_test)) 
results = cross_val_score(pipe_rf2, x, y, cv=kfold)
#print(results)
#pred = cross_val_predict(pipe_lr2,sm_x_train, sm_y_train, cv=kfold)
print("CV Accuracy - Random Forest: %.4f (%.4f)" % (results.mean(), results.std()))


pipe_xgb2 = Pipeline([('pca', PCA(n_components=10)), ('xg',XGBClassifier(random_state=1))]) 
pipe_xgb2.fit(sm_x_train, sm_y_train) 
print('Test Accuracy - XG Boost: %.4f' % pipe_xgb2.score(x_test, y_test)) 
results = cross_val_score(pipe_xgb2, x, y, cv=kfold)
#print(results)
#pred = cross_val_predict(pipe_lr2,sm_x_train, sm_y_train, cv=kfold)
print("CV Accuracy - XG Boost: %.4f (%.4f)" % (results.mean(), results.std()))

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from time import time

#disp_confusion_matrix(model_name, model, x_test, y_test);

In [None]:
#Except XG Boost none of the models predicted a single observation for failure 
for model, model_name in zip([pipe_lr2,pipe_rf2, pipe_xgb2], ['Logistic Regression','Random Forest', 
                                                      'XG Boost']):
    y_pred = cross_val_predict(model, x, y, cv=kfold)
    conf_mat = confusion_matrix(y, y_pred)
    print(conf_mat)

In [None]:
#Using Grid Search to search the hyper parameter space
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}
rf = RandomForestClassifier(n_estimators=50,random_state=1)                                                                             
grid_search = GridSearchCV(rf, param_grid=param_grid)
start = time()
grid_search.fit(sm_x_train, sm_y_train)

In [None]:
#Get the best parameters for Random Forest
grid_search.best_params_

In [None]:
#Mean Test Scores across the models
grid_search.cv_results_['mean_test_score']

In [None]:
#Best Model Parameters
grid_search.best_estimator_

In [None]:
#Construct RandomForest with the best model parameters
pipe_rf2 = Pipeline([('pca', PCA(n_components=10)), ('rf', RandomForestClassifier(bootstrap=False, max_features=3, min_samples_split=3,
                       n_estimators=50, random_state=1))]) 
pipe_rf2.fit(sm_x_train, sm_y_train) 
print('Test Accuracy - Random Forest: %.4f' % pipe_rf2.score(x_test, y_test))

In [None]:
#Able to classify 6 observations correctly
disp_confusion_matrix('Random Forest - Grid Search', pipe_rf2, x_test, y_test);

In [None]:
!pip install rfpimp

In [None]:
#Identify which features are best in Random Forest Classifier
from sklearn.metrics import r2_score
from rfpimp import permutation_importances

def r2(rf, X_train, y_train):
    return r2_score(y_train, rf.predict(X_train))

perm_imp_rfpimp = permutation_importances(pipe_rf2, sm_x_train, sm_y_train, r2)

In [None]:
#Get the feature importance
perm_imp_rfpimp.Importance.plot(kind="bar",figsize=(50,20))

In [None]:
#Feature Importance is as below
perm_imp_rfpimp

In [None]:
#Try OneClassSVM with oversampled train data with SMOTE
from sklearn.svm import OneClassSVM

model = OneClassSVM(kernel ='rbf', degree=3, gamma=0.1,nu=0.005, max_iter=-1)

model.fit(sm_x_train, sm_y_train)
y_pred = model.fit_predict(x_test)
accuracy = (len(y_pred[y_pred == -1])/len(y_pred))
print('Test Accuracy - OneClassSVM (Oversampled): %.4f' % accuracy)
#print(len(y_pred[y_pred == -1]))
#print(len(y_pred))
#print(accuracy)

In [None]:
# evaluating the model
# printing the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm ,annot = True, cmap = 'summer')

In [None]:
#Try OneClassSVM with undersampled train data
from sklearn.svm import OneClassSVM

model = OneClassSVM(kernel ='rbf', degree=3, gamma=0.1,nu=0.005, max_iter=-1)

model.fit(x_res, y_res)
y_pred = model.fit_predict(x_test)
accuracy = (len(y_pred[y_pred == -1])/len(y_pred))
print('Test Accuracy - OneClassSVM (Undersampled): %.4f' % accuracy)

In [None]:
# evaluating the model
# printing the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm ,annot = True, cmap = 'summer')

In [None]:
#We have tried Logistic Regression, Random Forest and XG Boost algorithm for the imbalanced classes
#Across methods Logisitc Regression performed the worst while Random Forest and XG Boost performed similarly
#However on the good side Logistic was able to classify more observationsin failure compared to other two
#algorithms.We have tried two sampling techniques -first one using SMOTE (oversampling) and second one 
#using centroid based method (undersampling), Oversampling gave better results than undersampling in 
#terms of accuracy. However undersampling classified more observations in minority class than oversampling
#We did Z score scaling on both the datasets and took PCA with n_components as 10
#We tried K-fold cross validation which helped improve the results a fair bit to about 93% accuracy
#However it continues misclassifying the minority class
#We used Grid search for hyper parameter tuning as well for random forest and checked results with 89% accuracy
#Using feature importance, we found that feature_64, feature_55 and feature_45 are the top three important 
#features. Lastly we tried OneClassSVM as well on the undersampled and oversampled data with similar accuracy
#of about 84%. However we were not able to achieve accuracy more than 93.4%, if we tried to improve the 
#classifier on the failure observations