# Case Study of equipment’s signal quality

PROJECT OBJECTIVE: The need is to build a regressor which can use these parameters to determine the signal strength or 
quality
DOMAIN:  Electronics and Telecommunication 
CONTEXT: A communications equipment manufacturing company has a product which is responsible for emitting informative signals. 
Company wants to build a machine learning model which can help the company to predict the equipment’s signal quality using 
various parameters.
DATA DESCRIPTION: The data set contains information on various signal tests performed: 
        1. Parameters: Various measurable signal parameters. 
        2. Signal_Quality: Final signal strength or quality 

In [None]:
#%tensorflow_version 2.x
import tensorflow as tf
tf.__version__

1. Import data. 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats 
import matplotlib.pyplot as plt
from tensorflow import keras
#from keras.models import Sequential
#from keras.layers import Dense
#from sklearn.model_selection import StratifiedKFold
%matplotlib inline
#Test Train Split
from sklearn.model_selection import train_test_split
#Feature Scaling library
from sklearn.preprocessing import StandardScaler
#import pickle
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras import regularizers, optimizers
from sklearn.metrics import r2_score
from tensorflow.keras.models import load_model

In [None]:
# Initialize the random number generator
import random
seed = 7
np.random.seed(seed)

# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
#Read the data as a data frame
mydata = pd.read_csv('../input/part-123-signalcsv/Part- 123 - Signal.csv')
mydata.head(20)

2. Data analysis & visualisation 

In [None]:
# Shape of the data 
mydata.shape

There are 1599 rows and 12 columns in data

There are 1599 rows and 12 columns

In [None]:
# Data type of each attribute 
mydata.info()   # it gives information about the data and data types of each attribute

All the parameters are floating point and the signal strength is an integer.

Apart from Signal Strength rest all features are floating point.

In [None]:
# Checking the presence of missing values
null_counts = mydata.isnull().sum()  # This prints the columns with the number of null values they have
print (null_counts)

There are no null values in the data

In [None]:
# 5 point summary of numerical attributes
mydata.describe()

Looking the 11 parameters :
Parameter 3 ranges between 0 and 1.
Maximum value of Parameter 5 is 0.6
Parameter 8 has a very low range between 0.9 and 1.004
Standard deviation is lowest for Parameter 8, it is 0.001887
'Signal_Strength' has classes as - 3.5, 4.0,5.0, 6.0, 7.0 and 7.5 

In [None]:
# studying the distribution of continuous attributes
cols = list(mydata)
for i in np.arange(len(cols)):
    sns.distplot(mydata[cols[i]], color='blue')
    #plt.xlabel('Experience')
    plt.show()
    print('Distribution of ',cols[i])
    print('Mean is:',mydata[cols[i]].mean())
    print('Median is:',mydata[cols[i]].median())
    print('Mode is:',mydata[cols[i]].mode())
    print('Standard deviation is:',mydata[cols[i]].std())
    print('Skewness is:',mydata[cols[i]].skew())
    print('Maximum is:',mydata[cols[i]].max())
    print('Minimum is:',mydata[cols[i]].min())

Mean, median and mode are almost overlapping or too close to each other ecept in Parameter 7
Parameter 3 is trimodal and Signal strength is a classification variable.
All of them are positively skewed.
Standard deviation is maximum for Parameter7, it is 32.895324478299074

In [None]:
sns.countplot(mydata['Signal_Strength'])    # Distibution of the column 'Signal_Strength'
plt.show()

class 5.0 in 'Signal_Strength' has the highest count.

In [None]:
#plt.figure(figsize = (50,50))
sns.pairplot(mydata,diag_kind='kde')
plt.show()

1.Parameter 6 and Parameter 7 are highly correlated with each other and visce versa and they have almost 0 correlation with other Parameters
2.Parameter 1 is positively correlated to Parameter 3 and Parameter 8 and negatively correlated to Parameter 2 and Parameter 9.
3.Parameter 4 is has very low correlation with other Parameters.

In [None]:
# Checking the presence of outliers
l = len(mydata)
col = list(mydata.columns)
#col.remove('condition')
for i in np.arange(len(col)):
    sns.boxplot(x= mydata[col[i]], color='cyan')
    plt.show()
    print('Boxplot of ',col[i])
    #calculating the outiers in attribute 
    Q1 = mydata[col[i]].quantile(0.25)
    Q2 = mydata[col[i]].quantile(0.50)
    Q3 = mydata[col[i]].quantile(0.75) 
    IQR = Q3 - Q1
    L_W = (Q1 - 1.5 *IQR)
    U_W = (Q3 + 1.5 *IQR)    
    print('Q1 is : ',Q1)
    print('Q2 is : ',Q2)
    print('Q3 is : ',Q3)
    print('IQR is:',IQR)
    print('Lower Whisker, Upper Whisker : ',L_W,',',U_W)
    bools = (mydata[col[i]] < (Q1 - 1.5 *IQR)) |(mydata[col[i]] > (Q3 + 1.5 * IQR))
    print('Out of ',l,' rows in data, number of outliers are:',bools.sum())   #calculating the number of outliers

Parameter 4 has the highest number of outliers which is 155.

In [None]:
#  function to treat outliers
def detect_treate_outliers(df,operation):
    cols=[]
    IQR_list=[]
    lower_boundary_list=[]
    upper_boundary_list=[]
    outliers_count=[]
    for col in df.columns:
        #print('col',col)
        if((df[col].dtype =='int64' or df[col].dtype =='float64') and (col != 'HR')):
            #print('Inside if')
            IQR = df[col].quantile(0.75) - df[col].quantile(0.25)
            lower_boundary = df[col].quantile(0.25) - (1.5 * IQR)
            upper_boundary = df[col].quantile(0.75) + (1.5 * IQR)
            up_cnt = df[df[col]>upper_boundary][col].shape[0]
            #print('Upper count=',up_cnt)
            lw_cnt = df[df[col]<lower_boundary][col].shape[0]
            #print('lower count=',lw_cnt)
            if(up_cnt+lw_cnt) > 0:
                cols.append(col)
                IQR_list.append(IQR)
                lower_boundary_list.append(lower_boundary)
                upper_boundary_list.append(upper_boundary)
                outliers_count.append(up_cnt+lw_cnt)
                if operation == 'update':
                    df.loc[df[col] > upper_boundary,col] = upper_boundary
                    df.loc[df[col] < lower_boundary,col] = lower_boundary
                else:
                    pass
            else:
                pass
   #print('cols=',cols)
   # print('IQR_list=',IQR_list)
   # print('lower_boundary_list=',lower_boundary_list)
   # print('upper_boundary_list=',upper_boundary_list)
   # print('outliers_count=',outliers_count)
    ndf = pd.DataFrame(list(zip(cols,IQR_list,lower_boundary_list,upper_boundary_list,outliers_count)),columns=['Features','IQR','Lower Boundary','Upper Boundary','Outlier Count'])
    #print('Data=',ndf)
    #print('Columns having outliers=',cols)
    if operation == 'update':
        return (len(cols),df)
    else:
        return (len(cols),ndf)

In [None]:
#Removing outliers by replacing the data below lower whisker with it and above upper whisker with it respectively.
count,df=detect_treate_outliers(mydata,'update')
if count>0:
    print('Updating dataset')
    mydata=df

In [None]:
# studying correlation between the attributes
b_corr=mydata.corr()
plt.subplots(figsize =(12, 7)) 
sns.heatmap(b_corr,annot=True)

Since high correlation coefficient value lies between ± 0.50 and ± 1
Parameter 1 is highly correlated with Parameter 3 and Parameter 8, Parameter 9.
Parameter 6 and 7 are highly correlated.
But since, the correlation is not too high near 0.8 or above not dropping the features.

3. Design, train, tune and test a neural network regressor. 

In [None]:
X = mydata.drop("Signal_Strength", axis=1)
y = mydata['Signal_Strength']

In [None]:
from sklearn.model_selection import train_test_split

# splitting to create test data
X_vtrain, X_test, y_vtrain, y_test = train_test_split(X, y, test_size=.30, random_state=seed)

In [None]:
X_vtrain.shape

In [None]:
# splitting to create training and validation data
X_train, X_val, y_train, y_val = train_test_split(X_vtrain, y_vtrain, test_size=.20, random_state=seed)

In [None]:
X_train.shape

In [None]:
# Initialize Sequential model
model_reg = tf.keras.models.Sequential()

# Normalize input data
model_reg.add(tf.keras.layers.BatchNormalization(input_shape=(11,)))

# Add final Dense layer for prediction - Tensorflow.keras declares weights and bias automatically
model_reg.add(tf.keras.layers.Dense(1))

In [None]:
# Compile the model - add mean squared error as loss and stochastic gradient descent as optimizer
model_reg.compile(optimizer='sgd', loss='mse')


In [None]:
model_reg.fit(X_train, y_train, validation_data=(X_val,y_val),epochs=100, batch_size=10)

4. Pickle the model for future use.

In [None]:
# save the model
model_reg.save("model_reg.h5") #using h5 extension
print("model saved!!!")

In [None]:
# load the model
model_rr = load_model('model_reg.h5')

error when trying to pickle is - 
TypeError: cannot pickle 'weakref' object
and to resolve 'weakref' object we need to import dill and weakref butit cannot be saved with pickle, so 
I have used save() to save the model and load_model() to load it.


In [None]:
# Save the Modle to file in the current working directory

#Pkl_Filename = "Pickle_RR_Model.pkl"  
#with open(Pkl_Filename, 'wb') as file:  
#    pickle.dump(model_reg, file)

In [None]:
# Load the Model back from file

#with open(Pkl_Filename, 'rb') as file:  
#    Pickled_RR_Model = pickle.load(file)

#Pickled_RR_Model

In [None]:
y_pred = model_rr.predict(X_test)

In [None]:
print(y_pred[0])
print(y_pred[1])
print(y_pred[2])
print(y_pred[3])
print(y_pred[4])


In [None]:
print(y_test.head())

The first 5 elements of y_pred and y_test are close.

In [None]:
score_r = r2_score(y_test,y_pred)
print(score_r)

In [None]:
#summary of regression model
model_rr.summary()

# Part 2

# PROJECT OBJECTIVE: The need is to build a classifier which can use these parameters to determine the signal strength or quality .

Steps 1 and 2 are same as for the regressor above

3. Design, train, tune and test a neural network classifier.

In [None]:
# counting the number of classes in output
mydata['Signal_Strength'].value_counts()

In [None]:
X.shape

In [None]:
y.shape

In [None]:
yc = to_categorical(y, num_classes=8)

In [None]:
# splitting data for test of categorial 
Xcv_train, Xc_test, ycv_train, yc_test = train_test_split(X, yc, test_size=.30, random_state=seed)

In [None]:
print("Shape of y_train:", ycv_train.shape)
print("One value of y_train:", ycv_train[0])

In [None]:
# splitting data for  train and validation of categorial 
Xc_train, Xc_val, yc_train, yc_val = train_test_split(Xcv_train, ycv_train, test_size=.20, random_state=seed)

In [None]:
print("Shape of y_train:", yc_train.shape)
print("One value of y_train:", yc_train[0])

In [None]:
model_class = Sequential()
model_class.add(Dense(11, activation='relu'))
model_class.add(Dense(8, activation='relu'))
model_class.add(Dense(8, activation='softmax'))

In [None]:
# Compile the model
model_class.compile(loss="categorical_crossentropy", metrics=["accuracy"], optimizer="sgd")

# Fit the model
model_class.fit(x=Xc_train, y=yc_train, batch_size=20, epochs=100, validation_data=(Xc_val, yc_val))

4. Pickle the model for future use.

In [None]:
# save the model
model_class.save("model_class.h5") #using h5 extension
print("model saved!!!")

In [None]:
# load the model
model_cl = load_model('model_class.h5')

In [None]:
# calculate score of training data
score = model_cl.evaluate(Xc_train, yc_train, verbose=0)
print(score)

In [None]:
# score of test data
score_t = model_cl.evaluate(Xc_test, yc_test, verbose=0)
print( score_t)

In [None]:
#summary of classification model
model_cl.summary()