# Confusion Matrix
A confusion matrix is a performance measurement technique used to evaluate the accuracy of a classification algorithm. It is a table that summarizes the classification results on a set of test data for which the true values are known.

The confusion matrix contains four categories:

True positives (TP): The number of cases where the actual value is positive and the predicted value is also positive.
False positives (FP): The number of cases where the actual value is negative, but the predicted value is positive.
False negatives (FN): The number of cases where the actual value is positive, but the predicted value is negative.
True negatives (TN): The number of cases where the actual value is negative, and the predicted value is also negative.
A typical layout for the confusion matrix is as follows:

Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)
The values in the cells of the confusion matrix provide important information about the performance of a classification algorithm. For example, the true positive rate (TPR) or sensitivity is given by TP / (TP + FN), and the false positive rate (FPR) or fall-out is given by FP / (FP + TN).





# Random Over Sampler
RandomOverSampler is a technique used in machine learning to balance the class distribution of a dataset. It is used when the training data contains an unequal number of instances for different classes. This situation can occur in classification problems, where one class may be underrepresented compared to the other classes. RandomOverSampler addresses this issue by randomly oversampling the minority class, which means creating new instances of the minority class until it is balanced with the majority class.

The RandomOverSampler technique works by randomly selecting instances from the minority class and creating duplicate copies of them until the number of instances in the minority class matches that of the majority class. The oversampling process is done randomly, which means that each instance has an equal chance of being selected for duplication. This technique helps to create a more balanced dataset, which can improve the performance of a classification algorithm.

RandomOverSampler can be implemented in Python using the imblearn library. Here's an example code snippet that shows how to use the RandomOverSampler technique to balance the class distribution of a dataset:

scss
Copy code
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
In this code, X is the feature matrix of the original dataset, and y is the corresponding target vector. The RandomOverSampler object is created with a random_state parameter set to 0 to ensure reproducibility. The fit_resample method is then called to perform the oversampling operation, which returns the resampled feature matrix (X_resampled) and target vector (y_resampled).





In [4]:
#importing the library
import numpy as np
import pandas as pd
#plotly
import plotly.express as px
#matplotlib
import matplotlib.pyplot as plt
#seaborn 
import seaborn as sns
#model_selection 
from sklearn.model_selection import train_test_split
#StandardScaler
from sklearn.preprocessing import StandardScaler
#RandomOverSampler
from imblearn.over_sampling import RandomOverSampler
#Logistic Regression
from sklearn.linear_model import LogisticRegression
#metrics
from sklearn.metrics import confusion_matrix,classification_report
import warnings
warnings.filterwarnings(action='ignore')

# Loading the Dataset

In [6]:
df=pd.read_csv('/kaggle/input/uci-semcom/uci-secom.csv')
#showing the dataset
df

Unnamed: 0,Time,0,1,2,3,4,5,6,7,8,...,581,582,583,584,585,586,587,588,589,Pass/Fail
0,2008-07-19 11:55:00,3030.93,2564.00,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,...,,0.5005,0.0118,0.0035,2.3630,,,,,-1
1,2008-07-19 12:32:00,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,...,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.0060,208.2045,-1
2,2008-07-19 13:17:00,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,...,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602,1
3,2008-07-19 14:43:00,2988.72,2479.90,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,...,73.8432,0.4990,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432,-1
4,2008-07-19 15:22:00,3032.24,2502.87,2233.3667,1326.5200,1.5334,100.0,100.3967,0.1235,1.5031,...,,0.4800,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1562,2008-10-16 15:13:00,2899.41,2464.36,2179.7333,3085.3781,1.4843,100.0,82.2467,0.1248,1.3424,...,203.1720,0.4988,0.0143,0.0039,2.8669,0.0068,0.0138,0.0047,203.1720,-1
1563,2008-10-16 20:49:00,3052.31,2522.55,2198.5667,1124.6595,0.8763,100.0,98.4689,0.1205,1.4333,...,,0.4975,0.0131,0.0036,2.6238,0.0068,0.0138,0.0047,203.1720,-1
1564,2008-10-17 05:26:00,2978.81,2379.78,2206.3000,1110.4967,0.8236,100.0,99.4122,0.1208,,...,43.5231,0.4987,0.0153,0.0041,3.0590,0.0197,0.0086,0.0025,43.5231,-1
1565,2008-10-17 06:01:00,2894.92,2532.01,2177.0333,1183.7287,1.5726,100.0,98.7978,0.1213,1.4622,...,93.4941,0.5004,0.0178,0.0038,3.5662,0.0262,0.0245,0.0075,93.4941,-1


# Getting the Preliminary Information

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Columns: 592 entries, Time to Pass/Fail
dtypes: float64(590), int64(1), object(1)
memory usage: 7.1+ MB


In [10]:
df.isna().sum().sum()

41951

# Creating the Preprocessing Function

In [38]:
def preprocess_inputs(df):
    df=df.copy()
    #dropping the time column in the dataset
    df=df.drop('Time',axis=1)
    #dropping the column with more than 25 percent missing vlues
    missing_value_columns=df.columns[df.isna().mean()>=0.25]
    df=df.drop(missing_value_columns,axis=1)
    #filling missing value with mean of that column
    for column in df.columns:
        df[column]=df[column].fillna(df[column].mean())
    df['Pass/Fail']=df['Pass/Fail'].replace({-1:'Pass',1:'Fail'})
    #splitting the data
    y=df['Pass/Fail']
    x=df.drop('Pass/Fail',axis=1)
    x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,shuffle=True)
    
    scaler=StandardScaler()
    scaler.fit(x_train)
    x_train=pd.DataFrame(scaler.transform(x_train),columns=x_train.columns)
    x_test=pd.DataFrame(scaler.transform(x_test),columns=x_test.columns)
    return x_train,x_test,y_train,y_test

In [39]:
x_train,x_test,y_train,y_test=preprocess_inputs(df)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(1096, 558)
(471, 558)
(1096,)
(471,)


In [40]:
x_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,576,577,582,583,584,585,586,587,588,589
0,-0.200281,-0.421654,0.057919,1.021407,-0.044666,0.0,-0.878635,0.264960,0.716638,0.497380,...,-0.239953,0.135944,0.590527,-0.417709,-0.343266,-0.402621,2.673832,0.210107,0.070672,-0.726511
1,-0.145113,-2.974374,-0.491014,0.619776,-0.047619,0.0,-0.413869,0.102483,-0.119735,-0.534902,...,-0.248876,0.408564,-0.665353,-0.007473,-0.025042,-0.007542,-0.661785,-0.413974,-0.034180,-0.014569
2,1.066664,2.108695,0.072615,-0.716576,-0.053433,0.0,0.112894,-0.135818,-0.422513,0.584520,...,4.741332,6.357410,0.444494,-0.215062,-0.320535,-0.208850,-0.734298,-0.391280,-0.488541,0.089615
3,0.231413,-0.433902,0.093336,-0.182054,-0.059399,0.0,-0.799686,0.102483,-0.437449,1.006817,...,-0.243253,-0.333695,-0.636146,0.180345,0.065879,0.176107,0.135863,0.505127,0.000771,-0.088571
4,2.537733,0.479453,1.511068,1.888743,-0.042825,0.0,-2.285158,-0.005836,-0.463246,1.469333,...,-0.270378,0.013917,1.203863,-0.234833,-0.025042,-0.232932,0.393688,-0.697647,-0.698246,-0.662703
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1091,-1.154283,0.236991,0.980594,0.373776,-0.049628,0.0,0.258582,0.189137,-0.533849,-1.446528,...,-0.237162,0.267805,-0.636146,-0.150809,-0.115963,-0.143099,-0.863211,0.527821,0.385230,1.115121
1092,-0.266070,0.740037,0.202971,-0.946108,-0.055258,0.0,2.562014,0.113314,-0.442880,0.323098,...,4.566131,4.012566,-0.110429,-0.131038,-0.184154,-0.126649,0.208376,0.425699,0.385230,-0.159046
1093,1.078663,0.123759,0.022129,-1.182704,-0.051307,0.0,0.876881,-0.070827,0.857844,-1.567184,...,-0.188056,1.966007,0.415288,0.778400,0.497754,0.742391,-0.863211,0.527821,0.385230,1.115121
1094,-0.963814,0.476328,-0.866264,-0.815029,-0.057665,0.0,0.393145,0.037491,1.182346,-2.264310,...,-0.264030,-0.696188,-0.840592,-0.022301,0.111340,-0.018043,0.667628,-0.005485,0.070672,-0.482522


In [37]:
x['Pass/Fail'].value_counts()

Pass    1463
Fail     104
Name: Pass/Fail, dtype: int64

In [34]:
x.columns[x.isna().sum()>0]

Index([], dtype='object')

In [42]:
model=LogisticRegression()
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.89171974522293

In [44]:
y_pred=model.predict(x_test)

In [45]:
cm=confusion_matrix(y_test,y_pred)


In [46]:
cm

array([[  8,  21],
       [ 30, 412]])

In [47]:
y_test.value_counts()

Pass    442
Fail     29
Name: Pass/Fail, dtype: int64

# RandomOverSampling

In [51]:
oversampling=RandomOverSampler(random_state=1)
x_train_os,y_train_os=oversampling.fit_resample(x_train,y_train)


In [52]:
model_os=LogisticRegression()
model_os.fit(x_train_os,y_train_os)
model_os.score(x_test,y_test)

0.8471337579617835

In [53]:
y_pred=model_os.predict(x_test)

In [54]:
cm_os=confusion_matrix(y_test,y_pred)

In [55]:
cm_os

array([[  9,  20],
       [ 52, 390]])

In [56]:
cm

array([[  8,  21],
       [ 30, 412]])