**Attribute information:**

1. **target**: DIE (1), LIVE (2)
2. **age**: 10, 20, 30, 40, 50, 60, 70, 80
3. **gender**: male (1), female (2)

           ------ no = 2,   yes = 1 ------

4. **steroid**: no, yes 
5. **antivirals**: no, yes 
6. **fatique**: no, yes 
7. **malaise**: no, yes 
8. **anorexia**: no, yes 
9. **liverBig**: no, yes 
10. **liverFirm**: no, yes 
11. **spleen**: no, yes 
12. **spiders**: no, yes
13. **ascites**: no, yes 
14. **varices**: no, yes
15. **histology**: no, yes


16. **bilirubin**: 0.39, 0.80, 1.20, 2.00, 3.00, 4.00 -- 
17. **alk**: 33, 80, 120, 160, 200, 250 ---
18. **sgot**: 13, 100, 200, 300, 400, 500, ---
19. **albu**: 2.1, 3.0, 3.8, 4.5, 5.0, 6.0, --- 
20. **protime**: 10, 20, 30, 40, 50, 60, 70, 80, 90, --- 

        NA's are represented with "?"

## Dataset Reading and Pre-Processing steps

import required libraries

In [1]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

In [2]:
#Code to ignore warnings
import warnings
warnings.filterwarnings("ignore")

###### 1. Read the HEPATITIS dataset and check the data shapes

In [3]:
## Read "hepatitis.csv" using pandas
# target =  1: Die; 2: Live 
data = pd.read_csv("hepatitis.csv", na_values="?")

In [4]:
data.shape

(155, 21)

In [5]:
data.head()

Unnamed: 0,ID,target,age,gender,steroid,antivirals,fatigue,malaise,anorexia,liverBig,...,spleen,spiders,ascites,varices,bili,alk,sgot,albu,protime,histology
0,1,2,30,2,1.0,2,2.0,2.0,2.0,1.0,...,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,2,2,50,1,1.0,2,1.0,2.0,2.0,1.0,...,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,3,2,78,1,2.0,2,1.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,4,2,31,1,,1,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,5,2,34,1,2.0,2,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1


###### 2. Check basic summary statistics of the data

In [6]:
data.describe()

Unnamed: 0,ID,target,age,gender,steroid,antivirals,fatigue,malaise,anorexia,liverBig,...,spleen,spiders,ascites,varices,bili,alk,sgot,albu,protime,histology
count,155.0,155.0,155.0,155.0,154.0,155.0,154.0,154.0,154.0,145.0,...,150.0,150.0,150.0,150.0,149.0,126.0,151.0,139.0,88.0,155.0
mean,78.0,1.793548,41.2,1.103226,1.506494,1.845161,1.350649,1.603896,1.792208,1.827586,...,1.8,1.66,1.866667,1.88,1.427517,105.325397,85.89404,3.817266,61.852273,1.451613
std,44.888751,0.40607,12.565878,0.30524,0.501589,0.362923,0.47873,0.490682,0.407051,0.379049,...,0.40134,0.475296,0.341073,0.32605,1.212149,51.508109,89.65089,0.651523,22.875244,0.499266
min,1.0,1.0,7.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,0.3,26.0,14.0,2.1,0.0,1.0
25%,39.5,2.0,32.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,...,2.0,1.0,2.0,2.0,0.7,74.25,31.5,3.4,46.0,1.0
50%,78.0,2.0,39.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,1.0,85.0,58.0,4.0,61.0,1.0
75%,116.5,2.0,50.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,1.5,132.25,100.5,4.2,76.25,2.0
max,155.0,2.0,78.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,2.0,2.0,2.0,2.0,8.0,295.0,648.0,6.4,100.0,2.0


###### 3. Check for value counts in target variable

In [7]:
data.target.value_counts()

target
2    123
1     32
Name: count, dtype: int64

#### 4. Check the datatype of each variable

In [8]:
data.dtypes

ID              int64
target          int64
age             int64
gender          int64
steroid       float64
antivirals      int64
fatigue       float64
malaise       float64
anorexia      float64
liverBig      float64
liverFirm     float64
spleen        float64
spiders       float64
ascites       float64
varices       float64
bili          float64
alk           float64
sgot          float64
albu          float64
protime       float64
histology       int64
dtype: object

In [9]:
unique_counts = data.nunique()
print(unique_counts)
# for column, count in unique_counts.items():
#     print(f"Column '{column}' has {count} unique values.")


ID            155
target          2
age            49
gender          2
steroid         2
antivirals      2
fatigue         2
malaise         2
anorexia        2
liverBig        2
liverFirm       2
spleen          2
spiders         2
ascites         2
varices         2
bili           34
alk            83
sgot           84
albu           29
protime        44
histology       2
dtype: int64


In [10]:
cat_cols = data.columns[data.nunique() < 5]

In [11]:
num_cols = data.columns[data.nunique() >= 5]

#### 5. Drop columns which are not significant

In [12]:
data.drop(["ID"], axis = 1, inplace=True)
num_cols = data.columns[data.nunique() >= 5]

In [13]:
data.head()

Unnamed: 0,target,age,gender,steroid,antivirals,fatigue,malaise,anorexia,liverBig,liverFirm,spleen,spiders,ascites,varices,bili,alk,sgot,albu,protime,histology
0,2,30,2,1.0,2,2.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,85.0,18.0,4.0,,1
1,2,50,1,1.0,2,1.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,0.9,135.0,42.0,3.5,,1
2,2,78,1,2.0,2,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,96.0,32.0,4.0,,1
3,2,31,1,,1,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,0.7,46.0,52.0,4.0,80.0,1
4,2,34,1,2.0,2,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,,200.0,4.0,,1


#### 6. Identify the Categorical Columns and store them in a variable cat_cols and numerical into num_cols

In [14]:
num_cols = ["age", "bili", "alk", "sgot", "albu", "protime"]
cat_cols = ['gender', 'steroid', 'antivirals', 'fatigue', 'malaise', 'anorexia', 'liverBig', 
            'liverFirm', 'spleen', 'spiders', 'ascites', 'varices', 'histology']

#### 7. Checking the null values

In [15]:
data.isna().sum()

target         0
age            0
gender         0
steroid        1
antivirals     0
fatigue        1
malaise        1
anorexia       1
liverBig      10
liverFirm     11
spleen         5
spiders        5
ascites        5
varices        5
bili           6
alk           29
sgot           4
albu          16
protime       67
histology      0
dtype: int64

In [16]:
data.isnull().sum()

target         0
age            0
gender         0
steroid        1
antivirals     0
fatigue        1
malaise        1
anorexia       1
liverBig      10
liverFirm     11
spleen         5
spiders        5
ascites        5
varices        5
bili           6
alk           29
sgot           4
albu          16
protime       67
histology      0
dtype: int64

#### 8. Split the data into X and y

In [17]:
X = data.drop(["target"], axis = 1)

In [18]:
y = data["target"]

In [19]:
print(X.shape, y.shape)

(155, 19) (155,)


#### 9. Split the data into X_train, X_test, y_train, y_test with test_size = 0.20 using sklearn

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

In [21]:
## Print the shape of X_train, X_test, y_train, y_test
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(124, 19)
(31, 19)
(124,)
(31,)


#### 10. Check null values in train and test, check value_counts in y_train and y_test

In [22]:
print(y_train.value_counts()/X_train.shape[0])

target
2    0.790323
1    0.209677
Name: count, dtype: float64


In [23]:
print(y_test.value_counts()/X_test.shape[0])

target
2    0.806452
1    0.193548
Name: count, dtype: float64


In [24]:
# null values in train
X_train.isna().sum()

age            0
gender         0
steroid        1
antivirals     0
fatigue        1
malaise        1
anorexia       1
liverBig       7
liverFirm      8
spleen         4
spiders        4
ascites        4
varices        4
bili           6
alk           23
sgot           4
albu          13
protime       53
histology      0
dtype: int64

In [25]:
# null values in test
X_test.isna().sum()

age            0
gender         0
steroid        0
antivirals     0
fatigue        0
malaise        0
anorexia       0
liverBig       3
liverFirm      3
spleen         1
spiders        1
ascites        1
varices        1
bili           0
alk            6
sgot           0
albu           3
protime       14
histology      0
dtype: int64

#### 11. Impute the Categorical Columns with mode and Numerical columns with mean

In [26]:
df_cat_train = X_train[cat_cols]
df_cat_test = X_test[cat_cols]

In [27]:
# Impute on train
# df_cat_train = df_cat_train.fillna(df_cat_train.mode().iloc[0])

# Impute on test
# df_cat_test = df_cat_test.fillna(df_cat_train.mode().iloc[0])

In [28]:
from sklearn.impute import SimpleImputer
cat_imputer = SimpleImputer(strategy='most_frequent')
cat_imputer.fit(df_cat_train)

In [29]:
df_cat_train = pd.DataFrame(cat_imputer.transform(df_cat_train), columns=cat_cols)

In [30]:
df_cat_test = pd.DataFrame(cat_imputer.transform(df_cat_test), columns=cat_cols)

In [31]:
df_num_train = X_train[num_cols]
df_num_test = X_test[num_cols]

In [32]:
# Impute on train
# df_num_train = df_num_train.fillna(df_num_train.mean())

#Impute on test
# df_num_test = df_num_test.fillna(df_num_train.mean())

In [33]:
num_imputer = SimpleImputer(strategy='median')
num_imputer.fit(df_num_train[num_cols])

In [34]:
df_num_train = pd.DataFrame ( num_imputer.transform(df_num_train), columns= num_cols)

In [35]:
df_num_test =  pd.DataFrame(num_imputer.transform(df_num_test), columns=num_cols)

In [36]:
# Combine numeric and categorical in train
X_train = pd.concat([df_num_train, df_cat_train], axis = 1)

# Combine numeric and categorical in test
X_test = pd.concat([df_num_test, df_cat_test], axis = 1)

In [37]:
X_train.isna().sum()

age           0
bili          0
alk           0
sgot          0
albu          0
protime       0
gender        0
steroid       0
antivirals    0
fatigue       0
malaise       0
anorexia      0
liverBig      0
liverFirm     0
spleen        0
spiders       0
ascites       0
varices       0
histology     0
dtype: int64

In [38]:
X_test.isna().sum()

age           0
bili          0
alk           0
sgot          0
albu          0
protime       0
gender        0
steroid       0
antivirals    0
fatigue       0
malaise       0
anorexia      0
liverBig      0
liverFirm     0
spleen        0
spiders       0
ascites       0
varices       0
histology     0
dtype: int64

#### Convert all the categorical columns to Integer Format before dummification (2.0 as 2 etc.)

In [39]:
# Train
X_train[cat_cols] = X_train[cat_cols].astype('int')

# Test
X_test[cat_cols] = X_test[cat_cols].astype('int')

#### 12. Dummify the Categorical columns

In [40]:
## Convert Categorical Columns to Dummies
# Train
X_train = pd.get_dummies(X_train, columns=cat_cols, drop_first=True)

# Test
X_test = pd.get_dummies(X_test, columns=cat_cols, drop_first=True)

In [41]:
X_train.columns

Index(['age', 'bili', 'alk', 'sgot', 'albu', 'protime', 'gender_2',
       'steroid_2', 'antivirals_2', 'fatigue_2', 'malaise_2', 'anorexia_2',
       'liverBig_2', 'liverFirm_2', 'spleen_2', 'spiders_2', 'ascites_2',
       'varices_2', 'histology_2'],
      dtype='object')

In [42]:
X_test.columns

Index(['age', 'bili', 'alk', 'sgot', 'albu', 'protime', 'gender_2',
       'steroid_2', 'antivirals_2', 'fatigue_2', 'malaise_2', 'anorexia_2',
       'liverBig_2', 'liverFirm_2', 'spleen_2', 'spiders_2', 'ascites_2',
       'varices_2', 'histology_2'],
      dtype='object')

#### 13. Scale the numeric attributes ["age", "bili", "alk", "sgot", "albu", "protime"]

In [43]:
from sklearn.preprocessing import StandardScaler

In [44]:
#num_cols = ["age", "bili", "alk", "sgot", "albu", "protime"]
scaler = StandardScaler()

scaler.fit(X_train.loc[:,num_cols])

# scale on train
X_train.loc[:,num_cols] = scaler.transform(X_train.loc[:,num_cols])
#X_train[num_cols] = scaler.transform(X_train[num_cols])

# scale on test
X_test.loc[:,num_cols] = scaler.transform(X_test.loc[:,num_cols])

## MODEL BUILDING - SVM

In [45]:
from sklearn.svm import SVC

In [46]:
# Create a SVC classifier using a linear kernel
linear_svm = SVC(kernel='linear', C=1, random_state=0)

In [47]:
# Train the classifier
linear_svm.fit(X=X_train, y= y_train)

In [48]:
## Predict
train_predictions = linear_svm.predict(X_train)
test_predictions = linear_svm.predict(X_test)

### Train data accuracy
from sklearn.metrics import accuracy_score,f1_score,confusion_matrix

print("TRAIN Conf Matrix : \n", confusion_matrix(y_train, train_predictions))
print("\nTRAIN DATA ACCURACY",accuracy_score(y_train,train_predictions))
print("\nTrain data f1-score for class '1'",f1_score(y_train,train_predictions,pos_label=1)) 

# This parameter indicates which label should be treated as the positive class in the calculation. 
# By default, the positive class is usually set to 1, but you can change it to any other class label present in your data.
# This is especially important in binary classification problems where the labels might not be 0 and 1 (e.g., "spam" and "not spam").
print("\nTrain data f1-score for class '2'",f1_score(y_train,train_predictions,pos_label=2))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix(y_test, test_predictions))
print("\nTEST DATA ACCURACY",accuracy_score(y_test,test_predictions))
print("\nTest data f1-score for class '1'",f1_score(y_test,test_predictions,pos_label=1))
print("\nTest data f1-score for class '2'",f1_score(y_test,test_predictions,pos_label=2))

TRAIN Conf Matrix : 
 [[17  9]
 [ 4 94]]

TRAIN DATA ACCURACY 0.8951612903225806

Train data f1-score for class '1' 0.7234042553191489

Train data f1-score for class '2' 0.9353233830845772


--------------------------------------


TEST Conf Matrix : 
 [[ 4  2]
 [ 1 24]]

TEST DATA ACCURACY 0.9032258064516129

Test data f1-score for class '1' 0.7272727272727272

Test data f1-score for class '2' 0.9411764705882353


####  Non Linear SVM (RBF)

Radial Basis Function is a commonly used kernel in SVC:<br>


where <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mrow class="MJX-TeXAtom-ORD">
    <mo stretchy="false">|</mo>
  </mrow>
  <mrow class="MJX-TeXAtom-ORD">
    <mo stretchy="false">|</mo>
  </mrow>
  <mrow class="MJX-TeXAtom-ORD">
    <mi mathvariant="bold">x</mi>
      <sub>i</sub>
  </mrow>
  <mo>&#x2212;<!-- − --></mo>
  <mrow class="MJX-TeXAtom-ORD">
    <msup>
      <mi mathvariant="bold">x</mi>
      <sub>j</sub>
    </msup>
  </mrow>
  <mrow class="MJX-TeXAtom-ORD">
    <mo stretchy="false">|</mo>
  </mrow>
  <msup>
    <mrow class="MJX-TeXAtom-ORD">
      <mo stretchy="false">|</mo>
    </mrow>
    <mrow class="MJX-TeXAtom-ORD">
      <sup>2</sup>
    </mrow>
  </msup>
</math>  is the squared Euclidean distance between two data points x<sub>i</sub> and x<sub>j</sub>

It is only important to know that an SVC classifier using an RBF kernel has two parameters: gamma and C.

<strong>Gamma:</strong>

- Gamma is a parameter of the RBF kernel and can be thought of as the ‘spread’ of the kernel and therefore the decision region. When gamma is low, the ‘curve’ of the decision boundary is very low and thus the decision region is very broad. When gamma is high, the ‘curve’ of the decision boundary is high, which creates islands of decision-boundaries around data points.

<strong>C:</strong>

- C is a parameter of the SVC learner and is the penalty for misclassifying a data point. When C is small, the classifier is okay with misclassified data points (high bias, low variance). When C is large, the classifier is heavily penalized for misclassified data and therefore bends over backwards avoid any misclassified data points (low bias, high variance).


<strong>Kernel Trick:</strong><br>
Image you have a two-dimensional non-linearly separable dataset, you would like to classify it using SVM. It looks like not possible because the data is not linearly separable. However, if we transform the two-dimensional data to a higher dimension, say, three-dimension or even ten-dimension, we would be able to find a hyperplane to separate the data.

<img src="kernel_trick.png">

The problem is, if we have a large dataset containing, say, millions of examples, the transformation will take a long time to run.<br>
To solve this problem, we actually only care about the result of the dot product (x<sub>i</sub>.x<sub>j</sub>)<br>
<br>If there is a function which could calculate the dot product and the result is the same as when we transform the data into higher dimension, it would be fantastic. This function is called a kernel function.<br>
<br>In essence, what the kernel trick does for us is to offer a more efficient and less expensive way to transform data into higher dimensions.

In [49]:
## Create an SVC object and print it to see the arguments
svc = SVC(kernel='rbf', random_state=0, gamma=0.01, C=1)
svc

In [50]:
## Train the model
svc.fit(X=X_train, y= y_train)

In [51]:
## Predict
train_predictions = svc.predict(X_train)
test_predictions = svc.predict(X_test)

### Train data accuracy

print("TRAIN Conf Matrix : \n", confusion_matrix(y_train, train_predictions))
print("\nTRAIN DATA ACCURACY",accuracy_score(y_train,train_predictions))
print("\nTrain data f1-score for class '1'",f1_score(y_train,train_predictions,pos_label=1))
print("\nTrain data f1-score for class '2'",f1_score(y_train,train_predictions,pos_label=2))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix(y_test, test_predictions))
print("\nTEST DATA ACCURACY",accuracy_score(y_test,test_predictions))
print("\nTest data f1-score for class '1'",f1_score(y_test,test_predictions,pos_label=1))
print("\nTest data f1-score for class '2'",f1_score(y_test,test_predictions,pos_label=2))

TRAIN Conf Matrix : 
 [[ 5 21]
 [ 0 98]]

TRAIN DATA ACCURACY 0.8306451612903226

Train data f1-score for class '1' 0.32258064516129037

Train data f1-score for class '2' 0.9032258064516129


--------------------------------------


TEST Conf Matrix : 
 [[ 1  5]
 [ 0 25]]

TEST DATA ACCURACY 0.8387096774193549

Test data f1-score for class '1' 0.2857142857142857

Test data f1-score for class '2' 0.9090909090909091


### SVM with Grid Search for Paramater Tuning

In [52]:
## Use Grid Search for parameter tuning

from sklearn.model_selection import GridSearchCV

svc_grid = SVC()
 
param_grid = { 
                'C': [0.001, 0.01, 0.1, 1, 10, 100 ],
                'gamma': [0, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100], 
                'kernel':['linear', 'rbf', 'poly' ]
             }

svc_cv_grid = GridSearchCV(estimator = svc_grid, param_grid = param_grid, cv = 5, verbose=3)

In [53]:
## Fit the grid search model
svc_cv_grid.fit(X=X_train, y=y_train)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
[CV 1/5] END ...C=0.001, gamma=0, kernel=linear;, score=0.760 total time=   0.0s
[CV 2/5] END ...C=0.001, gamma=0, kernel=linear;, score=0.800 total time=   0.0s
[CV 3/5] END ...C=0.001, gamma=0, kernel=linear;, score=0.800 total time=   0.0s
[CV 4/5] END ...C=0.001, gamma=0, kernel=linear;, score=0.800 total time=   0.0s
[CV 5/5] END ...C=0.001, gamma=0, kernel=linear;, score=0.792 total time=   0.0s
[CV 1/5] END ......C=0.001, gamma=0, kernel=rbf;, score=0.760 total time=   0.0s
[CV 2/5] END ......C=0.001, gamma=0, kernel=rbf;, score=0.800 total time=   0.0s
[CV 3/5] END ......C=0.001, gamma=0, kernel=rbf;, score=0.800 total time=   0.0s
[CV 4/5] END ......C=0.001, gamma=0, kernel=rbf;, score=0.800 total time=   0.0s
[CV 5/5] END ......C=0.001, gamma=0, kernel=rbf;, score=0.792 total time=   0.0s
[CV 1/5] END .....C=0.001, gamma=0, kernel=poly;, score=0.760 total time=   0.0s
[CV 2/5] END .....C=0.001, gamma=0, kernel=pol

[CV 4/5] END C=0.001, gamma=0.0001, kernel=linear;, score=0.800 total time=   0.0s
[CV 5/5] END C=0.001, gamma=0.0001, kernel=linear;, score=0.792 total time=   0.0s
[CV 1/5] END .C=0.001, gamma=0.0001, kernel=rbf;, score=0.760 total time=   0.0s
[CV 2/5] END .C=0.001, gamma=0.0001, kernel=rbf;, score=0.800 total time=   0.0s
[CV 3/5] END .C=0.001, gamma=0.0001, kernel=rbf;, score=0.800 total time=   0.0s
[CV 4/5] END .C=0.001, gamma=0.0001, kernel=rbf;, score=0.800 total time=   0.0s
[CV 5/5] END .C=0.001, gamma=0.0001, kernel=rbf;, score=0.792 total time=   0.0s
[CV 1/5] END C=0.001, gamma=0.0001, kernel=poly;, score=0.760 total time=   0.0s
[CV 2/5] END C=0.001, gamma=0.0001, kernel=poly;, score=0.800 total time=   0.0s
[CV 3/5] END C=0.001, gamma=0.0001, kernel=poly;, score=0.800 total time=   0.0s
[CV 4/5] END C=0.001, gamma=0.0001, kernel=poly;, score=0.800 total time=   0.0s
[CV 5/5] END C=0.001, gamma=0.0001, kernel=poly;, score=0.792 total time=   0.0s
[CV 1/5] END C=0.001, ga

In [54]:
# Get the best parameters
svc_cv_grid.best_params_

{'C': 0.001, 'gamma': 1, 'kernel': 'poly'}

In [55]:
svc_best = svc_cv_grid.best_estimator_

In [56]:
## Predict
train_predictions = svc_best.predict(X_train)
test_predictions = svc_best.predict(X_test)

print("TRAIN DATA ACCURACY",accuracy_score(y_train,train_predictions))
print("\nTrain data f1-score for class '1'",f1_score(y_train,train_predictions,pos_label=1))
print("\nTrain data f1-score for class '2'",f1_score(y_train,train_predictions,pos_label=2))

### Test data accuracy
print("\n\n--------------------------------------\n\n")
print("TEST DATA ACCURACY",accuracy_score(y_test,test_predictions))
print("\nTest data f1-score for class '1'",f1_score(y_test,test_predictions,pos_label=1))
print("\nTest data f1-score for class '2'",f1_score(y_test,test_predictions,pos_label=2))

TRAIN DATA ACCURACY 0.9516129032258065

Train data f1-score for class '1' 0.8695652173913044

Train data f1-score for class '2' 0.9702970297029703


--------------------------------------


TEST DATA ACCURACY 0.8709677419354839

Test data f1-score for class '1' 0.6666666666666666

Test data f1-score for class '2' 0.92
