<a href="https://colab.research.google.com/github/tranhuudan-02573/MachineLearning/blob/main/Lab_5_20130218_TranHuuDan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This lab is to deal with **SVM** to classification tasks and compare its performance with other competitive algorithms. In general, **SVM** is one of the most popular and widely used supervised machine learning algorithms.

*   **Deadline: 23:59, 17/03/2023**



# Import libraries

In [6]:
# code
from sklearn import svm
from sklearn import datasets
from sklearn.linear_model import LogisticRegression 
from sklearn import tree
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from prettytable import PrettyTable


#Task 1. 
For breast cancer dataset (https://tinyurl.com/3vme8hr3) which could be loaded from datasets in sklearn as follows:

```
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
cancer = datasets.load_breast_cancer()
```

*   1.1.	Apply SVM algorithm to above dataset using linear kernel.
*   1.2.	Compare the obtained results with other competitive algorithms (Logistic Regression, Decision Tree, kNN) based on metrics: accuracy, precision, recall, f1 measures.



In [7]:

# code
cancer = datasets.load_breast_cancer()
# print(cancer)
X_train, X_test, y_train, y_test = train_test_split(cancer['data'], cancer['target'], test_size=0.3, random_state=1)
clf = svm.SVC(kernel='linear') 
clf.fit(X_train,y_train)
y_predSvm = clf.predict(X_test)



classifier = LogisticRegression()
classifier.fit(X_train, y_train) 
y_predLogic = classifier.predict(X_test)

decision = tree.DecisionTreeClassifier()
decision.fit(X_train, y_train) 
y_predDes = decision.predict(X_test)

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_predKnn = neigh.predict(X_test)


t = PrettyTable(["Regression","accuracy","precision","recall","f1"])
t.add_row(["DecisionTree",accuracy_score(y_predDes,y_predSvm),
           precision_score(y_predDes,y_predSvm),
           recall_score(y_predDes,y_predSvm,average='macro'),
           f1_score(y_predDes,y_predSvm,average='macro')])
t.add_row(["LogisticRegression",accuracy_score(y_predLogic,y_predSvm),
           precision_score(y_predLogic,y_predSvm),
           recall_score(y_predLogic,y_predSvm,average='macro'),
           f1_score(y_predLogic,y_predSvm,average='macro')])
t.add_row(["KNeighbors",accuracy_score(y_predKnn,y_predSvm),
           precision_score(y_predKnn,y_predSvm),
           recall_score(y_predKnn,y_predSvm,average='macro'),
           f1_score(y_predKnn,y_predSvm,average='macro')])

print(t)








+--------------------+--------------------+--------------------+--------------------+--------------------+
|     Regression     |      accuracy      |     precision      |       recall       |         f1         |
+--------------------+--------------------+--------------------+--------------------+--------------------+
|    DecisionTree    | 0.9415204678362573 | 0.9553571428571429 | 0.9353056900726393 | 0.9353056900726393 |
| LogisticRegression | 0.9824561403508771 | 0.9821428571428571 | 0.9788288288288288 |  0.98066850058409  |
|     KNeighbors     | 0.935672514619883  | 0.9464285714285714 | 0.9274774774774774 | 0.9291178354749972 |
+--------------------+--------------------+--------------------+--------------------+--------------------+


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#Task 2. 

*   1.1.	Perform SVM algorithm to **Iris dataset** using **linear kernel**.
*   1.2.	Compare the obtained results in 1.1 with SVM using other kernels (**Polynomial Kernel, Gaussian Kernel, Sigmoid Kernel, Radial Basis Function Kernel**). Some metrics could be used: accuracy, precision, recall, f1 measures





In [11]:
# code
from sklearn import datasets
df = datasets.load_iris()
# print(df)

X_train, X_test, y_train, y_test = train_test_split(df['data'], df['target'], test_size=0.3, random_state=1)
clf = svm.SVC(kernel='linear') 
clf.fit(X_train,y_train)
y_predLinear = clf.predict(X_test)

clfPoly = svm.SVC(kernel='poly') 
clfPoly.fit(X_train,y_train)
y_predPoly = clfPoly.predict(X_test)

clfrbf = svm.SVC(kernel='rbf') 
clfrbf.fit(X_train,y_train)
y_predrbf = clfPoly.predict(X_test)

clfSig = svm.SVC(kernel='sigmoid') 
clfSig.fit(X_train,y_train)
y_predSig = clfPoly.predict(X_test)

t = PrettyTable(["Kernel","accuracy","precision","recall","f1"])

t.add_row(["Polynomial",accuracy_score(y_predPoly,y_predLinear),
           precision_score(y_predPoly,y_predLinear,average='macro'),
           recall_score(y_predPoly,y_predLinear,average='macro'),
           f1_score(y_predPoly,y_predLinear,average='macro')])
t.add_row(["Sigmoid",accuracy_score(y_predSig,y_predLinear),
           precision_score(y_predSig,y_predLinear,average='macro'),
           recall_score(y_predSig,y_predLinear,average='macro'),
           f1_score(y_predSig,y_predLinear,average='macro')])
t.add_row(["Radial Basis Function",accuracy_score(y_predrbf,y_predLinear),
           precision_score(y_predrbf,y_predLinear,average='macro'),
           recall_score(y_predrbf,y_predLinear,average='macro'),
           f1_score(y_predrbf,y_predLinear,average='macro')])
print(t)





+-----------------------+--------------------+--------------------+--------------------+--------------------+
|         Kernel        |      accuracy      |     precision      |       recall       |         f1         |
+-----------------------+--------------------+--------------------+--------------------+--------------------+
|       Polynomial      | 0.9777777777777777 | 0.9814814814814815 | 0.9761904761904763 | 0.9781305114638448 |
|        Sigmoid        | 0.9777777777777777 | 0.9814814814814815 | 0.9761904761904763 | 0.9781305114638448 |
| Radial Basis Function | 0.9777777777777777 | 0.9814814814814815 | 0.9761904761904763 | 0.9781305114638448 |
+-----------------------+--------------------+--------------------+--------------------+--------------------+


#Task 3. 
Compare the performance of selected classification algorithms (Decision Tree, kNN, Logistic Regression) and SVM (using different kernels) with mnist dataset based on accuracy, precision, recall, f1 measures.


In [13]:
from sklearn import datasets
mnist = datasets.load_digits()

classifier = LogisticRegression(random_state = 0) 
X_train, X_test, y_train, y_test = train_test_split(mnist['data'], mnist['target'], test_size=0.3, random_state=1)

clf = svm.SVC(kernel='linear') 
clf.fit(X_train,y_train)
y_predSvm = clf.predict(X_test)



classifier = LogisticRegression()
classifier.fit(X_train, y_train) 
y_predLogic = classifier.predict(X_test)

decision = tree.DecisionTreeClassifier()
decision.fit(X_train, y_train) 
y_predDes = decision.predict(X_test)

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_predKnn = neigh.predict(X_test)

t = PrettyTable(["Regression","accuracy","precision","recall","f1"])
t.add_row(["DecisionTree",accuracy_score(y_predDes,y_predSvm),
           precision_score(y_predDes,y_predSvm,average='macro'),
           recall_score(y_predDes,y_predSvm,average='macro'),
           f1_score(y_predDes,y_predSvm,average='macro')])
t.add_row(["LogisticRegression",accuracy_score(y_predLogic,y_predSvm),
           precision_score(y_predLogic,y_predSvm,average='macro'),
           recall_score(y_predLogic,y_predSvm,average='macro'),
           f1_score(y_predLogic,y_predSvm,average='macro')])
t.add_row(["KNeighbors",accuracy_score(y_predKnn,y_predSvm),
           precision_score(y_predKnn,y_predSvm,average='macro'),
           recall_score(y_predKnn,y_predSvm,average='macro'),
           f1_score(y_predKnn,y_predSvm,average='macro')])

print(t)



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


+--------------------+--------------------+--------------------+--------------------+--------------------+
|     Regression     |      accuracy      |     precision      |       recall       |         f1         |
+--------------------+--------------------+--------------------+--------------------+--------------------+
|    DecisionTree    | 0.8629629629629629 | 0.8654180382458847 | 0.8639498808927959 | 0.8631575175700776 |
| LogisticRegression | 0.9740740740740741 | 0.9738742131910236 | 0.9737831557427713 | 0.9735512672256377 |
|     KNeighbors     | 0.9796296296296296 | 0.9790506587692549 | 0.9792336340469445 | 0.9791108239414659 |
+--------------------+--------------------+--------------------+--------------------+--------------------+


#Task 4. 
Compare the performance of selected classification algorithms (Decision Tree, kNN, Logistic Regression) and SVM (using different kernels) with **credit card dataset** based on accuracy, precision, recall, f1 measures.

*   Give some comments on the obtained results
*   Identify issues with dataset, and propose the solutions to these issues



In [14]:
# code
import pandas as pd
from google.colab import drive
drive.mount('/content/gdrive')
%cd '/content/gdrive/MyDrive/ML/lab5'

data = pd.read_csv('creditcard.csv')

dataSet = data.head(30000)
X = dataSet.iloc[:, :-1]
y = dataSet.iloc[:, -1]

classifier = LogisticRegression(random_state = 0) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

clf = svm.SVC(kernel='linear') 
clf.fit(X_train,y_train)
y_predSvm = clf.predict(X_test)

classifier = LogisticRegression()
classifier.fit(X_train, y_train) 
y_predLogic = classifier.predict(X_test)

decision = tree.DecisionTreeClassifier()
decision.fit(X_train, y_train) 
y_predDes = decision.predict(X_test)

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_predKnn = neigh.predict(X_test)

t = PrettyTable(["Regression","accuracy","precision","recall","f1"])
t.add_row(["DecisionTree",accuracy_score(y_predDes,y_predSvm),
           precision_score(y_predDes,y_predSvm,average='macro'),
           recall_score(y_predDes,y_predSvm,average='macro'),
           f1_score(y_predDes,y_predSvm,average='macro')])
t.add_row(["LogisticRegression",accuracy_score(y_predLogic,y_predSvm),
           precision_score(y_predLogic,y_predSvm,average='macro'),
           recall_score(y_predLogic,y_predSvm,average='macro'),
           f1_score(y_predLogic,y_predSvm,average='macro')])
t.add_row(["KNeighbors",accuracy_score(y_predKnn,y_predSvm),
           precision_score(y_predKnn,y_predSvm,average='macro'),
           recall_score(y_predKnn,y_predSvm,average='macro'),
           f1_score(y_predKnn,y_predSvm,average='macro')])

print(t)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/MyDrive/ML/lab5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


+--------------------+--------------------+--------------------+--------------------+--------------------+
|     Regression     |      accuracy      |     precision      |       recall       |         f1         |
+--------------------+--------------------+--------------------+--------------------+--------------------+
|    DecisionTree    | 0.9971111111111111 | 0.8115530303030303 | 0.7338732158786798 | 0.7671326667091267 |
| LogisticRegression | 0.9977777777777778 | 0.7911096256684492 | 0.7911096256684492 | 0.7911096256684492 |
|     KNeighbors     | 0.9975555555555555 | 0.5416666666666666 | 0.9987775061124695 | 0.5763110818190378 |
+--------------------+--------------------+--------------------+--------------------+--------------------+


#Finally,
Save a copy in your Github. Remember renaming the notebook.