<a href="https://colab.research.google.com/github/ross39/Data-mining-assignment-2/blob/main/CS3033_CS6405_Data_Mining_Second_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS3033/CS6405 - Data Mining - Second Assignment

### Submission

This assignment is **due on 07/04/22 at 23:59**. You should submit a single .ipnyb file with your python code and analysis electronically via Canvas.
Please note that this assignment will account for 25 Marks of your module grade.

### Declaration

By submitting this assignment. I agree to the following:

<font color="red">“I have read and understand the UCC academic policy on plagiarism, and agree to the requirements set out thereby in relation to plagiarism and referencing. I confirm that I have referenced and acknowledged properly all sources used in the preparation of this assignment.
I declare that this assignment is entirely my own work based on my personal study. I further declare that I have not engaged the services of another to either assist me in, or complete this assignment”</font>

### Objective

The Boolean satisfiability (SAT) problem consists in determining whether a Boolean formula F is satisfiable or not. F is represented by a pair (X, C), where X is a set of Boolean variables and C is a set of clauses in Conjunctive Normal Form (CNF). Each clause is a disjunction of literals (a variable or its negation). This problem is one of the most widely studied combinatorial problems in computer science. It is the classic NP-complete problem. Over the past number of decades, a significant amount of research work has focused on solving SAT problems with both complete and incomplete solvers.

One of the most successful approaches is an algorithm portfolio, where a solver is selected among a set of candidates depending on the problem type. Your task is to create a classifier that takes as input the SAT instance's features and identifies the class.

In this project, we represent SAT problems with a vector of 327 features with general information about the problem, e.g., number of variables, number of clauses, the fraction of horn clauses in the problem, etc. There is no need to understand the features to be able to complete the assignment.


The original dataset is available at:
https://github.com/bprovanbessell/SATfeatPy/blob/main/features_csv/all_features.csv



# I used the following paper as a reference for this assignment:
http://www.cs.ucc.ie/~osullb/pubs/classification.pdf

## Data Preparation

In [1]:
import pandas as pd

import os
import random
import subprocess
import time

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold


df = pd.read_csv("https://raw.githubusercontent.com/andvise/DataAnalyticsDatasets/main/train_dataset.csv", index_col=0)
df.head()

Unnamed: 0,c,v,clauses_vars_ratio,vars_clauses_ratio,vcg_var_mean,vcg_var_coeff,vcg_var_min,vcg_var_max,vcg_var_entropy,vcg_clause_mean,...,rwh_0_max,rwh_1_mean,rwh_1_coeff,rwh_1_min,rwh_1_max,rwh_2_mean,rwh_2_coeff,rwh_2_min,rwh_2_max,target
0,608,71,8.56338,0.116776,0.045172,0.173688,0.029605,0.060855,2.802758,0.045172,...,5078250.0,1056.695041,1.0,2.981935e-09,2113.390083,1081.900778,1.0,1.30208e-29,2163.801556,matching
1,615,70,8.785714,0.113821,0.049617,0.168633,0.03252,0.069919,2.607264,0.049617,...,5469376.0,1207.488426,1.0,6.927306e-28,2414.976852,1186.623627,1.0,3.491123e-120,2373.247255,matching
2,926,105,8.819048,0.113391,0.033385,0.186444,0.017279,0.047516,3.022879,0.033385,...,4297025.0,441.327046,1.0,1.194627e-76,882.654092,474.697562,1.0,0.0,949.395124,matching
3,603,70,8.614286,0.116086,0.049799,0.133441,0.033167,0.063018,2.688342,0.049799,...,6640651.0,1181.583331,1.0,2.437278e-30,2363.166661,1149.059132,1.0,4.67009e-147,2298.118264,matching
4,228,43,5.302326,0.188596,0.067319,0.162581,0.048246,0.087719,2.203308,0.067319,...,2437500.0,1091.423921,0.999966,0.03723599,2182.810606,1296.888087,1.0,6.307424e-06,2593.776167,matching


In [2]:
# Label or target variable
df['target'].value_counts()

tseitin           298
dominating        294
cliquecoloring    268
php               266
subsetcard        263
op                201
tiling            120
5clique           108
3color            104
matching          102
5color             98
4color             98
3clique            98
4clique            94
Name: target, dtype: int64

# Tasks

## Basic models and evaluation (5 Marks)

Using Scikit-learn, train and evaluate a decision tree classifier using 70% of the dataset from training and 30% for testing. For this part of the project, we are not interested in optimising the parameters; we just want to get an idea of the dataset.

###  I was having issues with infinity values and very large values. So I added in a check to replace infinite values.

In [3]:
# Some data cleaning
df=df.dropna(axis=1) 
print('Infinity Values:',df.isin([np.inf, -np.inf]).values.any()) 

df=df.replace([np.inf, -np.inf], np.nan).dropna(axis=1)


Infinity Values: True


In [4]:
# Set the target variable
df_target = df['target']
df_features = df.drop('target', axis=1)
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size=0.3, random_state=42)

# MinMax scale the data 
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

pca = PCA(n_components=50)
# Encode the target and features
training_features_pca = pca.fit_transform(X_train)
testing_features_pca = pca.transform(X_test)

# create a decision tree 
dc_one = DecisionTreeClassifier(random_state=42)
dc_one.fit(training_features_pca, y_train)
y_pred = dc_one.predict(testing_features_pca)

# Evaluate the model
print('Accuracy Score:', accuracy_score(y_test, y_pred))







Accuracy Score: 0.9447513812154696


## Robust evaluation (10 Marks)

In this section, we are interested in more rigorous techniques by implementing more sophisticated methods, for instance:
* Hold-out and cross-validation.
* Hyper-parameter tuning.
* Feature reduction.
* Feature selection.
* Feature normalisation.

Your report should provide concrete information about your reasoning; everything should be well-explained.

The key to geting good marks is to show that you evaluated different methods and that you correctly selected the configuration.

In [5]:

classifier = Pipeline([
    ('pca', PCA(n_components=50)),
    ('predictor', DecisionTreeClassifier(random_state=42))
])

params = {"pca__n_components": [8,10,12],
          "predictor__criterion": ["gini", "entropy", "log_loss"],
          "predictor__splitter": ["best", "random"],
          "predictor__max_features": ["sqrt", "log2"],
          }

cross_validate_dc = GridSearchCV(classifier, params, cv=5, scoring='accuracy')
cross_validate_dc.fit(X_train, y_train)

# Test on unseen data
cross_validate_dc.best_params_
classifier.set_params(**cross_validate_dc.best_params_)
classifier.fit(X_train, y_train)
accuracy_score(y_test, classifier.predict(X_test))


0.9323204419889503

# Feature selection

In [6]:
filter = VarianceThreshold(threshold=0.03)
X_train_filtered = filter.fit_transform(X_train)
X_test_filtered = filter.transform(X_test)

# Feature reduction
classifier_2 = Pipeline([
    ("feature_selection",SelectKBest(f_classif)),
    ("predictor", DecisionTreeClassifier(random_state=0))
    ])

params = {"feature_selection__k": [20,40,80,'all'],
              "predictor__criterion": ["gini", "entropy", "log_loss"],
              "predictor__splitter": ["best","random"],
              "predictor__max_features":["sqrt", "log2",None]
              }


cross_validate_feature_dc = GridSearchCV(classifier_2, params, scoring='accuracy', cv=10)
cross_validate_feature_dc.fit(X_train_filtered, y_train)

# Test on unseen data
cross_validate_feature_dc.best_params_
classifier_2.set_params(**cross_validate_feature_dc.best_params_)
classifier_2.fit(X_train_filtered, y_train)
accuracy_score(y_test, classifier_2.predict(X_test_filtered))

0.9958563535911602

## New classifier (10 Marks)

Replicate the previous task for a classifier different than K-NN and decision trees. Briefly describe your choice.
Try to create the best model for the given dataset.


Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset:
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv

This link currently contains a sample of the training set. The real test set will be released after the submission. I should be able to run the code cell independently, load all the libraries you need as well.

# SVM 

I decided to use SVM, as my model of choice. support vector machines draw hyperplanes in order to seperate the points for classification. In this case it will draw hyperplanes using the SAT's instance features which will then(hopefully) correctly identify the class of SAT problem

In [7]:
from sklearn import svm 

new_classifier = Pipeline([
    ('pca', PCA(n_components=50)),
    ('predictor', svm.SVC(random_state=42)),
])
params = {  "pca__n_components": [10, 20, 30],
            "predictor__kernel": ["linear", "poly", "rbf", "sigmoid"],
        }   

cross_validate_feature_svm = GridSearchCV(new_classifier, params, cv=5, scoring='accuracy')
cross_validate_feature_svm.fit(X_train, y_train)

# Test on unseen data
cross_validate_feature_svm.best_params_
new_classifier.set_params(**cross_validate_feature_svm.best_params_)
new_classifier.fit(X_train, y_train)
accuracy_score(y_test, new_classifier.predict(X_test))

            
            

0.988950276243094

In [11]:
from sklearn.ensemble import RandomForestClassifier

new_classifier = Pipeline([
    ('pca', PCA(n_components=50)),
    ('predictor', RandomForestClassifier(random_state=42)),
])
params = {  "pca__n_components": [10, 20, 30, 50, 80, 100],
            "predictor__criterion": ["gini", "entropy", "log_loss"],
            "predictor__max_features": ["sqrt", "log2"],
        }

cross_validate_feature_rf = GridSearchCV(new_classifier, params, cv=5, scoring='accuracy')
cross_validate_feature_rf.fit(X_train, y_train) 

# Test on unseen data
cross_validate_feature_rf.best_params_
new_classifier.set_params(**cross_validate_feature_rf.best_params_)
new_classifier.fit(X_train, y_train)
accuracy_score(y_test, new_classifier.predict(X_test))


0.9792817679558011

# <font color="blue">FOR GRADING ONLY</font>

Save your best model into your github. And create a single code cell that loads it and evaluate it on the following test dataset: 
https://github.com/andvise/DataAnalyticsDatasets/blob/main/test_dataset.csv


In [None]:
from joblib import dump, load
from io import BytesIO
import requests

# INSERT YOUR MODEL'S URL
mLink = 'URL_OF_YOUR_MODEL_SAVED_IN_YOUR_GITHUB_REPOSITORY?raw=true'
# mfile = BytesIO(requests.get(mLink).content)
# model = load(mfile)
# YOUR CODE HERE