# Machine Learning

## Task 2.3 - Own Approach
We will follow two distinct approaches.
 2.3.1 - Apply Custering model (Kmodes) to find centroids of clusters on subsets of the dataset, groupped by Adoption speed. The idea is to determine the most common characteristics of the members of each class.
 2.3.2 - Use association rule mining together with CAR to perform classification. 

In [84]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import seaborn as sns
import matplotlib.pyplot as plt
from kmodes.kmodes import KModes
from imblearn.under_sampling import TomekLinks
import category_encoders as ce
from sklearn.metrics import accuracy_score

In [85]:
PetFinder_dataset = pd.read_csv("PetFinder_dataset_pp.csv")
PetFinder_dataset.shape

(12987, 28)

In [86]:
groupedBy_AdoptionSpeed=PetFinder_dataset.drop(['Age','Quantity','Fee','State','PhotoAmt','Polarity','Subjectivity','DescWords','Adopted','AdoptionSpeed','SubjectivityBin','PolarityBin'],axis=1).groupby('InitialAdoptionSpeed')


In [87]:
groupedBy_AdoptionSpeed.first()

Unnamed: 0_level_0,Type,Gender,MaturitySize,FurLength,Vaccinated,Dewormed,Sterilized,Health,Hasname,Breed,Color,AgeBin,FeeBin,PhotoAmtBin,DescwordsBin
InitialAdoptionSpeed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,Cat,Male,Medium,Medium,Not Sure,Not Sure,Not Sure,Healthy,0,Domestic,BlackOther,"(-0.1, 3.0]","(-0.1, 0.0]","(0.99, 3.99]","(17.0, 25.0]"
1,Cat,Male,Medium,Long,No,No,Not Sure,Healthy,1,Domestic,Black,"(3.0, 12.0]","(0.0, 300.0]","(0.99, 3.99]","(71.0, 93.0]"
2,Cat,Male,Small,Short,No,No,No,Healthy,1,Purebreed,BlackOther,"(-0.1, 3.0]","(0.0, 300.0]","(0.99, 3.99]","(56.0, 71.0]"
3,Dog,Male,Medium,Medium,Yes,Yes,No,Healthy,1,Mixed Breed,BrownOther,"(-0.1, 3.0]","(-0.1, 0.0]","(3.99, 30.0]","(56.0, 71.0]"
4,Cat,Female,Medium,Medium,Not Sure,Not Sure,Not Sure,Healthy,1,Domestic,BlackOther,"(3.0, 12.0]","(-0.1, 0.0]","(0.99, 3.99]","(44.0, 56.0]"


## 2.3.1 - Clustering Data - Binary Classification

In [88]:
km = KModes(n_clusters=1, init='Cao', n_init=10, verbose=1)

Initialization method and algorithm are deterministic. Setting n_init to 1.


### Cluster of data where adoption speed is 0

In [89]:
cluster_0 = km.fit_predict(groupedBy_AdoptionSpeed.get_group(0))

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 2360.0


In [90]:
# Print the cluster centroids
print(km.cluster_centroids_)

[['Cat' 'Female' 'Medium' 'Short' 'No' 'Yes' 'No' 'Healthy' '1'
  'Mixed Breed' 'BlackOther' '(-0.1, 3.0]' '(-0.1, 0.0]' '(0.99, 3.99]'
  '(-0.001, 9.0]']]


### Cluster of data where adoption speed is 1

In [91]:
cluster_1 = km.fit_predict(groupedBy_AdoptionSpeed.get_group(1))

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 16838.0


In [92]:
# Print the cluster centroids
print(km.cluster_centroids_)

[['Cat' 'Female' 'Medium' 'Short' 'No' 'Yes' 'No' 'Healthy' '1'
  'Mixed Breed' 'BlackOther' '(-0.1, 3.0]' '(-0.1, 0.0]' '(0.99, 3.99]'
  '(-0.001, 9.0]']]


### Cluster of data where adoption speed is 2

In [93]:
cluster_2 = km.fit_predict(groupedBy_AdoptionSpeed.get_group(2))

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 20866.0


In [94]:
# Print the cluster centroids
print(km.cluster_centroids_)

[['Dog' 'Female' 'Medium' 'Short' 'No' 'Yes' 'No' 'Healthy' '1'
  'Mixed Breed' 'BlackOther' '(-0.1, 3.0]' '(-0.1, 0.0]' '(0.99, 3.99]'
  '(17.0, 25.0]']]


### Cluster of data where adoption speed is 3

In [95]:
cluster_3 = km.fit_predict(groupedBy_AdoptionSpeed.get_group(3))

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 16666.0


In [96]:
# Print the cluster centroids
print(km.cluster_centroids_)

[['Dog' 'Female' 'Medium' 'Short' 'Yes' 'Yes' 'No' 'Healthy' '1'
  'Mixed Breed' 'BlackOther' '(-0.1, 3.0]' '(-0.1, 0.0]' '(0.99, 3.99]'
  '(135.0, 1257.0]']]


### Cluster of data where adoption speed is 4

In [97]:
cluster_4 = km.fit_predict(groupedBy_AdoptionSpeed.get_group(4))

Init: initializing centroids
Init: initializing clusters
Starting iterations...
Run 1, iteration: 1/100, moves: 0, cost: 21379.0


In [98]:
# Print the cluster centroids
print(km.cluster_centroids_)

[['Dog' 'Female' 'Medium' 'Short' 'Yes' 'Yes' 'No' 'Healthy' '1'
  'Mixed Breed' 'BlackOther' '(3.0, 12.0]' '(-0.1, 0.0]' '(0.99, 3.99]'
  '(-0.001, 9.0]']]


# Results and Discussion

These centroids gave us very deeper insight out of the data For Example:

Cluster 0 has this centroid 
[['Cat' 'Female' 'Medium' 'Short' 'No' 'Yes' 'No' 'Healthy' '1' 'Mixed Breed' 'BlackOther' '(-0.1, 3.0]' '(-0.1, 0.0]' '(0.99, 3.99]' '(-0.001, 9.0]']]
It tells us that mostly Female newborn Cats, Maturity Size Medium and Fur Length short have AdoptionSpeed of 0

Cluster 4 has this centroid
[['Dog' 'Female' 'Medium' 'Short' 'Yes' 'Yes' 'No' 'Healthy' '1' 'Mixed Breed' 'BlackOther' '(3.0, 12.0]' '(-0.1, 0.0]' '(0.99, 3.99]' '(-0.001, 9.0]']]
It tells us that mostly young (3-12 months) Female Dogs have adoption speed of 4.

This data gives us information that as AdoptionSpeed moves from 0 to 4 , people adopt cats faster than dogs because Mostly cats have AdoptionSpeed of 0 and 1 while Dogs have AdoptionSpeed 2, 3 , 4.
This gives a very good insight about our data.

Applying KMODES algorithm directly to our data has a drawback that it missed class 0(AdoptionSpeed = 0) because it has less number of instances. In this case we have seperately clustered the data to check what attributes do the classes posess separately

## 2.3.2 - Association Rule Binary Classification

In [124]:
encoder = ce.BinaryEncoder(cols=['Type' ,'Age' ,'Gender' , 'MaturitySize' , 'FurLength' , 'Vaccinated' ,'Dewormed' , 'Sterilized' , 'Health' , 'Fee' , 'PhotoAmt' , 'Breed' , 'Color'])
df_binary = encoder.fit_transform(PetFinder_dataset).drop(['DescWords','AgeBin','FeeBin','PhotoAmtBin','PolarityBin','SubjectivityBin','DescwordsBin','AdoptionSpeed','Polarity','Subjectivity','InitialAdoptionSpeed'],axis=1)

In [125]:
df_binary.head(3)

Unnamed: 0,Type_0,Type_1,Age_0,Age_1,Age_2,Age_3,Age_4,Age_5,Age_6,Age_7,...,Hasname,Breed_0,Breed_1,Breed_2,Color_0,Color_1,Color_2,Color_3,Color_4,Adopted
0,0,1,0,0,0,0,0,0,0,1,...,1,0,0,1,0,0,0,0,1,True
1,0,1,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,True
2,1,0,0,0,0,0,0,0,1,0,...,1,0,1,1,0,0,0,1,0,True


In [126]:
from pyarc import TransactionDB
from pyarc.algorithms import (
    top_rules,
    createCARs,
    M1Algorithm,
    M2Algorithm
)

In [127]:
df_binary = df_binary.sample(frac=1)

In [128]:
a = df_binary.head(12993)

In [129]:
b = df_binary.tail(2000)

## 2.3.2.1 Finding Associations

In [130]:
txns_train = TransactionDB.from_DataFrame(a)
txns_test = TransactionDB.from_DataFrame(b)

# get the best association rules
rules = top_rules(txns_train.string_representation , appearance=txns_train.appeardict ,init_conf=0.8 , init_support=0.8)
rules.sort(reverse=True)

# convert them to class association rules
cars = createCARs(rules)
cars.sort(reverse=True)
if len(cars) > 1000:
        cars = cars[:1000]
        
print("len(rules)", len(cars))

#classifier = M1Algorithm(cars, txns_train).build()
# classifier = M2Algorithm(cars, txns_train).build()
m1 = M1Algorithm(cars, txns_train)
    
m2 = M2Algorithm(cars, txns_train)
    
m1clf = m1.build()
m2clf = m2.build()


actual = list(map(lambda i: i.value, txns_test.class_labels))

pred = m1clf.predict_all(txns_test)
predM2 = m2clf.predict_all(txns_test)
    
acc = accuracy_score(pred, actual)    
accM2 = accuracy_score(predM2, actual)
    

Running apriori with setting: confidence=0.8, support=0.8, minlen=2, maxlen=3, MAX_RULE_LEN=57
Rule count: 9126, Iteration: 1
Target rule count satisfied: 1000
len(rules) 1000


In [131]:
acc

0.728

In [132]:
accM2

0.728

# Results and Discussion

In this task we have used an Association Rule model for classification. We have to apply some association rule mining approach for this classification So we have used CAR method (Classification based on Association Rule). We have set the confidence to be 0.8 , support is also 0.8 and we have maximum rule length of 68.The value of support and confidence is set too high in this case to get only those rules which have high chances of occurance and they give us good accuracy.
We have used top 1000 rules which have higher support confidence and lift and our results are quite reliable 

## 2.3.3 - Association Rule Multiclass Classification

In [109]:
encoder = ce.BinaryEncoder(cols=['Type' ,'Age' ,'Gender' , 'MaturitySize' , 'FurLength' , 'Vaccinated' ,'Dewormed' , 'Sterilized' , 'Health' , 'Fee' , 'PhotoAmt' , 'Breed' , 'Color'])
df_binary = encoder.fit_transform(PetFinder_dataset).drop(['DescWords','AgeBin','FeeBin','PhotoAmtBin','PolarityBin','SubjectivityBin','DescwordsBin','AdoptionSpeed','Polarity','Subjectivity','Adopted'],axis=1)

In [110]:
df_binary.head(3)

Unnamed: 0,Type_0,Type_1,Age_0,Age_1,Age_2,Age_3,Age_4,Age_5,Age_6,Age_7,...,Hasname,Breed_0,Breed_1,Breed_2,Color_0,Color_1,Color_2,Color_3,Color_4,InitialAdoptionSpeed
0,0,1,0,0,0,0,0,0,0,1,...,1,0,0,1,0,0,0,0,1,2
1,0,1,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
2,1,0,0,0,0,0,0,0,1,0,...,1,0,1,1,0,0,0,1,0,3


In [111]:
from pyarc import TransactionDB
from pyarc.algorithms import (
    top_rules,
    createCARs,
    M1Algorithm,
    M2Algorithm
)

In [112]:
df_binary = df_binary.sample(frac=1)

In [113]:
a = df_binary.head(12993)

In [114]:
b = df_binary.tail(2000)

## 2.3.2.1 Finding Associations

In [115]:
txns_train = TransactionDB.from_DataFrame(a)
txns_test = TransactionDB.from_DataFrame(b)

# get the best association rules
rules = top_rules(txns_train.string_representation , appearance=txns_train.appeardict ,init_conf=0.8 , init_support=0.5)
rules.sort(reverse=True)

# convert them to class association rules
cars = createCARs(rules)
cars.sort(reverse=True)
if len(cars) > 1000:
        cars = cars[:1000]
        
print("len(rules)", len(cars))

#classifier = M1Algorithm(cars, txns_train).build()
# classifier = M2Algorithm(cars, txns_train).build()
m1 = M1Algorithm(cars, txns_train)
    
m2 = M2Algorithm(cars, txns_train)
    
m1clf = m1.build()
m2clf = m2.build()


actual = list(map(lambda i: i.value, txns_test.class_labels))

pred = m1clf.predict_all(txns_test)
predM2 = m2clf.predict_all(txns_test)
    
acc = accuracy_score(pred, actual)    
accM2 = accuracy_score(predM2, actual)
    

Running apriori with setting: confidence=0.8, support=0.5, minlen=2, maxlen=3, MAX_RULE_LEN=57
Rule count: 20960, Iteration: 1
Target rule count satisfied: 1000
len(rules) 1000


In [116]:
acc

0.366

In [117]:
accM2

0.366

# Results and Discussion

We have to apply some association rule mining approach for this classification So we have used CAR methond (Classification based on Association Rule). We have set the confidence to be 0.8 , support is 0.5 and we have maximum rule length of 57.In this case the support value is kept lower as we have Multi Class Association Rules and the occurance of patterns is bit lower in this case.
Upon checking the results we have a low accuracy (around 40%) and almost same as the accuracy we have achieved in supervised learning for multi class classification task.