# TXMeans by Riccardo Guidotti
**Notebook by Edoardo Gabrielli**

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setting-Python-Path" data-toc-modified-id="Setting-Python-Path-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setting Python Path</a></span></li><li><span><a href="#Import-Packages" data-toc-modified-id="Import-Packages-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Packages</a></span><ul class="toc-item"><li><span><a href="#Internal-Packages" data-toc-modified-id="Internal-Packages-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Internal Packages</a></span></li><li><span><a href="#Other-Packages" data-toc-modified-id="Other-Packages-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Other Packages</a></span></li></ul></li><li><span><a href="#Visualize-Data" data-toc-modified-id="Visualize-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Visualize Data</a></span></li><li><span><a href="#Application-of-algorithm" data-toc-modified-id="Application-of-algorithm-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Application of algorithm</a></span></li></ul></div>

## Setting Python Path

In [1]:
import os, sys, importlib
from os.path import expanduser
from pathlib import Path

# TO DO: Better way to add to PythonPath the package.
home = str(Path.home())

## MODIFY HERE! ##
# This need to point on the folder where you clone the repo (respect to the home...)
Folder_Cloned_In = '/Università/2DM/Esercitazioni/' # Change here..
##################

# Full dir name
path_to_lib = home + Folder_Cloned_In

if os.path.isdir(path_to_lib + 'TXMeans'):
    print(f'My Home is: {home}')
    print(f'I cloned in: {path_to_lib}')
    # Add dirs to Python Path 
    sys.path.insert(0, path_to_lib + 'TXMeans/code')
    sys.path.insert(0, path_to_lib + 'TXMeans/code/algorithms')
else:
    print("Can't find Directory.")
    print('For example: you are in')
    print(str(os.getcwd()))

My Home is: /home/edo
I cloned in: /home/edo/Università/2DM/Esercitazioni/


## Import Packages

### Internal Packages

In [2]:
import algorithms.txmeans
from algorithms.txmeans import TXmeans # The class (like sklearn)
from algorithms.txmeans import remap_items, count_items, sample_size # Util functions
from algorithms.txmeans import basket_list_to_bitarray, basket_bitarray_to_list # Converting(Reverting) to(from) bitarray
from generators.datamanager import read_uci_data # (Convert the data in nice basket format)
from validation.validation_measures import delta_k, purity, normalized_mutual_info_score # Measure of Validation
from algorithms.util import jaccard_bitarray

### Other Packages

In [3]:
import pandas as pd
import numpy as np
import IPython.display as ipd
import datetime

## Visualize Data

In [4]:
mushrooms = pd.read_csv(path_to_lib + 'TXMeans/dataset/mushrooms.csv')
ipd.display(mushrooms.head(3))

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m


## Application of algorithm

In [5]:
# We set here the dir to the data (you have to pass to the function that create baskets)
path = path_to_lib + 'TXMeans/dataset/'
dataset_name = 'mushrooms.csv'
filename = path + dataset_name

# Initialize the model
txmeans_model = TXmeans()
# Target Class index (if you have one, otherwise you need to put here an index useless for the clustering)
class_index = 0

# Skip Dataset columns if you want
skipcolumnsindex = set()

# return ([baskets], maps to target index) and maps of every single value in the dataset.
baskets_real_labels, maps = read_uci_data(filename, class_index=class_index, skipcolumnsindex=skipcolumnsindex)

print( dataset_name, len(baskets_real_labels))

# Save baskets and the real labels 
baskets_list = list()
real_labels = list()
count = 0
for basket, label in baskets_real_labels:
    baskets_list.append(basket)
    real_labels.append(label)
    count += 1
    
# Speeding up the Jaccard distance: 
baskets_list, map_newitem_item, map_item_newitem = remap_items(baskets_list)
baskets_list = basket_list_to_bitarray(baskets_list, len(map_newitem_item))

# Get the number of baskets (equal to number of data) 
nbaskets = len(baskets_list)

# Get the number of different item
nitems = count_items(baskets_list)

start_time = datetime.datetime.now()

# Get subsamples of the dataset (in order to speed up)
nsample = sample_size(nbaskets, 0.05, conf_level=0.99, prob=0.5)

# Fit the model
txmeans_model.fit(baskets_list, nbaskets, nitems, random_sample=nsample)

end_time = datetime.datetime.now()
running_time = end_time - start_time

# Get the label and the clusters 
res = txmeans_model.clustering

# Number of iteration of the model for the convergence
iter_count = txmeans_model.iter_count

# Initialize empty predicted labels
pred_labels = [0] * len(real_labels)

# Initialize empty cluster list
baskets_clusters = list()
for label, cluster in enumerate(res):
    # Revert the bitarray transform.
    cluster_list = basket_bitarray_to_list(cluster['cluster']).values()
    for bid, bitarr in cluster['cluster'].items():
        # Labels of every data point
        pred_labels[bid] = label
        # Clusters
        baskets_clusters.append(cluster_list)

# Mesure of "goodness" for clustering algorithm respect to the target attribute
print('delta_k', delta_k(real_labels, pred_labels))
print('normalized_mutual_info_score', normalized_mutual_info_score(real_labels, pred_labels))
print('purity', purity(real_labels, pred_labels))
print('running_time', running_time)

print(f'Num of Clusters: {len(np.unique(np.array(pred_labels)))}')

mushrooms.csv 8124
delta_k 5
normalized_mutual_info_score 0.3214890407167497
purity 0.895371738060069
running_time 0:00:00.653447
Num of Clusters: 7


You cha also get the cluster's medioids:

In [6]:
txmeans_model.medioids

[4370, 3538, 2205, 2288, 87, 145, 7044]