## LCM as a preprocessing step
LCM looks for closed itemset with respect to an input minimum support

#### load the chess dataset

In [1]:
from skmine.datasets.fimi import fetch_chess
chess = fetch_chess()
chess.head()

0    [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
1    [1, 3, 5, 7, 9, 12, 13, 15, 17, 19, 21, 23, 25...
2    [1, 3, 5, 7, 9, 12, 13, 16, 17, 19, 21, 23, 25...
3    [1, 3, 5, 7, 9, 11, 13, 15, 17, 20, 21, 23, 25...
4    [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
Name: chess, dtype: object

In [2]:
chess.shape

(3196,)

#### fit_discover()
fit_discover makes pattern discovery more user friendly by outputting pretty formatted
patterns, instead of the traditional tabular format used in the `scikit` community

In [3]:
from skmine.preprocessing import LCM
lcm = LCM(min_supp=2000, n_jobs=4)
# minimum support of 2000, running on 4 processes
%time patterns = lcm.fit_discover(chess)

CPU times: user 463 ms, sys: 376 ms, total: 839 ms
Wall time: 9.19 s


In [4]:
patterns.shape

(68967, 2)

This format in which patterns are rendered makes post hoc analysis easier

Here we filter patterns with a length strictly superior to 3

In [5]:
patterns[patterns.itemset.map(len) > 3]

Unnamed: 0,itemset,support
12,"(5, 7, 17, 58)",2204
14,"(5, 9, 17, 58)",2088
18,"(5, 7, 21, 58)",2034
22,"(5, 7, 25, 58)",2539
24,"(5, 9, 25, 58)",2365
...,...,...
68956,"(29, 40, 48, 52, 54)",2012
68957,"(36, 40, 48, 52, 54)",2012
68964,"(29, 40, 52, 70)",2001
68965,"(29, 40, 58, 70)",2006


`Note`

Even when setting a very high minimum support threshold, we discovered more than 60K from only 3196 original transactions.
This is a good illustration of the so-called **pattern explosion problem**

------------
We could also get the top-k patterns in terms of supports, with a single line of code

In [6]:
patterns.nlargest(10, columns=['support'])  # top 10 patterns

Unnamed: 0,itemset,support
0,"(58,)",3195
8824,"(52,)",3185
2670,"(52, 58)",3184
10651,"(29,)",3181
29,"(29, 58)",3180
8853,"(29, 52)",3170
10686,"(40,)",3170
386,"(40, 58)",3169
2699,"(29, 52, 58)",3169
9036,"(40, 52)",3159


---------------
#### fit_transform()
LCM can also be used as a preprocessing step to work on tabular data

In [7]:
lcm = LCM(min_supp=2000, n_jobs=4)
lcm.fit_transform(chess)

Unnamed: 0,3,5,7,9,11,15,17,21,25,27,...,54,56,58,60,62,64,66,70,72,74
0,2004,2027,2050,2007,2050,2026,2089,2035,2003,2134,...,2012,2002,2000,2001,2001,2001,2001,2000,2001,2007
1,2004,2027,2134,2007,0,2026,2089,2035,2003,2134,...,2012,2002,2000,2001,2001,2001,2001,2000,2001,2007
2,2004,2027,2134,2007,0,0,2089,2035,2003,2134,...,2012,2002,2000,2001,2001,2001,2001,2000,2001,2007
3,2004,2027,2050,2007,2050,2026,2089,2035,2003,2134,...,2012,2002,2000,2001,2001,2001,2001,2000,2001,2007
4,2004,2027,2050,2007,2050,2026,2089,2035,2003,2134,...,2012,2002,2000,2001,2001,2001,2001,2000,2001,2007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3191,0,2027,2050,2007,2050,0,2089,2035,0,2134,...,2002,2013,2000,0,2013,2036,0,2000,0,2013
3192,0,2027,2050,2007,2050,0,2089,2035,0,2134,...,2002,2013,2000,0,2013,2036,0,2000,0,2013
3193,0,2027,2050,2007,2050,0,2089,2035,0,2134,...,2002,2013,2000,0,2013,2036,0,2000,0,2013
3194,0,2034,0,2007,2129,0,2089,2089,0,2205,...,2012,2068,2068,0,2068,2101,0,0,0,2068


Setting a lower threshold will fill cells currenly filled with 0s

But will also lead to longer run times