## Modeling_AssociationRules.ipynb
### Due: April 23

Here we explore the usage of Association Rule Mining with our anime watchlists.
Each transaction will be the set of anime watched by a user. We will make use of the Apriori Algorithm
to generate the frequent itemsets from the transactions and find the strongest association rules to be used as predictors
of what anime a person might like to watch given other anime they have enjoyed.

In [1]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Load in Dataset and Do some Data Preprocessing

In [2]:
df_anime_lists = pd.read_csv("./data/animelists_cleaned.csv")
df_anime_lists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31284030 entries, 0 to 31284029
Data columns (total 11 columns):
 #   Column               Dtype  
---  ------               -----  
 0   username             object 
 1   anime_id             int64  
 2   my_watched_episodes  int64  
 3   my_start_date        object 
 4   my_finish_date       object 
 5   my_score             int64  
 6   my_status            int64  
 7   my_rewatching        float64
 8   my_rewatching_ep     int64  
 9   my_last_updated      object 
 10  my_tags              object 
dtypes: float64(1), int64(5), object(5)
memory usage: 2.6+ GB


In [3]:
def getMemUsageInMB(df : pd.DataFrame) -> float:
    return df.memory_usage(deep=True).sum() / 2**20

In [4]:
orig_mem = getMemUsageInMB(df_anime_lists)
print(f"Total memory used by dataframe: {orig_mem :.2f} MB")

Total memory used by dataframe: 10738.10 MB


In [5]:
# percentage of NaN values in each column
df_anime_lists.isna().sum() / (df_anime_lists.isna().sum() + df_anime_lists.count())

username               0.000008
anime_id               0.000000
my_watched_episodes    0.000000
my_start_date          0.000000
my_finish_date         0.000000
my_score               0.000000
my_status              0.000000
my_rewatching          0.219864
my_rewatching_ep       0.000000
my_last_updated        0.000000
my_tags                0.936274
dtype: float64

In [6]:
# remove extraneous columns that we are not using for association rule mining.
df_anime_lists.drop(columns=df_anime_lists.columns[2:], inplace=True)

# let's try to downcast the int data types.
intCols = df_anime_lists.select_dtypes('int').columns
df_anime_lists[intCols] = df_anime_lists[intCols].apply(pd.to_numeric, downcast='integer')

# downcast float data types
fcols = df_anime_lists.select_dtypes('float').columns
df_anime_lists[fcols] = df_anime_lists[fcols].apply(pd.to_numeric, downcast='float') 

In [7]:
tmp = getMemUsageInMB(df_anime_lists)
print(f"Total memory used by dataframe: {tmp :.2f} MB")

# memory reduced by:
print(f"Memory Usage Reduction: {((orig_mem - tmp)/orig_mem) * 100}%")
del tmp

Total memory used by dataframe: 2088.10 MB
Memory Usage Reduction: 80.55431970379016%


In [8]:
df_anime_lists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31284030 entries, 0 to 31284029
Data columns (total 2 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   username  object
 1   anime_id  int32 
dtypes: int32(1), object(1)
memory usage: 358.0+ MB


# Formulate the Transactions and Find Frequent Itemsets

In [9]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [10]:
transactions = df_anime_lists.groupby('username')['anime_id'].apply(list)
transactions

username
----phoebelyn       [21, 120, 853, 957, 1571, 1579, 1698, 1735, 1,...
---L-AND-AME-4EV                                           [20, 1535]
--AnimeBoy--        [21, 59, 74, 210, 232, 249, 853, 1557, 1735, 2...
--Etsuko--          [3092, 4814, 7054, 7674, 9926, 11013, 11123, 1...
--FallenAngel--     [21, 59, 210, 249, 269, 853, 857, 957, 1579, 1...
                                          ...                        
zzshinzozz          [21, 232, 249, 1735, 7054, 9513, 9863, 10800, ...
zzvl                [120, 269, 853, 4224, 6045, 7054, 9926, 10800,...
zzz275              [59, 120, 853, 1571, 1698, 2104, 4477, 1, 16, ...
zzzcielo            [21, 74, 269, 853, 857, 957, 1698, 1735, 3731,...
zzzzz-chan          [21, 59, 120, 210, 232, 269, 853, 857, 1735, 3...
Name: anime_id, Length: 108709, dtype: object

In [11]:
# convert the ids into their English Anime Name
df_anime = pd.read_csv("./data/anime_cleaned.csv")
df_anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6668 entries, 0 to 6667
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   anime_id         6668 non-null   int64  
 1   title            6668 non-null   object 
 2   title_english    3438 non-null   object 
 3   title_japanese   6663 non-null   object 
 4   title_synonyms   4481 non-null   object 
 5   image_url        6666 non-null   object 
 6   type             6668 non-null   object 
 7   source           6668 non-null   object 
 8   episodes         6668 non-null   int64  
 9   status           6668 non-null   object 
 10  airing           6668 non-null   bool   
 11  aired_string     6668 non-null   object 
 12  aired            6668 non-null   object 
 13  duration         6668 non-null   object 
 14  rating           6586 non-null   object 
 15  score            6668 non-null   float64
 16  scored_by        6668 non-null   int64  
 17  rank          

In [12]:
# since some of the title_english column values are not present, we will use the regular `title` which appears to be English at first glance anyways.
# to save on memory usage, just grab columns of interest: [anime_id, title]

# get mapping from ids into names
animeIdToName = {}
def foo(row):
    animeIdToName[row.anime_id] = row.title
df_anime.apply(foo, axis=1)

transactions = transactions.apply(lambda x: [animeIdToName[i] for i in x])

# remove all other columns except the anime_id and title.
df_anime.drop(columns=df_anime.columns[2:], inplace=True)
df_anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6668 entries, 0 to 6667
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   anime_id  6668 non-null   int64 
 1   title     6668 non-null   object
dtypes: int64(1), object(1)
memory usage: 104.3+ KB


In [13]:
# running out of mem, so delete the dataframe objects currently in RAM
del df_anime_lists
del df_anime

In [60]:
MIN_SUPPORT = 0.26

te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=MIN_SUPPORT, use_colnames=True)

frequent_itemsets

Unnamed: 0,support,itemsets
0,0.295017,(Accel World)
1,0.279075,(Air)
2,0.344323,(Akame ga Kill!)
3,0.569962,(Angel Beats!)
4,0.429983,(Ano Hi Mita Hana no Namae wo Bokutachi wa Mad...
...,...,...
5447,0.263796,"(Shingeki no Kyojin, Steins;Gate, Sword Art On..."
5448,0.275221,"(Shingeki no Kyojin, Death Note, Steins;Gate, ..."
5449,0.271431,"(Shingeki no Kyojin, Mirai Nikki (TV), Steins;..."
5450,0.265912,"(Shingeki no Kyojin, Mirai Nikki (TV), Steins;..."


In [66]:
MIN_CONFIDENCE = 0.5
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=MIN_CONFIDENCE)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Accel World),(Sword Art Online),0.295017,0.575003,0.276518,0.937295,1.630070,0.106882,6.777762,0.548282
1,(Akame ga Kill!),(Angel Beats!),0.344323,0.569962,0.289139,0.839732,1.473312,0.092888,2.683239,0.489962
2,(Angel Beats!),(Akame ga Kill!),0.569962,0.344323,0.289139,0.507295,1.473312,0.092888,1.330770,0.747043
3,(Akame ga Kill!),(Code Geass: Hangyaku no Lelouch),0.344323,0.622957,0.268975,0.781171,1.253973,0.054477,1.723002,0.308894
4,(Akame ga Kill!),(Death Note),0.344323,0.748153,0.297170,0.863055,1.153580,0.039563,1.839031,0.203047
...,...,...,...,...,...,...,...,...,...,...
33708,"(Sword Art Online, Steins;Gate)","(Shingeki no Kyojin, Death Note, Angel Beats!,...",0.426754,0.315576,0.261689,0.613209,1.943140,0.127016,1.769493,0.846703
33709,"(Death Note, Steins;Gate)","(Sword Art Online, Shingeki no Kyojin, Angel B...",0.449586,0.325235,0.261689,0.582068,1.789683,0.115468,1.614533,0.801654
33710,"(Sword Art Online, Death Note)","(Shingeki no Kyojin, Steins;Gate, Angel Beats!...",0.476557,0.304979,0.261689,0.549126,1.800534,0.116350,1.541495,0.849393
33711,(Mirai Nikki (TV)),"(Angel Beats!, Shingeki no Kyojin, Steins;Gate...",0.497309,0.295385,0.261689,0.526211,1.781440,0.114792,1.487191,0.872617


In [16]:
def viewTopKRules(rules : pd.DataFrame, metric : str | list[str], k : int, ascending : bool = False) -> pd.DataFrame:
    '''
    return a subset of the rules dataframe 
    given the metric / metrics to sort by
    top k to return
    and whether to be sorted in ascending or descending order
    '''
    return rules.sort_values(by=metric, ascending=ascending).head(k)

In [17]:
viewTopKRules(rules, metric=['confidence', 'lift'], k=5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
6044,"(Shingeki no Kyojin, Sword Art Online II)",(Sword Art Online),0.324159,0.575003,0.32277,0.995715,1.731669,0.136378,99.181918,0.62518
2328,"(Sword Art Online II, Angel Beats!)",(Sword Art Online),0.297565,0.575003,0.296268,0.995641,1.731541,0.125167,97.502151,0.60145
5134,"(Death Note, Sword Art Online II)",(Sword Art Online),0.30486,0.575003,0.303425,0.995293,1.730935,0.128129,90.287346,0.607471
5924,"(No Game No Life, Sword Art Online II)",(Sword Art Online),0.296314,0.575003,0.294861,0.995095,1.730591,0.124479,86.645593,0.599931
7670,"(Tengen Toppa Gurren Lagann, Code Geass: Hangy...",(Code Geass: Hangyaku no Lelouch),0.293987,0.622957,0.292165,0.993805,1.595303,0.109024,60.858216,0.528545


### Create recommendations from association rules. Start with a simple search.

In [18]:
def recommendAnimeSimple( userList : list[str] ) -> list[str]:
    '''
    Given a list of anime the user watches, return a list of anime recommendations.

    Returns all the anime from the consequents of association rules that contain any one of the anime in the userlist.
    Only considers subsets of the userList of length 1 to match for antecedents.

    Output data format:
        (anime_recommendation, score)
            score - integer value noting how far away from the top rule this rec came from (lower is better)
    '''
    assert isinstance(userList, list), 'input not of type list'
    
    retList = []
    retAnime = set()
    # try to use as much of the anime in the user list as possible until
    # for now, loop over each anime in the user's list and add to our set of recommendations
    # the anime in the consequents of the top rule containing a user liked anime in the antecedents
    for anime in userList:
        score = 0 # lower the better
        for index, row in rules.sort_values(by=['confidence', 'lift'], ascending=False).iterrows():
            if anime in list(row.antecedents):
                consequents = list(row.consequents)
                for a in consequents:
                    if a in retAnime: continue
                    retAnime.add(a)
                    retList.append((a, score))
            score += 1 # increment, the higher the score, the further away from the top rules the anime rec comes from

    # sort return list of anime by ascending score
    retList = sorted(retList, key=lambda x : x[1])
    
    return retList

In [19]:
recommendAnimeSimple(['Naruto', "Toradora!"])

[('Code Geass: Hangyaku no Lelouch', 53),
 ('Naruto', 95),
 ('Death Note', 108),
 ('Shingeki no Kyojin', 699),
 ('Sword Art Online', 945),
 ('Bleach', 1360),
 ('Naruto: Shippuuden', 1542),
 ('Code Geass: Hangyaku no Lelouch R2', 1554),
 ('Clannad: After Story', 1578),
 ('Fullmetal Alchemist: Brotherhood', 1886),
 ('Angel Beats!', 2350),
 ('Steins;Gate', 2713),
 ('Elfen Lied', 2856),
 ('Fullmetal Alchemist', 3054),
 ('Ano Hi Mita Hana no Namae wo Bokutachi wa Mada Shiranai.', 3124),
 ('Soul Eater', 3666),
 ('No Game No Life', 3843),
 ('Toradora!', 4074),
 ('Tengen Toppa Gurren Lagann', 4325),
 ('K-On!', 5053),
 ('Mahou Shoujo Madoka★Magica', 5203),
 ('Fairy Tail', 5386),
 ('Durarara!!', 5785),
 ('Clannad', 5796),
 ('One Piece', 6392),
 ('Suzumiya Haruhi no Yuuutsu', 6403),
 ('Mirai Nikki (TV)', 6579),
 ('Highschool of the Dead', 6584),
 ('Ao no Exorcist', 6665),
 ('Sen to Chihiro no Kamikakushi', 6680),
 ('Bakemonogatari', 6851),
 ('Kaichou wa Maid-sama!', 7648),
 ('Higurashi no Naku Ko

### Create a better recommendAnime function.
The simple version above checks each individual anime in the user input list, one by one, and finds the top rule that contains the anime in its antecedents and adds to the recommendation list the anime in the consequents. This is clearly ignoring the information in having the anime found in a set TOGETHER. So we need to figure out how to match and find the best subset of antecedent and return just that rule's consequent, to make the recommendations more meaningful. 

In [20]:
from itertools import chain, combinations

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

In [21]:
[x for x in powerset(['Naruto', 'Toradora!'])]

[(), ('Naruto',), ('Toradora!',), ('Naruto', 'Toradora!')]

In [22]:
tmp = ['Naruto', 'Toradora!', 'Shingeki no Kyojin']
tmp = [x for x in powerset(tmp)]
tmp = list(filter(lambda x : True if len(x) >= 1 else False, tmp))
tmp = sorted(tmp, key=lambda x: len(x), reverse=True)
tmp = [frozenset(x) for x in tmp]
tmp

[frozenset({'Naruto', 'Shingeki no Kyojin', 'Toradora!'}),
 frozenset({'Naruto', 'Toradora!'}),
 frozenset({'Naruto', 'Shingeki no Kyojin'}),
 frozenset({'Shingeki no Kyojin', 'Toradora!'}),
 frozenset({'Naruto'}),
 frozenset({'Toradora!'}),
 frozenset({'Shingeki no Kyojin'})]

In [23]:
# great, the order doesn't matter in the frozenset comparison.
frozenset({'Naruto', 'Toradora!'}) == frozenset({'Toradora!', 'Naruto'})

True

In [40]:
def recommendAnimeBetter( userList : list[str] , short : bool = False) -> list[str]:
    '''
    Given a list of anime the user watches, return a list of anime recommendations.

    This is the better version that tries to match as much of animes in the original list.
    Then we start to add more recommendations by using rules that match fewer and fewer until we are matching rules from singleton animes from userList.

    Input data:
        userList - list of user anime to make predictions off of.
        short - boolean to toggle whether to return a short list of recommended anime or not. the short version 
            is just the consequent anime(s) in the first association rule whose antecedent matches with
            the longest subset in userList.

    Return data format:
        (anime_recommendation, score1, score2)
            score1 - this is the number of animes used to match the rule for this rec (higher is better)
            score2 - this is how far away from the strongest/top rule this rec was found at (lower is better)
    
    '''
    assert isinstance(userList, list), 'input not of type list'
    
    retList = []
    retAnime = set()

    # create all subsets with at least 1 anime
    tmp = [x for x in powerset(userList)]
    tmp = list(filter(lambda x : True if len(x) >= 1 else False, tmp))
    tmp = sorted(tmp, key=lambda x: len(x), reverse=True)
    tmp = [frozenset(x) for x in tmp]

    for subset in tmp:
        score2 = 0
        for index, row in rules.sort_values(by=['confidence', 'lift'], ascending=False).iterrows():
            if subset == row.antecedents:
                consequents = list(row.consequents)
                for a in consequents:

                    # don't add this anime if we already got it
                    if a in retAnime: continue

                    retAnime.add(a)

                    score1 = len(subset)
                    retList.append((a, score1, score2))

                # only capture one consequent(s) if short is True
                if short:
                    break

            score2 += 1
    
    # sort return list of anime by descending number of anime from 
    retList = sorted(retList, key=lambda x : x[1], reverse=True)
    
    return retList

In [67]:
recommendAnimeBetter(['Naruto', 'Toradora!', 'Shingeki no Kyojin'])

[('Death Note', 2, 2774),
 ('Code Geass: Hangyaku no Lelouch', 2, 8640),
 ('Naruto: Shippuuden', 2, 9880),
 ('Fullmetal Alchemist: Brotherhood', 2, 10111),
 ('Angel Beats!', 2, 10486),
 ('Bleach', 2, 10991),
 ('Clannad', 2, 11446),
 ('Sword Art Online', 2, 11773),
 ('Shingeki no Kyojin', 2, 12910),
 ('Elfen Lied', 2, 12922),
 ('Durarara!!', 2, 14836),
 ('Soul Eater', 2, 14951),
 ('Bakemonogatari', 2, 15679),
 ('Steins;Gate', 2, 16403),
 ('Suzumiya Haruhi no Yuuutsu', 2, 16528),
 ('Tengen Toppa Gurren Lagann', 2, 17482),
 ('Mirai Nikki (TV)', 2, 12158),
 ('Toradora!', 2, 13958),
 ('Ao no Exorcist', 2, 14595),
 ('Fairy Tail', 2, 15433),
 ('Tokyo Ghoul', 2, 17109),
 ('Highschool of the Dead', 2, 17700),
 ('Ano Hi Mita Hana no Namae wo Bokutachi wa Mada Shiranai.', 2, 14927),
 ('Another', 2, 15101),
 ('No Game No Life', 2, 15168),
 ('Noragami', 2, 17728),
 ('Psycho-Pass', 2, 18017),
 ('Naruto', 2, 19293),
 ('One Punch Man', 2, 19727),
 ('Hataraku Maou-sama!', 2, 19980),
 ('Code Geass: Hang

### Save Rules to Disk
It is time consuming to have to recompute the association rules every time we want to make predictions with them. Therefore, we can save the rules using `pickle`.

In [26]:
import pickle

In [68]:
with open(f'rules_minsup_{MIN_SUPPORT}_conf_{MIN_CONFIDENCE}.pkl', 'wb') as f:
    pickle.dump(rules, f)

In [69]:
del rules

In [70]:
with open(f'rules_minsup_{MIN_SUPPORT}_conf_{MIN_CONFIDENCE}.pkl', 'rb') as f:
    rules = pickle.load(f)
    print(recommendAnimeBetter(['Naruto', 'Toradora!', 'Shingeki no Kyojin'], short=False))

[('Death Note', 2, 2774), ('Code Geass: Hangyaku no Lelouch', 2, 8640), ('Naruto: Shippuuden', 2, 9880), ('Fullmetal Alchemist: Brotherhood', 2, 10111), ('Angel Beats!', 2, 10486), ('Bleach', 2, 10991), ('Clannad', 2, 11446), ('Sword Art Online', 2, 11773), ('Shingeki no Kyojin', 2, 12910), ('Elfen Lied', 2, 12922), ('Durarara!!', 2, 14836), ('Soul Eater', 2, 14951), ('Bakemonogatari', 2, 15679), ('Steins;Gate', 2, 16403), ('Suzumiya Haruhi no Yuuutsu', 2, 16528), ('Tengen Toppa Gurren Lagann', 2, 17482), ('Mirai Nikki (TV)', 2, 12158), ('Toradora!', 2, 13958), ('Ao no Exorcist', 2, 14595), ('Fairy Tail', 2, 15433), ('Tokyo Ghoul', 2, 17109), ('Highschool of the Dead', 2, 17700), ('Ano Hi Mita Hana no Namae wo Bokutachi wa Mada Shiranai.', 2, 14927), ('Another', 2, 15101), ('No Game No Life', 2, 15168), ('Noragami', 2, 17728), ('Psycho-Pass', 2, 18017), ('Naruto', 2, 19293), ('One Punch Man', 2, 19727), ('Hataraku Maou-sama!', 2, 19980), ('Code Geass: Hangyaku no Lelouch R2', 2, 20159)