## Modeling_AssociationRules.ipynb
### Due: April 23

Here we explore the usage of Association Rule Mining with our anime watchlists.
Each transaction will be the set of anime watched by a user. We will make use of the Apriori Algorithm
to generate the frequent itemsets from the transactions and find the strongest association rules to be used as predictors
of what anime a person might like to watch given other anime they have enjoyed.

In [1]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Load in Dataset and Do some Data Preprocessing

In [2]:
df_anime_lists = pd.read_csv("./data/animelists_cleaned.csv")
df_anime_lists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31284030 entries, 0 to 31284029
Data columns (total 11 columns):
 #   Column               Dtype  
---  ------               -----  
 0   username             object 
 1   anime_id             int64  
 2   my_watched_episodes  int64  
 3   my_start_date        object 
 4   my_finish_date       object 
 5   my_score             int64  
 6   my_status            int64  
 7   my_rewatching        float64
 8   my_rewatching_ep     int64  
 9   my_last_updated      object 
 10  my_tags              object 
dtypes: float64(1), int64(5), object(5)
memory usage: 2.6+ GB


In [3]:
def getMemUsageInMB(df : pd.DataFrame) -> float:
    return df.memory_usage(deep=True).sum() / 2**20

In [4]:
orig_mem = getMemUsageInMB(df_anime_lists)
print(f"Total memory used by dataframe: {orig_mem :.2f} MB")

Total memory used by dataframe: 10738.10 MB


In [5]:
# percentage of NaN values in each column
df_anime_lists.isna().sum() / (df_anime_lists.isna().sum() + df_anime_lists.count())

username               0.000008
anime_id               0.000000
my_watched_episodes    0.000000
my_start_date          0.000000
my_finish_date         0.000000
my_score               0.000000
my_status              0.000000
my_rewatching          0.219864
my_rewatching_ep       0.000000
my_last_updated        0.000000
my_tags                0.936274
dtype: float64

In [6]:
# remove extraneous columns that we are not using for association rule mining.
df_anime_lists.drop(columns=df_anime_lists.columns[2:], inplace=True)

# let's try to downcast the int data types.
intCols = df_anime_lists.select_dtypes('int').columns
df_anime_lists[intCols] = df_anime_lists[intCols].apply(pd.to_numeric, downcast='integer')

# downcast float data types
fcols = df_anime_lists.select_dtypes('float').columns
df_anime_lists[fcols] = df_anime_lists[fcols].apply(pd.to_numeric, downcast='float') 

In [7]:
tmp = getMemUsageInMB(df_anime_lists)
print(f"Total memory used by dataframe: {tmp :.2f} MB")

# memory reduced by:
print(f"Memory Usage Reduction: {((orig_mem - tmp)/orig_mem) * 100}%")
del tmp

Total memory used by dataframe: 2088.10 MB
Memory Usage Reduction: 80.55431970379016%


In [8]:
df_anime_lists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31284030 entries, 0 to 31284029
Data columns (total 2 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   username  object
 1   anime_id  int32 
dtypes: int32(1), object(1)
memory usage: 358.0+ MB


# Formulate the Transactions and Find Frequent Itemsets

In [9]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [10]:
transactions = df_anime_lists.groupby('username')['anime_id'].apply(list)
transactions

username
----phoebelyn       [21, 120, 853, 957, 1571, 1579, 1698, 1735, 1,...
---L-AND-AME-4EV                                           [20, 1535]
--AnimeBoy--        [21, 59, 74, 210, 232, 249, 853, 1557, 1735, 2...
--Etsuko--          [3092, 4814, 7054, 7674, 9926, 11013, 11123, 1...
--FallenAngel--     [21, 59, 210, 249, 269, 853, 857, 957, 1579, 1...
                                          ...                        
zzshinzozz          [21, 232, 249, 1735, 7054, 9513, 9863, 10800, ...
zzvl                [120, 269, 853, 4224, 6045, 7054, 9926, 10800,...
zzz275              [59, 120, 853, 1571, 1698, 2104, 4477, 1, 16, ...
zzzcielo            [21, 74, 269, 853, 857, 957, 1698, 1735, 3731,...
zzzzz-chan          [21, 59, 120, 210, 232, 269, 853, 857, 1735, 3...
Name: anime_id, Length: 108709, dtype: object

In [11]:
# convert the ids into their English Anime Name
df_anime = pd.read_csv("./data/anime_cleaned.csv")
df_anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6668 entries, 0 to 6667
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   anime_id         6668 non-null   int64  
 1   title            6668 non-null   object 
 2   title_english    3438 non-null   object 
 3   title_japanese   6663 non-null   object 
 4   title_synonyms   4481 non-null   object 
 5   image_url        6666 non-null   object 
 6   type             6668 non-null   object 
 7   source           6668 non-null   object 
 8   episodes         6668 non-null   int64  
 9   status           6668 non-null   object 
 10  airing           6668 non-null   bool   
 11  aired_string     6668 non-null   object 
 12  aired            6668 non-null   object 
 13  duration         6668 non-null   object 
 14  rating           6586 non-null   object 
 15  score            6668 non-null   float64
 16  scored_by        6668 non-null   int64  
 17  rank          

In [12]:
# since some of the title_english column values are not present, we will use the regular `title` which appears to be English at first glance anyways.
# to save on memory usage, just grab columns of interest: [anime_id, title]

# get mapping from ids into names
animeIdToName = {}
def foo(row):
    animeIdToName[row.anime_id] = row.title
df_anime.apply(foo, axis=1)

transactions = transactions.apply(lambda x: [animeIdToName[i] for i in x])

# remove all other columns except the anime_id and title.
df_anime.drop(columns=df_anime.columns[2:], inplace=True)
df_anime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6668 entries, 0 to 6667
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   anime_id  6668 non-null   int64 
 1   title     6668 non-null   object
dtypes: int64(1), object(1)
memory usage: 104.3+ KB


In [13]:
# running out of mem, so delete the dataframe objects currently in RAM
del df_anime_lists
del df_anime

In [14]:
MIN_SUPPORT = 0.30

te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=MIN_SUPPORT, use_colnames=True)

frequent_itemsets

Unnamed: 0,support,itemsets
0,0.344323,(Akame ga Kill!)
1,0.569962,(Angel Beats!)
2,0.429983,(Ano Hi Mita Hana no Namae wo Bokutachi wa Mad...
3,0.450699,(Another)
4,0.464626,(Ao no Exorcist)
...,...,...
1376,0.305025,"(Sword Art Online, Shingeki no Kyojin, Mirai N..."
1377,0.317389,"(Sword Art Online, Shingeki no Kyojin, Steins;..."
1378,0.320893,"(Sword Art Online, Shingeki no Kyojin, Steins;..."
1379,0.303287,"(Sword Art Online, Shingeki no Kyojin, Mirai N..."


In [16]:
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.5)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Shingeki no Kyojin),(Akame ga Kill!),0.573338,0.344323,0.316883,0.552698,1.605173,0.119469,1.465848,0.883637
1,(Akame ga Kill!),(Shingeki no Kyojin),0.344323,0.573338,0.316883,0.920307,1.605173,0.119469,5.353800,0.575000
2,(Sword Art Online),(Akame ga Kill!),0.575003,0.344323,0.316938,0.551193,1.600804,0.118951,1.460935,0.883098
3,(Akame ga Kill!),(Sword Art Online),0.344323,0.575003,0.316938,0.920467,1.600804,0.118951,5.343656,0.572407
4,(Angel Beats!),(Ano Hi Mita Hana no Namae wo Bokutachi wa Mad...,0.569962,0.429983,0.365315,0.640946,1.490631,0.120241,1.587552,0.765382
...,...,...,...,...,...,...,...,...,...,...
5661,"(Steins;Gate, Toradora!)","(Sword Art Online, Shingeki no Kyojin)",0.395781,0.480337,0.305770,0.772574,1.608399,0.115662,2.284975,0.626038
5662,(Sword Art Online),"(Shingeki no Kyojin, Steins;Gate, Toradora!)",0.575003,0.335869,0.305770,0.531772,1.583271,0.112645,1.418392,0.866821
5663,(Shingeki no Kyojin),"(Sword Art Online, Steins;Gate, Toradora!)",0.573338,0.339365,0.305770,0.533316,1.571513,0.111200,1.415595,0.852363
5664,(Steins;Gate),"(Sword Art Online, Shingeki no Kyojin, Toradora!)",0.521236,0.364422,0.305770,0.586626,1.609742,0.115821,1.537537,0.791167


In [17]:
def viewTopKRules(rules : pd.DataFrame, metric : str | list[str], k : int, ascending : bool = False) -> pd.DataFrame:
    '''
    return a subset of the rules dataframe 
    given the metric / metrics to sort by
    top k to return
    and whether to be sorted in ascending or descending order
    '''
    return rules.sort_values(by=metric, ascending=ascending).head(k)

In [18]:
viewTopKRules(rules, metric=['confidence', 'lift'], k=5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
4326,"(Sword Art Online II, Shingeki no Kyojin)",(Sword Art Online),0.324159,0.575003,0.32277,0.995715,1.731669,0.136378,99.181918,0.62518
3767,"(Sword Art Online II, Death Note)",(Sword Art Online),0.30486,0.575003,0.303425,0.995293,1.730935,0.128129,90.287346,0.607471
1105,(Sword Art Online II),(Sword Art Online),0.358894,0.575003,0.356585,0.993567,1.727933,0.15022,66.06079,0.657105
5073,"(Code Geass: Hangyaku no Lelouch R2, Tengen To...",(Code Geass: Hangyaku no Lelouch),0.321841,0.622957,0.319679,0.993283,1.594466,0.119186,56.134527,0.549769
2646,"(Code Geass: Hangyaku no Lelouch R2, Darker th...",(Code Geass: Hangyaku no Lelouch),0.323175,0.622957,0.32093,0.993055,1.594099,0.119606,54.288054,0.550639


### Create recommendations from association rules. Start with a simple search.

In [117]:
def recommendAnimeSimple( userList : list[str] ) -> list[str]:
    '''
    Given a list of anime the user watches, return a list of anime recommendations.

    Returns all the anime from the consequents of association rules that contain any one of the anime in the userlist.
    Only considers subsets of the userList of length 1 to match for antecedents.

    Output data format:
        (anime_recommendation, score)
            score - integer value noting how far away from the top rule this rec came from (lower is better)
    '''
    assert isinstance(userList, list), 'input not of type list'
    
    retList = []
    retAnime = set()
    # try to use as much of the anime in the user list as possible until
    # for now, loop over each anime in the user's list and add to our set of recommendations
    # the anime in the consequents of the top rule containing a user liked anime in the antecedents
    for anime in userList:
        score = 0 # lower the better
        for index, row in rules.sort_values(by=['confidence', 'lift'], ascending=False).iterrows():
            if anime in list(row.antecedents):
                consequents = list(row.consequents)
                for a in consequents:
                    if a in retAnime: continue
                    retAnime.add(a)
                    retList.append((a, score))
            score += 1 # increment, the higher the score, the further away from the top rules the anime rec comes from

    # sort return list of anime by ascending score
    retList = sorted(retList, key=lambda x : x[1])
    
    return retList

In [118]:
recommendAnimeSimple(['Naruto', "Toradora!"])

[('Code Geass: Hangyaku no Lelouch', 34),
 ('Death Note', 75),
 ('Shingeki no Kyojin', 608),
 ('Sword Art Online', 663),
 ('Code Geass: Hangyaku no Lelouch R2', 998),
 ('Bleach', 1040),
 ('Naruto: Shippuuden', 1160),
 ('Fullmetal Alchemist: Brotherhood', 1199),
 ('Clannad: After Story', 1343),
 ('Fullmetal Alchemist', 1927),
 ('Elfen Lied', 2090),
 ('No Game No Life', 2470),
 ('Ano Hi Mita Hana no Namae wo Bokutachi wa Mada Shiranai.', 2514),
 ('Soul Eater', 2541),
 ('Another', 2571),
 ('Naruto', 2956),
 ('Toradora!', 3302),
 ('Angel Beats!', 3478),
 ('Durarara!!', 3704),
 ('Clannad', 3712),
 ('Tengen Toppa Gurren Lagann', 3747),
 ('K-On!', 3832),
 ('Fairy Tail', 3860),
 ('Steins;Gate', 3875),
 ('One Piece', 4166),
 ('Suzumiya Haruhi no Yuuutsu', 4177),
 ('Mahou Shoujo Madoka★Magica', 4721),
 ('Highschool of the Dead', 4771),
 ('Sen to Chihiro no Kamikakushi', 4813),
 ('Ao no Exorcist', 4845),
 ('Mirai Nikki (TV)', 4863),
 ('Bakemonogatari', 4907),
 ('Kaichou wa Maid-sama!', 4970),
 ('

### Create a better recommendAnime function.
The simple version above checks each individual anime in the user input list, one by one, and finds the top rule that contains the anime in its antecedents and adds to the recommendation list the anime in the consequents. This is clearly ignoring the information in having the anime found in a set TOGETHER. So we need to figure out how to match and find the best subset of antecedent and return just that rule's consequent, to make the recommendations more meaningful. 

In [78]:
from itertools import chain, combinations

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

In [80]:
[x for x in powerset(['Naruto', 'Toradora!'])]

[(), ('Naruto',), ('Toradora!',), ('Naruto', 'Toradora!')]

In [103]:
tmp = ['Naruto', 'Toradora!', 'Shingeki no Kyojin']
tmp = [x for x in powerset(tmp)]
tmp = list(filter(lambda x : True if len(x) >= 1 else False, tmp))
tmp = sorted(tmp, key=lambda x: len(x), reverse=True)
tmp = [frozenset(x) for x in tmp]
tmp

[frozenset({'Naruto', 'Shingeki no Kyojin', 'Toradora!'}),
 frozenset({'Naruto', 'Toradora!'}),
 frozenset({'Naruto', 'Shingeki no Kyojin'}),
 frozenset({'Shingeki no Kyojin', 'Toradora!'}),
 frozenset({'Naruto'}),
 frozenset({'Toradora!'}),
 frozenset({'Shingeki no Kyojin'})]

In [114]:
# great, the order doesn't matter in the frozenset comparison.
frozenset({'Naruto', 'Toradora!'}) == frozenset({'Toradora!', 'Naruto'})

True

In [115]:
def recommendAnimeBetter( userList : list[str] ) -> list[str]:
    '''
    Given a list of anime the user watches, return a list of anime recommendations.

    This is the better version that tries to match as much of animes in the original list.
    Then we start to add more recommendations by using rules that match fewer and fewer until we are matching rules from singleton animes from userList.

    Return data format:
        (anime_recommendation, score1, score2)
            score1 - this is the number of animes used to match the rule for this rec (higher is better)
            score2 - this is how far away from the strongest/top rule this rec was found at (lower is better)
    
    '''
    assert isinstance(userList, list), 'input not of type list'
    
    retList = []
    retAnime = set()

    # create all subsets with at least 1 anime
    tmp = [x for x in powerset(userList)]
    tmp = list(filter(lambda x : True if len(x) >= 1 else False, tmp))
    tmp = sorted(tmp, key=lambda x: len(x), reverse=True)
    tmp = [frozenset(x) for x in tmp]

    for subset in tmp:
        score2 = 0
        for index, row in rules.sort_values(by=['confidence', 'lift'], ascending=False).iterrows():
            if subset == row.antecedents:
                consequents = list(row.consequents)
                for a in consequents:

                    # don't add this anime if we already got it
                    if a in retAnime: continue

                    retAnime.add(a)

                    score1 = len(subset)
                    retList.append((a, score1, score2))

            score2 += 1
    
    # sort return list of anime by descending number of anime from 
    retList = sorted(retList, key=lambda x : x[1], reverse=True)
    
    return retList

In [116]:
recommendAnimeBetter(['Naruto', 'Toradora!', 'Shingeki no Kyojin'])

[('Death Note', 2, 427),
 ('Sword Art Online', 2, 663),
 ('Fullmetal Alchemist: Brotherhood', 2, 1199),
 ('Naruto: Shippuuden', 2, 1437),
 ('Angel Beats!', 2, 875),
 ('Steins;Gate', 2, 1255),
 ('Mirai Nikki (TV)', 2, 1481),
 ('Code Geass: Hangyaku no Lelouch', 2, 1652),
 ('Clannad', 2, 1781),
 ('Durarara!!', 2, 1957),
 ('Bakemonogatari', 2, 2040),
 ('Elfen Lied', 2, 2337),
 ('Bleach', 1, 2017),
 ('Fullmetal Alchemist', 1, 3318),
 ('Shingeki no Kyojin', 1, 3535),
 ('Soul Eater', 1, 3648),
 ('Toradora!', 1, 3672),
 ('Code Geass: Hangyaku no Lelouch R2', 1, 4190),
 ('Tengen Toppa Gurren Lagann', 1, 4302),
 ('Fairy Tail', 1, 4329),
 ('Suzumiya Haruhi no Yuuutsu', 1, 4582),
 ('One Piece', 1, 4624),
 ('Highschool of the Dead', 1, 4771),
 ('Sen to Chihiro no Kamikakushi', 1, 4813),
 ('Ao no Exorcist', 1, 4845),
 ('Darker than Black: Kuro no Keiyakusha', 1, 5219),
 ('Neon Genesis Evangelion', 1, 5453),
 ('Cowboy Bebop', 1, 5548),
 ('Naruto', 1, 3679),
 ('Ano Hi Mita Hana no Namae wo Bokutachi 