<small><i>Updated February 2019 - This notebook was created by [Santi Seguí](https://ssegui.github.io/). </i></small>

<div class="alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;"><a class="anchor" id="what-is-a-recommender"></a><h3>Non-Personalized recommeder systems</h3><br></div>

<br>
<p>A non-personalized recommender system is one that makes the same recommendations for everyone. </p>

The simplest example is a retailer that shows the ten (or some number) most popular products on their homepage. <br>
Some examples: <br><br>
IMDB: MOVIE RANKING
![alt IMDB](images/np1.png)

___
Amazon: Top Recommendations
![alt Amazon](images/np2.png)
___
Amazon: Product Association
![alt Amazon](images/np3.png)
___
Reedit: News Recommendations
![alt Reedit](images/np4.png)


<p><b>Several</b> cases but <b>two main</b> approaches

1. Aggregated opinion recommenders
2. Basic product association recommenders
<br>
<div class="alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;"><a class="anchor" id="what-is-a-recommender"></a><h3>Aggregated opinion recommenders</h3><br></div>

Usually, the problem posed as a learning to rank problem. But what seems to be straighfowrard becomes a really complicated question: <b>How do you rank your rated items and which logic to use to display them?</b>

In order to score/rank items we first have to <b>understand the business case</b>. Of course, several factors plays a role. For instance, 

* Which information do we have about the items? Bought / Seen / Rated / ... 
* From how many users do we have the info for a particular item 
* How old is that info? 


## EXAMPLE: Non-Personalised Recommender using MovieLens Dataset
We will work with the well known MovieLens dataset (http://grouplens.org/datasets/movielens/). This dataset was initially constructed to support participants in the Netflix Prize. Today, we can find several versions of this dataset with different amout of data, from 100k samples version to 20m sample version. Although performance on bigger dataset is expected to be better, we will work with the smallest dataset: MovieLens 100K Dataset (ml-100k-zip). Working with this lite version has the benefit of less computational costs

With a unix machine the dataset can be downloaded with the following code:



In [229]:
!pip install wget



In [230]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip 
!unzip ml-1m.zip -d "data/"

/bin/sh: wget: command not found
unzip:  cannot find or open ml-1m.zip, ml-1m.zip.zip or ml-1m.zip.ZIP.


If you are working with a windows machine, please go to the website and download the 100k version and extract it to the subdirectory named "data/ml-100k/"

Once you have downloaded and unzipped the file into a directory, you can create a DataFrame with the following code:

In [231]:
#NETFLIX REAL 50.000.000 usuaris and 100.000 items
%autosave 150
%matplotlib inline
import pandas as pd
import numpy as np
import math
import matplotlib.pylab as plt

# Load Data set
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('data/ml-1m/users.dat', sep='::', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('data/ml-1m/ratings.dat', sep='::', names=r_cols)

# the movies file contains columns indicating the movie's genres
# let's only load the first three columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date']
movies = pd.read_csv('data/ml-1m/movies.dat', sep='::', names=m_cols, usecols=range(3), encoding='latin-1')

# Construcció del DataFrame
data = pd.merge(pd.merge(ratings, users), movies)
data = data[['user_id','title', 'movie_id','rating','release_date','sex','age']]


print("La BD has "+ str(data.shape[0]) +" ratings")
print("La BD has ", data.user_id.nunique()," users")
print("La BD has ", data.movie_id.nunique(), " movies")
data.head()


Autosaving every 150 seconds


  # This is added back by InteractiveShellApp.init_path()
  


La BD has 1000209 ratings
La BD has  6040  users
La BD has  3706  movies


Unnamed: 0,user_id,title,movie_id,rating,release_date,sex,age
0,1,One Flew Over the Cuckoo's Nest (1975),1193,5,Drama,1,F
1,2,One Flew Over the Cuckoo's Nest (1975),1193,5,Drama,56,M
2,12,One Flew Over the Cuckoo's Nest (1975),1193,4,Drama,25,M
3,15,One Flew Over the Cuckoo's Nest (1975),1193,4,Drama,25,M
4,17,One Flew Over the Cuckoo's Nest (1975),1193,5,Drama,50,M


If you explore the dataset in detail, you will see that it consists of:

100,000 ratings from 943 users of 1682 movies. Ratings are from 1 to 5.
Each user has rated at least 20 movies.
Simple demographic info for the users (age, gender, occupation, zip)

### 2.1 Top movies ranking. 
The simplest way to show the ranking is by using the mean rating.

In [232]:
mean_score = data.groupby(['title'])[['rating']].mean().rename(columns = {'rating': 'mean_rating'})
mean_score.sort_values(by='mean_rating',ascending=False).head(10)

Unnamed: 0_level_0,mean_rating
title,Unnamed: 1_level_1
Ulysses (Ulisse) (1954),5.0
Lured (1947),5.0
Follow the Bitch (1998),5.0
Bittersweet Motel (2000),5.0
Song of Freedom (1936),5.0
One Little Indian (1973),5.0
Smashing Time (1967),5.0
Schlafes Bruder (Brother of Sleep) (1995),5.0
"Gate of Heavenly Peace, The (1995)",5.0
"Baby, The (1973)",5.0


<div class="alert alert-error" style = "border-radius:10px;border-width:3px;border-color:darkred;font-family:Verdana,sans-serif;font-size:14px;">
What do you think about the output? </div><br>
Now, let's show only ranking the mean rating but using only those movies with at least 20 ratings

In [233]:
size = data.groupby('title').size()
mean_score.loc[size>20].sort_values(by='mean_rating',ascending=False).head(10)

Unnamed: 0_level_0,mean_rating
title,Unnamed: 1_level_1
Sanjuro (1962),4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954),4.56051
"Shawshank Redemption, The (1994)",4.554558
"Godfather, The (1972)",4.524966
"Close Shave, A (1995)",4.520548
"Usual Suspects, The (1995)",4.517106
Schindler's List (1993),4.510417
"Wrong Trousers, The (1993)",4.507937
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.491489
Raiders of the Lost Ark (1981),4.477725


<div class="alert alert-error" style = "border-radius:10px;border-width:3px;border-color:darkred;font-family:Verdana,sans-serif;font-size:14px;">
Any other idea?<br>
How can you improve it?</div>

### There is other measures like the <b>Damped Means</b>.


* <b>Problem:</b> There is low conficende with few ratings
* <b>Solution:</b> Assume that, without evidence, everything is average.

<br>
$damped\_Means = \frac{\sum_u r_{u,i} + k \mu}{n +  k}$
<br><br>$k$ controls the strength of the requiered evidence

<br>


In [234]:
k = 10

mean_score_movies  = data.groupby('movie_id')[['rating']].mean().rename(columns = {'rating': 'mean_rating'}).reset_index()
sum_ratings_movie = data.groupby('movie_id')[['rating']].sum().rename(columns = {'rating': 'num_ratings'}).reset_index()
sum_ratings_movie['num_ratings_factor'] = sum_ratings_movie['num_ratings'] + k *(data['rating'].mean())

count_ratings = data.groupby('movie_id')[['rating']].count().rename(columns = {'rating': 'count_rating'}).reset_index()
count_ratings['count_rating_factor'] = count_ratings['count_rating'] + k

ratings_damped = pd.merge(sum_ratings_movie,
                         count_ratings[['movie_id','count_rating','count_rating_factor']],
                         on=['movie_id'],how='left')

ratings_damped['damped_mean']=ratings_damped['num_ratings_factor']/ratings_damped['count_rating_factor']

ratings_mean_damped=pd.merge(data[['title','movie_id']].drop_duplicates(),
                             ratings_damped[['movie_id','damped_mean']],
                             on=['movie_id'],how='left')

ratings_mean_damped = ratings_mean_damped.sort_values(by='damped_mean', ascending=False)
ratings_mean_damped.head()

Unnamed: 0,title,movie_id,damped_mean
167,"Shawshank Redemption, The (1994)",318,4.550208
1092,Seven Samurai (The Magnificent Seven) (Shichin...,2019,4.545166
669,"Godfather, The (1972)",858,4.520741
259,"Usual Suspects, The (1995)",50,4.511888
29,"Close Shave, A (1995)",745,4.50647



## Ranking Cosiderations

+ Confidence
 - How confident are we that this item is good?
 
+ Risk tolerance
 - High-risk, high-reward
 - Conservative recommendations+

+ Domain and business considerations
 - Age
 - System goals

ANOTHER EXAMPLE DOMAIN CONSTRAINT: TIME

<p><b>REDDIT:</b> Old stories are not interesting even though they might have a high net upvotes score! How does Reddit deal with this?</p>
<br>
$$log_{10}max( 1,| U -D | ) +  \frac{sign(U-D)*t_{post}}{45000} $$ 
<br>
where $U$ is the number of upvotes and $D$ is the number of downvotes. 
* In Reddit, time and votes were treated independently.
* The Log term has a damping effect for votes. The idea is that votes 11 to 100 should have the same influence as votes 1 to 10. Obviously, a post with 1000 votes should be better than a post with 1 vote, but is a post with 2000 votes much better than the 1000 votes? The log decreases marginal values for later votes.
* The sign(U-D) is useful to bury any negative items (as Reddit wants only to show the popular ones!)

<p><b>HACKERS NEWS:</b> </p>

$$ \frac{(U - D  +1)^\alpha}{(t_{now} - t_{post})^\gamma} P $$

* Numerator is related to popularity
* Denominator is realted to the age factor with a gravity effect with the $\gamma$ parameter
* $P$ is a penalty term for each new

![alt hackers](images/hackers.png)


<h3>It was really famous and then it become to getting worse.</h3>
![alt ![alt IMDB](images/zagat.png)
](images/zagat.png)



In [235]:
# Why?

<div class="alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;"><a class="anchor" id="what-is-a-recommender"></a><h3>Basic product association recommenders
</h3><br> People who buy X also buy Y. </div>


In [236]:
#Let's read a dataset which contains several market baskets lists

# read data/grocieries.csv
def union(a, b):
    """ return the union of two lists """
    return list(set(a) | set(b))

market_data = []
cont = 0
items = []
with open("data/groceries.csv") as f:
    for l in f:
        market_data.append(l.rstrip().split(','))
        items = union(items,l.rstrip().split(','))

print("Number of different items", len(items))
print("Number of rows ", len(market_data))


print("An example:", market_data[3])

Number of different items 171
Number of rows  9835
An example: ['pip fruit', 'yogurt', 'cream cheese ', 'meat spreads']


One of the most simple ways to found association between product could be obtained as follows: $$score(Y|X) = \frac{X \ and \ Y}{X}$$

In [237]:
# Which is the top associated product with "yogurt"?

In [238]:
def top_associated_products(product,N = 5):
    d = {}
    times = 0
    for l in market_data:
        if product in l:
            times = times + 1
            for i in l:
                if i != product: 
                    if(i in d):
                        d[i] += 1.0
                    else:
                        d[i] = 1.0

    for k in d:
        d[k] =   d[k] / times
    sorted_list=sorted(d.items(), key=lambda x: x[1],reverse=True)
    return sorted_list[:N]

In [239]:
s = top_associated_products('yogurt',N = 3)
print(s)

[('whole milk', 0.40160349854227406), ('other vegetables', 0.3112244897959184), ('rolls/buns', 0.24635568513119532)]


In [240]:
# Which is the top associated prouct with "rice"?
s = top_associated_products('rice',N = 3)
print(s) 

[('whole milk', 0.6133333333333333), ('other vegetables', 0.52), ('root vegetables', 0.41333333333333333)]


In [241]:
# Which is the top associated prouct with "rum"?
s = top_associated_products('rum',N = 3)
print(s)

[('whole milk', 0.38636363636363635), ('other vegetables', 0.3409090909090909), ('tropical fruit', 0.20454545454545456)]


What happens? Is it a good measure? It has a problem with popular items...
<br>
Let's check this other formula:
$$score(Y|X) = \frac{ \frac{X \ and \ Y}{X}} {  \frac{!X \ and \ Y}{!X} }  $$

In [242]:
from collections import defaultdict

def top_associated_products2(product,N = 5):
    d, d_not, d_yes = {}, {}, {}
    d = defaultdict(lambda: 0, d)
    d_not = defaultdict(lambda: 0, d_not)
    times, times_not = 0, 0
    for l in market_data:
        if product in l:
            times = times + 1
            for i in l:
                if i != product: 
                    if(i in d_yes):
                        d_yes[i] += 1.0
                    else:
                        d_yes[i] = 1.0
        else:
            times_not = times_not + 1
            for i in l:
                if(i in d_not):
                    d_not[i] += 1.0
                else:
                    d_not[i] = 1.0
                        
    for k in d_yes:
        if(d_not[k] == 0):
            d[k] = 0
        else:
            d[k] =  ( d_yes[k] *times_not) / (times * d_not[k])
    sorted_list=sorted(d.items(), key=lambda x: x[1],reverse=True)
    return sorted_list[:N]

In [243]:
s = top_associated_products2('yogurt',N = 3)
print(s)

[('kitchen utensil', 6.168367346938775), ('preservation products', 6.168367346938775), ('meat spreads', 4.626275510204081)]


In [244]:
# Which is the top associated prouct with "rice"?
s = top_associated_products2('rice',N = 3)
print(s)

[('decalcifier', 20.02051282051282), ('canned fruit', 18.590476190476192), ('organic products', 18.590476190476192)]


In [245]:
# Which is the top associated prouct with "rum"?
s = top_associated_products2('rum',N = 3)
print(s)

[('artif. sweetener', 14.834848484848484), ('specialty vegetables', 13.907670454545455), ('cooking chocolate', 9.271780303030303)]


#### Let's check this last formula:
$$ score(Y|X) = \frac{P(X \ and \ Y)}{P(X)P(Y) }   $$

In [246]:
def top_associated_products3(product,N = 5):
    d , times = {}, {}
    d = defaultdict(lambda: 0, d)
    times = defaultdict(lambda: 0, times)
    for l in market_data:
        for item in l:
            if item in times: #already exist
                times[item] += 1
            else:
                times[item] =1
        if product in l:
            for i in l:
                if i != product: 
                    if(i in d):
                        d[i] += 1.0
                    else:
                        d[i] = 1.0
                        
    for k in d:
        d[k] =  ( d[k] /len(market_data) ) / ((times[k]/len(market_data)) * times[product] /(len(market_data)))
        
    sorted_list=sorted(d.items(), key=lambda x: x[1],reverse=True)
    return sorted_list[:N]

In [247]:
s = top_associated_products3('yogurt',N = 3)
print(s)

[('baby food', 7.168367346938775), ('kitchen utensil', 3.5841836734693877), ('preservation products', 3.5841836734693877)]


In [248]:
# Which is the top associated prouct with "rice"?
s = top_associated_products3('rice',N = 3)
print(s)

[('decalcifier', 17.484444444444446), ('canned fruit', 16.391666666666666), ('organic products', 16.391666666666666)]


In [249]:
# Which is the top associated prouct with "rice"?
s = top_associated_products3('rum',N = 3)
print(s)

[('artif. sweetener', 13.970170454545455), ('specialty vegetables', 13.148395721925132), ('cooking chocolate', 8.940909090909091)]


In [250]:
s = top_associated_products3('baby food',N = 3)
print(s)

[('finished products', 153.671875), ('soups', 146.7910447761194), ('cake bar', 75.65384615384615)]


## APRIORI Algorithm
Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold

![alt apriori](images/apriori.png)

<b>Apriori principle</b>: Any subset of a frequent itemset must be frequent

> Step 1: Find the frequent itemsset: the set of items that have minimum support.
> -  A subset of a frequent itemset must also be a frequent itemset  i.e. if {1,2} is a frequent itemset, both {1} and {2} should be a frequent itemset
> - Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

> Step 2: Use the frequent itemsets to generate association rules

![alt apriori2](images/apriori2.png)

Reference : 
[Fast algorithms for mining association rules](http://www-cgi.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-721/summaries/12.pdf)


In [251]:
#adapted code from https://github.com/asaini/Apriori
import sys

from itertools import chain, combinations
from collections import defaultdict
from optparse import OptionParser


def subsets(arr):
    """ Returns non empty subsets of arr"""
    return chain(*[combinations(arr, i + 1) for i, a in enumerate(arr)])


def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet):
        """calculates the support for items in the itemSet and returns a subset
       of the itemSet each of whose elements satisfies the minimum support"""
        _itemSet = set()
        localSet = defaultdict(int)

        for item in itemSet:
                for transaction in transactionList:
                        if item.issubset(transaction):
                                freqSet[item] += 1
                                localSet[item] += 1

        for item, count in localSet.items():
                support = float(count)/len(transactionList)

                if support >= minSupport:
                        _itemSet.add(item)

        return _itemSet


def joinSet(itemSet, length):
        """Join a set with itself and returns the n-element itemsets"""
        return set([i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length])


def getItemSetTransactionList(data_iterator):
    transactionList = list()
    itemSet = set()
    for record in data_iterator:
        transaction = frozenset(record)
        transactionList.append(transaction)
        for item in transaction:
            itemSet.add(frozenset([item]))              # Generate 1-itemSets
    return itemSet, transactionList


def runApriori(data_iter, minSupport, minConfidence):
    """
    run the apriori algorithm. data_iter is a record iterator
    Return both:
     - items (tuple, support)
     - rules ((pretuple, posttuple), confidence)
    """
    
    itemSet, transactionList = getItemSetTransactionList(data_iter)
    freqSet = defaultdict(int)
    largeSet = dict()
    # Global dictionary which stores (key=n-itemSets,value=support)
    # which satisfy minSupport

    assocRules = dict()
    # Dictionary which stores Association Rules
    
    oneCSet = returnItemsWithMinSupport(itemSet,
                                        transactionList,
                                        minSupport,
                                        freqSet)

    
    currentLSet = oneCSet
    k = 2
    while(currentLSet != set([])):
        largeSet[k-1] = currentLSet
        currentLSet = joinSet(currentLSet, k)
        currentCSet = returnItemsWithMinSupport(currentLSet,
                                                transactionList,
                                                minSupport,
                                                freqSet)
        currentLSet = currentCSet
        k = k + 1

    def getSupport(item):
            """local function which Returns the support of an item"""
            return float(freqSet[item])/len(transactionList)

    toRetItems = []
    for key, value in list(largeSet.items()):
        toRetItems.extend([(tuple(item), getSupport(item))
                           for item in value])

    toRetRules = []
    for key, value in list(largeSet.items())[1:]:
        for item in value:
            _subsets = map(frozenset, [x for x in subsets(item)])
            for element in _subsets:
                remain = item.difference(element)
                if len(remain) > 0:
                    confidence = getSupport(item)/getSupport(element)
                    if confidence >= minConfidence:
                        toRetRules.append(((tuple(element), tuple(remain)),
                                           confidence))
    return toRetItems, toRetRules


def printResults(items, rules):
    """prints the generated itemsets sorted by support and the confidence rules sorted by confidence"""
    for item, support in sorted(items, key = lambda x: float(x[1])):
        print("item: %s , %.3f" % (str(item), support))
    print("\n------------------------ RULES:")
    for rule, confidence in sorted(rules, key=lambda x: float(x[1])):
        pre, post = rule
        print("Rule: %s ==> %s , %.3f" % (str(pre), str(post), confidence))


def dataFromFile(fname):
        """Function which reads from the file and yields a generator"""
        file_iter = open(fname, 'r')
        for line in file_iter:
                line = line.strip().rstrip(',')                         # Remove trailing comma
                record = frozenset(line.split(','))
                yield record

In [252]:
inFile = dataFromFile('data/groceries.csv')
minSupport = 0.04
minConfidence = 0.4
items, rules =  runApriori(inFile, minSupport, minConfidence)
printResults(items, rules)

item: ('soda', 'whole milk') , 0.040
item: ('white bread',) , 0.042
item: ('tropical fruit', 'whole milk') , 0.042
item: ('rolls/buns', 'other vegetables') , 0.043
item: ('chicken',) , 0.043
item: ('other vegetables', 'yogurt') , 0.043
item: ('other vegetables', 'root vegetables') , 0.047
item: ('frozen vegetables',) , 0.048
item: ('whole milk', 'root vegetables') , 0.049
item: ('chocolate',) , 0.050
item: ('napkins',) , 0.052
item: ('beef',) , 0.052
item: ('curd',) , 0.053
item: ('butter',) , 0.055
item: ('whole milk', 'yogurt') , 0.056
item: ('rolls/buns', 'whole milk') , 0.057
item: ('pork',) , 0.058
item: ('coffee',) , 0.058
item: ('margarine',) , 0.059
item: ('frankfurter',) , 0.059
item: ('domestic eggs',) , 0.063
item: ('brown bread',) , 0.065
item: ('whipped/sour cream',) , 0.072
item: ('fruit/vegetable juice',) , 0.072
item: ('whole milk', 'other vegetables') , 0.075
item: ('pip fruit',) , 0.076
item: ('canned beer',) , 0.078
item: ('newspapers',) , 0.080
item: ('bottled beer'

### Exercice: Create and Product Association Recommender with MovieLens Dataset
Explain the obtained results and conclusions.

In [253]:
max2=data
# max2 = max2.loc[max2['user_id'] == 1]
# max2.groupby(['user_id'], sort=False)['rating'].max()
idx = max2.groupby(['user_id'])['rating'].transform(max) == max2['rating']
# max3 = max2[idx].sample(100000)
max3 = max2[idx]
len(max3)


226570

In [254]:
# max3['title'] = max3.groupby(['user_id'])['title'].transform(lambda x: ','.join(x))
max_list = max3.groupby('user_id')['title'].apply(list)
len(max_list)


6040

In [255]:
max_list2 = [x for x in max_list if len(x) >1]
len(max_list2)

5984

In [256]:
# max_list

In [257]:
# max_list = max3['title'].apply(lambda x: x.split(sep=',')).values.tolist()


In [258]:
# max_list

In [259]:
above_mean = data

In [260]:
len(above_mean)

1000209

In [261]:
above_mean['mean_rating']  = above_mean['rating'].groupby(above_mean['user_id']).transform('mean')


In [262]:
above_mean = above_mean.drop(above_mean[above_mean.rating < above_mean.mean_rating].index)



In [263]:
above_mean['title'] = above_mean.groupby(['user_id'])['title'].transform(lambda x: ','.join(x))


In [264]:
above_mean = above_mean[['title']]
# df3 = df2[0:10000]
above_mean.head()



Unnamed: 0,title
0,"One Flew Over the Cuckoo's Nest (1975),Bug's L..."
1,"One Flew Over the Cuckoo's Nest (1975),Awakeni..."
2,"One Flew Over the Cuckoo's Nest (1975),Christm..."
3,"One Flew Over the Cuckoo's Nest (1975),Erin Br..."
4,"One Flew Over the Cuckoo's Nest (1975),Beauty ..."


In [267]:

# df3['title'] = df3['title'].apply(lambda x: x.split(sep=','))

In [269]:
# df4 = df3['title'].values.tolist()

In [270]:
# import pandas as pd
# df2.to_csv('movies_liked.csv')

In [271]:
# inFile = dataFromFile('movies_liked.csv')

In [272]:
minSupport = 0.04
minConfidence = 0.5
items, rules =  runApriori(max_list2, minSupport, minConfidence)
printResults(items, rules)

item: ('Game, The (1997)',) , 0.040
item: ('Big Chill, The (1983)',) , 0.040
item: ('Bonnie and Clyde (1967)',) , 0.040
item: ('Platoon (1986)', "Schindler's List (1993)") , 0.040
item: ('Bridge on the River Kwai, The (1957)', 'Raiders of the Lost Ark (1981)') , 0.040
item: ('Star Wars: Episode IV - A New Hope (1977)', 'Willy Wonka and the Chocolate Factory (1971)') , 0.040
item: ('Lawrence of Arabia (1962)', 'Casablanca (1942)') , 0.040
item: ('Good Will Hunting (1997)', 'Matrix, The (1999)') , 0.040
item: ('Glory (1989)', "Schindler's List (1993)") , 0.040
item: ("One Flew Over the Cuckoo's Nest (1975)", 'Taxi Driver (1976)') , 0.040
item: ('Monty Python and the Holy Grail (1974)', 'Star Wars: Episode VI - Return of the Jedi (1983)') , 0.040
item: ("Ferris Bueller's Day Off (1986)", 'Matrix, The (1999)') , 0.040
item: ('Pulp Fiction (1994)', 'Citizen Kane (1941)') , 0.040
item: ('Toy Story 2 (1999)', 'Saving Private Ryan (1998)') , 0.040
item: ('Graduate, The (1967)', 'Raiders of the

Rule: ('Star Wars: Episode IV - A New Hope (1977)', 'Star Wars: Episode VI - Return of the Jedi (1983)') ==> ('Star Wars: Episode V - The Empire Strikes Back (1980)', 'Raiders of the Lost Ark (1981)') , 0.534
Rule: ('Godfather, The (1972)', 'Pulp Fiction (1994)') ==> ('Silence of the Lambs, The (1991)',) , 0.535
Rule: ('Shawshank Redemption, The (1994)', 'Braveheart (1995)') ==> ('American Beauty (1999)',) , 0.535
Rule: ('Silence of the Lambs, The (1991)', 'Matrix, The (1999)') ==> ('American Beauty (1999)',) , 0.535
Rule: ('Saving Private Ryan (1998)', 'Sixth Sense, The (1999)') ==> ('Matrix, The (1999)',) , 0.535
Rule: ('Saving Private Ryan (1998)', 'Sixth Sense, The (1999)') ==> ('Star Wars: Episode IV - A New Hope (1977)',) , 0.535
Rule: ('Terminator, The (1984)', 'Raiders of the Lost Ark (1981)') ==> ('Godfather, The (1972)',) , 0.535
Rule: ('Good Will Hunting (1997)',) ==> ('American Beauty (1999)',) , 0.535
Rule: ('Terminator, The (1984)',) ==> ('Star Wars: Episode V - The Empir