# CSCI4022 Homework 4; Item Sets

## Due Friday, March 12 at 11:59 pm to Canvas

#### Submit this file as a .ipynb with *all cells compiled and run* to the associated dropbox.

***

Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  Remember that you are encouraged to discuss the problems with your classmates, but **you must write all code and solutions on your own**.

**NOTES**: 

- Any relevant data sets should be available on Canvas. To make life easier on the graders if they need to run your code, do not change the relative path names here. Instead, move the files around on your computer.
- If you're not familiar with typesetting math directly into Markdown then by all means, do your work on paper first and then typeset it later.  Here is a [reference guide](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) linked on Canvas on writing math in Markdown. **All** of your written commentary, justifications and mathematical work should be in Markdown.  I also recommend the [wikibook](https://en.wikibooks.org/wiki/LaTeX) for LaTex.
- Because you can technically evaluate notebook cells is a non-linear order, it's a good idea to do **Kernel $\rightarrow$ Restart & Run All** as a check before submitting your solutions.  That way if we need to run your code you will know that it will work as expected. 
- It is **bad form** to make your reader interpret numerical output from your code.  If a question asks you to compute some value from the data you should show your code output **AND** write a summary of the results in Markdown directly below your code. 
- 45 points of this assignment are in problems.  The remaining 5 are for neatness, style, and overall exposition of both code and text.
- This probably goes without saying, but... For any question that asks you to calculate something, you **must show all work and justify your answers to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit. 
- There is *not a prescribed API* for these problems, except the **form of your output for #3**.  You may answer coding questions with whatever syntax or object typing you deem fit.  Your evaluation will primarily live in the clarity of how well you present your final results, so don't skip over any interpretations!  Your code should still be commented and readable to ensure you followed the given course algorithm.

---

In [10]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import itertools
import csv


***
<a/ id='p1'></a>
[Back to top](#top)
# Problem 1 (Theory: A Priori Properties; 10 pts) 
In using triangular arrays to store item basket data, we defined the function $a[k] :=$ count for the pair ${i, j}$, where $1 \leq i < j \leq n,$ with

$k=(i-1)\left(n-\frac{i}{2}\right) +j -i$


This formula involves dividing an arbitrary integer i by 2. Yet $k$ is an index, so we need to have k be an integer. Prove that k will, in fact, be an integer.


Solution: 

If i is an even number, then dividing by 2 will result an integer, which means every expression in calculating k (ie. (i-1), (n-$\dfrac{i}{2}$), j and i) is an integer so k will also be an integer.


If i is an odd number, then the left hand expression (i-1) will always result in an even number. Then when we distribute the even number that we got earlier into the expression (n-$\dfrac{i}{2}$), the even number and the 2 in the denominator will cancel into an integer, making the entire expression to be an integer so k will also be an integer.

***
<a/ id='p2'></a>
[Back to top](#top)
# Problem 2 (Theory: Item Baskets; 10 pts)

Suppose we have 20 distinct items numbered 1 to 20. Each basket is constructed by including the item numbered `k` with probability $1/k$, independent of other items.  As a result, all baskets will include item 1, half will include item 2, and so forth.  What are all of the *itemsets* expected to be frequent at a support threshold of 3%?

Note: You may use simulation if you prefer, but I suspect you may find the pen-and-paper answer is easier.

Solution:

all item itemsets of size 1 are expected to be frequent since if we have a 100 baskets,there are at least 5 baskets of each item which exceeds the support threshold of 3 percent. 

***
<a/ id='p3'></a>
[Back to top](#top)
# Problem 3 (Practice: Candidate Items; 25 pts)

In the A-Priori algorithm, there is a step in which we create a candidate list of frequent itemsets of size $k+1$ as we prune the frequent itemsets of size $k$.  This this problem we will create two functions to do that formally.

#### Part A:

There are two types of data objects in which we might be holding the frequency counts of itemsets.  If $k=2$, they may be stored in a triangular array.  Create a function `Cand_Trips` that takes a triangular array and returns all valid candidate triples as a list.  Recall that the itemset $\{i,j,k\}$ is only a candidate if all 3 of the itemsets in $\{\{i,j\}, \{i,k\}, \{k,j\}\}$ are frequent.

Some usage notes:

- The first input argument is `Triang_Counts`,  a zero-indexed triangular (numeric) array, by same convention as introduced in class.
- The second input argument is the positive integer support threshold `s`.
- The underlying itemset is 0-indexed, so e.g. `[0,1,3]` is a valid triple.
- The return array `Candidates` should be a list of 3-index lists of the item numbers of the triples.  So a final answer for some input might be:

`Cand_Trips` =
    `[[0,3,4], [1,2,7]]`

- An implementation note: there are two fundamentally different ways to think about implementing this function.  Option 1 involves thinking about the elements of `Tri_Counts` in terms of their locations on the corresponding *triangular matrix*: scan row $i$ for a pair of frequent pairs $\{\{i,j\}, \{i,k\}\}$ and then check if $\{j,k\}$ is in fact frequent.  Option 2 scans all of `Tri_Counts` for frequent item pairs (the "pruning" step) and saves those in some object with their indices, then scans *that* object for candidates.  Both are valid for this problem, but option 2 may generalize to higher $k$ better...

In [2]:
def Cand_Trips(Triang_Counts,s):
    cand_double = [] #holds item double of correct size.
    #first loop to find dandidate double of correct threshold. result look like [([i,j],c)]
    for x in range(len(Triang_Counts)):
        number_items = 5 # we need to enter the number of items manually
        i = np.triu_indices(number_items, k=1)[0][x]
        j = np.triu_indices(number_items, k=1)[1][x]
        #interested in the correct number of frequency (threshold)
        if(Triang_Counts[x]>= s):
            cand_double.append(([i,j],Triang_Counts[x]))
    #tri_candi hold all candidate triples(doesnt matter support)
    tri_candi = list(itertools.combinations([x for x in range(0,number_items)],3))
    result = []
    for i in tri_candi:
        # here for each possible candidate triple, 'subset' will contain it double component
        subset = list(itertools.combinations(i,2))
        #variable "cond" refers to the number of double compnents in the candidate triple
        #(example dandidate[1,2,3] has 3 double compnents[1,2],[1,3],[2,3]) 
        cond = 3 
        counter = 0
        for j in subset:
            subl = set(j)
            for x in range(len(cand_double)):#look for each sub component in cand_dboule, if all 3 components are in, then it is a triple
                sol_set = set(cand_double[x][0])
                if subl == sol_set:
                    counter= counter + 1
                    break

        if counter == 3:
            result.append(list(i))
    return result


#### Part B:

A quick test case.  Below is  a matrix $M$ and code including its corresponding the triangular array.  

$C=\begin{bmatrix}
\cdot &10&7&3&2\\
\cdot &\cdot&6&4&3\\
\cdot &\cdot&\cdot&3&6\\
\cdot &\cdot&\cdot&\cdot&0\\
\cdot &\cdot&\cdot&\cdot&\cdot\\
\end{bmatrix}$
 
Input the given list into your function to verify that it returns the correct valid triples at $s=1$ and $s=6$.

In [3]:
Triang_Counts=[10,7,3,2,6,4,3,3,2,0]

#Check that...
print(Cand_Trips(Triang_Counts, 1))# returns all the possible triples except those that contain BOTH items 3 and 4.
print(Cand_Trips(Triang_Counts, 6)) #returns only the triple [[0,1,2]]

[[0, 1, 2], [0, 1, 3], [0, 1, 4], [0, 2, 3], [0, 2, 4], [1, 2, 3], [1, 2, 4]]
[[0, 1, 2]]


#### Part C:

Suppose instead that our $k=2$ item counts were stored in a list of the form e.g.
`Pairs_Counts` =
    `[[0,1,12], [0,2,0], [0,3,11], ..., [7,8,103]]`
    
Where each element is a triple storing the two item indices and their count, $[i,j,c_{ij}]$. 

Create a function `Cand_Trips_List` that takes in a list of pairs counts and returns all valid candidate triples as a list.  

Some usage notes:

- The first input argument is `Pairs_Counts`,  a zero-indexed list of triples.
- The second input argument is the positive integer support threshold `s`.
- The underlying itemset is 0-indexed, so e.g. `[0,1,3]` is a valid triple.
- The return array `Candidates` should be a list of 3-element lists, as above.

You should **not** convert the input list `Pairs_Counts` into a triangular array as part of your function.  After all, sometimes we use the list format for pairs because it saves memory compared to the triangular array format!  You may be able to borrow heavily from the logic of your first function, though!

In [4]:
def Cand_Trips_List(Pairs_Counts, s):
    number_items = 5
    cand_double = []
    for i in Pairs_Counts:
        if i[2] >= s:
            cand_double.append({i[0],i[1]})
    tri_candi = list(itertools.combinations([x for x in range(0,number_items)],3))
    #print(cand_double)
    result = []
    for i in tri_candi:
        # here for each possible candidate triple, 'subset' will contain it double component
        subset = list(itertools.combinations(i,2))
        #variable "cond" refers to the number of double compnents in the candidate triple
        #(example dandidate[1,2,3] has 3 double compnents[1,2],[1,3],[2,3]) 
        cond = 3 
        counter = 0
        for j in subset:
            subl = set(j)
            for x in range(len(cand_double)):#look for each sub component in cand_dboule, if all 3 components are in, then it is a triple
                sol_set = set(cand_double[x])
                #print(sol_set)
                if subl == sol_set:
                    counter= counter + 1
                    break

        if counter == 3:
            result.append(list(i))
    return result
#Cand_Trips_List(Pairs_Counts,6)

#### Part D:

Do the test case again.  Below is the list reprentation of the same matrix $M$ from part B.  
 
Input the given list into your function to verify that it returns the correct valid triples at $s=1$ and $s=6$.

In [5]:
Pairs_Counts=[[0,1,10], [0,2,7], [0,3,3], [0,4,2],\
             [1,2,6],[1,3,4], [1,4,3],\
             [2,3,3],[2,4,2],\
             [3,4,0]]
#Pairs_Counts

#Check that...
print(Cand_Trips_List(Pairs_Counts, 1))# returns all the possible triples except those that contain BOTH items 3 and 4.
print(Cand_Trips_List(Pairs_Counts, 6))# returns only the triple [[0,1,2]]

[[0, 1, 2], [0, 1, 3], [0, 1, 4], [0, 2, 3], [0, 2, 4], [1, 2, 3], [1, 2, 4]]
[[0, 1, 2]]


#### Part E

Describe *in words* how you would generalize your code in part D to work for generating candidate quadruples $[i_1, i_2, i_3, i_4]$ from an input list of triples counts (each element of the form $[i, j, k, c_{ijk}]$).

I would generate all possible candidates of size four with my combinations functions. then for each of those I would generate a list of all their possible subset triples then check if all those subsets are in my actual (given) candidate triples. 

# Problem 4 (Practice: A-Priori.  Not due this week.)

This problem is **not on this homework**.  It is repeated on the homework due next week on Friday, Mar 19, which also contains your first PageRank/power iteration problem(s).  But it is contained here in brief in case you wish to start on it early, because it involves using A-Priori and includes some data wrangling/munging that you might enjoy extra time on.

Consider the Online Retail data set provided in `onlineretail.csv`.  This includes over 500,000 purchases from an online retailer.

We want to use the baskets (marked by `InvoiceNo`) and the items (marked by `StockCode` and/or `Description`) to perform an item basket analysis.

This data set is small enough to run directly from main memory, so you may do that if you wish.  You may also complete this problem using only the first 100,000 entries of the .csv if you wish for shorter computational time.  Be very explicit which you are using.

#### a)  There are some odd entries in the data set.  Make sure that you're discarding any transactions and items with no `Description`, non-positive `Quantity`, or non-positive `Unit Price`.


#### b) For our first iteration, we will use just `StockCode` for the items.  Use `StockCode` to create a table of frequent single items at 1% support threshold.  For convenience on this part of the problem and part c), you may choose to discard all items with non-integer values in `StockCode`.  Was 1% an appropriate support threshold?  Describe why or why not.

#### c) Use A-priori to find all frequent  pairs of items from your set of frequent items in a).  Use whatever support threshold you feel is most appropriate.

#### d) Use a hash table to hash items from their `Descriptions`.  Include a check to minimize and fix any collisions, as in nb08.

#### e) Use A-priori to find all frequent items and all frequent pairs of items from your hashed data set in part c).

#### f) Did any frequent items appear in part d) that did not in part b)?  If so, list them.

In [235]:
df = pd.read_csv('/users/tamer/desktop/advanced_ds/data/onlineretail.csv',nrows=100000,names = ['InvoiceNo','StockCode','Description','Quantity','InvoiceDate','UnitPrice','CustomerID','Country'],encoding='latin1')  
#drop colums we dont need
df = df.drop(columns = ['InvoiceDate','CustomerID','Country'])
#replace empty values in description to nan
df['Description'] = df['Description'].replace('',np.NaN,regex=True)
#if stockcode value contain letter, replace with nan
df['StockCode'] = df['StockCode'].replace('[^0-9]',np.NaN,regex=True)
#convert string numbers to numeric value so we can play with them.
df['Quantity']= pd.to_numeric(df['Quantity'],errors='coerce')
df['UnitPrice']= pd.to_numeric(df['UnitPrice'],errors='coerce')
##replace any negative value with a nan
df['Quantity'] = df['Quantity'].loc[df['Quantity'] > 0.0]
df['UnitPrice'] = df['UnitPrice'].loc[df['UnitPrice'] > 0.0]
# last but not least, drop nans
df = df.dropna()
print(len(df))
df.head(50)

84562


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice
2,536365,71053,WHITE METAL LANTERN,6.0,3.39
6,536365,22752,SET 7 BABUSHKA NESTING BOXES,2.0,7.65
7,536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6.0,4.25
8,536366,22633,HAND WARMER UNION JACK,6.0,1.85
9,536366,22632,HAND WARMER RED POLKA DOT,6.0,1.85
10,536367,84879,ASSORTED COLOUR BIRD ORNAMENT,32.0,1.69
11,536367,22745,POPPY'S PLAYHOUSE BEDROOM,6.0,2.1
12,536367,22748,POPPY'S PLAYHOUSE KITCHEN,6.0,2.1
13,536367,22749,FELTCRAFT PRINCESS CHARLOTTE DOLL,8.0,3.75
14,536367,22310,IVORY KNITTED MUG COSY,6.0,1.65


In [236]:
#count number of baskets.
baskets = {}
for i in df['InvoiceNo']:
    if i not in baskets:
        lst =[]
        lst = (df[df['InvoiceNo']==i]['StockCode']).tolist()
        baskets.update({i:lst})


In [228]:
#print(baskets)

{'536365': ['71053', '22752', '21730'], '536366': ['22633', '22632'], '536367': ['84879', '22745', '22748', '22749', '22310', '84969', '22623', '22622', '21754', '21755', '21777', '48187'], '536368': ['22960', '22913', '22912', '22914'], '536369': ['21756'], '536370': ['22728', '22727', '22726', '21724', '21883', '10002', '21791', '21035', '22326', '22629', '22659', '22631', '22661', '21731', '22900', '21913', '22540', '22544', '22492'], '536371': ['22086'], '536372': ['22632', '22633'], '536373': ['71053', '20679', '37370', '21871', '21071', '21068', '82483', '82486', '82482', '22752', '21730'], '536374': ['21258'], '536375': ['71053', '20679', '37370', '21871', '21071', '21068', '82483', '82486', '82482', '22752', '21730'], '536376': ['22114', '21733'], '536377': ['22632', '22633'], '536378': ['22386', '21033', '20723', '21094', '20725', '21559', '22352', '21212', '21975', '21977', '84991', '21931', '21929'], '536380': ['22961'], '536381': ['22139', '84854', '22411', '82567', '21672'

In [237]:
item_count = {}
one_percent = round(len(baskets) * (1/100))#one percent of basket count

for i in df['StockCode']:#loop over items, count unique items, and save them in dictionary as {item:count}
    if i not in item_count:
        item_count.update({i:1})
    elif i in item_count:
        item_count[i] = item_count[i] +1

frequent_item = {}
for i in item_count:
    if item_count[i]> one_percent:
        frequent_item.update({i:item_count[i]})


In [248]:
print(len(frequent_item))

846


In [238]:
pair = list(itertools.combinations(frequent_item,2))
pair_len = len(pair)

In [216]:
print(pair[1])

('71053', '22633')


In [239]:
item_pair_count = {}
#Count each and every pair of items.
for x in range(pair_len):
    count = 0
    for i in baskets:
        #if both items are in the same basket then count++
        if(pair[x][0] in baskets[i]):
            if(pair[x][1] in baskets[i]):
                count = count + 1
    item_pair_count.update({pair[x]:count})


In [249]:
frequent_item_pair = {}
for i in item_pair_count:
    if item_pair_count[i] > 35:
        frequent_item_pair.update({i:item_pair_count[i]})
print(len(frequent_item_pair))

1967
