# Homework 2. Frequent itemset

***Double Click here to edit this cell***

- Name: 김수연
- Student ID: 201800839
- Submission date: 2020 3/28 Sat.

*Remark. Do not import numpy, pandas, sklearn, or any module implementing the solution directly*

## Frequent itemset
- ***Support*** is an indication of how frequently the itemset $X$ appears in the dataset $T$.
- The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X.

$$
{\displaystyle \mathrm {supp} (X)={\frac {|\{t\in T;X\subseteq t\}|}{|T|}}} 
$$

- Frequent itemset is an itemset whose support $\ge$ ***min_sup***.

## Data set

- Each line in the following can be imagined as a market basket, which contains items you want to buy.

In [1]:
# DO NOT EDIT THIS CELL
data_str = 'apple,beer,rice,chicken\n'
data_str += 'apple,beer,rice\n'
data_str += 'apple,beer\n'
data_str += 'apple,mango\n'
data_str += 'milk,beer,rice,chicken\n'
data_str += 'milk,beer,rice\n'
data_str += 'milk,beer\n'
data_str += 'milk,mango'

## Problem 1 (2 pts)

- Define a function ***record_gen*** generating a list of items each ***next***.
- It must be a generator.
- Use ***yield*** instead of ***return***

In [2]:
# YOUR CODE MUST BE HERE 

def gen_record(s):
    """긴 문자열의 s의 항목들을 분리('\n', ','을 기준으로)
       yield 사용해서 반환."""
    tmp = s.split('\n')
    for i in tmp:
        yield i.split(',')

In [3]:
# DO NOT EDIT THIS CELL 
test = gen_record(data_str)
next(test)

['apple', 'beer', 'rice', 'chicken']

**Your output must be:**
```
['apple', 'beer', 'rice', 'chicken']
```

In [4]:
# DO NOT EDIT THIS CELL
next(test)

['apple', 'beer', 'rice']

**Your output must be:**
```
['apple', 'beer', 'rice']
```

## Problem 2 (10 pts)

- Define a function ***gen_frequent_1_itemset*** generating 1-itemset.
- It must be a generator.
- We want to find frequent 1-itemset (itemset containing only 1 item)

In [5]:
# YOUR CODE MUST BE HERE

def gen_frequent_1_itemset(dataset, min_sup=0.5):
    """item의 set을 구하고, 그것을 set_1item이라고 한다.
       set_1item에 있는 원소들이 dataset의 리스트 원소들 내에 있으면 count를 추가하는 방식으로 횟수를 구한다.
       위에서 구한 count로 support를 구해서 min_sup과 비교한다.
       min_sup 이상의 support를 가지는 item 항목 리스트를 반환한다."""
    item_set = set()
    sup = []
    result = []
    
    for i in range(len(dataset)):
        for j in range(len(dataset[i])):
            item_set.add(dataset[i][j])
    
    item_list = list(item_set)
    set_1item = item_list

    for i in range(len(set_1item)):
        cnt = 0
        for j in range(len(dataset)):
            if (set_1item[i] in dataset[j]): # item이 dataset의 원소리스트 안에 있다면
                cnt += 1
        sup.append(cnt/len(dataset))
    
    for i in range(len(sup)):
        if sup[i] >= min_sup: # support가 min_sup 이상일때
            result.append(set_1item[i])
    
    return result


In [6]:
# DO NOT EDIT THIS CELL
dataset = list(gen_record(data_str))
for item in gen_frequent_1_itemset(dataset, 0.5):
    print(item)
print('No more items')

milk
rice
apple
beer
No more items


**Your output must be:**
```
rice
beer
milk
apple
No more items
```

In [7]:
# DO NOT EDIT THIS CELL
dataset = list(gen_record(data_str))
for item in gen_frequent_1_itemset(dataset, 0.7):
    print(item)
print('No more items')

beer
No more items


**Your output must be:**
```
beer
No more items
```

In [8]:
# DO NOT EDIT THIS CELL
dataset = list(gen_record(data_str))
for item in gen_frequent_1_itemset(dataset, 0.2):
    print(item)
print('No more items')

milk
rice
apple
mango
chicken
beer
No more items


**Your output must be:**
```
rice
chicken
beer
mango
milk
apple
No more items
```

## Problem 3 (10 pts)

- Define a function ***gen_frequent_2_itemset*** generating 2-itemset.
- It must be a generator.
- We want to find frequent 2-itemset (itemset containing only 2 items)

In [9]:
# YOUR CODE MUST BE HERE

def gen_frequent_2_itemset(dataset, min_sup=0.5):
    """item의 set을 구하고, item_list의 원소 2개로 이루어진 부분집합 원소쌍의 리스트를 set_2item이라고 한다.
       set_2item에 있는 원소쌍의 [0]번째 원소와 [1]번째 원소가 dataset의 리스트 원소들 내에 동시에 있으면, 
       count를 추가하는 방식으로 횟수를 구한다.
       위에서 구한 count로 support를 구해서 min_sup과 비교한다.
       min_sup 이상의 support를 가지는 item 항목 리스트를 반환한다."""
    item_set = set()
    sup = []
    result = []
    
    for i in range(len(dataset)):
        for j in range(len(dataset[i])):
            item_set.add(dataset[i][j])
            
    item_list = list(item_set)
    
    # 원소가 겹치거나 원소쌍의 구성이 겹치지 않게끔 원소쌍의 list를 만든다.
    set_2item = [(x, y)
             for x in item_list
             for y in item_list[item_list.index(x)+1:len(item_list)]]
    
    for i in range(len(set_2item)):
        cnt = 0
        for j in range(len(dataset)):
            if ((set_2item[i][0] in dataset[j]) and (set_2item[i][1] in dataset[j])): 
                # item 쌍의 0번째 1번째 item들이 dataset의 리스트원소 내에 있을 때
                cnt += 1
        sup.append(cnt/len(dataset))
    
    for i in range(len(sup)):
        if sup[i] >= min_sup: # support가 min_sup 이상일 때
            result.append(set_2item[i])
    
    return result


In [10]:
# DO NOT EDIT THIS CELL
data = list(gen_record(data_str))
for item in gen_frequent_2_itemset(data, 0.5):
    print(item)
print('No more items')

('rice', 'beer')
No more items


**Your output must be:**
```
('beer', 'rice')
No more items
```

In [11]:
# DO NOT EDIT THIS CELL
data = list(gen_record(data_str))
for item in gen_frequent_2_itemset(data, 0.3):
    print(item)
print('No more items')

('milk', 'beer')
('rice', 'beer')
('apple', 'beer')
No more items


**Your output must be:**
```
('beer', 'rice')
('beer', 'milk')
('apple', 'beer')
No more items
```

In [12]:
# DO NOT EDIT THIS CELL
dataset = list(gen_record(data_str))
for item in gen_frequent_2_itemset(dataset, 0.2):
    print(item)
print('No more items')

('milk', 'rice')
('milk', 'beer')
('rice', 'apple')
('rice', 'chicken')
('rice', 'beer')
('apple', 'beer')
('chicken', 'beer')
No more items


**Your output must be:**
```
('chicken', 'rice')
('beer', 'rice')
('beer', 'chicken')
('beer', 'milk')
('milk', 'rice')
('apple', 'rice')
('apple', 'beer')
No more items
```

## Ethics:
If you cheat, you will get negatgive of the total points.
If the homework total is 22 and you cheat, you get -22.

## What to submit
- Run **all cells**
- Goto "File -> Print Preview"
- Print the page as pdf
- Submit the pdf file in google classroom
- No late homeworks accepted
- Your homework will be graded on the basis of correctness and programming skills