# Task 1

There's a file containing purchase data from a grocery store, including purchase IDs, item names and quantity. Your task is to find out which pairs of items were purchased most of all, i.e. you should look for data patterns that would benefit the store. Write a function that returns a table containing top 5 pairs of purchased items.

In [2]:
import pandas as pd
import numpy as np

In [78]:
def top5_pairs(df, id_col, item_col, quantity_col):
    '''
    Finds top 5 items purchased together.
    
        Parameters:
            df: dataframe
            id_col: name of purchase id column
            item_col: name of items column
            quantity_col: name of item quantity column
        
        Returns a table containing top 5 pairs of items and their purchase frequency.            
    '''
    # removing null values from the dataset, setting quantity column to 1 to count occurence of items
    df = df.dropna()
    df[quantity_col] = 1
    
    # making a cross table with item pairs, setting irrelevant values to 0
    tab_1 = df.pivot(index=item_col, columns=id_col, values=quantity_col).fillna(0)
    tab_2 = tab_1.T
    pairs = tab_1.dot(tab_2)
    pairs = pairs.mask(np.triu(pairs.to_numpy()).astype(bool)).fillna(0)
    
    # making a result table
    pairs = (
        pairs.stack().to_frame().reset_index(1).rename(columns={item_col: 'item_2'})\
        .reset_index().rename(columns={item_col: 'item_1', 0: 'Frequency'})\
        .sort_values(by='Frequency', ascending=False).reset_index(drop=True).query('Frequency != 0').head(5)
    )
    return pairs

In [82]:
# loading the given dataset
data = pd.read_csv('https://stepik.org/media/attachments/lesson/409319/test1_completed.csv')
data.head(10)

Unnamed: 0,id,Товар,Количество
0,17119,Лимон,1.1
1,17119,Лимон оранжевый,0.7
2,17119,Лук-порей,10.0
3,17119,Лук репчатый,2.5
4,17119,Малина свежая,1.0
5,17119,Морковь немытая,1.4
6,17119,Черешня сушеная,1.8
7,17530,Лимон оранжевый,0.25
8,17530,Изюм Султана,0.5
9,17530,Капуста цветная,2.0


In [81]:
# using the function on the given dataset
top5_pairs(data, 'id','Товар', 'Количество')

Unnamed: 0,item_1,item_2,Frequency
0,Укроп,Огурцы Луховицкие,431.0
1,Укроп,Петрушка,408.0
2,Огурцы Луховицкие,Арбуз,345.0
3,Огурцы Луховицкие,Кабачки,326.0
4,Укроп,Кинза,303.0
