# Understanding Data and gathering requirements

- dataset:
    - say we have a dataset which has columns as below:
    1. items - actual product that is being bought by customer.
    2. group_cat - this is key for grouping similar items as a group.
        eg: red apple and green apple belong to same group. These items cannot be part of anyother group
    3. order_count - the number of orders a particular item has.
- requirement:
    - ask is to segregate above dataset into n batches with below requirements.
    1. Orders belonging to a group_cat (group of items) can only belong to one batch and any two batches cannot have item overlaps.
    2. all batches should have approximately equal count of order_count
    

In [None]:
def fn_name(arguments):
    # process the arguments
    # meet the requirements and return
    return 

# create dataset which looks similar to input

In [1]:
group_cat = ['AB', 'BC', 'CD', 'DE', 'EF', 'FG', 'GH', 'HI', 'IJ']
print(len(group_cat))

9


In [2]:
import random
random.randint(0, 10)

10

In [9]:
random.randint(0, 10)

3

In [11]:
# AB
[ random.randint(1000, 1500) for x in range(random.randint(1, 4))]

[1495, 1277, 1010]

In [12]:
# BC
[ random.randint(1501, 3000) for x in range(random.randint(1, 4))]

[1740, 1988]

In [13]:
import random
items = []
for i in range(len(group_cat)):
    items.append([random.randint((i+1)*1000, (i+1)*1000+999) for x in range(random.randint(1, 4))])

In [15]:
group_cat[0]

'AB'

In [14]:
items[0]

[1483, 1983, 1520]

In [17]:
import pandas as pd
data = pd.DataFrame({'items': items,
                    'group_cat':group_cat})

In [18]:
data.head()

Unnamed: 0,items,group_cat
0,"[1483, 1983, 1520]",AB
1,[2655],BC
2,"[3294, 3334]",CD
3,"[4409, 4030]",DE
4,[5696],EF


In [19]:
data = data.explode('items')

In [20]:
data.head()

Unnamed: 0,items,group_cat
0,1483,AB
0,1983,AB
0,1520,AB
1,2655,BC
2,3294,CD


In [21]:
data['order_count'] = [random.randint(1, 20) for x in range(data.shape[0])]

In [22]:
data.head()

Unnamed: 0,items,group_cat,order_count
0,1483,AB,5
0,1983,AB,3
0,1520,AB,3
1,2655,BC,9
2,3294,CD,14


# steps to achieve batching logic.

In [23]:
data.shape

(17, 3)

In [26]:
nbatches = 2
batches = [[] for x in range(nbatches)]

In [27]:
batches

[[], []]

In [28]:
batch_count = [0 for x in range(nbatches)]

In [29]:
batch_count

[0, 0]

In [31]:
data_grp = data.groupby(['group_cat']).agg({'order_count':'sum', 'items':list}).reset_index()
data_grp

Unnamed: 0,group_cat,order_count,items
0,AB,11,"[1483, 1983, 1520]"
1,BC,9,[2655]
2,CD,15,"[3294, 3334]"
3,DE,15,"[4409, 4030]"
4,EF,19,[5696]
5,FG,11,"[6930, 6574]"
6,GH,30,"[7232, 7634]"
7,HI,37,"[8418, 8406, 8349]"
8,IJ,9,[9778]


In [33]:
import numpy as np
np.argmin([1, 10, -5, -5, 0, 500])

2

In [34]:
data_grp.shape[0]

9

In [35]:
data_grp['group_cat'].iloc[0]

'AB'

In [36]:
batches[0]

[]

In [37]:
for i in range(data_grp.shape[0]):
    idx = np.argmin(batch_count)
    batches[idx].append(data_grp['group_cat'].iloc[i])
    batch_count[idx] += data_grp['order_count'].iloc[i]

In [38]:
batches

[['AB', 'DE', 'FG', 'GH', 'IJ'], ['BC', 'CD', 'EF', 'HI']]

In [39]:
batch_count

[76, 80]

# write the function after you have logic.

In [40]:
def batching(df:pd.DataFrame, n:int):
    df_grp = df.groupby(['group_cat']).agg({'order_count':'sum', 'items':list}).reset_index()
    batches = [[] for x in range(n)]
    batch_count = [0 for x in range(n)]
    for i in range(df_grp.shape[0]):
        idx = np.argmin(batch_count)
        batches[idx].append(df_grp['group_cat'].iloc[i])
        batch_count[idx] += df_grp['order_count'].iloc[i]
    return batches, batch_count

In [41]:
new_batches, new_batch_count = batching(data, 2)

In [42]:
batches

[['AB', 'DE', 'FG', 'GH', 'IJ'], ['BC', 'CD', 'EF', 'HI']]

In [43]:
new_batches

[['AB', 'DE', 'FG', 'GH', 'IJ'], ['BC', 'CD', 'EF', 'HI']]

In [45]:
batch_count

[76, 80]

In [46]:
new_batch_count

[76, 80]