# Split / Expand and Melt - Code Snippets

- Expand columns from col with list values (with different index options)
- Expand columns from col with str values
- Melt the whole thing

Data and inspiration taken from [here](https://medium.com/datadriveninvestor/how-to-build-a-recommendation-system-for-purchase-data-step-by-step-d6d7a78800b6).


In [1]:
import pandas as pd

In [2]:
"""load and check initial data"""

transactions = pd.read_csv('trx_data.csv')
transactions.head()

Unnamed: 0,customerId,products
0,0,20
1,1,2|2|23|68|68|111|29|86|107|152
2,2,111|107|29|11|11|11|33|23
3,3,164|227
4,5,2|2


In [3]:
"""step 1: split products string into a list of integers"""

transactions['products'] = transactions['products'].apply(lambda x: [int(i) for i in x.split('|')])

In [4]:
# check result
transactions.head()

Unnamed: 0,customerId,products
0,0,[20]
1,1,"[2, 2, 23, 68, 68, 111, 29, 86, 107, 152]"
2,2,"[111, 107, 29, 11, 11, 11, 33, 23]"
3,3,"[164, 227]"
4,5,"[2, 2]"


---

In [5]:
"""SPLIT DEMO 1, FROM LIST - note 'head' is only called for demonstration purpose"""

transactions.head().set_index('customerId')['products'].apply(pd.Series).reset_index()

Unnamed: 0,customerId,0,1,2,3,4,5,6,7,8,9
0,0,20.0,,,,,,,,,
1,1,2.0,2.0,23.0,68.0,68.0,111.0,29.0,86.0,107.0,152.0
2,2,111.0,107.0,29.0,11.0,11.0,11.0,33.0,23.0,,
3,3,164.0,227.0,,,,,,,,
4,5,2.0,2.0,,,,,,,,


In [6]:
# explanation: why set_index

transactions.head()['products'].apply(pd.Series)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,20.0,,,,,,,,,
1,2.0,2.0,23.0,68.0,68.0,111.0,29.0,86.0,107.0,152.0
2,111.0,107.0,29.0,11.0,11.0,11.0,33.0,23.0,,
3,164.0,227.0,,,,,,,,
4,2.0,2.0,,,,,,,,


In [7]:
# explanation: why reset_index (sets a new int index, customerId was only temporarily set as index but is preserved)

transactions.head().set_index('customerId')['products'].apply(pd.Series)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9
customerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,20.0,,,,,,,,,
1,2.0,2.0,23.0,68.0,68.0,111.0,29.0,86.0,107.0,152.0
2,111.0,107.0,29.0,11.0,11.0,11.0,33.0,23.0,,
3,164.0,227.0,,,,,,,,
5,2.0,2.0,,,,,,,,


In [8]:
"SPLIT DEMO 2, FROM STRING: Split directly from str, no list transformation at the start"

trx = transactions.copy()

# transform list back to string
trx['products'] = trx['products'].apply(lambda x: ''.join(str(x)).lstrip('[').rstrip(']'))

# split string with expand
trx.head().set_index('customerId')['products'].str.split(', ', expand=True).reset_index()

Unnamed: 0,customerId,0,1,2,3,4,5,6,7,8,9
0,0,20,,,,,,,,,
1,1,2,2.0,23.0,68.0,68.0,111.0,29.0,86.0,107.0,152.0
2,2,111,107.0,29.0,11.0,11.0,11.0,33.0,23.0,,
3,3,164,227.0,,,,,,,,
4,5,2,2.0,,,,,,,,


**Note:** If you have given categories and want to distribute values directly into appropriate columns, check starbucks capstone challenge repository (one-hot-encoding of channels in data prep) or the 'split and dummy movies' notebook.

---

In [9]:
"""MELT DEMO - note 'head' is only called for demonstration purpose"""

pd.melt(transactions.head(2).set_index('customerId')['products'].apply(pd.Series).reset_index(), 
             id_vars=['customerId'],
             value_name='products') \
    .dropna().drop(['variable'], axis=1) \
    .groupby(['customerId', 'products']) \
    .agg({'products': 'count'}) \
    .rename(columns={'products': 'purchase_count'}) \
    .reset_index() \
    .rename(columns={'products': 'productId'})

Unnamed: 0,customerId,productId,purchase_count
0,0,20.0,1
1,1,2.0,2
2,1,23.0,1
3,1,29.0,1
4,1,68.0,2
5,1,86.0,1
6,1,107.0,1
7,1,111.0,1
8,1,152.0,1


---