# Create craft beer consuming data

This notebook create a fake data frame that relate users buyings to craft beers.

The output dataframe was contructed based on two notebooks that you can found here:

### [eCommerce purchase history from eletronics store](https://www.kaggle.com/datasets/mkechinov/ecommerce-purchase-history-from-electronics-store)

### [Craft Beers Dataset](https://www.kaggle.com/datasets/nickhould/craft-cans/code)

In [1]:
import pandas as pd

In [2]:
beers = pd.read_csv("data/beers.csv", index_col="id")

del beers['Unnamed: 0']

beers.head()

Unnamed: 0_level_0,abv,ibu,name,style,brewery_id,ounces
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1436,0.05,,Pub Beer,American Pale Lager,408,12.0
2265,0.066,,Devil's Cup,American Pale Ale (APA),177,12.0
2264,0.071,,Rise of the Phoenix,American IPA,177,12.0
2263,0.09,,Sinister,American Double / Imperial IPA,177,12.0
2262,0.075,,Sex and Candy,American IPA,177,12.0


In [3]:
df = pd.read_csv("data/user-item-interactions.csv")

In [4]:
df.tail()

Unnamed: 0,event_time,order_id,product_id,category_id,category_code,brand,price,user_id
2633516,2020-11-21 10:10:01 UTC,2388440981134693942,1515966223526602848,2.268105e+18,electronics.smartphone,oppo,138.87,1.515916e+18
2633517,2020-11-21 10:10:13 UTC,2388440981134693943,1515966223509089282,2.268105e+18,electronics.smartphone,apple,418.96,1.515916e+18
2633518,2020-11-21 10:10:30 UTC,2388440981134693944,1515966223509089917,2.268105e+18,appliances.personal.scales,vitek,12.48,1.515916e+18
2633519,2020-11-21 10:10:30 UTC,2388440981134693944,2273948184839454837,2.268105e+18,,moulinex,41.64,1.515916e+18
2633520,2020-11-21 10:10:30 UTC,2388440981134693944,1515966223509127566,2.268105e+18,appliances.kitchen.blender,redmond,53.22,1.515916e+18


## Remove NA values

unfortunaly non labelad user_id columns will not be useful in our analyses, so we need to drop then.

In [5]:
df.dropna(subset=['user_id'], inplace=True)

df.user_id.isna().sum()

0

## Transform product_id hashs in sequential ints

In [6]:
df['product_id'] = df['product_id'].rank(method='dense', ascending=False).astype(int)

In [7]:
# We want to keep only the product id lowers than 2409, that is the ids that we have in our
# beers dataset.


print(f"Dataset size before cuting {df.shape}")
print(f"Unique product_id values before cuting {df.product_id.unique().shape}")
df = df.loc[df["product_id"] <= 2409]


print(f"Dataset size before cuting {df.shape}")
print(f"Unique product_id values before cuting {df.product_id.unique().shape}")


Dataset size before cuting (564169, 8)
Unique product_id values before cuting (20964,)
Dataset size before cuting (28144, 8)
Unique product_id values before cuting (2409,)


We only kept 10% of the dataset, what a pain....

# Transform user_id hashs in sequential ints

In [8]:
df['user_id'] = df['user_id'].rank(method='dense', ascending=False).astype(int)

In [9]:
df = df[["product_id", "order_id", "user_id"]]
df

Unnamed: 0,product_id,order_id,user_id
10028,2187,2310942933416673977,8147
15261,2059,2316136337087922698,8642
15947,2057,2316746918924911613,8563
16236,2133,2316899056305045560,9990
16866,1988,2317477536566608607,9437
...,...,...,...
2633493,1369,2388440981134693923,10046
2633494,388,2388440981134693923,10046
2633497,1082,2388440981134693926,4460
2633500,389,2388440981134693928,4290


# Store dataset

In [10]:
df.to_csv('data/user_interactions.csv')

In [11]:
pd.read_csv("data/user_interactions.csv").head()

Unnamed: 0.1,Unnamed: 0,product_id,order_id,user_id
0,10028,2187,2310942933416673977,8147
1,15261,2059,2316136337087922698,8642
2,15947,2057,2316746918924911613,8563
3,16236,2133,2316899056305045560,9990
4,16866,1988,2317477536566608607,9437
