## Purchase correspondence (Chinese-English)

In this notebook, we'll try to find the english correspondence for each cs:go purchase, and associate it with a monetary value.

In [7]:
import pandas as pd

Importing the main dataset (`output.csv`) in to `df_purchases`.

Let's drop the current `datetime` and compute a new one to make sure we're working in the UTC time zone (depending where the server was located, it used its own timezone to create the oridinal datetime).

The two relevant columns here are: `src` indicates the lootbox that was opened (different lootboxes can have different contents), and `out`  (*outcome*) is what the user got inside the lootbox.

In [8]:
#df_purchases = pd.read_csv('output.csv')
df_purchases = pd.read_csv('2023-01-11-output.csv')
print("Head of original dataframe:")
display(df_purchases.head(5))

# Drop the 'time' column, because it's redundant with 'timestamp'
print(f"Are the 'timestamp' and 'time' columns equal? {df_purchases['time'].equals(df_purchases['timestamp'])}")
df_purchases.drop(columns='time', inplace=True)

# Drop the datetime column, and compute it using UTC time based on the timestamp.
df_purchases.drop(columns='datetime', inplace=True)
df_purchases['datetimeUTC'] = pd.to_datetime(df_purchases['timestamp'], unit='s') # Will be in the UTC timezone by default

# Rearrange so the datetimeUTC is the first column
cols = df_purchases.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_purchases = df_purchases[cols]
del cols

# Sort by purchase datetime
df_purchases.sort_values(by='timestamp', inplace=True)

# Reset index
df_purchases.reset_index(drop=True, inplace=True)

#df_purchases['datetime'] = pd.to_datetime(df_purchases['datetime'])
#df_purchases.set_index('datetimeUTC', inplace=True) # May not be a good idea to set the datetimeUTC as index, as it can contain duplicated values
print("Head of prepared dataframe:")
display(df_purchases.head(5))

Head of original dataframe:


Unnamed: 0,datetime,timestamp,user,src,out,time
0,2022-12-13 16:14:51,1670966091,AA***-4PGE,光谱 2 号武器箱,Tec-9 | 碎蛋白石,1670966091
1,2022-12-13 16:14:48,1670966088,AF***-K6NC,“头号特训”武器箱,SG 553 | 危险距离,1670966088
2,2022-12-13 16:14:24,1670966064,SR***-E8UJ,2022年里约热内卢锦标赛炙热沙城 II 纪念包,截短霰弹枪（纪念品） | 旱地之花,1670966064
3,2022-12-13 16:14:13,1670966053,AV***-2LQJ,梦魇武器箱,截短霰弹枪 | 灵应牌,1670966053
4,2022-12-13 16:14:12,1670966052,SQ***-MCPJ,反冲武器箱,截短霰弹枪 | 么么,1670966052


Are the 'timestamp' and 'time' columns equal? True
Head of prepared dataframe:


Unnamed: 0,datetimeUTC,timestamp,user,src,out
0,2022-12-13 20:37:31,1670963851,SQ***-MCPJ,2022年里约热内卢锦标赛炙热沙城 II 纪念包,P90（纪念品） | 沙漠 DDPAT
1,2022-12-13 20:51:17,1670964677,AQ***-7LRL,CS:GO 10周年印花胶囊,印花 | 给个闪 TT
2,2022-12-13 20:51:27,1670964687,SQ***-MAQC,反冲武器箱,UMP-45 | 路障
3,2022-12-13 20:51:50,1670964710,SQ***-MAQC,反冲武器箱,格洛克 18 型（StatTrak™） | 冬季战术
4,2022-12-13 20:52:04,1670964724,SC***-KRSG,棱彩2号武器箱,AUG | 汤姆猫


Importing the df with the English-Chinese correspondence for the outcomes (`df_weapons_skins_.pkl`). It will be called `df_out` since it will be used to find the contents of the `out` column in the `df_purchases`.

In [9]:
#df_weapons_skins.to_pickle("./df_weapons_skins_.pkl")
df_out = pd.read_pickle('df_weapons_skins_.pkl')
display(df_out.sample(5))

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
434,Regular Stickers,普通贴纸,Sneaky Beaky Like,三寸鸟七寸嘴,,https://wiki.cs.money/regular-stickers,,,[High Grade],[3],$ 0.33,,,https://wiki.cs.money/stickers/sticker-sneaky-...,[Community Sticker Capsule 1],[1 号社区印花胶囊],[https://wiki.cs.money/capsules/community-stic...,[https://wiki.cs.money/zh/capsules/community-s...,https://wiki.cs.money/zh/regular-stickers
208,Knives,刀,★ Stiletto Knife,短剑（★）,https://wiki.cs.money/_next/static/images/stil...,https://wiki.cs.money/weapons/stiletto-knife,Blue Steel,蓝钢,"[★ StatTrak™, Covert]","[4, 7]",$ 261.81 - $ 327.08,$ 195.51 - $ 208.55,,https://wiki.cs.money/weapons/stiletto-knife/b...,"[Danger Zone Case, Horizon Case]","[“头号特训”武器箱, 地平线武器箱]","[https://wiki.cs.money/cases/danger-zone-case,...",[https://wiki.cs.money/zh/cases/danger-zone-ca...,
1498,Heavy,重型武器,MAG-7,MAG-7,https://wiki.cs.money/_next/static/images/mag-...,https://wiki.cs.money/weapons/mag-7,Heat,炽热,"[StatTrak™, Restricted]","[nan, 4]",$ 0.33 - $ 2.51,$ 0.73 - $ 7.81,,https://wiki.cs.money/weapons/mag-7/heat,[The Chroma 2 Collection],[幻彩 2 号收藏品],[https://wiki.cs.money/collections/the-chroma-...,[https://wiki.cs.money/zh/collections/the-chro...,
1425,Tournament Stickers,大赛贴纸,Gambit Esports,Gambit Esports,,https://wiki.cs.money/tournament-stickers,Boston 2018,2018年波士顿锦标赛,"[High Grade, Remarkable, Exotic, Exotic]","[3, 4, 5, 5]",$ 0.89,,,https://wiki.cs.money/stickers/sticker-gambit-...,"[Boston 2018 Legends (Holo/Foil), Boston 2018 ...","[2018年波士顿锦标赛传奇（全息/闪亮）（含 100 Thieves）, 2018年波士顿...",[https://wiki.cs.money/capsules/boston-2018-le...,[https://wiki.cs.money/zh/capsules/boston-2018...,https://wiki.cs.money/zh/tournament-stickers
407,Regular Stickers,普通贴纸,Purple Jaggyfish,带刺鱼（紫色）,,https://wiki.cs.money/regular-stickers,,,[High Grade],[3],$ 0.40,,,https://wiki.cs.money/stickers/sticker-purple-...,[Riptide Surf Shop Sticker Collection],[“激流冲浪店”印花收藏品],[https://wiki.cs.money/capsules/riptide-surf-s...,[https://wiki.cs.money/zh/capsules/riptide-sur...,https://wiki.cs.money/zh/regular-stickers


### Parsing columns from df_purchases

The column `out` from the main dataframe `df_purchases` should be roughly equivalent to `Weapon_zh` and `Skin_Name_zh` from `df_out`. However, it also includes details like the grade and type (stattrak, souvenir) of the skin. We'll likely have to parse that information.

Let's try with the first purchase in `df_purchases`

In [10]:
df_purchases.iloc[1]

datetimeUTC    2022-12-13 20:51:17
timestamp               1670964677
user                    AQ***-7LRL
src                 CS:GO 10周年印花胶囊
out                    印花 | 给个闪 TT
Name: 1, dtype: object

Try to find *印花 | 给个闪 TT* in `df_out`. There appears to be a separating character (*|*) between the weapon name and the skin. Check if this is the case for the whole `out` column in the `df_purchases` dataframe.

In [11]:
print(df_purchases.shape)
print(df_purchases['out'].str.contains('\|').sum()) # the "|" character needs to be escaped, otherwise it just counts any character.

(614508, 5)
614445


In [12]:
# rows without a vertical bar | in ['out']
display(df_purchases[~df_purchases['out'].str.contains('\|')])

Unnamed: 0,datetimeUTC,timestamp,user,src,out
17222,2022-12-14 12:41:02,1671021662,A8***-W4NG,系列 3 收藏胸章胶囊,准将胸章
21174,2022-12-14 16:05:10,1671033910,AR***-ESVJ,《半衰期：爱莉克斯》收藏胸章胶囊,联合军头盔胸章
22640,2022-12-14 17:15:45,1671038145,SK***-PVTQ,《半衰期：爱莉克斯》收藏胸章胶囊,生命值胸章
25051,2022-12-14 19:21:12,1671045672,A2***-74NJ,《半衰期：爱莉克斯》收藏胸章胶囊,λ 胸章
25680,2022-12-14 19:51:52,1671047512,SK***-PVTQ,《半衰期：爱莉克斯》收藏胸章胶囊,λ 胸章
...,...,...,...,...,...
557286,2023-01-09 13:33:07,1673271187,AC***-B8QL,系列 3 收藏胸章胶囊,守护者 3 号胸章
586094,2023-01-10 19:10:06,1673377806,AM***-YNEQ,棱彩2号武器箱,短剑（★）
600494,2023-01-11 08:58:00,1673427480,S7***-9GQN,系列 3 收藏胸章胶囊,运河水城胸章
602053,2023-01-11 10:16:18,1673432178,SW***-XHRC,系列 3 收藏胸章胶囊,守护者 3 号胸章


Ok, so the overwhelming majority (99,9897%) of `out` fields contain a vertical bar "|". However, there seem to be cells with more than one *|*. We have to figure out how many are there and what does it mean.

In [13]:
# for instance, this one
display(df_purchases.iloc[61]['out'] )

'印花 | Brollan | 2022年安特卫普锦标赛'

In [14]:
counts = df_purchases['out'].str.count("\|")

print(df_purchases['out'].tail(5))
print(counts.value_counts())

614503    SSG 08 | 主机001
614504      加利尔 AR | 毁灭者
614505       G3SG1 | 净化者
614506       FN57 | 焰色反应
614507      PP-野牛 | 神秘碑文
Name: out, dtype: object
1    381383
2    233062
0        63
Name: out, dtype: int64


There are cells with 1, 2 or zero "|". Let's study some examples of each.

In [15]:
# Show 5 of each kind
print(df_purchases.loc[counts == 0, 'out'].iloc[range(0,5)]) # Zero |
print(df_purchases.loc[counts == 1, 'out'].iloc[range(0,5)]) # One |
print(df_purchases.loc[counts == 2, 'out'].iloc[range(0,5)]) # Two | 

#df_purchases.loc[counts == 0, 'out'] # Prints all ocurrences of zero "|"
#df_purchases.loc[counts == 1, 'out'] # Prints all ocurrences of one "|"
del counts

17222       准将胸章
21174    联合军头盔胸章
22640      生命值胸章
25051       λ 胸章
25680       λ 胸章
Name: out, dtype: object
0           P90（纪念品） | 沙漠 DDPAT
1                   印花 | 给个闪 TT
2                   UMP-45 | 路障
3    格洛克 18 型（StatTrak™） | 冬季战术
4                     AUG | 汤姆猫
Name: out, dtype: object
17    印花 | FaZe Clan（闪耀）| 2022年里约热内卢锦标赛
54           印花 | nafany | 2022年安特卫普锦标赛
56           印花 | interz | 2022年安特卫普锦标赛
60              印花 | arT | 2022年安特卫普锦标赛
61          印花 | Brollan | 2022年安特卫普锦标赛
Name: out, dtype: object


Let's split that cell in three: `out_1`, `out_2` and `out_3`. We still haven't figure out why are they divided that way.

In [16]:
df_purchases[['out_1', 'out_2', 'out_3']] =  df_purchases['out'].str.split("|", expand=True)
display(df_purchases.sample(5))

Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1,out_2,out_3
429660,2023-01-04 09:42:25,1672825345,S9***-8VQE,裂空武器箱,内格夫 | 飞羽,内格夫,飞羽,
279464,2022-12-29 23:20:32,1672356032,SU***-YMGQ,命悬一线武器箱,FN57 | 焰色反应,FN57,焰色反应,
184323,2022-12-26 11:35:50,1672054550,AQ***-F6VN,2022年里约热内卢锦标赛竞争组印花胶囊,印花 | GamerLegion | 2022年里约热内卢锦标赛,印花,GamerLegion,2022年里约热内卢锦标赛
346196,2023-01-01 08:39:17,1672562357,ST***-9CPL,2022年里约热内卢锦标赛竞争组印花胶囊,印花 | FURIA | 2022年里约热内卢锦标赛,印花,FURIA,2022年里约热内卢锦标赛
142472,2022-12-24 22:08:51,1671919731,AU***-NRBN,2022年安特卫普锦标赛传奇组印花胶囊,印花 | BIG | 2022年安特卫普锦标赛,印花,BIG,2022年安特卫普锦标赛


There are also cases where the cell string has parenthesis. For instance:

`2023-01-05 22:33:51    格洛克 18 型（StatTrak™） | 冬季战术`

Let's see if it applies for all three `out_*` cells or just the first one. Apparently it's not a "normal" parenthesis, it's this character:'（'.

In [17]:
counts = df_purchases['out_1'].str.count("\（")
print(counts.value_counts())
print(df_purchases.loc[counts == 1, 'out_1'].iloc[0]) # Example of `out_1` with one "("

counts = df_purchases['out_2'].str.count("\（")
print(counts.value_counts())
print(df_purchases.loc[counts == 2, 'out_2'].iloc[0]) # Apparently, there's at least one record with two pairs of parenthesis. We'll have to take this into account.

counts = df_purchases['out_3'].str.count("\（")
print(counts.value_counts())
del counts

0    559025
1     55483
Name: out_1, dtype: int64
P90（纪念品） 
0.0    547408
1.0     67035
2.0         2
Name: out_2, dtype: int64
 魔性探员 特别空勤团（SAS）（全息）
0.0    233062
Name: out_3, dtype: int64


So out_1 and _out2 can have these parenthesis. There's one case of two pairs of parenthesis. It seems that the good one is the last one. Let's extract thoses value and put them in other columns (`out_1_par` and `out_2_par`). Note: there are some strings (at least in `out_2`) that end in an invisible character (a tab?), hence the `.strip()`.

In [18]:
#the u prefix is to specify that the string is a Unicode string. e.g. 截短霰弹枪（纪念品） 
df_purchases['out_1_par'] = df_purchases['out_1'].str.strip().str.extract(u'（(.*?)）')
df_purchases['out_2_par'] = df_purchases['out_2'].str.strip().str.extract(r'（([^（）]*)）$')

# Also, let's put the part with no parentheses in another column
df_purchases['out_1_nopar'] = df_purchases['out_1'].str.replace(r'（[^（）]*）(?=[^（）]*$)', '', regex=True)
df_purchases['out_2_nopar'] = df_purchases['out_2'].str.replace(r'（[^（）]*）(?=[^（）]*$)', '', regex=True)

display(df_purchases.head(10))

Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1,out_2,out_3,out_1_par,out_2_par,out_1_nopar,out_2_nopar
0,2022-12-13 20:37:31,1670963851,SQ***-MCPJ,2022年里约热内卢锦标赛炙热沙城 II 纪念包,P90（纪念品） | 沙漠 DDPAT,P90（纪念品）,沙漠 DDPAT,,纪念品,,P90,沙漠 DDPAT
1,2022-12-13 20:51:17,1670964677,AQ***-7LRL,CS:GO 10周年印花胶囊,印花 | 给个闪 TT,印花,给个闪 TT,,,,印花,给个闪 TT
2,2022-12-13 20:51:27,1670964687,SQ***-MAQC,反冲武器箱,UMP-45 | 路障,UMP-45,路障,,,,UMP-45,路障
3,2022-12-13 20:51:50,1670964710,SQ***-MAQC,反冲武器箱,格洛克 18 型（StatTrak™） | 冬季战术,格洛克 18 型（StatTrak™）,冬季战术,,StatTrak™,,格洛克 18 型,冬季战术
4,2022-12-13 20:52:04,1670964724,SC***-KRSG,棱彩2号武器箱,AUG | 汤姆猫,AUG,汤姆猫,,,,AUG,汤姆猫
5,2022-12-13 20:52:30,1670964750,AV***-QDXL,梦魇武器箱,P2000 | 升天,P2000,升天,,,,P2000,升天
6,2022-12-13 20:52:40,1670964760,AC***-69RE,梦魇武器箱,MAG-7 | 先见之明,MAG-7,先见之明,,,,MAG-7,先见之明
7,2022-12-13 20:52:44,1670964764,AV***-QDXL,梦魇武器箱,MP7 | 幽幻深渊,MP7,幽幻深渊,,,,MP7,幽幻深渊
8,2022-12-13 20:52:54,1670964774,AV***-QDXL,梦魇武器箱,G3SG1 | 梦之林地,G3SG1,梦之林地,,,,G3SG1,梦之林地
9,2022-12-13 20:53:01,1670964781,AC***-69RE,梦魇武器箱,MAG-7 | 先见之明,MAG-7,先见之明,,,,MAG-7,先见之明


Let's check the case with two pairs of parentheses, just in case:

In [19]:
boolean_mask = df_purchases['out_2'] == u' 魔性探员 特别空勤团（SAS）（全息）' # to find the index of that specific value
display(df_purchases.loc[boolean_mask])
del boolean_mask

Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1,out_2,out_3,out_1_par,out_2_par,out_1_nopar,out_2_nopar
242196,2022-12-28 14:22:48,1672237368,A8***-7SPC,魔性探员胶囊,印花 | 魔性探员 特别空勤团（SAS）（全息）,印花,魔性探员 特别空勤团（SAS）（全息）,,,全息,印花,魔性探员 特别空勤团（SAS）
596059,2023-01-11 04:47:27,1673412447,AT***-ZZFA,魔性探员胶囊,印花 | 魔性探员 特别空勤团（SAS）（全息）,印花,魔性探员 特别空勤团（SAS）（全息）,,,全息,印花,魔性探员 特别空勤团（SAS）


So far so good. Now we observe that some of these new columns (specially `out_2_par`) contains two (or more?) values separated by comma (，). For instance:  *qikert（闪耀，冠军）--> 闪耀，冠军*. Let's see which columns are affected and split these too.

In [20]:
counts = df_purchases['out_1'].str.count("，")
print(counts.value_counts()) # 0

counts = df_purchases['out_1_par'].str.count("，")
print(counts.value_counts()) # 0

counts = df_purchases['out_2'].str.count("，")
print(counts.value_counts()) # 3733

counts = df_purchases['out_2_par'].str.count("，")
print(counts.value_counts()) # 3722

counts = df_purchases['out_3'].str.count("，")
print(counts.value_counts()) # 0

del counts

# apparently there are some (11) commas in out_2 that are not present in out_2_par. Let's find which ones.
df_purchases['out_2_par'] = df_purchases['out_2_par'].fillna('')
mask = df_purchases['out_2'].str.contains("，") & ~df_purchases['out_2_par'].str.contains("，")
display(df_purchases[mask])
del mask
# It's cases like "Laura Shigihara - 好好干，好好活" or "悄悄进村，打枪不要（全息". These are sentences, we don't have to split or do anything with them.

0    614508
Name: out_1, dtype: int64
0.0    55483
Name: out_1_par, dtype: int64
0.0    609856
1.0      4589
Name: out_2, dtype: int64
0.0    62463
1.0     4574
Name: out_2_par, dtype: int64
0.0    233062
Name: out_3, dtype: int64


Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1,out_2,out_3,out_1_par,out_2_par,out_1_nopar,out_2_nopar
146627,2022-12-25 02:04:40,1671933880,S5***-J9VJ,2021社区印花胶囊,印花 | 悄悄进村，打枪不要（全息）,印花,悄悄进村，打枪不要（全息）,,,全息,印花,悄悄进村，打枪不要
146645,2022-12-25 02:04:52,1671933892,S5***-J9VJ,2021社区印花胶囊,印花 | 悄悄进村，打枪不要（全息）,印花,悄悄进村，打枪不要（全息）,,,全息,印花,悄悄进村，打枪不要
268072,2022-12-29 13:18:07,1672319887,SM***-LATJ,2021社区印花胶囊,印花 | 悄悄进村，打枪不要（全息）,印花,悄悄进村，打枪不要（全息）,,,全息,印花,悄悄进村，打枪不要
271271,2022-12-29 16:01:32,1672329692,AH***-3PPG,2021社区印花胶囊,印花 | 悄悄进村，打枪不要（全息）,印花,悄悄进村，打枪不要（全息）,,,全息,印花,悄悄进村，打枪不要
281587,2022-12-30 01:11:42,1672362702,S4***-UJEE,战术大师音乐盒集,音乐盒 | Laura Shigihara - 好好干，好好活,音乐盒,Laura Shigihara - 好好干，好好活,,,,音乐盒,Laura Shigihara - 好好干，好好活
360035,2023-01-01 20:23:54,1672604634,SU***-NJXL,2021社区印花胶囊,印花 | 悄悄进村，打枪不要（全息）,印花,悄悄进村，打枪不要（全息）,,,全息,印花,悄悄进村，打枪不要
360875,2023-01-01 21:05:02,1672607102,SU***-NJXL,2021社区印花胶囊,印花 | 悄悄进村，打枪不要（全息）,印花,悄悄进村，打枪不要（全息）,,,全息,印花,悄悄进村，打枪不要
360928,2023-01-01 21:06:01,1672607161,SU***-NJXL,2021社区印花胶囊,印花 | 悄悄进村，打枪不要（全息）,印花,悄悄进村，打枪不要（全息）,,,全息,印花,悄悄进村，打枪不要
371826,2023-01-02 06:51:43,1672642303,SG***-62UL,战术大师音乐盒集,音乐盒 | Laura Shigihara - 好好干，好好活,音乐盒,Laura Shigihara - 好好干，好好活,,,,音乐盒,Laura Shigihara - 好好干，好好活
372707,2023-01-02 07:34:52,1672644892,SC***-NNXQ,2021社区印花胶囊,印花 | 悄悄进村，打枪不要（全息）,印花,悄悄进村，打枪不要（全息）,,,全息,印花,悄悄进村，打枪不要


So it's just `out_2_par` that contain commas that need to be splitted.

In [21]:
df_purchases[['out_2_par_1', 'out_2_par_2']] =  df_purchases['out_2_par'].str.split("，", expand=True)
display(df_purchases.sample(5))

Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1,out_2,out_3,out_1_par,out_2_par,out_1_nopar,out_2_nopar,out_2_par_1,out_2_par_2
68859,2022-12-22 05:43:25,1671687805,S4***-NSUQ,2022年里约热内卢锦标赛传奇组亲笔签名胶囊,印花 | Snappi | 2022年里约热内卢锦标赛,印花,Snappi,2022年里约热内卢锦标赛,,,印花,Snappi,,
612359,2023-01-11 19:15:30,1673464530,SU***-LLUJ,2022年里约热内卢锦标赛竞争组印花胶囊,印花 | 00 Nation（金色）| 2022年里约热内卢锦标赛,印花,00 Nation（金色）,2022年里约热内卢锦标赛,,金色,印花,00 Nation,金色,
528180,2023-01-08 12:11:37,1673179897,SM***-7JRA,裂空武器箱,SSG 08 | 主机001,SSG 08,主机001,,,,SSG 08,主机001,,
383434,2023-01-02 17:16:52,1672679812,SV***-7NUE,梦魇武器箱,截短霰弹枪 | 灵应牌,截短霰弹枪,灵应牌,,,,截短霰弹枪,灵应牌,,
69597,2022-12-22 06:32:56,1671690776,AN***-J7TL,命悬一线武器箱,XM1014 | 锈蚀烈焰,XM1014,锈蚀烈焰,,,,XM1014,锈蚀烈焰,,


It seems that we are done parsing the columns. Let's rearrange the dataframe and drop the redundant info:

In [22]:
df_purchases = df_purchases[['datetimeUTC', 'timestamp', 'user', 'src', 'out', 'out_1_nopar', 'out_1_par', 'out_2_nopar', 'out_2_par_1', 'out_2_par_2', 'out_3']]
df_purchases = df_purchases.fillna('')
display(df_purchases.sample(15))

Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1_nopar,out_1_par,out_2_nopar,out_2_par_1,out_2_par_2,out_3
374283,2023-01-02 08:56:59,1672649819,SY***-6APE,命悬一线武器箱,FN57 | 焰色反应,FN57,,焰色反应,,,
384788,2023-01-02 18:18:50,1672683530,SY***-BYVL,命悬一线武器箱,SG 553 | 阿罗哈,SG 553,,阿罗哈,,,
369456,2023-01-02 04:44:26,1672634666,A8***-YADA,2022年里约热内卢锦标赛挑战组印花胶囊,印花 | Bad News Eagles（闪耀）| 2022年里约热内卢锦标赛,印花,,Bad News Eagles,闪耀,,2022年里约热内卢锦标赛
424010,2023-01-04 04:56:05,1672808165,AM***-YAVL,2020 RMR 传奇组战队胶囊,印花 | Natus Vincere（全息）| 2020 RMR,印花,,Natus Vincere,全息,,2020 RMR
32102,2022-12-20 20:22:38,1671567758,SB***-XLFN,光谱 2 号武器箱,M4A1 消音型（StatTrak™） | 破碎铅秋,M4A1 消音型,StatTrak™,破碎铅秋,,,
409437,2023-01-03 16:07:20,1672762040,AN***-JWFA,地平线武器箱,P90（StatTrak™） | 牵引力,P90,StatTrak™,牵引力,,,
186646,2022-12-26 13:38:00,1672061880,AT***-W4BL,反冲武器箱,MAC-10 | 萌猴迷彩,MAC-10,,萌猴迷彩,,,
475429,2023-01-06 01:46:57,1672969617,A9***-HYNN,反冲武器箱,格洛克 18 型 | 冬季战术,格洛克 18 型,,冬季战术,,,
31601,2022-12-20 20:01:27,1671566487,S9***-D6UN,2022年里约热内卢锦标赛竞争组亲笔签名胶囊,印花 | ANNIHILATION | 2022年里约热内卢锦标赛,印花,,ANNIHILATION,,,2022年里约热内卢锦标赛
282688,2022-12-30 02:06:02,1672365962,AR***-F8TC,2022年里约热内卢锦标赛挑战组亲笔签名胶囊,印花 | Ax1Le | 2022年里约热内卢锦标赛,印花,,Ax1Le,,,2022年里约热内卢锦标赛


The next step is to figure out what exactly is each column, and match them with the English translation.

### Matching purchases with scraped data from https://wiki.cs.money/

Let's practice with a sample row

In [23]:
display(df_purchases.iloc[[0]])

Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1_nopar,out_1_par,out_2_nopar,out_2_par_1,out_2_par_2,out_3
0,2022-12-13 20:37:31,1670963851,SQ***-MCPJ,2022年里约热内卢锦标赛炙热沙城 II 纪念包,P90（纪念品） | 沙漠 DDPAT,P90,纪念品,沙漠 DDPAT,,,


#### `out` column

`out_1_nopar` seems to be the weapon name.
`out_2_nopar` seems to be the skin name.

Let's see if we can find P90 in `df_out`:

In [24]:
search = df_out[df_out['Weapon_zh'].str.contains("P90")]
display(search)
print(len(search))
del search

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
1222,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Emerald Dragon,翡翠之龙,"[StatTrak™, Classified]","[nan, 5]",$ 47.63 - $ 335.76,$ 137.12 - $ 1 046.00,,https://wiki.cs.money/weapons/p90/emerald-dragon,[The Bravo Collection],[英勇收藏品],[https://wiki.cs.money/collections/the-bravo-c...,[https://wiki.cs.money/zh/collections/the-brav...,
1223,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Run and Hide,豹走,"[Souvenir, Classified]","[nan, 5]",$ 142.49 - $ 310.05,,$ 14.52 - $ 35.73,https://wiki.cs.money/weapons/p90/run-and-hide,[The Ancient Collection],[远古收藏品],[https://wiki.cs.money/collections/the-ancient...,[https://wiki.cs.money/zh/collections/the-anci...,
1224,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Astral Jörmungandr,星辰巨蟒,"[Normal, Restricted]","[1, 4]",$ 186.39 - $ 225.83,,,https://wiki.cs.money/weapons/p90/astral-jormu...,[The Norse Collection],[挪威人收藏品],[https://wiki.cs.money/collections/the-norse-c...,[https://wiki.cs.money/zh/collections/the-nors...,
1225,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Fallout Warning,辐射警告,"[Souvenir, Industrial Grade]","[nan, 2]",$ 2.58 - $ 73.24,,$ 2.26 - $ 123.45,https://wiki.cs.money/weapons/p90/fallout-warning,[The Nuke Collection],[核子危机收藏品],[https://wiki.cs.money/collections/the-nuke-co...,[https://wiki.cs.money/zh/collections/the-nuke...,
1226,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Death by Kitty,喵之萌杀,"[StatTrak™, Covert]","[nan, 6]",$ 32.75 - $ 60.93,$ 92.69 - $ 265.99,,https://wiki.cs.money/weapons/p90/death-by-kitty,[The eSports 2013 Collection],[电竞 2013 收藏品],[https://wiki.cs.money/collections/the-esports...,[https://wiki.cs.money/zh/collections/the-espo...,
1227,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Cold Blooded,冷血杀手,"[StatTrak™, Classified]","[nan, 5]",$ 44.67 - $ 60.30,$ 55.85 - $ 83.41,,https://wiki.cs.money/weapons/p90/cold-blooded,[The Arms Deal 2 Collection],[军火交易 2 收藏品],[https://wiki.cs.money/collections/the-arms-de...,[https://wiki.cs.money/zh/collections/the-arms...,
1228,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Asiimov,二西莫夫,"[StatTrak™, Covert]","[nan, 6]",$ 5.16 - $ 40.28,$ 17.20 - $ 144.55,,https://wiki.cs.money/weapons/p90/asiimov,[The Breakout Collection],[突围收藏品],[https://wiki.cs.money/collections/the-breakou...,[https://wiki.cs.money/zh/collections/the-brea...,
1229,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Baroque Red,巴洛克之红,"[Normal, Mil-Spec]","[1, 3]",$ 9.16 - $ 19.99,,,https://wiki.cs.money/weapons/p90/baroque-red,[The Canals Collection],[运河水城收藏品],[https://wiki.cs.money/collections/the-canals-...,[https://wiki.cs.money/zh/collections/the-cana...,
1230,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Glacier Mesh,冰川网格,"[Souvenir, Mil-Spec]","[nan, 3]",$ 2.35 - $ 16.40,,$ 1.14 - $ 110.15,https://wiki.cs.money/weapons/p90/glacier-mesh,[The Vertigo Collection],[殒命大厦收藏品],[https://wiki.cs.money/collections/the-vertigo...,[https://wiki.cs.money/zh/collections/the-vert...,
1231,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Sunset Lily,日落百合,"[Normal, Industrial Grade]","[1, 2]",$ 10.33 - $ 14.08,,,https://wiki.cs.money/weapons/p90/sunset-lily,[The St. Marc Collection],[圣马克镇收藏品],[https://wiki.cs.money/collections/the-st-marc...,[https://wiki.cs.money/zh/collections/the-st-m...,


41


Indeed, it was the name of a weapon. That particular one has 39 different skins (and two other elements which contain that name). Now let's try to find to what corresponds `out_2_nopar` (沙漠 DDPAT)

In [25]:
search = df_out[df_out['Skin_Name_zh'].str.contains("沙漠 DDPAT")]
display(search)
del search

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
1259,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Desert DDPAT,沙漠 DDPAT,"[Souvenir, Consumer Grade]","[nan, 1]",$ 0.08 - $ 0.15,,$ 0.03,https://wiki.cs.money/weapons/p90/desert-ddpat,[The 2021 Dust 2 Collection],[2021 炙热沙城 II 收藏品],[https://wiki.cs.money/collections/the-2021-du...,[https://wiki.cs.money/zh/collections/the-2021...,


Ok, it found one particular skin. We can see that it can come in different grades, which determine the value. In the absence of more information, we can assume it's the basic version (No StatTrak, no Souvenir). In this case, it has a rarity level of 1, and the market value ranges from $0.08 to $0.15. At some point we'll have to determine which value to use.

Let's explore the other fields we found in `df_purchases`.

In [26]:
print("Records without out_1_nopar:", len(df_purchases.query("out_1_nopar == ''")))
print("Records without out_1_par:", len(df_purchases.query("out_1_par == ''")))
print("Records without out_2_nopar:", len(df_purchases.query("out_2_nopar == ''")))
print("Records without out_2_par_1:", len(df_purchases.query("out_2_par_1 == ''")))
print("Records without out_2_par_2:", len(df_purchases.query("out_2_par_2 == ''")))
print("Records with out_2_par_1 and not our_2_par_2:", len(df_purchases.query("out_2_par_1 != '' and out_2_par_2 == ''")))
print("Records without out_2_par_1 but with our_2_par_2:", len(df_purchases.query("out_2_par_1 == '' and out_2_par_2 != ''")))
print("Records without out_3:", len(df_purchases.query("out_3 == ''")))
print("Records without out_2_nopar but with out_3:", len(df_purchases.query("out_2_nopar == '' and out_3 != ''")))
print("Records without out_2_par_1 but with out_3:", len(df_purchases.query("out_2_par_1 == '' and out_3 != ''")))
print("Records with out1_par and out_3:", len(df_purchases.query("out_1_par != '' and out_3 != ''")))
print("Records with out1_par and out_2_par_2:", len(df_purchases.query("out_1_par != '' and out_2_par_2 != ''")))
print("Records with out_2_nopar, out_2_par_1, out_2_par_2 and out_3:", len(df_purchases.query("out_2_nopar != '' and out_2_par_1 != '' and out_2_par_2 != '' and out_3 != ''")))
print("Records with out_1_nopar, out_2_nopar, out_2_par_1, out_2_par_2 and out_3:", len(df_purchases.query("out_1_nopar != '' and out_2_nopar != '' and out_2_par_1 != '' and out_2_par_2 != '' and out_3 != ''")))
print("Records with out_1_nopar, out_1_par, out_2_nopar, out_2_par_1, out_2_par_2 and out_3:", len(df_purchases.query("out_1_nopar != '' and out_1_par != '' and out_2_nopar != '' and out_2_par_1 != '' and out_2_par_2 != '' and out_3 != ''")))
print("Records with out_2_nopar, out_2_par_1 and out_2_par_2 but no out_3:", len(df_purchases.query("out_2_nopar != '' and out_2_par_1 != '' and out_2_par_2 != '' and out_3 == ''")))
print("Records with out_2_par_2 but no out_3:", len(df_purchases.query("out_2_par_2 != '' and out_3 == ''")))
print("Records with out_3 but no out_2_par_2:", len(df_purchases.query("out_2_par_2 == '' and out_3 != ''")))
print("Records with out_2_par_1 but no out_3:", len(df_purchases.query("out_2_par_1 != '' and out_3 == ''")))
print("Records with out_1_par and out_2_nopar:", len(df_purchases.query("out_1_par != '' and (out_2_nopar != '')")))

# Print some example rows
display(df_purchases.query("out_1_nopar != '' and out_1_par != ''").sample(5))
display(df_purchases.query("out_2_par_1 != '' and out_3 == ''").sample(5))
display(df_purchases.query("out_1_nopar != '' and out_2_nopar != '' and out_2_par_1 != '' and out_2_par_2 != '' and out_3 != ''").sample(5))

Records without out_1_nopar: 0
Records without out_1_par: 559025
Records without out_2_nopar: 63
Records without out_2_par_1: 547471
Records without out_2_par_2: 609934
Records with out_2_par_1 and not our_2_par_2: 62463
Records without out_2_par_1 but with our_2_par_2: 0
Records without out_3: 381446
Records without out_2_nopar but with out_3: 0
Records without out_2_par_1 but with out_3: 168361
Records with out1_par and out_3: 0
Records with out1_par and out_2_par_2: 0
Records with out_2_nopar, out_2_par_1, out_2_par_2 and out_3: 4574
Records with out_1_nopar, out_2_nopar, out_2_par_1, out_2_par_2 and out_3: 4574
Records with out_1_nopar, out_1_par, out_2_nopar, out_2_par_1, out_2_par_2 and out_3: 0
Records with out_2_nopar, out_2_par_1 and out_2_par_2 but no out_3: 0
Records with out_2_par_2 but no out_3: 0
Records with out_3 but no out_2_par_2: 228488
Records with out_2_par_1 but no out_3: 2336
Records with out_1_par and out_2_nopar: 55466


Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1_nopar,out_1_par,out_2_nopar,out_2_par_1,out_2_par_2,out_3
417127,2023-01-03 22:50:42,1672786242,SU***-9JHA,裂空武器箱,P90（StatTrak™） | 集装箱,P90,StatTrak™,集装箱,,,
603164,2023-01-11 11:18:38,1673435918,SV***-QXXJ,梦魇武器箱,P2000（StatTrak™） | 升天,P2000,StatTrak™,升天,,,
476233,2023-01-06 02:27:37,1672972057,SZ***-98TG,“突围大行动”武器箱,格洛克 18 型（StatTrak™） | 水灵,格洛克 18 型,StatTrak™,水灵,,,
232208,2022-12-28 05:48:43,1672206523,AU***-WYGN,2022年里约热内卢锦标赛炙热沙城 II 纪念包,新星（纪念品） | 流沙,新星,纪念品,流沙,,,
341557,2023-01-01 04:35:16,1672547716,ST***-SZGN,梦魇武器箱,SCAR-20（StatTrak™） | 暗夜活死鸡,SCAR-20,StatTrak™,暗夜活死鸡,,,


Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1_nopar,out_1_par,out_2_nopar,out_2_par_1,out_2_par_2,out_3
502427,2023-01-07 14:08:13,1673100493,AD***-QCWE,CS:GO 10周年印花胶囊,印花 | 蓝钻（闪耀）,印花,,蓝钻,闪耀,,
214039,2022-12-27 13:56:34,1672149394,A3***-4JNC,CS:GO 10周年印花胶囊,印花 | 我很好（人质）,印花,,我很好,人质,,
464725,2023-01-05 16:05:30,1672934730,A8***-7LQJ,CS:GO 10周年印花胶囊,印花 | 宙斯的认可（全息）,印花,,宙斯的认可,全息,,
295426,2022-12-30 12:52:01,1672404721,SR***-4KNN,点亮中国 2 号印花胶囊,印花 | 年年有鱼（全息）,印花,,年年有鱼,全息,,
36336,2022-12-21 01:00:08,1671584408,SJ***-XTDJ,CS:GO 10周年印花胶囊,印花 | 宙斯的认可（全息）,印花,,宙斯的认可,全息,,


Unnamed: 0,datetimeUTC,timestamp,user,src,out,out_1_nopar,out_1_par,out_2_nopar,out_2_par_1,out_2_par_2,out_3
612172,2023-01-11 19:05:31,1673463931,AT***-CJVJ,2022年里约热内卢锦标赛冠军亲笔签名胶囊,印花 | FL1T（闪耀，冠军）| 2022年里约热内卢锦标赛,印花,,FL1T,闪耀,冠军,2022年里约热内卢锦标赛
201153,2022-12-27 03:00:48,1672110048,SP***-U6GQ,2022年里约热内卢锦标赛冠军亲笔签名胶囊,印花 | Jame（全息，冠军）| 2022年里约热内卢锦标赛,印花,,Jame,全息,冠军,2022年里约热内卢锦标赛
572894,2023-01-10 03:39:18,1673321958,SA***-BWUJ,2022年里约热内卢锦标赛冠军亲笔签名胶囊,印花 | qikert（闪耀，冠军）| 2022年里约热内卢锦标赛,印花,,qikert,闪耀,冠军,2022年里约热内卢锦标赛
410010,2023-01-03 16:37:47,1672763867,SU***-5ASA,2022年里约热内卢锦标赛冠军亲笔签名胶囊,印花 | n0rb3r7（闪耀，冠军）| 2022年里约热内卢锦标赛,印花,,n0rb3r7,闪耀,冠军,2022年里约热内卢锦标赛
274372,2022-12-29 18:34:58,1672338898,AN***-BJYJ,2022年里约热内卢锦标赛冠军亲笔签名胶囊,印花 | qikert（闪耀，冠军）| 2022年里约热内卢锦标赛,印花,,qikert,闪耀,冠军,2022年里约热内卢锦标赛


A few observations about the data:
* So, the field `out_1_nopar`  (weapon name) seems to be compulsory (it's always present), and `out_2_nopar` (skin name) seems to be almost compulsory (present in 99.99% of cases).
* `our_2_par_2` depends on `our_2_par_1` (obviously, otherwise it wouldn't have been parsed).
* `out_3` depends on `out_2_nopar` (skin name) (for the same reason).
* `out1_par` and `out_3` never appear together.
* `out_1_par` and `out_2_par_2` never appear together.
* No row has all fields, but quite a few have `out_1_nopar`, `out_2_nopar`, `out_2_par_1`, `out_2_par_2` and `out_3`.
* `out_2_par_2` depends on `out_3`, but `out_3` can be there without `out_2_par_2`.
* `out_2_par_1` does not really depend on `out_3`, but they almost always appear together.
* If `out_1_par` is there, `out_2_par_1`, `out_2_par_2` and `out_3` will be empty.

Apparently, `out_1_par` corresponds to the grade (whether it's `Normal`, `StatTrak™` or `Souvenir`). If the field is empty, we can assume it's `Normal`. `out_2_par_1` seems to refer to the grade of a sticker (闪耀 translates to *Shiny* or *Glitter*). Let's find that string in the whole dataframe:

In [27]:
# It seems that 2023-01-03 10:27:40	 is referring to this item: https://wiki.cs.money/zh/stickers/sticker-broky-glitter-champion-antwerp-2022
import re
def find_in_df(string, df):
    #find if the string is in any of the cells
    bool_df = df.apply(lambda x: x.str.contains(re.escape(string)))
    # filter the rows where the string is in any of the cells, and return it as a dataframe
    return df[bool_df.any(axis=1)]

display(df_purchases['out_1_par'].value_counts())
display(find_in_df("闪耀", df_out))
display(df_purchases['out_2_par_1'].value_counts())

               559025
StatTrak™       35375
纪念品             19213
★                 851
★ StatTrak™        44
Name: out_1_par, dtype: int64

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
850,Tournament Stickers,大赛贴纸,Blue Gem (Glitter),蓝钻（闪耀）,,https://wiki.cs.money/tournament-stickers,,,[Remarkable],[4],$ 4.22,,,https://wiki.cs.money/stickers/sticker-blue-ge...,[10 Year Birthday Sticker Capsule],[CS:GO 10周年印花胶囊],[https://wiki.cs.money/capsules/10-year-birthd...,[https://wiki.cs.money/zh/capsules/10-year-bir...,https://wiki.cs.money/zh/tournament-stickers
1538,Tournament Stickers,大赛贴纸,Go Boom (Glitter),毁灭吧（闪耀）,,https://wiki.cs.money/tournament-stickers,,,[Remarkable],[4],$ 0.58,,,https://wiki.cs.money/stickers/sticker-go-boom...,[10 Year Birthday Sticker Capsule],[CS:GO 10周年印花胶囊],[https://wiki.cs.money/capsules/10-year-birthd...,[https://wiki.cs.money/zh/capsules/10-year-bir...,https://wiki.cs.money/zh/tournament-stickers
2578,Graffiti,涂鸦,Shining Star (Violent Violet),闪耀之星 (纯紫),,https://wiki.cs.money/graffiti,,,[Base Grade],[1],$ 0.42,,,https://wiki.cs.money/graffiti/sealed-graffiti...,[],[],[],[],https://wiki.cs.money/zh/graffiti


        547471
闪耀       36331
冠军       18067
全息       10282
金色        1504
闪亮         667
人质          75
T           58
反恐精英        25
透镜          21
FBI          4
IDF          2
SAS          1
Name: out_2_par_1, dtype: int64

There seems to be 4 grades of stickers: Basic, Holo (全息), Gold (金色) and Glitter (闪耀). The *Basic* won't have any value in the `out_2_par_1` column. These seems to correspond to the *High Grade, Remarkable, Exotic and Extraordinary*, with in that case have rarity values of 3, 4, 5 and 6 respectively.

| Grade_en | Grade_zh | Corresponding category | Rarity | Color   | Note              |
|----------|----------|------------------------|--------|---------|-------------------|
| Basic    |          | High Grade             | 3      | Blue    | Just for stickers |
| Glitter  | 闪耀     | Remarkable             | 4      | Purple  | Just for stickers |
| Holo     | 全息     | Exotic                 | 5      | Magenta | Just for stickers |
| Gold     | 金色     | Extraordinary          | 6      | Red     | Just for stickers |

Let's now find to what corresponds *冠军* (translates as *Champion*) in `out_2_par_2`.

In [28]:
display(df_purchases['out_2_par_2'].value_counts())
display(find_in_df("冠军", df_out))

      609934
冠军      4574
Name: out_2_par_2, dtype: int64

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
2027,Tournament Stickers,大赛贴纸,broky (Champion),broky（冠军）,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-broky-c...,[Antwerp 2022 Champions Autograph Capsule],[2022年安特卫普锦标赛冠军亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
2075,Tournament Stickers,大赛贴纸,karrigan (Champion),karrigan（冠军）,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-karriga...,[Antwerp 2022 Champions Autograph Capsule],[2022年安特卫普锦标赛冠军亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
2099,Tournament Stickers,大赛贴纸,rain (Champion),rain (Champion),,https://wiki.cs.money/tournament-stickers,Antwerp 2022,rain（冠军）| 2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-rain-ch...,[Antwerp 2022 Champions Autograph Capsule],[2022年安特卫普锦标赛冠军亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
2105,Tournament Stickers,大赛贴纸,ropz (Champion),ropz（冠军）,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-ropz-ch...,[Antwerp 2022 Champions Autograph Capsule],[2022年安特卫普锦标赛冠军亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
2126,Tournament Stickers,大赛贴纸,Twistzz (Champion),Twistzz（冠军）,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-twistzz...,[Antwerp 2022 Champions Autograph Capsule],[2022年安特卫普锦标赛冠军亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers


Apparently it's just part of the sticker name. It does not seem to refer to its grade or rarity. In this case, Champion might refer to a commemorative sticker indicating that specific player (*broky*) won a Tournament. In this specific case, it most likely refers to the item *Sticker | broky (Glitter, Champion) | Antwerp 2022* found here: https://wiki.cs.money/stickers/sticker-broky-glitter-champion-antwerp-2022

Finally, `out_3` Seems to refer to the name of a tournament, likely only used for Tournament Stickers. For instance, *2022年安特卫普锦标赛* translates to *Antwerp Championship 2022*. If we search for fields with that text, we find 152 items. https://en.wikipedia.org/wiki/Counter-Strike:_Global_Offensive_Major_Championships

In [29]:
display(df_purchases['out_3'].value_counts())
display(find_in_df("2022年安特卫普锦标赛", df_out))

                     381446
 2022年里约热内卢锦标赛       199392
 2022年安特卫普锦标赛         25754
 2020 RMR              6709
 2021年斯德哥尔摩锦标赛          675
 2019年柏林锦标赛             402
 2019年卡托维兹锦标赛            45
 2018年伦敦锦标赛              26
 2015年科隆锦标赛              14
 2016年科隆锦标赛              10
 2017年克拉科夫锦标赛             9
 2014年科隆锦标赛               9
 2016年 MLG 哥伦布锦标赛         6
 2018年波士顿锦标赛              5
 2017年亚特兰大锦标赛             3
 2015年卢日-纳波卡锦标赛           2
 2015年克卢日-纳波卡锦标赛          1
Name: out_3, dtype: int64

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
1815,Tournament Stickers,大赛贴纸,m0NESY,m0NESY,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.08,,,https://wiki.cs.money/stickers/sticker-m0nesy-...,[Antwerp 2022 Challengers Autograph Capsule],[2022年安特卫普锦标赛挑战组亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
1836,Tournament Stickers,大赛贴纸,s1mple,s1mple,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.06,,,https://wiki.cs.money/stickers/sticker-s1mple-...,[Antwerp 2022 Legends Autograph Capsule],[2022年安特卫普锦标赛传奇组亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-l...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
1837,Tournament Stickers,大赛贴纸,rox,rox,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.06,,,https://wiki.cs.money/stickers/sticker-rox-ant...,[Antwerp 2022 Contenders Autograph Capsule],[2022年安特卫普锦标赛竞争组亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
1839,Tournament Stickers,大赛贴纸,s1mple,s1mple,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.06,,,https://wiki.cs.money/stickers/sticker-s1mple-...,[Antwerp 2022 Legends Autograph Capsule],[2022年安特卫普锦标赛传奇组亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-l...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
1876,Tournament Stickers,大赛贴纸,Imperial Esports,Imperial Esports,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.04,,,https://wiki.cs.money/stickers/sticker-imperia...,[Antwerp 2022 Challengers Sticker Capsule],[2022年安特卫普锦标赛挑战组印花胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2157,Tournament Stickers,大赛贴纸,Outsiders,Outsiders,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-outside...,[Antwerp 2022 Contenders Sticker Capsule],[2022年安特卫普锦标赛竞争组印花胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
2158,Tournament Stickers,大赛贴纸,PGL,PGL,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-pgl-ant...,"[Antwerp 2022 Legends Sticker Capsule, Antwerp...","[2022年安特卫普锦标赛传奇组印花胶囊, 2022年安特卫普锦标赛挑战组印花胶囊, 202...",[https://wiki.cs.money/capsules/antwerp-2022-l...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
2159,Tournament Stickers,大赛贴纸,Renegades,Renegades,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-renegad...,[Antwerp 2022 Contenders Sticker Capsule],[2022年安特卫普锦标赛竞争组印花胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers
2160,Tournament Stickers,大赛贴纸,Team Spirit,Team Spirit,,https://wiki.cs.money/tournament-stickers,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-team-sp...,[Antwerp 2022 Contenders Sticker Capsule],[2022年安特卫普锦标赛竞争组印花胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,https://wiki.cs.money/zh/tournament-stickers


#### `src` column

This column refers to *containers* where a skin or a sticker may appear. These can be:

|           Type_en | URL                                     |
|------------------:|-----------------------------------------|
|             Cases |             https://wiki.cs.money/cases |
|       Collections |       https://wiki.cs.money/collections |
| Souvenir Packages | https://wiki.cs.money/souvenir-packages |
| Agent Collections | https://wiki.cs.money/agent-collections |
|  Sticker Capsules |          https://wiki.cs.money/capsules |
|       Patch Packs |             https://wiki.cs.money/packs |
| Graffiti Capsules | https://wiki.cs.money/graffiti-capsules |
|   Music Kit Boxes |   https://wiki.cs.money/music-kit-boxes |
|     Pins Capsules |     https://wiki.cs.money/pins-capsules |

These are not found in the `df_out` dataframe.

In [30]:
display(find_in_df("梦魇武器箱", df_out))

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh


Instead, they are in the df_src.pkl dataframe, containing all the types of *lootboxes* that we scrapped from the website. We'll call it the `df_src`.

In [31]:
df_src = pd.read_pickle('df_src.pkl')
display(df_src.sample(5))

# Let's search that 
display(find_in_df("梦魇武器箱", df_src))
display(find_in_df("2022年安特卫普锦标赛冠军亲笔签名胶囊", df_src))

Unnamed: 0,Type_en,Link,Type_zh,Link_zh,lootbox_en,lootbox_zh,Grade,Rarity,Value,Skin_Link
107,Souvenir Packages,https://wiki.cs.money/souvenir-packages,纪念品包,https://wiki.cs.money/zh/souvenir-packages,Antwerp 2022 Ancient Souvenir Package,安特卫普 2022 远古遗迹纪念包,[Default],[1],$ 3.61,https://wiki.cs.money/cases/antwerp-2022-ancie...
96,Souvenir Packages,https://wiki.cs.money/souvenir-packages,纪念品包,https://wiki.cs.money/zh/souvenir-packages,Rio 2022 Mirage Souvenir Package,2022年里约热内卢锦标赛荒漠迷城纪念包,[Default],[1],—,https://wiki.cs.money/cases/rio-2022-mirage-so...
37,Collections,https://wiki.cs.money/collections,藏品,https://wiki.cs.money/zh/collections,The St. Marc Collection,圣马克镇收藏品,[Default],[1],,https://wiki.cs.money/collections/the-st-marc-...
158,Pins Capsules,https://wiki.cs.money/pins-capsules,勋章胶囊,https://wiki.cs.money/zh/pins-capsules,Collectible Pins Capsule Series 1,系列 1 收藏胸章胶囊,[Default],[1],$ 12.21,https://wiki.cs.money/pins-capsules/collectibl...
10,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Prisma Case,棱彩武器箱,[Default],[1],$ 0.16,https://wiki.cs.money/cases/prisma-case


Unnamed: 0,Type_en,Link,Type_zh,Link_zh,lootbox_en,lootbox_zh,Grade,Rarity,Value,Skin_Link
1,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Dreams & Nightmares Case,梦魇武器箱,[Default],[1],$ 0.50,https://wiki.cs.money/cases/dreams-nightmares-...


Unnamed: 0,Type_en,Link,Type_zh,Link_zh,lootbox_en,lootbox_zh,Grade,Rarity,Value,Skin_Link
128,Sticker Capsules,https://wiki.cs.money/capsules,印花胶囊,https://wiki.cs.money/zh/capsules,Antwerp 2022 Champions Autograph Capsule,2022年安特卫普锦标赛冠军亲笔签名胶囊,[Default],[1],$ 0.23,https://wiki.cs.money/capsules/antwerp-2022-ch...


Translating `df_src`

So, the `src` column seems to be equivalent to the `lootbox_zh` field in the lootbox dataframe `df_src`

However, `df_src` has some missing translations, that we'll add manually

In [194]:
# Let's add the missing translations do df_src and re-export it to pickle, to make changes permanent
# Manual list of the lootboxes with missing translation (the ones with more than 1000 purchases at the moment of writing this)
lootboxes_zh_en = {'反冲武器箱':'Recoil Case',
        '2022年里约热内卢锦标赛传奇组亲笔签名胶囊':'Rio 2022 Legends Autograph Capsule',
        '2022年里约热内卢锦标赛传奇组印花胶囊':'Rio 2022 Legends Sticker Capsule',
        '2022年里约热内卢锦标赛挑战组亲笔签名胶囊':'Rio 2022 Challengers Autograph Capsule',
        '2022年里约热内卢锦标赛挑战组印花胶囊':'Rio 2022 Challengers Sticker Capsule',
        '2022年里约热内卢锦标赛竞争组印花胶囊':'Rio 2022 Contenders Sticker Capsule',
        '2022年里约热内卢锦标赛炙热沙城 II 纪念包':'Rio 2022 Dust II Souvenir Package',
        '2022年里约热内卢锦标赛竞争组亲笔签名胶囊':'Rio 2022 Contenders Autograph Capsule',
        '2022年里约热内卢锦标赛死亡游乐园纪念包':'Rio 2022 Overpass Souvenir Package',
        '2022年里约热内卢锦标赛远古遗迹纪念包':'Rio 2022 Ancient Souvenir Package',
        '2022年里约热内卢锦标赛核子危机纪念包':'Rio 2022 Nuke Souvenir Package',
        '2022年里约热内卢锦标赛炼狱小镇纪念包':'Rio 2022 Inferno Souvenir Package',
        '2022年里约热内卢锦标赛殒命大厦纪念包':'Rio 2022 Vertigo Souvenir Package',
        '幻彩 3 号武器箱':'Chroma 3 Case',
        '左轮武器箱':'Revolver Case',
        'CS:GO 10周年印花胶囊':'10 Year Birthday Sticker Capsule',
        '幻彩 2 号武器箱':'Chroma 2 Case',
        '2022年里约热内卢锦标赛荒漠迷城纪念包':'Rio 2022 Mirage Souvenir Package',
        '2020 RMR 传奇组战队胶囊':'2020 RMR Legends',
        '“突围大行动”武器箱':'Operation Breakout Weapon Case',
        '2020 RMR 竞争组战队胶囊':'2020 RMR Contenders',
        '2020 RMR 挑战组战队胶囊':'2020 RMR Challengers',
        '“野火大行动”武器箱':'Operation Wildfire Case',
        '引爆器音乐盒集':'Initiators Music Kit Box',
        '“凤凰大行动”武器箱':'Operation Phoenix Weapon Case',
        '幻彩武器箱':'Chroma Case',
        '暗影武器箱':'Shadow Case',
        '《半衰期：爱莉克斯》印花胶囊':'Half-Life: Alyx Sticker Capsule',
        '点亮中国 2 号印花胶囊':'Perfect World Sticker Capsule 2',
        '海报女郎胶囊':'Pinups Capsule',
        '光环胶囊':'Halo Capsule',
        '弯曲猎手武器箱':'Falchion Case'
}

# Same dict, opposite direction (English->Chinese)
lootboxes_en_zh = {v: k for k, v in lootboxes_zh_en.items()}

df_src['lootbox_zh'] = df_src['lootbox_zh'].map(lootboxes_en_zh).fillna(df_src['lootbox_zh'])

df_src.to_pickle("./df_src.pkl")

In [227]:
import numpy as np
# Let's check if we are able to find the translations for the df_purchases['src'] in df_src, and which values remain untranslated.

merged_df = df_purchases.merge(df_src, left_on='src', right_on='lootbox_zh', how='left')
df_purchases["src_en"] = merged_df["lootbox_en"]

# Apply the manual dictionary again, to df_purchases this time
df_purchases['src_en'] = df_purchases['src'].map(lootboxes_zh_en).fillna(df_purchases['src_en'])

# Add the type of lootbox (will be helpful to determine the approximate cost)
merged_df = df_purchases.merge(df_src, left_on='src_en', right_on='lootbox_en', how='left')
df_purchases["src_type"] = merged_df["Type_en"]

# Manually add some type of lootbox that wasn't done automatically
df_purchases["src_type"] = np.where((df_purchases['src_en'].str.contains('Sticker Capsule')) & (df_purchases['src_type'].isnull()), "Sticker Capsules", df_purchases["src_type"]) # e.g. 10 Year Birthday Sticker Capsule
df_purchases["src_type"] = np.where((df_purchases['src_en'].str.contains('Case')) & (df_purchases['src_type'].isnull()), "Cases", df_purchases["src_type"]) # e.g.: Falchion case

# Add the price of lootbox
## Cases: $2.5
## Music Kit boxes: check price on df_src
## souvenir packages: $3
## Patch packs: $2
## Graffity capsules: check value on df_src
## pin capsules: $9.49
## Sticker capsules: $0.95
df_purchases['src_value'] = merged_df["Value"]
df_purchases['src_value'] = np.where((merged_df['src_type'] == 'Music Kit Boxes') | (merged_df['src_type'] == 'Graffiti Capsules'), merged_df['Value'], np.nan)
df_purchases['src_value'] = np.where(((df_purchases['src_type'] == 'Cases') & (df_purchases['src_value'].isnull())), 2.5, df_purchases['src_value']) # Cases
df_purchases['src_value'] = np.where(((df_purchases['src_type'] == 'Souvenir Packages') & (df_purchases['src_value'].isnull())), 3.0, df_purchases['src_value']) # Souvenir packages
df_purchases['src_value'] = np.where(((df_purchases['src_type'] == 'Patch Packs') & (df_purchases['src_value'].isnull())), 2.0, df_purchases['src_value']) # Patch packs
df_purchases['src_value'] = np.where(((df_purchases['src_type'] == 'Pins Capsules') & (df_purchases['src_value'].isnull())), 9.49, df_purchases['src_value']) # Pin capsules
df_purchases['src_value'] = np.where(((df_purchases['src_type'] == 'Sticker Capsules') & (df_purchases['src_value'].isnull())), 0.95, df_purchases['src_value']) # Sticker capsules


# Convert values like $ 1.21 to 1.21 (float) and convert whole column to float
df_purchases['src_value'] = df_purchases['src_value'].apply(lambda x: x.strip('$') if isinstance(x, str) and "$" in x else x)
df_purchases['src_value']= pd.to_numeric(df_purchases['src_value'])



# Which elements of src in df_purchases are not found in df_src
counts = df_purchases[df_purchases['src_en'].isna()]['src'].value_counts()
print(f"Missing translations for src: {counts.sum()} ({counts.sum()/df_purchases.shape[0]*100}%)")
for value, count in counts.items():
    print(value, count)

# Rearrange column in df_purchases so src_en goes after src
df_purchases = df_purchases[['datetimeUTC', 'timestamp', 'user', 'src', 'src_en', 'src_type', 'src_value', 'out', 'out_1_nopar',
       'out_1_par', 'out_2_nopar', 'out_2_par_1', 'out_2_par_2', 'out_3']]

display(df_purchases.sample(5))

del counts
del merged_df
del value, count


## Safe df_purchases to pickle, so we can start from this point later.
df_purchases.to_pickle("df_purchases.pkl")

Missing translations for src: 4919 (0.8004777805984625%)
Enfu 印花胶囊 489
“先锋大行动”武器箱 424
段位印花胶囊 387
点亮中国 1 号印花胶囊 369
2021社区印花胶囊 361
猛兽胶囊 358
猎杀者武器箱 248
StatTrak™ 引爆器音乐盒集 225
2 号印花胶囊 183
2019年柏林锦标赛挑战组亲笔签名胶囊 122
糖果脸谱胶囊 109
冬季攻势武器箱 106
反恐精英20周年印花胶囊 102
动物寓言胶囊 101
1 号印花胶囊 98
反恐精英武器箱 90
2019年柏林锦标赛传奇组亲笔签名胶囊 88
2019年柏林锦标赛传奇组胶囊（全息/闪亮） 71
1 号社区印花胶囊 66
2019年柏林锦标赛竞争组亲笔签名胶囊 66
2021年斯德哥尔摩锦标赛竞争组印花胶囊 57
“英勇大行动”武器箱 55
战术大师音乐盒集 55
Slid3 胶囊 53
反恐精英 3 号武器箱 52
反恐精英 2 号武器箱 51
电竞 2014 夏季武器箱 50
战锤40k印花胶囊 46
电竞 2013 冬季武器箱 44
团队定位胶囊 44
2019年柏林锦标赛竞争组胶囊（全息/闪亮） 43
2018年社区印花胶囊 40
魔性探员胶囊 38
StatTrak™ 决策大师音乐盒集 27
2019年卡托维兹锦标赛传奇亲笔签名胶囊 27
小鸡胶囊 25
斯德哥尔摩 2021 殒命大厦纪念包 17
2018年伦敦锦标赛挑战组亲笔签名胶囊 12
2019年柏林锦标赛挑战组胶囊（全息/闪亮） 12
2019年卡托维兹锦标赛竞争组亲笔签名胶囊 10
2016年科隆锦标赛传奇（全息/闪亮） 8
2018年伦敦锦标赛竞争组亲笔签名胶囊 6
2014年 ESL One 科隆锦标赛传奇 6
2016年 MLG 哥伦布锦标赛挑战组（全息/闪亮） 5
2017年克拉科夫锦标赛挑战组（全息/闪亮） 5
2015年 ESL One 科隆锦标赛传奇（闪亮） 4
2019年卡托维兹锦标赛挑战组亲笔签名胶囊 4
2018年伦敦锦标赛传奇亲笔签名胶囊 4
2015年 ESL One 科隆锦标赛挑战组（闪亮） 3
2019年卡托维兹锦标赛竞争组（全息/闪亮） 3
2014年 ESL One 科隆锦标赛挑战组 3
2018年伦敦锦标赛传

Unnamed: 0,datetimeUTC,timestamp,user,src,src_en,src_type,src_value,out,out_1_nopar,out_1_par,out_2_nopar,out_2_par_1,out_2_par_2,out_3
39385,2022-12-21 03:36:26,1671593786,A2***-4JTC,2022年里约热内卢锦标赛挑战组印花胶囊,Rio 2022 Challengers Sticker Capsule,Sticker Capsules,0.95,印花 | BIG | 2022年里约热内卢锦标赛,印花,,BIG,,,2022年里约热内卢锦标赛
520487,2023-01-08 05:27:12,1673155632,AV***-BZZQ,CS:GO 10周年印花胶囊,10 Year Birthday Sticker Capsule,Sticker Capsules,0.95,印花 | K商力荐,印花,,K商力荐,,,
17086,2022-12-14 12:30:55,1671021055,AM***-EEPQ,梦魇武器箱,Dreams & Nightmares Case,Cases,2.5,MAG-7 | 先见之明,MAG-7,,先见之明,,,
4158,2022-12-14 01:40:43,1670982043,SZ***-Q5EC,2022年里约热内卢锦标赛传奇组亲笔签名胶囊,Rio 2022 Legends Autograph Capsule,Sticker Capsules,0.95,印花 | cadiaN（闪耀）| 2022年里约热内卢锦标赛,印花,,cadiaN,闪耀,,2022年里约热内卢锦标赛
508555,2023-01-07 19:21:30,1673119290,AV***-X9DE,命悬一线武器箱,Clutch Case,Cases,2.5,新星 | 狂野六号,新星,,狂野六号,,,


### Assigning a value to each purchase

#### Manually dealing with untranslated records

In [34]:
df_purchases.sample(1)

Unnamed: 0,datetimeUTC,timestamp,user,src,src_en,out,out_1_nopar,out_1_par,out_2_nopar,out_2_par_1,out_2_par_2,out_3
446578,2023-01-05 00:33:30,1672878810,AA***-YQGL,反冲武器箱,Recoil Case,格洛克 18 型 | 冬季战术,格洛克 18 型,,冬季战术,,,


Apparently not all skins/stickers can be found in the df (for instance, 喵喵36) https://buff.163.com/goods/900562 . In this case, the name in the website was in english (Meow 36). https://wiki.cs.money/zh/weapons/famas/meow-36

In [41]:
display(find_in_df("喵喵36", df_out))
display(find_in_df("Meow 36", df_out))

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
792,Rifles,步枪,FAMAS,法玛斯,https://wiki.cs.money/_next/static/images/fama...,https://wiki.cs.money/weapons/famas,Meow 36,喵喵36,"[StatTrak™, Mil-Spec]","[nan, 3]",$ 0.08 - $ 0.67,$ 0.22 - $ 2.12,,https://wiki.cs.money/weapons/famas/meow-36,[The Recoil Collection],[The Recoil Collection],[https://wiki.cs.money/collections/the-recoil-...,[https://wiki.cs.money/zh/collections/the-reco...,


Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
792,Rifles,步枪,FAMAS,法玛斯,https://wiki.cs.money/_next/static/images/fama...,https://wiki.cs.money/weapons/famas,Meow 36,喵喵36,"[StatTrak™, Mil-Spec]","[nan, 3]",$ 0.08 - $ 0.67,$ 0.22 - $ 2.12,,https://wiki.cs.money/weapons/famas/meow-36,[The Recoil Collection],[The Recoil Collection],[https://wiki.cs.money/collections/the-recoil-...,[https://wiki.cs.money/zh/collections/the-reco...,


The same thing can happen with the `src` column. Apparently 反冲武器箱 cannot be found, because in the Chinese version of the website it's also called "Recoil Case": https://wiki.cs.money/zh/cases/recoil-case. We'll have to consider these cases manually.

In [37]:
display(find_in_df("反冲武器箱", df_src)) # Now it works because we added it manually to the df_src a few steps before, but originally it was 'Recoil Case' as well.
display(find_in_df("Recoil Case", df_src))

Unnamed: 0,Type_en,Link,Type_zh,Link_zh,lootbox_en,lootbox_zh,Grade,Rarity,Value,Skin_Link
0,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Recoil Case,反冲武器箱,[Default],[1],$ 0.57,https://wiki.cs.money/cases/recoil-case


Unnamed: 0,Type_en,Link,Type_zh,Link_zh,lootbox_en,lootbox_zh,Grade,Rarity,Value,Skin_Link
0,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Recoil Case,反冲武器箱,[Default],[1],$ 0.57,https://wiki.cs.money/cases/recoil-case


##### Untranslated `df_out` items
We can just check if the name of the skin is the same in English and in Chinese (and is not empty).

In [38]:
filtered_df = df_out[(df_out['Skin_Name_zh'].eq(df_out['Skin_Name'])) & (df_out['Skin_Name_zh'].ne('') & df_out['Skin_Name'].ne(''))]
untranslated_out = filtered_df['Skin_Name_zh'].unique()
del filtered_df
print(untranslated_out)

['Ice Coaled' 'Chromatic Aberration' 'Poly Mag' 'Dragon Tech' 'Destroyer'
 'Meow 36' 'Printstream' 'Winterized' 'Visions' 'Crazy 8'
 'Flora Carnivora' 'Vent Rush' 'Roadblock' 'Monkeyflage' 'Kiss♥Love'
 'SWAG-7' 'Exo' 'Drop Me' 'Downtown' 'O.S.I.P.R.' 'Rio 2022' '2020 RMR']


##### Assigning a manual translation to these items

In [39]:
skins_en_zh = {'Ice Coaled': '可燃冰',
 'Chromatic Aberration': '迷人眼',
 'Poly Mag': '透明弹匣',
 'Dragon Tech': '青龙',
 'Destroyer': '毁灭者',
 'Meow 36': '喵喵36',
 'Printstream': '印花集',
 'Winterized': '冬季战术',
 'Visions': '迷人幻象',
 'Crazy 8': '疯狂老八',
 'Flora Carnivora': '食人花',
 'Vent Rush': '给爷冲',
 'Roadblock': '路障',
 'Monkeyflage': '萌猴迷彩',
 'Kiss♥Love': '么么',
 'SWAG-7': 'SWAG-7',
 'Exo': 'Exo',
 'Drop Me': '丢把枪',
 'Downtown': '闹市区',
 'O.S.I.P.R.': 'O.S.I.P.R.',
 'Rio 2022':'2022年里约热内卢锦标赛'
 }

 # Same dict, opposite direction (English->Chinese)
skins_zh_en = {v: k for k, v in skins_en_zh.items()}

In [40]:
# Apply the manual dictionary, to df_out this time
df_out['Skin_Name_zh'] = df_out['Skin_Name'].map(skins_en_zh).fillna(df_out['Skin_Name_zh'])

# Save df_out to pickle
df_out.to_pickle("./df_out.pkl")

In [42]:
find_in_df("Vent Rush", df_out)
#find_in_df("hola", df_purchases) # doesn't work, I'll fix it when I have time

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
1244,SMGs,微型冲锋枪,P90,P90,https://wiki.cs.money/_next/static/images/p90-...,https://wiki.cs.money/weapons/p90,Vent Rush,给爷冲,"[StatTrak™, Restricted]","[nan, 4]",$ 1.03 - $ 3.25,$ 1.65 - $ 8.12,,https://wiki.cs.money/weapons/p90/vent-rush,[The Recoil Collection],[The Recoil Collection],[https://wiki.cs.money/collections/the-recoil-...,[https://wiki.cs.money/zh/collections/the-reco...,


Finds the row from a random purchase in the `df_out`

In [43]:
skin = df_purchases.sample(1)
display(skin)
display(skin['out_2_nopar'].item())
skin = skin['out_2_nopar'].item().strip()
correspondence = find_in_df(skin, df_out)
display(correspondence)

if len(correspondence) > 0:
    print(f"Estimated value obtained for that purchase was: {correspondence['Value'].iloc[len(correspondence)-1]}") #Only prints the last one. Fix that

del correspondence, skin

Unnamed: 0,datetimeUTC,timestamp,user,src,src_en,out,out_1_nopar,out_1_par,out_2_nopar,out_2_par_1,out_2_par_2,out_3
267518,2022-12-29 12:44:33,1672317873,SZ***-ZGGE,2022年里约热内卢锦标赛传奇组印花胶囊,Rio 2022 Legends Sticker Capsule,印花 | Ninjas in Pyjamas | 2022年里约热内卢锦标赛,印花,,Ninjas in Pyjamas,,,2022年里约热内卢锦标赛


' Ninjas in Pyjamas '

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
624,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,Katowice 2015,2015年卡托维兹锦标赛,"[Remarkable, Exotic, Exotic]","[4, 5, 5]",$ 84.21 - $ 104.39,,,https://wiki.cs.money/stickers/sticker-ninjas-...,"[ESL One Katowice 2015 Legends (Holo/Foil), St...","[2015年 ESL One 卡托维兹锦标赛传奇（全息/闪亮）, 无胶囊印花]",[https://wiki.cs.money/capsules/esl-one-katowi...,[https://wiki.cs.money/zh/capsules/esl-one-kat...,https://wiki.cs.money/zh/tournament-stickers
645,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,Katowice 2014,2014年卡托维兹锦标赛,"[High Grade, Remarkable, Exotic]","[3, 4, 5]",$ 469.45,,,https://wiki.cs.money/stickers/sticker-ninjas-...,"[EMS Katowice 2014 Legends, Stickers Without C...","[2014年 EMS 卡托维兹锦标赛传奇, 无胶囊印花]",[https://wiki.cs.money/capsules/ems-katowice-2...,[https://wiki.cs.money/zh/capsules/ems-katowic...,https://wiki.cs.money/zh/tournament-stickers
713,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,DreamHack 2014,2014年 DreamHack 锦标赛,"[High Grade, Remarkable, Exotic, Exotic]","[3, 4, 5, 5]",$ 12.26,,,https://wiki.cs.money/stickers/sticker-ninjas-...,"[DreamHack 2014 Legends (Holo/Foil), Stickers ...","[2014年 DreamHack 锦标赛传奇（全息/闪亮）, 无胶囊印花]",[https://wiki.cs.money/capsules/dreamhack-2014...,[https://wiki.cs.money/zh/capsules/dreamhack-2...,https://wiki.cs.money/zh/tournament-stickers
734,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,Katowice 2015,2015年卡托维兹锦标赛,[High Grade],[3],$ 8.50,,,https://wiki.cs.money/stickers/sticker-ninjas-...,[Stickers Without Capsule],[无胶囊印花],[https://wiki.cs.money/capsules/stickers-witho...,[https://wiki.cs.money/zh/capsules/stickers-wi...,https://wiki.cs.money/zh/tournament-stickers
911,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,Cologne 2014,2014年科隆锦标赛,"[High Grade, Remarkable, Exotic]","[3, 4, 5]",$ 3.22,,,https://wiki.cs.money/stickers/sticker-ninjas-...,"[ESL One Cologne 2014 Legends, Stickers Withou...","[2014年 ESL One 科隆锦标赛传奇, 无胶囊印花]",[https://wiki.cs.money/capsules/esl-one-cologn...,[https://wiki.cs.money/zh/capsules/esl-one-col...,https://wiki.cs.money/zh/tournament-stickers
941,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,Cologne 2016,2016年科隆锦标赛,"[High Grade, Remarkable, Exotic, Exotic]","[3, 4, 5, 5]",$ 2.88,,,https://wiki.cs.money/stickers/sticker-ninjas-...,"[Cologne 2016 Legends (Holo/Foil), Stickers Wi...","[2016年科隆锦标赛传奇（全息/闪亮）, 无胶囊印花]",[https://wiki.cs.money/capsules/cologne-2016-l...,[https://wiki.cs.money/zh/capsules/cologne-201...,https://wiki.cs.money/zh/tournament-stickers
1108,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,MLG Columbus 2016,2016年 MLG 哥伦布锦标赛,"[High Grade, Remarkable, Exotic, Exotic]","[3, 4, 5, 5]",$ 1.85,,,https://wiki.cs.money/stickers/sticker-ninjas-...,"[MLG Columbus 2016 Legends (Holo/Foil), Sticke...","[2016年 MLG 哥伦布锦标赛传奇（全息/闪亮）, 无胶囊印花]",[https://wiki.cs.money/capsules/mlg-columbus-2...,[https://wiki.cs.money/zh/capsules/mlg-columbu...,https://wiki.cs.money/zh/tournament-stickers
1134,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,Cologne 2015,2015年科隆锦标赛,"[High Grade, Exotic, Exotic]","[3, 5, 5]",$ 1.75,,,https://wiki.cs.money/stickers/sticker-ninjas-...,"[ESL One Cologne 2015 Legends (Foil), Stickers...","[2015年 ESL One 科隆锦标赛传奇（闪亮）, 无胶囊印花]",[https://wiki.cs.money/capsules/esl-one-cologn...,[https://wiki.cs.money/zh/capsules/esl-one-col...,https://wiki.cs.money/zh/tournament-stickers
1222,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,Cluj-Napoca 2015,2015年克卢日-纳波卡锦标赛,"[High Grade, Exotic, Exotic]","[3, 5, 5]",$ 1.47,,,https://wiki.cs.money/stickers/sticker-ninjas-...,"[DreamHack Cluj-Napoca 2015 Legends (Foil), St...","[2015年 DreamHack 克卢日-纳波卡锦标赛传奇（闪亮）, 无胶囊印花]",[https://wiki.cs.money/capsules/dreamhack-cluj...,[https://wiki.cs.money/zh/capsules/dreamhack-c...,https://wiki.cs.money/zh/tournament-stickers
1350,Tournament Stickers,大赛贴纸,Ninjas in Pyjamas,Ninjas in Pyjamas,,https://wiki.cs.money/tournament-stickers,Katowice 2019,2019年卡托维兹锦标赛,"[High Grade, Remarkable, Exotic, Exotic]","[3, 4, 5, 5]",$ 1.10,,,https://wiki.cs.money/stickers/sticker-ninjas-...,[Katowice 2019 Returning Challengers (Holo/Foi...,"[2019年卡托维兹锦标赛挑战组（全息/闪亮）, 无胶囊印花]",[https://wiki.cs.money/capsules/katowice-2019-...,[https://wiki.cs.money/zh/capsules/katowice-20...,https://wiki.cs.money/zh/tournament-stickers


Estimated value obtained for that purchase was: —


Unique df_purchases outcomes

How many different skins are in the df_purchases dataframe?

In [44]:
df_purchases['out_2_nopar'].nunique()
unique_purchases = df_purchases['out_2_nopar'].unique()

Outcomes present in `df_out`

And how many of these return at least one result in the `df_out` dataframe?
(this can be quite slow...)

In [45]:
found = []
notfound = []
for purchase in unique_purchases:
    purchase = purchase.strip()
    #print(purchase)
    if len(find_in_df(purchase.strip(), df_out)) > 0:
        found.append(purchase.strip())
    else:
        notfound.append(purchase.strip())

In [46]:
print(len(found)/len(unique_purchases)*100, "% of purchases found somewhere in df_out")

96.96406443618339 % of purchases found somewhere in df_out


Skins not found in `df_out` 

(take it with a grain of salt, there are probably many false negatives in the previous process, where the text of the skin has been found in a non-relevant column)

We'll try to find their correspondence manually

In [47]:
print(notfound)

['一目了然', '小宝火蛇', '点头就行', '残局大师 (全息)', 'GamerLegion', 'Sullivan King - 困兽', '3kliksphilip - 追溯起源', "Humanity's Last Breath - 虚空", '战吼斑纹', '开干', 'Meechy Darko - 哥特浮华', 'GamerLegion', '00 Nation', '00 Nation', '我悟了', 'Juelz - 神枪手', 'Knock2 - 冲击星*', '闪光一抹黑', '杀戮快感 (全息)', 'Chipzel - 黄色魔法', '刺客 (全息)', '惊怖兽王', '隐秘行动处', '极地孤狼', '激光发射', '香蕉上道', '老鼠帮', '焚化 (全息)', 'Sarah Schachner - 蜂鸟', '万斯后裔', 'Jesse Harlin - 战火星空', '屠杀者', '穿墙射击', '传奇 (闪亮)', '斗鸡眼', '贵族小队 (全息)', '士官长 (全息)', '士官长 (闪亮)', '温浴 (全息)', 'Dren - 枪炮卷饼卡车', 'Freaky DNA - 征服', 'Austin Wintory - 咖啡拉花', '鸡中富豪(闪亮)', '鸡中刺客 (全息)', '箱巢 (全息)', 'Laura Shigihara - 好好干，好好活', '双架鸡友(闪亮)', '星际战士 (全息)', '潘多拉之盒']


In [62]:
# Skins not found in df_out
# 丢把枪 "Negev | throw a gun" https://buff.163.com/goods/900575
# Rio 2022 should be renamed to 2022年里约热内卢锦标赛 in Skin_Name_zh column.
# 法玛斯 | 喵喵36	--> FAMAS | Meow 36 https://buff.163.com/goods/900562

```
Weapon skins
out_1_nopar: Weapon_zh (e.g. MP9)
out_1_par: normal (empty), stattrak or souvenir
out_2_nopar: Skin_Name_zh (e.g. 黑砂, black sand)

A skin name can be in several weapons, so the combination of Weapon_zh and Skin_Name_zh is necessary to get a unique row. Also, check if there's something in out_1_par	 to see if it's stattrak or souvenir.


Stickers
out_1_nopar: 印花 (printing / stamp)


Tournament stickers
out_1_nopar: 印花 (printing / stamp)
out_2_nopar: weapon_zh (usually name of the sticker/team e.g. MIBR or Team Spirit)
out_2_par_1: grade of the sticker (e.g. 闪耀 (glitter)). Can be empty.
out_3: name of the tournament, Skin_Name_zh (e.g. 2022年安特卫普锦标赛, Antwerp 2022)

The combination of Weapon_zh and Skin_Name_zh is needed to get a unique row.
```

##### Ok, let's try to get the value for Skins

 ```
 def get_value():
    agafar skin name out_2_nopar
    agafar weapon name out_1_nopar
    
    buscar out_2_nopar a df_out['Skin_Name_zh']
        veure quants resultats retorna.
    Si hi ha més d'un, 
        als resultats buscar out_1_nopar a df_out['Weapon_zh']
    (si només hi ha un, també estaria be veure que la arma coincieixi)
    
    a df_purchase['out_1_par'], veure si es stattrak o souvenir
        si ho es, agafa el valor de df_purchase['Value_Stattrak'] o df_purchase['Value_Souvenir']
        si no ho es, agafa el valor de df_purchases['Value']
    
    Parseja valor.
    
    # Pendent obtenir el nivel de Rarity
    # Això segurament només funcionarà per a skins. Per a stickers i graffitis i demés sera diferent.
 ```

In [1]:
### YOU CAN START FROM HERE
import pandas as pd
df_out = pd.read_pickle('df_out.pkl')
df_src = pd.read_pickle('df_src.pkl')
df_purchases = pd.read_pickle('df_purchases.pkl')

In [2]:
def get_skin_value(skin_name, weapon_name):
    print(f"Trying to find the value for the skin {skin_name} for the weapon {weapon_name}")
    df_skinsearch = df_out[df_out['Skin_Name_zh'] == skin_name]
    if df_skinsearch.shape[0] == 1:
        if df_skinsearch['Weapon_zh'].iloc[0] == weapon_name:
            print("Match found!")
            return df_skinsearch['Value'].item()
    else:
        print("More than one weapon with that skin")
        df_skinweaponsearch = df_out.query("Skin_Name_zh == @skin_name & Weapon_zh == @weapon_name ")
        # Control what to do if nothing is found
        return df_skinweaponsearch['Value'].item()


display(get_skin_value("斯康里娅", "FN57"))

Trying to find the value for the skin 斯康里娅 for the weapon FN57
More than one weapon with that skin


'$ 0.06 - $ 0.34'

In [3]:
import numpy as np

# Get a random skin name
def get_randompurchase(df):
    purchase = df.sample(1)
    #display(purchase)
    purchase = purchase.values.flatten().tolist()
    return purchase

# (this function should be rewriten so it can use column names instead of indices)
# Finds the weapon or skin in the df_out
def get_value(purchase, verbose=False):
    outcategory = ''
    stripped = [s.strip() for s in purchase[8:]]
    purchase = purchase[:8] + stripped
    if verbose: print("Purchase: ", purchase)

    # Some specific cases to deal with manually
    if purchase[8] == 'CZ75': purchase[8] = 'CZ75 自动手枪' # The pistol CZ75 appears as CZ75 自动手枪 in df_out. It's an exception that needs to be corrected manually.
    if purchase[8] == 'M4A1 消音型': purchase[8] = 'M4A1 消音版' # The weapon M4A1 消音型 appears as M4A1 消音版 in df_out.

    if purchase[8] == '印花': # If it's a sticker
        # It still fails if a sticker, patch and graffiti have the same name (finds more than 1 result, like item 580721), we have to control those cases
        if verbose: print("It's a sticker")
        outcategory = "Regular Stickers"
        if verbose: print(f"Name of sticker {purchase[10]}")
        if verbose: print(f"Grade of the sticker: {purchase[11]}")
        #if purchase[10] == "冠军": # Champion
        #    print("Champion!")
        if purchase[13] != "": # If there's something in out3, it's a tournament sticker
            outcategory = "Tournament Sticker"
            if verbose: print("It's a tournament sticker")
            if verbose: print(f"It belongs to the tournament {purchase[13]}")
            df_query = df_out.query("Weapon_zh == @purchase[10] & Skin_Name_zh == @purchase[13]")
        else:
            if verbose: print("It's a non-tournament sticker")
            df_query = df_out.query("Type_zh == '普通贴纸' & Weapon_zh == @purchase[10] & Skin_Name_zh == @purchase[13]") # to separate them from graffiti and patches

        value =  df_query['Value']
        if verbose: display(df_query)
    

    elif purchase[8] == '音乐盒':
        if verbose: print("It's a music kit")
        df_query = df_out.query("Type_en == 'Music Kits' & Weapon_zh == @purchase[8]")
        outcategory = "Music Kits"
        value =  df_query['Value']

    elif '★' in purchase[9]:
        if verbose: print("Item with a star! ★")
        searchitem = purchase[8]+'（'+purchase[9]+'）'
        if verbose: print(searchitem)
        df_query = df_out.query("Weapon_zh == @searchitem & Skin_Name_zh == @purchase[10]")
        value =  df_query['Value']
        #display(df_query)

    else:
        if verbose: print("It's likely a weapon skin") # Still fails for music boxes
        #df_weaponsearch = df_out[df_out['Weapon_zh'] == purchase[6]]
        
        df_query = df_out.query("Skin_Name_zh == @purchase[10] & Weapon_zh == @purchase[8] ") # Control if 0 cases are returned, like item with timestamp 1673442333
        outcategory = "Unknown Weapon skin"
        if purchase[9] == '纪念品':
            if verbose: print("It's a Souvenir weapon.")
            value =  df_query['Value_Souvenir']
        elif purchase[9] == 'StatTrak™':
            if verbose: print("It's a StatTrak weapon.")
            value =  df_query['Value_Stattrak']
        elif purchase[9] == '':
            if verbose: print("The weapon has the grade Normal.")
            value =  df_query['Value']

        else:
            if verbose: print("What is this?") # if anything else fails
            return 'not found', np.nan
        if verbose: display(df_query)

    # Parse value (will be in '$ 34 - $ 56' format)
    if verbose: print("value: ", value)
    
    if len(value.index) == 0: # If no results were found, set the value as np.nan
        if verbose: print("No results found")
        if outcategory == "":
            return 'unknown', np.nan 
        else:
            return outcategory, np.nan # sometimes we know the type, even if it was not found
    elif len(value.index) > 1: 
        value = value.head(1) # If more than 1 value is returned, keep the first one. Some rows are repeated in df_out, it's fine
        
    value = value.item()

    if '-' in value: value = value.split(' - ')[0] # If it's a range, get the first value (the lowest)
    if verbose: print("value: ", value)
    value = value.replace('$', '').strip() # remove the $ sign
    if verbose: print("value without $: ", value)
    return df_query.iloc[0]['Type_en'], value # Returns a 2-element tuple with the type of out and its value

In [4]:
# Value of a random purchase
randompurchase = get_randompurchase(df_purchases) # Random line of df_purchases
print(randompurchase)
print(get_value(randompurchase, verbose=True)) # the purchase row must be a list instead of a pd.Series... (sorry!)

[Timestamp('2022-12-28 00:26:22'), 1672187182, 'S3***-S5HG', '命悬一线武器箱', 'Clutch Case', 'Cases', 2.5, 'R8 左轮手枪 | 稳', 'R8 左轮手枪 ', '', ' 稳', '', '', '']
Purchase:  [Timestamp('2022-12-28 00:26:22'), 1672187182, 'S3***-S5HG', '命悬一线武器箱', 'Clutch Case', 'Cases', 2.5, 'R8 左轮手枪 | 稳', 'R8 左轮手枪', '', '稳', '', '', '']
It's likely a weapon skin
The weapon has the grade Normal.


Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Link_zh
1185,Pistols,手枪,R8 Revolver,R8 左轮手枪,https://wiki.cs.money/_next/static/images/r8-r...,https://wiki.cs.money/weapons/r8-revolver,Grip,稳,"[StatTrak™, Mil-Spec]","[nan, 3]",$ 0.05 - $ 0.24,$ 0.10 - $ 0.84,,https://wiki.cs.money/weapons/r8-revolver/grip,[The Clutch Collection],[命悬一线收藏品],[https://wiki.cs.money/collections/the-clutch-...,[https://wiki.cs.money/zh/collections/the-clut...,


value:  1185    $ 0.05 - $ 0.24
Name: Value, dtype: object
value:  $ 0.05
value without $:  0.05
('Pistols', '0.05')


In [5]:
# try to find the value of a STAR outcome
print(list(df_purchases.iloc[389710]))
get_value(['2023-01-02 22:38:47', 1672699127, 'AT***-THGN', '反冲武器箱', 'Recoil Case', 'Cases', 2.5, '运动手套（★） | 夜行衣', '运动手套 ', '★', ' 夜行衣', '', '', ''], verbose=True)

[Timestamp('2023-01-02 22:38:47'), 1672699127, 'AT***-THGN', '反冲武器箱', 'Recoil Case', 'Cases', 2.5, '运动手套（★） | 夜行衣', '运动手套 ', '★', ' 夜行衣', '', '', '']
Purchase:  ['2023-01-02 22:38:47', 1672699127, 'AT***-THGN', '反冲武器箱', 'Recoil Case', 'Cases', 2.5, '运动手套（★） | 夜行衣', '运动手套', '★', '夜行衣', '', '', '']
Item with a star! ★
运动手套（★）
value:  12    $ 215.22 - $ 5 089.54
Name: Value, dtype: object
value:  $ 215.22
value without $:  215.22


('Gloves', '215.22')

In [6]:
# another star item
print(list(df_purchases.iloc[385923]))
get_value(['2023-01-02 19:17:34', 1672687054, 'AV***-LYSJ', '命悬一线武器箱', 'Clutch Case', 'Cases', 2.5, '裹手（★） | 森林色调', '裹手 ', '★', ' 森林色调', '', '', ''], verbose=True)

[Timestamp('2023-01-02 19:17:34'), 1672687054, 'AV***-LYSJ', '命悬一线武器箱', 'Clutch Case', 'Cases', 2.5, '裹手（★） | 森林色调', '裹手 ', '★', ' 森林色调', '', '', '']
Purchase:  ['2023-01-02 19:17:34', 1672687054, 'AV***-LYSJ', '命悬一线武器箱', 'Clutch Case', 'Cases', 2.5, '裹手（★） | 森林色调', '裹手', '★', '森林色调', '', '', '']
Item with a star! ★
裹手（★）
value:  Series([], Name: Value, dtype: object)
No results found


('unknown', nan)

Things to consider:
* The weapon 'CZ75' (in df_purchases) appears as 'CZ75-Auto' (english) or 'CZ75 自动手枪' (chinese) in df_out. We have to control this case
* There are a few duplicates in df_out (like sticker Natus Vincere for Rio 2022). We cannot directly drop duplicates because some columns are lists.
* We don't have info about the wear level for a particular purchase. The Value range goes according go the wear level. The probabilities are:

        * Factory New (0.00 – 0.07)
        *  Minimal Wear (0.07 – 0.15)
        * Field-Tested (0.15 – 0.38)
        * Well-Worn (0.38 – 0.45)
        * Battle-Scarred (0.45 – 1.00)

In [7]:
# Calculates the out_value for a sample of 1000 purchases
df_purchases_value = df_purchases.sample(1000) # 50 samples / sec
#df_purchases_value # get a slice

df_purchases_value[['out_type', 'out_value']] = df_purchases_value.apply(lambda row: pd.Series(get_value(list(row), verbose=False)), axis=1)

In [8]:
## Extract a sample of 100000 and save it to a pickle file # This can take ~36 minutes
df_purchases_value = df_purchases.iloc[-100000:]
df_purchases_value[['out_type', 'out_value']] = df_purchases_value.apply(lambda row: pd.Series(get_value(list(row), verbose=False)), axis=1)
df_purchases_value.to_pickle("./df_purchases_value.pkl")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_purchases_value[['out_type', 'out_value']] = df_purchases_value.apply(lambda row: pd.Series(get_value(list(row), verbose=False)), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_purchases_value[['out_type', 'out_value']] = df_purchases_value.apply(lambda row: pd.Series(get_value(list(row), verbose=False)), axis=1)
