# Translation and Price scraping

This notebook is essentially a cleaner version of features.ipynb

From https://wiki.cs.money

Translating lootbox content (Chinese to English)

This website seems to have all the names in English and Chinese, with the estimated value for each outcome: https://wiki.cs.money/capsules

e.g.: 印花 | apEX（闪耀） | 2022年安特卫普锦标赛 --> Sticker | apEX (Glitter) | Antwerp 2022

Maybe we could find a way to scrape the data? For instance, to search: https://wiki.cs.money/search?q=2022%E5%B9%B4%E9%87%8C%E7%BA%A6%E7%83%AD%E5%86%85%E5%8D%A2%E9%94%A6%E6%A0%87%E8%B5%9B%E5%86%A0%E5%86%9B%E4%BA%B2%E7%AC%94%E7%AD%BE%E5%90%8D%E8%83%B6%E5%9B%8A

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Scraping lootbox outcomes (`df_out`)

### Scraping weapon skins (`df_out_weaponskins`)

#### Scraping weapon types

The idea is to first scrape the weapon types in English, then in Chinese, and then merge the two dataframes (`df_weapontypes_en` & `df_Weapontypes_zh` into a single one: `df_weapontypes`. This will be the skeleton in which the next dataframes will be built, since each weapon skin will correspond to one of these weapon types.

Using BeautifulSoup, scrape the weapon types in English to `df_weapontypes_en`.

In [56]:
# Make a request to the website
url = 'https://wiki.cs.money'
response = requests.get(url)

# Parse the HTML of the page
soup = BeautifulSoup(response.text, 'html.parser')

# Initialize dataframe
df_weapontypes_en = pd.DataFrame(columns=['Type_en', 'Weapon_en', 'Image', 'Link'])

# Find the element containing all the data
data_element = soup.find('div', class_='vlcjlpdimswrcghpmhnvuiucwc')

# Find all the elements containing the weapon type
type_elements = data_element.find_all('div', class_='yubwxbvslulbauytxdhyofbyjh')

# Iterate over the weapon types
for type_element in type_elements:

    type_name_element = type_element.find('div', class_='dhpekapuyjhykzpqkkagxicxzl')
    type_ = type_name_element.get_text()
    
    # Find all the elements containing a weapon
    weapon_elements = type_element.find_all('div', class_='brzpbogxsgrlikcnlwpafrzdyt')
    
    # Iterate over the weapons
    for weapon_element in weapon_elements:
        # Find the elements containing the weapon name, link, and image
        name_element = weapon_element.find('div', class_='yuevqsquaveulqpticinqpvght')
        link_element = weapon_element.find('a', class_='itixalfiylvsvmdssbpcdzfawb')
        image_element = weapon_element.find('img', class_='djzvxzseplffnklcorsqbipwky')
        
        # Extract the data from the elements
        name = name_element.get_text()
        link = link_element['href']
        link = url+link
        image = image_element['src']
        image = url+image
        
        # Print the data
        #print(type_, name, link, image)
        df_weapontypes_en.loc[len(df_weapontypes_en)] = [type_, name, image, link]

display(df_weapontypes_en.sample(5))

Unnamed: 0,Type_en,Weapon_en,Image,Link
19,Rifles,AK-47,https://wiki.cs.money/_next/static/images/ak-4...,https://wiki.cs.money/weapons/ak-47
17,Knives,★ Navaja Knife,https://wiki.cs.money/_next/static/images/nava...,https://wiki.cs.money/weapons/navaja-knife
20,Rifles,AWP,https://wiki.cs.money/_next/static/images/awp-...,https://wiki.cs.money/weapons/awp
28,Rifles,SCAR-20,https://wiki.cs.money/_next/static/images/scar...,https://wiki.cs.money/weapons/scar-20
11,Knives,★ Survival Knife,https://wiki.cs.money/_next/static/images/surv...,https://wiki.cs.money/weapons/survival-knife


Scrape weapon types in chinese to `df_weapontypes_zh`.

In [57]:
# Make a request to the website
url = 'https://wiki.cs.money'
suffix = '/zh'
response = requests.get(url+suffix)

# Parse the HTML of the page
soup = BeautifulSoup(response.text, 'html.parser')


# Initialize dataframe
df_weapontypes_zh = pd.DataFrame(columns=['Type_zh', 'Weapon_zh', 'Image_zh', 'Link_zh'])

# Find the element containing all the data
data_element = soup.find('div', class_='vlcjlpdimswrcghpmhnvuiucwc')

# Find all the elements containing the weapon type
type_elements = data_element.find_all('div', class_='yubwxbvslulbauytxdhyofbyjh')

# Iterate over the weapon types
for type_element in type_elements:

    type_name_element = type_element.find('div', class_='dhpekapuyjhykzpqkkagxicxzl')
    type_ = type_name_element.get_text()
    
    # Find all the elements containing a weapon
    weapon_elements = type_element.find_all('div', class_='brzpbogxsgrlikcnlwpafrzdyt')
    
    # Iterate over the weapons
    for weapon_element in weapon_elements:
        # Find the elements containing the weapon name, link, and image
        name_element = weapon_element.find('div', class_='yuevqsquaveulqpticinqpvght')
        link_element = weapon_element.find('a', class_='itixalfiylvsvmdssbpcdzfawb')
        image_element = weapon_element.find('img', class_='djzvxzseplffnklcorsqbipwky')
        
        # Extract the data from the elements
        name = name_element.get_text()
        link = link_element['href']
        link = url+link
        image = image_element['src']
        image = url+image
        
        # Print the data
        #print(type_, name, link, image)
        df_weapontypes_zh.loc[len(df_weapontypes_zh)] = [type_, name, image, link]

display(df_weapontypes_zh.sample(5))

Unnamed: 0,Type_zh,Weapon_zh,Image_zh,Link_zh
21,步枪,M4A4,https://wiki.cs.money/_next/static/images/m4a4...,https://wiki.cs.money/zh/weapons/m4a4
0,刀,爪子刀（★）,https://wiki.cs.money/_next/static/images/kara...,https://wiki.cs.money/zh/weapons/karambit
39,手枪,双持贝瑞塔,https://wiki.cs.money/_next/static/images/dual...,https://wiki.cs.money/zh/weapons/dual-berettas
10,刀,流浪者匕首（★）,https://wiki.cs.money/_next/static/images/noma...,https://wiki.cs.money/zh/weapons/nomad-knife
34,手枪,FN57,https://wiki.cs.money/_next/static/images/five...,https://wiki.cs.money/zh/weapons/five-seven


Merge the two weapon type dataframes into `df_weapontypes`.

In [58]:
#df = pd.DataFrame(columns=['Type_en', 'Weapon_en', 'Image', 'Link'])
df_weapontypes  = pd.concat([df_weapontypes_en, df_weapontypes_zh], sort=False, axis=1)
df_weapontypes = df_weapontypes[['Type_en', 'Type_zh', 'Weapon_en', 'Weapon_zh', 'Image', 'Link']]
display(df_weapontypes)

# no longer needed, their contents are in df_weapons
del df_weapontypes_zh
del df_weapontypes_en

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link
0,Knives,刀,★ Karambit,爪子刀（★）,https://wiki.cs.money/_next/static/images/kara...,https://wiki.cs.money/weapons/karambit
1,Knives,刀,★ M9 Bayonet,M9 刺刀（★）,https://wiki.cs.money/_next/static/images/m9-b...,https://wiki.cs.money/weapons/m9-bayonet
2,Knives,刀,★ Butterfly Knife,蝴蝶刀（★）,https://wiki.cs.money/_next/static/images/butt...,https://wiki.cs.money/weapons/butterfly-knife
3,Knives,刀,★ Talon Knife,锯齿爪刀（★）,https://wiki.cs.money/_next/static/images/talo...,https://wiki.cs.money/weapons/talon-knife
4,Knives,刀,★ Skeleton Knife,骷髅匕首（★）,https://wiki.cs.money/_next/static/images/skel...,https://wiki.cs.money/weapons/skeleton-knife
5,Knives,刀,★ Classic Knife,海豹短刀（★）,https://wiki.cs.money/_next/static/images/clas...,https://wiki.cs.money/weapons/classic-knife
6,Knives,刀,★ Bayonet,刺刀（★）,https://wiki.cs.money/_next/static/images/bayo...,https://wiki.cs.money/weapons/bayonet
7,Knives,刀,★ Stiletto Knife,短剑（★）,https://wiki.cs.money/_next/static/images/stil...,https://wiki.cs.money/weapons/stiletto-knife
8,Knives,刀,★ Ursus Knife,熊刀（★）,https://wiki.cs.money/_next/static/images/ursu...,https://wiki.cs.money/weapons/ursus-knife
9,Knives,刀,★ Paracord Knife,系绳匕首（★）,https://wiki.cs.money/_next/static/images/para...,https://wiki.cs.money/weapons/paracord-knife


#### Scraping weapon skins


##### Helper functions

###### Smooth scrolling `smooth_scroll()`
Since some parts of the website use infinite scrolling to load the content, some helper functions are used in order to scrape the data. Google Chrome needs to be installed in the system for this task.

In [2]:
## Smooth scroller for webs that load with infinite scrolling
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

def smooth_scroll(url):
    chrome_options = Options()
    chrome_options.add_argument("--incognito")
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)
    time.sleep(2) # Wait for the website to load

    reached_page_end = False
    last_height = driver.execute_script("return window.pageYOffset + window.innerHeight")

    time.sleep(2)
    scroll_distance = 10

    while not reached_page_end:
        driver.execute_script(f"window.scrollBy(0, {scroll_distance});")
        time.sleep(0.5) 
        pageHeight = driver.execute_script("return document.body.scrollHeight")
        new_height = driver.execute_script("return window.pageYOffset + window.innerHeight")

        if last_height == new_height:
                reached_page_end = True
                print("reached end")
        else:
                last_height = new_height
                scroll_distance += 5
    html = driver.page_source           
    driver.close()
    return html

###### Convert URL to the English to Chinese version `convert_url()`
Some parts of the website use a common URL for the Chinese and English parts, with the difference of a small particle indicating the language. This function converts the English to the Chinese version of URL.

For instance, https://wiki.cs.money/weapons/aug --> https://wiki.cs.money/zh/weapons/aug.

In [3]:
# Function to handle the URL language conversion
def convert_url(url, language_code):
    # Split the URL into its components
    protocol, domain_and_path = url.split("://", 1)
    domain, path = domain_and_path.split("/", 1)
    
    # Check if the path starts with a language code
    if path.startswith("[language code]/"):
        # If it does, replace the language code with the desired language code
        modified_path = path.replace("[language code]", language_code, 1)
    else:
        # If it doesn't, add the desired language code at the beginning of the path
        modified_path = f"{language_code}/{path}"
    
    # Reassemble the URL
    modified_url = f"{protocol}://{domain}/{modified_path}"
    
    return modified_url

###### Scrape skin info `get_skin_info()`

Obtains the information about the skin info from the URL of the website for both the English and Chinese versions. Returns a dataframe: `df_skins`

In [4]:
# Returns df with info about specific weapon, from an URL
def get_skin_info(url):

    # Make a request to the website
    baseurl = 'https://wiki.cs.money'

    
    # Parse the HTML of the page
    ## First, check if it's one of these pages that require infinite scrolling or not
    needs_infinitescroll = ['regular-stickers', 'tournament-stickers', 'patches', 'graffiti', 'pins']
    if any(string in url for string in needs_infinitescroll):
        soup = BeautifulSoup(smooth_scroll(url), 'html.parser')
    else:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')


    # Initialize dataframe
    df_skins = pd.DataFrame(columns=['Weapon_en', 'Skin_Name', 'Grade', 'Rarity', 'Value', 'Value_Stattrak', 'Value_Souvenir', 'Skin_Link', 'Found_in', 'Found_in_Link'])

    # Find the element containing all the data
    data_element = soup.find('div', class_='gasovxczmdwrpzliptyovkjrjp')

    skin_elements = data_element.find_all('div', class_='kxmatkcipwonxvwweiqqdoumxg') # In some rare cases, it can be empty
    #if data_element is not None:
    #    skin_elements = data_element.find_all('div', class_='kxmatkcipwonxvwweiqqdoumxg')
    #else:
    #    skin_elements = []

    grade_rarity = {
        'Consumer Grade':1, 
        'Base Grade':1,
        'Default':1,
        'Normal':1,
        'Industrial Grade':2,
        'Mil-Spec':3,
        'High Grade':3,
        'Distinguished':3,
        'Restricted':4,
        'Remarkable':4,
        'Exceptional':4,
        'Classified':5,
        'Exotic':5,
        'Superior':5,
        'Covert':6,
        'Extraordinary':6,
        'Master':6,
        'Rare Special (★)':7,
        'Knife':7,
        'Gloves':7,
        'Contraband':8,
        '★ StatTrak™':4,
        'StatTrak™':np.nan,
        'Souvenir':np.nan
    }

    # Iterate over the skins
    for skin_element in skin_elements:
        weapon_element = skin_element.find('div', class_='szvsuisjrrqalciyqqzoxoaubw') 
        name_element = skin_element.find('div', class_='zhqwubnajobxbgkzlnptmjmgwn')
        link_element = skin_element.find('a', class_='blzuifkxmlnzwzwpwjzrrtwcse')
        grade_element = skin_element.find_all('div', class_='nwdmbwsohrhpxvdldicoixwfed') # Each grade is associated with a color. Can be a single element or (more commonly) a list of elements.

        foundin_element = skin_element.find('div', class_='vercanrflftqkxuojwkgkgsiak') # Can either be a single item (e.g. collection) or a list of cases where it can be found. Can also be empty
        foundin_item_elements = skin_element.find_all('li', class_='ovgyowuqvvytpzvkyvdijroyub') # sometimes foundin_element might contain more than one subitem
        foundin_link_elements = skin_element.find_all('a', class_='wszrcfrvpibgonenagkdwmyscg')  # There is a link for each element of "found in"

        value_element = skin_element.find('div', class_='ribvzntfjepldppjrgkwabviqq') #might be empty
        value_stattrak_element = skin_element.find('div', class_='ribvzntfjepldppjrgkwabviqq dqyspyiikxcwupdhcrxpbwlide') #might be empty
        value_souvenir_element = skin_element.find('div', class_='ribvzntfjepldppjrgkwabviqq sxupcdgqjermeuhxnaqmawnkmi') #might be empty
                
        # Extract the data from the elements
        weapon = weapon_element.get_text()
        name = name_element.get_text()
        #grade = grade_element[1]['title'] if len(grade_element) > 1 else grade_element[0]['title']
        grade = [element.get('title') for element in grade_element]
        #rarity = grade_rarity[grade]
        rarity = [grade_rarity[g] for g in grade] # will also be a list, a skin or object han appear in different grades
        if "★" in weapon: rarity[-1] +=1 #Increase the grade by 1 only to the last element of the list (the first one is the "normal" object).
        link = link_element['href']
        link = baseurl+link

        foundin = [foundin_element.get_text()] if foundin_element else []
        item_list = [item.get_text() for item in foundin_item_elements] # if there's more than one
        if item_list: foundin = item_list
        foundinlink = [baseurl+link['href'] for link in foundin_link_elements] # convert links for "Found in" items to list
        
        value = value_element.get_text() if value_element is not None else ""
        value_stattrak = value_stattrak_element.get_text() if value_stattrak_element is not None else ""
        value_souvenir = value_souvenir_element.get_text() if value_souvenir_element is not None else ""

        # Append to dataframe
        df_skins.loc[len(df_skins)] = [weapon, name, grade, rarity, value, value_stattrak, value_souvenir, link, foundin, foundinlink]



    #### Repeat the exact same procedure, but for the Chinese version of the site. We do not need to get the value again.
    # Make a request to the website
    # Extract the subdomain, domain, and path from the original URL
    lang = 'zh'
    url = convert_url(url, lang)
    baseurl = 'https://wiki.cs.money'
    print(url)
    
    # Parse the HTML of the page
    ## First, check if it's one of these pages that require infinite scrolling or not
    needs_infinitescroll = ['regular-stickers', 'tournament-stickers', 'patches', 'graffiti', 'pins']
    if any(string in url for string in needs_infinitescroll):
        soup = BeautifulSoup(smooth_scroll(url), 'html.parser')
    else:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

    # Initialize dataframe
    df_skins_zh = pd.DataFrame(columns=['Weapon_zh', 'Skin_Name_zh', 'Found_in_zh', 'Found_in_Link_zh'])

    # Find the element containing all the data
    data_element = soup.find('div', class_='gasovxczmdwrpzliptyovkjrjp')

    #skin_elements = data_element.find_all('div', class_='kxmatkcipwonxvwweiqqdoumxg') # In some rare cases, it can be empty
    if data_element is not None:
        skin_elements = data_element.find_all('div', class_='kxmatkcipwonxvwweiqqdoumxg')
    else:
        skin_elements = []
    
    # Iterate over the skins
    for skin_element in skin_elements:
        weapon_element = skin_element.find('div', class_='szvsuisjrrqalciyqqzoxoaubw') 
        name_element = skin_element.find('div', class_='zhqwubnajobxbgkzlnptmjmgwn')
        foundin_element = skin_element.find('div', class_='vercanrflftqkxuojwkgkgsiak')
        foundin_item_elements = skin_element.find_all('li', class_='ovgyowuqvvytpzvkyvdijroyub') # sometimes foundin_element might contain more than one subitem
        foundin_link_elements = skin_element.find_all('a', class_='wszrcfrvpibgonenagkdwmyscg')  # There is a link for each element of "found in"

        # Extract the data from the elements
        weapon = weapon_element.get_text()
        name = name_element.get_text()
        #if foundin_element: foundin = [foundin_element.get_text()] # In case there's just one element
        foundin = [foundin_element.get_text()] if foundin_element else []
        item_list = [item.get_text() for item in foundin_item_elements] # If there's more than one
        if item_list: foundin = item_list

        foundinlink = [baseurl+link['href'] for link in foundin_link_elements] # convert links for "Found in" items to list

        # Append to dataframe
        df_skins_zh.loc[len(df_skins_zh)] = [weapon, name, foundin, foundinlink]

    # And now concatenate the English and Chinese skin df into one
    df_skins  = pd.concat([df_skins, df_skins_zh], sort=False, axis=1)
    df_skins = df_skins[['Weapon_en', 'Weapon_zh', 'Skin_Name', 'Skin_Name_zh', 'Grade', 'Rarity', 'Value', 'Value_Stattrak', 'Value_Souvenir', 'Skin_Link', 'Found_in', 'Found_in_zh', 'Found_in_Link', 'Found_in_Link_zh']]
    

    return df_skins

##### Actual scraping of the skins

Loop through the weapontypes and scrape all the skins for each weapon. Save the results into a dataframe: `df_weaponskins`.

In [62]:
# This loop will get the skins for each one of the 53 weapons in df_weapontypes, both in English and in Chinese. Can take a while.
df_weaponskins = pd.DataFrame()
for weapon in df_weapontypes['Link']:
    print(weapon)
    df_weaponskins = pd.concat([df_weaponskins, get_skin_info(weapon)], axis=0)

display(df_weaponskins.sample(5))

https://wiki.cs.money/weapons/karambit
https://wiki.cs.money/zh/weapons/karambit
https://wiki.cs.money/weapons/m9-bayonet
https://wiki.cs.money/zh/weapons/m9-bayonet
https://wiki.cs.money/weapons/butterfly-knife
https://wiki.cs.money/zh/weapons/butterfly-knife
https://wiki.cs.money/weapons/talon-knife
https://wiki.cs.money/zh/weapons/talon-knife
https://wiki.cs.money/weapons/skeleton-knife
https://wiki.cs.money/zh/weapons/skeleton-knife
https://wiki.cs.money/weapons/classic-knife
https://wiki.cs.money/zh/weapons/classic-knife
https://wiki.cs.money/weapons/bayonet
https://wiki.cs.money/zh/weapons/bayonet
https://wiki.cs.money/weapons/stiletto-knife
https://wiki.cs.money/zh/weapons/stiletto-knife
https://wiki.cs.money/weapons/ursus-knife
https://wiki.cs.money/zh/weapons/ursus-knife
https://wiki.cs.money/weapons/paracord-knife
https://wiki.cs.money/zh/weapons/paracord-knife
https://wiki.cs.money/weapons/nomad-knife
https://wiki.cs.money/zh/weapons/nomad-knife
https://wiki.cs.money/weapons

Unnamed: 0,Weapon_en,Weapon_zh,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh
6,★ M9 Bayonet,M9 刺刀（★）,Gamma Doppler Phase 2,伽玛多普勒阶段 2,"[★ StatTrak™, Covert]","[4, 7]",$ 1 399.33 - $ 1 879.72,$ 1 719.08,,https://wiki.cs.money/weapons/m9-bayonet/gamma...,"[Gamma 2 Case, Gamma Case]","[伽玛 2 号武器箱, 伽玛武器箱]","[https://wiki.cs.money/cases/gamma-2-case, htt...","[https://wiki.cs.money/zh/cases/gamma-2-case, ..."
32,Five-SeveN,FN57,Forest Night,暮色森林,"[Souvenir, Consumer Grade]","[nan, 1]",$ 0.03 - $ 0.13,,$ 0.26 - $ 6.77,https://wiki.cs.money/weapons/five-seven/fores...,[The Train Collection],[列车停放站收藏品],[https://wiki.cs.money/collections/the-train-c...,[https://wiki.cs.money/zh/collections/the-trai...
24,SSG 08,SSG 08,Acid Fade,渐变强酸,"[Souvenir, Mil-Spec]","[nan, 3]",$ 1.62,,,https://wiki.cs.money/weapons/ssg-08/acid-fade,[The Safehouse Collection],[安全处所收藏品],[https://wiki.cs.money/collections/the-safehou...,[https://wiki.cs.money/zh/collections/the-safe...
25,Glock-18,格洛克 18 型,Steel Disruption,钢铁禁锢,"[StatTrak™, Restricted]","[nan, 4]",$ 2.13 - $ 2.92,$ 4.89 - $ 6.81,,https://wiki.cs.money/weapons/glock-18/steel-d...,[The eSports 2014 Summer Collection],[电竞 2014 夏季收藏品],[https://wiki.cs.money/collections/the-esports...,[https://wiki.cs.money/zh/collections/the-espo...
0,R8 Revolver,R8 左轮手枪,Fade,渐变之色,"[StatTrak™, Covert]","[nan, 6]",$ 4.73 - $ 14.26,$ 15.14 - $ 55.36,,https://wiki.cs.money/weapons/r8-revolver/fade,[The Revolver Case Collection],[左轮武器箱收藏品],[https://wiki.cs.money/collections/the-revolve...,[https://wiki.cs.money/zh/collections/the-revo...


#### Merge the Weapon types `df_weapontypes` and Weapon Skins `df_weaponskins` dataframes

In [63]:
df_out_weaponskins = df_weapontypes.copy()

df_out_weaponskins = df_out_weaponskins.merge(df_weaponskins, how='outer', left_on=['Weapon_en', 'Weapon_zh'], right_on=['Weapon_en', 'Weapon_zh'])
display(df_out_weaponskins.sample(7))

# no longer needed
del df_weaponskins
del df_weapontypes

# Let's save the df to a file
import os
if not os.path.exists('df_pickles'):
   os.makedirs('df_pickles')

df_out_weaponskins.to_pickle("./df_pickles/df_out_weaponskins.pkl")

Unnamed: 0,Type_en,Type_zh,Weapon_en,Weapon_zh,Image,Link,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh
971,Pistols,手枪,Desert Eagle,沙漠之鹰,https://wiki.cs.money/_next/static/images/dese...,https://wiki.cs.money/weapons/desert-eagle,Cobalt Disruption,钴蓝禁锢,"[StatTrak™, Classified]","[nan, 5]",$ 51.34 - $ 66.16,$ 57.59 - $ 108.25,,https://wiki.cs.money/weapons/desert-eagle/cob...,[eSports 2013 Winter Case],[电竞 2013 冬季武器箱],[https://wiki.cs.money/cases/esports-2013-wint...,[https://wiki.cs.money/zh/cases/esports-2013-w...
868,Rifles,步枪,G3SG1,G3SG1,https://wiki.cs.money/_next/static/images/g3sg...,https://wiki.cs.money/weapons/g3sg1,Ancient Ritual,远古仪式,"[Souvenir, Industrial Grade]","[nan, 2]",$ 0.72 - $ 1.20,,$ 0.06 - $ 0.51,https://wiki.cs.money/weapons/g3sg1/ancient-ri...,[The Ancient Collection],[远古收藏品],[https://wiki.cs.money/collections/the-ancient...,[https://wiki.cs.money/zh/collections/the-anci...
477,Knives,刀,★ Navaja Knife,折刀（★）,https://wiki.cs.money/_next/static/images/nava...,https://wiki.cs.money/weapons/navaja-knife,Stained,人工染色,"[★ StatTrak™, Covert]","[4, 7]",$ 76.42 - $ 79.29,$ 68.75 - $ 166.51,,https://wiki.cs.money/weapons/navaja-knife/sta...,"[Danger Zone Case, Horizon Case]","[“头号特训”武器箱, 地平线武器箱]","[https://wiki.cs.money/cases/danger-zone-case,...",[https://wiki.cs.money/zh/cases/danger-zone-ca...
1126,Pistols,手枪,P2000,P2000,https://wiki.cs.money/_next/static/images/p200...,https://wiki.cs.money/weapons/p2000,Turf,草皮,"[StatTrak™, Mil-Spec]","[nan, 3]",$ 0.10 - $ 0.66,$ 0.24 - $ 2.51,,https://wiki.cs.money/weapons/p2000/turf,[The Glove Collection],[手套收藏品],[https://wiki.cs.money/collections/the-glove-c...,[https://wiki.cs.money/zh/collections/the-glov...
539,Rifles,步枪,AK-47,AK-47,https://wiki.cs.money/_next/static/images/ak-4...,https://wiki.cs.money/weapons/ak-47,Frontside Misty,前线迷雾,"[StatTrak™, Classified]","[nan, 5]",$ 11.36 - $ 54.56,$ 22.37 - $ 129.04,,https://wiki.cs.money/weapons/ak-47/frontside-...,[The Shadow Collection],[暗影收藏品],[https://wiki.cs.money/collections/the-shadow-...,[https://wiki.cs.money/zh/collections/the-shad...
1588,Heavy,重型武器,Negev,内格夫,https://wiki.cs.money/_next/static/images/nege...,https://wiki.cs.money/weapons/negev,Bratatat,*哒哒哒*,"[StatTrak™, Mil-Spec]","[nan, 3]",$ 0.66 - $ 5.00,$ 1.15 - $ 17.63,,https://wiki.cs.money/weapons/negev/bratatat,[The eSports 2014 Summer Collection],[电竞 2014 夏季收藏品],[https://wiki.cs.money/collections/the-esports...,[https://wiki.cs.money/zh/collections/the-espo...
1276,SMGs,微型冲锋枪,UMP-45,UMP-45,https://wiki.cs.money/_next/static/images/ump-...,https://wiki.cs.money/weapons/ump-45,Mechanism,机械装置,"[Souvenir, Industrial Grade]","[nan, 2]",$ 1.24 - $ 1.77,,$ 0.22 - $ 0.48,https://wiki.cs.money/weapons/ump-45/mechanism,[The 2021 Vertigo Collection],[2021 殒命大厦收藏品],[https://wiki.cs.money/collections/the-2021-ve...,[https://wiki.cs.money/zh/collections/the-2021...
41,Knives,刀,★ M9 Bayonet,M9 刺刀（★）,https://wiki.cs.money/_next/static/images/m9-b...,https://wiki.cs.money/weapons/m9-bayonet,Gamma Doppler Phase 2,伽玛多普勒阶段 2,"[★ StatTrak™, Covert]","[4, 7]",$ 1 399.33 - $ 1 879.72,$ 1 719.08,,https://wiki.cs.money/weapons/m9-bayonet/gamma...,"[Gamma 2 Case, Gamma Case]","[伽玛 2 号武器箱, 伽玛武器箱]","[https://wiki.cs.money/cases/gamma-2-case, htt...","[https://wiki.cs.money/zh/cases/gamma-2-case, ..."
706,Rifles,步枪,SG 553,SG 553,https://wiki.cs.money/_next/static/images/sg-5...,https://wiki.cs.money/weapons/sg-553,Tornado,狂哮飓风,"[Normal, Consumer Grade]","[1, 1]",$ 5.49 - $ 17.11,,,https://wiki.cs.money/weapons/sg-553/tornado,[The Assault Collection],[仓库突击收藏品],[https://wiki.cs.money/collections/the-assault...,[https://wiki.cs.money/zh/collections/the-assa...
1400,SMGs,微型冲锋枪,MP9,MP9,https://wiki.cs.money/_next/static/images/mp9-...,https://wiki.cs.money/weapons/mp9,Army Sheen,军队之辉,"[Normal, Consumer Grade]","[1, 1]",$ 0.06 - $ 0.13,,,https://wiki.cs.money/weapons/mp9/army-sheen,[The Control Collection],[控制收藏品],[https://wiki.cs.money/collections/the-control...,[https://wiki.cs.money/zh/collections/the-cont...


### Scraping "Other" (`df_out_other`) (e.g. stickers, pins, graffiti...)

Now, it's turn to do the same with 'Other'. (e.g. lootboxes outcomes that are not weapon skins). Basically, each category in `other_out` is comparable to a weapon (which will contain many "skins"). We'll merge the other dataframe with df_out, and then it will be a comprehensive table of the possible lootbox outcomes, and it will correspond to the `out` column in the json stream in the purchases csv (`output.csv`).

Skeleton of the `df_out_other` dataframe.

The website is not very consistent, so there are many elements here that will be entered manually.

In [32]:
# Too chaotic in the website to scrape it automatically...

# list of outcome types from src, except weapon skins, and URLs
other_out_en = {
    'Gloves':'https://wiki.cs.money/gloves',
    'Keys':'https://wiki.cs.money/keys',
    #'Rare Patterns':'https://wiki.cs.money/rare-patterns',
    'Agents':'https://wiki.cs.money/agents',
    'Regular Stickers':'https://wiki.cs.money/regular-stickers',
    'Tournament Stickers':'https://wiki.cs.money/tournament-stickers',
    'Patches':'https://wiki.cs.money/patches',
    'Graffiti':'https://wiki.cs.money/graffiti',
    'Music Kits':'https://wiki.cs.money/music-kits',
    'Pins':'https://wiki.cs.money/pins'
} 
df_other_out_en = pd.DataFrame.from_dict(other_out_en, orient='index', columns=['Link'])
df_other_out_en.reset_index(inplace=True)
df_other_out_en = df_other_out_en.rename(columns={"index": "Type_en"})

other_out_zh = {
    '手套':'https://wiki.cs.money/zh/gloves',
    '钥匙':'https://wiki.cs.money/zh/keys',
    #'罕见的图案':'https://wiki.cs.money/zh/rare-patterns',
    '探员':'https://wiki.cs.money/zh/agents',
    '普通贴纸':'https://wiki.cs.money/zh/regular-stickers',
    '大赛贴纸':'https://wiki.cs.money/zh/tournament-stickers',
    '布章':'https://wiki.cs.money/zh/patches',
    '涂鸦':'https://wiki.cs.money/zh/graffiti',
    '音乐盒':'https://wiki.cs.money/zh/music-kits',
    '引脚':'https://wiki.cs.money/zh/pins'

}
df_other_out_zh = pd.DataFrame.from_dict(other_out_zh, orient='index', columns=['Link_zh'])
df_other_out_zh.reset_index(inplace=True)
df_other_out_zh = df_other_out_zh.rename(columns={"index": "Type_zh"})

# Concat the EN and ZH versions
df_other_out = pd.concat([df_other_out_en, df_other_out_zh], axis=1)

display(df_other_out)

# no longer needed
del other_out_en
del other_out_zh
del df_other_out_zh
del df_other_out_en

# (it's possible that the "Rare Patterns" are duplicates of skins already found on each weapon skin, and can be removed from the dict)
# Regular stickers has an infinite scroll page, but beautifulsoup can only capture the first 16 elements. We would need some javascript magic. Same with Pins. Check if there are other pages with the same issue.
## function smooth_scroll() implemented for this purpose. Can still fail if website loads too slowly (modify the parameters on the website in this case).

Unnamed: 0,Type_en,Link,Type_zh,Link_zh
0,Gloves,https://wiki.cs.money/gloves,手套,https://wiki.cs.money/zh/gloves
1,Keys,https://wiki.cs.money/keys,钥匙,https://wiki.cs.money/zh/keys
2,Agents,https://wiki.cs.money/agents,探员,https://wiki.cs.money/zh/agents
3,Regular Stickers,https://wiki.cs.money/regular-stickers,普通贴纸,https://wiki.cs.money/zh/regular-stickers
4,Tournament Stickers,https://wiki.cs.money/tournament-stickers,大赛贴纸,https://wiki.cs.money/zh/tournament-stickers
5,Patches,https://wiki.cs.money/patches,布章,https://wiki.cs.money/zh/patches
6,Graffiti,https://wiki.cs.money/graffiti,涂鸦,https://wiki.cs.money/zh/graffiti
7,Music Kits,https://wiki.cs.money/music-kits,音乐盒,https://wiki.cs.money/zh/music-kits
8,Pins,https://wiki.cs.money/pins,引脚,https://wiki.cs.money/zh/pins


Actually scraping 'other' elements

For each one of the types `df_other_out`, do what we did for the Skins for each type of weapon.
Since sometimes the website has a tendency to not load properly, we'll save each dataframe as picke for each type, and we'll join them later.

This part will open a webbrowser window and smooth scroll to the end of the page to load all items. Can be quite slow, so be patient.

In [34]:
# Usually takes aroun 20 minutes.
for index, row in df_other_out.iterrows():
    #print(index)
    print(row['Type_en'])
    print(row['Link'])
    new_df = get_skin_info(row['Link'])
    new_df['Type_en'] = row['Type_en']
    new_df.to_pickle("./df_pickles/new_df_{}.pkl".format(row['Type_en']))
    del new_df

Gloves
https://wiki.cs.money/gloves
https://wiki.cs.money/zh/gloves
Keys
https://wiki.cs.money/keys
https://wiki.cs.money/zh/keys
Agents
https://wiki.cs.money/agents
https://wiki.cs.money/zh/agents
Regular Stickers
https://wiki.cs.money/regular-stickers
reached end
https://wiki.cs.money/zh/regular-stickers
reached end
Tournament Stickers
https://wiki.cs.money/tournament-stickers
reached end
https://wiki.cs.money/zh/tournament-stickers
reached end
Patches
https://wiki.cs.money/patches
reached end
https://wiki.cs.money/zh/patches
reached end
Graffiti
https://wiki.cs.money/graffiti
reached end
https://wiki.cs.money/zh/graffiti
reached end
Music Kits
https://wiki.cs.money/music-kits
https://wiki.cs.money/zh/music-kits
Pins
https://wiki.cs.money/pins
reached end
https://wiki.cs.money/zh/pins
reached end


(In case of error, manually scrapping one particular element). One way to detect missing info is to compare if the number of non-null elements match in the English and Chinese version for the Weapon and Skin_name features. Beware that the website contains some inconsistencies. For instance, the field Skin_name_zh will appear in Weapon_zh for one particular patch.

In [None]:
# Just in case some of them failed (due to the smooth scrolling) and you need to rerun it manually. Example with "Patches"
new_df = get_skin_info('https://wiki.cs.money/patches')
new_df['Type_en'] = "Patches"
print(len(new_df))
new_df.to_pickle("./new_df_Patches.pkl")
del new_df

In [41]:
# Checking the number of non-null elements.
df_temp = pd.read_pickle("new_df_Patches.pkl")
df_temp.replace(r'^\s*$', np.nan, regex=True).info()
display(df_temp.sample(5))
del df_temp

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 0 to 106
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Weapon_en         107 non-null    object 
 1   Weapon_zh         107 non-null    object 
 2   Skin_Name         50 non-null     object 
 3   Skin_Name_zh      48 non-null     object 
 4   Grade             107 non-null    object 
 5   Rarity            107 non-null    object 
 6   Value             107 non-null    object 
 7   Value_Stattrak    0 non-null      float64
 8   Value_Souvenir    0 non-null      float64
 9   Skin_Link         107 non-null    object 
 10  Found_in          107 non-null    object 
 11  Found_in_zh       107 non-null    object 
 12  Found_in_Link     107 non-null    object 
 13  Found_in_Link_zh  107 non-null    object 
 14  Type_en           107 non-null    object 
dtypes: float64(2), object(13)
memory usage: 13.4+ KB


Unnamed: 0,Weapon_en,Weapon_zh,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Type_en
36,Sunset Wave,落日余波,,,[Remarkable],[4],$ 3.59,,,https://wiki.cs.money/patches/patch-sunset-wave,[Operation Riptide Patch Collection],[“激流大行动”布章收藏品],[https://wiki.cs.money/packs/operation-riptide...,[https://wiki.cs.money/zh/packs/operation-ript...,Patches
92,Headcrab Glyph,猎头蟹符文,,,[High Grade],[3],$ 0.64,,,https://wiki.cs.money/patches/patch-headcrab-g...,[Half-Life: Alyx Patch Pack],[《半衰期：爱莉克斯》布章包],[https://wiki.cs.money/packs/half-life-alyx-pa...,[https://wiki.cs.money/zh/packs/half-life-alyx...,Patches
71,Hydra,九头蛇,,,[High Grade],[3],$ 0.97,,,https://wiki.cs.money/patches/patch-hydra,[CS:GO Patch Pack],[CS:GO布章包],[https://wiki.cs.money/packs/csgo-patch-pack],[https://wiki.cs.money/zh/packs/csgo-patch-pack],Patches
29,Entropiq (Gold),Entropiq （金色）,Stockholm 2021,2021年斯德哥尔摩锦标赛,[Remarkable],[4],$ 4.94,,,https://wiki.cs.money/patches/patch-entropiq-g...,[Stockholm 2021 Contenders Patch Pack],[2021年斯德哥尔摩锦标赛竞争组布章包],[https://wiki.cs.money/packs/stockholm-2021-co...,[https://wiki.cs.money/zh/packs/stockholm-2021...,Patches
102,Bravo,英勇,,,[High Grade],[3],$ 0.52,,,https://wiki.cs.money/patches/patch-bravo,[CS:GO Patch Pack],[CS:GO布章包],[https://wiki.cs.money/packs/csgo-patch-pack],[https://wiki.cs.money/zh/packs/csgo-patch-pack],Patches


Correcting *Tournament Stickers*

The *Tournament Stickets* website is malformed. In many (but no all) items, the class that should correspond to `Skin_Name_zh`, appears as `Weapon_zh`, and the `Skin_Name_zh` field is empty. We'll have to manually correct this. If `Skin_Name_zh` is empty (but `Skin_Name` is not), move `Weapon_zh` to `Skin_Name_zh`, copy `Weapon_en` to `Weapon_zh`.

In [42]:
new_df_temp = pd.read_pickle("new_df_Tournament Stickers.pkl")

df_tournamentstickers = new_df_temp.copy()

# Iterate through the rows of the dataframe
for index, row in df_tournamentstickers.iterrows():
    # If the Skin_Name_zh column is empty but Skin_Name is not

    if row['Skin_Name_zh'] == "" and row['Skin_Name'] != "":
        # Move Weapon_zh to Skin_Name_zh
        df_tournamentstickers.at[index, 'Skin_Name_zh'] = row['Weapon_zh']
        # Copy the contents of Weapon_en to Weapon_zh
        df_tournamentstickers.at[index, 'Weapon_zh'] = row['Weapon_en']

display(df_tournamentstickers.sample(7))
df_tournamentstickers.to_pickle("./new_df_Tournament Stickers.pkl")
del df_tournamentstickers
del new_df_temp

Unnamed: 0,Weapon_en,Weapon_zh,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh,Type_en
1403,Boombl4,Boombl4,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-boombl4...,[Antwerp 2022 Legends Autograph Capsule],[2022年安特卫普锦标赛传奇组亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-l...,[https://wiki.cs.money/zh/capsules/antwerp-202...,Tournament Stickers
1255,BIG,BIG,Stockholm 2021,2021年斯德哥尔摩锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.04,,,https://wiki.cs.money/stickers/sticker-big-sto...,[Stockholm 2021 Challengers Sticker Capsule],[2021年斯德哥尔摩锦标赛挑战组印花胶囊],[https://wiki.cs.money/capsules/stockholm-2021...,[https://wiki.cs.money/zh/capsules/stockholm-2...,Tournament Stickers
1382,Jame,Jame,Stockholm 2021,2021年斯德哥尔摩锦标赛,"[High Grade, Remarkable, Extraordinary]","[3, 4, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-jame-st...,[Stockholm 2021 Finalists Autograph Capsule],[2021年斯德哥尔摩锦标赛冠军赛选手亲笔签名胶囊],[https://wiki.cs.money/capsules/stockholm-2021...,[https://wiki.cs.money/zh/capsules/stockholm-2...,Tournament Stickers
1200,Sico,Sico,Berlin 2019,2019年柏林锦标赛,"[High Grade, Remarkable, Extraordinary]","[3, 4, 6]",$ 0.08,,,https://wiki.cs.money/stickers/sticker-sico-be...,[Berlin 2019 Minor Challengers Autograph Capsule],[2019年柏林锦标赛竞争组亲笔签名胶囊],[https://wiki.cs.money/capsules/berlin-2019-mi...,[https://wiki.cs.money/zh/capsules/berlin-2019...,Tournament Stickers
308,freakazoid,freakazoid,Cluj-Napoca 2015,2015年克卢日-纳波卡锦标赛,"[High Grade, Exotic, Exotic]","[3, 5, 5]",$ 2.98,,,https://wiki.cs.money/stickers/sticker-freakaz...,[Autograph Capsule | Challengers (Foil) | Cluj...,"[亲笔签名胶囊 | 挑战组（闪亮）| 2015年克卢日-纳波卡锦标赛, 亲笔签名胶囊 | C...",[https://wiki.cs.money/capsules/autograph-caps...,[https://wiki.cs.money/zh/capsules/autograph-c...,Tournament Stickers
1452,karrigan,karrigan,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-karriga...,[Antwerp 2022 Legends Autograph Capsule],[2022年安特卫普锦标赛传奇组亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-l...,[https://wiki.cs.money/zh/capsules/antwerp-202...,Tournament Stickers
264,olofmeister,olofmeister,Atlanta 2017,2017年亚特兰大锦标赛,"[High Grade, Exotic, Exotic]","[3, 5, 5]",$ 3.75,,,https://wiki.cs.money/stickers/sticker-olofmei...,[Autograph Capsule | Legends (Foil) | Atlanta ...,"[亲笔签名胶囊 | 传奇（闪亮）| 2017年亚特兰大锦标赛, 亲笔签名胶囊 | Fnati...",[https://wiki.cs.money/capsules/autograph-caps...,[https://wiki.cs.money/zh/capsules/autograph-c...,Tournament Stickers
1412,dav1d,dav1d,Antwerp 2022,2022年安特卫普锦标赛,"[High Grade, Remarkable, Exotic, Extraordinary]","[3, 4, 5, 6]",$ 0.03,,,https://wiki.cs.money/stickers/sticker-dav1d-a...,[Antwerp 2022 Contenders Autograph Capsule],[2022年安特卫普锦标赛竞争组亲笔签名胶囊],[https://wiki.cs.money/capsules/antwerp-2022-c...,[https://wiki.cs.money/zh/capsules/antwerp-202...,Tournament Stickers
558,markeloff,markeloff,Boston 2018,2018年波士顿锦标赛,"[High Grade, Remarkable, Extraordinary]","[3, 4, 6]",$ 1.59,,,https://wiki.cs.money/stickers/sticker-markelo...,[Boston 2018 Returning Challengers Autograph C...,[2018年波士顿锦标赛挑战组亲笔签名胶囊],[https://wiki.cs.money/capsules/boston-2018-re...,[https://wiki.cs.money/zh/capsules/boston-2018...,Tournament Stickers
257,markeloff,markeloff,Krakow 2017,2017年克拉科夫锦标赛,"[High Grade, Remarkable, Extraordinary]","[3, 4, 6]",$ 3.82,,,https://wiki.cs.money/stickers/sticker-markelo...,[Krakow 2017 Challengers Autograph Capsule],[2017年克拉科夫锦标赛挑战组亲笔签名胶囊],[https://wiki.cs.money/capsules/krakow-2017-ch...,[https://wiki.cs.money/zh/capsules/krakow-2017...,Tournament Stickers


Concatenate df_other pickle files

In [66]:
# Now load all the pickle files (they start with "new_df_*), and concatenate them
import os
df_out_other = pd.DataFrame()
files = os.listdir('./df_pickles')
for file in files:
    if file.endswith('.pkl') and file.startswith('new_df_'):
        # Load the dataframe from the pickle file
        df = pd.read_pickle(file)
        df_out_other = pd.concat([df_out_other , df], axis=0)
del df
del files

# and merge the dataframes df_other and df_other_out (the "skeleton" df that contains the categories)
df_out_other  = df_out_other.copy()
df_out_other  = df_other_out.merge(df_out_other , how='outer', left_on=['Type_en'], right_on=['Type_en'])
display(df_out_other .sample(5))

# Show number of null values for each feature
df_out_other.replace(r'^\s*$', np.nan, regex=True).info()

# Let's save it to a pickle file
df_out_other.to_pickle("./df_pickles/df_out_other.pkl")

del df_other_out

Unnamed: 0,Type_en,Link,Type_zh,Link_zh,Weapon_en,Weapon_zh,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh
930,Tournament Stickers,https://wiki.cs.money/tournament-stickers,大赛贴纸,https://wiki.cs.money/zh/tournament-stickers,freakazoid,freakazoid,Cluj-Napoca 2015,2015年克卢日-纳波卡锦标赛,"[High Grade, Exotic, Exotic]","[3, 5, 5]",$ 2.98,,,https://wiki.cs.money/stickers/sticker-freakaz...,[Autograph Capsule | Challengers (Foil) | Cluj...,"[亲笔签名胶囊 | 挑战组（闪亮）| 2015年克卢日-纳波卡锦标赛, 亲笔签名胶囊 | C...",[https://wiki.cs.money/capsules/autograph-caps...,[https://wiki.cs.money/zh/capsules/autograph-c...
79,Keys,https://wiki.cs.money/keys,钥匙,https://wiki.cs.money/zh/keys,Operation Breakout Case Key,“突围大行动”武器箱钥匙,,,[Default],[1],$ 9.92,,,https://wiki.cs.money/keys/operation-breakout-...,[Operation Breakout Weapon Case],[“突围大行动”武器箱],[https://wiki.cs.money/cases/operation-breakou...,[https://wiki.cs.money/zh/cases/operation-brea...
391,Regular Stickers,https://wiki.cs.money/regular-stickers,普通贴纸,https://wiki.cs.money/zh/regular-stickers,Necron,死灵,,,[High Grade],[3],$ 0.46,,,https://wiki.cs.money/stickers/sticker-necron,"[Warhammer 40,000 Sticker Capsule]",[战锤40k印花胶囊],[https://wiki.cs.money/capsules/warhammer-4000...,[https://wiki.cs.money/zh/capsules/warhammer-4...
2652,Graffiti,https://wiki.cs.money/graffiti,涂鸦,https://wiki.cs.money/zh/graffiti,Mr. Teeth (Bazooka Pink),露齿笑 (丁香),,,[Base Grade],[1],$ 0.03,,,https://wiki.cs.money/graffiti/sealed-graffiti...,[],[],[],[]
450,Regular Stickers,https://wiki.cs.money/regular-stickers,普通贴纸,https://wiki.cs.money/zh/regular-stickers,Vigilance,警戒者,,,"[High Grade, Remarkable]","[3, 4]",$ 0.29,,,https://wiki.cs.money/stickers/sticker-vigilance,[Sticker Capsule],[印花胶囊],[https://wiki.cs.money/capsules/sticker-capsule],[https://wiki.cs.money/zh/capsules/sticker-cap...


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2779 entries, 0 to 2778
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Type_en           2779 non-null   object 
 1   Link              2779 non-null   object 
 2   Type_zh           2779 non-null   object 
 3   Link_zh           2779 non-null   object 
 4   Weapon_en         2779 non-null   object 
 5   Weapon_zh         2779 non-null   object 
 6   Skin_Name         1931 non-null   object 
 7   Skin_Name_zh      1924 non-null   object 
 8   Grade             2779 non-null   object 
 9   Rarity            2779 non-null   object 
 10  Value             2779 non-null   object 
 11  Value_Stattrak    0 non-null      float64
 12  Value_Souvenir    0 non-null      float64
 13  Skin_Link         2779 non-null   object 
 14  Found_in          2779 non-null   object 
 15  Found_in_zh       2779 non-null   object 
 16  Found_in_Link     2779 non-null   object 


### Concatenate (`df_out_weaponskins`) & (`df_out_other`)

The resulting dataframe will be `df_out`, that will correspond to the outcomes of a lootbox.

In [80]:
# Finally, concat the "other" dataframe with the main one:
df_out = pd.read_pickle("./df_pickles/df_out_weaponskins.pkl")
df_out = pd.concat([df_out_weaponskins, df_out_other], axis=0)
df_out.reset_index(inplace=True)

# Save to pickle
df_out.to_pickle("./df_pickles/df_out.pkl")

In [81]:
# Show number of null values for each feature
df_out.replace(r'^\s*$', np.nan, regex=True).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4403 entries, 0 to 4402
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   index             4403 non-null   int64 
 1   Type_en           4402 non-null   object
 2   Type_zh           4402 non-null   object
 3   Weapon_en         4403 non-null   object
 4   Weapon_zh         4403 non-null   object
 5   Image             1623 non-null   object
 6   Link              4402 non-null   object
 7   Skin_Name         3555 non-null   object
 8   Skin_Name_zh      3548 non-null   object
 9   Grade             4403 non-null   object
 10  Rarity            4403 non-null   object
 11  Value             4403 non-null   object
 12  Value_Stattrak    1106 non-null   object
 13  Value_Souvenir    248 non-null    object
 14  Skin_Link         4403 non-null   object
 15  Found_in          4403 non-null   object
 16  Found_in_zh       4403 non-null   object
 17  Found_in_Link 

In [88]:
# Check if things change if we remove duplicates
# (This weird line is because some features contains list, and drop_duplicates can't work with them. So we convert everything to strings first)

df_out.loc[df_out.astype(str).drop_duplicates().index].replace(r'^\s*$', np.nan, regex=True).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4403 entries, 0 to 4402
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   index             4403 non-null   int64 
 1   Type_en           4402 non-null   object
 2   Type_zh           4402 non-null   object
 3   Weapon_en         4403 non-null   object
 4   Weapon_zh         4403 non-null   object
 5   Image             1623 non-null   object
 6   Link              4402 non-null   object
 7   Skin_Name         3555 non-null   object
 8   Skin_Name_zh      3548 non-null   object
 9   Grade             4403 non-null   object
 10  Rarity            4403 non-null   object
 11  Value             4403 non-null   object
 12  Value_Stattrak    1106 non-null   object
 13  Value_Souvenir    248 non-null    object
 14  Skin_Link         4403 non-null   object
 15  Found_in          4403 non-null   object
 16  Found_in_zh       4403 non-null   object
 17  Found_in_Link 

## Scraping lootbox cases (src)



### Manually select categories to scrape

Since the website is quite inconsistent and not all info is necessary, we'll define the categories manually.

In [5]:
# list of lootbox types that you can purchase, except weapon skins, and URLs
other_src = {
    'Cases': 'https://wiki.cs.money/cases', 
    'Collections': 'https://wiki.cs.money/collections', 
    'Souvenir Packages': 'https://wiki.cs.money/souvenir-packages',
    'Agent Collections':'https://wiki.cs.money/agent-collections',
    'Sticker Capsules':'https://wiki.cs.money/capsules',
    'Patch Packs':'https://wiki.cs.money/packs',
    'Graffiti Capsules':'https://wiki.cs.money/graffiti-capsules',
    'Music Kit Boxes':'https://wiki.cs.money/music-kit-boxes',
    'Pins Capsules':'https://wiki.cs.money/pins-capsules'
} 

df_other_src = pd.DataFrame.from_dict(other_src, orient='index', columns=['Link'])
df_other_src.reset_index(inplace=True)
df_other_src = df_other_src.rename(columns={"index": "Type_en"})
#df_other_out['Link_zh'] = df_other_out.apply(lambda row : convert_url(row['Link'], 'zh'), axis=1) # not necessary, will only add clutter.

other_src_zh = {
    '案件':'https://wiki.cs.money/zh/cases',
    '藏品':'https://wiki.cs.money/zh/collections',
    '纪念品包':'https://wiki.cs.money/zh/souvenir-packages',
    '探员系列':'https://wiki.cs.money/zh/agent-collections',
    '印花胶囊':'https://wiki.cs.money/zh/capsules',
    '布章包':'https://wiki.cs.money/zh/packs',
    '涂鸦胶囊':'https://wiki.cs.money/zh/graffiti-capsules',
    '音乐盒箱子':'https://wiki.cs.money/zh/music-kit-boxes',
    '勋章胶囊':'https://wiki.cs.money/zh/pins-capsules'
}

df_other_src_zh = pd.DataFrame.from_dict(other_src_zh, orient='index', columns=['Link_zh'])
df_other_src_zh.reset_index(inplace=True)
df_other_src_zh = df_other_src_zh.rename(columns={"index": "Type_zh"})

df_other_src = pd.concat([df_other_src, df_other_src_zh], axis=1)

display(df_other_src)

Unnamed: 0,Type_en,Link,Type_zh,Link_zh
0,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases
1,Collections,https://wiki.cs.money/collections,藏品,https://wiki.cs.money/zh/collections
2,Souvenir Packages,https://wiki.cs.money/souvenir-packages,纪念品包,https://wiki.cs.money/zh/souvenir-packages
3,Agent Collections,https://wiki.cs.money/agent-collections,探员系列,https://wiki.cs.money/zh/agent-collections
4,Sticker Capsules,https://wiki.cs.money/capsules,印花胶囊,https://wiki.cs.money/zh/capsules
5,Patch Packs,https://wiki.cs.money/packs,布章包,https://wiki.cs.money/zh/packs
6,Graffiti Capsules,https://wiki.cs.money/graffiti-capsules,涂鸦胶囊,https://wiki.cs.money/zh/graffiti-capsules
7,Music Kit Boxes,https://wiki.cs.money/music-kit-boxes,音乐盒箱子,https://wiki.cs.money/zh/music-kit-boxes
8,Pins Capsules,https://wiki.cs.money/pins-capsules,勋章胶囊,https://wiki.cs.money/zh/pins-capsules


### Scrape the lootbox cases from the website

In [6]:
# For each one of the other_src, do what we did for Skins. Can take around a minute.
df_src = pd.DataFrame()
for key, value in other_src.items():
    print(key)
    print(value)
    new_df = get_skin_info(value)
    new_df['Type_en'] = key
    df_src = pd.concat([df_src, new_df], axis=0)
del new_df

# and merge the dataframes df_other and df_other_src
df_src = df_src.copy()
df_src = df_other_src.merge(df_src, how='outer', left_on=['Type_en'], right_on=['Type_en'])
display(df_src)

Cases
https://wiki.cs.money/cases
https://wiki.cs.money/zh/cases
Collections
https://wiki.cs.money/collections
https://wiki.cs.money/zh/collections
Souvenir Packages
https://wiki.cs.money/souvenir-packages
https://wiki.cs.money/zh/souvenir-packages
Agent Collections
https://wiki.cs.money/agent-collections
https://wiki.cs.money/zh/agent-collections
Sticker Capsules
https://wiki.cs.money/capsules
https://wiki.cs.money/zh/capsules
Patch Packs
https://wiki.cs.money/packs
https://wiki.cs.money/zh/packs
Graffiti Capsules
https://wiki.cs.money/graffiti-capsules
reached end
https://wiki.cs.money/zh/graffiti-capsules
reached end
Music Kit Boxes
https://wiki.cs.money/music-kit-boxes
https://wiki.cs.money/zh/music-kit-boxes
Pins Capsules
https://wiki.cs.money/pins-capsules
reached end
https://wiki.cs.money/zh/pins-capsules
reached end


Unnamed: 0,Type_en,Link,Type_zh,Link_zh,Weapon_en,Weapon_zh,Skin_Name,Skin_Name_zh,Grade,Rarity,Value,Value_Stattrak,Value_Souvenir,Skin_Link,Found_in,Found_in_zh,Found_in_Link,Found_in_Link_zh
0,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Revolution Case,Revolution Case,,,[Default],[1],$ 6.30,,,https://wiki.cs.money/cases/revolution-case,[],[],[],[]
1,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Recoil Case,Recoil Case,,,[Default],[1],$ 0.57,,,https://wiki.cs.money/cases/recoil-case,[],[],[],[]
2,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Dreams & Nightmares Case,梦魇武器箱,,,[Default],[1],$ 0.50,,,https://wiki.cs.money/cases/dreams-nightmares-...,[],[],[],[]
3,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Operation Riptide Case,“激流大行动”武器箱,,,[Default],[1],$ 1.82,,,https://wiki.cs.money/cases/operation-riptide-...,[],[],[],[]
4,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Snakebite Case,蛇噬武器箱,,,[Default],[1],$ 0.12,,,https://wiki.cs.money/cases/snakebite-case,[],[],[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158,Music Kit Boxes,https://wiki.cs.money/music-kit-boxes,音乐盒箱子,https://wiki.cs.money/zh/music-kit-boxes,Masterminds Music Kit Box,决策大师音乐盒集,,,[Default],[1],$ 6.37,,,https://wiki.cs.money/music-kit-boxes/mastermi...,[],[],[],[]
159,Pins Capsules,https://wiki.cs.money/pins-capsules,勋章胶囊,https://wiki.cs.money/zh/pins-capsules,Collectible Pins Capsule Series 1,系列 1 收藏胸章胶囊,,,[Default],[1],$ 12.21,,,https://wiki.cs.money/pins-capsules/collectibl...,[],[],[],[]
160,Pins Capsules,https://wiki.cs.money/pins-capsules,勋章胶囊,https://wiki.cs.money/zh/pins-capsules,Collectible Pins Capsule Series 2,系列 2 收藏胸章胶囊,,,[Default],[1],$ 11.69,,,https://wiki.cs.money/pins-capsules/collectibl...,[],[],[],[]
161,Pins Capsules,https://wiki.cs.money/pins-capsules,勋章胶囊,https://wiki.cs.money/zh/pins-capsules,Half-Life: Alyx Collectible Pins Capsule,《半衰期：爱莉克斯》收藏胸章胶囊,,,[Default],[1],$ 11.42,,,https://wiki.cs.money/pins-capsules/half-life-...,[],[],[],[]


In [7]:
# Show number of null values for each feature
df_src.replace(r'^\s*$', np.nan, regex=True).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 163 entries, 0 to 162
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Type_en           163 non-null    object 
 1   Link              163 non-null    object 
 2   Type_zh           163 non-null    object 
 3   Link_zh           163 non-null    object 
 4   Weapon_en         163 non-null    object 
 5   Weapon_zh         163 non-null    object 
 6   Skin_Name         0 non-null      float64
 7   Skin_Name_zh      0 non-null      float64
 8   Grade             163 non-null    object 
 9   Rarity            163 non-null    object 
 10  Value             74 non-null     object 
 11  Value_Stattrak    0 non-null      float64
 12  Value_Souvenir    0 non-null      float64
 13  Skin_Link         163 non-null    object 
 14  Found_in          163 non-null    object 
 15  Found_in_zh       163 non-null    object 
 16  Found_in_Link     163 non-null    object 
 1

### Drop unused columns in `df_src` dataframe

In [8]:
df_src = df_src.rename(columns={
    'Weapon_en': 'lootbox_en',
    'Weapon_zh': 'lootbox_zh'
})

df_src = df_src.drop(columns=['Skin_Name', 'Skin_Name_zh', 'Value_Stattrak', 'Value_Souvenir', 'Found_in', 'Found_in_zh', 'Found_in_Link', 'Found_in_Link_zh'])

display(df_src.sample(5))


# Let's save it to a file
df_src.to_pickle("./df_pickles/df_src.pkl")

Unnamed: 0,Type_en,Link,Type_zh,Link_zh,lootbox_en,lootbox_zh,Grade,Rarity,Value,Skin_Link
1,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,Recoil Case,Recoil Case,[Default],[1],$ 0.57,https://wiki.cs.money/cases/recoil-case
9,Cases,https://wiki.cs.money/cases,案件,https://wiki.cs.money/zh/cases,CS20 Case,反恐精英20周年武器箱,[Default],[1],$ 0.17,https://wiki.cs.money/cases/cs20-case
44,Collections,https://wiki.cs.money/collections,藏品,https://wiki.cs.money/zh/collections,The Danger Zone Collection,头号特训收藏品,[Default],[1],,https://wiki.cs.money/collections/the-danger-z...
72,Collections,https://wiki.cs.money/collections,藏品,https://wiki.cs.money/zh/collections,The Huntsman Collection,猎杀者收藏品,[Default],[1],,https://wiki.cs.money/collections/the-huntsman...
126,Sticker Capsules,https://wiki.cs.money/capsules,印花胶囊,https://wiki.cs.money/zh/capsules,Rio 2022 Legends Sticker Capsule,Rio 2022 Legends Sticker Capsule,[Default],[1],—,https://wiki.cs.money/capsules/rio-2022-legend...


### Missing translations in df_src

The `src` column in the purchases dataset seems to be equivalent to the `lootbox_zh` field in the lootbox dataframe `df_src`

However, `df_src` has some missing translations, that we'll add manually.

In [94]:
# import from json file
import json
with open('./lootbox_addiction/lootbox_db_scraping/lootboxes_zh_en.json', 'r', encoding='utf-8') as file:
    lootboxes_zh_en = json.load(file)
display(lootboxes_zh_en)

# Same dict, opposite direction (English->Chinese)
lootboxes_en_zh = {v: k for k, v in lootboxes_zh_en.items()}

df_src['lootbox_zh'] = df_src['lootbox_zh'].map(lootboxes_en_zh).fillna(df_src['lootbox_zh'])

df_src.to_pickle("./df_pickles/df_src.pkl")

Done. In the `df_pickles` folder there should be a `df_src` with information about the lootbox cases, and `df_out` with information about the lootbox outcomes, including their value. The rest of the .pkl files at this point are redundant, but can be used to reconstruct these two dataframes from partial data if necessary.