In [1]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import wikipedia
import pickle

### Objective

The objective is to gather data from three different sources:
* **unknowncheats**: get post and thread count by FPS game for FPS gaming series
* **wikipedia**: get wikipedia entry for each gaming series (text data)
* **levvvel.com**: get anti-cheat software, publisher, and developer for each gaming series

The data is then joined together by game title and exported as an output

### Unknowncheats.me

First we scrape the website *unknowncheats.me*. This website boasts, "UnKnoWnCheaTs is the oldest game cheating forum in existence, leading the game cheating community for over 20 years. We encourage an open, free and collaborative environment and offer a vast and resourceful file database, a wiki that's packed with structured information and tutorials, access to the most intelligent programmers, and a team that protects members from malware while enforcing a diverse community." 

The website contains a table called "First-Person Shooters" which has FPS games and the number of post and thread counts for each game. This table will be used as a proxy for the incidence of cheating in video games. That is, we make the assumption that the higher the thread count, the higher the prevalence of cheating for a particular game is. 

Note that not every post or thread is directly related to cheating. However, since it is a website called "unknown cheats" whose purpose is to share cheating software, we take this as a measure of cheating. More work may be done to sort through posts/threads to classify them as cheating vs. non-cheating using keywords.

In [2]:
# url for unknowncheats.me forum page which contains First Person Shooters table
uc_url = 'https://www.unknowncheats.me/forum/index.php'

def uc_get_all_games_posts_threads() -> pd.DataFrame:
    """
    Creates a dataframe from the "First Person Shooters" table of the unknowncheats page
    Resulting dataframe contains game name, unknowncheats game link fo that particular game,
    the number of threads for the game, and the number of posts for the game
    """
    # use requests to access unknowncheats website
    res = requests.get(uc_url)
    # make sure status code is 200, otherwise raise error
    if res.status_code == 200:
        print('status code is 200, website is accessible')
    else:
        raise Exception('status code is not 200, error accessing website')
    
    # create soup instance
    soup = BeautifulSoup(res.content)
    # access tbody for First-Person Shooter table
    uc_fps_games = soup.find('tbody', attrs={'id': 'collapseobj_forumbit_156'})
    # find all tr in table 
    uc_fps_games_rows = uc_fps_games.find_all('tr')
    
    # create empty list to store table contents
    uc_fps_games_rows_data = []
    # iterate through the trs (each game)
    for row in uc_fps_games_rows[1:]:
        # create empty dictionary for each entry
        row_dict = {}
        # get game name
        row_dict['game_name'] = row.find('a').text #row.find('a').text.strip()
        a_element = row.find('a')
        # get unknowncheats url for each game
        row_dict['uc_game_link'] = row.find('a').get('href')
        # get threat count for game
        row_dict['uc_threads'] = row.find('td', attrs={'class': 'alt1'}).text.strip()
        # get post count for game
        row_dict['uc_posts'] = row.find_all('td', attrs={'class': 'alt2'})[-1].text.strip()
        # append entry for game to the empty list
        uc_fps_games_rows_data.append(row_dict)
    # return the table as a dataframe
    return pd.DataFrame(uc_fps_games_rows_data)


uc_games_df = uc_get_all_games_posts_threads()

status code is 200, website is accessible


There is an entry for "Other FPS games" which we will not consider as part of this project. This row will be dropped.

In [3]:
# drop entry for "Other FPS games"
uc_games_df = uc_games_df.drop(uc_games_df[uc_games_df['game_name']=='Other FPS Games'].index)

### Wikipedia

We then want to gather the wikipedia articles. Since we have our $y$ as the number of threads for each game, we need to gather the entries for each game we scraped from unknowncheats (each game is uc_games_df). 

This is to add text data, such as game genre or number of players, and use it as an explanatory variable to predict the incidence of cheating.

For the scraping, we use the module *wikipedia*.

In [4]:
# save list of wikipedia articles for source citation:
wikipedia_sources = []

In [5]:
# loop through all the game names pulled from unknowncheats
for game in uc_games_df['game_name']:
    try:
        # use wikipedia module to pull the contents of the page
        # note: auto_suggest = True by default
        p = wikipedia.page(game)
        content = p.content
        # save url for source citation
        wikipedia_sources.append(p.url)
    except:
        # if the game can't be pulled using wikipedia, substitute will nan as a placeholder
        content = np.nan
    # add the entry to the dataframe column wiki_content
    uc_games_df.loc[uc_games_df['game_name']==game, 'wiki_content'] = content



  lis = BeautifulSoup(html).find_all('li')


Looking at the result, we see that there are some entries where an entry could not be pulled:

In [6]:
uc_games_df.loc[uc_games_df['wiki_content'].isnull()]

Unnamed: 0,game_name,uc_game_link,uc_threads,uc_posts,wiki_content
1,All Points Bulletin,https://www.unknowncheats.me/forum/all-points-...,1362,27077,
4,Apex Legends,https://www.unknowncheats.me/forum/apex-legend...,1744,51976,
10,Crysis Series,https://www.unknowncheats.me/forum/crysis-seri...,519,6062,
13,FEAR,https://www.unknowncheats.me/forum/fear/?s=e67...,60,533,
17,Halo,https://www.unknowncheats.me/forum/halo/?s=e67...,456,5392,
21,Overwatch,https://www.unknowncheats.me/forum/overwatch/?...,676,12774,
25,Playerunknown's Battlegrounds,https://www.unknowncheats.me/forum/playerunkno...,5830,109490,
28,Radical Heights,https://www.unknowncheats.me/forum/radical-hei...,32,1286,
30,Rust,https://www.unknowncheats.me/forum/rust/?s=e67...,3311,50855,
31,Sea of Thieves,https://www.unknowncheats.me/forum/sea-of-thie...,425,15827,


For some of these games we can fill the entry by setting auto_suggest to False

source: https://github.com/goldsmith/Wikipedia/issues/192

In [7]:
# loop through all the game names pulled from unknowncheats
for game in uc_games_df.loc[uc_games_df['wiki_content'].isnull()]['game_name']:
    try:
        # use wikipedia module to pull the contents of the page, with auto_suggest = False
        p = wikipedia.page(game, auto_suggest=False)
        content = p.content
        # save url for source citation
        wikipedia_sources.append(p.url)
    except:
        # if the game can't be pulled using wikipedia, substitute will nan as a placeholder
        content = np.nan
    # add the entry to the dataframe column wiki_content
    uc_games_df.loc[uc_games_df['game_name']==game, 'wiki_content'] = content



  lis = BeautifulSoup(html).find_all('li')


We now check the null entries again:

In [8]:
uc_games_df.loc[uc_games_df['wiki_content'].isnull()]

Unnamed: 0,game_name,uc_game_link,uc_threads,uc_posts,wiki_content
10,Crysis Series,https://www.unknowncheats.me/forum/crysis-seri...,519,6062,
13,FEAR,https://www.unknowncheats.me/forum/fear/?s=e67...,60,533,
17,Halo,https://www.unknowncheats.me/forum/halo/?s=e67...,456,5392,
36,Swat 4,https://www.unknowncheats.me/forum/swat-4-a/?s...,133,1077,


For the remaining 4 games, the wikipedia entry could not be pulled because the game_name did not match the wikipedia article title. We fill these values individually:

In [9]:
# Crysis Series
p = wikipedia.page('Crysis')
content = p.content
wikipedia_sources.append(p.url)
uc_games_df.loc[uc_games_df['game_name']=='Crysis Series', 'wiki_content'] = content

# FEAR
p = wikipedia.page('F.E.A.R. (video game)', auto_suggest=False)
content = p.content
wikipedia_sources.append(p.url)
uc_games_df.loc[uc_games_df['game_name']=='FEAR', 'wiki_content'] = content

# Halo
p = wikipedia.page('Halo (franchise)')
content = p.content
wikipedia_sources.append(p.url)
uc_games_df.loc[uc_games_df['game_name']=='Halo', 'wiki_content'] = content

# Swat 4
p = wikipedia.page('SWAT 4', auto_suggest=False)
content = p.content
wikipedia_sources.append(p.url)
uc_games_df.loc[uc_games_df['game_name']=='Swat 4', 'wiki_content'] = content

There are additionally some games which pulled the wrong wikipedia entry. We correct for these below:

In [10]:
# all points bulletin
# APB: All Points Bulletin
p = wikipedia.page('APB: All Points Bulletin', auto_suggest=False)
content = p.content
wikipedia_sources.append(p.url)
uc_games_df.loc[uc_games_df['game_name'].str.lower()=='all points bulletin', 'wiki_content'] = content

# combat arms
# Combat Arms
p = wikipedia.page('Combat Arms', auto_suggest=False)
content = p.content
wikipedia_sources.append(p.url)
uc_games_df.loc[uc_games_df['game_name'].str.lower()=='combat arms', 'wiki_content'] = content

# paladins
# Paladins (video game)
p = wikipedia.page('Paladins (video game)', auto_suggest=False)
content = p.content
wikipedia_sources.append(p.url)
uc_games_df.loc[uc_games_df['game_name'].str.lower()=='paladins', 'wiki_content'] = content

# rust
# Rust (video game)
p = wikipedia.page('Rust (video game)', auto_suggest=False)
content = p.content
wikipedia_sources.append(p.url)
uc_games_df.loc[uc_games_df['game_name'].str.lower()=='rust', 'wiki_content'] = content

Now we have pulled all of the wikipedia entries for each game and added it to our data:

In [11]:
uc_games_df

Unnamed: 0,game_name,uc_game_link,uc_threads,uc_posts,wiki_content
0,ARMA Series,https://www.unknowncheats.me/forum/arma-series...,5970,91647,Arma (sometimes stylised as ARMA) is a series ...
1,All Points Bulletin,https://www.unknowncheats.me/forum/all-points-...,1362,27077,APB: All Points Bulletin is an open world mult...
2,Alliance of Valiant Arms,https://www.unknowncheats.me/forum/alliance-of...,242,6218,Alliance of Valiant Arms (abbreviated as A.V.A...
3,America's Army Operations,https://www.unknowncheats.me/forum/america-s-a...,6434,85966,America's Army was a series of first-person sh...
4,Apex Legends,https://www.unknowncheats.me/forum/apex-legend...,1744,51976,Apex Legends is a free-to-play battle royale-h...
5,Battlefield Series,https://www.unknowncheats.me/forum/battlefield...,9085,191476,Battlefield is a series of first-person shoote...
6,Call of Duty Series,https://www.unknowncheats.me/forum/call-of-dut...,7733,105461,Call of Duty is a first-person shooter video g...
7,Combat Arms,https://www.unknowncheats.me/forum/combat-arms...,843,10607,Combat Arms: Reloaded & Combat Arms: Classic i...
8,Counter Strike,https://www.unknowncheats.me/forum/counter-str...,28290,508197,Counter-Strike (CS) is a series of multiplayer...
9,CrossFire,https://www.unknowncheats.me/forum/crossfire/?...,276,2094,Crossfire is an online tactical first-person s...


### LEVVVEL

Next we get data from *levvvel.com*. They claim on the website, "ensuring no one can unfairly gain an advantage over their opponent and more...". They contain information on kernel-level anti-cheat drivers and have a table of different video games and the anti-cheat software used in the game. This section scrapes this table into a dataframe as the anti-cheat software can be used as a explanatory variable in our model.

The website required selenium for webscraping which is outside the scope of this project. Instead, the data was gathered manually in a csv format, read below:

In [12]:
lvl_path = '../data/levvvel_games_anticheat_software.csv'
lvl_df = pd.read_csv(lvl_path)

In [13]:
# make columns lower case
lvl_df.columns = [col.lower() for col in lvl_df]

### Joining the data: clean joins

Next, uc_games_df and lvl_df must be joined together to create a master dataset. There are a few problems here:

1. The game names in the two dataframes may not match up
2. Some of the entries we have from unknowncheats are for gaming series, not individual games. The entries from levvvel are for individual games. For gaming series, it is often the case that the same anticheat software is used for the whole series, however we cannot make this assumption. Therefore we cannot use fuzzymatching and must check each individual entry to make sure the correct data is appended from levvvel.
3. The entries for levvvel are only for games which have kernel-level anti-cheat drivers. It's possible that the game does not have this and therefore the data does not exist in levvvel.

First, we will join the two dataframes which does work for a handfull of entries:

In [14]:
# make the game names lower case for join:
uc_games_df['game_name'] = uc_games_df['game_name'].str.lower()
lvl_df['game'] = lvl_df['game'].str.lower()

In [15]:
# join the two dataframes to make new dataframe called uc_games_df_joined
uc_games_df_joined = uc_games_df.join(lvl_df.set_index('game'), on='game_name', how='left')

In [16]:
print(f"Number of entries where join succeeded: {len(uc_games_df_joined.loc[~uc_games_df_joined['software'].isnull()])}")
print(f"Number of entries where join failed: {len(uc_games_df_joined.loc[uc_games_df_joined['software'].isnull()])}")

Number of entries where join succeeded: 12
Number of entries where join failed: 33


We see that 12 out of the 45 rows now have the anticheat system appended. We obtain the rest by checking the games one-by-one.

### Joining the data: failed joins

We create a list, anticheat_needed_games, which contains the games from unknowncheats which we need to obtain the anti-cheat information for. This is simply used as a checklist while going through the data.

In [17]:
# create a list of the games we need to obtain the anticheat information for:
anticheat_needed_games = uc_games_df_joined.loc[uc_games_df_joined['software'].isnull()]['game_name'].tolist()

In [18]:
anticheat_needed_games

['arma series',
 'all points bulletin',
 'alliance of valiant arms',
 "america's army operations",
 'battlefield series',
 'call of duty series',
 'counter strike',
 'crossfire',
 'crysis series',
 'day of defeat',
 'fear',
 'grand theft auto v',
 'h1z1',
 'halo',
 'joint operations & dfx',
 'medal of honor series',
 'operation 7',
 'overwatch',
 'payday 2',
 'quake series',
 'rainbow six siege',
 'radical heights',
 'red dead redemption 2',
 'sea of thieves',
 'star wars battlefront',
 'star wars battlefront 2',
 'sudden attack',
 'swat 4',
 'titanfall',
 'team fortress 2',
 "tom clancy's the division",
 'unreal tournament',
 'war inc']

#### arma series

In [19]:
# arma series
lvl_df.loc[lvl_df['game'].str.contains('arma')]

Unnamed: 0,game,software,developer,publisher
18,arma 2,BattlEye,Bohemia Interactive,Bohemia Interactive
19,arma 2: operation arrowhead,BattlEye,Bohemia Interactive,505 Games
20,arma 3,BattlEye,Bohemia Interactive,Bohemia Interactive
45,battlefleet gothic: armada 2,Easy Anti-Cheat,Tindalos Interactive,Focus Home Interactive


According to wikipedia:

"Arma 2: Operation Arrowhead (Arma 2: OA; stylized as ARMA II: Operation Arrowhead) is a standalone expansion pack to Bohemia Interactive's tactical shooter Arma 2."

Since *operation arrowhead* is a standalone, we will use the entry for "arma 2" (identical to that of "arma 3") for arma series.

In [20]:
# create a new dataframe containing the anticheat information for games where the join failed. doing this to make sure we are not overwriting our source data and to have a clean join later
# we will use this to embellish our data later
anticheat_for_nonjoins = lvl_df.loc[lvl_df['game']=='arma 2'].reset_index(drop=True)
# change the game name in this new dataframe so the join is clean
anticheat_for_nonjoins.loc[anticheat_for_nonjoins['game']=='arma 2', 'game'] = 'arma series'

In [21]:
# look at resulting df
anticheat_for_nonjoins

Unnamed: 0,game,software,developer,publisher
0,arma series,BattlEye,Bohemia Interactive,Bohemia Interactive


In [22]:
# remove the game from our checklist since we obtained the info for it
anticheat_needed_games.remove('arma series')

#### all points bulletin

all points bulletin was renamed to "APB: Reloaded" when it was purchased by a different company

source: https://en.wikipedia.org/wiki/APB:_All_Points_Bulletin

We create a helper function so we can keep adding the anticheat entries to anticheat_needed_games dataframe:

In [23]:
def add_anticheat_entry(uc_game_name: str, lvl_game_name: str, df: pd.DataFrame) -> pd.DataFrame:
    """
    uc_game_name: game name as appears in unknowncheats
    lvl_game_name: game name for levvvel
    df: anticheat_for_nonjoins dataframe
    adds game anticheat information to anticheat_for_nonjoins dataframe for clean join later
    removes the game from the anticheat_needed_games checklist
    """
    # join entry for game into anticheat_for_nonjoins
    new_df = pd.concat([df, lvl_df.loc[lvl_df['game']==lvl_game_name]]).reset_index(drop=True)
    # rename lvl game name to uc game name so it corresponds to unknowncheats game name
    new_df.loc[new_df['game']==lvl_game_name, 'game'] = uc_game_name
    
    # remove the game from our checklist since we obtained the info for it
    anticheat_needed_games.remove(uc_game_name)
    
    return new_df

In [24]:
# take data for "apb reloaded" for "all points bulletin"
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='all points bulletin', 
                                             lvl_game_name='apb reloaded', 
                                             df=anticheat_for_nonjoins)

In [25]:
anticheat_for_nonjoins

Unnamed: 0,game,software,developer,publisher
0,arma series,BattlEye,Bohemia Interactive,Bohemia Interactive
1,all points bulletin,BattlEye,Little Orbit,Little Orbit


#### alliance of valiant arms

Alliance of valiant arms is also known as "a.v.a". There are two versions, "a.v.a" and "a.v.a: dog tag". According to wikipedia:

"Red Duck, Inc. attempted to self-publish a modified version of the game for these regions called Alliance of Valiant Arms: DOG TAG, launching open beta on May 2, 2019. AVA: Dog Tag was shut down on May 29, 2019.[7] The servers were also shut down for Taiwanese version on July 30, 2019 and September 25, 2019 for Chinese version."

Since the "dog tag" version was shut down after a month, we select the entry for "a.v.a"

In [26]:
# take data for "a.v.a" for "alliance of valiant arms"
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='alliance of valiant arms', 
                                             lvl_game_name='a.v.a', 
                                             df=anticheat_for_nonjoins)

In [27]:
anticheat_for_nonjoins

Unnamed: 0,game,software,developer,publisher
0,arma series,BattlEye,Bohemia Interactive,Bohemia Interactive
1,all points bulletin,BattlEye,Little Orbit,Little Orbit
2,alliance of valiant arms,XIGNCODE3,Red Duck,Red Duck


In [28]:
anticheat_needed_games

["america's army operations",
 'battlefield series',
 'call of duty series',
 'counter strike',
 'crossfire',
 'crysis series',
 'day of defeat',
 'fear',
 'grand theft auto v',
 'h1z1',
 'halo',
 'joint operations & dfx',
 'medal of honor series',
 'operation 7',
 'overwatch',
 'payday 2',
 'quake series',
 'rainbow six siege',
 'radical heights',
 'red dead redemption 2',
 'sea of thieves',
 'star wars battlefront',
 'star wars battlefront 2',
 'sudden attack',
 'swat 4',
 'titanfall',
 'team fortress 2',
 "tom clancy's the division",
 'unreal tournament',
 'war inc']

#### america's army operations

"America's army operations" is a series that includes "america's army", "america's army 3", and "america's army: proving grounds". Since there all have the same software, developer, and publisher, we will use the entry for "america's army".

In [29]:
# america's army
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='america\'s army operations', 
                                             lvl_game_name='america\'s army', 
                                             df=anticheat_for_nonjoins)

#### battlefield series

Battlefield is a large series which has over a dozen games. Most of the games use PunkBuster as a software, with the exception of "battlefield 2042", which uses Easy Anti-Cheat. We will take the most common entries as the anticheat information for the battlefield series, but we will keep in mind that "battlefield 2042" is the most recent game (released in 2021) which has a different anti-cheat system.

In [30]:
lvl_df.loc[lvl_df['game'].str.contains('battlefield')]

Unnamed: 0,game,software,developer,publisher
34,battlefield 1942,PunkBuster,DICE,Electronic Arts
35,battlefield 2,PunkBuster,DICE,Electronic Arts
36,battlefield 2042,Easy Anti-Cheat,DICE,Electronic Arts
37,battlefield 2142,PunkBuster,DICE,Electronic Arts
38,battlefield 3,PunkBuster,DICE,Electronic Arts
39,battlefield 4,PunkBuster,DICE,Electronic Arts
40,battlefield hardline,PunkBuster,Visceral Games,Electronic Arts
41,battlefield heroes,PunkBuster,DICE,Electronic Arts
42,battlefield play4free,PunkBuster,DICE,Electronic Arts
43,battlefield vietnam,PunkBuster,DICE,Electronic Arts


In [31]:
# for battlefield series, take entry for battlefield 1942
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='battlefield series', 
                                             lvl_game_name='battlefield 1942', 
                                             df=anticheat_for_nonjoins)

#### call of duty series

Call of Duty has many games in the series with two primary anti-cheat software, which are PunkBuster (for older games) and Ricochet (for newer games). For now, we will take the entries for call of duty: modern warfare ii since it is the most recent game in the series. 

In [32]:
# for call of duty series, take entry for battlefield 1942
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='call of duty series', 
                                             lvl_game_name='call of duty: modern warfare ii', 
                                             df=anticheat_for_nonjoins)

#### counter strike

Counter strike has an anti-cheat system called "Valve Anti-Cheat" (a.k.a "VAC") which is by the game publisher. We will use this as their anti-cheat system.

https://counterstrike.fandom.com/wiki/Valve_Anti-Cheat

In [33]:
# take row where software is ESEA and change game name and software accordingly
lvl_df.loc[(lvl_df['game'].str.contains('counter'))&(lvl_df['software']=='ESEA'), 'game'] = 'counter strike'
lvl_df.loc[(lvl_df['game'].str.contains('counter'))&(lvl_df['software']=='ESEA'), 'software'] = 'Valve Anti-Cheat'

In [34]:
# add to anticheat_for_nonjoins dataframe
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='counter strike', 
                                             lvl_game_name='counter strike', 
                                             df=anticheat_for_nonjoins)

#### crossfire

After desktop research, the game crossfire switched anti-cheat systems in 2018 from XTrap to XIGNCODE3. Also, from the wikipedia, the publisher is Smilegate Entertainment which exists in lvl_df as "Smilegate". The developer is "Wellbia".

https://crossfirefps.fandom.com/wiki/XTrap

https://en.wikipedia.org/wiki/Crossfire_(2007_video_game)

https://crossfirefps.fandom.com/wiki/XIGNCODE3

In [35]:
# create dataframe with entry for crossfire using the values from research
crossfire_df = pd.DataFrame({'game': 'crossfire', 'software': 'XIGNCODE3', 'developer': 'Wellbia', 'publisher': 'Smilegate'}, index=[6])
# add the data to anticheat_for_nonjoins
anticheat_for_nonjoins = pd.concat([anticheat_for_nonjoins, crossfire_df])
# delete crossfire_df as we do not need it anymore
del crossfire_df
# remove from checklist
anticheat_needed_games.remove('crossfire')

#### crysis series

"Crysis series" is a series that includes a few games. Since there all have the same software, developer, and publisher, we will use the entry for "crysis".

In [36]:
lvl_df.loc[lvl_df['game'].str.lower().str.contains('crysis')]

Unnamed: 0,game,software,developer,publisher
83,crysis,PunkBuster,Crytek,Electronic Arts
84,crysis: warhead,PunkBuster,Crytek,Electronic Arts


In [37]:
# for battlefield series, take entry for battlefield 1942
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='crysis series', 
                                             lvl_game_name='crysis', 
                                             df=anticheat_for_nonjoins)

#### day of defeat

"Day of defeat" was released in 2003. It is a game by Valve, but information regarding the game's anti-cheat system is very hard to find. It will be removed from the dataset.

In [38]:
# drop day of defeat
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='day of defeat'].index)
# remove from checklist
anticheat_needed_games.remove('day of defeat')

#### fear

In [39]:
# for fear series, take entry for f.e.a.r
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='fear', 
                                             lvl_game_name='f.e.a.r.', 
                                             df=anticheat_for_nonjoins)

#### grand theft auto v

Grand Theft Auto V does have an anti-cheat system that is by the game's publisher. However, there is no information available on what kind of anti-cheat system it is and whether it is kernel-level. Also, GTA V is not really a traditional FPS game - it is traditionally played in third-person mode and has strong components of an RPG (roll-playing game). Therefore any cheats/hacks it has may not necessarily be related to FPS games. It will therefore be removed from the data as an anomalous game.

https://www.ginx.tv/en/gta-online/anti-cheat

https://www.thegamer.com/gta-v-reasons-its-a-secret-rpg-not/#:~:text=This%20has%20led%20to%20an,mixed%20with%20a%20massive%20sandbox.

In [40]:
# drop grand theft auto v
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='grand theft auto v'].index)
# remove from checklist
anticheat_needed_games.remove('grand theft auto v')

#### h1z1

Z1 Battle Royale was formerly known as H1Z1. Since all the entries have the same software, developer, and publisher, we will use the entry for "z1 battle royale" for h1z1.

https://en.wikipedia.org/wiki/Z1_Battle_Royale

In [41]:
lvl_df.loc[lvl_df['game'].str.lower().str.contains('z1')]

Unnamed: 0,game,software,developer,publisher
143,h1z1: just survive,BattlEye,Daybreak Game Company,Daybreak Game Company
144,h1z1: king of the hill,BattlEye,Daybreak Game Company,Daybreak Game Company
319,z1 battle royale,BattlEye,Daybreak Game Company,Daybreak Game Company


In [42]:
# for h1z1, take entry for z1 battle royale
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='h1z1', 
                                             lvl_game_name='z1 battle royale', 
                                             df=anticheat_for_nonjoins)

#### halo

There is only one entry in levvvel for halo which is for "halo: the master chief collection":

In [43]:
lvl_df.loc[lvl_df['game'].str.lower().str.contains('halo')]

Unnamed: 0,game,software,developer,publisher
145,halo: the master chief collection,Easy Anti-Cheat,343 Industries,Xbox Game Studios


This is a bundle that includes the following six games:
* Halo: Reach, Halo: Combat Evolved Anniversary, Halo 2: Anniversary, Halo 3, Halo 3: ODST Campaign, and Halo 4

This entry will therefore be used for halo.

https://store.steampowered.com/app/976730/Halo_The_Master_Chief_Collection/

In [44]:
# for halo, take entry for halo: the master chief collection
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='halo', 
                                             lvl_game_name='halo: the master chief collection', 
                                             df=anticheat_for_nonjoins)

#### joint operations & dfx

This is an entry which combines two video games: "Joint Operations: Typhoon Rising" and "Delta Force Xtreme" (dfx). Since there is no way to parse the thread counts for the games and since the games are very small anyway, we will drop this entry.

In [45]:
lvl_df.loc[lvl_df['game'].str.lower().str.contains('delta')]

Unnamed: 0,game,software,developer,publisher


In [46]:
# drop joint operations & dfx
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='joint operations & dfx'].index)
# remove from checklist
anticheat_needed_games.remove('joint operations & dfx')

#### medal of honor series

"Medal of honor series" is a series that includes "medal of honor", "medal of honor: airborne", and "medal of honor: warfighter". Since there all have the same software, developer, and publisher, we will use the entry for "medal of honor".

In [47]:
# for Medal of honor series, take entry for Medal of honor
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='medal of honor series', 
                                             lvl_game_name='medal of honor', 
                                             df=anticheat_for_nonjoins)

#### operation 7

Operation 7 does not have information available on anti-cheat system. It actually does not have a wikipedia site for the game (the wikipedia site for Operation 7 is for something else). It will be removed from the data.

In [48]:
# drop operation 7
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='operation 7'].index)
# remove from checklist
anticheat_needed_games.remove('operation 7')

#### overwatch

Overwatch refers to "Overwatch 2", which replaced the original "Overwatch". The anti-cheat is the same for both and created by the game publisher.

In [49]:
# for overwatch, take entry for overwatch 2
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='overwatch', 
                                             lvl_game_name='overwatch 2', 
                                             df=anticheat_for_nonjoins)

#### payday 2

According to the steam community (a platform that sells many pc video games), Payday 2 does not have an anti-cheat system. We will therefore create a new category called "None" to denote the lack of an anti-cheat system.
According to the wikipedia, the developer is Overkill Software and publisher is 505 Games.

https://steamcommunity.com/app/218620/discussions/8/1354868867727122982/

https://en.wikipedia.org/wiki/Payday_2

In [50]:
# create dataframe with entry for payday2 using the values from research
payday2_df = pd.DataFrame({'game': 'payday 2', 'software': 'None', 'developer': 'Overkill Software', 'publisher': '505 Games'}, index=[13])
# add the data to anticheat_for_nonjoins
anticheat_for_nonjoins = pd.concat([anticheat_for_nonjoins, payday2_df])
# delete payday2_df as we do not need it anymore
del payday2_df
# remove from checklist
anticheat_needed_games.remove('payday 2')

#### quake series

Quake series includes quake 4, quake iii arena, and quake live. Its current publisher is Bethesda Softworks. We will therefore take the entry from "quake live".

https://en.wikipedia.org/wiki/Quake_(series)

In [51]:
# for quake series, take entry for quake live
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='quake series', 
                                             lvl_game_name='quake live', 
                                             df=anticheat_for_nonjoins)

#### rainbow six siege

Rainbox six siege is called "tom clancy's rainbow six siege" in levvvel so we rename it.

In [52]:
# rainbow six siege
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='rainbow six siege', 
                                             lvl_game_name='tom clancy\'s rainbow six siege', 
                                             df=anticheat_for_nonjoins)

#### radical heights

Radical heights was a game which lasted less than two months. It will be removed from the dataset.

https://gamingdatabase.fandom.com/wiki/Radical_Heights

In [53]:
# drop radical heights
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='radical heights'].index)
# remove from checklist
anticheat_needed_games.remove('radical heights')

#### red dead redemption 2

According to the Steam community, Red Dead Redemption 2 does not have an anti-cheat system. We substitute "None".
It is published and developed by Rockstar Games.

https://steamcommunity.com/app/1174180/discussions/0/1733258352678491181/

https://en.wikipedia.org/wiki/Red_Dead_Redemption_2

In [54]:
# create dataframe with entry for payday2 using the values from research
rdd2_df = pd.DataFrame({'game': 'red dead redemption 2', 'software': 'None', 'developer': 'Rockstar Games', 'publisher': 'Rockstar Games'}, index=[16])
# add the data to anticheat_for_nonjoins
anticheat_for_nonjoins = pd.concat([anticheat_for_nonjoins, rdd2_df])
# delete payday2_df as we do not need it anymore
del rdd2_df
# remove from checklist
anticheat_needed_games.remove('red dead redemption 2')

#### sea of thieves

Sea of thieves does have an anti-cheat system but after research the system name could not be found. It will therefore be removed from the dataset.

In [55]:
# drop sea of thieves
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='sea of thieves'].index)
# remove from checklist
anticheat_needed_games.remove('sea of thieves')

#### star wars battlefront and star wars battlefront 2

According to the Steam community, both star wars battlefront and star wars battlefront 2 do not have an anti-cheat system.
The developer is Pandemic Studios and publisher is Electronic Arts.

https://steamcommunity.com/app/6060/discussions/0/1692659769948660888/

https://en.wikipedia.org/wiki/Star_Wars:_Battlefront

In [56]:
lvl_df.loc[lvl_df['developer'].str.lower().str.contains('pandemic')]

Unnamed: 0,game,software,developer,publisher


In [57]:
# star wars battlefront

# create dataframe with entry using the values from research
sw_df = pd.DataFrame({'game': 'star wars battlefront', 'software': 'None', 'developer': 'Pandemic Studios', 'publisher': 'Electronic Arts'}, index=[17])
# add the data to anticheat_for_nonjoins
anticheat_for_nonjoins = pd.concat([anticheat_for_nonjoins, sw_df])
# delete sw_df as we do not need it anymore
del sw_df
# remove from checklist
anticheat_needed_games.remove('star wars battlefront')

# star wars battlefront 2

# create dataframe with entry using the values from research
sw_df2 = pd.DataFrame({'game': 'star wars battlefront 2', 'software': 'None', 'developer': 'Pandemic Studios', 'publisher': 'Electronic Arts'}, index=[18])
# add the data to anticheat_for_nonjoins
anticheat_for_nonjoins = pd.concat([anticheat_for_nonjoins, sw_df2])
# delete sw_df2 as we do not need it anymore
del sw_df2
# remove from checklist
anticheat_needed_games.remove('star wars battlefront 2')

#### sudden attack

Information could not be found regarding anti-cheat system for sudden attack. It will therefore be removed from the dataset.

In [58]:
# drop sudden attack
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='sudden attack'].index)
# remove from checklist
anticheat_needed_games.remove('sudden attack')

#### swat 4

Information could not be found regarding anti-cheat system for swat 4. It will therefore be removed from the dataset.

In [59]:
# drop swat 4
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='swat 4'].index)
# remove from checklist
anticheat_needed_games.remove('swat 4')

#### titanfall

Titanfall uses an anti-cheat system called Fairfight by the third-party company Gameblocks LLC. The game publisher is Respawn Entertainment (owned by Electronic Arts)

https://www.forbes.com/sites/danielnyegriffiths/2014/03/27/titanfalls-anti-cheat-system-activates-is-beautiful-and-hilarious/?sh=727f39bb35eb

In [60]:
# create dataframe with entry using the values from research
titanfall_df = pd.DataFrame({'game': 'titanfall', 'software': 'Fairfight', 'developer': 'Gameblocks', 'publisher': 'Electronic Arts'}, index=[19])
# add the data to anticheat_for_nonjoins
anticheat_for_nonjoins = pd.concat([anticheat_for_nonjoins, titanfall_df])
# delete titanfall_df as we do not need it anymore
del titanfall_df
# remove from checklist
anticheat_needed_games.remove('titanfall')

#### team fortress 2

Team fortress 2 uses Valve Anti-Cheat, which is the same as was used for counter strike.

https://wiki.teamfortress.com/wiki/Valve_Anti-Cheat

In [61]:
# make new dataframe with the same anti-cheat info as counter strike
tf2_df = lvl_df.loc[lvl_df['software'].str.lower().str.contains('valve')].copy()
# change game name to tf2
tf2_df['game'] = 'team fortress 2'
# add the data to anticheat_for_nonjoins
anticheat_for_nonjoins = pd.concat([anticheat_for_nonjoins, tf2_df])
# delete tf2_df as we do not need it anymore
del tf2_df
# remove from checklist
anticheat_needed_games.remove('team fortress 2')

In [62]:
# reset index
anticheat_for_nonjoins.reset_index(inplace=True, drop=True)

#### tom clancy's the division

This game is named "tom clancy's the division 2" in levvvel. We replace the name

In [63]:
# tom clancy's the division
anticheat_for_nonjoins = add_anticheat_entry(uc_game_name='tom clancy\'s the division', 
                                             lvl_game_name='tom clancy\'s the division 2', 
                                             df=anticheat_for_nonjoins)

#### unreal tournament

The Unreal Tournament series was released in 1999 and consists of many games. Because of how old the series is, there have been multiple anti-cheat systems as well as a period without anti-cheat systems. Since it does not have a clear anti-cheat system (and also because the servers for the most recent game were shut down), it will be excluded from the dataset.

https://en.wikipedia.org/wiki/Unreal_Tournament
https://en.wikipedia.org/wiki/Unreal_Tournament_3

In [64]:
# drop unreal tournament
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='unreal tournament'].index)
# remove from checklist
anticheat_needed_games.remove('unreal tournament')

#### war inc

War Inc. was published in 1997. Since the unknowncheats forum has only been running post-2000, it will be removed from the dataset (we are likely missing post entries/discussions which exist on other websites)

https://en.wikipedia.org/wiki/War_Inc.

In [65]:
# drop war inc
uc_games_df_joined = uc_games_df_joined.drop(uc_games_df_joined[uc_games_df_joined['game_name']=='war inc'].index)
# remove from checklist
anticheat_needed_games.remove('war inc')

### Joining the data: joining cleaned data for failed joines

We now can embellish our dataset from unknowncheats using the values obtained from research/ cleaning above.

In [66]:
# join uc_games_df_joined to anticheat_for_nonjoins
uc_games_df_joined = uc_games_df_joined.join(anticheat_for_nonjoins.set_index('game'), 
                                             on='game_name', 
                                             how='left', 
                                             rsuffix='_cleaned')

# replace NaNs for software, developer, and publisher with the cleaned entries
for col in ['software', 'developer', 'publisher']:
    cleaned_col = col + '_cleaned'
    uc_games_df_joined[col] = np.where(uc_games_df_joined[col].isnull(), uc_games_df_joined[cleaned_col], uc_games_df_joined[col])

In [67]:
# drop the "cleaned" columns since this data has been added to the original software, developer, and publisher columns
uc_games_df_joined.drop(columns=[col + '_cleaned' for col in ['software', 'developer', 'publisher']], 
                        inplace=True)

In [68]:
# check for any nulls
uc_games_df_joined.isnull().sum()

game_name       0
uc_game_link    0
uc_threads      0
uc_posts        0
wiki_content    0
software        0
developer       0
publisher       0
dtype: int64

### Data Save

We now have a dataset which has FPS games each having the thread and post counts from unknowncheats as well as their wikipedia entry and anticheat system. 

We continue cleaning and processing this in the next notebook. The data is exported below to pass to next steps in data cleaning.

In [69]:
# reset index
uc_games_df_joined.reset_index(inplace=True, drop=True)
# save uc_games_df_joined to pickle
uc_games_df_joined.to_pickle('../data/uc_games_df_joined')

(sources for data): 
* https://levvvel.com/games-with-kernel-level-anti-cheat-software/
* https://stackoverflow.com/questions/54750165/why-is-the-gethref-returning-none-on-a-bs4-element-tag

for the wikipedia entries see below:

In [70]:
for source in wikipedia_sources:
    print(f'{source}')

https://en.wikipedia.org/wiki/Arma_(series)
https://en.wikipedia.org/wiki/Alliance_of_Valiant_Arms
https://en.wikipedia.org/wiki/America%27s_Army
https://en.wikipedia.org/wiki/Battlefield_(video_game_series)
https://en.wikipedia.org/wiki/Call_of_Duty
https://en.wikipedia.org/wiki/Combat_arms
https://en.wikipedia.org/wiki/Counter-Strike
https://en.wikipedia.org/wiki/Crossfire_(2007_video_game)
https://en.wikipedia.org/wiki/Day_of_Defeat
https://en.wikipedia.org/wiki/Escape_from_Tarkov
https://en.wikipedia.org/wiki/Far_Cry
https://en.wikipedia.org/wiki/Grand_Theft_Auto_V
https://en.wikipedia.org/wiki/Z1_Battle_Royale
https://en.wikipedia.org/wiki/Ford_Motor_Company
https://en.wikipedia.org/wiki/Medal_of_Honor_(video_game_series)
https://en.wikipedia.org/wiki/Operation_7
https://en.wikipedia.org/wiki/Paladin
https://en.wikipedia.org/wiki/Payday_2
https://en.wikipedia.org/wiki/PlanetSide_2
https://en.wikipedia.org/wiki/Quake_(series)
https://en.wikipedia.org/wiki/Tom_Clancy%27s_Rainbow_Six