In [1]:
import numpy as np
import pandas as pd
import requests
import re

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 40)

The goal of this notebook is to clean the data and perform exploratory data analysis

In [3]:
# read data
cod = pd.read_csv('data/cod_data_pull.csv', )
ow = pd.read_csv('data/overwatch2_data_pull.csv')

  cod = pd.read_csv('data/cod_data_pull.csv', )


In [4]:
# drop unnecessary index columns
cod.drop(columns=['Unnamed: 0'], inplace=True)
ow.drop(columns=['Unnamed: 0'], inplace=True)

cod has 48,182 rows:
* 29,062 have null selftext
* 13,948 are deleted or removed

ow has 20,247 rows:
* 11,556 have null selftext
* 5,485 are deleted or removed

Since many of the rows are null, we will keep posts with null selftext to preserve the majority of our data. However, we will remove the posts which were deleted or removed since 1. the counts are smaller and 2. because they were deleted anyway.

In [5]:
# remove "removed" and "deleted" posts
cod = cod.loc[(cod['selftext']!='[removed]')&(cod['selftext']!='[deleted]')]
ow = ow.loc[(ow['selftext']!='[removed]')&(ow['selftext']!='[deleted]')]

In [6]:
# combine two dataframes to create master dataframe we will use for modeling
ow_cod_df = pd.concat([cod, ow])

We define our predicted variable $y$ as 1 if the subreddit is Overwatch and 0 if the subreddit is ModernWarfareII (cod):

In [7]:
ow_cod_df.loc[:, 'subreddit_ow'] = np.where(ow_cod_df['subreddit']=='Overwatch', 1, 0)

We will keep the following columns:
* `title`: title, used for classification
* `selftext`: post text, used for classification
* `author`: author name, not used for classification
* `created_utc`: created date, not used for classification
* `utc_datetime_str`: date string, not used for classification

We will use the `title` and `selftext` for classification because we are interested in using language as a classifier

In [8]:
ow_cod_df = ow_cod_df[['subreddit_ow', 'title', 'selftext', 'author', 'created_utc', 'utc_datetime_str']].copy()

In [9]:
ow_cod_df.head(2)

Unnamed: 0,subreddit_ow,title,selftext,author,created_utc,utc_datetime_str
3,0,Unlimited double xp tokens,I’m pretty sure it’s a glitch and I’m not comp...,Wokindajuice,1671066265,2022-12-15 01:04:25
4,0,"So i just unlocked orion camo, but i kinda reg...",,Lanky-Arm-1373,1671066238,2022-12-15 01:03:58


#### Primary data cleaning

In [10]:
# we start with 51,735 posts
len(ow_cod_df)

51735

In [11]:
# drop duplicates for title and author
ow_cod_df.drop_duplicates(subset=['title', 'author'], inplace=True)

In [12]:
# fill null values for selftext
ow_cod_df['selftext'].fillna('', inplace=True)

In [13]:
# we combine the title and the selftext and will perform NLP on the combined text data
ow_cod_df['title_selftext'] = ow_cod_df['title'] + ' ' + ow_cod_df['selftext']

In [14]:
# remove '[View Poll]' which occurs as string in posts will polls
ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: x.replace('[View Poll]', ' '))

In [15]:
# make title_selftext lower case (we are not interested in use of capitalization)
ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].str.lower()

We aim to remove text from the posts which directly reference Overwatch 2 or COD MW2, as classification with self referential text is not interesting. We are interested looking at similarities/ differences in language which exist in shooter subreddits which are more subtle.

This includes removing the names of each game, characters, and weapon names

In [16]:
# remove overwatch2 character names
overwatch_names = ['ana', 'ashe', 'baptiste', 'bastion', 'brigitte', 'cassidy', 'd.va', 'doomfist', 'echo', 'genji', 
                   'hanzo', 'junker queen', 'junkrat', 'kiriko', 'lucio', 'mei', 'mercy', 'moira', 'orisa', 'pharah', 
                   'ramattra', 'reaper', 'reinhardt', 'roadhog', 'sigma', 'sojourn', 'soldier 76', 'sombra', 'symmetra', 
                   'torbjorn', 'tracer', 'widowmaker', 'winston', 'wrecking ball', 'zarya', 'zenyatta']
for name in overwatch_names:
    ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: x.replace(name, ''))

# source: https://overwatch.fandom.com/wiki/Heroes

In [17]:
# remove call of duty weapon names
cod_weapons = ['Chimera', 'Lachmann-556', 'STB 556', 'M4', 'M16', 'Kastov 762', 'Kastov-74u', 'Kastov 545', 'M13B', 'TAQ-56', 'TAQ-V', 
               'SO-14', 'FTAC Recon', 'Lachmann-762', 'Lachmann Sub', 'BAS-P', 'MX9', 'Vaznev-9K', 'FSS Hurricane', 'Minibak', 'PDSW 528', 
               'VEL 46', 'Fennec 45', 'Lockwood 300', 'Bryson 800', 'Bryson 890', 'Expedite 12', 'RAAL MG', 'HCR 56', '556 Icarus', 'RPK', 
               'RAPP H', 'Sakin MG38', 'LM-S', 'SP-R 208', 'EBR-14', 'SA-B 50', 'Lockwood MK2', 'TAQ-M', 'MCPR-300', 'Victus XMR', 'Signal 50', 
               'LA-B 330', 'SP-X 80', 'X12', 'X13 Auto', '.50 GS', 'P890', 'Basilisk', 'RPG-7', 'Pila', 'JOKR', 'Strela-P', 'Riot Shields', 'Riot Shield']
for weapon in cod_weapons:
    ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: x.replace(weapon.lower(), ''))
# source: https://www.gamesatlas.com/cod-modern-warfare-2/weapons/

In [18]:
# remove direct references to COD and OW2 games
cod_ow_game_references = ['modern warfare ii', 'modern warfareii', 'modernwarfareii', 'mwii', 'modernwarfare', 'mw', 'overwatch', 
                          'warzone', 'cod', 'duty', 'modern', 'warfare', 'ow', 'wz']
# for "ow" don't want to delete parts of words (ex. "know" -> "kn") so we add space
# TODO: FIGURE THIS OUT (ow)
for game_reference in cod_ow_game_references:
    ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: x.replace(game_reference, ''))

Upon looking at the most frequent words, there are other terms which reference either COD or Overwatch2 directly. Since we are aiming to use underlying language to classify the posts, we will remove these from the title_selftext.

Words we are removing and the reason:

| Word | Description/ Reason for Excluding |
| --- | --- |
| dmz | gamemode in COD |
| blizzard | company that created Overwatch2 |
| dps | damage per second; used in Overwatch2 (in COD they use "ttk" or "time to kill" instead |
| camo | weapon camos, only exist in COD |
| attachments | weapon attachments, only exist in COD |
| campaign | campaign mode, only exists in COD |
| blueprint | weapon blueprints, only exist in COD |
| perk | character perks, only exist in COD |
| exfil | term used in dmz gamemode in COD |
| hog | roadhog, character shorthand in Overwatch2 |
| uav | drone in COD |
| shipment | map in COD |
| hardpoint | gamemode in cod |
| zen | shorthand for zenyatta, character in Overwatch2 |
| potg | "play of the game", Overwatch2 shows best play of the game after each game |
| cdl | shorthand for "Call of Duty League" |
| polyatomic | name of a camo in cod |
| extraction | term used in dmz gamemode in COD (same as "exfil") |
| hero/heroes | the characters in Overwatch 2 are referred to as "heroes" |
| ult | Overwatch 2 features "ultimate abilities," (aka "ults") which are not in COD |
| source: Shoki Leffel |

In [19]:
# removing other terms which are directly related to either just COD or just ow2
other_game_terms = ['dmzs', 'dmz', 'blizzard', 'dps', 'camos', 'camo', 'attachments', 'attachment', 'campaigns', 'campaign', 'blueprints', 
                    'blueprint', 'perks', 'perk', 'exfils', 'exfil', 'hog', 'operators', 'operator', 'uavs', 'uav', 'shipments', 'shipment', 
                    'hardpoint', 'zen', 'potg', 'cdl', 'polyatomic', 'extractions', 'extraction', 'heroes', 'hero', 'ults', 'ult']

for term in other_game_terms:
    ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: x.replace(term, ''))

In [20]:
# other data cleaning
# removing &amp; and #x200B; which occur and have no meaning
ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: x.replace('&amp;', ''))
ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: x.replace('#x200B;', ''))

# removing urls
ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: re.sub(r'http\S+', '', x))
# removing non regular characters
ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: re.sub('[^a-zA-Z0-9 \n]', '', x))
# replacing the line breaks with spaces
ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: re.sub('[\n]', ' ', x))

# sources:
# https://gist.github.com/MrEliptik/b3f16179aa2f530781ef8ca9a16499af
# https://stackoverflow.com/questions/23996118/replace-special-characters-in-a-string-in-python

In [21]:
# removing emojis
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)

ow_cod_df['title_selftext'] = ow_cod_df['title_selftext'].apply(lambda x: emoji_pattern.sub(r'', x))
# source: https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python

In [22]:
# removing all numbers and numerics
ow_cod_df['title_selftext'].replace('\d+', '', regex=True, inplace=True)

In [23]:
# removing posts with only whitespaces left
ow_cod_df['len_no_whitespace'] = ow_cod_df['title_selftext'].str.strip().apply(len)
ow_cod_df = ow_cod_df.loc[ow_cod_df['len_no_whitespace']!=0].copy()

In [24]:
len(ow_cod_df)

50491

In [25]:
ow_cod_df['subreddit_ow'].value_counts(normalize=True).round(2).to_frame()

Unnamed: 0,subreddit_ow
0,0.66
1,0.34


We are now left with 50,491 posts total, with 66 % COD MW2 posts ad 34 % Overwatch 2 posts

In [26]:
# save cleaned data to csv
ow_cod_df.to_csv('data/ow_cod_df_clean.csv', index=False)