# Dota Dataset Notebook 2 - Preprocessing and Feature Engineering

This notebook covers:
* Additional preprocessing
* Additional features
* Word embeddings

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import langdetect as ld
from textblob import TextBlob
import warnings
warnings.filterwarnings('ignore')

## Parts of Notebook 1
* Finding the language of each text and getting the English rows

In [4]:
df = pd.read_csv('dota2_chat_messages.csv', nrows=50000)
df.head()

Unnamed: 0,match,time,slot,text
0,0,1005.12122,9,ладно гг
1,0,1005.85442,9,изи
2,0,1008.65372,9,од
3,0,1010.51992,9,ебаный
4,0,1013.91912,9,мусор на войде


In [5]:
# Labeling languages
langs = np.zeros(len(df)).astype(str)
i = -1
for message in df['text'].values:
    i += 1
    try:
        langs[i] = ld.detect(message)
    except:
        continue
df['language'] = langs

In [6]:
# Fixing some languages due to acronyms
lang_fix = df.copy()
lang_fix = lang_fix.mask(df['text'].str.contains('(ez)|(Ez)|(EZ)'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('(lol)|(Lol)|(LOL)'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('(gg)|(Gg)|(GG)'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('(ty)|(Ty)|(TY)'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('(xD)|(XD)'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('[Rr]eport'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('STUPID|[Ss]tupid'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('[Ff][Uu][Cc][Kk]|[Ss]hit'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('[Nn][Oo][Oo][Bb]'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('retard|RETARD'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('pls|stfu|omg|OMG|wtf|WTF|wp|guys|kill|KILL|god|feed|FEED|btw'),'en')
lang_fix = lang_fix.mask(df['text'].str.contains('idiot|IDIOT|defend|dumb|end'), 'en')
lang_fix = lang_fix.mask(df['text'].str.contains('good|game|nice|thx|THX'), 'en')
df['language'] = lang_fix['language']

In [7]:
eng = df[df['language']=='en'].drop('language', axis=1)
eng.head()

Unnamed: 0,match,time,slot,text
9,1,-131.14018,0,twitch.tv/rage_channel
29,2,1563.1849,0,fast and furious
30,2,1757.5132,0,too fas
31,2,1996.3936,8,idiot drow
32,2,2006.2939,2,no idiot


______

# Start of Notebook 2 Work

# Additional Preprocesssing
* Dropping links
* Dropping stop words
* Dropping words that do not appear often

In [8]:
eng.head()

Unnamed: 0,match,time,slot,text
9,1,-131.14018,0,twitch.tv/rage_channel
29,2,1563.1849,0,fast and furious
30,2,1757.5132,0,too fas
31,2,1996.3936,8,idiot drow
32,2,2006.2939,2,no idiot


In [9]:
# Current Version 1 df
eng.shape

(14649, 4)

In [10]:
# Dropping links
eng = eng.drop(eng[eng['text'].str.contains("(\.tv)")].index).drop(eng[eng['text'].str.contains("(\.com)")].index)
eng.shape

(14635, 4)

Recurring issues caused by the dataset of 21 million rows are due to limited space and long runtimes. A possible way to combat this may be to drop stop words. Stop words are commonly used words, such as "the", "a", "an", "in", etc. Natural Language Toolkit's list of stop words will be used.

In [11]:
# Loading stop words
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

For now, the stop words "you" and "you're" will not be taken out. It may be useful to see if words are directed at other teammates in the future.

In [12]:
stopwords = stopwords.words('english')
stopwords.remove("you")
stopwords.remove("you're")
stopwords.remove("yourself")
stopwords[:10]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 "you've",
 "you'll"]

In [13]:
print("There will be {} different stop words dropped.".format(len(stopwords)))

There will be 176 different stop words dropped.


In [14]:
# Dropping stop words
def drop_stop_words(text):
    """Drops stop words from a line of text."""
    text = text.split(" ")
    nonstop_words = [word for word in text if word not in stopwords]
    string = ""
    for word in nonstop_words:
        string += word + " "
    return string[:len(string)-1]

eng['text'] = eng['text'].apply(drop_stop_words)

In [15]:
# Finding words that are only used once or twice
words = []
for row in eng['text'].str.split(" "):
    for word in row:
        words.append(word)
word_counts = pd.Series(words).str.lower().value_counts()
rare_words = word_counts[word_counts < 3].index
rare_words[:30]

Index(['notice', 'shop', 'weak', 'lols', 'kak', 'play,', 'ezzzzzzzz', 'dope',
       'falling', 'ld', 'cooldown', 'may', 'rs', 'question', 'hg', 'place,',
       'ignore', 'ho', 'shithead', 'lo', 'curry', 'aleatory', 'incase',
       'gg,wp', 'celeb', 'mnice', 'weather', 'iq', 'recomiendo', 'usless'],
      dtype='object')

In [16]:
print("There will be {} different rarely used words dropped.".format(len(rare_words)))

There will be 4386 different rarely used words dropped.


In [17]:
def drop_rare_words(text):
    """Drops rare words from a line of text."""
    text = text.split(" ")
    nonrare_words = [word for word in text if word.lower() not in rare_words]
    string = ""
    for word in nonrare_words:
        string += word + " "
    return string[:len(string)-1]

eng['text'] = eng['text'].apply(drop_rare_words)

# Additional Feature Engineering
* Explicit keywords
* Word embedding → word2vec

Like before, **this is to explore different features, methods to gain different features, and information** we can gain from them. No features are finalized within this notebook, along with the first.

## Use of offensive language

"**Description: A list of 1,300+ English terms that could be found offensive.** The list contains some words that many people won't find offensive, but it's a good start for anybody wanting to block offensive or profane terms on their Site."
* From Luis von Ahn's Research Group: _https://www.cs.cmu.edu/~biglou/resources/bad-words.txt_

In [18]:
# list of potentially offensive words from the link above
bad_words = pd.read_fwf("bad-words.txt", header=None)

# dropping the words 'dead', 'death', 'sniper', and 'doom' (reasons explained further below this cell)
bad_words = bad_words[bad_words[0] != 'dead'][bad_words[0] != 'death'][bad_words[0] != 'sniper'][bad_words[0] != 'doom'].iloc[:,0].values
bad_words[:30]

array(['abbo', 'abo', 'abortion', 'abuse', 'addict', 'addicts', 'adult',
       'africa', 'african', 'alla', 'allah', 'alligatorbait', 'amateur',
       'american', 'anal', 'analannie', 'analsex', 'angie', 'angry',
       'anus', 'arab', 'arabs', 'areola', 'argie', 'aroused', 'arse',
       'arsehole', 'asian', 'ass', 'assassin'], dtype=object)

In [19]:
print("Number of offensive words considered:", len(bad_words))

Number of offensive words considered: 1379


In [20]:
# Converting it back to the original dataset
offense = eng.copy()
offense.head()

Unnamed: 0,match,time,slot,text
29,2,1563.1849,0,fast
30,2,1757.5132,0,
31,2,1996.3936,8,idiot drow
32,2,2006.2939,2,idiot
37,2,2263.3697,2,lol


In [21]:
def offensive_word(text):
    return [word for word in text.split(" ") if word in bad_words]

offense['offensive word'] = offense['text'].apply(offensive_word)
offense['num offensive words'] = offense['offensive word'].apply(len)
offense.sort_values('num offensive words', ascending=False).head(15)

Unnamed: 0,match,time,slot,text,offensive word,num offensive words
42160,2033,-277.59811,1,hey faggot ass show stupid fuck support,"[faggot, ass, stupid, fuck]",4
1491,76,2839.95183,7,fucking axe retard cunt,"[fucking, retard, cunt]",3
27655,1327,1942.15502,4,fucking monkey suck dick,"[fucking, suck, dick]",3
29100,1395,2422.90175,1,suck big black cock,"[suck, black, cock]",3
37669,1795,2780.8741,2,fucking shit ass lion,"[fucking, shit, ass]",3
34516,1656,2504.491,0,fuckin retard shit,"[fuckin, retard, shit]",3
17665,864,2104.47704,7,stupid fuck,"[stupid, fuck]",2
23867,1139,1380.12965,6,wtf back attack,"[wtf, attack]",2
33915,1621,2125.03777,9,you fucking suck wk,"[fucking, suck]",2
33892,1621,167.25917,0,fuck u faggot,"[fuck, faggot]",2


For now, the number of offensive words **seems to be a suitable indicator of toxicity.**

Future possibility: putting a degree of offensiveness to certain words (ie. "ass" can be less toxic than "retard", and therefore carry less weight)
* These weights were later included in the use of Tfidf.

### EDA on Offensive Language

In [22]:
print("Out of {} messages, {} messages contain offensive language."
      .format(len(offense), len(offense[offense['num offensive words'] > 0])))

Out of 14635 messages, 1261 messages contain offensive language.


In [23]:
offense.groupby('num offensive words').size()

num offensive words
0    13374
1     1160
2       95
3        5
4        1
dtype: int64

In [24]:
used_words = []
for array in offense['offensive word']:
    for word in array:
        used_words.append(word)

print("Most commonly used offensive words:")
pd.Series(used_words).value_counts().head(10)

Most commonly used offensive words:


fuck       192
fucking    173
wtf        153
shit       134
kill        97
idiot       50
retard      50
god         28
stupid      28
killed      25
dtype: int64

Some commonly used "offensive" words were: "dead", "death", "sniper", and "doom". Given these messages are within gaming contexts, these terms may be highly used without toxicity. For example, a player can tell their team that the enemy is dead. In addition, there are two heroes called Sniper and Doom. These words were dropped when loading the initial bad_word list after looking at the context of these words, shown below.

In [25]:
offense[offense['text'].str.contains('dead')].tail(3)

Unnamed: 0,match,time,slot,text,offensive word,num offensive words
41780,2015,1721.26196,7,dead,[],0
42155,2032,2440.75247,5,gg lc dead,[],0
43389,2086,2587.0708,1,braindead children,[],0


In [26]:
offense[offense['text'].str.contains('death')].tail(3)

Unnamed: 0,match,time,slot,text,offensive word,num offensive words
48573,2353,2909.65522,7,deaths,[],0
49554,2411,2045.2572,8,many solo weaver deaths,[],0
49895,2420,2123.45143,9,3 secs you death,[],0


In [27]:
offense[offense['text'].str.contains('sniper')].tail(3)

Unnamed: 0,match,time,slot,text,offensive word,num offensive words
45125,2168,1045.0931,5,sniper mother fucker,[fucker],1
46852,2269,1448.94784,7,report sniper pls,[],0
48403,2346,3062.6275,7,sniper played better u,[],0


In [28]:
offense[offense['text'].str.contains('doom')].tail(3)

Unnamed: 0,match,time,slot,text,offensive word,num offensive words
43923,2113,873.96154,2,doom getting killed 3 ppl,[killed],1
45136,2169,966.33073,3,report doom,[],0
48850,2365,260.70302,6,report doom,[],0


**By match and by player**

In [29]:
# Number of offensive words per match
bad_word_sum = offense.groupby('match')['num offensive words'].sum().sort_values()
bad_word_sum

match
1229     0
1513     0
1512     0
1510     0
1507     0
        ..
1395    14
1795    14
1125    15
227     16
121     19
Name: num offensive words, Length: 1982, dtype: int64

In [30]:
print("On average, there is/are {} offensive word(s) said per match.".format(round(bad_word_sum.mean())))

On average, there is/are 1 offensive word(s) said per match.


In [31]:
by_player = offense.groupby(['match', 'slot'])['num offensive words'].sum().sort_values()
by_player

match  slot
2      0        0
1569   6        0
1567   4        0
       2        0
1566   1        0
               ..
1395   1        9
2033   1        9
2361   2       11
1269   1       12
1795   2       13
Name: num offensive words, Length: 6217, dtype: int64

In [32]:
print("On average, there is/are {} offensive word(s) said per player.".format(round(by_player.mean())))

On average, there is/are 0 offensive word(s) said per player.


**Time**

In [33]:
only_offense = offense[offense['num offensive words'] > 0]
print("Among the messages with at least one offensive word, the average time of usage is at the {} minute mark."
      .format(round(only_offense['time'].mean() / 60)))

Among the messages with at least one offensive word, the average time of usage is at the 23 minute mark.


In [34]:
# Negative time
time_copy = offense.copy()
prop = offense[offense['time'] < 0].sort_values("num offensive words")['num offensive words'].mean()
print("Proportion of messages with offensive words (negative time):", round(prop, 3))

Proportion of messages with offensive words (negative time): 0.126


In [35]:
# Within 0-15 minutes
prop = offense[(offense['time'] >= 0) & (offense['time'] < 900)].sort_values("num offensive words")['num offensive words'].mean()
print("Proportion of messages with offensive words (0-15 mins):", round(prop, 3))

Proportion of messages with offensive words (0-15 mins): 0.122


In [36]:
# Within 15-30 minutes
prop = offense[(offense['time'] >= 900) & (offense['time'] < 1800)].sort_values("num offensive words")['num offensive words'].mean()
print("Proportion of messages with offensive words (15-30 mins):", round(prop, 3))

Proportion of messages with offensive words (15-30 mins): 0.102


In [37]:
# Over 30 minutes
prop = offense[(offense['time'] >= 1800)].sort_values("num offensive words")['num offensive words'].mean()
print("Proportion of messages with offensive words (30+ mins):", round(prop, 3))

Proportion of messages with offensive words (30+ mins): 0.072


## Word2Vec Word Embedding
* **Word Embeddings** create "a representation for words that capture their _meanings, semantic relationships, and the different contexts_ they are used in."
* **Word2vec** is a combination of the continuous bag of words and skip-gram models. Both of these models associate weights to word. CBOW predicts probabilities of words given a context. Skip-gram predicts the context given a word.

In [38]:
messages = eng['text'].str.lower().str.split(" ").values

In [39]:
from gensim.models import Word2Vec

# loading the Word2Vec model
model = Word2Vec(sg=1, min_count=1, window=3, size=100, workers=4)
model.build_vocab(messages)
model.train(sentences=messages, total_examples=model.corpus_count, epochs=model.epochs)

(81617, 130520)

In [40]:
model.most_similar('trash')

[('get', 0.9993589520454407),
 ('nice', 0.9993569254875183),
 ('the', 0.9993484020233154),
 ('4', 0.999345600605011),
 ('last', 0.9993374943733215),
 ('5', 0.9993152618408203),
 ('de', 0.9993113279342651),
 ('kill', 0.999302327632904),
 ('time', 0.9992989301681519),
 ('shit', 0.9992865324020386)]

In [41]:
list(model.wv.vocab)[:10]

['fast', '', 'idiot', 'drow', 'lol', 'commend', 'me', 'ty', 'ez', 'wtf']

In [42]:
similar = model.similarity('trash', 'garbage')
print("Similarity between 'trash' and 'garbage' is {}".format(similar))

Similarity between 'trash' and 'garbage' is 0.9980788826942444


**Being new to word embedding, the Word2Vec work above was to explore its functionality. Word2Vec will be more involved in future notebooks.**