![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4895752%2Fbf6d1be3b18ade8bd780840fd8f871c1%2FU78nzyG.jpg?generation=1597522833435019&alt=media)

# Introduction

CS:GO is a tactical shooter, where two teams (CT and Terrorist) play for a best of 30 rounds, with each round being 1 minute and 55 seconds. There are 5 players on each team (10 in total) and the first team to reach 16 rounds wins the game. At the start, one team plays as CT and the other as Terrorist. After 15 rounds played, the teams swap side. There are 7 different maps a game can be played on. You win a round as Terrorist by either planting the bomb and making sure it explodes, or by eliminating the other team. You win a round as CT by either eliminating the other team, or by disarming the bomb, should it have been planted.

The data set consists of ~700 demos from high level tournament play in 2019 and 2020. The total number of snapshots is 122411. This notebook uses a Keres DNN with a Tensorflow backend to predict round winners, either Counter-Terrorist or Terrorist.

TensorFlow is the premier open-source deep learning framework developed and maintained by Google. Although using TensorFlow directly can be challenging, the modern tf.keras API beings the simplicity and ease of use of Keras to the TensorFlow project. Using tf.keras allows you to design, fit, evaluate, and use deep learning models to make predictions in just a few lines of code. It makes common deep learning tasks, such as classification and regression predictive modeling, accessible to average developers looking to get things done.

Key takeways:

* Keras can be used to make DNN's that fit the problem well.
* Model tuning requires a lot of experimentation, but I'll argue my choices.
* The data requires many epochs for a DNN to learn. I ended at about 250 for an optimal solution.

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("fivethirtyeight")
sns.set_style('whitegrid')
%matplotlib inline

# Load the data
df = pd.read_csv('/kaggle/input/csgo-round-winner-classification/csgo_round_snapshots.csv')

# Split X and y
y = df.round_winner
X = df.drop(['round_winner'], axis=1)

# Drop columns with grenade info
cols_grenade = 'grenade'
X = X.drop(X.columns[X.columns.str.contains(cols_grenade)], axis=1)

print(f"Total number of samples: {len(X)}")

X.head()

In [None]:
# Print a random snapshot as a sample
sample_index = 25
print(df.iloc[sample_index])

# EDA

Plots mostly speak for themselves and are fairly straight-forward. We just visualize some of the data to get a better understanding and spot potential skewness in the data. Plots provide a intuitive basis undertanding of the problem domain and how we should deal with the data.

In [None]:
plt.figure(figsize=(8,6))
ax = sns.countplot(x="map", hue="round_winner", data=df)
ax.set(title='Round winners on each map')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
ax = sns.countplot(x="map", hue="bomb_planted", data=df)
ax.set(title='Maps and bomb planted')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
ax = sns.barplot(x=df['round_winner'].unique(), y=df['round_winner'].value_counts())
ax.set(title='Total wins per side', xlabel='Side', ylabel='Wins')
plt.show()

We'll continue with some distribution plots. Note that data look fairly skewed Gaussian, which tells us that a power-transform like yeo-johnson or box-cox would be appropriate. Generally, ML models prefer and assume that the distribution is in the fact Gaussian, so adjusting data is way is a sound pre-process technique.

In [None]:
# Plot the distribution of health
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12,5))
sns.distplot(df['ct_health'], bins=10, ax=ax1);
sns.distplot(df['t_health'], bins=10, ax=ax2);

In [None]:
# Plot the distribution of money
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12,5))
sns.distplot(df['ct_money'], bins=10, ax=ax1);
sns.distplot(df['t_money'], bins=10, ax=ax2);

In [None]:
# Plot the distribution of scores
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12,5))
sns.kdeplot(df['ct_score'], shade=True, ax=ax1)
sns.kdeplot(df['t_score'], shade=True, ax=ax2)

In [None]:
# Plot the distribution of time left
plt.figure(figsize=(8,6))
sns.kdeplot(df['time_left'], shade=True)

Interestingly, a majority of the snapshots are taken at the start of the round, and slowly fades as we apporach the end of it. This makes sense, since the number of players is narrowed down during a round and only a few games lasts the full duration.

# Feature encoding

For feature encoding, we'll perform the usual tricks: Label encode the targets and onehot encode the predictors. In this case, the targets are the round winners and the predictors are the map and the bomb_planted feature. Encoding the former will give us seven additional columns, and the latter will add just two. We apply a yeo_johnson transform to the features, that resemble a skewed Gaussian.

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import power_transform

def encode_targets(y):
    encoder = LabelEncoder()
    encoder.fit(y)
    y_encoded = encoder.transform(y)
    return y_encoded

def encode_inputs(X, object_cols):
    ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
    X_encoded = pd.DataFrame(ohe.fit_transform(X[object_cols]))
    X_encoded.columns = ohe.get_feature_names(object_cols)
    X_encoded.index = X.index
    return X_encoded

def yeo_johnson(series):
    arr = np.array(series).reshape(-1, 1)
    return power_transform(arr, method='yeo-johnson')

# Use OH encoder to encode predictors
object_cols = ['map', 'bomb_planted']
X_encoded = encode_inputs(X, object_cols)
numerical_X = X.drop(object_cols, axis=1)
X = pd.concat([numerical_X, X_encoded], axis=1)

# Use label encoder to encode targets
y = encode_targets(y)

# Make data more Gaussian-like
cols = ['time_left', 'ct_money', 't_money', 'ct_health',
 't_health', 'ct_armor', 't_armor', 'ct_helmets', 't_helmets',
  'ct_defuse_kits', 'ct_players_alive', 't_players_alive']
for col in cols:
    X[col] = yeo_johnson(X[col])

# Keras DNN model

![](https://cdn-images-1.medium.com/max/1000/1*ytBUCmhkAucJ5imsNfAyfQ.png)

Keras is one of the most popular deep learning libraries in Python for research and development because of its simplicity and ease of use. It uses the Tensorflow backend to build both shallow and deep models without much hazzle. Since there's a lot of data available here, my belief was that neural network were suiteable. We'll go through why the settings and hyperparameters are set the way they are. See more at https://keras.io/api/

**Number of layers/nodes:** This comes down to the data and what works best. I experimented with both shallow and deep nets and found that between 4-8 layers was suiteable with about 128-300 nodes. I figure it's because there's so many samples, so in order to fit them all, we'll need a big network. <br>

**Learning rate:** We'll initialize the optimizer with the default learning rate and use a callback (ReduceLROnPlateau) to slowly reduce the learning rate when we're at at a plateau. You could try to use learning rate scheduling like 1cycle as well. <br>

**Optimizer:** Based on some testing, the top optimizers for this task are Adam, Adamax and Nadam. They are all adaptive momentum based, which is a method that helps accelerate gradients vectors in the right directions, thus leading to faster converging. Nadam is an Adam optimizer plus the Nesterov momentum trick, more here: 
https://keras.io/api/optimizers/Nadam/ <br>

**Batch size:** Batch size defines the number of samples that will be propagated through the network. Advantages of using a batch size smaller than number of all samples is that it requires less memory and typically networks trains faster with smaller batches. The downside is that the smaller the batch, the less accurate the estimate of the gradient will be. On Kaggle, a batch size of 128 seems OK, since full batch size is very slow in training.<br>

**Activation function:** These are important to enable non-linear representations of the data. A neural network without an activation function is essentially just a linear regression model, so what it does is a non-linear transformation to the input making it capable to learn and perform more complex tasks. I found ELU wtith he_normal initialization to be good here, which looks a lot like the popular RELU, but tends to converge cost to zero faster and produce more accurate results. More here: https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html <br>

**Number of epochs:** With full batch size, this was quite high (> 200) for a 4 layer neural network. One approrach is to set epochs high and use early stopping on thation set to stop training, whenever the validation loss/accuracy stops improving (more next). Another is to simply monitor loss/accuracy curves for both training and validation and find the right epoch before overftting sets in. With Dropout enabled there was almost no overfitting on this dataset. <br>

**Early stopping:** We set a callback to stop training when some metric stops improving. The default is validation loss. Patience=5 means we'll wait five epochs with no improvement after which training will be stopped.

**Batch normalization:** This can make neural networks faster and more stable through normalization of the input layer by re-centering and re-scaling. Adding a BN-layer before each activation function we zero-center and normalize each input, then scales and shifts the result using two new parameter vecetors per layer: One for scaling and the other for shifting. This enables the model to learn optimal scale and mean of each layer's input. It's a farily novel technique that also means we'll get away with not scaling our input (like normally you would with StandardScaler). <br>

**Dropout:** I tested Dropout vs L1/L2 regularzation and found that Dropout worked better here. At every training step, every neuron (including the input neurons but excluding the  output neurons) has a probability of being temporarily “dropped  out,” meaning it will be entirely ignored during this training step, but it may be active during the  next step. This probability is usually between 0-1 and 0.5. We use 0.2.  <br>

Let's make the model and train it.


In [None]:
from sklearn.model_selection import train_test_split
from tensorflow import keras

# Make a train, validation and test set
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y,
 stratify=y, test_size=0.1, random_state=0)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full,
 stratify=y_train_full, test_size=0.25, random_state=0)

# Set model parameters
n_layers = 4
n_nodes = 300
regularized = False
dropout = True
epochs = 50

# Make a Keras DNN model
model = keras.models.Sequential()
model.add(keras.layers.BatchNormalization())
for n in range(n_layers):
    if regularized:
        model.add(keras.layers.Dense(n_nodes, kernel_initializer="he_normal",
         kernel_regularizer=keras.regularizers.l1(0.01), use_bias=False))
    else:
        model.add(keras.layers.Dense(n_nodes,
         kernel_initializer="he_normal", use_bias=False))
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.Activation("elu"))
    if dropout:
        model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='Nadam', metrics=['accuracy'])

# Make a callback that reduces LR on plateau
reduce_lr_cb = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2,
                                                 patience=5, min_lr=0.001)

# Make a callback for early stopping
early_stopping_cb = keras.callbacks.EarlyStopping(patience=5)

# Train DNN.
history = model.fit(np.array(X_train), np.array(y_train), epochs=epochs,
     validation_data=(np.array(X_valid), np.array(y_valid)),
      callbacks=[reduce_lr_cb, early_stopping_cb], batch_size=128)

We get over 80% accuracy with just 50 epochs, which is pretty cool. Early stopping did not come into play, since the validation loss kept improving. Also note there's no overfitting to speak of after enabling dropout. After training, we can print a summary of the model. The notice the batch normalization layers before the activation function. The first layer has the input layer and simply outputs the number of features in the data, here 92.

In [None]:
model.summary()

# Evaluation

To evaluate a DNN, we'll look at the loss and accuracy scores to see how well training's progressed and check if there's any underfit/overfit. To properly evaluate the model, we'll bring in the yet unseen test set. Afterwards we'll make a few round winner predictions based on the test data. The accuracy for the test set is 80%.

In [None]:
# Evaluate the test set
model.evaluate(X_test, y_test)

The loss and accuracy plots are below. Luckily, they show promise with a diminishing loss and an increasing accuracy as training progresses. There's a lot more potential here though and about ~90% validation accuracy is not out reach. It just takes a lot of training and a suiteable batch size.

In [None]:
# Plot the loss curves for training and validation.
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values)+1)

plt.figure(figsize=(8,6))
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# Plot the accuracy curves for training and validation.
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']
epochs = range(1, len(acc_values)+1)

plt.figure(figsize=(8,6))
plt.plot(epochs, acc_values, 'bo', label='Training accuracy')
plt.plot(epochs, val_acc_values, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Lastly we'll predict the round winner from 10 samples without knowing the winner in advance.

In [None]:
# Predict the winning teams for ten rounds.
X_new = X_test[:10]
y_pred = model.predict_classes(X_new)
class_names = ['CT', 'T']
np.array(class_names)[y_pred]

In [None]:
# Show the predicated probabilities. Below 0.5 predicts CT, otherwise T.
y_proba = model.predict(X_new)
y_proba.round(2)