<a id="contents"></a>

# Notebook Contents

- [0. Installations](#id0)
- [1. Data Reading and Preprocessing](#id1)
- [2. Baseline Model: Three GPT-2 Models](#id2)
    - [2.1 Base Functions](#id2.1)
    - [2.2 Winners GPT 2 Model](#id2.2)
    - [2.3 Losers GPT 2 Model](#id2.3)
    - [2.4 Tie GPT 2 Model](#id2.4)
    - [2.5 Results](#id2.5)
- [3. Conditional Transformer](#id3)
    - [3.1 Idea](#id3.1)
    - [3.2 Initialization](#id3.2)
    - [3.3 Interpretation](#id3.3)
    - [3.4 General GPT 2](#id3.4)
    - [3.5 The Conditional Class and Functions](#id3.5)
    - [3.6 Example Text](#id3.6)
    - [3.7 Results](#id3.7)
- [4. Categorical GPT 2](#id4)
    - [4.1 Idea](#id4.1)
    - [4.2 Creating the Labels](#id4.2)
    - [4.3 Logistical Regression](#id4.3)
    - [4.4 Categorical GPT 2](#id4.4)
    - [4.5 Example Text](#id4.5)
    - [4.6 Results](#id4.6)
    - [4.7 Generator Class](#id4.7)
    - [4.8 Generator Class From Pre generated interviews](#id4.8)
- [5. Conclusion](#id5)

<a id="id0"></a>

# Part 0: Installations

[Return to contents](#contents)

In [1]:
# import the necessary libraries
import os 
os.environ['TF_CPP_MIN_LOG_LEVEL']='2' #Trying to reduce tensorflow warnings
import re
import math
import string
import time
import json
import pickle
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

# useful structures and functions for experiments 
from time import sleep

# specific machine learning functionality
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import transformers
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
from sklearn.linear_model import LogisticRegression

In [2]:
from dataclasses import dataclass
from typing import List, Optional, Tuple

from transformers.modeling_tf_utils import (
    TFCausalLanguageModelingLoss,
    TFConv1D,
    TFPreTrainedModel,
    TFSequenceClassificationLoss,
    TFSequenceSummary,
    TFSharedEmbeddings,
    get_initializer,
    input_processing,
    keras_serializable,
    shape_list,
)
from transformers.modeling_tf_outputs import (
    TFBaseModelOutputWithPast,
    TFCausalLMOutputWithPast,
    TFSequenceClassifierOutputWithPast,
)
import transformers.models.gpt2.modeling_tf_gpt2
from transformers.models.gpt2.modeling_tf_gpt2 import TFBlock, TFGPT2PreTrainedModel
from transformers.models.gpt2.configuration_gpt2 import GPT2Config
from tensorflow.keras import initializers

<a id="id1"></a>

# Part 1: Data Reading and Preprocessing

[Return to contents](#contents)

First, let's read the data and do any data cleaning needed. We first read the csv files. Then, instead of having one row for each game, we want one row for every interview, so, we create a "is_home_team" to record which team is playing at home, and we separate each match to its two interviews. 

In [3]:
important_columns = ['name_home_team', 'name_away_team', 'score_home', 'score_away',
       'shots_home', 'shots_away', 'passes_home', 'passes_away',
       'misplaced_passes_home', 'misplaced_passes_away', 'pass_accuracy_home',
       'pass_accuracy_away', 'distance_home', 'distance_away', 'grade',
       'interview_home_english','interview_away_english']
base_columns_names = ['score','shots', 'passes','misplaced_passes', 'pass_accuracy', 'distance']
def read_and_process(file):
    """
        Given a dataset of interviews, read the data, 
        separate home and away interviews, and return the 
        resulting dataframe
    """
    df = pd.read_csv(file)
    df = df[important_columns]

    # Prepare the first half
    df_home = df.copy()
    df_home['is_home_team'] = 1
    df_home = df_home.drop('interview_away_english', axis = 1)
    df_home['interview'] = df_home['interview_home_english']
    df_home = df_home.drop('interview_home_english', axis = 1)

    # Prepare the second half
    df_away = df.copy()
    df_away['is_home_team'] = 0
    df_away = df_away.drop('interview_home_english', axis = 1)
    df_away['interview'] = df_away['interview_away_english']
    df_away = df_away.drop('interview_away_english', axis = 1)

    # Swap the team names
    df_away['name_home_team'] = df_home['name_away_team']
    df_away['name_away_team'] = df_home['name_home_team']

    # Swap all the other base columns:
    for col in base_columns_names:
        df_away[col + '_home'] = df_home[col + '_away']
        df_away[col + '_away'] = df_home[col + '_home']
        df_away = df_away.copy()
    df = pd.concat([df_home, df_away], ignore_index = True)

    # Delete not found interviews
    df = df[df['interview'] != 'NOTFOUND']

    # Fix the grade column
    def fix_grade(g):
        """
            g is given as a tuple of number,comma,number
        """
        return float(g[0] + '.' + g[2])
    df['grade'] = df['grade'].apply(func = fix_grade)
    return df 
df1 = read_and_process('1920.csv')
df2 = read_and_process('1819.csv')
df = pd.concat([df1, df2], ignore_index=True)
display(df.head(5))

Unnamed: 0,name_home_team,name_away_team,score_home,score_away,shots_home,shots_away,passes_home,passes_away,misplaced_passes_home,misplaced_passes_away,pass_accuracy_home,pass_accuracy_away,distance_home,distance_away,grade,is_home_team,interview
0,Bayern München,Hertha BSC,2.0,2.0,17.0,6.0,661.0,282.0,79.0,81.0,88.0,71.0,114.47,119.19,2.0,1,We had the dominance and the chances. The team...
1,Borussia Dortmund,FC Augsburg,5.0,1.0,22.0,5.0,886.0,246.0,68.0,69.0,92.0,72.0,110.57,113.09,2.0,1,We were surprised very early about the 0: 1 af...
2,Bayer 04 Leverkusen,SC Paderborn 07,3.0,2.0,13.0,9.0,763.0,267.0,103.0,95.0,87.0,64.0,122.81,123.08,1.5,1,"I'm very satisfied with the result, but not ye..."
3,VfL Wolfsburg,1. FC Köln,2.0,1.0,15.0,11.0,377.0,411.0,96.0,99.0,75.0,76.0,116.85,111.96,4.0,1,We are very satisfied that we were able to win...
4,Werder Bremen,Fortuna Düsseldorf,1.0,3.0,24.0,12.0,639.0,308.0,81.0,72.0,87.0,77.0,115.24,117.77,3.0,1,We had a lot of control over the game in the f...


In [4]:
df.describe()

Unnamed: 0,score_home,score_away,shots_home,shots_away,passes_home,passes_away,misplaced_passes_home,misplaced_passes_away,pass_accuracy_home,pass_accuracy_away,distance_home,distance_away,grade,is_home_team
count,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0,1090.0
mean,1.617431,1.617431,13.329358,13.329358,456.645872,456.645872,92.523853,92.523853,78.057798,78.057798,116.313404,116.313404,3.074312,0.5
std,1.419227,1.419227,5.096026,5.096026,131.149104,131.149104,16.28065,16.28065,7.183887,7.183887,4.555203,4.555203,0.900709,0.50023
min,0.0,0.0,1.0,1.0,187.0,187.0,46.0,46.0,46.0,46.0,102.7,102.7,1.0,0.0
25%,1.0,1.0,10.0,10.0,362.0,362.0,81.0,81.0,74.0,74.0,113.15,113.15,2.5,0.0
50%,1.0,1.0,13.0,13.0,436.0,436.0,92.0,92.0,79.0,79.0,116.205,116.205,3.0,0.5
75%,2.0,2.0,16.0,16.0,529.75,529.75,102.0,102.0,83.0,83.0,119.31,119.31,4.0,1.0
max,8.0,8.0,34.0,34.0,1059.0,1059.0,156.0,156.0,94.0,94.0,129.65,129.65,5.0,1.0


<a id="id2"></a>

# Part 2: Baseline Model: Three GPT-2 Models

[Return to contents](#contents)

As our baseline model, we fine tune three different gpt 2 models, for winning teams, drawing teams, and losing teams. Then, given a match, we find what the result for it was, and generate an interview from the corresponding model.

<a id="id2.1"></a>

## Part 2.1: Base Functions

[Return to contents](#contents)

In [5]:
def pad_list(lst, value, target_length):
    additional_pads = [value for _ in range(target_length - len(lst))]
    return lst + additional_pads

pad_value = GPT2Tokenizer.from_pretrained("distilgpt2").eos_token_id
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")

def build_dataset(df, BATCH_SIZE = 12):
    """
        For a given df, corresponding to some or all
        of the games, create a base dataset to train a gpt2 model
    """
    # Get the encoding
    blocks = [tokenizer.encode(interview) for interview in df['interview']]

    # Pad the short interviews
    max_len = max(np.shape(x) for x in blocks)[0]
    blocks = [pad_list(x, pad_value, max_len) for x in blocks]

    # Get the input and output ids
    input_ids = [x[:-1] for x in blocks]
    output_ids = [x[1:] for x in blocks]

    # Prepare the data sets
    TRAIN_SHUFFLE_BUFFER_SIZE = 10000

    # Create 
    train_data = tf.data.Dataset.from_tensor_slices((input_ids, output_ids))

    # Shuffle
    train_data = train_data.shuffle(buffer_size=TRAIN_SHUFFLE_BUFFER_SIZE)

    # Batch
    train_data = train_data.batch(BATCH_SIZE, drop_remainder=True)

    return train_data

In [6]:
def build_general_gpt2(
    train_data,
    learning_rate = 3e-5,
    epsilon=1e-08,
    clipnorm=1.0,
    epochs = 30,
    train_the_model = False
  ):
    """
        For a given tf dataset and training parameters, 
        fine tune a gpt2 model on this dataset and return the model
    """
    # Get the pretrained model
    model_gpt2 = TFGPT2LMHeadModel.from_pretrained("distilgpt2", pad_token_id = pad_value)

    # Optimizer, Loss function, and metrics
    optimizer = keras.optimizers.Adam(learning_rate=learning_rate, epsilon=epsilon, clipnorm=clipnorm)
    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    metric = keras.metrics.SparseCategoricalAccuracy('accuracy')

    # Compile
    model_gpt2.compile(loss=[loss, *[None] * model_gpt2.config.n_layer],
                        optimizer=optimizer,
                        metrics=[metric])
    # Train model
    if train_the_model:
        start_time = time.time()
        training_results = model_gpt2.fit(
                train_data,
                epochs=epochs, 
                verbose=1)
        execution_time = (time.time() - start_time)/60.0
        print("Training execution time (mins)",execution_time)
    return model_gpt2

In [7]:
def generate_from_baseline(model, input_text = "Today we"):
    """
        Given a model and an intial text, generate an interview 
        from the model starting with the given intial text.
    """
    # Tokenize Input
    input_ids = tokenizer.encode(input_text, return_tensors='tf')

    # Generate outout
    outputs = model.generate(
      input_ids, 
      do_sample=True, 
      max_length=100, 
      top_p=0.80, 
      top_k=0
    )
    print("Generated text:")
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(generated_text)
    return generated_text

<a id="id2.2"></a>

## Part 2.2: Winners GPT 2 Model

[Return to contents](#contents)

In [155]:
df_winner = df[df['score_home'] > df['score_away']]
winner_data = build_dataset(df_winner)

In [88]:
winner_model = build_general_gpt2(df, train_the_model = False)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [None]:
winner_model.save_pretrained('model/baseline_winner_3')

In [None]:
winner_model = winner_model.from_pretrained('model/baseline_winner_3')

In [None]:
_ = generate_from_baseline(winner_model)

Generated text:
Today we're happy with the three points. Overall, it was a deserved victory. We increased the lead to eleven points with a little coaching. I'm very happy with the result and the way the team played. We are very satisfied with this win and are looking forward to the game.


In [None]:
_ = generate_from_baseline(winner_model)

Generated text:
Today we are very happy with the 3-1 win. The three points are very important to us. The guys did a great job.


In [None]:
_ = generate_from_baseline(winner_model)

Generated text:
Today we are happy that we were able to score two goals. That was important for the second half. We played well forward and pushed ourselves up against Paderborn. The 2-0 win was extremely important for the mentality, character and mentality of the team. The win was important for our development.


<a id="id2.3"></a>

## Part 2.3: Losers GPT 2 Model

[Return to contents](#contents)

In [None]:
df_loser = df[df['score_home'] < df['score_away']]
loser_data = build_dataset(df_loser)

In [None]:
loser_model = build_general_gpt2(loser_data, epochs=30, train_the_model = False)

In [None]:
loser_model.save_pretrained('model/baseline_loser_3')

In [None]:
loser_model = loser_model.from_pretrained('model/baseline_loser_3')

In [None]:
for _ in range(4):
    _ = generate_from_baseline(loser_model)

Generated text:
Today we found it very difficult. We didn't take advantage of what we got, what we get, we have to put everything that we can on the pitch. It is a deserved victory for Frankfurt.
Generated text:
Today we played a very good game, but we couldn't stop the game. In the second half we did a lot better, but we didn't have the courage to defend forward. The crucial point was to score the third goal. That game deserved to me. We will always be a favorite.
Generated text:
Today we weren't good at football, it was better at football too. The game was much more intense than last year. We got into the game better after the 1-0 win. That was a disappointment, but also a reflection of the whole season. It was also good for both teams.
Generated text:
Today we defended a very brave home game in which we allowed only one goal in two games. We did that because we liked what we saw. Then you have a completely different picture. The opponent has more access to the ball than we did in th

<a id="id2.4"></a>

## Part 2.4: Tie GPT 2 Model

[Return to contents](#contents)

In [None]:
df_tie = df[df['score_home'] == df['score_away']]
tie_data = build_dataset(df_tie)

In [None]:
tie_model = build_general_gpt2(tie_data, epochs=30, train_the_model = False)

In [None]:
tie_model.save_pretrained('model/baseline_tie_3')

In [None]:
tie_model = tie_model.from_pretrained('model/baseline_tie_1')

In [None]:
for _ in range(4):
  _ = generate_from_baseline(tie_model)

Generated text:
Today we saw a very exciting game with a very clear dominance in the first half. We saw a very good performance from us today. After the 1: 1, we also implemented some changes to the game plan. The team implemented a lot in the second half. I can't blame the guys. We showed a lot today.
Generated text:
Today we will analyze the game against Bremen. After a long and intense first half, it was okay for us to score more goals. We will not forget that today we are still unbeaten. We will not forget that today we are still unbeaten. We will not forget that today we are still unbeaten.
Generated text:
Today we see a lot of changes and need to improve. We should have won the game for a long time, but the point is worth a lot. We had a very good game with a lot of power and could have won the game for a lot of goals.
Generated text:
Today we have to play football with a clear mentality. We have to be consistent and we have to do it today. We have to get back into the game, we s

<a id="id2.5"></a>

## Part 2.5: Results

[Return to contents](#contents)

We can see that the generator is able to generate interviews that make sense and are coherent. This is mainly because of how good the GPT2 model is. However, these baseline models don't take any match details into account. Take this example from the winning model:

- Today we're happy with the three points. Overall, it was a deserved victory. We increased the lead to eleven points with a little coaching. I'm very happy with the result and the way the team played. We are very satisfied with this win and are looking forward to the game.

While this sounds like a good interview, we make two observations:
- The coach says that this is a deserver victory. But what if in the given match, the team had done very bad and won by a lucky shot or two. We want our model to capture that.
- The coach mentions that now the lead is eleven points. This is referencing the general results of the league, and not just this competition. This is a concern when generating interviews, but we don't expect to be able to solve it, as it would require us to work with results of full seasons, not just single matches, which we won't address for this project. 

<a id="id3"></a>

# Part 3: Conditional Transformer
[Return to contents](#contents)

<a id="id3.1"></a>

## Part 3.1 Idea
[Return to contents](#contents)

To address the concerns mentioned earlier, we want our models to take into consideration all data we have about a match. To do so, we want the data to affect the probabilities of generating words inside the GPT2 model. 

Our first approach for this is a conditional GPT 2 model. To build this model, we first fine tune a GPT2 model to all the matches to get it tuned to the languange of the interviews. 

Then, we modify the layers of the transformer to have it use the match statistics. The small gpt2 model we will use has 6 layers of transformers, each one ends with a FFNN. To modify that, after each transformer, we add one more FFNN. This network takes the outputs of the transformer, concatenates it with the match statistics, and then produces new outputs of the same size as the transformer outputs. 



<a id="id3.2"></a>

## Part 3.2 Initialization
[Return to contents](#contents)

To leverage the GPT 2 Model, we want to make sure that the added layers don't break the relationships previously found by the layers of transformers. So, we want the models to initially give the same results as the initial GPT 2 model, and then, through fine tuning, it can start finding other relationship with the help of the match statistics. To do so, for all added FFNN layers, we initialize them with an identity matrix for the weights and zeros for the biases. For non square matrices, the identity initializer fills the additional rows with zeros. As a result of this, when we initialize the weights, the FFNNs will be returning exactly their inputs, which means the model will start off with the same results as the GPT2. Then through fine tuning, will start learning the new relationships with the match statistics. 

<a id="id3.3"></a>

## Part 3.3 Interpretation
[Return to contents](#contents)

What is the interpretation of adding the match statistics inbetween the transformer layers? Our intuition is as following:
- For the first few transformer layers, the GPT-2 model tries to understand the previous text in the sentence. But these sentences in the interviews mean different things under different results. Take as an example, the sentence "We played against a strong team". If the team had won the match, then this sentence is showing that the coach is proud of their team and they respect their opponent. But if the team had lost, then this sentence could mean that the coach is finding excuses.
- For the later transformer layers, the GPT-2 model is decoding the information it has into an interview. The match statisics here are helping the model chose the words appropriate for the result. 

<a id="id3.4"></a>

## Part 3.4: General GPT 2
[Return to contents](#contents)

Here, we fine tune a gpt2 model on all the data. 

In [None]:
all_data = build_dataset(df)

In [None]:
train_the_model = False
if train_the_model:
    model = build_general_gpt2(all_data, train_the_model = True, epochs=10)
    model.save_pretrained("general_1")
else:
    model = build_general_gpt2(all_data, train_the_model = False)
    model = model.from_pretrained("general_1")

<a id="id3.5"></a>

## Part 3.5: The Conditional Class and Functions
[Return to contents](#contents)

Here, we write functions to train prepare the dataset, train the model, and generate interviews using it. 

For the class for the model, we copy most of it from transformers.models.gpt2.configuration_gpt2, as we are using the exact same transformers, just with the added FFNN layers. 

In [None]:
def pad_list(lst, value, target_length):
    additional_pads = [value for _ in range(target_length - len(lst))]
    return lst + additional_pads

pad_value = GPT2Tokenizer.from_pretrained("distilgpt2").eos_token_id
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")

num_columns = ['score_home', 'score_away','shots_home', 'shots_away', 'passes_home', 'passes_away',
              'misplaced_passes_home', 'misplaced_passes_away', 'pass_accuracy_home',
              'pass_accuracy_away', 'distance_home', 'distance_away', 'grade',
              'is_home_team']

def build_dataset_for_custom_transformer(df, columns = num_columns, BATCH_SIZE = 12):
    """
        For a given df, corresponding to some or all
        of the games, create a base dataset to train a gpt2 model
    """
    # Get the encoding
    blocks = [tokenizer.encode(interview) for interview in df['interview']]
    match_stats = [row[columns] for _, row in df.iterrows()]

    # Pad the short interviews
    max_len = max(np.shape(x) for x in blocks)[0]
    blocks = [pad_list(x, pad_value, max_len) for x in blocks]

    # Get the input and output ids
    input_ids = [x[:-1]for x in blocks]
    output_ids = [x[1:] for x in blocks]

    # Prepare the data sets
    TRAIN_SHUFFLE_BUFFER_SIZE = 10000

    # Create 
    train_data = tf.data.Dataset.from_tensor_slices(((input_ids, match_stats), output_ids))

    # Batch
    train_data = train_data.batch(BATCH_SIZE, drop_remainder=True)

    return train_data

In [None]:
data = build_dataset_for_custom_transformer(df)

In [None]:
validation_data = data.take(5)
train_data = data.skip(5)

In [None]:
class Custom_dense(tf.keras.layers.Layer):
    """
    A layer that takes match stats and transformer outputs and
    return modified outputs.
    """

    def __init__(
        self,
        transformer_output_shape,
        soccer_stats_shape,
        name="custom",
        **kwargs
    ):
        super(Custom_dense, self).__init__(name=name, **kwargs)
        self.dense = keras.layers.Dense(transformer_output_shape, kernel_initializer=initializers.identity(), bias_initializer=initializers.zeros())
        return 

    def call(self, t_outputs, match_stats):
        # Get the input  tensors
        #t_outputs  = inputs[0]
        #match_stats = inputs[1]
        
        # Concatenate
        dims = shape_list(t_outputs)
        match_stats = tf.reshape(match_stats, (dims[0], dims[1], 14))
        concat = keras.layers.concatenate([t_outputs, match_stats])

        # Get the output
        outputs = self.dense(concat)
        return outputs

In [None]:
"""
  Most of the code for this part has been copied from transformers.models.gpt2.configuration_gpt2
"""
class Custom_transformer(keras.models.Model):
    config_class = GPT2Config

    def __init__(self, config, model, soccer_input_shape = 1, *inputs, **kwargs):
        super(MyModel, self).__init__()

        self.config = config
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
        self.use_cache = config.use_cache
        self.return_dict = config.use_return_dict

        self.num_hidden_layers = config.n_layer
        self.vocab_size = config.vocab_size
        self.n_embd = config.n_embd
        self.n_positions = config.n_positions
        self.initializer_range = config.initializer_range

        self.wte = model.transformer.wte
        self.wpe = model.transformer.wpe
        self.drop = model.transformer.drop
        self.h = model.transformer.h
        self.ln_f = model.transformer.ln_f

        # model2.transformer.h[-1].mlp.c_fc.nx
        self.custom_layers=[
          Custom_dense(self.h[i].mlp.c_fc.nx, soccer_input_shape) for i in range(config.n_layer)
        ]

    def get_input_embeddings(self):
        return self.wte

    def set_input_embeddings(self, value):
        self.wte.weight = value
        self.wte.vocab_size = shape_list(value)[0]

    def _prune_heads(self, heads_to_prune):
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer}
        """
        raise NotImplementedError
    @tf.function()
    def call(
        self,
        input=None,
        past=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        training=False,
        **kwargs,
    ):
        input_ids, match_stats = input
        inputs = input_processing(
            func=self.call,
            config=self.config,
            input_ids=input_ids,
            past=past,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            training=training,
            kwargs_call=kwargs,
        )
        inputs["inputs_embeds"] = None
        # print(inputs['inputs_embeds'])
        if input_ids is not None and inputs["inputs_embeds"] is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = shape_list(input_ids)
            inputs["input_ids"] = tf.reshape(input_ids, [-1, input_shape[-1]])
        elif inputs["inputs_embeds"] is not None:
            input_shape = shape_list(inputs["inputs_embeds"])[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        if inputs["past"] is None:
            past_length = 0
            inputs["past"] = [None] * len(self.h)
        else:
            past_length = shape_list(inputs["past"][0][0])[-2]

        if inputs["position_ids"] is None:
            inputs["position_ids"] = tf.expand_dims(tf.range(past_length, input_shape[-1] + past_length), axis=0)

        if inputs["attention_mask"] is not None:
            # We create a 3D attention mask from a 2D tensor mask.
            # Sizes are [batch_size, 1, 1, to_seq_length]
            # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
            # this attention mask is more simple than the triangular masking of causal attention
            # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
            attention_mask_shape = shape_list(inputs["attention_mask"])
            inputs["attention_mask"] = tf.reshape(
                inputs["attention_mask"], (attention_mask_shape[0], 1, 1, attention_mask_shape[1])
            )

            # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
            # masked positions, this operation will create a tensor which is 0.0 for
            # positions we want to attend and -10000.0 for masked positions.
            # Since we are adding it to the raw scores before the softmax, this is
            # effectively the same as removing these entirely.
            one_cst = tf.constant(1.0)
            inputs["attention_mask"] = tf.cast(inputs["attention_mask"], dtype=one_cst.dtype)
            inputs["attention_mask"] = tf.multiply(
                tf.subtract(one_cst, inputs["attention_mask"]), tf.constant(-10000.0)
            )

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        if inputs["head_mask"] is not None:
            raise NotImplementedError
        else:
            inputs["head_mask"] = [None] * self.num_hidden_layers
            # head_mask = tf.constant([0] * self.num_hidden_layers)

        inputs["position_ids"] = tf.reshape(inputs["position_ids"], [-1, shape_list(inputs["position_ids"])[-1]])

        if inputs["inputs_embeds"] is None:
            inputs["inputs_embeds"] = self.wte(inputs["input_ids"], mode="embedding")

        position_embeds = tf.gather(self.wpe, inputs["position_ids"])

        if inputs["token_type_ids"] is not None:
            inputs["token_type_ids"] = tf.reshape(
                inputs["token_type_ids"], [-1, shape_list(inputs["token_type_ids"])[-1]]
            )
            token_type_embeds = self.wte(inputs["token_type_ids"], mode="embedding")
        else:
            token_type_embeds = tf.constant(0.0)

        position_embeds = tf.cast(position_embeds, dtype=inputs["inputs_embeds"].dtype)
        token_type_embeds = tf.cast(token_type_embeds, dtype=inputs["inputs_embeds"].dtype)
        hidden_states = inputs["inputs_embeds"] + position_embeds + token_type_embeds
        hidden_states = self.drop(hidden_states, training=inputs["training"])

        output_shape = input_shape + [shape_list(hidden_states)[-1]]

        presents = () if inputs["use_cache"] else None
        all_attentions = () if inputs["output_attentions"] else None
        all_hidden_states = () if inputs["output_hidden_states"] else None

        # Modify match stats shape
        #match_stats = np.asarray([[a for _ in range(np.shape(input_ids)[1])] for a in match_stats])
        match_stats = tf.repeat(match_stats, [input_shape[1] for _ in range(14)], axis = 1)
        for i, (block, layer_past) in enumerate(zip(self.h, inputs["past"])):
            if inputs["output_hidden_states"]:
                all_hidden_states = all_hidden_states + (tf.reshape(hidden_states, output_shape),)

            outputs = block(
                hidden_states,
                layer_past,
                inputs["attention_mask"],
                inputs["head_mask"][i],
                inputs["use_cache"],
                inputs["output_attentions"],
                training=inputs["training"],
            )

            hidden_states, present = outputs[:2]
            hidden_states = self.custom_layers[i](hidden_states, match_stats)
            if inputs["use_cache"]:
                presents = presents + (present,)

            if inputs["output_attentions"]:
                all_attentions = all_attentions + (outputs[2],)

        hidden_states = self.ln_f(hidden_states)

        hidden_states = tf.reshape(hidden_states, output_shape)

        logits = self.wte(hidden_states, mode="linear")
        return (logits,) + (present,)

In [None]:
custom_model = Custom_transformer(model.config, model, len(num_columns))

In [None]:
# parameters
learning_rate = 3e-5
epsilon=1e-08
clipnorm=1.0
epochs = 20

# Optimizer, Loss function, and metrics
optimizer = keras.optimizers.Adam(learning_rate=learning_rate, epsilon=epsilon, clipnorm=clipnorm)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = keras.metrics.SparseCategoricalAccuracy('accuracy')

# Compile
custom_model.compile(loss=[loss, *[None] * custom_model.config.n_layer],
                  optimizer=optimizer,
                  metrics=[metric])


In [None]:
# Train model
start_time = time.time()
training_results = custom_model.fit(
        train_data,
        epochs=epochs,
        verbose=1)
execution_time = (time.time() - start_time)/60.0
print("Training execution time (mins)",execution_time)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Training execution time (mins) 520.1370963613192


In [None]:
custom_model.save('custom_model1')

In [None]:
def generate_from_custom(model, val_data, top_k = 15, max_len = 120, input_text = "I think "):
    generated_df = pd.DataFrame(columns = num_columns + ["interview"])
    for i, batch in zip(range(5), val_data):
    print(f"Batch {i}")
    # Tokenize Input
    cur_input_ids = tokenizer.encode(input_text, return_tensors='tf')
    cur_outputs = [cur_input_ids[0] for _ in range(12)]

    for _ in range(max_len):
        # Generate output
        logits = model((np.asarray(cur_outputs), batch[0][1]))[0]
        logits = [x[-1] for x in logits]
        possible_next_id = [np.argpartition(a, -top_k)[-top_k:] for a in logits]
        next_ids = [np.random.choice(a) for a in possible_next_id]
        cur_outputs = [list(out) for out in cur_outputs]
        for out, id in zip(cur_outputs, next_ids):
            out.append(id)
    generated_text = [tokenizer.decode(out, skip_special_tokens=True) for out in cur_outputs]
    for match_stats, interview in zip(batch[0][1], generated_text):
        cur_dict ={x:y for x, y in zip(num_columns, list(match_stats.numpy()))}
        cur_dict['interview'] = interview
        generated_df = generated_df.append(cur_dict, ignore_index = True)
    return generated_df

In [None]:
df_generated = generate_from_custom(custom_model, validation_data)

Batch 0
Batch 1
Batch 2
Batch 3
Batch 4


In [None]:
df_generated.to_csv("exp7.csv")

<a id="id3.6"></a>

## Part 3.6: Example Text
[Return to contents](#contents)

Let's look at some of the generated text. For this file, we generated it using greedy generation (k = 1)

In [177]:
df_generated = pd.read_csv('exp4.csv')

In [183]:
print(df_generated.loc[53])
print(df_generated.loc[53]['interview'])

Unnamed: 0                                                              53
score_home                                                             2.0
score_away                                                             2.0
shots_home                                                            25.0
shots_away                                                            11.0
passes_home                                                          472.0
passes_away                                                          311.0
misplaced_passes_home                                                 99.0
misplaced_passes_away                                                 93.0
pass_accuracy_home                                                    79.0
pass_accuracy_away                                                    70.0
distance_home                                                   113.010002
distance_away                                                   112.900002
grade                    

<a id="id3.7"></a>

## Part 3.7: Results
[Return to contents](#contents)

In the previous example, we see that the text starts with "Today we have to accept defeat", although the team didn't lose the match, but had a tie. This is common in many examples generated. The model is not performing well.

We think this is because the model is overfitting to some of the match statistics. We are throwing a lot of features at the model. This features are correlated with each other and might not be correlated with the actual interview, and we don't have enough data to overcome this. As a result, the model might relate a word like "defeat" with the passes rather than the scores for example.

<a id="id4"></a>

# Part 4: Categorical GPT 2
[Return to contents](#contents)

<a id="id4.1"></a>

## Part 4.1: Idea
[Return to contents](#contents)

The reason the conditional model isn't performing well is that we are throwing a lot of match statistics at it that might not be useful. Instead, let's only put the data that helps.

All of the shots/passes/passes missed only help with one thing: was the team dominanting the match, and so deserved the win or didn't deserve the loss, or were they the weaker side and possibly won by luck or deserved the loss. So, let's actually only use these labels. We create those and then feed them into the GPT 2 model as special tokens at the beginning of the text. Then, to generate an interview for a match, we create its labels, feed them as input ids to the gpt2 model, and have it generate an interview for us. 

<a id="id4.2"></a>

## Part 4.2: Creating the Labels
[Return to contents](#contents)

From the match statistics, we create two labels and only use those for generating. 

- The expected result: We build a logistical regression model that predict the probability of winning given the statistics of the game (shots, passes...). The probability of win tells us who had more control and dominated the match and who didn't. So, we then devide that into five labels ["dominant loss", "regular loss", "tie", "regular win", "dominant win"] corresponding to the probability intervals: [0, 0.2], [0.2, 0.4], [0.4, 0.6], [0.6, 0.8], [0.8, 1].

- The actual result: Based on the goal difference, we create the samle labels ["dominant loss", "regular loss", "tie", "regular win", "dominant win"], based on the goal differences [Loss by more than two goals, loss by one or two goals, tie, win by one or two goals, win by more than two goals].

<a id="id4.3"></a>

## Part 4.3: Logistical Regression
[Return to contents](#contents)

For this part, we build a logistical regression model to predict match results. 

In [8]:
x_columns = ['shots_home', 'shots_away', 'passes_home', 'passes_away',
        'misplaced_passes_home', 'misplaced_passes_away', 'pass_accuracy_home',
        'pass_accuracy_away', 'distance_home', 'distance_away', 'grade', 'is_home_team']
x = df[x_columns]

def get_result_cat(diff):
    if diff>0:
        return "win"
    elif diff<0:
        return "lose"
    else:
        return "tie"

y = (df['score_home'] - df['score_away']).apply(get_result_cat)
#y = pd.get_dummies(y)

In [9]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 109)

In [15]:
train_the_model = False
if train_the_model:
    log_model = LogisticRegression(random_state=109, max_iter=5000).fit(x_train, y_train)
    pickle.dump(log_model, open('model/log1', 'wb'))
else:
    log_model = pickle.load(open('model/log1', 'rb'))

In [16]:
log_model.classes_

array(['lose', 'tie', 'win'], dtype=object)

In [17]:
s = log_model.score(x_test, y_test)
print(f"The accuracy of the model is {s:0.3f}")

The accuracy of the model is 0.633


In [18]:
def get_expected_result(row):
    x = row[x_columns]
    
    # Get the win probability
    y = log_model.predict_proba([x])[0][2]
    
    # Return the labels
    if y < 0.2:
        return "dominant loss"
    if y < 0.4:
        return "regular loss"
    if y < 0.6:
        return "tie"
    if y < 0.8:
        return "regular win"
    return "dominant win"

df['expected_result'] = df.apply(get_expected_result, axis = 1)

  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **kwargs)
  return f(*args, **

In [19]:
def get_actual_result(row):
    home_goals = int(row['score_home'])
    away_goals = int(row['score_away'])
    diff = home_goals - away_goals
    if diff < -2:
        return "dominant loss"
    if diff < 0:
        return "regular loss"
    if diff == 0:
        return "tie"
    if diff < 3:
        return "regular win"
    return "dominant win"

df['actual_result'] = df.apply(get_actual_result, axis = 1)

In [20]:
df.head(5)

Unnamed: 0,name_home_team,name_away_team,score_home,score_away,shots_home,shots_away,passes_home,passes_away,misplaced_passes_home,misplaced_passes_away,pass_accuracy_home,pass_accuracy_away,distance_home,distance_away,grade,is_home_team,interview,expected_result,actual_result
0,Bayern München,Hertha BSC,2.0,2.0,17.0,6.0,661.0,282.0,79.0,81.0,88.0,71.0,114.47,119.19,2.0,1,We had the dominance and the chances. The team...,regular loss,tie
1,Borussia Dortmund,FC Augsburg,5.0,1.0,22.0,5.0,886.0,246.0,68.0,69.0,92.0,72.0,110.57,113.09,2.0,1,We were surprised very early about the 0: 1 af...,dominant win,dominant win
2,Bayer 04 Leverkusen,SC Paderborn 07,3.0,2.0,13.0,9.0,763.0,267.0,103.0,95.0,87.0,64.0,122.81,123.08,1.5,1,"I'm very satisfied with the result, but not ye...",regular loss,regular win
3,VfL Wolfsburg,1. FC Köln,2.0,1.0,15.0,11.0,377.0,411.0,96.0,99.0,75.0,76.0,116.85,111.96,4.0,1,We are very satisfied that we were able to win...,regular win,regular win
4,Werder Bremen,Fortuna Düsseldorf,1.0,3.0,24.0,12.0,639.0,308.0,81.0,72.0,87.0,77.0,115.24,117.77,3.0,1,We had a lot of control over the game in the f...,tie,regular loss


<a id="id4.4"></a>

## Part 4.4: Categorical GPT 2
[Return to contents](#contents)

Here, we build a categorical GPT 2 model by feeding the categories as input ids to the model when training and generating. 

In [21]:
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
special_tokens_dict = {
        "additional_special_tokens": [
            '[s:actual_result]', '[e:actual_result]', 
            '[s:expected_result]', '[e:expected_result]',
        ]
    }
tokenizer.add_special_tokens(special_tokens_dict)

4

In [22]:
special_tokens = tokenizer.added_tokens_encoder
def format_interview(row):
    output =  [special_tokens['[s:actual_result]']] + tokenizer.encode(row['actual_result']) + [special_tokens['[e:actual_result]']] + \
            [special_tokens['[s:expected_result]']] + tokenizer.encode(row['expected_result']) + [special_tokens['[e:expected_result]']] + \
            tokenizer.encode(row['interview'])
    return output
formatted_interviews = df.apply(format_interview, axis = 1)

In [23]:
cat_model = TFGPT2LMHeadModel.from_pretrained("distilgpt2")
cat_model.resize_token_embeddings(len(tokenizer))

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


<transformers.modeling_tf_utils.TFSharedEmbeddings at 0x7fbf3bf3a130>

In [24]:
def pad_list(lst, value, target_length):
    additional_pads = [value for _ in range(target_length - len(lst))]
    return lst + additional_pads

pad_value = tokenizer.eos_token_id

def build_dataset_for_categorical_transformer(blocks, BATCH_SIZE = 12):
    """
        For a list of interviews, create a base dataset to train a gpt2 model
    """
    # Pad the short interviews
    max_len = max(np.shape(x) for x in blocks)[0]
    blocks = [pad_list(x, pad_value, max_len) for x in blocks]

    # Get the input and output ids
    input_ids = [x[:-1] for x in blocks]
    output_ids = [x[1:] for x in blocks]

    # Prepare the data sets
    TRAIN_SHUFFLE_BUFFER_SIZE = 10000

    # Create 
    train_data = tf.data.Dataset.from_tensor_slices((input_ids, output_ids))

    # Shuffle
    train_data = train_data.shuffle(buffer_size=TRAIN_SHUFFLE_BUFFER_SIZE)

    # Batch
    train_data = train_data.batch(BATCH_SIZE, drop_remainder=True)

    return train_data

In [25]:
train_data = build_dataset_for_categorical_transformer(formatted_interviews)

In [26]:
# parameters
learning_rate = 3e-5
epsilon=1e-08
clipnorm=1.0
epochs = 30

In [27]:
# Optimizer, Loss function, and metrics
optimizer = keras.optimizers.Adam(learning_rate=learning_rate, epsilon=epsilon, clipnorm=clipnorm)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = keras.metrics.SparseCategoricalAccuracy('accuracy')

# Compile
cat_model.compile(loss=[loss, *[None] * cat_model.config.n_layer],
                  optimizer=optimizer,
                  metrics=[metric])

In [184]:
# Train model
train = False
if train:
    start_time = time.time()
    training_results = cat_model.fit(
            train_data,
            epochs=epochs, 
            verbose=1)
    execution_time = (time.time() - start_time)/60.0
    print("Training execution time (mins)",execution_time)
    cat_model.save_pretrained('model/cat2')
else:
    cat_model = TFGPT2LMHeadModel.from_pretrained('model/cat1')

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at model/cat1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


<a id="id4.5"></a>

## Part 4.5: Example Text
[Return to contents](#contents)

Let's generate an example to check the results.

In [185]:
row = {
    'actual_result': 'regular win',
    'expected_result': 'dominant loss',
    'interview': ''
}

In [186]:
interview = format_interview(row)

In [188]:
def generate_from_categorical(model, input_ids):
    input_ids = tf.constant(input_ids)
    # Generate outout
    outputs = model.generate(
      input_ids, 
      do_sample=True, 
      max_length=120, 
      top_p=0.80, 
      top_k=10
    )
    generated_text = tokenizer.decode(outputs[0][8:], skip_special_tokens=True)
    return generated_text

In [189]:
generate_from_categorical(cat_model, [interview])

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


"It was a very difficult game, we didn't have the punch, we played too slowly and too slowly. I'm very happy that we won the opening game against a very strong opponent."

<a id="id4.6"></a>

## Part 4.6: Results
[Return to contents](#contents)

For many generated examples, we see that the model generates good examples that make sense. In the previous example, when a team won but was expacted to have a huge loss, the generated interview says "It was a difficult match" and "I'm very happy that we won the openning game against a very strong opponent", both consistent with the given labels. This is common for many of the generated examples. 

<a id="id4.7"></a>

## Part 4.7: Generator Class
[Return to contents](#contents)

Here, we summaries the trained model we have in one class to be used by the server for the dashboard. 

In [119]:
class interview_generator():
    def __init__(self):
        # GPT2
        self.cat_model = TFGPT2LMHeadModel.from_pretrained('model/cat1')
        self.tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
        special_tokens_dict = {
            "additional_special_tokens": [
            '[s:actual_result]', '[e:actual_result]', 
            '[s:expected_result]', '[e:expected_result]',]}
        self.tokenizer.add_special_tokens(special_tokens_dict)
        
        # Logistical Model
        self.log_model = pickle.load(open('model/log1', 'rb'))
        self.x_columns = ['shots_home', 'shots_away', 'passes_home', 'passes_away',
                            'misplaced_passes_home', 'misplaced_passes_away', 'pass_accuracy_home',
                            'pass_accuracy_away', 'distance_home', 'distance_away', 'grade', 'is_home_team']
        
    def get_expected_result(self, row):
        x = [row[a] for a in self.x_columns]

        # Get the win probability
        y = self.log_model.predict_proba([x])[0][2]

        # Return the labels
        if y < 0.2:
            return "dominant loss"
        if y < 0.4:
            return "regular loss"
        if y < 0.6:
            return "tie"
        if y < 0.8:
            return "regular win"
        return "dominant win"

    @staticmethod
    def get_actual_result(row):
        home_goals = int(row['score_home'])
        away_goals = int(row['score_away'])
        diff = home_goals - away_goals
        if diff < -2:
            return "dominant loss"
        if diff < 0:
            return "regular loss"
        if diff == 0:
            return "tie"
        if diff < 3:
            return "regular win"
        return "dominant win"
    
    def format_interview(self, row):
        output = ([self.tokenizer.added_tokens_encoder['[s:actual_result]']] + 
                self.tokenizer.encode(interview_generator.get_actual_result(row)) + 
                [self.tokenizer.added_tokens_encoder['[e:actual_result]']] + 
                [self.tokenizer.added_tokens_encoder['[s:expected_result]']] + 
                self.tokenizer.encode(self.get_expected_result(row)) +
                [self.tokenizer.added_tokens_encoder['[e:expected_result]']])
        return output

    def generate_interview(self, row):
        input_ids = self.format_interview(row)
        input_ids = tf.constant([input_ids])
        # Generate output
        outputs = self.cat_model.generate(
          input_ids, 
          do_sample=True, 
          max_length=120, 
          top_p=0.80, 
          top_k=10
        )
        generated_text = tokenizer.decode(outputs[0][8:], skip_special_tokens=True)
        return generated_text

In [167]:
gen = interview_generator()

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at model/cat1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [168]:
d = {
    'score_home': 2, 
    'score_away': 1,
    'shots_home': 14, 
    'shots_away': 17, 
    'passes_home':200, 
    'passes_away':400,
    'misplaced_passes_home':150, 
    'misplaced_passes_away':250, 
    'pass_accuracy_home':3/4,
    'pass_accuracy_away':5/8, 
    'distance_home':140, 
    'distance_away':120, 
    'grade':4,
    'is_home_team':True
}

In [169]:
df.columns

Index(['name_home_team', 'name_away_team', 'score_home', 'score_away',
       'shots_home', 'shots_away', 'passes_home', 'passes_away',
       'misplaced_passes_home', 'misplaced_passes_away', 'pass_accuracy_home',
       'pass_accuracy_away', 'distance_home', 'distance_away', 'grade',
       'is_home_team', 'interview', 'expected_result', 'actual_result'],
      dtype='object')

In [170]:
gen.generate_interview(d)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


"I'm proud of the boys. They played a really good game. We did a lot of things right, especially in the first half. I am very happy that we were able to put a smile on our face."

<a id="id4.8"></a>

## Part 4.8: Generator Class From Pre generated interviews
[Return to contents](#contents)

Because it is very complicated to setup the transformer packages on Heroku, we pre generate some interviews and use them for the online platform.

First, we pregenerate the examples. We generate 50 examples for every combination of labels. 

Then, we write the code for the generator. 

In [None]:
# Pre generate the examples:
generate = False
if generate:
    possible_results = ["dominant loss", "regular loss", "tie", "regular win", "dominant win"]
    generated_df = pd.DataFrame(columns = ["actual_result", "expected_result", "interview"])
    for expected in possible_results:
        for actual in possible_results:
            for _ in range(50):
                row = {
                    'actual_result': actual,
                    'expected_result': expected,
                    'interview': ' '
                }
                cur_interview = generate_from_categorical(cat_model, [format_interview(row)])
                row['interview'] = cur_interview
                generated_df = generated_df.append(row, ignore_index=True)
    generated_df.to_csv('exp9')
    generated_df.to_csv('exp9.csv')
else:
    generated_df = pd.read_csv('exp9.csv')

In [149]:
class interview_generator_csv():
    def __init__(self):
        # GPT2
        self.df = pd.read_csv('exp9.csv')
        
        # Logistical Model
        self.log_model = pickle.load(open('model/log1', 'rb'))
        self.x_columns = ['shots_home', 'shots_away', 'passes_home', 'passes_away',
                            'misplaced_passes_home', 'misplaced_passes_away', 'pass_accuracy_home',
                            'pass_accuracy_away', 'distance_home', 'distance_away', 'grade', 'is_home_team']
        
    def get_expected_result(self, row):
        x = [row[a] for a in self.x_columns]

        # Get the win probability
        y = self.log_model.predict_proba([x])[0][2]

        # Return the labels
        if y < 0.2:
            return "dominant loss"
        if y < 0.4:
            return "regular loss"
        if y < 0.6:
            return "tie"
        if y < 0.8:
            return "regular win"
        return "dominant win"

    @staticmethod
    def get_actual_result(row):
        home_goals = int(row['score_home'])
        away_goals = int(row['score_away'])
        diff = home_goals - away_goals
        if diff < -2:
            return "dominant loss"
        if diff < 0:
            return "regular loss"
        if diff == 0:
            return "tie"
        if diff < 3:
            return "regular win"
        return "dominant win"
    
    def generate_interview(self, row):
        actual = interview_generator.get_actual_result(row)
        expected = self.get_expected_result(row)
        cur_interviews = df[(df['actual_result']==actual) & (df['expected_result']==expected)]['interview'].values
        g = np.random.choice(cur_interviews)
        return g

In [151]:
gen = interview_generator_csv()

In [153]:
gen.generate_interview(d)

'We are determined to move our opponents today. We wanted to play football and lure the opponent in order to then shift. That worked zero point zero in the first run. Wolfsburg played very, very good counter-pressing, we have to admit that. The system change gave us access and it was a great second half. In the end we have to score more goals.'

<a id="id5"></a>

## Part 5: Conclusion
[Return to contents](#contents)

Finally, we have a working model that generates interviews from match statistics. While the model still generates words referencing specific teams or players, which might not be correct for the given game, we think these can be switched to the relevant teams or players for the given match by the user or the application using this model. 