# Deep Averaging Network

In this notebook, we will use a DAN to solve our problem.  

In [5]:
import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import numpy as np
import tensorflow as tf
import keras

from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split

from keras.layers import Embedding, Input, Dense, Lambda
from keras.models import Model

import tensorflow_datasets as tfds
import tensorflow_text as tf_text


import sklearn as sk

import nltk
from nltk.data import find

import matplotlib.pyplot as plt

import re

import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pandas as pd

In [2]:
!pip install pydot --quiet
!pip install gensim --quiet
!pip install tensorflow-datasets --quiet
!pip install tensorflow-text --quiet

In [6]:
data = pd.read_json("../dataset/tagged_transcripts.json")

### Exploratory Data Analysis

In [7]:
data.head(5)

Unnamed: 0,1962-houston_oilers-dallas_texans.txt,1969-chicago_bears-green_bay_packers.txt,1969-cleveland_browns-minnesota_vikings-1.txt,1969-cleveland_browns-minnesota_vikings.txt,1969-new_york_jets-baltimore_colts.txt,1970-baltimore_colts-kansas_city_chiefs.txt,1970-cleveland_browns-new_york_jets.txt,1970-dallas_cowboys-detroit_lions.txt,1970-kansas_city_chiefs-baltimore_colts.txt,1970-los_angeles_rams-minnesota_vikings-1.txt,...,2018-tampa_bay_buccaneers-dallas_cowboys.txt,2018-tampa_bay_buccaneers-detroit_lions.txt,2018-tennessee_titans-green_bay_packers.txt,2018-tennessee_titans-minnesota_vikings.txt,2018-tennessee_titans-pittsburgh_steelers.txt,2018-tennessee_titans-tampa_bay_buccaneers.txt,2018-washington_redskins-new_england_patriots.txt,2018-washington_redskins-new_york_jets.txt,2018-washington_redskins-philadelphia_eagles-1.txt,2018-washington_redskins-philadelphia_eagles.txt
teams,"[houston_oilers, dallas_texans]","[chicago_bears, green_bay_packers]","[cleveland_browns, minnesota_vikings]","[cleveland_browns, minnesota_vikings]","[new_york_jets, baltimore_colts]","[baltimore_colts, kansas_city_chiefs]","[cleveland_browns, new_york_jets]","[dallas_cowboys, detroit_lions]","[kansas_city_chiefs, baltimore_colts]","[los_angeles_rams, minnesota_vikings]",...,"[tampa_bay_buccaneers, dallas_cowboys]","[tampa_bay_buccaneers, detroit_lions]","[tennessee_titans, green_bay_packers]","[tennessee_titans, minnesota_vikings]","[tennessee_titans, pittsburgh_steelers]","[tennessee_titans, tampa_bay_buccaneers]","[washington_redskins, new_england_patriots]","[washington_redskins, new_york_jets]","[washington_redskins, philadelphia_eagles]","[washington_redskins, philadelphia_eagles]"
transcript,gilson well defend the goal on your left theyl...,cbs television sports presents the national fo...,the nfl today brought to you by the foundation...,the nfl today brought to you by the foundation...,&gt;&gt; nbc sports presents the third nflafl ...,biochemistry was almost an that i doing it cam...,from municipal stadium in cleveland ohio to po...,a long time ago ford motor company had a bette...,from memorial stadium in baltimore maryland na...,from metropolitan stadium in bloomington minne...,...,you welcomes you to the following presentation...,well the rain continues to fall but we have fo...,so the first preseason game a couple weeks at ...,time is running out for some opportunity has c...,heinz field and the new head coach of the tita...,tennessee titans preseason football is brought...,the patriots take the field and football retur...,espn welcomes you to the following presentatio...,and there is nick falls the eagle fans at atte...,skins and eagles theyve been division rivals d...
year,1962,1969,1969,1969,1969,1970,1970,1970,1970,1970,...,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018


In [8]:
nltk.download('word2vec_sample')

word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

wvmodel = KeyedVectors.load_word2vec_format(datapath(word2vec_sample), binary=False)

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /Users/tommayer/nltk_data...
[nltk_data]   Package word2vec_sample is already up-to-date!


In [9]:
len(wvmodel)

43981

So this dataset has 43,981 games over the years.

### Data Preprocessing

Preprocess text:
- remove punctuation
- replace with space 
- also lowercase 

In [None]:
data_transposed = data.T.reset_index().rename(columns={'index': 'game_id'}) # pd operation
def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text.lower())
    tokens = text.split()
    return tokens

# Apply preprocessing to each transcript
data_transposed['tokens'] = data_transposed['transcript'].apply(preprocess_text)

In [12]:
data_transposed.head()

Unnamed: 0,game_id,teams,transcript,year,tokens,doc_embedding
0,1962-houston_oilers-dallas_texans.txt,"[houston_oilers, dallas_texans]",gilson well defend the goal on your left theyl...,1962,"[gilson, well, defend, the, goal, on, your, le...","[0.02728455, 0.016727475, 0.0260244, 0.0380681..."
1,1969-chicago_bears-green_bay_packers.txt,"[chicago_bears, green_bay_packers]",cbs television sports presents the national fo...,1969,"[cbs, television, sports, presents, the, natio...","[0.030220592, 0.014963325, 0.02284711, 0.03831..."
2,1969-cleveland_browns-minnesota_vikings-1.txt,"[cleveland_browns, minnesota_vikings]",the nfl today brought to you by the foundation...,1969,"[the, nfl, today, brought, to, you, by, the, f...","[0.027876755, 0.016259313, 0.022658505, 0.0399..."
3,1969-cleveland_browns-minnesota_vikings.txt,"[cleveland_browns, minnesota_vikings]",the nfl today brought to you by the foundation...,1969,"[the, nfl, today, brought, to, you, by, the, f...","[0.028167814, 0.016339412, 0.022509856, 0.0396..."
4,1969-new_york_jets-baltimore_colts.txt,"[new_york_jets, baltimore_colts]",&gt;&gt; nbc sports presents the third nflafl ...,1969,"[gtgt, nbc, sports, presents, the, third, nfla...","[0.031091398, 0.015320361, 0.02407883, 0.03909..."


Now, we have a column of tokens that we can use to get the game commentary embeddings.  We also have each game as a different row making it easier to work with our data.

Let's get the game commentary embeddings. Essentially, computers can do math much more easily with numbers than with text.  So, we'll convert the text into numbers saving compute with a pretrained model (word2vec that I called wvmodel).

In [11]:
def get_document_embedding(tokens, model):
    # Filter tokens to only those in the model's vocabulary
    valid_tokens = [token for token in tokens if token in model.key_to_index]
    if not valid_tokens:
        return np.zeros(model.vector_size)
    # Average the word vectors
    return np.mean([model[token] for token in valid_tokens], axis=0)

# Apply to each game transcript
data_transposed['doc_embedding'] = data_transposed['tokens'].apply(
    lambda tokens: get_document_embedding(tokens, wvmodel)
)

Now, we can create the embedding matrix.  This converts the w2v model, that we are using already, into a matrix that we can use for our model. Then, we build a vocabulary dictionary that we can use to map the words to their corresponding indices.  We cannot forget to add the unknown token to the vocabulary dictionary too.

In [28]:
EMBEDDING_DIM = len(wvmodel['university'])      # we know... it's 300

# initialize embedding matrix and word-to-id map:
embedding_matrix = np.zeros((len(wvmodel) + 1, EMBEDDING_DIM))
vocab_dict = {}

# build the embedding matrix and the word-to-id map:
for i, word in enumerate(wvmodel.index_to_key):
    embedding_vector = wvmodel[word]

    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
        vocab_dict[word] = i

# we can use the last index at the end of the vocab for unknown tokens
vocab_dict['[UNK]'] = len(vocab_dict)

In [18]:
# take a peek at the embedding matrix
embedding_matrix.shape


(43982, 300)

In [20]:
# and take a look at the first embedding vector, a game from 1962!
embedding_matrix[0]

array([ 0.0891758 ,  0.121832  , -0.0671959 ,  0.0477279 , -0.013659  ,
       -0.0671959 ,  0.0640559 , -0.0331269 , -0.0364239 ,  0.00565199,
       -0.017113  , -0.10362   ,  0.0552639 , -0.00706499, -0.0643699 ,
        0.00753598, -0.0866638 ,  0.0492979 , -0.0816398 , -0.0910598 ,
        0.00416049, -0.0681379 ,  0.0568339 ,  0.0524379 ,  0.00143262,
       -0.01256   , -0.0775578 ,  0.0960838 ,  0.0555779 , -0.0734758 ,
       -0.013659  , -0.0376799 , -0.0489839 , -0.0470999 , -0.102992  ,
        0.00612299,  0.0452159 , -0.0356389 ,  0.0665679 ,  0.0747318 ,
        0.0759878 , -0.0248059 ,  0.013031  , -0.00490624,  0.00733973,
       -0.0351679 ,  0.00639774, -0.00370912,  0.0835238 ,  0.0477279 ,
       -0.0885478 , -0.0929438 ,  0.0634279 ,  0.0741038 ,  0.00561274,
       -0.0192325 ,  0.0803838 ,  0.00580899,  0.0923158 ,  0.0700219 ,
        0.0266899 ,  0.0788138 , -0.0634279 , -0.0470999 ,  0.0835238 ,
       -0.0483559 ,  0.0574619 ,  0.0411339 ,  0.00455299,  0.07

I want to group the data into 'eras' to make it easier to predict. I will group the data by decade.

A few things here:
- I'll use indices to split the data. This way, I can keep the columns together aka maintain the relationships between the columns.
- I'll stratify by year to ensure temporal representation across both sets.
- I'll use 20% of the data for testing.

In [32]:
# make sure the year is an int
data_transposed['year'] = data_transposed['year'].astype(int)
# now group by decade
data_transposed['decade'] = (data_transposed['year'] // 10) * 10
indices_train, indices_test = train_test_split(
    np.arange(len(data_transposed)),
    test_size=0.2,
    stratify=data_transposed['decade']  # Stratify by decade instead
)

# Create train and test DataFrames
train_df = data_transposed.iloc[indices_train].copy()
test_df = data_transposed.iloc[indices_test].copy()

# Extract features and targets
X_train = np.array(train_df['doc_embedding'].tolist())
X_test = np.array(test_df['doc_embedding'].tolist())
y_train = train_df['decade'].values  
y_test = test_df['decade'].values

In [35]:
X_train[0]

array([ 2.80200057e-02,  1.85463410e-02,  2.20466256e-02,  4.07019816e-02,
       -1.83194838e-02, -2.74180733e-02,  1.18565168e-02, -4.72977795e-02,
        3.47229578e-02,  3.66538689e-02, -1.77515280e-02, -5.40170372e-02,
       -1.39009790e-03,  7.09863938e-03, -5.18067777e-02,  1.09228622e-02,
        2.75296438e-02,  3.47514078e-02,  6.25449792e-03, -2.42791437e-02,
       -1.37908487e-02,  2.85209920e-02, -1.74424052e-03, -9.51399282e-03,
        3.26409712e-02, -2.68419436e-03, -3.48200053e-02,  2.01653540e-02,
        1.49998190e-02,  1.11482479e-02, -1.02505460e-02,  1.28404219e-02,
       -1.61986034e-02, -1.00000612e-02,  1.00300312e-02,  3.64408130e-03,
       -8.78093822e-04, -1.46641312e-02,  1.89352762e-02,  4.04345915e-02,
        4.28580418e-02, -3.64055224e-02,  5.01380898e-02, -1.28889643e-02,
       -6.42580539e-03,  4.69247764e-03, -1.17599135e-02,  3.78651032e-03,
        2.27206238e-02,  1.10230464e-02,  1.21068703e-02,  2.77537610e-02,
       -5.08523453e-03, -

In [37]:
y_test[0]

2000

## Deep Averaging Network
Simple yet powerful baseline model.
The DAN takes the average of the word embeddings to get a document embedding. In my model, it was average the embeddings of the game commentary to get a result of the decade.  It can show me semantic meaning of the game commentary learning complex patterns via a nueral network.  The hidden layers can serve as better feature representations than the raw averages of the vectors. As a baseline model, this DAN will give me more interpreatble results versus a transformer.