In [None]:
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
import codecs, json
import tempfile
import requests
import base64

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


Our file is under /kaggle/input. In this case we are going to use an old version of a Kaggle public dataset that has all the English Premiership matches from the 2000-2001 season to March of the 2019-2020 season.

We are going to load this file into a panda dataframe and we will take a look at the first few rows of the file to see the content

In [None]:
file_path = "/kaggle/input/epl-results-up-to-march-2020/EPLresults.csv"
my_df = pd.read_csv(file_path)

print('The shape of our dataset is ', my_df.shape)

my_df.head()

Excellent, we have 7,386 rows, each one with 22 columns and we can see a few samples below.

We can see some of the columns are numeric and some are strings so let's print all the column types to see what types we have

In [None]:
my_df.info()

Now let's plot a few of our data dimensions to get familiar with our dataset.


In [None]:
fig, chart = plt.subplots() 
data = my_df['FTR'].value_counts() 

points = data.index 
frequency = data.values 

chart.bar(points, frequency) 

chart.set_title('Frequency of different results in the English Permiership (2001-2020) ') 
chart.set_xlabel('Result Type') 
chart.set_ylabel('Frequency')

We can see home team wins are way more prevalent than away team wins or draws. This makes sense as the crowd plays an important role in soccer. OK, what else can we plot?

How about what team has played the most home games?

In [None]:
# create a figure and axis 
fig, ax = plt.subplots() 

# count the occurrence of each class 
data = my_df['HomeTeam'].value_counts() 
# get x and y data 
points = data.index 
frequency = data.values 
# create bar chart 
ax.bar(points, frequency) 
plt.setp(ax.get_xticklabels(), rotation=90)
# set title and labels 
ax.set_title('Number of home games for all the English Premiership teams, 2001-2020') 
ax.set_xlabel('Home Teams')
ax.set_ylabel('Frequency')

Kind of hard to see but we can see the teams with the most home games are Everton, Chelsea, Arsenal,Tottenham, Man United & Liverpool. Are they more likely to win the league then? We'll see

There are other things we could plot but let's say that's it for now. We can move on to the next step.

Now that the file is in dataframe we assign some columns to x and the label to y. In our case the label is the full time result ("FTR"). Let's make sure we can print a label

In [None]:
print("For row 149 the teams playing are " + str(my_df["HomeTeam"][149]) + " and " + str(my_df["AwayTeam"][149]) + " and the label is "  + str(my_df["FTR"][149]) + " and the day is " + str(my_df["Date"][149]))


Before assigning the features to X and y we need to convert strings and text to something else since you cannot input strings to a neural network

What features to use? Well, a theory is we need the teams that are playing and their statistics throughout the match: assists, corners and so on. Date? For now we are going to use just the day ofthe week so we will start with the following features in x:

* Day of the week when the match was played (i.e Saturday)
* HomeTeam
* AwayTeam
* HTHG (Half Time Home Goals)
* HTAG (Half Time Away Goals)
* HTR (Half Time Result)
* HS (Home Team Shots)
* AS (Away Team Shots)
* HC (Home Team Corners)
* AC (Away Team Corners)
* HF (Home Team Fouls)
* AF (Away Team Fouls)
* HY (Homw Team Yellow Cards)
* AY (Away Team Yellow Cards)
* HR (Home Team Red Cards)
* AR (Away Team Red Cards)

Now we need to convert the object columns to numbers. First we create a new dataframe with the object columns.

We also drop the referee column since we are not going to use it and then we print a few rows

In [None]:
epl_df_objects = my_df.copy()
epl_df_objects.drop('Referee', axis=1, inplace=True)


epl_df_objects.head()

Now we are going to see if there are any null values

In [None]:
print(epl_df_objects.isnull().values.sum())

Hooray! No null values so we don't have to fix our data. We can move on to the next step: fixing some of the features we want to keep.

Since we only want the day the match was played we convert the date to day of the week in a new column and drop the original date column.

In [None]:
#converting match date to epoch and day of the week

epl_df_objects["matchDate"] = pd.to_datetime(epl_df_objects["Date"], infer_datetime_format=True)
epl_df_objects['matchDay'] = epl_df_objects['matchDate'].dt.day_name()

print(epl_df_objects["matchDate"][0])
print(epl_df_objects['matchDay'][149])

epl_df_objects.drop('Date', axis=1, inplace=True)
epl_df_objects.drop('matchDate', axis=1, inplace=True)


epl_df_objects.head()

Now we convert all the object columns to numbers because a neural network does not accept text


In [None]:
epl_df_objects = pd.get_dummies(epl_df_objects, columns=['HomeTeam'], prefix = ['HomeTeam'])
epl_df_objects = pd.get_dummies(epl_df_objects, columns=['AwayTeam'], prefix = ['AwayTeam'])
epl_df_objects = pd.get_dummies(epl_df_objects, columns=['HTR'], prefix = ['HTR'])
epl_df_objects = pd.get_dummies(epl_df_objects, columns=['matchDay'], prefix = ['matchDay'])

epl_df_objects.head()

Before assigning features to X and the label to y we need to convert the label to numeric values. We also assign all the relevant features to an intermediate variable


In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()


epl_df_objects['FTR']= label_encoder.fit_transform(epl_df_objects['FTR']) 
  
print('Unique values for our label are: ', epl_df_objects['FTR'].unique())
print('if the home team wins the label is ', epl_df_objects['FTR'][0])
print('if the away team wins the label is ', epl_df_objects['FTR'][2])
print('if there is a tie the label is ', epl_df_objects['FTR'][3])

label = epl_df_objects['FTR']
print('the result for the match in row 149 is ', label[149])

print(epl_df_objects.iloc[:,3:113])

features = epl_df_objects.iloc[:,3:113]

Now we can create X and y and divide the dataset in a training set and a test set. We will use the test set to check and see if we are overfitting.

In [None]:
from sklearn.model_selection import train_test_split

y=np.ravel(label)
X = features


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=False)

print("The shape of X_train is " + str(X_train.shape))
print("The size of y_train is " + str(y_train.shape))
print("The size of X_test set is " + str(X_test.shape))
print("The size of y_test is " + str(y_test.shape))

Let's print a few rows of y_train to make sure they are one hot-encoded

In [None]:
#one hot-encoding y_train and y_test
y_train = tf.keras.utils.to_categorical(y_train, num_classes=3)
y_test = tf.keras.utils.to_categorical(y_test, num_classes=3)

print("The size of y_train is " + str(y_train.shape))
print("The size of y_test is " + str(y_test.shape))

print(y_train[0])

We now create our models. We will start with a neural network using tensorflow and keras.

In [None]:
model = tf.keras.models.Sequential([
      tf.keras.layers.Dense(330, input_dim=110, activation='relu'), 
      tf.keras.layers.Dense(10, input_dim=330, activation='relu'),                               
      tf.keras.layers.Dense(3,activation='softmax')
])

model.summary()

model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Our model is now ready to be fitted.

In [None]:
history = model.fit(X_train, y_train, epochs=65)

Before evaluating our model let's plot our loss and accuracy to see how they changed while the model was trained.

In [None]:
#accuracy history
plt.plot(history.history['accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

#loss history
plt.plot(history.history['loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

Great accuracy, perhaps too good? Could we be overfitting to our training set. Now we can evaluate our model on the test set to see if that is the case.

In [None]:

score = model.evaluate(X_test, y_test, verbose=1)

print("Test Score:", score[0])
print("Test Accuracy:", score[1])

As we suspected we are overfitting to the training set. 

Now let's try to make a prediction with data from a  premiership match from this current season:

Arsenal at home vs Norwich on Wednesday 07/01/2020

In [None]:
Xnew = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
print('the shape of our input data is ', Xnew.shape)

 # make a prediction
ynew = np.argmax(model.predict(Xnew), axis=-1)
# show the inputs and predicted outputs
print("X = %s " % Xnew)
print("Prediction = %s" % ynew[0])

if ynew[0] == 2:
  print("Home team is going to win")
elif ynew[0] == 0:
  print("Away team is going to win")
else:
  print("It is going to be a draw")

Now let's try to serve the model as an api so people can call it and get predictions for the games they want to know about.

We will use tensorflow serving to create our api

First we save our model

In [None]:
MODEL_DIR = tempfile.gettempdir()

version = 1

export_path = os.path.join(MODEL_DIR, str(version))

if os.path.isdir(export_path):
    print('\nAlready saved a model, cleaning up\n')
    !rm -r {export_path}

model.save(export_path, save_format="tf")

print('\nexport_path = {}'.format(export_path))
!ls -l {export_path}

Now we download the tensorflow model server code

In [None]:
!echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | apt-key add -
!apt update

Now we install tensorflow model server

In [None]:
!apt-get install tensorflow-model-server

Now we can run our api server.

In [None]:
os.environ["MODEL_DIR"] = MODEL_DIR

In [None]:
%%bash --bg 
nohup tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=epl_predictions \
  --model_base_path="${MODEL_DIR}" >server.log 2>&1

Now we can create the a json object to send as request to the api. We are going to send all the remaining Arsenal games and see what are the predictions so we can finally figure out if Arsenal is going to reach the Champions League.

In [None]:
entry = np.array([[0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1], 
                  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0],
                  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0], 
                  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0],
                  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1],
                  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0],
                  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0],
                  [0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0]
                ])
                  
print(type(entry))
print(entry.shape)

the_list = entry.tolist()
print(type(the_list))


data = json.dumps({"signature_name": "serving_default", "instances": the_list})
print('Data: {} ... {}'.format(data[:50], data[len(data)-52:]))

Now that we have our object we can send it to the api and we can receive the prediction from our model

Since we sent 8 games we should receive back an array of 8 x 3 shape. We print the shape to be sure. We then We then print the index of the highest probability to see what are our results

In [None]:
!pip install -q requests


headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/epl_predictions:predict', data=data, headers=headers)

response = json.loads(json_response.text)
predictions = response['predictions']

print(json_response)
print(json_response.text)
print(response['predictions'])

my_predictions = np.array(predictions)
print("The predictions are: ",np.argmax(my_predictions,axis=1))

OK, we got all our results back. The model says that for the 8 games we sent the home team is going to win 4, the away team is going to win 1 and the rest are draws.

Do we believe it? Kinda, usually, a home team has the advantage of the crowd but during in this Covid-19 era there is no crowd anymore. Our model doesn't know that though so that's a good feature to add for a future iteration of the model.

For now let's assume we believe it. If this is the case then we have the following record for Arsenal:

Before playing Norwich (the first of the 8 games we sent to our api) Arsenal had 43 points. 43 + 6 points (2 wins) + 2 points (2 times) = 51 so, according to our model, the Gunners will end the season with 51 points.