# PART 2

#### – Input: 
prepared data
#### – Output:
machine learning model, expected generalisation RMSE
#### – Features:
This system takes the prepared dataframe and builds a machine learning model
for predicting scores. Model selection, feature selection and handling missing data are
important parts of this system. You should evaluate at least 3 fundamentally different
modelling approaches before selecting the final model. We evaluate the performance of the
system by comparing the predicted scores with the known scores on a validation/test data
set. Specifically, the system should be evaluated with the root mean squared error (RMSE)
of predictions.

### imports
imports all the models and libraries that i use aswell as the data that i use

In [1]:
#imports
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
from sklearn.neural_network import MLPRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import LabelBinarizer
from sklearn import linear_model
from sklearn import tree
from sklearn import svm

from joblib import dump

import numpy as np
import pandas as pd

#gets data from last part of the task
prepared_data = pd.read_csv("prepared_data.csv", encoding = "UTF-8")
prepared_data

Unnamed: 0,ID,Date,Season,Team,Opponent,Venue,Goals,Winner
0,0,2017-04-17,2017,Medkila,Sandviken,Home,1,Draw
1,0,2017-04-17,2017,Sandviken,Medkila,Away,1,Draw
2,1,2017-04-17,2017,Avaldsnes,Vålerenga,Home,2,Home
3,1,2017-04-17,2017,Vålerenga,Avaldsnes,Away,1,Home
4,2,2017-04-17,2017,Grand Bodø,Arna-Bjørnar,Home,2,Draw
...,...,...,...,...,...,...,...,...
791,156,2019-11-16,2019,Klepp,Fart,Away,6,Away
792,158,2019-11-24,2019,IF Fløya,Lyn,Home,0,Away
793,158,2019-11-24,2019,Lyn,IF Fløya,Away,5,Away
794,159,2019-12-01,2019,Lyn,IF Fløya,Home,2,Home


In [2]:
# makes a dictionary which labels the teams from 0 - 14 where 14 is the best team 
def makedict(table):
    dict = {}
    for i in range(len(table)):
        dict[table["Squad"].iloc[i]] = 15 - int(table["Rk"].iloc[i])
    return dict

### adding data
i want to add some data from previous years to make the predictions better, so i do it like this:

In [3]:
#reads the tables
tables2017 = pd.read_html('prosjekt\\2017\\table.xls', encoding="UTF-8")[0]
tables2018 = pd.read_html('prosjekt\\2018\\table.xls', encoding="UTF-8")[0]
tables2019 = pd.read_html('prosjekt\\2019\\table.xls', encoding="UTF-8")[0]

#makes a dict we can follow
dictdict = {
    2018 : makedict(tables2017),
    2019 : makedict(tables2018),
    2020 : makedict(tables2019)
}

#this is really not a good solution, but i dont know how to do this well :/ it basically just adds the data line by line in a bad way
team_last_rk = []
opponent_last_rk = []
for i in range(len(prepared_data)):
    season = prepared_data["Season"].iloc[i]
    team = prepared_data["Team"].iloc[i]
    opponent = prepared_data["Opponent"].iloc[i]
    if season not in dictdict.keys():
        team_last_rk.append(0)
        opponent_last_rk.append(0)
        continue
    if team in dictdict[season].keys():
        team_last_rk.append(dictdict[season][team])
    if opponent in dictdict[season].keys():
        opponent_last_rk.append(dictdict[season][opponent])
    elif team not in dictdict[season].keys() and opponent not in dictdict[season].keys():
        team_last_rk.append(0)
        opponent_last_rk.append(0)
    else:
        team_last_rk.append(0)
        opponent_last_rk.append(0)

# adds the new data to the leftmost side of the dataframe
prepared_data = pd.concat([pd.DataFrame(team_last_rk), prepared_data], axis=1)
prepared_data = pd.concat([pd.DataFrame(opponent_last_rk), prepared_data], axis=1)
prepared_data

Unnamed: 0,0,0.1,ID,Date,Season,Team,Opponent,Venue,Goals,Winner
0,0,0,0,2017-04-17,2017,Medkila,Sandviken,Home,1,Draw
1,0,0,0,2017-04-17,2017,Sandviken,Medkila,Away,1,Draw
2,0,0,1,2017-04-17,2017,Avaldsnes,Vålerenga,Home,2,Home
3,0,0,1,2017-04-17,2017,Vålerenga,Avaldsnes,Away,1,Home
4,0,0,2,2017-04-17,2017,Grand Bodø,Arna-Bjørnar,Home,2,Draw
...,...,...,...,...,...,...,...,...,...,...
791,0,0,156,2019-11-16,2019,Klepp,Fart,Away,6,Away
792,4,4,158,2019-11-24,2019,IF Fløya,Lyn,Home,0,Away
793,0,0,158,2019-11-24,2019,Lyn,IF Fløya,Away,5,Away
794,0,4,159,2019-12-01,2019,Lyn,IF Fløya,Home,2,Home


### transforms data
we need to change the input-data into integers or doubles to make most regression models work.
<li> i split the dataframes into two different ones where X is the input data and y is the expected output</li>
<li>i change home to 1 and away to 0 in Venue</li>
<li>since text isnt easily transformed into numbers i add a new collumn for each team which is eighter 1 or 0 depending on if that team is playing or not. the first 14 numbers are for the home team and the second 14 numbers are for the away team</li>

In [4]:
# splits the data into whats needed to guess and the answer
X = prepared_data.iloc[:, :8]
y = prepared_data["Goals"]

#splits the "team" and "opponent" rows into 14 different rows as it makes it easier for the models to understand
X_train = pd.concat([X, pd.DataFrame(LabelBinarizer().fit_transform(X["Team"]))], axis=1)
X_train = pd.concat([X_train, pd.DataFrame(LabelBinarizer().fit_transform(X["Opponent"]))], axis=1)
X_train = X_train.drop(['Team', "Opponent", "Date"], axis=1)

# changes venue to a 0 or 1 depending on if it is home or away
venue_dict = {"Home" : 1,  "Away" : 0}
X_train["Venue"] = X_train["Venue"].map(venue_dict)

X_train

Unnamed: 0,0,0.1,ID,Season,Venue,0.2,1,2,3,4,...,5,6,7,8,9,10,11,12,13,14
0,0,0,0,2017,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,2017,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,1,2017,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,1,2017,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,2,2017,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
791,0,0,156,2019,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
792,4,4,158,2019,1,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
793,0,0,158,2019,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
794,0,4,159,2019,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### splits the data
i am splitting the data such that i have 70% of the data to train the models, 15% to validate and find the best model and 15% to test

In [5]:
X_train, X_testval, y_train, y_testval = train_test_split(X_train, y, test_size=0.30, shuffle = False)
X_test, X_val, y_test, y_val = train_test_split(X_testval, y_testval, test_size=0.50, shuffle = False)

X_train

Unnamed: 0,0,0.1,ID,Season,Venue,0.2,1,2,3,4,...,5,6,7,8,9,10,11,12,13,14
0,0,0,0,2017,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,2017,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,1,2017,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,1,2017,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,2,2017,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
552,8,13,14,2019,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
553,13,8,14,2019,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
554,10,4,15,2019,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
555,4,10,15,2019,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


### adding models 
we need to check a lot of different models to find which is the best one to use, i am also putting in a lot of different parameters so that we can check what is going to give the best result

In [6]:
models = {
    "lasso Regression" : (linear_model.Lasso(), {"alpha": [1], "fit_intercept" : [True, False], "precompute": [True, False], "warm_start" : [True, False], "positive" : [True, False]}),
    "Support Vector Machines" : (svm.SVC(random_state=42), {}),
    "BaysianRidge" : (linear_model.BayesianRidge(), {"compute_score": [True, False], "fit_intercept" : [True, False], "n_iter" : [300, 400]}),
    "Decition Tree" : (tree.DecisionTreeClassifier(random_state=42), {"min_samples_leaf" : [2, 3], "min_samples_split" : [2, 3]}),
    "Naive Bayes" : (GaussianNB(), {}),
    "Support Vector Machines" : (svm.SVC(random_state=42), {"C" : [1, 2, 3], "break_ties": [True, False], "probability": [True, False]}),
    "Base line" : (DummyRegressor(strategy = "mean"), {}),
    "Multi-layer Perception" : (MLPRegressor(random_state=42), {"hidden_layer_sizes" : [10, 50], "max_iter" : [1000], "alpha" : [0.0001], "warm_start" : [False]}),
    "K Nearest Neighbors" : (KNeighborsClassifier(), {"n_neighbors" : [5, 10, 20, 40], "p": [1, 2, 3]}),
    "Linear Regression" : (LinearRegression(), {"fit_intercept": [True, False], "copy_X" : [True, False], "positive" : [True, False]}),
    "Elastic Net" : (ElasticNet(), {"alpha" : [0.1, 1, 10], "copy_X" : [True, False], "fit_intercept": [True, False]}),
    "Random Forrest Regressor" : (RandomForestRegressor(random_state=42), {"max_depth": [2, 3], "n_estimators" : [50, 100, 300], }),
    "Decision Tree Regressor" : (DecisionTreeRegressor(random_state=42), {"min_samples_split" : [2, 3], "min_samples_leaf" : [1, 2]}),
    "Random Forrest Classifier" : (RandomForestClassifier(random_state = 42), {}),
}

### using models to make a prediction
we need to use our model to find what is going to give the best result. gridsearch allows us to check the different parameters aswell. i add the data and the name to a list so that we can find what is the best algorythm later

In [7]:
datalist = []

#goes throught the keys
for i in models.keys():
    #does the gridsearch on the models
    model = GridSearchCV(models[i][0], models[i][1])
    #makes a prediction
    prediction = np.round(model.fit(X_train, y_train).predict(X_val))
    #finds out how good that prediction is
    error = np.sqrt(mean_squared_error(y_val, prediction)).round(4)
    datalist.append([i , error])



### finding the best model
i sort by RMSE to find the best possible model. the best model currently is Elastic Net with 1.4634. i am not sure why this is so small of an improvement compared to others who has done the same as me.

In [8]:
model_df = pd.DataFrame(datalist).rename(columns={0: "Model", 1: "RMSE"}).sort_values(by=['RMSE'])
model_df

Unnamed: 0,Model,RMSE
9,Elastic Net,1.4634
0,lasso Regression,1.4663
5,Base line,1.4663
8,Linear Regression,1.4972
6,Multi-layer Perception,1.5303
1,Support Vector Machines,1.5864
10,Random Forrest Regressor,1.5995
2,BaysianRidge,1.6047
7,K Nearest Neighbors,1.7958
12,Random Forrest Classifier,1.8303


### do the same with test data
now we need to do the same thing with the test data to see if the models hold up

In [9]:
X_fin = X_train.append(X_val)
y_fin = y_train.append(y_val)

datalist = []

for i in models.keys():
    model = GridSearchCV(models[i][0], models[i][1])
    prediction = np.round(model.fit(X_fin, y_fin).predict(X_test))
    error = np.sqrt(mean_squared_error(y_test, prediction)).round(4)
    datalist.append([i , error])



In [11]:
model_df = pd.DataFrame(datalist).rename(columns={0: "Model", 1: "RMSE"}).sort_values(by=['RMSE'])
model_df

Unnamed: 0,Model,RMSE
6,Multi-layer Perception,1.3093
8,Linear Regression,1.3093
2,BaysianRidge,1.341
9,Elastic Net,1.341
10,Random Forrest Regressor,1.3473
0,lasso Regression,1.5035
5,Base line,1.5035
1,Support Vector Machines,1.5063
12,Random Forrest Classifier,1.5638
7,K Nearest Neighbors,1.6398


### then i export the best model and its RSME
the result gets far better when we add the data from val, this is probably because it allows the extra data from last years results to kick in since all of 2017 is missing the positions from last year so ill export MLP regressor

In [17]:
model_df.to_csv("best_models_df.csv", index = False, encoding = "UTF-8")
dump(models[model_df["Model"].iloc[0]], "best_model.joblib")

['best_model.joblib']