This Python code snippet imports the necessary libraries and modules for data analysis, machine learning, and visualization. It uses pandas for tabular data handling, NumPy for efficient numerical operations, Matplotlib for creating plots, scikit-learn (sklearn) for machine learning, and pickle for object serialization. 

In [14]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LogisticRegression
import pickle

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
from s3fs.core import S3FileSystem

s3_file = S3FileSystem()

This code defines two functions. The first function, create_rows_per_team, extracts specific columns from a DataFrame based on regular expressions, processes team data, and returns a combined DataFrame. The second function, get_datasets, prepares training and testing datasets for machine learning, with an optional validation dataset

In [3]:
def create_rows_per_team(df, regex_pattern_100, regex_pattern_200, players_data = False):
    stats_100_df = df.filter(regex=regex_pattern_100, axis=1).copy()
    stats_200_df = df.filter(regex=regex_pattern_200, axis=1).copy()

    stats_100_df['winner'] = stats_100_df['winningTeam'].apply(lambda x: 1 if x == 0 else 0)
    stats_200_df['winner'] = stats_200_df['winningTeam']
    stats_100_df = stats_100_df.drop(columns=['winningTeam'])
    stats_200_df = stats_200_df.drop(columns=['winningTeam'])
    
    if players_data:
        stats_200_df.columns = [col.replace('_10', '_5').replace('_9', '_4').replace('_8', '_3').replace('_7', '_2').replace('_6', '_1') for col in stats_200_df.columns]
    else:
        stats_100_df.columns = [col.replace('_100','') for col in stats_100_df.columns]
        stats_200_df.columns = [col.replace('_200','') for col in stats_200_df.columns]
    team_performance_data = pd.concat([stats_100_df,stats_200_df])
    return team_performance_data

def get_datasets(df, key_column, validation_dataset=False):
    X = df.drop(columns=[key_column]).values
    y = df[key_column].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
    if validation_dataset:
        X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.50, random_state=40)
        return X_train, X_test, y_train, y_test, X_val, y_val
    return X_train, X_test, y_train, y_test

This code loads and filters data from various CSV files, creating DataFrames for player and team statistics. The resulting DataFrames are stored in the data_dfs dictionary for further analysis or processing.

In [4]:
regex_pattern = r'.*(ID|Champion).*'

norm_players_stats = pd.read_csv('./norm_players_stats.csv', sep=';')
norm_players_df = norm_players_stats.filter(regex=f'^(?!{regex_pattern}).*$', axis=1)
player_id_columns = norm_players_stats[['platformGameID', 'teamOnlineID_100', 'teamOnlineID_200']]
players_performance = create_rows_per_team(norm_players_df,r'.*(_[1-5](?![0-9])|winningTeam)',r'.*(_6|_7|_8|_9|_10|winningTeam)',True)

norm_teams_stats = pd.read_csv('./norm_teams_stats.csv', sep=';')
norm_teams_df = norm_teams_stats.filter(regex=f'^(?!{regex_pattern}).*$', axis=1)
team_id_columns = norm_teams_stats[['platformGameID', 'teamOnlineID_100', 'teamOnlineID_200']]
teams_performance = create_rows_per_team(norm_teams_df,r'.*(_100|winningTeam)',r'.*(_200|winningTeam)')

pca_players_stats = pd.read_csv('./players_stats_pca.csv', sep=';')
pca_players_df = pca_players_stats.filter(regex=f'^(?!{regex_pattern})$', axis=1)

data_dfs = {
    'Normalized player stats':players_performance,
    'Normalized teams stats':teams_performance,
    # 'PCA player stats':pca_players_df
}

the build_and_train_mlp function creates and trains a Multi-Layer Perceptron (MLP) classifier for classification tasks using scikit-learn. It takes training and testing data along with an optional parameter for specifying the neural network's hidden layer sizes. The function then trains the classifier, makes predictions on both the training and test data, and prints confusion matrices and classification reports to assess the model's performance. The trained model and predictions are returned, offering a convenient way to build and evaluate MLP classifiers.

In [5]:
def build_and_train_mlp(X_train,y_train,X_test,y_test,layer_sizes:tuple=(8,8,8)):
    mlp = MLPClassifier(hidden_layer_sizes=layer_sizes, activation='relu', solver='adam', max_iter=500)
    mlp.fit(X_train,y_train)
    predict_train = mlp.predict(X_train)
    predict_test = mlp.predict(X_test)
    
    print('Confusion matrix (training_data)')
    print(confusion_matrix(y_train,predict_train))
    print(classification_report(y_train,predict_train))
    
    print('Confusion matrix (test_data)')
    print(confusion_matrix(y_test,predict_test))
    print(classification_report(y_test,predict_test))
    return mlp,predict_train,predict_test

This code segment iterates through multiple datasets, splits each dataset into training and testing sets, and trains Multi-Layer Perceptron (MLP) classifiers for binary classification. It collects information about each model's performance and dataset name, storing it in a list called models for later analysis and comparison. This allows you to evaluate how well the MLP classifiers perform on different datasets.

In [6]:
models = []
for dataset_name,df in data_dfs.items():
    X_train, X_test, y_train, y_test = get_datasets(df, 'winner')
    print(f"Building model to predict result from {dataset_name}")
    mlp,predict_train,predict_test = build_and_train_mlp(X_train,y_train,X_test,y_test)
    model = {
        'model':mlp,
        'dataset':dataset_name,
        'train_prediction':predict_train,
        'test_prediction':predict_test
    }
    models.append(model)

Building model to predict result from Normalized player stats
Confusion matrix (training_data)
[[5458    1]
 [   0 5389]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5459
           1       1.00      1.00      1.00      5389

    accuracy                           1.00     10848
   macro avg       1.00      1.00      1.00     10848
weighted avg       1.00      1.00      1.00     10848

Confusion matrix (test_data)
[[2223   67]
 [  47 2313]]
              precision    recall  f1-score   support

           0       0.98      0.97      0.97      2290
           1       0.97      0.98      0.98      2360

    accuracy                           0.98      4650
   macro avg       0.98      0.98      0.98      4650
weighted avg       0.98      0.98      0.98      4650

Building model to predict result from Normalized teams stats
Confusion matrix (training_data)
[[5336  123]
 [  44 5345]]
              precision    recall  f1-score   s

This code saves each trained MLP classifier to a file with a specific naming convention based on the dataset name. This allows you to store and later load these models for future use or analysis.

In [None]:
for model in models:
    trained_model = model['model']
    filename = "team_independent_tm_" + model['dataset'].replace(' ','_').lower() + '.sav'
    pickle.dump(trained_model, open(filename, 'wb'))