# SPI Scores Data Set

I'm using a dataset sourced from https://fivethirtyeight.com/. 
It contains match data for a professional soccer leagues across the globe, including the SPI score for each team involved in the match.

You can learn more about SPI score here:. Simply, the SPI score is the relative strength of a team at a given time, based on goals expected during a match and the team's overall financial position.

There are a lot of records in this dataset, and there are more columns than I require for my purposes, so there will be a bit of work to get the dataset organized the way I need it.

Let's get started!

## STEP ONE: Looking at the dataset and selecting columns

Below, there's a good look at the range of values across each column, including quartiles, statistically useful figures like mean and standard deviation, and sum of null values in each column.

Shape of the data set:
45494 records across 23 columns

There are columns here that I would break into two categories: Season/Team columns, and Match-Specific columns.

Season/Team columns:
 - Team1
 - Team2
 - Season
 - League
 - SPI 1
 - SPI 2
 
All other columns are match-specific. They represent game-specific details that, while interesting when examining SPI scores, will not serve a purpose for this analysis, which generalizes data across seasons.

In [5]:
import pandas as pd
from quickda.explore_data import *
from quickda.clean_data import *
from quickda.explore_numeric import *
from quickda.explore_categoric import *
from quickda.explore_numeric_categoric import *
from quickda.explore_time_series import *
path = 'https://projects.fivethirtyeight.com/soccer-api/club/spi_matches.csv'

df = pd.read_csv(path)
explore(df)

Unnamed: 0,dtypes,count,null_sum,null_pct,nunique,min,25%,50%,75%,max,mean,median,std,skew
adj_score1,float64,23101,22393,0.492,480,0.0,1.05,1.05,2.1,9.15,1.523371,1.05,1.239958,0.85298
adj_score2,float64,23101,22393,0.492,396,0.0,0.0,1.05,2.1,11.05,1.201216,1.05,1.128121,0.984744
date,object,45494,0,0.0,1731,2016-07-09,-,-,-,2021-12-12,-,-,-,-
importance1,float64,38236,7258,0.16,998,0.0,10.9,26.3,46.2,100.0,31.781619,26.3,26.529099,0.899369
importance2,float64,38236,7258,0.16,999,0.0,10.4,25.4,45.5,100.0,31.095609,25.4,26.245261,0.927435
league,object,45494,0,0.0,39,Argentina Primera Division,-,-,-,United Soccer League,-,-,-,-
league_id,int64,45494,0,0.0,39,1818,1854.0,1874.0,2160.0,9541,2176.866906,1874.0,893.339007,4.756768
nsxg1,float64,23101,22393,0.492,452,0.0,0.94,1.3,1.73,6.89,1.393037,1.3,0.651986,1.135472
nsxg2,float64,23101,22393,0.492,395,0.0,0.73,1.05,1.44,7.17,1.140026,1.05,0.578978,1.145036
prob1,float64,45494,0,0.0,7754,0.0271,0.344925,0.4374,0.536775,0.9775,0.44621,0.4374,0.158988,0.346192


***I'll drop the match-specific columns and move on with my work on this dataset.***

In [7]:
cols_to_drop = ['importance1', 'importance2', 'score1', 'score2', 
                'xg1', 'xg2', 'nsxg1', 'nsxg2', 'adj_score1', 'adj_score2',
                'prob1', 'prob2', 'probtie', 'proj_score1', 'proj_score2']

df.drop(columns=cols_to_drop, inplace=True)
df.columns

Index(['season', 'date', 'league_id', 'league', 'team1', 'team2', 'spi1',
       'spi2'],
      dtype='object')

## STEP TWO: Subsetting for specific leagues

There are 39 different leagues represented in this dataset, and I require only nine of them for my analysis. I'll subset the dataset for each of my leagues.

In [11]:
df.league.value_counts()

English League Championship                 2228
Major League Soccer                         2003
United Soccer League                        1993
French Ligue 1                              1900
Barclays Premier League                     1900
Italy Serie A                               1900
Brasileiro Série A                          1900
Spanish Primera Division                    1900
Spanish Segunda Division                    1871
Italy Serie B                               1602
English League Two                          1559
German Bundesliga                           1530
French Ligue 2                              1520
English League One                          1519
Turkish Turkcell Super Lig                  1338
German 2. Bundesliga                        1224
Portuguese Liga                             1224
Dutch Eredivisie                            1224
Norwegian Tippeligaen                       1200
Swedish Allsvenskan                         1200
Japanese J League   

In [14]:
leagues = ['English League Championship', 'French Ligue 1', 'Barclays Premier League', 
           'Italy Serie A', 'German Bundesliga', 'Portuguese Liga', 
           'Dutch Eredivisie', 'Russian Premier Liga', 'Spanish Primera Division']

idx = np.where(
    (df['league'] == 'English League Championship') |
    (df['league'] == 'Barclays Premier League') | 
    (df['league'] == 'French Ligue 1') | 
    (df['league'] == 'Italy Serie A') | 
    (df['league'] == 'German Bundesliga') |
    (df['league'] == 'Portuguese Liga') |
    (df['league'] == 'Dutch Eredivisie') |
    (df['league'] == 'Russian Premier Liga') |
    (df['league'] == 'Spanish Primera Division'))

spi = df.loc[idx]

explore(spi)

Unnamed: 0,dtypes,count,null_sum,null_pct,nunique,min,25%,50%,75%,max,mean,median,std,skew
date,object,14761,0,0.0,1099,2016-08-12,-,-,-,2021-05-29,-,-,-,-
league,object,14761,0,0.0,9,Barclays Premier League,-,-,-,Spanish Primera Division,-,-,-,-
league_id,int64,14761,0,0.0,9,1843,1849.0,1864.0,2411.0,2412,2010.69223,1864.0,249.911252,0.978132
season,int64,14761,0,0.0,5,2016,2017.0,2018.0,2019.0,2020,2018.190434,2018.0,1.331495,-0.109038
spi1,float64,14761,0,0.0,5559,15.93,48.47,59.91,70.69,96.57,59.887348,59.91,15.226667,0.022839
spi2,float64,14761,0,0.0,5562,15.9,48.24,59.87,70.63,96.69,59.79421,59.87,15.225719,0.027533
team1,object,14761,0,0.0,228,1. FC Nürnberg,-,-,-,Zenit St Petersburg,-,-,-,-
team2,object,14761,0,0.0,228,1. FC Nürnberg,-,-,-,Zenit St Petersburg,-,-,-,-


In [16]:
spi.league.value_counts()

English League Championship    2228
Spanish Primera Division       1900
French Ligue 1                 1900
Italy Serie A                  1900
Barclays Premier League        1900
German Bundesliga              1530
Dutch Eredivisie               1224
Portuguese Liga                1224
Russian Premier Liga            955
Name: league, dtype: int64

In [20]:
spi.to_csv('spi_scores.csv', index=False)

## What's happened so far:

 - Had a dataset with match-specific data involving 39 leagues dating back to 2016
 - Dropped columns that will not serve a purpose in my analysis, and happened to contain high percentages of null values
 - Subset the entire dataframe into a new dataframe `spi` that contains only the nine leagues I am using in my analysis.

### NEXT STEPS:
 - Bring together this dataframe with my transfers dataframe
 - Clean up any mess created with the concat of `spi` with `transfers`
 - Create features!!