# Parsing MLS Roaster Page

![la galaxy 2018 roster mlssoccer com](https://user-images.githubusercontent.com/872296/41433129-5871915c-6fee-11e8-9ce3-32f5dab9764b.png)



In [1]:
import pandas as pd
import numpy as np

##### Step 1: Read the right table:

In [2]:
page = pd.read_html(
    'https://www.mlssoccer.com/rosters/2018/la-galaxy',
    attrs={'class': "activethirty"})

In [3]:
len(page)

1

In [4]:
df = page[0]

A preview of the table parsed:

In [5]:
df

Unnamed: 0,30-man Active Roster (Spots 1-30),#,POS,ROSTER STATUSR.S.,PLAYER CATEGORYCAT.,NOTE
0,"Alessandrini, Romain",7.0,M,SeniorSR,"DP, INTL",
1,"Arellano, Hugo",21.0,D,ReserveRES,HG,
2,"Bingham, David",1.0,GK,SeniorSR,,
3,"Boateng, Emmanuel",24.0,M,SeniorSR,,
4,"Carrasco, Servando",14.0,M,SupplementalSUP,,
5,"Ciani, Michael",28.0,D,SeniorSR,INTL,
6,"Cole, Ashley",3.0,D,SeniorSR,INTL,
7,"dos Santos, Giovani",10.0,F,SeniorSR,"DP, INTL",
8,"dos Santos, Jonathan",8.0,M,SeniorSR,"DP, INTL",
9,"Feltscher, Rolf",25.0,D,"SeniorSR, Disabled ListDL",INTL,Disabled ListDL


The issues we can quickly spot are:

1. The index should probably be the player's name.
2. Roaster Status seems to have a few appended values that are strange (SR, SUP, etc)
3. There are NaNs and then a final note which are not part of the table _"26 of 30 spots filled"_

##### Step 2: Renaming columns to more convenient names

In [6]:
df.columns = ['Player', 'Number', 'Position', 'Roster Status', 'Player Category', 'Notes']
df.head()

Unnamed: 0,Player,Number,Position,Roster Status,Player Category,Notes
0,"Alessandrini, Romain",7.0,M,SeniorSR,"DP, INTL",
1,"Arellano, Hugo",21.0,D,ReserveRES,HG,
2,"Bingham, David",1.0,GK,SeniorSR,,
3,"Boateng, Emmanuel",24.0,M,SeniorSR,,
4,"Carrasco, Servando",14.0,M,SupplementalSUP,,


##### Step 3: Removing NaNs and not row players

In [7]:
df.dropna(how='all', inplace=True)

In [8]:
df

Unnamed: 0,Player,Number,Position,Roster Status,Player Category,Notes
0,"Alessandrini, Romain",7.0,M,SeniorSR,"DP, INTL",
1,"Arellano, Hugo",21.0,D,ReserveRES,HG,
2,"Bingham, David",1.0,GK,SeniorSR,,
3,"Boateng, Emmanuel",24.0,M,SeniorSR,,
4,"Carrasco, Servando",14.0,M,SupplementalSUP,,
5,"Ciani, Michael",28.0,D,SeniorSR,INTL,
6,"Cole, Ashley",3.0,D,SeniorSR,INTL,
7,"dos Santos, Giovani",10.0,F,SeniorSR,"DP, INTL",
8,"dos Santos, Jonathan",8.0,M,SeniorSR,"DP, INTL",
9,"Feltscher, Rolf",25.0,D,"SeniorSR, Disabled ListDL",INTL,Disabled ListDL


In [9]:
df.loc[df['Player'].str.contains('spots filled')]

Unnamed: 0,Player,Number,Position,Roster Status,Player Category,Notes
30,26 of 30 spots filled,,,,,


In [10]:
df.index.get_indexer_for(df.loc[df['Player'].str.contains('spots filled')].index)

array([26])

In [11]:
index_col = df.index.get_indexer_for(df.loc[df['Player'].str.contains('spots filled')].index)[0]
index_col

26

In [12]:
df.iloc[index_col:]

Unnamed: 0,Player,Number,Position,Roster Status,Player Category,Notes
30,26 of 30 spots filled,,,,,
31,Players that do not count against the 30-man a...,,,,,
32,"Alvarez, Efrain",,,"ReserveRES, On loanOL",Loaned to: USLOL: USL,


In [13]:
df.drop(df.iloc[index_col:].index)

Unnamed: 0,Player,Number,Position,Roster Status,Player Category,Notes
0,"Alessandrini, Romain",7.0,M,SeniorSR,"DP, INTL",
1,"Arellano, Hugo",21.0,D,ReserveRES,HG,
2,"Bingham, David",1.0,GK,SeniorSR,,
3,"Boateng, Emmanuel",24.0,M,SeniorSR,,
4,"Carrasco, Servando",14.0,M,SupplementalSUP,,
5,"Ciani, Michael",28.0,D,SeniorSR,INTL,
6,"Cole, Ashley",3.0,D,SeniorSR,INTL,
7,"dos Santos, Giovani",10.0,F,SeniorSR,"DP, INTL",
8,"dos Santos, Jonathan",8.0,M,SeniorSR,"DP, INTL",
9,"Feltscher, Rolf",25.0,D,"SeniorSR, Disabled ListDL",INTL,Disabled ListDL


In [14]:
df.drop(df.iloc[index_col:].index, inplace=True)

##### Step 4: Cleaning Roaster Status

In [15]:
df['Roster Status'].unique()

array(['SeniorSR', 'ReserveRES', 'SupplementalSUP',
       'SeniorSR, Disabled ListDL'], dtype=object)

In [16]:
df['Roster Status'].str.split(',', expand=True)

Unnamed: 0,0,1
0,SeniorSR,
1,ReserveRES,
2,SeniorSR,
3,SeniorSR,
4,SupplementalSUP,
5,SeniorSR,
6,SeniorSR,
7,SeniorSR,
8,SeniorSR,
9,SeniorSR,Disabled ListDL
