# NFL Salary Scrape


## Write and Read from a file

Instead of continuing to get from the web page I put everything into a file and then worked from there.  

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
base_url = "http://www.spotrac.com/nfl/new-york-jets/cap/"

def get_page(base_url):
    page = urlopen(base_url)
    soup = BeautifulSoup(page)
    file = open("sample_page.txt", 'w')
    file.write(str(soup))
    file.close()

In [3]:
with open("sample_page.txt", 'rb') as file:
    for line in file:
        line = line.strip()

## Parse the HTML
Import the BeautifulSoup library and use the find method to get the datatable.
It is important to note here, that I opened the sample_page with read binary flags "rb".  Leaving off the binary was causing the parser to puke 6 ways to Sunday.  

In [4]:
from bs4 import BeautifulSoup
page = open("sample_page.txt", 'rb')
soup = BeautifulSoup(page, "html.parser")
table = str(soup.find("table","datatable"))



## Render the table
This is an exact replica of the HTML table on the page

In [5]:
from IPython.core.display import HTML
HTML(table)


Active Players (53),Pos.,Base Salary,Signing Bonus,Roster Bonus,Option Bonus,Workout Bonus,Restruc. Bonus,Misc.,Dead Cap,Cap Hit,Cap %
Muhammad Wilkerson,DE,"$14,750,000","$3,000,000",-,-,"$250,000",-,-,"($27,000,000)","$18,000,000",12.17
Buster Skrine,CB,"$6,000,000","$1,250,000",-,-,-,"$1,250,000",-,"($11,000,000)","$8,500,000",5.75
Brian Winters,G,"$1,000,000",-,"$7,000,000",-,-,-,-,"($22,000,000)","$8,000,000",5.41
James Carpenter,G,"$4,450,000","$875,000","$250,000",-,-,"$1,230,000",-,"($8,660,000)","$6,805,000",4.6
Josh McCown,QB,"$3,000,000","$3,000,000",-,-,-,-,"$500,000","($6,000,000)","$6,500,000",4.39
Leonard Williams,DE,"$615,000","$2,952,430","$1,513,716",-,-,-,-,"($11,009,149)","$5,081,146",3.44
Matt Forte,RB,"$4,000,000","$1,000,000",-,-,-,-,-,"($6,000,000)","$5,000,000",3.38
Kelvin Beachum,LT,"$1,500,000","$1,500,000","$2,000,000",-,-,-,-,"($12,000,000)","$5,000,000",3.38
Morris Claiborne,CB,"$2,500,000","$2,000,000","$218,750",-,-,-,-,"($4,500,000)","$4,718,750",3.19
Bilal Powell,RB,"$3,750,000","$883,333",-,-,-,-,-,"($5,516,668)","$4,633,333",3.13


In [6]:
rows = [row for row in soup.find("table", "datatable").find_all("tr")]
players = []
for row in rows:
    if row.get_text("tr") is not None:
        players.append(row)
players
        

[<tr>
 <th class="player">Active Players  (53)</th>
 <th class="center">Pos.</th>
 <th class="right xs-hide">Base Salary</th>
 <th class="right xs-hide"><span class="info" title="The prorated portion of the signing bonus allocated to this year's salary cap.">Signing Bonus</span></th>
 <th class="right xs-hide"><span class="info" title="Generally in the form of Actve Roster or Per Game bonuses.">Roster Bonus</span></th>
 <th class="right xs-hide"><span class="info" title="The prorated portion of the option bonus allocated to this year's salary cap.">Option Bonus</span></th>
 <th class="right xs-hide"><span title="">Workout Bonus</span></th>
 <th class="right xs-hide"><span class="info" title="The prorated portion of the restructure bonus allocated to this year's salary cap, which comes from converted base salary.">Restruc. Bonus</span></th>
 <th class="right xs-hide"><span class="info" title="Performance, Playing Time, Pro Bowl, etc bonuses.">Misc.</span></th>
 <th class="right xs-hide"

In [7]:
columns_headers = [col.get_text() for col in players[0].find_all("th") if col.get_text()]
columns_headers

['Active Players  (53)',
 'Pos.',
 'Base Salary',
 'Signing Bonus',
 'Roster Bonus',
 'Option Bonus',
 'Workout Bonus',
 'Restruc. Bonus',
 'Misc.',
 'Dead Cap',
 'Cap Hit',
 'Cap %']

## List of player names
Get the player names from the td tags that are of class="player"

I sliced this to get only the first 110 records because the rest have the same labels as above.  This should be a list of 53 to reflect the 53 man roster.  Some records are last name only like "Wilkerson" and then "Muhammad Wilkerson" so there are duplicates.





In [8]:
td_tags = soup.find_all("td", {"class":"player"})[:110]

for td in td_tags:
    player_names = td.get_text()
    print(player_names)
  
    


Wilkerson
Muhammad Wilkerson 
Skrine
Buster Skrine 
Winters
Brian Winters 
Carpenter
James Carpenter 
McCown
Josh McCown 
Williams
Leonard Williams 
Forte
Matt Forte 
Beachum
Kelvin Beachum 
Claiborne
Morris Claiborne 
Powell
Bilal Powell 
Ijalana
Benjamin Ijalana 
Adams
Jamal Adams 
McLendon
Steve McLendon 
Johnson
Wesley Johnson 
Williams
Marcus Williams 
Lee
Darron Lee 
Davis
Demario Davis 
Kearse
Jermaine Kearse 
Martin
Josh Martin 
Maye
Marcus Maye 
Hackenberg
Christian Hackenberg 
Seferian-Jenkins
Austin Seferian-Jenkins 
Pennel
Mike Pennel 
Catanzaro
Chandler Catanzaro 
Harrison
Jonotthan Harrison 
Ealy
Kony Ealy 
Carter
Bruce Carter 
Kerley
Jeremy Kerley 
Stanford
Julian Stanford 
Dozier
Dakota Dozier 
Petty
Bryce Petty 
Jenkins
Jordan Jenkins 
Brooks
Terrence Brooks 
Bass
David Bass 
Burris
Juston Burris 
Stewart
ArDarius Stewart 
Miles
Rontez Miles 
Qvale
Brent Qvale 
Roberts
Darryl Roberts 
Sterling
Neal Sterling 
Tye
Will Tye 
Shell
Brandon Shell 
Hansen
Chad Hansen 
Edward

## Player Position and Salary
Pick out the position of each player and the Cap Hit

The resulting data type of get_text() is a string.  player_cap_hit will need to be converted to an int once loaded into a dataframe.

There are 53 rows with salaries, and 110 positions.  The number of positions makes sense and matches to what and NFL active roster should be, but why are there 110 salaries?  



In [27]:
player_cap_hit = []
cap_info_rows = soup.find_all("td", {"class":" right result "})
for row in cap_info_rows:
    player_cap_hit.append(row.get_text())

len(player_cap_hit)

53

In [24]:
player_position = []
position_rows = soup.find_all("td", {"class": " center small"})
for row in position_rows:
    player_position.append(row.get_text())

len(player_position)

110