In [1]:
from bs4 import BeautifulSoup
import urllib.request

### Setup

The Player class will make the data easier to sort through and separate unique objects, since we'll be collecting photos from rosters over several seasons.

In [2]:
class Player:
    def __init__(self, name, team, jpg):
        self.name=name;
        self.team=team;
        self.jpg=jpg;
    def __eq__(self, other):
        return self.name==other.name
    def __hash__(self):
        return hash(('name', self.name, 'team', self.team, 'jpg', self.jpg))

The following functions are for scraping the NHL website. 
(An exception for the 2004-2005 season, which was cancelled due to the lockout, is handled accordingly. #history)

In [3]:
def get_multiseasons(base_url, seasonrange_floor, seasonrange_ceiling):
    allplayers=[];
    for year in range(seasonrange_floor, seasonrange_ceiling):
        try:
            url=base_url+"/"+str(year);
            team=base_url.split('/')[3];
            allplayers=allplayers+getPlayers(url, team);
            print("Team: ",team,", Season: ", year,"added \n")
        except:
            print("2004-2005 season unavailable.")
    return allplayers;
        
def getPlayers(url, team):
    
    player_list=[];
    html=urllib.request.urlopen(url)
    soup = BeautifulSoup(html, 'html.parser')
    
    images = soup.find_all("img","player-photo")
    lastnames=soup.find_all("span","name-col__item name-col__lastName")
    firstnames=soup.find_all("span","name-col__item name-col__firstName")
    
    player_zip=zip(firstnames, lastnames, images)
    for player in player_zip:
        name=player[0].string+' '+player[1].string;
        jpg=player[-1]['src'];
        player_list.append(Player(name=name, team=team, jpg=jpg))
        
    return player_list

We'll collect photos from the 2000-2001 season to the 2018-2019 season using a list of rosters from every team in the NHL. Duplicates will be removed before images are collected.

In [4]:
seasonrange_floor=2000;
seasonrange_ceiling=2020;
all_players=[];

In [5]:
fname="nhlroster_urls"

In [6]:
with open(fname) as f:
    content = f.read().splitlines()

In [7]:
for page in content:
    all_players=all_players+get_multiseasons(page, seasonrange_floor, seasonrange_ceiling);

Team:  blackhawks , Season:  2000 added 

Team:  blackhawks , Season:  2001 added 

Team:  blackhawks , Season:  2002 added 

Team:  blackhawks , Season:  2003 added 

2004-2005 season unavailable.
Team:  blackhawks , Season:  2005 added 

Team:  blackhawks , Season:  2006 added 

Team:  blackhawks , Season:  2007 added 

Team:  blackhawks , Season:  2008 added 

Team:  blackhawks , Season:  2009 added 

Team:  blackhawks , Season:  2010 added 

Team:  blackhawks , Season:  2011 added 

Team:  blackhawks , Season:  2012 added 

Team:  blackhawks , Season:  2013 added 

Team:  blackhawks , Season:  2014 added 

Team:  blackhawks , Season:  2015 added 

Team:  blackhawks , Season:  2016 added 

Team:  blackhawks , Season:  2017 added 

Team:  blackhawks , Season:  2018 added 

Team:  blackhawks , Season:  2019 added 

Team:  avalanche , Season:  2000 added 

Team:  avalanche , Season:  2001 added 

Team:  avalanche , Season:  2002 added 

Team:  avalanche , Season:  2003 added 

2004-200

Team:  oilers , Season:  2016 added 

Team:  oilers , Season:  2017 added 

Team:  oilers , Season:  2018 added 

Team:  oilers , Season:  2019 added 

Team:  kings , Season:  2000 added 

Team:  kings , Season:  2001 added 

Team:  kings , Season:  2002 added 

Team:  kings , Season:  2003 added 

2004-2005 season unavailable.
Team:  kings , Season:  2005 added 

Team:  kings , Season:  2006 added 

Team:  kings , Season:  2007 added 

Team:  kings , Season:  2008 added 

Team:  kings , Season:  2009 added 

Team:  kings , Season:  2010 added 

Team:  kings , Season:  2011 added 

Team:  kings , Season:  2012 added 

Team:  kings , Season:  2013 added 

Team:  kings , Season:  2014 added 

Team:  kings , Season:  2015 added 

Team:  kings , Season:  2016 added 

Team:  kings , Season:  2017 added 

Team:  kings , Season:  2018 added 

Team:  kings , Season:  2019 added 

Team:  sharks , Season:  2000 added 

Team:  sharks , Season:  2001 added 

Team:  sharks , Season:  2002 added 

T

Team:  mapleleafs , Season:  2005 added 

Team:  mapleleafs , Season:  2006 added 

Team:  mapleleafs , Season:  2007 added 

Team:  mapleleafs , Season:  2008 added 

Team:  mapleleafs , Season:  2009 added 

Team:  mapleleafs , Season:  2010 added 

Team:  mapleleafs , Season:  2011 added 

Team:  mapleleafs , Season:  2012 added 

Team:  mapleleafs , Season:  2013 added 

Team:  mapleleafs , Season:  2014 added 

Team:  mapleleafs , Season:  2015 added 

Team:  mapleleafs , Season:  2016 added 

Team:  mapleleafs , Season:  2017 added 

Team:  mapleleafs , Season:  2018 added 

Team:  mapleleafs , Season:  2019 added 

Team:  hurricanes , Season:  2000 added 

Team:  hurricanes , Season:  2001 added 

Team:  hurricanes , Season:  2002 added 

Team:  hurricanes , Season:  2003 added 

2004-2005 season unavailable.
Team:  hurricanes , Season:  2005 added 

Team:  hurricanes , Season:  2006 added 

Team:  hurricanes , Season:  2007 added 

Team:  hurricanes , Season:  2008 added 

Team

We can see there have 7557 unique players in the NHL within the last 20 years. 

In [43]:
reduced_allplayers=list(dict.fromkeys(all_players))
print("Unique players: ",len(reduced_allplayers))

Unique players:  7557


In [34]:
len(reduced_allplayers)

7557

We'll store the images in our data directory, removing any images that have been removed from the site.

In [47]:
data_dir="./project4_data/"
photos_unavailable=0;
photos_available=0;

In [52]:
for i in range(100):
    try:
        urllib.request.urlretrieve(reduced_allplayers[i].jpg, "./data"+reduced_allplayers[i].name+'.jpg')
        photos_available+=1;
        if photos_available%100==0:
            print(i," images collected.")
    except: 
        print("Player photgraph unavailable.")
        print(reduced_allplayers[i].jpg,i,reduced_allplayers[i].name)
        photos_unavailable+=1

0  images collected.
Greg McKegg
Daniel Tjarnqvist
Anthony Stolarz
Player photgraph unavailable.
https://nhl.bamcontent.com/images/headshots/current/60x60/8467939.jpg 3 Dan Jancevski
Chris Tamer
Player photgraph unavailable.
https://nhl.bamcontent.com/images/headshots/current/60x60/8470735.jpg 5 Greg Moore
Player photgraph unavailable.
https://nhl.bamcontent.com/images/headshots/current/60x60/8469752.jpg 6 Pascal Pelletier
Christoph Bertschy
Brandon Pirri
Player photgraph unavailable.
https://nhl.bamcontent.com/images/headshots/current/60x60/8460560.jpg 9 Marko Kiprusoff
Brett Bellemore
Player photgraph unavailable.
https://nhl.bamcontent.com/images/headshots/current/60x60/8467478.jpg 11 Kent Huskins
Player photgraph unavailable.
https://nhl.bamcontent.com/images/headshots/current/60x60/8458126.jpg 12 Cameron Stewart
Byron Ritchie
Colby Armstrong
Derek Stepan
Aaron Miller
Travis Moen
Tim Erixon
Player photgraph unavailable.
https://nhl.bamcontent.com/images/headshots/current/60x60/8475