# LeBron James Season Stats from ESPN
Most of the data that we use is already clean, or has been made available for potential analyses, I wanted to try and find some information that I was interested in and then use that to perform an analysis. 

Using Python, the main web scraping library is `BeautifulSoup`. 

I'm really interested in sports analytics and so decided that I would get the summary stats for LeBron James from his stats summary page on ESPN using a web scraper and then look at his results over time. As this season is obviously progressing as I do this analysis, hopefully our data won't go out of date because we will be scraping the new results every time that we look at the analysis.  

---

Useful articles:  
[Web Scraping using Python Article](https://www.datacamp.com/community/tutorials/web-scraping-using-python)  
[Beautiful Soup Article](https://www.datacamp.com/community/tutorials/tutorial-python-beautifulsoup-datacamp-tutorials)
<br>

_Note that this notebook is a working progress, and I am exploring different ways of pulling this particular data from the html page to explore the capabilities of the package._

In [119]:
# Load packages and have a look at converting season column into date times 
import pandas as pd
season = "'03-'04"
years = [pd.to_datetime(year, format="'%y") for year in season.split('-')]
print(pd.period_range(years[0], years[1], freq='Y' ))
print(years)

PeriodIndex(['2003', '2004'], dtype='period[A-DEC]', freq='A-DEC')
[Timestamp('2003-01-01 00:00:00'), Timestamp('2004-01-01 00:00:00')]


# Points by season
James is a prolific scorer, recently becoming the 5th all time scorer in the NBA, but has he improved over time? Is he past his prime? 

We can have a quick look at his total points during the regular season by year since he joined the NBA, we can see that he has been consistently good, scoring at least 1500 points per season.  

There are a couple of methods for requesting the page of interest, and then turning that output (a `BeautifulSoup` object in this example) into a coherent data frame that we can use. 

In [120]:
# Method 1 for requests for web scraping using the urllib package

# Import packages
# If you have a requirement to do this via a URL 
from urllib.request import urlopen, Request

# Specify the url
url = "http://www.google.com"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

# Be polite and close the response!
response.close()

In [121]:
# Another method using the `requests` package - this is my preferred method 

# Import package
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import pandas as pd

# Check the available matplotlib styles 
# print(plt.style.available)
plt.style.use('fivethirtyeight')

In [122]:
# Specify the url: url
# url = "http://www.google.com"

# Packages the request, send the request and catch the response: r
# r = requests.get(url)

# Extract the response: text
# text = r.text

# Print the html
# print(text)

# Create a BeautifulSoup object from the HTML: soup
# soup = BeautifulSoup(text)
# print(soup.prettify())
# print(soup.title)
# print(soup.get_text())

# Get all the links
# a_tags = soup.find_all('a')

# for link in a_tags:
# 	print(link)
# 	print(link.get('href'))

In [123]:
# Get the LeBron James stats page 
url = "http://www.espn.com/nba/player/stats/_/id/1966/lebron-james"

# Request to the URL and get the text: html
html = requests.get(url).text

# Parse that html into a Beautiful Soup object
soup = BeautifulSoup(html, features="html.parser")

# Get the text from the object (removes the tags)
# print(soup.title.get_text())

# Pull all the tables from the soup object
tables = soup.find_all('table')
table_content = []

# Separate all the tables out from the page by looping through the tables list 
tester_tables = [table.get_text() for table in tables]

# Remove the first table that is just the small summary of his stats - we don't want this 
tables.pop(0)

# Loop through the tables we are interested in
for table in tables:
    # Get all the rows for each tables 
    rows = table.find_all('tr')
    
    # Get the table's title 
    table_title = rows[0].get_text()
    clean_rows = [] 
    
    # Remove the bottom row that is missing a column for the summary stats
    for row in rows[1:-1]:
        # print(row.get_text(' | '))
        # Add a separator that we can use to split the data later 
        # You can specify a string to be used to join the bits of text together ("|")
        clean_rows.append(row.get_text(' | '))

    # Iterate through the clean rows and split into columns
    cleaner_rows = [row.split(' | ') for row in clean_rows]

    # Create a dataframe 
    dataframe_version = pd.DataFrame(cleaner_rows)

    # Use the first row to create column names
    dataframe_version.columns = dataframe_version.iloc[0]

    # Drop that column name row
    dataframe_version.drop(dataframe_version.index[0], inplace = True)

    dataframe_version.set_index('SEASON', inplace = True)
    # dataframe_version['PTS'] = pd.to_numeric(dataframe_version['PTS'])
    table_content.append(dataframe_version) 

    # dataframe_version.plot.bar(y='PTS',  rot=0, figsize=(16,8))
    # plt.show()

We have looped through and stored the tables that we have found on the page in an object `table_content`. 

In [124]:
table_content

[0       TEAM  GP  GS   MIN      FGM-A   FG%    3PM-A   3P%     FTM-A   FT%  \
 SEASON                                                                       
 '03-'04  CLE  79  79  39.5   7.9-18.9  .417  0.8-2.7  .290   4.4-5.8  .754   
 '04-'05  CLE  80  80  42.4   9.9-21.1  .472  1.4-3.9  .351   6.0-8.0  .750   
 '05-'06  CLE  79  79  42.5  11.1-23.1  .480  1.6-4.8  .335  7.6-10.3  .738   
 '06-'07  CLE  78  78  40.9   9.9-20.8  .476  1.3-4.0  .319   6.3-9.0  .698   
 '07-'08  CLE  75  74  40.4  10.6-21.9  .484  1.5-4.8  .315  7.3-10.3  .712   
 '08-'09  CLE  81  81  37.7   9.7-19.9  .489  1.6-4.7  .344   7.3-9.4  .780   
 '09-'10  CLE  76  76  39.0  10.1-20.1  .503  1.7-5.1  .333  7.8-10.2  .767   
 '10-'11  MIA  79  79  38.8   9.6-18.8  .510  1.2-3.5  .330   6.4-8.4  .759   
 '11-'12  MIA  62  62  37.5  10.0-18.9  .531  0.9-2.4  .362   6.2-8.1  .771   
 '12-'13  MIA  76  76  37.9  10.1-17.8  .565  1.4-3.3  .406   5.3-7.0  .753   
 '13-'14  MIA  77  77  37.7  10.0-17.6  .567  1.5-4.

If we view the tables one by one then we are able to see that we have a table that we can work with going forward.  

After this we can have a look at different methods of generating the `pandas` data frames. 

In [125]:
table_content[0]

Unnamed: 0_level_0,TEAM,GP,GS,MIN,FGM-A,FG%,3PM-A,3P%,FTM-A,FT%,OR,DR,REB,AST,BLK,STL,PF,TO,PTS
SEASON,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
'03-'04,CLE,79,79,39.5,7.9-18.9,0.417,0.8-2.7,0.29,4.4-5.8,0.754,1.3,4.2,5.5,5.9,0.7,1.6,1.9,3.5,20.9
'04-'05,CLE,80,80,42.4,9.9-21.1,0.472,1.4-3.9,0.351,6.0-8.0,0.75,1.4,6.0,7.4,7.2,0.7,2.2,1.8,3.3,27.2
'05-'06,CLE,79,79,42.5,11.1-23.1,0.48,1.6-4.8,0.335,7.6-10.3,0.738,1.0,6.1,7.0,6.6,0.8,1.6,2.3,3.3,31.4
'06-'07,CLE,78,78,40.9,9.9-20.8,0.476,1.3-4.0,0.319,6.3-9.0,0.698,1.1,5.7,6.7,6.0,0.7,1.6,2.2,3.2,27.3
'07-'08,CLE,75,74,40.4,10.6-21.9,0.484,1.5-4.8,0.315,7.3-10.3,0.712,1.8,6.1,7.9,7.2,1.1,1.8,2.2,3.4,30.0
'08-'09,CLE,81,81,37.7,9.7-19.9,0.489,1.6-4.7,0.344,7.3-9.4,0.78,1.3,6.3,7.6,7.2,1.1,1.7,1.7,3.0,28.4
'09-'10,CLE,76,76,39.0,10.1-20.1,0.503,1.7-5.1,0.333,7.8-10.2,0.767,0.9,6.4,7.3,8.6,1.0,1.6,1.6,3.4,29.7
'10-'11,MIA,79,79,38.8,9.6-18.8,0.51,1.2-3.5,0.33,6.4-8.4,0.759,1.0,6.5,7.5,7.0,0.6,1.6,2.1,3.6,26.7
'11-'12,MIA,62,62,37.5,10.0-18.9,0.531,0.9-2.4,0.362,6.2-8.1,0.771,1.5,6.4,7.9,6.2,0.8,1.9,1.5,3.4,27.1
'12-'13,MIA,76,76,37.9,10.1-17.8,0.565,1.4-3.3,0.406,5.3-7.0,0.753,1.3,6.8,8.0,7.3,0.9,1.7,1.4,3.0,26.8


In [126]:
# Pass the regular season totals as ints before adding a summary row and adding an index name of 'career'
main_stats_totals = table_content[1][['OR', 'DR', 'REB', 'AST', 'BLK', 'STL', 'PF', 'TO', 'PTS']].astype('int')
career_totals = main_stats_totals.sum(numeric_only=True)
career_totals.name = 'Career'

main_stats_totals.append(career_totals)

Unnamed: 0_level_0,OR,DR,REB,AST,BLK,STL,PF,TO,PTS
SEASON,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
'03-'04,99,333,432,465,58,130,149,273,1654
'04-'05,111,477,588,577,52,177,146,262,2175
'05-'06,75,481,556,521,66,123,181,260,2478
'06-'07,83,443,526,470,55,125,171,250,2132
'07-'08,133,459,592,539,81,138,165,255,2250
'08-'09,106,507,613,587,93,137,139,241,2304
'09-'10,71,483,554,651,77,125,119,261,2258
'10-'11,80,510,590,554,50,124,163,284,2111
'11-'12,94,398,492,387,50,115,96,213,1683
'12-'13,97,513,610,551,67,129,110,226,2036


In [127]:
# View the whole page 
# soup

# Get all row tags from the page 
rows = soup.find_all('tr')

# Get all the cells from each row
td = [row.find_all('td') for row in rows]

In [128]:
# import numpy for manipulating the BeautifulSoup object results
import numpy
pd.DataFrame(numpy.array(td)).head()

# Not quite the format that we are looking for 

Unnamed: 0,0
0,"[<td>28.1</td>, <td>6.6</td>, <td>7.8</td>, <t..."
1,"[<td colspan=""3""><p>Career</p></td>]"
2,"[<td>27.2</td>, <td>7.2</td>, <td>7.4</td>]"
3,"[<td colspan=""21"">Regular Season Averages</td>]"
4,"[<td width=""6%"">SEASON</td>, <td width=""8%"">TE..."


In [129]:
# Generate a clean list of td values
# This function is probably the easiest way to remove html tags from objects
BeautifulSoup(str(td), "html.parser").get_text()

"[[28.1, 6.6, 7.8, 26.97], [Career], [27.2, 7.2, 7.4], [Regular Season Averages], [SEASON, TEAM, GP, GS, MIN, FGM-A, FG%, 3PM-A, 3P%, FTM-A, FT%, OR, DR, REB, AST, BLK, STL, PF, TO, PTS], ['03-'04, CLE, 79, 79, 39.5, 7.9-18.9, .417, 0.8-2.7, .290, 4.4-5.8, .754, 1.3, 4.2, 5.5, 5.9, 0.7, 1.6, 1.9, 3.5, 20.9], ['04-'05, CLE, 80, 80, 42.4, 9.9-21.1, .472, 1.4-3.9, .351, 6.0-8.0, .750, 1.4, 6.0, 7.4, 7.2, 0.7, 2.2, 1.8, 3.3, 27.2], ['05-'06, CLE, 79, 79, 42.5, 11.1-23.1, .480, 1.6-4.8, .335, 7.6-10.3, .738, 1.0, 6.1, 7.0, 6.6, 0.8, 1.6, 2.3, 3.3, 31.4], ['06-'07, CLE, 78, 78, 40.9, 9.9-20.8, .476, 1.3-4.0, .319, 6.3-9.0, .698, 1.1, 5.7, 6.7, 6.0, 0.7, 1.6, 2.2, 3.2, 27.3], ['07-'08, CLE, 75, 74, 40.4, 10.6-21.9, .484, 1.5-4.8, .315, 7.3-10.3, .712, 1.8, 6.1, 7.9, 7.2, 1.1, 1.8, 2.2, 3.4, 30.0], ['08-'09, CLE, 81, 81, 37.7, 9.7-19.9, .489, 1.6-4.7, .344, 7.3-9.4, .780, 1.3, 6.3, 7.6, 7.2, 1.1, 1.7, 1.7, 3.0, 28.4], ['09-'10, CLE, 76, 76, 39.0, 10.1-20.1, .503, 1.7-5.1, .333, 7.8-10.2, .767,

In [130]:
# Save the variable and split the values - slightly more difficult in our case as there are multiple tables here
# Note that this creates a string
clean_rows = BeautifulSoup(str(td), "html.parser").get_text()

In [131]:
table_rows = []
for row in rows:
    cells = row.find_all('td')
    text = BeautifulSoup(str(cells), "html.parser").get_text()
    table_rows.append(text)

table_rows

['[28.1, 6.6, 7.8, 26.97]',
 '[Career]',
 '[27.2, 7.2, 7.4]',
 '[Regular Season Averages]',
 '[SEASON, TEAM, GP, GS, MIN, FGM-A, FG%, 3PM-A, 3P%, FTM-A, FT%, OR, DR, REB, AST, BLK, STL, PF, TO, PTS]',
 "['03-'04, CLE, 79, 79, 39.5, 7.9-18.9, .417, 0.8-2.7, .290, 4.4-5.8, .754, 1.3, 4.2, 5.5, 5.9, 0.7, 1.6, 1.9, 3.5, 20.9]",
 "['04-'05, CLE, 80, 80, 42.4, 9.9-21.1, .472, 1.4-3.9, .351, 6.0-8.0, .750, 1.4, 6.0, 7.4, 7.2, 0.7, 2.2, 1.8, 3.3, 27.2]",
 "['05-'06, CLE, 79, 79, 42.5, 11.1-23.1, .480, 1.6-4.8, .335, 7.6-10.3, .738, 1.0, 6.1, 7.0, 6.6, 0.8, 1.6, 2.3, 3.3, 31.4]",
 "['06-'07, CLE, 78, 78, 40.9, 9.9-20.8, .476, 1.3-4.0, .319, 6.3-9.0, .698, 1.1, 5.7, 6.7, 6.0, 0.7, 1.6, 2.2, 3.2, 27.3]",
 "['07-'08, CLE, 75, 74, 40.4, 10.6-21.9, .484, 1.5-4.8, .315, 7.3-10.3, .712, 1.8, 6.1, 7.9, 7.2, 1.1, 1.8, 2.2, 3.4, 30.0]",
 "['08-'09, CLE, 81, 81, 37.7, 9.7-19.9, .489, 1.6-4.7, .344, 7.3-9.4, .780, 1.3, 6.3, 7.6, 7.2, 1.1, 1.7, 1.7, 3.0, 28.4]",
 "['09-'10, CLE, 76, 76, 39.0, 10.1-20.1, .50

In [132]:
# Convert to a pandas data frame, getting better but not quite there
# We would have to do a lot more cleaning here 
pd.DataFrame([table_row.split(',') for table_row in table_rows])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,[28.1,6.6,7.8,26.97],,,,,,,,,,,,,,,,
1,[Career],,,,,,,,,,,,,,,,,,,
2,[27.2,7.2,7.4],,,,,,,,,,,,,,,,,
3,[Regular Season Averages],,,,,,,,,,,,,,,,,,,
4,[SEASON,TEAM,GP,GS,MIN,FGM-A,FG%,3PM-A,3P%,FTM-A,FT%,OR,DR,REB,AST,BLK,STL,PF,TO,PTS]
5,['03-'04,CLE,79,79,39.5,7.9-18.9,.417,0.8-2.7,.290,4.4-5.8,.754,1.3,4.2,5.5,5.9,0.7,1.6,1.9,3.5,20.9]
6,['04-'05,CLE,80,80,42.4,9.9-21.1,.472,1.4-3.9,.351,6.0-8.0,.750,1.4,6.0,7.4,7.2,0.7,2.2,1.8,3.3,27.2]
7,['05-'06,CLE,79,79,42.5,11.1-23.1,.480,1.6-4.8,.335,7.6-10.3,.738,1.0,6.1,7.0,6.6,0.8,1.6,2.3,3.3,31.4]
8,['06-'07,CLE,78,78,40.9,9.9-20.8,.476,1.3-4.0,.319,6.3-9.0,.698,1.1,5.7,6.7,6.0,0.7,1.6,2.2,3.2,27.3]
9,['07-'08,CLE,75,74,40.4,10.6-21.9,.484,1.5-4.8,.315,7.3-10.3,.712,1.8,6.1,7.9,7.2,1.1,1.8,2.2,3.4,30.0]
