# LeBron James Season Stats from ESPN
Most of the data that we use is already clean, or has been made available for potential analyses, I wanted to try and find some information that I was interested in and then use that to perform an analysis. 

Using Python, the main web scraping library is `BeautifulSoup`. 

I'm really interested in sports analytics and so decided that I would get the summary stats for LeBron James from his stats summary page on ESPN using a web scraper and then look at his results over time. As this season is obviously progressing as I do this analysis, hopefully our data won't go out of date because we will be scraping the new results every time that we look at the analysis.  

---

Useful articles:  
[Web Scraping using Python Article](https://www.datacamp.com/community/tutorials/web-scraping-using-python)  
[Beautiful Soup Article](https://www.datacamp.com/community/tutorials/tutorial-python-beautifulsoup-datacamp-tutorials)
<br>

_Note that this notebook is a working progress, and I am exploring different ways of pulling this particular data from the html page to explore the capabilities of the package._

In [0]:
import pandas as pd

In [0]:
season = "'03-'04"
years = [pd.to_datetime(year, format="'%y") for year in season.split('-')]
print(pd.period_range(years[0], years[1], freq='Y' ))
print(years)

PeriodIndex(['2003', '2004'], dtype='period[A-DEC]', freq='A-DEC')
[Timestamp('2003-01-01 00:00:00'), Timestamp('2004-01-01 00:00:00')]


In [0]:
df = pd.DataFrame([[2, 2, 2]])
season_range = pd.period_range(years[0], years[1], freq='Y' )
df['years'] = years[0]
df.set_index('years', inplace=True)
df

Unnamed: 0_level_0,0,1,2
years,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2003-01-01,2,2,2


## Points throughout the season
James is obviously a prolific scorer, but has he improved over time? Is he past his prime? 

Having a quick look at his total points during the regular season by season since he began, we can see that he has been consistently good, scoring at least 1500 points per season.

In [0]:
# Method 1 for requests for web scraping
# Import packages
# If you have a requirement to do this via a URL 
from urllib.request import urlopen, Request

# Specify the url
url = "http://www.google.com"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

# Print the html
# print(html)

# Be polite and close the response!
response.close()

In [1]:
# Another method using the `requests` package
# Import package
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
# print(plt.style.available)
plt.style.use('fivethirtyeight')
import pandas as pd


# Specify the url: url
# url = "http://www.google.com"

# Packages the request, send the request and catch the response: r
# r = requests.get(url)

# Extract the response: text
# text = r.text

# Print the html
# print(text)

# Create a BeautifulSoup object from the HTML: soup
# soup = BeautifulSoup(text)
# print(soup.prettify())
# print(soup.title)
# print(soup.get_text())

# Get all the links
# a_tags = soup.find_all('a')

# for link in a_tags:
# 	print(link)
# 	print(link.get('href'))

url = "http://www.espn.com/nba/player/stats/_/id/1966/lebron-james"
html = requests.get(url).text
soup = BeautifulSoup(html, features="html.parser")
print(soup.title.get_text())
tables = soup.find_all('table')
table_content = []

# Separate all the tables out from the page 
tester_tables = [table.get_text() for table in tables]
print(tester_tables)

# Remove the first table that is just the small summary of his stats 
tables.pop(0)
print(tables)

# Loop through the tables
for table in tables:
#   Get all the rows 
    rows = table.find_all('tr')
#   Get the table's title 
    table_title = rows[0].get_text()
    clean_rows = [] 
#     Remove the bottom row that is missing a column for the summary stats
    for row in rows[1:-1]:
# 		print(row.get_text(' | '))
        clean_rows.append(row.get_text(' | '))

    print("\n\n")
# 	Iterate through the clean rows and split into columns
    cleaner_rows = [row.split(' | ') for row in clean_rows]

# 	Create a dataframe 
    dataframe_version = pd.DataFrame(cleaner_rows)

# 	Use the first row to create column names
    dataframe_version.columns = dataframe_version.iloc[0]

# 	Drop that column name row
    dataframe_version.drop(dataframe_version.index[0], inplace = True)

    dataframe_version.set_index('SEASON', inplace = True)
#     dataframe_version['PTS'] = pd.to_numeric(dataframe_version['PTS'])
    table_content.append(dataframe_version) 

#     dataframe_version.plot.bar(y='PTS',  rot=0, figsize=(16,8))
#     plt.show()

LeBron James Stats - Los Angeles Lakers - ESPN
['PPGAPGRPGPER27.37.27.727.39Career27.27.27.4', "Regular Season AveragesSEASONTEAMGPGSMINFGM-AFG%3PM-A3P%FTM-AFT%ORDRREBASTBLKSTLPFTOPTS'03-'04CLE797939.57.9-18.9.4170.8-2.7.2904.4-5.8.7541.34.25.55.90.71.61.93.520.9'04-'05CLE808042.49.9-21.1.4721.4-3.9.3516.0-8.0.7501.46.07.47.20.72.21.83.327.2'05-'06CLE797942.511.1-23.1.4801.6-4.8.3357.6-10.3.7381.06.17.06.60.81.62.33.331.4'06-'07CLE787840.99.9-20.8.4761.3-4.0.3196.3-9.0.6981.15.76.76.00.71.62.23.227.3'07-'08CLE757440.410.6-21.9.4841.5-4.8.3157.3-10.3.7121.86.17.97.21.11.82.23.430.0'08-'09CLE818137.79.7-19.9.4891.6-4.7.3447.3-9.4.7801.36.37.67.21.11.71.73.028.4'09-'10CLE767639.010.1-20.1.5031.7-5.1.3337.8-10.2.7670.96.47.38.61.01.61.63.429.7'10-'11MIA797938.89.6-18.8.5101.2-3.5.3306.4-8.4.7591.06.57.57.00.61.62.13.626.7'11-'12MIA626237.510.0-18.9.5310.9-2.4.3626.2-8.1.7711.56.47.96.20.81.91.53.427.1'12-'13MIA767637.910.1-17.8.5651.4-3.3.4065.3-7.0.7531.36.88.07.30.91.71.43.026.8'13-'14MI

We have looped through and stored the tables that we have found on the page in an object `table_content`. 

In [2]:
table_content

[0       TEAM  GP  GS   MIN      FGM-A   FG%    3PM-A   3P%     FTM-A   FT%  \
 SEASON                                                                       
 '03-'04  CLE  79  79  39.5   7.9-18.9  .417  0.8-2.7  .290   4.4-5.8  .754   
 '04-'05  CLE  80  80  42.4   9.9-21.1  .472  1.4-3.9  .351   6.0-8.0  .750   
 '05-'06  CLE  79  79  42.5  11.1-23.1  .480  1.6-4.8  .335  7.6-10.3  .738   
 '06-'07  CLE  78  78  40.9   9.9-20.8  .476  1.3-4.0  .319   6.3-9.0  .698   
 '07-'08  CLE  75  74  40.4  10.6-21.9  .484  1.5-4.8  .315  7.3-10.3  .712   
 '08-'09  CLE  81  81  37.7   9.7-19.9  .489  1.6-4.7  .344   7.3-9.4  .780   
 '09-'10  CLE  76  76  39.0  10.1-20.1  .503  1.7-5.1  .333  7.8-10.2  .767   
 '10-'11  MIA  79  79  38.8   9.6-18.8  .510  1.2-3.5  .330   6.4-8.4  .759   
 '11-'12  MIA  62  62  37.5  10.0-18.9  .531  0.9-2.4  .362   6.2-8.1  .771   
 '12-'13  MIA  76  76  37.9  10.1-17.8  .565  1.4-3.3  .406   5.3-7.0  .753   
 '13-'14  MIA  77  77  37.7  10.0-17.6  .567  1.5-4.

In [0]:
# View the whole page 
# soup

In [0]:
# Get all rows
rows = soup.find_all('tr')

In [0]:
# Get all the cells from each row
td = [row.find_all('td') for row in rows]

In [5]:
import numpy
pd.DataFrame(numpy.array(td))

Unnamed: 0,0
0,"[<td>27.3</td>, <td>7.2</td>, <td>7.7</td>, <t..."
1,"[<td colspan=""3""><p>Career</p></td>]"
2,"[<td>27.2</td>, <td>7.2</td>, <td>7.4</td>]"
3,"[<td colspan=""21"">Regular Season Averages</td>]"
4,"[<td width=""6%"">SEASON</td>, <td width=""8%"">TE..."
5,"[<td>'03-'04</td>, <td><ul class=""game-schedul..."
6,"[<td>'04-'05</td>, <td><ul class=""game-schedul..."
7,"[<td>'05-'06</td>, <td><ul class=""game-schedul..."
8,"[<td>'06-'07</td>, <td><ul class=""game-schedul..."
9,"[<td>'07-'08</td>, <td><ul class=""game-schedul..."


In [0]:
# Generate a clean list of td values
# This function is probably the easiest way to remove html tags from objects
BeautifulSoup(str(td), "html.parser").get_text()

"[[27.6, 7.2, 7.9, 27.51], [Career], [27.2, 7.2, 7.4], [Regular Season Averages], [SEASON, TEAM, GP, GS, MIN, FGM-A, FG%, 3PM-A, 3P%, FTM-A, FT%, OR, DR, REB, AST, BLK, STL, PF, TO, PTS], ['03-'04, CLE, 79, 79, 39.5, 7.9-18.9, .417, 0.8-2.7, .290, 4.4-5.8, .754, 1.3, 4.2, 5.5, 5.9, 0.7, 1.6, 1.9, 3.5, 20.9], ['04-'05, CLE, 80, 80, 42.4, 9.9-21.1, .472, 1.4-3.9, .351, 6.0-8.0, .750, 1.4, 6.0, 7.4, 7.2, 0.7, 2.2, 1.8, 3.3, 27.2], ['05-'06, CLE, 79, 79, 42.5, 11.1-23.1, .480, 1.6-4.8, .335, 7.6-10.3, .738, 1.0, 6.1, 7.0, 6.6, 0.8, 1.6, 2.3, 3.3, 31.4], ['06-'07, CLE, 78, 78, 40.9, 9.9-20.8, .476, 1.3-4.0, .319, 6.3-9.0, .698, 1.1, 5.7, 6.7, 6.0, 0.7, 1.6, 2.2, 3.2, 27.3], ['07-'08, CLE, 75, 74, 40.4, 10.6-21.9, .484, 1.5-4.8, .315, 7.3-10.3, .712, 1.8, 6.1, 7.9, 7.2, 1.1, 1.8, 2.2, 3.4, 30.0], ['08-'09, CLE, 81, 81, 37.7, 9.7-19.9, .489, 1.6-4.7, .344, 7.3-9.4, .780, 1.3, 6.3, 7.6, 7.2, 1.1, 1.7, 1.7, 3.0, 28.4], ['09-'10, CLE, 76, 76, 39.0, 10.1-20.1, .503, 1.7-5.1, .333, 7.8-10.2, .767,

In [0]:
# Save the variable and split the values - slightly more difficult in our case as there are multiple tables here
# Note that this creates a string
clean_rows = BeautifulSoup(str(td), "html.parser").get_text()

In [0]:
table_rows = []
for row in rows:
    cells = row.find_all('td')
    text = BeautifulSoup(str(cells), "html.parser").get_text()
    table_rows.append(text)
    

In [8]:
pd.DataFrame([table_row.split(',') for table_row in table_rows])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,[27.3,7.2,7.7,27.39],,,,,,,,,,,,,,,,
1,[Career],,,,,,,,,,,,,,,,,,,
2,[27.2,7.2,7.4],,,,,,,,,,,,,,,,,
3,[Regular Season Averages],,,,,,,,,,,,,,,,,,,
4,[SEASON,TEAM,GP,GS,MIN,FGM-A,FG%,3PM-A,3P%,FTM-A,FT%,OR,DR,REB,AST,BLK,STL,PF,TO,PTS]
5,['03-'04,CLE,79,79,39.5,7.9-18.9,.417,0.8-2.7,.290,4.4-5.8,.754,1.3,4.2,5.5,5.9,0.7,1.6,1.9,3.5,20.9]
6,['04-'05,CLE,80,80,42.4,9.9-21.1,.472,1.4-3.9,.351,6.0-8.0,.750,1.4,6.0,7.4,7.2,0.7,2.2,1.8,3.3,27.2]
7,['05-'06,CLE,79,79,42.5,11.1-23.1,.480,1.6-4.8,.335,7.6-10.3,.738,1.0,6.1,7.0,6.6,0.8,1.6,2.3,3.3,31.4]
8,['06-'07,CLE,78,78,40.9,9.9-20.8,.476,1.3-4.0,.319,6.3-9.0,.698,1.1,5.7,6.7,6.0,0.7,1.6,2.2,3.2,27.3]
9,['07-'08,CLE,75,74,40.4,10.6-21.9,.484,1.5-4.8,.315,7.3-10.3,.712,1.8,6.1,7.9,7.2,1.1,1.8,2.2,3.4,30.0]
