<a href="https://colab.research.google.com/github/rgnemasters/coding-dojo-project-2/blob/main/Project2_Part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment Header
**Ryan Masters**
*   Project 2 Part 1
*   3/10/2022
*   Week 7
*   Disclosure: web-scraping method and code adapted from [DataQuest's 'Web Scraping NBA Stats With Python: Data Project [Part 1 of 3]](https://www.youtube.com/watch?v=JGQGd-oa0l4&t=1073s)





#First proposed dataset (with web-scraping steps)

##Mount Data and Import Libraries

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests

In [4]:
#Scrape combine data for 2000-2021 from Pro Football Reference
years = list(range(2000, 2022))

for year in years:
  data = requests.get(f'https://www.pro-football-reference.com/draft/{year}-combine.htm')
  with open('/content/drive/MyDrive/Project Notebooks/scraped pages/NFL_Combine_{}'.format(year), 'w+') as f:
    f.write(data.text)

##Web scraping and DF prep

In [14]:
#Parse scraped pages and load to list
from bs4 import BeautifulSoup

combine_stats = []

for year in years:
  with open('/content/drive/MyDrive/Project Notebooks/scraped pages/NFL_Combine_{}'.format(year)) as f:
    page = f.read()
    soup = BeautifulSoup(page, 'html.parser')
    table = soup.find(id='combine')
    table_read = pd.read_html(str(table))[0]
    table_read['Year'] = year
    combine_stats.append(table_read)


In [15]:
#Concatanate list and review DF
combine_table = pd.concat(combine_stats)
report = pd.DataFrame({'columns': combine_table.columns,
                       'dtype': combine_table.dtypes,
                       'uniques': combine_table.nunique(),
                       'nulls': combine_table.isna().sum(),
                       '% nulls': combine_table.isna().sum()/len(combine_table)}).reset_index(drop=True)
display(report)
combine_table.head()

Unnamed: 0,columns,dtype,uniques,nulls,% nulls
0,Player,object,7238,0,0.0
1,Pos,object,26,0,0.0
2,School,object,325,0,0.0
3,College,object,2,1415,0.188893
4,Ht,object,20,29,0.003871
5,Wt,object,208,24,0.003204
6,40yd,object,160,385,0.051395
7,Vertical,object,59,1647,0.219864
8,Bench,object,46,2333,0.31144
9,Broad Jump,object,64,1715,0.228941


Unnamed: 0,Player,Pos,School,College,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle,Drafted (tm/rnd/yr),Year
0,John Abraham,OLB,South Carolina,,6-4,252,4.55,,,,,,New York Jets / 1st / 13th pick / 2000,2000
1,Shaun Alexander,RB,Alabama,College Stats,6-0,218,4.58,,,,,,Seattle Seahawks / 1st / 19th pick / 2000,2000
2,Darnell Alford,OT,Boston Col.,,6-4,334,5.56,25.0,23.0,94.0,8.48,4.98,Kansas City Chiefs / 6th / 188th pick / 2000,2000
3,Kyle Allamon,TE,Texas Tech,,6-2,253,4.97,29.0,,104.0,7.29,4.49,,2000
4,Rashard Anderson,CB,Jackson State,,6-2,206,4.55,34.0,,123.0,7.18,4.15,Carolina Panthers / 1st / 23rd pick / 2000,2000


In [17]:
#Drop 'College' column, as this functions just to house a hyperlink 
combine_table.drop(columns='College', inplace=True)

In [18]:
#Drop extra header rows 
combine_table.drop_duplicates(keep=False, inplace=True)
combine_table = combine_table.reset_index(drop=True)
combine_table.head(51)

Unnamed: 0,Player,Pos,School,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle,Drafted (tm/rnd/yr),Year
0,John Abraham,OLB,South Carolina,6-4,252,4.55,,,,,,New York Jets / 1st / 13th pick / 2000,2000
1,Shaun Alexander,RB,Alabama,6-0,218,4.58,,,,,,Seattle Seahawks / 1st / 19th pick / 2000,2000
2,Darnell Alford,OT,Boston Col.,6-4,334,5.56,25.0,23.0,94.0,8.48,4.98,Kansas City Chiefs / 6th / 188th pick / 2000,2000
3,Kyle Allamon,TE,Texas Tech,6-2,253,4.97,29.0,,104.0,7.29,4.49,,2000
4,Rashard Anderson,CB,Jackson State,6-2,206,4.55,34.0,,123.0,7.18,4.15,Carolina Panthers / 1st / 23rd pick / 2000,2000
5,Jake Arians,K,Ala-Birmingham,5-10,202,,,,,,,,2000
6,LaVar Arrington,OLB,Penn State,6-3,250,4.53,,,,,,Washington Redskins / 1st / 2nd pick / 2000,2000
7,Corey Atkins,OLB,South Carolina,6-0,237,4.72,31.0,21.0,112.0,7.96,4.39,,2000
8,Kyle Atteberry,K,Baylor,6-0,167,,,,,,,,2000
9,Reggie Austin,CB,Wake Forest,5-9,175,4.44,35.0,17.0,119.0,7.03,4.14,Chicago Bears / 4th / 125th pick / 2000,2000


In [19]:
#Convert numeric values to 'float' values
#Waiting to convert 'Ht', since this will require changing the data format
combine_table[['Wt', '40yd', 'Vertical',
       'Bench', 'Broad Jump', '3Cone', 'Shuttle']]=combine_table[['Wt', '40yd', 'Vertical',
       'Bench', 'Broad Jump', '3Cone', 'Shuttle']].astype(float)

report

Unnamed: 0,columns,dtype,uniques,nulls,% nulls
0,Player,object,7238,0,0.0
1,Pos,object,26,0,0.0
2,School,object,325,0,0.0
3,College,object,2,1415,0.188893
4,Ht,object,20,29,0.003871
5,Wt,object,208,24,0.003204
6,40yd,object,160,385,0.051395
7,Vertical,object,59,1647,0.219864
8,Bench,object,46,2333,0.31144
9,Broad Jump,object,64,1715,0.228941


In [21]:
combine_table.to_csv(path_or_buf='/content/drive/MyDrive/Project Notebooks/scraped pages/Combine_2000_2020.csv')

##Answers to Assignment Questions, DF header for First Choice

In [27]:
print(combine_table.shape)
print(combine_table.info())
combine_table.head()

(7356, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7356 entries, 0 to 7355
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Player               7356 non-null   object 
 1   Pos                  7356 non-null   object 
 2   School               7356 non-null   object 
 3   Ht                   7327 non-null   object 
 4   Wt                   7332 non-null   float64
 5   40yd                 6971 non-null   float64
 6   Vertical             5709 non-null   float64
 7   Bench                5023 non-null   float64
 8   Broad Jump           5641 non-null   float64
 9   3Cone                4700 non-null   float64
 10  Shuttle              4794 non-null   float64
 11  Drafted (tm/rnd/yr)  4714 non-null   object 
 12  Year                 7356 non-null   int64  
dtypes: float64(7), int64(1), object(5)
memory usage: 747.2+ KB
None


Unnamed: 0,Player,Pos,School,Ht,Wt,40yd,Vertical,Bench,Broad Jump,3Cone,Shuttle,Drafted (tm/rnd/yr),Year
0,John Abraham,OLB,South Carolina,6-4,252.0,4.55,,,,,,New York Jets / 1st / 13th pick / 2000,2000
1,Shaun Alexander,RB,Alabama,6-0,218.0,4.58,,,,,,Seattle Seahawks / 1st / 19th pick / 2000,2000
2,Darnell Alford,OT,Boston Col.,6-4,334.0,5.56,25.0,23.0,94.0,8.48,4.98,Kansas City Chiefs / 6th / 188th pick / 2000,2000
3,Kyle Allamon,TE,Texas Tech,6-2,253.0,4.97,29.0,,104.0,7.29,4.49,,2000
4,Rashard Anderson,CB,Jackson State,6-2,206.0,4.55,34.0,,123.0,7.18,4.15,Carolina Panthers / 1st / 23rd pick / 2000,2000


1) Source of data


**This is NFL Combine data from 2000-2021. This data was scraped from Pro Football Reference.**




2) Brief description of data

**This data includes various metrics gathered from the NFL Combine** 

3) What is the target?

**The target will be draft round. Currently the Draft column of data is in string format, but I plan to convert this either to an overall draft position or a round**

4) Is this a classification or regression problem?

**If I choose to just use the draft round as a target, than this is a multi-class classification problem. If I use overall draft placement, this will be a regression problem.**

5) How many features?

**13, though I may drop 1 or 2**

6) How many rows of data.

**7,356**

7) What, if any, challenges do your foresee in cleaning, exploring, or modeling with this dataset?

**This dataset is full of null values, since players participate (or don't) in different drills based on position. I will have to decide what imputation might be best, especially with the high % null columns. I'll also need to transform the 'Drafted' and 'Ht' columns, which may be tricky**

#Second proposed dataset

In [26]:
filename = '/content/drive/MyDrive/Project Notebooks/Datasets for personal projects/Team_Stat_Table_Main.csv'
df = pd.read_csv(filename)
df = df.drop(columns='Unnamed: 0')
print(df.shape)
print(df.info())
df.head()

(1164, 27)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1164 entries, 0 to 1163
Data columns (total 27 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rk            1164 non-null   float64
 1   Team          1164 non-null   object 
 2   G             1164 non-null   int64  
 3   MP            1164 non-null   int64  
 4   FG            1164 non-null   int64  
 5   FGA           1164 non-null   int64  
 6   FG%           1164 non-null   float64
 7   3P            1164 non-null   float64
 8   3PA           1164 non-null   float64
 9   3P%           1164 non-null   float64
 10  2P            1164 non-null   int64  
 11  2PA           1164 non-null   int64  
 12  2P%           1164 non-null   float64
 13  FT            1164 non-null   int64  
 14  FTA           1164 non-null   int64  
 15  FT%           1164 non-null   float64
 16  ORB           1164 non-null   int64  
 17  DRB           1164 non-null   int64  
 18  TRB           116

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Season,playoff_team
0,1.0,San Antonio Spurs,82,19755,3856,7738,0.498,52.0,206.0,0.252,...,2515,3668,2326,771,333,1589,2103,9788,1980,True
1,2.0,Los Angeles Lakers,82,19880,3898,7368,0.529,20.0,100.0,0.2,...,2653,3738,2413,774,546,1639,1784,9438,1980,True
2,3.0,Cleveland Cavaliers,82,19930,3811,8041,0.474,36.0,187.0,0.193,...,2381,3688,2108,764,342,1370,1934,9360,1980,False
3,4.0,New York Knicks,82,19780,3802,7672,0.496,42.0,191.0,0.22,...,2303,3539,2265,881,457,1613,2168,9344,1980,False
4,5.0,Boston Celtics,82,19880,3617,7387,0.49,162.0,422.0,0.384,...,2457,3684,2198,809,308,1539,1974,9303,1980,True


##Answers to Assignment Questions, DF header for Second Choice

1) Source of data

**I scraped this data from [Basketball Reference](https://www.basketball-reference.com/)**.

2) Brief description of data

**This is overall team stats for every season since the introduction of the 3-pt line in 1980**

3) What is the target?

**The target is 'playoff_team,' which is a binary (boolean) category. The point will be to predict this season's playoff teams** 

4) Is this a classification or regression problem?

**This is a binary classification problem.**

5) How many features?

**This dataset has 27 features**

6) How many rows of data.

**This dataset has 1164 rows**

7) What, if any, challenges do your foresee in cleaning, exploring, or modeling with this dataset?

**This dataset is pretty straight-forward. The only thing that might be tricky in modelling is working with 27 features, which is more than I have ever worked with so far**