# NCAA M Basketball Player / Team Stats Scraping

#### This notebook can be used to scrape men's NCAA basketball player / team stats from https://www.sports-reference.com/

## Introduction

As I got started on the March Machine Learning Mania 2021 - NCAAM contest, I thought it might be useful to bring in player stats to help make the winning predictions. 

As the compeition data didn't include player stats, I decided to create a notebook to get that data! 

## Prep

In [None]:
## library imports

## linear algebra
import numpy as np

## file IO / processing
import pandas as pd

## system
import os 

## requests
import requests

## results storage
from collections import defaultdict

In [None]:
## scraping

! pip install bs4

from bs4 import BeautifulSoup

In [None]:
## functions 

def get_soup(url,parser='html.parser',verify=True):
    """
        Gets the Beautifulsoup object which holds the data to be scrapped
        
        Args:
            url (str): URL of the page to be scrapped
            parser (str): Type of parser to be used 
        Returns:
            BeautifulSoup object
    
    
    """
    request = get_request_response(url,verify=verify)
    
    return BeautifulSoup(request.content,parser)


def get_request_response(url,verify=True):
    """
    
        Requests the content to be scrapped
        
        Args:
            url (url): URL of the page to be scrapped
            
        Returns:
            request object
    
    """
    
    return requests.get(url,verify=verify)


def find_all(soup,**kwargs):
    """
        Returns all the PageElement tags which match a given criteria
        See BeautifulSoup find_all documentation
        
        Args:
            soup (BeautifulSoup): BeautifulSoup object containing the web page content
            
        Returns:
            list of PageElements
    
    
    """
    return soup.find_all(**kwargs)

def find(soup,**kwargs):
    """
        Returns one PageElement tag which matches a given criteria
        See BeautifulSoup find_all documentation
        
        Args:
            soup (BeautifulSoup): BeautifulSoup object containing the web page content
            
        Returns:
            PageElement instance
    
    
    
    """
    return soup.find(**kwargs)


def find_per_game_table(soup):
    """
    
        Returns per game stats table PageElement tag
        
        See BeautifulSoup find_all documentation
        
        Args:
            soup (BeautifulSoup): BeautifulSoup object containing the web page content
            
        Returns:
            per game stats table PageElement tag instance
    
    
    
    """
    return find(soup,name='table',attrs={'class':'sortable stats_table','id':'per_game'})


def find_stats_tags_from_soup(soup,name):
    """
    
        Returns table PageElement tag of the chosen table
        
        See BeautifulSoup find_all documentation
        
        Args:
            soup (BeautifulSoup): BeautifulSoup object containing the web page content
            name (str): name of the table to be scraped
            
        Returns:
            table PageElement tag instance
    
    
    
    """
    if name == 'per game':
        table_soup = find_per_game_table(soup)
    else:
        raise NotImplementedError('Invalid table name')
        
    return find_all(soup=table_soup,name='td')

def get_stats(tags):
    """
        Get stats from the table tags
        
        Args:
            tags (list(PageElement tag)): list of tag elements which contain player informations
            
        returns:
            dictionary of player stats

    
    """
    game_stats = defaultdict(list)
    
    for tag in tags:
        game_stats[tag['data-stat']].append(tag.string)
        
    return game_stats
        
def get_stats_from_soup(soup,name):
    """
        Get player stats from BeautifulSoup object
        
        Args:
            soup (BeautifulSoup): BeautifulSoup object containing the web page content
            name (str): name of the table to be scraped
            
        Returns:
            dictionary of player stats
            
    
    """
    tags = find_stats_tags_from_soup(soup,name=name)
    return get_stats(tags)

def get_stats_frame_from_soup(soup,name):
    """
        Get player stats DataFrame from BeautifulSoup object
        
        Args:
            soup (BeautifulSoup): BeautifulSoup object containing the web page content
            name (str): name of the table to be scraped
            
        Returns:
            pandas DataFrame of player stats
            
    
    """
    return pd.DataFrame(data=get_stats_from_soup(soup,name=name))

def scrape_per_game_stats_from_url(url,school=None,season=None):
    """
        Get player stats from BeautifulSoup object with school and season
        
        Args:
            soup (BeautifulSoup): BeautifulSoup object containing the web page content
            name (str): name of the table to be scraped
            
        Returns:
            pandas DataFrame of player stats
            
    
    """
    soup = get_soup(url=url)
    df = get_stats_frame_from_soup(soup=soup,name='per game')
    df['School'] = school
    df['Season'] = season
    return df

## Example

In [None]:
## Gonzaga 2020-2021

url = "https://www.sports-reference.com/cbb/schools/gonzaga/2021.html#all_schools_per_game"
df = scrape_per_game_stats_from_url(url=url,school='Gonzaga',season='2020-2021')
df

In [None]:
## Duke 2018-2019

url = "https://www.sports-reference.com/cbb/schools/duke/2019.html#all_schools_per_game"
df = scrape_per_game_stats_from_url(url=url,school='Duke',season='2018-2019')
df

## Notes 
* Data types of the scraped DataFrame's series are objects.
* I implemented the find_stats_tags_from_soup function so other tables could be added in if this notebook is found to be useful. 

#### Enjoy! 