# NHL Draft data from NHL Records API
# Feature extraction
This notebook presents feature extraction process for NHL Draft data collected from the NHL Records API. Previous steps performed include data collection from the API and data cleanup.
This workflow produces additional features and adds them to the cleaned NHL Draft dataset obtained from NHL Records API.


### Data collection summary
Dataset generated from a JSON received from the NHL Records API, contains response to the request for all draft records.  

For details, see notebook `notebooks/feature_extraction/nhl_api.ipynb`.

### Data cleanup summary
* fixed capitalizations for Amateur Club Names, Player Names
* fixed inconsistent team names for some Russian Amateur Club Names
    * e.g., 'Ska Leningrad', 'St. Petersburg Ska', 'Ska St. Petersburg', 'Leningrad Ska', 'St. Petersburg Ska St. Petersburg' was replaced with 'Ska'
    * team names fixed only for Ska, Ska2, Cska, and Cska 2
* fixed 2 erratic height values and 3 weight values (replaced with mean)
* fixed inconsistent names for Russian leagues
    * 'Russia', 'RUS', 'RUSSIA', 'RUSSIA-1' were changed to 'Russia'
    * 'RUSSIA-JR.', 'RUSSIA JR. 2' were changed to 'Russia-Jr.'
* removed redundant positions
    * all players who can play center are assumed to be centers
        * C/RW (17 players), C/LW (30 players), F (362 players) = C
    * player who can play both wings are assumed to play the right wing
        * LW/RW (13 players) = RW
    * mixed D positions are assumed to be D
        * LW/D (1 player), D/F (1 player) = D
* filtered columns and renamed them to shorten their labels

* data problems identified but not currently addressed:
    * inconsistencies in some Amateur Club Names (e.g., 'London Knights' and 'London')
    * `pickInRound` appears to have erratic values (will be addressed in this notebook)

* data problems potentially present (not verified)
    * inconsistent names for Russian teams other than CSKA and SKA (those were fixed)

For details, see notebook `notebooks/feature_extraction/nhl_draft_api_cleanup.ipynb`

## Description of features
Features to be added:
* `num_teams`: int  
    number of teams in each draft year
* `top5c`: Boolean  
    whether the prospect is a national of one of the "top 5 hockey countries": Canada, US, Sweden, Finland, or Russia.
* `age`: int  
    age when drafted
* `age_1g`: int  
    age when played the 1st NHL game
* `round`: int  
    round in which a prospect has been drafted to the NHL
* `round_ratio`: float  
    * captures how high in a round was a prospect drafted, in addition to in which round
    * $\text{round_ratio} = \frac{\text{#overall} - 1} {\text{num_teams}}$
* `1st_round_pick`: Boolean  
    whether a prospect was drafted in the first round
* `bmi`: float  
    * [body-mass index](https://en.wikipedia.org/wiki/Body_mass_index) of a prospect at the time of draft
    * $\text{BMI} = \frac{\text{mass}} {\text{height}^2} $
    * BMI is a convenient rule of thumb used to broadly categorize a person as underweight, normal weight, overweight, or obese based on tissue mass (muscle, fat, and bone) and height
    * [commonly accepted](https://www.who.int/gho/ncd/risk_factors/bmi_text/en/) BMI ranges are: 
        * underweight: under 18.5 kg/m2
        * normal weight: 18.5 to 25
        * overweight: 25 to 30
        * obese: over 30
* `pshoots`: category (string)
    * player position + shooting hand concatenated together
    * e.g., a right wing shooting left would have a `pshoots` values of 'RW-L`
    * for goalies, shooting hand corresponds to catching hand (glove hand)
* `zod`: category (string)
    * zodiac sign of a player, based on their birth date (for fun)

### Basic skater-specific stats
#### Post-draft totals of the player in the NHL
* `gp`: int  
    total games played since drafted
* `g`: int  
    total goals scored
* `a`: int  
    total assists
* `p`: int  
    total points scored
* `pm`: int  
    total +/- of the player in the NHL
* `pim`: int  
    penalty infraction minutes
#### Post-draft averages of the player in the NHL
* `agp`: float  
    average games played per season
* `apm`: float
    average +/- of the player per season
* `apim`: float
    average penalty infraction minutes per season
* `gpg`: float  
    average goals per game
* `apg`: float  
    average assists per game
* `ppg`: float  
    average points per game

### Basic goaltender-specific stats
More can be found [here](http://hockeygoalies.org/stats/glossary.html).
* `gp1`: games played as a starting goalie
* `W`: wins
* `L`: losses
* `T/O`: _?_
* `SV%`: save percentage
* `GAA`: goals against average

## Preparations
### Import dependencies

### Load data