from [jacklich10/nfl_draft_data:](https://github.com/jacklich10/nfl-draft-data)

If you attempt to read in the .csv files in R with the commonly used readr::read_csv function, please use the parameter guess_max = 13000 to avoid parsing errors (this has to do with how readr guesses column types using only the first 1000 rows). Alternatively, use the read.csv function which will handle this without intervention.

from https://github.com/jacklich10/nfl-draft-data: 

If you are joining the datasets together, the value player_id is uniquely identified across all data. 
Be wary of pos_abbr, as the abbreviations sometimes (though rarely) differ across datasets. I have had no issues with school, school_name, and school_abbr, but player_id will always join the data correctly.

If you are joining in ESPN college QBR statistics from college_qbr.csv, join by guid and player_name

## nfl_draft_prospects.csv

ESPN Information on previous NFL draft prospects dating back to 1967 (first year of the common draft). 
If the player has NA values for pick, overall, and round, it means he went undrafted or the draft has not occurred yet (for current year prospects).

In [1]:
import pandas as pd
# Import CSV for Prospects

# data on when a prospect was drafted, which team drafted them.
rel_path = r"..\Sources\nfl_draft_prospects.csv"
prospects_df = pd.read_csv(rel_path)
display('draft_prospects:')
display(prospects_df.head())
# display(prospects_df.info())
display(f'{len(prospects_df)} rows in prospects_df')

'draft_prospects:'

Unnamed: 0,draft_year,player_id,player_name,position,pos_abbr,school,school_name,school_abbr,link,pick,...,team,team_abbr,team_logo_espn,guid,weight,height,pos_rk,ovr_rk,grade,player_image
0,1967,23590,Bubba Smith,Defensive End,DE,Michigan State,Spartans,MSU,http://insider.espn.com/nfl/draft/player/_/id/...,1.0,...,Baltimore Colts,IND,https://a.espncdn.com/i/teamlogos/nfl/500/scor...,,,,,,,
1,1967,23591,Clinton Jones,Running Back,RB,Michigan State,Spartans,MSU,http://insider.espn.com/nfl/draft/player/_/id/...,2.0,...,Minnesota Vikings,MIN,https://a.espncdn.com/i/teamlogos/nfl/500/scor...,,,,,,,
2,1967,23592,Steve Spurrier,Quarterback,QB,Florida,Gators,FLA,http://insider.espn.com/nfl/draft/player/_/id/...,3.0,...,San Francisco 49ers,SF,https://a.espncdn.com/i/teamlogos/nfl/500/scor...,,,,,,,
3,1967,23593,Bob Griese,Quarterback,QB,Purdue,Boilermakers,PUR,http://insider.espn.com/nfl/draft/player/_/id/...,4.0,...,Miami Dolphins,MIA,https://a.espncdn.com/i/teamlogos/nfl/500/scor...,,,,,,,
4,1967,23594,George Webster,Linebacker,LB,Michigan State,Spartans,MSU,http://insider.espn.com/nfl/draft/player/_/id/...,5.0,...,Houston Oilers,TEN,https://a.espncdn.com/i/teamlogos/nfl/500/scor...,,,,,,,


'13354 rows in prospects_df'

In [2]:
qb_prospects_df = prospects_df.query("position =='Quarterback' or pos_abbr == 'QB'").reset_index(drop =True)
display(f'{len(qb_prospects_df)} rows in qb_prospects_df')

'619 rows in qb_prospects_df'

In [3]:
# Count NaN, 0 values in each column
nan_counts = qb_prospects_df.isna().sum()
zero_counts = (qb_prospects_df == 0).sum()
print("NaN values:\n", nan_counts)
print("\n0 values:\n", zero_counts)

NaN values:
 draft_year          0
player_id           0
player_name         0
position            0
pos_abbr            0
school             10
school_name        10
school_abbr        13
link                0
pick               64
overall            64
round              64
traded             64
trade_note        397
team               64
team_abbr          64
team_logo_espn     65
guid              320
weight            338
height            339
pos_rk            341
ovr_rk            371
grade             341
player_image      549
dtype: int64

0 values:
 draft_year          0
player_id           0
player_name         0
position            0
pos_abbr            0
school              0
school_name         0
school_abbr         0
link                0
pick                0
overall             0
round               0
traded            497
trade_note          0
team                0
team_abbr           0
team_logo_espn      0
guid                0
weight              0
height          

In [4]:
display(qb_prospects_df.columns)
# Drop unnecessary/redundant columns
drop = ['position', 'pos_abbr', 'school_name', 'school_abbr', 'link', 'team_logo_espn','player_image',\
         'guid', 'team_abbr', 'pos_rk', 'ovr_rk']

prospects_df = qb_prospects_df.drop(columns=drop)
display('prospects_df')
display(prospects_df.head())
display(prospects_df.columns)
# display(prospects_df.info())

Index(['draft_year', 'player_id', 'player_name', 'position', 'pos_abbr',
       'school', 'school_name', 'school_abbr', 'link', 'pick', 'overall',
       'round', 'traded', 'trade_note', 'team', 'team_abbr', 'team_logo_espn',
       'guid', 'weight', 'height', 'pos_rk', 'ovr_rk', 'grade',
       'player_image'],
      dtype='object')

'prospects_df'

Unnamed: 0,draft_year,player_id,player_name,school,pick,overall,round,traded,trade_note,team,weight,height,grade
0,1967,23592,Steve Spurrier,Florida,3.0,3.0,1.0,False,from Atlanta,San Francisco 49ers,,,
1,1967,23593,Bob Griese,Purdue,4.0,4.0,1.0,False,,Miami Dolphins,,,
2,1967,23614,Don Horn,San Diego State,25.0,25.0,1.0,False,,Green Bay Packers,,,
3,1967,23616,Bo Burris,Houston,1.0,27.0,2.0,False,,New Orleans Saints,,,
4,1967,23619,Bob Davis,Virginia,4.0,30.0,2.0,False,,Houston Oilers,,,


Index(['draft_year', 'player_id', 'player_name', 'school', 'pick', 'overall',
       'round', 'traded', 'trade_note', 'team', 'weight', 'height', 'grade'],
      dtype='object')

In [5]:
# Get all rows with valid values (numerical) for column 'grade'

# Convert the 'grade' column to numeric, setting non-numeric values to NaN
prospects_df['grade'] = pd.to_numeric(prospects_df['grade'], errors='coerce')

# Filter the DataFrame to include only rows with numeric 'grade' values
filtered_prospects_df = prospects_df[prospects_df['grade'].notna()].reset_index(drop=True)

display(filtered_prospects_df.head())
display(filtered_prospects_df.columns)
display(f'{len(filtered_prospects_df)} rows in filtered_prospects_df')

# filtered_df = prospects_df.reindex(sorted(prospects_df.columns), axis=1)

Unnamed: 0,draft_year,player_id,player_name,school,pick,overall,round,traded,trade_note,team,weight,height,grade
0,2004,7841,Eli Manning,Ole Miss,1.0,1.0,1.0,False,,San Diego Chargers,221.0,77.0,98.0
1,2004,7842,Philip Rivers,North Carolina State,4.0,4.0,1.0,False,,New York Giants,224.0,77.0,95.0
2,2004,7840,Ben Roethlisberger,Miami (OH),11.0,11.0,1.0,False,,Pittsburgh Steelers,241.0,77.0,99.0
3,2004,7843,J.P. Losman,Tulane,22.0,22.0,1.0,False,from Dallas,Buffalo Bills,224.0,74.0,89.0
4,2004,7844,Matt Schaub,Virginia,27.0,90.0,3.0,False,from Indianapolis,Atlanta Falcons,233.0,78.0,80.0


Index(['draft_year', 'player_id', 'player_name', 'school', 'pick', 'overall',
       'round', 'traded', 'trade_note', 'team', 'weight', 'height', 'grade'],
      dtype='object')

'278 rows in filtered_prospects_df'

In [6]:
#Export csv for Prospects
relative_path = "../Data_Artifacts/nfl_draft_prospects_clean.csv"
filtered_prospects_df.to_csv(relative_path, index=False)

### ALL REMAINING CELLS: NOT USED

In [7]:
import os
#  Open the file with default Windows application (Excel for csv)
if os.path.exists(rel_path):
    os.startfile(rel_path)
    print('Opened with default application.')
else:
    print(f'File not found at: {rel_path}')


Opened with default application.


In [8]:
# NOT USED
# # DataFrame .index attribute  provides an iterable Index object. 
# # You can directly loop over this object.
# for idx in qb_prospects_df.index:# .tolist():
#     if idx > 10:
#         break
#     name = qb_prospects_df.loc[idx, 'player_name']
#     link = qb_prospects_df.loc[idx, 'link']
#     print(f'{idx} {name} {link}')

    
# print('\niterrows')
# for idx, row in qb_prospects_df.iterrows():
#     if idx > 10:
#         break
#     # link = qb_prospects_df.loc[idx, 'link']
#     name = row['player_name']
#     link = row['link']
#     print(f'{idx} {name} {link}')

# # http://insider.espn.com/nfl/draft/player/id/23592

In [9]:
# write the file:
relative_path = 'Resources\sandbox_pandas_2024.10_output.txt'
with open(file_path, 'w') as file:
    file.write(df_tail2.to_string())
# -----------------------------

# open file in ipynb
with open(relative_path, 'r') as file:
    print(file.read())

# ------------------------

#  Open the file with default Windows application (Excel for csv)
if os.path.exists(relative_path):
    os.startfile(relative_path)
    print('Opened with default application.')
else:
    print(f'File not found at: {relative_path}')

# -----------------------------

# Open file in VCCode:
import os
# Using shell=True allows the command to run in the shell, which sometimes resolves path issues
absolute_path = os.path.abspath(csv_path)
try:
    subprocess.run(f'code "{absolute_path}"', check=True, shell=True)
except FileNotFoundError as e:
        print(f'Error: {e}')
else:
    print(f'File not found at: {absolute_path}')



NameError: name 'file_path' is not defined

In [31]:
# merge (prospects_df - player_name, profiles_df - player_name_prof) using fuzzy wuzzy, to allow for some possible differences in the way the name is spelled. 
# print an indicator when there is NOT a match

# Method:
# 1. Create new column 'player_name_match' in dataframe 1 prospects_df
# For each value of player_name, search 'player_name_prof' in dataframe 2: profiles_df using process.extractOne to get the best match (or None if no good match).
#  this best match gets added to 'player_name_match' in dataframe 1 prospects_df
# 2. Then, merge dataframe 1 prospects_df (on 'player_name_match') and dataframe 2 profiles_df (on 'player_name_prof')
#   Maintain the original 'player_name' in dataframe 1 for reference.

# pip install fuzzywuzzy
# Uses slow pure-python SequenceMatcher;  Install python-Levenshtein an optional dependency that speeds things 
# pip install python-Levenshtein

from fuzzywuzzy import process

# Find the best match for each player name in qb_profiles_clean_df
def get_best_match(name, choices, threshold=80):
#  process.extractOne returns tuple w/ 3 elements: match, score, and the index.
    match, score, _ = process.extractOne(name, choices) 
    return match if score > threshold else None

# Create the 'player_name_match' column
prospects_df['player_name_match'] = prospects_df['player_name'].apply(
    lambda x: get_best_match(x, profiles_df['player_name_prof'])
)

# Merge the dataframes
merged_df = pd.merge(
    prospects_df,
    profiles_df,
    left_on='player_name_match',
    right_on='player_name_prof',
    how='outer',
    indicator=True
)

display(merged_df.head())
display(merged_df.tail())


Unnamed: 0,draft_year,player_id,player_name,pos_abbr,school,pick,overall,round,traded,trade_note,...,weight_prof,height_prof,school_prof,pos_rk_prof,ovr_rk_prof,grade_prof,text1_prof,text2_prof,text3_prof,_merge
0,1967,23592,Steve Spurrier,QB,Florida,3.0,3.0,1.0,False,from Atlanta,...,,,Florida,,,,,,,both
1,1967,23593,Bob Griese,QB,Purdue,4.0,4.0,1.0,False,,...,,,Purdue,,,,,,,both
2,1967,23614,Don Horn,QB,San Diego State,25.0,25.0,1.0,False,,...,,,San Diego State,,,,,,,both
3,1967,23616,Bo Burris,QB,Houston,1.0,27.0,2.0,False,,...,,,Houston,,,,,,,both
4,1967,23619,Bob Davis,QB,Virginia,4.0,30.0,2.0,False,,...,,,Virginia,,,,,,,both


Unnamed: 0,draft_year,player_id,player_name,pos_abbr,school,pick,overall,round,traded,trade_note,...,weight_prof,height_prof,school_prof,pos_rk_prof,ovr_rk_prof,grade_prof,text1_prof,text2_prof,text3_prof,_merge
616,2021,104814,Jamie Newman,QB,Georgia,,,,,,...,234.0,74.875,Georgia,9.0,184.0,55.0,"Newman has average height, a sturdy build and ...",Newman is a developmental prospect with a good...,,both
617,2021,105211,Feleipe Franks,QB,Arkansas,,,,,,...,234.0,78.625,Arkansas,11.0,239.0,39.0,Franks has a big-time arm. He lacks ideal rele...,Franks is a developmental quarterback prospect...,,both
618,2021,105151,Shane Buechele,QB,SMU,,,,,,...,210.0,72.25,SMU,13.0,260.0,36.0,Buechele is a shorter quarterback who thrives ...,Buechele is a shorter quarterback who is very ...,,both
619,2021,105349,Zach Smith,QB,Tulsa,,,,,,...,222.0,75.375,Tulsa,14.0,311.0,31.0,,,,both
620,2021,105467,K.J. Costello,QB,Mississippi State,,,,,,...,227.0,76.625,Mississippi State,15.0,345.0,30.0,,,,both
