This File takes the combined NFL Career statistics, characteristics, awards, and records of NFL Quarterbacks and creates new target/y labels based on those.

The ultimate goal will be to determine which of those target/y are indicative of overall quarterback success (or can be investigated on their own) using the feature/X data from college football data.

Target/y columns:
* **hof:** Hall of Fame induction
* **wins_succ:**  wins > 50, win % > 50%
* **award_succ:** Awards total number of Pro Bowls, All Pros, MVPs, etc. eg.: ["'2x Pro Bowl'", " '1x All-Pro'"]  >= 3
* **stats_succ:** Statistical Success (passing_yards > 15,000, touchdown_passes > 50....)
* **earn_succ:** Earnings Success : career_earnings > $5M
* **long_succ:** Longevity success (years_played > 5 years)
* **metrics_succ**  wAV, AV, QBR
* **draft_succ:** Draft success: picked in 1st round, mid round, later round.

* **nfl_agg_succ:** aggregate NFL success, based on combination of the aforementioned success targets.
  * eg.: nfl_agg_succ = 1 for any player with 1 in >= 2 or more of the other columns (stats_succ =1 and wins_succ = 1).

eg.: nfl_agg_succ = 1 for any player with 1 in >= 2 or more of the other columns (stats_succ =1 and wins_succ = 1).

All columns will be 1 or 0, so ready-to-use as a target/y.  We can choose to focus on the single NFL Aggregate Sucess target, or perform branched/multi-prediction Neural Network models for multiple target/y columns.

to_do:
1) drop incorrect names (Brett Smith != Bruce Smith...)
2) check wAV values of '100' and other odd 'default' values that need to be made NaN
3) drop columns not used : 'win_record'
4) rename columns?



In [178]:
# ['player', 'draft', 'rd', 'pick', 'team', 'earn_mils', 'yrs_play',
#        'games', 'wAV', 'AV', 'win_%', 'pass_rating', 'hof', 'all_star', 'SBs',
#        'att', 'comp', 'comp_%', 'pass_yds', 'TD', 'YPA', 'pydsPG', 'int',
#        'int_%', 'sacks', 'pick_6', 'GWD', '4QC', 'ht', 'wt', 'hand', 'record',
#        'wins', 'loss'],

# base DataFrame
#       'player', 'draft', 'team',
#       'ht', 'wt',
# **hof:** Hall of Fame induction
#       'hof'
# **wins_succ:**  wins > 50, win % > 50%
#       'wins', 'loss', 'win_%', 
# **awards_tot, awards_succ:** Awards total number of Pro Bowls, All Pros, MVPs, etc. eg.: ["'2x Pro Bowl'", " '1x All-Pro'"]  = 3
#       'all_star',
# **stats_succ:** Statistical Success (passing_yards > 15,000, touchdown_passes > 50....)
#       'comp', 'comp_%', 'pass_yds', 'TD', 'YPA', 'pydsPG',       
# **earn_succ:** Earnings Success : career_earnings > $5M
#       'earn_mils',
# **long_succ:** Longevity success (years_played > 5 years)
#       'yrs_play', 'games',
# **metrics_succ**  wAV, AV, QBR
#       'wAV', 'AV', 'pass_rating',
# **draft_succ:** Draft success: picked in 1st round, mid round, later round.
# '     rd', 'pick', 
# * **nfl_agg_succ:** aggregate NFL success, based on combination of the aforementioned success targets.
#   * eg.: nfl_agg_succ = 1 for any player with 1 in >= 2 or more of the other columns (stats_succ =1 and wins_succ = 1).

# NOT USED: 'att', 'int', 'record', 'pick_6', 'hand', 'int_%', 'sacks', 

In [179]:
import pandas as pd
file_in = "../Data_Final/nfl_merged.csv"
df = pd.read_csv(file_in)

display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 38 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   player         637 non-null    object 
 1   draft_yr       519 non-null    float64
 2   rd             431 non-null    float64
 3   pick           431 non-null    float64
 4   teams          212 non-null    object 
 5   team_names     637 non-null    object 
 6   college        563 non-null    object 
 7   earn_mils      391 non-null    float64
 8   yrs_play       250 non-null    float64
 9   games          475 non-null    float64
 10  wAV            465 non-null    float64
 11  AV             396 non-null    object 
 12  win_%          370 non-null    float64
 13  pass_rating    214 non-null    float64
 14  hof_yes        356 non-null    float64
 15  hof_yr         40 non-null     float64
 16  all_star       212 non-null    object 
 17  SBs            212 non-null    float64
 18  att       

None

In [180]:

# reimport, applying type Int64 to applicable columns. Note <NA> used by INT64, vs. NaN
float_to_int = ['draft_yr', 'rd', 'pick','yrs_play','games', 'hof_yr','hof_yes', 'SBs',
       'att', 'comp', 'TD', 'int',
       'sacks', 'pick_6', 'GWD', '4QC','wt','wins', 'loss']

dtype_dict = {col:'Int64' for col in float_to_int}
df = pd.read_csv(file_in, dtype=dtype_dict)

col_count = len(df.columns)
print(f'cols: {col_count}')

display(df.info())
display(df.columns)
display(df.iloc[:, :15].head())

if col_count >= 15:
    display(df.iloc[:, 15:].head())


cols: 38
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 637 entries, 0 to 636
Data columns (total 38 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   player         637 non-null    object 
 1   draft_yr       519 non-null    Int64  
 2   rd             431 non-null    Int64  
 3   pick           431 non-null    Int64  
 4   teams          212 non-null    object 
 5   team_names     637 non-null    object 
 6   college        563 non-null    object 
 7   earn_mils      391 non-null    float64
 8   yrs_play       250 non-null    Int64  
 9   games          475 non-null    Int64  
 10  wAV            465 non-null    float64
 11  AV             396 non-null    object 
 12  win_%          370 non-null    float64
 13  pass_rating    214 non-null    float64
 14  hof_yes        356 non-null    Int64  
 15  hof_yr         40 non-null     Int64  
 16  all_star       212 non-null    object 
 17  SBs            212 non-null    Int64  
 18  a

None

Index(['player', 'draft_yr', 'rd', 'pick', 'teams', 'team_names', 'college',
       'earn_mils', 'yrs_play', 'games', 'wAV', 'AV', 'win_%', 'pass_rating',
       'hof_yes', 'hof_yr', 'all_star', 'SBs', 'att', 'comp', 'comp_%',
       'pass_yds', 'TD', 'YPA', 'pydsPG', 'int', 'int_%', 'sacks', 'pick_6',
       'GWD', '4QC', 'ht', 'wt', 'hand', 'record', 'wins', 'loss',
       'college_stats'],
      dtype='object')

Unnamed: 0,player,draft_yr,rd,pick,teams,team_names,college,earn_mils,yrs_play,games,wAV,AV,win_%,pass_rating,hof_yes
0,Steve Bartkowski,1975,1,1,"['Atlanta Falcons 1975-1985', 'Los Angeles Rams 1986']",[],California,,12,129,100.0,84,46.46,75.4,0
1,Steve Beuerlein,1987,4,110,"['Los Angeles Raiders 1988-1989', 'Dallas Cowboys 1991-1992', 'Phoenix Cardinals 1993-1994', 'Jacksonville Jaguars 1995', 'Carolina Panthers 1996-2000', 'Denver Broncos 2002-2003']",[],Notre Dame,,16,147,100.0,85,46.08,80.3,0
2,Archie Manning,1971,1,2,"['New Orleans Saints 1971-1982', 'Houston Oilers 1982-1983', 'Minnesota Vikings 1983-1984']",[],Mississippi,,14,151,100.0,94,25.74,67.1,0
3,Brian Sipe,1972,13,330,['Cleveland Browns 1974-1983'],[],Grossmont College (CA),,10,125,100.0,87,50.89,74.8,0
4,Andrew Luck,2012,1,1,['Indianapolis Colts 2012-2018'],['Colts'],Stanford,109.108,7,86,100.0,80,61.63,89.5,0


Unnamed: 0,hof_yr,all_star,SBs,att,comp,comp_%,pass_yds,TD,YPA,pydsPG,int,int_%,sacks,pick_6,GWD,4QC,ht,wt,hand,record,wins,loss,college_stats
0,,['2x Pro Bowl'],0,3456,1932,55.9,24124.0,156,7.0,187.0,144,4.2,356,7,21,18,6-4,216,Right,59-68-0,59,68,https://www.sports-reference.com/cfb/players/steve-bartkowski-1.html
1,,['1x Pro Bowl'],1,3328,1894,56.9,24046.0,147,7.2,163.0,112,3.4,332,6,16,13,6-3,220,Right,47-55-0,47,55,https://www.sports-reference.com/cfb/players/steve-beuerlein-1.html
2,,['2x Pro Bowl'],0,3642,2011,55.2,23911.0,125,6.6,158.0,173,4.8,396,11,12,11,6-3,212,Right,35-101-3,35,101,https://www.sports-reference.com/cfb/players/archie-manning-1.html
3,,"['1x Pro Bowl', '1x All-Pro']",0,3439,1944,56.5,23713.0,154,6.9,189.0,149,4.3,224,7,23,17,6-1,195,Right,57-55-0,57,55,
4,,['4x Pro Bowl'],0,3290,2000,60.8,23671.0,171,7.2,275.0,83,2.5,174,10,20,16,6-4,240,Right,53-33-0,53,33,https://www.sports-reference.com/cfb/players/andrew-luck-1.html


In [181]:
# Verify no lost data:
# Set display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

display(len(df))
# display(df[['player', 'team', 'all_star']].sort_values(by='player'))

637

### NEW Dataframe: cleaning_df

In [182]:
cleaning_df = df.copy()
cleaning_df['hof_yr'].value_counts()

# display(cleaning_df[['player', 'hof_yr']][(cleaning_df['hof_yr'] == 5564) | (cleaning_df['hof_yr'] == 1)])

hof_yr
0       6
2005    3
1965    3
1971    2
1985    2
2021    2
2016    2
2006    2
1983    1
1966    1
1977    1
1990    1
1967    1
1981    1
1989    1
1987    1
2000    1
2017    1
2002    1
1979    1
1993    1
1986    1
2004    1
1995    1
1963    1
1972    1
Name: count, dtype: Int64

In [184]:
# if there is a numerical value (year between 1950 and 2026) in the hof column, put a 1 in new hof_succ column in the cleaning_df dataframe

# Create the 'hof_succ' column based on the condition
cleaning_df['hof_succ'] = cleaning_df['hof_yr'].apply(lambda x: 1 if pd.notna(x) and 1950 <= x <= 2026 else 0)

# Display the first few rows to verify the changes
display(cleaning_df[['player', 'team_names', 'draft_yr','hof_succ', 'hof_yes', 'hof_yr']].head())

Unnamed: 0,player,team_names,draft_yr,hof_succ,hof_yes,hof_yr
0,Steve Bartkowski,[],1975,0,0,
1,Steve Beuerlein,[],1987,0,0,
2,Archie Manning,[],1971,0,0,
3,Brian Sipe,[],1972,0,0,
4,Andrew Luck,['Colts'],2012,0,0,


In [185]:
filtered_df = cleaning_df[['player', 'hof_succ','hof_yes', 'hof_yr' ]][cleaning_df['hof_succ'] != cleaning_df['hof_yes']]
filtered_df

Unnamed: 0,player,hof_succ,hof_yes,hof_yr


In [186]:
# **wins_succ** column based on the conditions:
    # > 50 career wins
    # > 50.0% career win %

cleaning_df['wins_succ'] = ((cleaning_df['wins'] >= 32) & (cleaning_df['win_%'] > 50.0)).astype(int)

print(cleaning_df['wins_succ'].value_counts())
cleaning_df[['player','wins', 'win_%', 'wins_succ']].head()   

wins_succ
0    544
1     93
Name: count, dtype: int64


Unnamed: 0,player,wins,win_%,wins_succ
0,Steve Bartkowski,59,46.46,0
1,Steve Beuerlein,47,46.08,0
2,Archie Manning,35,25.74,0
3,Brian Sipe,57,50.89,1
4,Andrew Luck,53,61.63,1


In [187]:
# **awards_tot, awards_succ:** Awards total number of Pro Bowls, All Pros, MVPs, etc. 
#       'all_star'
#  return the total number of all pros, mvps, etc.; eg. ["'2x Pro Bowl'", " '1x All-Pro'"]  = 3

import numpy as np
import ast


# Function to count and sum digits in front of 'x'
def count_all_star(all_star_str):
    try:
        all_star_str = ast.literal_eval(all_star_str)
    except (ValueError, SyntaxError):
        return 0
    
    total = 0
    for item in all_star_str:
        if 'x' in item:
            try:
                count = int(item.split('x')[0])
                # print(f"Extracted count: {count} from item: {item}")
                total += count
            except ValueError:
                print(f"Skipping item: {item} due to ValueError")
                continue
    return total

cleaning_df['awards_count'] = cleaning_df['all_star'].apply(count_all_star)
cleaning_df['awards_succ'] = cleaning_df['awards_count'].apply(lambda x: 1 if x >= 2 else 0)


In [188]:
# verify all values still present
display(len(cleaning_df))
# display(df[['player', 'team', 'all_star']].sort_values(by='player').reset_index(drop=True))

637

In [190]:
display(cleaning_df[['player',  'awards_succ', 'awards_count', 'all_star']].head())

Unnamed: 0,player,awards_succ,awards_count,all_star
0,Steve Bartkowski,1,2,['2x Pro Bowl']
1,Steve Beuerlein,0,1,['1x Pro Bowl']
2,Archie Manning,1,2,['2x Pro Bowl']
3,Brian Sipe,1,2,"['1x Pro Bowl', '1x All-Pro']"
4,Andrew Luck,1,4,['4x Pro Bowl']


In [191]:
#  view smaller DataFrame for statistical success; look at averages, mean, etc.

# **stats_succ:** Statistical Success (passing_yards > 15,000, touchdown_passes > 50....)
#       'comp', 'comp_%', 'pass_yds', 'TD', 'YPA', 'pydsPG',       

stats_cols = ['player' , 'att' ,'comp', 'comp_%', 'pass_yds', 'TD', 'YPA', 'pydsPG']
stats_df = cleaning_df[stats_cols]
display(stats_df.head())

num_cols = stats_df.select_dtypes(include='number').columns.tolist() #['comp', 'comp_%', 'pass_yds', 'TD', 'YPA', 'pydsPG']
display(stats_df[num_cols].describe().style.format("{:,.2f}"))


Unnamed: 0,player,att,comp,comp_%,pass_yds,TD,YPA,pydsPG
0,Steve Bartkowski,3456,1932,55.9,24124.0,156,7.0,187.0
1,Steve Beuerlein,3328,1894,56.9,24046.0,147,7.2,163.0
2,Archie Manning,3642,2011,55.2,23911.0,125,6.6,158.0
3,Brian Sipe,3439,1944,56.5,23713.0,154,6.9,189.0
4,Andrew Luck,3290,2000,60.8,23671.0,171,7.2,275.0


Unnamed: 0,att,comp,comp_%,pass_yds,TD,YPA,pydsPG
count,251.0,250.0,414.0,427.0,416.0,399.0,214.0
mean,3159.55,1847.3,56.88,13134.34,85.82,6.68,179.72
std,1876.07,1198.76,8.62,14243.82,95.52,1.61,45.73
min,1110.0,591.0,0.0,5.0,1.0,0.0,71.0
25%,1736.5,969.75,52.42,2281.0,16.0,6.2,148.25
50%,2634.0,1535.5,56.75,7975.0,49.5,6.7,176.0
75%,4006.0,2303.25,60.58,20506.5,128.25,7.1,210.0
max,12050.0,7753.0,100.0,89214.0,649.0,24.0,293.0


In [193]:
# View a dataframe with the determined criteria.

criteria = {
    'comp': 1200,       # or more career passing completions
    'comp_%': 70,       # or more career completion percent : comp %
    'pass_yds': 15000,  # or more passing yards
    'TD': 75,           # or more passing touchdowns
    'YPA': 7.0,         # or more Yards Per pass Attempt
    'pydsPG': 200       # or more pass Yards Per Game
}

stats_df = stats_df[
    (stats_df['comp'] > criteria['comp']) |
    (df['comp_%'] >= criteria['comp_%']) |
    (df['pass_yds'] >= criteria['pass_yds']) |
    (df['TD'] >= criteria['TD']) |
    (df['YPA'] >= criteria['YPA']) |
    (df['pydsPG'] >= criteria['pydsPG'])          
    ]

display(len(stats_df))
# display(stats_df)

  stats_df = stats_df[


227

In [194]:
# quick dataclean for comp_%  anything over 90% is wrong.

import numpy as np
cleaning_df['comp_%'] = cleaning_df['comp_%'].apply(lambda x: np.nan if x >= 90 else x)
cleaning_df[cleaning_df['comp_%'] >= 90 ]

Unnamed: 0,player,draft_yr,rd,pick,teams,team_names,college,earn_mils,yrs_play,games,wAV,AV,win_%,pass_rating,hof_yes,hof_yr,all_star,SBs,att,comp,comp_%,pass_yds,TD,YPA,pydsPG,int,int_%,sacks,pick_6,GWD,4QC,ht,wt,hand,record,wins,loss,college_stats,hof_succ,wins_succ,awards_count,awards_succ


In [195]:
# Now apply the above criteria for the Stats success column, and add it to the cleaning dataframe
import numpy as np

# Assuming df and criteria are already defined
cleaning_df.fillna(0, inplace=True)

# Update the logic to handle NaN values
cleaning_df['stats_succ'] = (
    (cleaning_df['comp'].fillna(0) >= criteria['comp']) |
    (cleaning_df['comp_%'].fillna(0) >= criteria['comp_%']) |
    (cleaning_df['pass_yds'].fillna(0) >= criteria['pass_yds']) |
    (cleaning_df['TD'].fillna(0) >= criteria['TD']) |
    (cleaning_df['YPA'].fillna(0) >= criteria['YPA']) |
    (cleaning_df['pydsPG'].fillna(0) >= criteria['pydsPG'])
).astype(np.int64)

display(len(cleaning_df))
stats_cols_updated = ['player', 'stats_succ', 'att' ,'comp', 'comp_%', 'pass_yds', 'TD', 'YPA', 'pydsPG']
display(cleaning_df[stats_cols_updated].head())
display(len(cleaning_df['stats_succ'][cleaning_df['stats_succ'] == 1]))
cleaning_df['stats_succ'].value_counts()

637

Unnamed: 0,player,stats_succ,att,comp,comp_%,pass_yds,TD,YPA,pydsPG
0,Steve Bartkowski,1,3456,1932,55.9,24124.0,156,7.0,187.0
1,Steve Beuerlein,1,3328,1894,56.9,24046.0,147,7.2,163.0
2,Archie Manning,1,3642,2011,55.2,23911.0,125,6.6,158.0
3,Brian Sipe,1,3439,1944,56.5,23713.0,154,6.9,189.0
4,Andrew Luck,1,3290,2000,60.8,23671.0,171,7.2,275.0


228

stats_succ
0    409
1    228
Name: count, dtype: int64

In [196]:
# **earn_succ:** Earnings Success : career_earnings > $5M
#       'earn_mils',

# create new column in main dataframe cleaning_df
cleaning_df['earn_succ'] = (cleaning_df['earn_mils'] >= 1).astype(int)

# view the sub-Dataframe...
earn_df = cleaning_df[['player', 'earn_succ','earn_mils', 'draft_yr', 'yrs_play', 'teams']]
print(earn_df['earn_succ'].value_counts())

# view the cleaning_df
display(cleaning_df[['player', 'earn_succ','earn_mils', 'draft_yr', 'yrs_play', 'teams']].sort_values(by='earn_mils', ascending=False).head(10))
display(cleaning_df[['player', 'earn_succ','earn_mils', 'draft_yr', 'yrs_play', 'teams']].sort_values(by='earn_mils', ascending=False).tail(10))


earn_succ
0    410
1    227
Name: count, dtype: int64


Unnamed: 0,player,earn_succ,earn_mils,draft_yr,yrs_play,teams
505,Aaron Rodgers,1,343.531,2005,20,"['Green Bay Packers 2005-2022', 'New York Jets 2023-2024']"
253,Matt Stafford,1,328.0,2009,0,"['Detroit Lions 2009-2020', 'Los Angeles Rams 2021-2024']"
497,Tom Brady,1,317.62,2000,23,"['New England Patriots 2000-2019', 'Tampa Bay Buccaneers 2020-2022']"
503,Matt Ryan,1,306.206,2008,15,"['Atlanta Falcons 2008-2019', 'Atlanta Falcons 2020-2021', 'Indianapolis Colts 2022']"
515,Russell Wilson,1,305.34,2012,12,"['Seattle Seahawks 2012-2021', 'Denver Broncos 2022-2023', 'Pittsburgh Steelers 2024']"
518,Kirk Cousins,1,281.469,2012,13,"['Washington Redskins 2012-2013', 'Washington Redskins 2014-2017', 'Minnesota Vikings 2018-2023', 'Atlanta Falcons 2024']"
498,Drew Brees,1,273.933,2001,20,"['San Diego Chargers 2001-2005', 'New Orleans Saints 2006-2020']"
501,Ben Roethlisberger,1,266.724,2004,18,['Pittsburgh Steelers 2004-2021']
499,Peyton Manning,1,247.714,1998,18,"['Indianapolis Colts 1998-2010', 'Denver Broncos 2012-2015']"
502,Philip Rivers,1,242.15,2004,17,"['San Diego Chargers 2004-2018', 'Los Angeles Chargers 2019', 'Indianapolis Colts 2020']"


Unnamed: 0,player,earn_succ,earn_mils,draft_yr,yrs_play,teams
172,Jim Hardy,0,0.0,1945,0,"['Los Angeles Rams 1946-1948', 'Chicago Cardinals 1949', 'Chicago Cardinals 1950', 'Chicago Cardinals 1951', 'Detroit Lions 1952']"
181,Mike Taliaferro,0,0.0,1963,0,"['New York Jets 1964-1967', 'Boston Patriots 1968-1970', 'Buffalo Bills 1972']"
180,George Shaw,0,0.0,1955,0,"['Baltimore Colts 1955', 'Baltimore Colts 1956-1958', 'New York Giants 1959-1960', 'Minnesota Vikings 1961', 'Denver Broncos 1962']"
179,Joe Kapp,0,0.0,1959,0,"['Minnesota Vikings 1967-1969', 'Boston Patriots 1970']"
178,Jim Ninowski,0,0.0,1958,0,"['Cleveland Browns 1958-1963', 'Detroit Lions 1960-1961', 'Cleveland Browns 1964-1966', 'Washington Redskins 1967-1968', 'New Orleans Saints 1969']"
177,Dennis Shaw,0,0.0,1970,0,"['Buffalo Bills 1970-1973', 'St. Louis Cardinals 1974-1975']"
176,King Hill,0,0.0,1958,0,0
175,Don Strock,0,0.0,1973,0,0
174,Red Dunn,0,0.0,0,0,0
636,Broc Rutter,0,0.0,0,0,0


In [197]:
# **long_succ:** Longevity success (years_played > 5 years)
#       'yrs_play', 'games',

# long_df = cleaning_df['long_succ']
long_df = cleaning_df[['player', 'draft_yr', 'games','yrs_play', 'team_names' , 'teams']]
long_df.describe()

# apply filters to cleaning_df, then have a look.
cleaning_df['long_succ'] = ((cleaning_df['yrs_play'] >= 4) & (cleaning_df['games'] >= 32)).astype(int)

print(cleaning_df['long_succ'].value_counts())
display(cleaning_df[['player', 'draft_yr', 'games','yrs_play', 'team_names' , 'teams']].head())

long_succ
0    420
1    217
Name: count, dtype: int64


Unnamed: 0,player,draft_yr,games,yrs_play,team_names,teams
0,Steve Bartkowski,1975,129,12,[],"['Atlanta Falcons 1975-1985', 'Los Angeles Rams 1986']"
1,Steve Beuerlein,1987,147,16,[],"['Los Angeles Raiders 1988-1989', 'Dallas Cowboys 1991-1992', 'Phoenix Cardinals 1993-1994', 'Jacksonville Jaguars 1995', 'Carolina Panthers 1996-2000', 'Denver Broncos 2002-2003']"
2,Archie Manning,1971,151,14,[],"['New Orleans Saints 1971-1982', 'Houston Oilers 1982-1983', 'Minnesota Vikings 1983-1984']"
3,Brian Sipe,1972,125,10,[],['Cleveland Browns 1974-1983']
4,Andrew Luck,2012,86,7,['Colts'],['Indianapolis Colts 2012-2018']


In [198]:
# **metrics_succ**  wAV, AV, QBR
#       'wAV', 'AV', 'pass_rating',

cleaning_df['AV'] = pd.to_numeric(cleaning_df['AV'], errors='coerce')
# Fill NaN values with 0
cleaning_df['AV'].fillna(0, inplace=True)
cleaning_df['AV'] = cleaning_df['AV'].astype(float)

# view smaller dataframe: 
metrics_df = cleaning_df[['player', 'draft_yr', 'wAV','AV', 'pass_rating', 'team_names' , 'teams' ]]
display(metrics_df.describe())


# apply filters to maindataframe, cleaning_df
cleaning_df['metrics_succ'] = ((cleaning_df['wAV'] >= 70) | 
                            (cleaning_df['AV'] >= 10) |
                            (cleaning_df['pass_rating'] >= 100)       
                               ).astype(int)


#  view cleaning_df
display(cleaning_df['metrics_succ'].value_counts())
display(cleaning_df[['player', 'draft_yr', 'wAV','AV', 'pass_rating', 'team_names' , 'teams' ]].head())


# sample AV:  Colt McCoy: 7 (in 2011), Joe Flacco 13 (in 2014), Jake Delhomme: 12 (2004)
# sample wAV: Colt McCoy: 22         , Joe Flacco: 76/95         , Jake Delhomme: 55 
# The average career passer rating for an NFL starter typically falls around 85-90. Modern quarterbacks tend to have
#  higher passer ratings due to changes in the game that favor passing offenses

cleaning_df.columns


Unnamed: 0,draft_yr,wAV,AV,pass_rating
count,637.0,637.0,637.0,637.0
mean,1627.956044,72.99843,27.601256,26.514443
std,777.119614,44.431648,48.102497,37.765802
min,0.0,0.0,-4.0,0.0
25%,1959.0,0.0,0.0,0.0
50%,1995.0,100.0,1.0,0.0
75%,2015.0,100.0,35.0,72.3
max,2024.0,100.0,326.0,103.0


metrics_succ
1    465
0    172
Name: count, dtype: int64

Unnamed: 0,player,draft_yr,wAV,AV,pass_rating,team_names,teams
0,Steve Bartkowski,1975,100.0,84.0,75.4,[],"['Atlanta Falcons 1975-1985', 'Los Angeles Rams 1986']"
1,Steve Beuerlein,1987,100.0,85.0,80.3,[],"['Los Angeles Raiders 1988-1989', 'Dallas Cowboys 1991-1992', 'Phoenix Cardinals 1993-1994', 'Jacksonville Jaguars 1995', 'Carolina Panthers 1996-2000', 'Denver Broncos 2002-2003']"
2,Archie Manning,1971,100.0,94.0,67.1,[],"['New Orleans Saints 1971-1982', 'Houston Oilers 1982-1983', 'Minnesota Vikings 1983-1984']"
3,Brian Sipe,1972,100.0,87.0,74.8,[],['Cleveland Browns 1974-1983']
4,Andrew Luck,2012,100.0,80.0,89.5,['Colts'],['Indianapolis Colts 2012-2018']


Index(['player', 'draft_yr', 'rd', 'pick', 'teams', 'team_names', 'college',
       'earn_mils', 'yrs_play', 'games', 'wAV', 'AV', 'win_%', 'pass_rating',
       'hof_yes', 'hof_yr', 'all_star', 'SBs', 'att', 'comp', 'comp_%',
       'pass_yds', 'TD', 'YPA', 'pydsPG', 'int', 'int_%', 'sacks', 'pick_6',
       'GWD', '4QC', 'ht', 'wt', 'hand', 'record', 'wins', 'loss',
       'college_stats', 'hof_succ', 'wins_succ', 'awards_count', 'awards_succ',
       'stats_succ', 'earn_succ', 'long_succ', 'metrics_succ'],
      dtype='object')

In [199]:
# * **nfl_agg_succ:** aggregate NFL success, based on combination of the aforementioned success targets.
#   * eg.: nfl_agg_succ = 1 for any player with 1 in >= 2 or more of the other columns (stats_succ =1 and wins_succ = 1).


# Display the number of rows in cleaning_df
print(len(cleaning_df))

# Create nfl_succ_df with the specified columns
nfl_succ_df = cleaning_df[['player','draft_yr', 
                           'hof_succ', 'wins_succ', 
                           'awards_succ', 'stats_succ', 'earn_succ', 'long_succ', 'metrics_succ', 
                           'team_names' , 'teams', 'hof_yr']]
success_columns = ['hof_succ', 'wins_succ', 'awards_succ', 'stats_succ', 'earn_succ', 'long_succ', 'metrics_succ']


# NFL success defined as having three or more of the various _succ categories: 
nfl_succ_df['nfl_success'] = (nfl_succ_df[success_columns].sum(axis=1) >= 3).astype(int)

# Move 'nfl_success' column right after 'draft' column
cols = list(nfl_succ_df.columns)
draft_index = cols.index('draft_yr')
cols.insert(draft_index + 1, cols.pop(cols.index('nfl_success')))
nfl_succ_df = nfl_succ_df[cols]

# Sort by 'nfl_success' and then by 'player'
nfl_succ_df.sort_values(by=['nfl_success', 'player'], ascending=[False, True], inplace=True)
nfl_succ_df.reset_index(drop=True, inplace=True)

display(nfl_succ_df['nfl_success'].value_counts())
display(nfl_succ_df.head())

637


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nfl_succ_df['nfl_success'] = (nfl_succ_df[success_columns].sum(axis=1) >= 3).astype(int)


nfl_success
0    403
1    234
Name: count, dtype: int64

Unnamed: 0,player,draft_yr,nfl_success,hof_succ,wins_succ,awards_succ,stats_succ,earn_succ,long_succ,metrics_succ,team_names,teams,hof_yr
0,Aaron Brooks,1999,1,0,0,0,1,0,1,1,[],0,0
1,Aaron Rodgers,2005,1,0,0,1,1,1,0,1,"['Packers', 'Jets']","['Green Bay Packers 2005-2022', 'New York Jets 2023-2024']",0
2,Al Dorow,1952,1,0,0,1,0,0,1,1,[],"['Washington Redskins 1954-1956', 'Philadelphia Eagles 1957', 'New York Titans 1960-1961', 'Buffalo Bills 1962']",0
3,Alex Smith,2005,1,0,0,0,1,1,0,1,"['49ers', 'Commanders', 'Chiefs']",0,0
4,Alex Tanney,2015,1,0,0,0,1,1,0,1,"['Giants', 'Titans', 'Cowboys', 'Colts', 'Bills']",0,0


In [204]:


nan_rows = nfl_succ_df[nfl_succ_df['player'].isna()]

# Display the rows where 'player' column is NaN
display(nan_rows)

Unnamed: 0,player,draft_yr,nfl_success,hof_succ,wins_succ,awards_succ,stats_succ,earn_succ,long_succ,metrics_succ,team_names,teams,hof_yr


In [205]:
#Export csv of JUST the target/y labels:
file_out = "../Data_Final/nfl_success_labels.csv"
nfl_succ_df.to_csv(file_out, index=False)
print(f'saved csv to: {file_out}')

saved csv to: ../Data_Final/nfl_success_labels.csv


In [202]:
file_out = "../Data_Final/nfl_success_full_dataframe.csv"
cleaning_df.to_csv(file_out, index=False)
print(f'saved csv to: {file_out}')

saved csv to: ../Data_Final/nfl_success_full_dataframe.csv


# THESE CELLS NOT USED

In [176]:
# base_cols = ['player', 'team', 'draft']
# hof_cols = ['hof']
# wins_cols = ['win_succ', 'wins', 'loss', 'win_%']
# awards_cols = ['awards_succ', 'awards_count', 'all_star']
# stats_cols = ['stats_succ' , 'comp', 'comp_%', 'pass_yds', 'TD', 'YPA', 'pydsPG']
# earn_cols = ['earn_succ', 'earn_mils' ]
# long_cols= ['long_succ', 'yrs_play', 'games']
# metrics_cols = ['metrics_succ' ,'wAV', 'AV', 'pass_rating']
# draft_cols =  ['draft_succ', 'rd', 'pick']