<h1 style="color:blue; font-family:'Tahoma';text-align:center"> Predictive Defense Analysis on NFL Data </h1>

<img src="https://cdn.pixabay.com/photo/2015/03/07/21/42/stadium-663584_1280.jpg">
<hr>

<h2 style="color:green;font-family:Tahoma">1. Identifying Zone vs Man Coverage</h2>

A fundamental issue to address in this challenge was to identify which coverage types defenses were employing. Traditionally, teams line up in either zone or man coverage. In zone coverage, teams attempt to cover different parts of the field by assigning each player to a zone. Contrastly, in man coverage, each defensive player is responsible for covering one offensive player. 

There are many variations of zone and man coverages, and defenses frequently disguise their coverages to confuse opposing offenses. 

Since the data in this challenge was so detailed, we realized that we could use the defensive player data in conjunction with various data science models and statistical analyses to generate predictions about whether the defense was in man or zone coverage. We used unsupervised machine learning in an attempt to classify whether the defense was in man or zone coverage. 

To accomplish this, we generated (with help from some excellent Public Notebooks), new features to determine the variance in distances and relative speeds between defensive and offensive players (`varX`, `varY`, `varS`, `oppVar`, `oppMean`, `mateMean`, `oppDirVar`, `oppDirMean`, `meanOppMateDistRatio`, `varOppMateDistRation`) to better grasp how frequently the defense employed each coverage scheme.

We then used K-Means, Mini Batch K-Means, and Gaussian Mixture models to classify zone or man coverage, and used our algorithms’ predictions as outputs. As seen below, the models all say the same thing –– that the defenses primarily employ zone coverage schemes.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff

In [2]:
plays = pd.read_csv("../input/nfl-big-data-bowl-2021/plays.csv")
coverage_preds = pd.read_csv("../input/nflcoveragepredictions/coverage_preds.csv")

all_dbs = pd.read_csv("../input/nflbdbtopplayers/all_dbs.csv")
all_lbs = pd.read_csv("../input/nflbdbtopplayers/all_lbs.csv")
all_top = pd.read_csv("../input/nflbdbtopplayers/all_top.csv")

<h3 style="color:gray">1.1 Using K-Means for Clustering</h3>

We used a simple K-Means algorithm to cluster zone vs. man coverage. First, we cleaned our data by removing duplicate and NaN values. We then standardized our data by subtracting each data value by the mean and dividing by the standard deviation. Next, we used a simple K-Means algorithm with these two clusters. From our model’s output, we counted 14,951 values for man coverage and 64,100 values for zone coverage. 


Our output had one dataframe for man coverage and another for zone coverage. To better understand the different coverage options, we applied statistical analysis to the set. Through this process, we found a key insight –– that the average variances in x-direction, y-direction, and speed were much higher in the zone cluster than in the man cluster. We also noted that there was more player movement in the zone cluster than in the man cluster during plays.

In [3]:
# Bar Chart for KMeans
x = ['Zone', 'Man']
y = coverage_preds['kmeans'].value_counts().tolist()

fig = go.Figure(data=[go.Bar(x = x, y = y, hovertext=['Zone Coverage', 'Man Coverage'])])

fig.update_traces(marker_color='rgb(255, 84, 0)', marker_line_color='rgb(20, 21, 25)', marker_line_width=1.5, opacity=0.6)
fig.update_layout(title_text='Zone Coverage vs Man Coverage [K-Means]')
fig.show()

<h3 style="color:gray">1.2 Using Mini Batch K-Means for Clustering</h3>

Additionally, we chose to cluster with Mini Batch K-means, knowing that this method works well on datasets with more than 10,000 samples. We cleaned the data and ran the predictions with two clusters. Similar to our findings with the simple K-means method, we saw that there were 63,800 values in our zone coverage cluster and 15,251 in our man coverage cluster. 

In [4]:
# Bar Chart
x = ['Zone', 'Man']
y = coverage_preds['mini_kmeans'].value_counts().tolist()

fig = go.Figure(data=[go.Bar(x = x, y = y,
            hovertext=['Zone Coverage', 'Man Coverage'])])

fig.update_traces(marker_color='rgb(0, 51, 255)', marker_line_color='rgb(20, 21, 25)',
                  marker_line_width=1.5, opacity=0.6)
fig.update_layout(title_text='Zone Coverage vs Man Coverage [Mini Batch K-Means]')
fig.show()

<h3 style="color:gray">1.3 Using Gaussian Mixture Models for Clustering</h3>

Finally, we used a Gaussian Mixture model to gain more clarity on the differences between the two subpopulations we are looking at: man and zone. 

We built a simple model to generate predictions regarding defensive coverage schemes. Our model found 62,156 plays where zone coverage was used and 16,895 plays where man coverage was used; this finding is in accord with the other modeling methods discussed above.

In [5]:
# Bar Chart
x = ['Zone', 'Man']
y = coverage_preds['gmm'].value_counts().tolist()

fig = go.Figure(data=[go.Bar(x = x, y = y,
            hovertext=['Zone Coverage', 'Man Coverage'])])

fig.update_traces(marker_color='rgb(191, 51, 255)', marker_line_color='rgb(20, 21, 25)',
                  marker_line_width=1.5, opacity=0.6)
fig.update_layout(title_text='Zone Coverage vs Man Coverage [Gaussian Mixture Models]')
fig.show()

<h3 style="color:gray">1.4 Conclusion</h3>

We learned that teams primarily employ zone defensive schemes, where the players line up further off the ball, allowing more distance between the offensive and defensive players. 

Having broken down each play into subcomponents with detailed information about each player’s directional positions and speed, we were able to build models that classify differences across coverage schemes. 

All three models were in agreement that roughly 80% of plays correspond to one type of coverage scheme and the remaining 20% to another scheme. These findings, coupled with our new generated features, led us to infer that the more frequently employed scheme was zone coverage.

<h2 style="color:green;font-family:Tahoma">2. Drawing Insights from Defensive Player Performance Data</h2>

<h3 style="color:gray">2.1 Parsing Descriptions Text to Evaluate Player Defensive Performance</h3>

Quantifying individual players’ defensive performance is a difficult feat that accompanies lots of ambiguity.
With this in mind, we decided an effective way to accomplish this was to attribute the number of sacks, forced fumbles, and interceptions to each defensive player, and then use those features to model each player’s performance.

To do this, we generated an algorithm that parsed through each play description and stored the player’s name and the running count total for each associated feature in a dictionary. 

We transformed the dictionary into a dataframe, and then used this data to uncover insights regarding the top performers. 

**Summing up the total number of interceptions, forced fumbles, and sacks, we assigned an effective total defensive score to each player.** 

We then ranked these players in descending order, as seen in the table below:

In [6]:
all_top['age'] = all_top['age'].round(1)
all_top.head(20)

Unnamed: 0,name,age,weight,height,ints,score
0,C.Jones,30.9,265,77,2,25
1,C.Jones,33.2,205,72,2,25
2,C.Jones,27.1,200,70,2,25
3,C.Jones,29.9,250,75,2,25
4,J.Pierre-Paul,32.0,275,77,0,11
5,V.Miller,31.8,250,75,1,11
6,K.Fackrell,29.1,245,77,0,10
7,T.Johnson,24.5,192,71,5,10
8,T.Watt,26.3,252,76,0,10
9,D.Leonard,25.5,230,74,2,10


Building off of the aforementioned parsing algorithm and resultant data, we had a final dataframe with these player statistics: *each player name, height, age, total interceptions, total forced fumbles, and total sacks recorded.* 

Now we could model what player attributes contributed to or correlated with certain statistical outcomes. Through this analysis, we aimed to answer "Is there anything about a player – for example, their height, weight, experience, speed, or position – that can be used to predict their performance on defense?"

Knowing that there existed a large variance between these player attributes (height, weight, and age) across different positions, we decided it would enrich our insights and behoove us to create subsets by position type.

After closer inspection, we found that nearly all of the top players in the above table were linebackers, due to the bias resulting from the vastly larger amount of forced fumbles and sacks than interceptions.

As a result, we decided it would be best to separate our analysis into two distinct groups: `Defensive Backs` and `Linebackers`.

<h3 style="color:gray">2.2 Defensive Back Insights</h3>

First, we subset all defensive backs in our data. We looked at players having the positions `CB`, `SS`, `S`, and `DB`.

After building this defensive back subset, we analyzed the data to see what correlated with strong defensive performance. 

First, we calculated correlations between each player attribute and the number of interceptions, forced fumbles, and sacks. 
Since we are exclusively looking at defensive backs, we decided that it was most important to analyze the attributes that optimized the number of interceptions. Our findings are detailed below.

<h3 style="color:purple">Preliminary Descriptive Statistics - DEFENSIVE BACKS</h3>
<br>

* Median Age: 28.8 years
* Median Height: 71.0 inches (5’-11”)
* Median Weight: 200.0 pounds

<br>
The medians and the means were almost identical, with negligible differences between the two. To avoid redundancy, we only included the medians here.

<h4>Major Insights:</h4>

1. We found that defensive backs who were **taller than 74 inches** (6’-2”) were **negatively correlated with their number of interceptions** (Correlation = -0.74).

2. We also found that of defensive backs who **weighed more than 215 pounds**, their age was **negatively correlated with their number of sacks** (Correlation = -0.77).

3. Finally, in defensive backs who were **younger than 25 years old**, there was **a slight positive correlation between height and interceptions** (Correlation = 0.37).

Shown below is the list of top 20 Defensive Backs players:

In [7]:
all_dbs['age'] = all_dbs['age'].round(1)
all_dbs.rename(columns={'ints':'interceptions'}, inplace=True)
all_dbs.head(20)

Unnamed: 0,name,age,weight,height,interceptions
0,E.Jackson,27.1,202,72,6
1,T.Johnson,24.5,192,71,5
2,T.Johnson,31.0,213,74,5
3,J.Poyer,29.7,191,72,4
4,D.Jackson,25.2,180,70,4
5,K.Byard,27.4,212,71,4
6,B.Poole,28.2,211,70,3
7,M.Adams,39.8,205,61,3
8,M.Butler,30.9,190,71,3
9,D.James,24.4,215,74,3


<h3 style="color:gray">2.3 Linebacker Insights</h3>

<h3 style="color:purple">Preliminary Descriptive Statistics - LINEBACKERS</h3>
<br>

* Median Age: 28.2 years
* Median Height: 74.0 inches (6’-2”)
* Median Weight: 245.0 pounds


<br>
Like the defensive backs, the medians and means for linebackers were almost identical, with negligible differences between the two. To avoid redundancy, we only included the medians here.

<h4>Major Insights:</h4>

1. We found that linebackers **younger than 25** were **negatively correlated with their number of interceptions** (Correlation = -0.87).

2. We also found that the heaviest linebackers’ (**those who weighed 268 pounds or more**) ages were nearly perfectly correlated with their number of sacks (Correlation = 0.95); likewise, their heights were considerably positively correlated with their number of forced fumbles (Correlation = 0.70).

In [8]:
all_lbs = all_lbs.drop(['ints'], axis=1)
all_lbs['age'] = all_lbs['age'].round(1)
all_lbs.head(20)

Unnamed: 0,name,age,weight,height,score
0,C.Jones,30.9,265,77,25
1,C.Jones,29.9,250,75,25
2,J.Pierre-Paul,32.0,275,77,11
3,V.Miller,31.8,250,75,11
4,T.Watt,26.3,252,76,10
5,T.Johnson,25.9,232,72,10
6,K.Fackrell,29.1,245,77,10
7,D.Leonard,25.5,230,74,10
8,R.Kerrigan,32.4,265,76,9
9,J.Brown,26.9,221,71,8


<h2 style="color:green;font-family:Tahoma">Conclusion and Technical Challenges</h2>

Throughout this challenge, our team did face a few technical challenges worth nothing.

When answering “What are coverage schemes that the defense employs?”, we were confined to data from three weeks (1, 3, and 5). This was due to memory constraints and various errors in our debugging process. 

Nevertheless, we still strongly believe that our overall conclusion is accurate and applicable throughout the entire season.

Another issue our team faced was that the player names in the play descriptions that were fed into the parsing algorithm were in a different convention than the player names in the players.csv file. 
This caused some player names to not match up when building on the resultant data to create the linebacker and defensive back data frames.

Lastly, when we were exploring various correlations between attributes and defensive performance in linebackers and defensive backs, some of our conditional parameters for attributes (height, weight, age) yielded a data frame with less than 10 observations, which could result in some biased correlations. We felt that despite these smaller sample sizes, because the other attributes were relatively close in value, and the data frame was inherently a subset of a subset, the correlations were worth noting.

Ultimately, we believe that our analysis is strong, even with these various technical challenges. 
We are all in agreement that tackling these issues head-on enabled us to all grow as data scientists and sports analytics enthusiasts. 
We hope that our findings are of use, and welcome any feedback or suggestions.

<h1 style="color:blue;font-family:'Tahoma'">Appendix</h1>

Our analysis on man vs zone coverage was inspired by Andika Rachman's [Notebook on Identifying Coverage Scheme Among Defense Backs](https://www.kaggle.com/ar2017/identifying-coverage-scheme-among-defensive-backs).

In [9]:
# ### FOR DEFENSIVE PERFORMANCE BY EVENTS
# #######################################

# plays = pd.read_csv("../input/nfl-big-data-bowl-2021/plays.csv")
# players = pd.read_csv("../input/nfl-big-data-bowl-2021/players.csv")

# # First, make a list of smaller player names
# player_names = list(players['displayName'].tolist())

# short_player_names = []

# for i, _ in enumerate(player_names):
#     short_player_names.append(player_names[i].split()[0][0] + "." + player_names[i].split()[1])

# players['displayName'] = short_player_names

# def add_dict_entry(dct, entry, other=None):
#     """
#     Check if a player is already in the dictionary, if not then add it, if it is, then increment it's value
#     """
#     if entry in dct:
#         if not other:
#             dct[entry] += 1
#         else:
#             dct[entry] += other
#     else:
#         if not other:
#             dct[entry] = 1
#         else:
#             dct[entry] = other
#     return dct

# # Now let's parse through the description and get the important data
# player_intercepted = {}
# player_fumbles = {}
# player_incomplete = {}
# player_sacked = {}

# # Add a custom progress bar to track the progress
# prog_bar = tqdm(plays['playDescription'], total=len(plays))
# for desc in prog_bar:
#     # Deal with INTERCEPTIONS
#     if "INTERCEPTED" in desc:
#         # The Index of player name is exactly 2 indexes away from the "INTERCEPTED" marker
#         split_desc = desc.split()
#         name_idx = split_desc.index("INTERCEPTED") + 2
#         # Add the entry to the intercepted dict
#         player_intercepted = add_dict_entry(player_intercepted, str(split_desc[name_idx]))
    
#     # Deal with FUMBLES
#     elif "FUMBLES" in desc:
#         if "FUMBLES," in desc:
#             continue
#         # The Index of the player who did fumble is just the next one but it's in parenthesis (remove it)
#         split_desc = desc.split()
#         name_idx = split_desc.index("FUMBLES") + 1
#         name = str(split_desc[name_idx][1:-2])
#         # Add the entry into the fumbles dict
#         player_fumbles = add_dict_entry(player_fumbles, name)
    
#     # Deal with Sacks
#     elif "sack" in desc:
#         # Split the current description
#         split = desc.split()
#         # Set a flag; when "sack*" passes, it will be true
#         flag = False
#         # Go through all elements of current description
#         for j, _ in enumerate(split):
#             prev_item = split[j-1]
#             current_item = split[j]
            
#             # If it's sack, make the flag true
#             if "sack" in current_item:
#                 flag = True
            
#             # If the current word is "yard*" and the flag is True (sack* has passed)
#             # Then get the -ve yardage number (which should be the previous item now) and add into dict
#             if current_item == "yards" or current_item == "yard":
#                 if flag is not True:
#                     continue
#                 else:
#                     try:
#                         # numb = int(prev_item)
#                         name = str(split[j+1][1:-2])
#                         player_sacked = add_dict_entry(player_sacked, name)
#                     except:
#                         continue

# # Convert the dictionaries into dataframes, one by one
# int_players, int_scores = [], []
# for info in list(player_intercepted.items()):
#     int_players.append(info[0])
#     int_scores.append(info[1])

# fum_players, fum_scores = [], []
# for info in list(player_fumbles.items()):
#     fum_players.append(info[0])
#     fum_scores.append(info[1])

# sac_players, sac_scores = [], []
# for info in list(player_sacked.items()):
#     sac_players.append(info[0])
#     sac_scores.append(info[1])
    
# intercepted = pd.DataFrame()
# intercepted['player'] = int_players
# intercepted['score'] = int_scores

# fumbles = pd.DataFrame()
# fumbles['player'] = fum_players
# fumbles['score'] = fum_scores

# sacks = pd.DataFrame()
# sacks['player'] = sac_players
# sacks['score'] = sac_scores

# fumbles = fumbles.drop([0]).reset_index(drop=True)
# sacks = sacks.drop([2]).reset_index(drop=True)

# short_player_names = []

# for i, _ in enumerate(player_names):
#     short_player_names.append(player_names[i].split()[0][0] + "." + player_names[i].split()[1])

# players['displayName'] = short_player_names

# all_pos = ['CB', 'SS', 'FS', 'DB', 'MLB', 'LB', 'OLB', 'ILB', 'S']
# lbs = ['MLB', 'LB', 'OLB', 'ILB']
# dbs = ['CB', 'SS', 'FS', 'DB', 'S']
# all_df = players.loc[players['position'].isin(all_pos)]
# lbs_df = players.loc[players['position'].isin(lbs)]
# dbs_df = players.loc[players['position'].isin(dbs)]

# # Player Matching Algorithm that I am trynna make
# # We first make a big dataframe consisting of players and their scores in fumbles, interceptions and sacks, all in one place.
# all_scores = pd.DataFrame()

# in_score, in_player = [], []
# fu_score, fu_player = [], []
# sa_score, sa_player = [], []

# prog_bar = tqdm(sacks.iterrows(), total=len(sacks))

# for x in prog_bar:
#     match_fu = False
#     match_in = False
    
#     sa_score.append(x[1]['score'])
#     sa_player.append(x[1]['player'])
    
#     for y in fumbles.iterrows():
#         if str(x[1]['player']) == str(y[1]['player']):
#             match_fu = True
#             fu_score.append(y[1]['score'])
#             fu_player.append(y[1]['player'])
    
#     if match_fu == False:
#         fu_score.append(0)
#         fu_player.append(str(x[1]['player']))
    
#     for z in intercepted.iterrows():
#         if str(x[1]['player']) == str(z[1]['player']):
#             match_in = True
#             in_score.append(z[1]['score'])
#             in_player.append(z[1]['player'])
    
#     if match_in == False:
#         in_score.append(0)
#         in_player.append(str(x[1]['player']))
        
# all_scores['player'] = sa_player
# all_scores['interception'] = in_score
# all_scores['fumbles'] = fu_score
# all_scores['sacks'] = sa_score

# def change_height(string):
#     """
#     string: '6-2', etc
#     """
#     if "-" in string:
#         string = int(string[0])*12 + int(string[-1])
#         return string
#     else:
#         return int(string)
    
# def calc_age(days):
#     days = str(days)
#     return int(days[:-14]) / (365)

# all_df = all_df.drop_duplicates()
# lbs_df = lbs_df.drop_duplicates()
# dbs_df = dbs_df.drop_duplicates()

# all_df['height'] = all_df['height'].apply(change_height)
# all_df['birthDate'] = pd.to_datetime(all_df['birthDate'])
# all_df['current_date'] = pd.to_datetime("2021-01-06")
# all_df['age_days'] = all_df['current_date'] - all_df['birthDate']
# all_df['age'] = all_df['age_days'].apply(calc_age)

# lbs_df['height'] = lbs_df['height'].apply(change_height)
# lbs_df['birthDate'] = pd.to_datetime(lbs_df['birthDate'])
# lbs_df['current_date'] = pd.to_datetime("2021-01-06")
# lbs_df['age_days'] = lbs_df['current_date'] - lbs_df['birthDate']
# lbs_df['age'] = lbs_df['age_days'].apply(calc_age)

# dbs_df['height'] = dbs_df['height'].apply(change_height)
# dbs_df['birthDate'] = pd.to_datetime(dbs_df['birthDate'])
# dbs_df['current_date'] = pd.to_datetime("2021-01-06")
# dbs_df['age_days'] = dbs_df['current_date'] - dbs_df['birthDate']
# dbs_df['age'] = dbs_df['age_days'].apply(calc_age)

# all_df_new = all_df.drop(['birthDate', 'collegeName', 'nflId', 'current_date', 'age_days', 'position'], axis=1)

# lbs_df_new = lbs_df.drop(['birthDate', 'collegeName', 'nflId', 'current_date', 'age_days', 'position'], axis=1)

# dbs_df_new = dbs_df.drop(['birthDate', 'collegeName', 'nflId', 'current_date', 'age_days', 'position'], axis=1)

# # Quick Algo for player matching - LBs
# players_h = []
# players_w = []
# players_a = []
# names = []
# interceptions = []
# fumbles = []
# sacks = []

# prog_bar = tqdm(all_scores.iterrows(), total=len(all_scores))
# for player_2 in prog_bar:
#     for player in lbs_df_new.iterrows():
#         if str(player[1]['displayName']) == str(player_2[1]['player']):
#             players_h.append(player[1]['height'])
#             players_w.append(player[1]['weight'])
#             players_a.append(player[1]['age'])
            
#             names.append(str(player_2[1]['player']))
#             interceptions.append(player_2[1]['interception'])
#             fumbles.append(player_2[1]['fumbles'])
#             sacks.append(player_2[1]['sacks'])

# final_df_lbs = pd.DataFrame()
# final_df_lbs['name'] = names
# final_df_lbs['age'] = players_a
# final_df_lbs['weight'] = players_w
# final_df_lbs['height'] = players_h
# final_df_lbs['ints'] = interceptions
# final_df_lbs['fums'] = fumbles
# final_df_lbs['sack'] = sacks

# # Quick Algo for player matching - DBs
# players_h = []
# players_w = []
# players_a = []
# names = []
# interceptions = []
# fumbles = []
# sacks = []

# prog_bar = tqdm(all_scores.iterrows(), total=len(all_scores))
# for player_2 in prog_bar:
#     for player in dbs_df_new.iterrows():
#         if str(player[1]['displayName']) == str(player_2[1]['player']):
#             players_h.append(player[1]['height'])
#             players_w.append(player[1]['weight'])
#             players_a.append(player[1]['age'])
            
#             names.append(str(player_2[1]['player']))
#             interceptions.append(player_2[1]['interception'])
#             fumbles.append(player_2[1]['fumbles'])
#             sacks.append(player_2[1]['sacks'])

# final_df_db = pd.DataFrame()
# final_df_db['name'] = names
# final_df_db['age'] = players_a
# final_df_db['weight'] = players_w
# final_df_db['height'] = players_h
# final_df_db['ints'] = interceptions
# final_df_db['fums'] = fumbles
# final_df_db['sack'] = sacks

# # Quick Algo for player matching - DBs
# players_h = []
# players_w = []
# players_a = []
# names = []
# interceptions = []
# fumbles = []
# sacks = []

# prog_bar = tqdm(all_scores.iterrows(), total=len(all_scores))
# for player_2 in prog_bar:
#     for player in all_df_new.iterrows():
#         if str(player[1]['displayName']) == str(player_2[1]['player']):
#             players_h.append(player[1]['height'])
#             players_w.append(player[1]['weight'])
#             players_a.append(player[1]['age'])
            
#             names.append(str(player_2[1]['player']))
#             interceptions.append(player_2[1]['interception'])
#             fumbles.append(player_2[1]['fumbles'])
#             sacks.append(player_2[1]['sacks'])

# final_df_all = pd.DataFrame()
# final_df_all['name'] = names
# final_df_all['age'] = players_a
# final_df_all['weight'] = players_w
# final_df_all['height'] = players_h
# final_df_all['ints'] = interceptions
# final_df_all['fums'] = fumbles
# final_df_all['sack'] = sacks

# # Ranking all 3 types (ALL, LBs, DBs) by score
# # For ALL and LBs -> Total Score; For DBs -> Interceptions only

# # For All
# final_new_all = final_df_all.copy(deep=True)
# final_new_all = final_df_all.drop(['fums', 'sack'], axis=1)
# total_score = []
# for x in final_df_all.iterrows():
#     total_score.append(x[1]['ints'] + x[1]['fums'] + x[1]['sack'])
# final_new_all['score'] = total_score

# # For LBs
# final_new_lb = final_df_lbs.copy(deep=True)
# final_new_lb = final_df_lbs.drop(['fums', 'sack'], axis=1)
# total_score = []
# for x in final_df_lbs.iterrows():
#     total_score.append(x[1]['ints'] + x[1]['fums'] + x[1]['sack'])
# final_new_lb['score'] = total_score
    
# # For DBs
# final_new_db = final_df_db.copy(deep=True)
# final_new_db = final_new_db.drop(['fums', 'sack'], axis=1)

# # Store the rankings in different csvs
# all_top_players = final_new_all.sort_values(by='score', ascending=False).head(20).reset_index(drop=True)
# lbs_top_players = final_new_lb.sort_values(by='score', ascending=False).head(20).reset_index(drop=True)
# dbs_top_players = final_new_db.sort_values(by='ints', ascending=False).head(20).reset_index(drop=True)

# all_top_players.to_csv("all_top.csv", index=None)
# lbs_top_players.to_csv("all_lbs.csv", index=None)
# dbs_top_players.to_csv("all_dbs.csv", index=None)

In [10]:
# ### FOR DEFENSIVE COVERAGE SCHEME PREDICTION
# ############################################


# from sklearn.cluster import KMeans, MiniBatchKMeans
# from sklearn.mixture import GaussianMixture

# plays = pd.read_csv("../input/nfl-big-data-bowl-2021/plays.csv")
# modified_week = pd.read_csv("../input/nfl-2021-modified-data/week_modified_processed.csv")
# players = pd.read_csv("../input/nfl-big-data-bowl-2021/players.csv")

# # Choose features for Unsupervised Learning
# # Using Secondary defensive backs
# df_backs = ['CB', 'SS', 'FS', 'DB', 'S']
# all_pos = ['CB', 'SS', 'FS', 'DB', 'MLB', 'LB', 'OLB', 'ILB', 'S']

# train_data = modified_week.loc[modified_week['position'].isin(all_pos)]
# train_data = train_data[train_data.columns[30:].tolist()]
# train_data = train_data.reset_index(drop=True)

# # Before modelling, let's drop NaN and duplicate values
# train_data = train_data.dropna()
# train_data = train_data.drop_duplicates()
# train_data = train_data.reset_index(drop=True)

# # And also do Standardization on the training data
# train_data_scaled = (train_data - train_data.mean()) / (train_data.std())
# train_data_scaled.head()

# # First Let's try KMeans Clustering on Scaled Training Data
# kmeans = KMeans(n_clusters=2)
# kmeans.fit(train_data_scaled)

# # Get the Predictions on Training Data
# kmeans_preds = kmeans.predict(train_data_scaled)
# train_data['kmean_cluster'] = kmeans_preds.tolist()

# ones = train_data[train_data['kmean_cluster'].astype(int) == 1]
# zeros = train_data[train_data['kmean_cluster'].astype(int) == 0]

# # Make a new dataframe with the predictions
# coverage_predictions = pd.DataFrame()
# coverage_predictions['kmeans'] = kmeans_preds.tolist()

# # Mini Batch KMeans on training data
# mini_kmeans = MiniBatchKMeans(n_clusters=2)
# mini_kmeans.fit(train_data_scaled)

# # Get the Predictions on the training data
# mini_kmeans_pred = mini_kmeans.predict(train_data_scaled)

# # Add these predictions into the dataframe too
# coverage_predictions['mini_kmeans'] = mini_kmeans_pred.tolist()

# # Train
# gm = GaussianMixture(n_components=2)
# gm.fit(train_data_scaled)

# # Predict
# gmm_preds = gm.predict(train_data_scaled)

# # Add these predictions into the dataframe too
# coverage_predictions['gmm'] = gmm_preds.tolist()