# Сompetition task:
The main goal is to characterize any differences in player movement between the playing surfaces and to identify specific variables (e.g., field surface, weather, position, play type, etc.) that may influence player movement and the risk of injury.

# Main findings:
**According to research the main factors influencing injury risks are as follows:** </br>
1.	**Turf type.** Injuries on synthetic turf happen 1.7 times more often than on natural turf. 
2.	**Play scenario.** Injury risks vary depending on play scenario ( a combination of roster position, position, group position, play type) and the turf type. A play scenario “Wide Receiver, WR, WR, Kickoff” on synthetic turn has the highest injury risk. The chance of injury in such a scenario is about 0.9%. The highest risk injury on natural turf happens in a “Linebacker, OLB, LB, Punt” scenario. The chance of injury in such a scenario is about 0.4%.
3.	**Temperature.** It is likely that injury risk on synthetic turf increases with higher temperatures (approximately above 70F). One can suggest that some physical properties of synthetic turf start changing at these temperatures. Natural turf is seemingly not temperature-sensitive, as there were no indications of such dependencies found.
4.	**Player dynamics.** It has been found that athletes who got injured on average run faster and have more rapid accelerations than those who did not got  injured. This difference does not depend on turf type.

Most discussion of analisys results are in PDF report. In Notebooks just brief points.

# Content:

Work represents in five main parts:


1.   Basic data exploration.
2. Influence of game scenario.
3. Influence of game conditions.
4.   Athlete movement analisys.
5. Conclusions



**Personal preface:**
As luck would have it, I myself experienced a lower limb injury in the end of November, just when the challenge started. I broke my tibia and sprained my ankle as a result of not my best front flip on a trampoline. So being fully immersed and having experienced all the consequences of lower limb injury first-hand, I tried my best to help reduce such risks.


# Imports:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')

# 1. Basic data exploration.

First fings first. Start with analysing datasets with injuryes and play list:

In [None]:
# Import datasets:
injury_record = pd.read_csv("/kaggle/input/nfl-playing-surface-analytics/InjuryRecord.csv")
play_list = pd.read_csv("/kaggle/input/nfl-playing-surface-analytics/PlayList.csv")

game_id_with_injuryes = injury_record['GameID'].unique()
game_id_without_injuryes = play_list.query('GameID not in @game_id_with_injuryes')['GameID'].unique()

num_games_with_inj = game_id_with_injuryes.shape[0]
num_games_safe = game_id_without_injuryes.shape[0]

inj_ratio = num_games_with_inj / num_games_safe*100

print('Percent of GameID with at least one injury: {0:.2f}%'.format(inj_ratio))

In [None]:
play_keys_inj = injury_record['PlayKey'].unique().shape[0]
play_keys_safe = play_list['PlayKey'].unique().shape[0] - play_keys_inj

print(f'Avaliable play keys for safe play: {play_keys_safe}')
print(f'Avaliable play keys for play with injury: {play_keys_inj}')
print('Percent of plays with injury: {0:.2f}%'.format(play_keys_inj / play_keys_safe *100))

**The good news** is that non-contact injuries happen not so frequently: only about 0.04% of plays end with some non-contact injury. <br>
The challenge is that there is a huge data imbalance between GameIDs and PlayKeys so it can be quite tricky to train models and draw statistically meaningful conclusions. Statistical significance of the findings should be double-checked. 




Let’s dig further and check is there is difference between natural and synthetic turf:

In [None]:
inj_nat = injury_record.groupby('Surface').count()['PlayerKey']['Natural']
inj_synt = injury_record.groupby('Surface').count()['PlayerKey']['Synthetic']

games_nat = play_list.groupby('FieldType').count()['PlayerKey']['Natural']
games_synt = play_list.groupby('FieldType').count()['PlayerKey']['Synthetic']

natural_inj_rate = inj_nat / games_nat * 100
synt_inj_rate = inj_synt / games_synt * 100

sns.set_context("talk", font_scale=0.8) 
sns.set_palette("Blues_d")
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

g1 = sns.barplot(x=['Synthetic', 'Natural'], y=[inj_synt, inj_nat], 
                 alpha=0.7, ax=ax[0])
g2 = sns.barplot(x=['Synthetic', 'Natural'], y=[synt_inj_rate, natural_inj_rate],
                 alpha=0.7, ax=ax[1])

ax[0].set_title('Absolute value of injuries', fontsize=14)
ax[1].set_title('Frequency of injury occurrence', fontsize=14)
ax[1].set_ylabel('%', fontsize=14)

sns.despine()  
fig.tight_layout()

Difference in absolute values doesn’t look significant, but one should take into account that there were more games played on natural turf. Thus comparing the relation between games resulting in injury with the total amount of games on synthetic and natural turf is more informative. Here one can see a bigger difference which supports the idea of the turf type influence.

Lets see injury severity on different surfaces:

In [None]:
inj_nat = injury_record[injury_record['Surface'] == 'Natural']
nat_size = inj_nat.shape[0]
nat_42 = inj_nat['DM_M42'].sum() 
nat_28 = inj_nat['DM_M28'].sum() - nat_42
nat_7 = inj_nat['DM_M7'].sum() - nat_42 - nat_28
nat_1 = inj_nat['DM_M1'].sum() - nat_42 - nat_28 - nat_7

inj_synt = injury_record[injury_record['Surface'] == 'Synthetic']
synt_size = inj_synt.shape[0]
synt_42 = inj_synt['DM_M42'].sum()
synt_28 = inj_synt['DM_M28'].sum() - synt_42
synt_7 = inj_synt['DM_M7'].sum() - synt_42 - synt_28
synt_1 = inj_synt['DM_M1'].sum() - synt_42 - synt_28 - synt_7

ks = 100/57 
kn = 100/48

nat = [nat_1*kn, nat_7*kn, nat_28*kn, nat_42*kn] 
synt = [synt_1*ks, synt_7*ks, synt_28*ks, synt_42*ks] 
inj_days = ['1', '1-7', '7-28', '42+']

fig, ax = plt.subplots(1, 2, figsize=(15, 5))

g1 = sns.barplot(x=inj_days, y=nat, order=inj_days, ax=ax[0])
g2 = sns.barplot(x=inj_days, y=synt, order=inj_days, ax=ax[1])

ax[0].set_title('Natural', fontsize=14)
ax[1].set_title('Synthetic', fontsize=14)
ax[0].set_ylim([0, 50])
ax[1].set_ylim([0, 50])
ax[0].set_xlabel('Days to recover')
ax[1].set_xlabel('Days to recover')
ax[0].set_ylabel('%')
ax[1].set_ylabel('%')
sns.despine()  
fig.tight_layout()

Distribution between recovery days after each injury looks quite similar for natural and synthetic turf. One can notice that injuries with recovering time of 28-42 days are less frequent than others. If we divide injuries into “short recovery” (1-7days to recover) and ”long recovery” (7+ days to recover) injuries, synthetic turf can be assumed to be leading to slightly bigger injury severity, as ”long recovery” injuries are about 10% more frequent.

Lets discover injuries in more depth:

In [None]:
natural = injury_record[injury_record['Surface'] == 'Natural']
synthetic = injury_record[injury_record['Surface'] == 'Synthetic']

natural_bp = natural.groupby('BodyPart').count()['GameID']
synt_bp = synthetic.groupby('BodyPart').count()['GameID']
synt_bp['Heel'] = 0

feet_part = pd.merge(natural_bp, synt_bp, on='BodyPart')
feet_part.sort_values('GameID_x', inplace=True, ascending=False)
feet_part.columns = ['Natural', 'Synthetic']

feet_part['Natural'] = feet_part['Natural']  / 48* 100
feet_part['Synthetic'] = feet_part['Synthetic']  / 57 * 100

natural = injury_record[injury_record['Surface'] == 'Natural']
synthetic = injury_record[injury_record['Surface'] == 'Synthetic']

# Injured parts
body_parts = injury_record['BodyPart'].unique()

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.barplot(x=feet_part.index, y=feet_part['Natural'], ax=ax[0])
sns.barplot(x=feet_part.index, y=feet_part['Synthetic'], ax=ax[1])

ax[0].set_title('Natural')
ax[1].set_title('Synthetic')
ax[0].set_ylabel('%')
ax[1].set_ylabel('%')

ax[0].set_ylim([0, 60])
ax[1].set_ylim([0, 60])

sns.despine() 
fig.tight_layout()

We can see that most frequently injured parts are ankles and knees on both types of turf. Heel injuries are rare. One can suggest that a foot is more frequently injured on natural turf and a toe is more frequently injured on synthetic turf. 

**Main takeaways:**
<li> Injuries on synthetic turf are approximately 1.7 times more frequent than on natural turf.
<li> Toe injury received on synthetic turf is likely to be more frequent.
<li> One can suspect that knee and foot injuries received on natural turf are more frequent.


# 2. Influence of game scenario.

Let us consider a distribution between the risk of injury and different roster positions. 

In [None]:
# Import datasets:
injury_record = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/InjuryRecord.csv')
play_list = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/PlayList.csv')

pk_with_injuryes = injury_record['PlayKey'].unique()
injured = play_list.query('PlayKey in @pk_with_injuryes')

injured_rp_sorted = list(injured['RosterPosition'].value_counts().sort_values(ascending=False).index)

# getting roster position that never injured (safe_rp)
safe_rp = []
for rp in play_list['RosterPosition'].unique():
  if rp not in injured_rp_sorted:
    safe_rp.append(rp)

for sp in safe_rp:
  injured_rp_sorted.append(sp)

# calculating ratio of injuries in relation for total plays in according roster position
inj_rp = injured['RosterPosition'].value_counts()
for rp in safe_rp:
  inj_rp[rp] = 0
total_rp = play_list['RosterPosition'].value_counts()
rp_inj_ratio = inj_rp / total_rp * 100

play_list_natural = play_list[play_list['FieldType']=='Natural']
play_list_synt = play_list[play_list['FieldType']=='Synthetic']

injured_natural = play_list_natural.query('PlayKey in @pk_with_injuryes')
injured_synthetic = play_list_synt.query('PlayKey in @pk_with_injuryes')

inj_rp_natural = injured_natural['RosterPosition'].value_counts()
inj_rp_synt = injured_synthetic['RosterPosition'].value_counts()
total_rp_natural = play_list_natural['RosterPosition'].value_counts()
total_rp_synt = play_list_synt['RosterPosition'].value_counts()

rp_inj_ratio_natural = inj_rp_natural / total_rp_natural * 100
rp_inj_ratio_synt = inj_rp_synt / total_rp_synt * 100

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.set_palette("Blues_d")
sns.countplot(y=injured['RosterPosition'],
              order=rp_inj_ratio.sort_values(ascending=False).index, 
              ax=ax[0], palette="Blues_d")

sns.barplot(x=rp_inj_ratio.sort_values(ascending=False).values,
            y=rp_inj_ratio.sort_values(ascending=False).index,
              ax=ax[1], palette="Blues_d")

ax[0].set_title('Absolute values')
ax[1].set_title('Relative values')
ax[1].set_xlabel('frequency of injury, %')
sns.despine() 
fig.tight_layout()

Most injuries happen in a Linebacker position, but most frequent injuries happen in a Running Back position. <br> The safest positions are quarterback and kicker positions. Such a dependence feels quite natural as kickers and quarterbacks are likely to run less than a running back or a linebacker. 

Next, let us consider a turf type. 

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

d_rate = rp_inj_ratio_natural - rp_inj_ratio_synt

sns.barplot(x=d_rate.sort_values(ascending=False).values,
            y=d_rate.sort_values(ascending=False).index,
            label='Synthetic',
              ax=ax, palette="Blues_d", edgecolor='w')
ax.set_title('Difference between natural and synthetic injury frequences.')
ax.set_xlabel('%')
sns.despine()
plt.tight_layout()

Most roster positions injuries occur more frequently on synthetic turf, except for an offensive lineman, who is more likely to get an injury on natural turf. The more player moves during the game, the higher the risk of getting an injury on synthetic turf is and the stronger turf influence is. It seems valid as an offensive lineman isn’t likely to move as much as other players. So it fits with a proposed idea of a cleats-turf interaction

**Next we performe same analisys for play type:**

In [None]:
injured_pt_sorted = list(injured['PlayType'].value_counts().sort_values(ascending=False).index)

# getting roster position that never injured (safe_rp)
safe_pt = []
for pt in play_list['PlayType'].unique():
  if pt not in injured_pt_sorted:
    safe_pt.append(pt)

for pt in safe_pt:
  if pt not in ['0', 0 , np.nan, 'nan']:
    injured_pt_sorted.append(pt)

# calculating ratio of injuries in relation for total plays in according roster position
inj_pt = injured['PlayType'].value_counts()
for pt in safe_pt:
  inj_pt[pt] = 0
total_pt = play_list['PlayType'].value_counts()
pt_inj_ratio = (inj_pt / total_pt * 100).drop(index='0')

fig, ax = plt.subplots(1, 2, figsize=(15, 5))

sns.countplot(y=injured['PlayType'],
              order=pt_inj_ratio.sort_values(ascending=False).index, 
              ax=ax[0], palette="Blues_d")

sns.barplot(x=pt_inj_ratio.sort_values(ascending=False).values,
            y=pt_inj_ratio.sort_values(ascending=False).index,
              ax=ax[1], palette="Blues_d")

ax[0].set_title('Absolute values')
ax[1].set_title('Relative values')
ax[1].set_xlabel('frequency of injury, %')
sns.despine()
fig.tight_layout()

And dividing by turf type:

In [None]:
play_list_natural = play_list[play_list['FieldType']=='Natural']
play_list_synt = play_list[play_list['FieldType']=='Synthetic']

injured_natural = play_list_natural.query('PlayKey in @pk_with_injuryes')
injured_synthetic = play_list_synt.query('PlayKey in @pk_with_injuryes')

inj_pt_natural = injured_natural['PlayType'].value_counts()
inj_pt_synt = injured_synthetic['PlayType'].value_counts()
total_pt_natural = play_list_natural['PlayType'].value_counts()
total_pt_synt = play_list_synt['PlayType'].value_counts()

pt_inj_ratio_natural = inj_pt_natural / total_pt_natural * 100
pt_inj_ratio_synt = inj_pt_synt / total_pt_synt * 100

fig, ax = plt.subplots(figsize = (12, 6))

d_rate = (pt_inj_ratio_natural.fillna(0) - pt_inj_ratio_synt.fillna(0)).drop(index='0')

sns.barplot(x=d_rate.sort_values(ascending=False).values,
            y=d_rate.sort_values(ascending=False).index,
            label='Synthetic',
              ax=ax, palette="Blues_d", edgecolor='w')
ax.set_title('Difference between natural and synthetic injury frequences.')
ax.set_xlabel('%')
sns.despine()
plt.tight_layout()

Lets break down position and position group:

In [None]:
injured_pos_sorted = list(injured['Position'].value_counts().sort_values(ascending=False).index)

# getting position that never injured 
safe_pos = []
for pos in play_list['Position'].unique():
  if pos not in injured_pos_sorted:
    safe_pos.append(pos)

for pos in safe_pos:
  if pos != 'Missing Data':
    injured_pos_sorted.append(pos)

# calculating ratio of injuries in relation for total plays in according to position
inj_pos = injured['Position'].value_counts()
for pos in safe_pos:
  inj_pos[pos] = 0
total_pos = play_list['Position'].value_counts()
pos_inj_ratio = (inj_pos / total_pos * 100).drop(index='Missing Data')

injured_posg_sorted = list(injured['PositionGroup'].value_counts().sort_values(ascending=False).index)

# getting position group that never injured 
safe_posg = []
for posg in play_list['PositionGroup'].unique():
  if posg not in injured_posg_sorted:
    safe_posg.append(posg)

for posg in safe_posg:
  if posg != 'Missing Data':
    injured_posg_sorted.append(posg)

# calculating ratio of injuries in relation for total plays in according to position group
inj_posg = injured['PositionGroup'].value_counts()
for posg in safe_posg:
  inj_posg[posg] = 0
total_posg = play_list['PositionGroup'].value_counts()
posg_inj_ratio = (inj_posg / total_posg * 100).drop(index='Missing Data')

fig, ax = plt.subplots(2, 2, figsize=(15, 10))

sns.countplot(y=injured['Position'],
              order=pos_inj_ratio.sort_values(ascending=False).index, 
              ax=ax[0][0], palette="Blues_d")

sns.barplot(x=pos_inj_ratio.sort_values(ascending=False).values,
            y=pos_inj_ratio.sort_values(ascending=False).index,
              ax=ax[0][1], palette="Blues_d")

sns.countplot(y=injured['PositionGroup'],
              order=posg_inj_ratio.sort_values(ascending=False).index, 
              ax=ax[1][0], palette="Blues_d")

sns.barplot(x=posg_inj_ratio.sort_values(ascending=False).values,
            y=posg_inj_ratio.sort_values(ascending=False).index,
              ax=ax[1][1], palette="Blues_d")

ax[0][0].set_title('Absolute values')
ax[0][1].set_title('Relative values')
ax[0][1].set_xlabel('frequency of injury, %')
ax[1][1].set_xlabel('frequency of injury, %')
sns.despine()
fig.tight_layout()

Most injuries happen at a WR position and LB group positions, but the most frequent injuries happen at DB position and RB group positions.

**Factor combinations:**
In order to group plays by similarity in terms of roster position, position, group position and play type, data was represented in a dendrogram with cosine similarity. Dendrograms were created for injured plays, separately for natural and synthetic turf.

In [None]:
injury_record = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/InjuryRecord.csv')
play_list = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/PlayList.csv')

injured_natural = injured[injured['FieldType'] == 'Natural']
injured_synth = injured[injured['FieldType'] == 'Synthetic']

injured_rp_sort_nat = injured_natural['RosterPosition']
injured_pos_sort_nat = injured_natural['Position']
injured_posg_sort_nat = injured_natural['PositionGroup']
injured_pt_sort_nat = injured_natural['PlayType']

injured_rp_sort_synt = injured_synth['RosterPosition']
injured_pos_sort_synt = injured_synth['Position']
injured_posg_sort_synt = injured_synth['PositionGroup']
injured_pt_sort_synt = injured_synth['PlayType']


inj_groups_nat = pd.concat([injured_rp_sort_nat, injured_pos_sort_nat, 
                            injured_posg_sort_nat, injured_pt_sort_nat], axis=1)
inj_groups_nat_d = pd.get_dummies(inj_groups_nat)

inj_groups_synt = pd.concat([injured_rp_sort_synt, injured_pos_sort_synt, 
                             injured_posg_sort_synt, injured_pt_sort_synt], axis=1)
inj_groups_synt_d = pd.get_dummies(inj_groups_synt)

With threshold = 0.1 we can obtain clusters with same 4 common positions:

In [None]:
from scipy.cluster.hierarchy import fcluster, dendrogram, linkage

treshold = 0.1

plt.subplots(figsize=(12, 7))

Z_nat = linkage(inj_groups_nat_d, method='average', metric='cosine')
dend_nat = dendrogram(Z_nat)

plt.axhline(y=treshold, c='grey', lw=2, linestyle='dashed')
plt.tick_params(axis='x', which='major', labelsize=15)
plt.title('Natural')
sns.despine()

In [None]:
treshold = 0.1

plt.subplots(figsize=(12, 7))

Z_synt = linkage(inj_groups_synt_d, method='average', metric='cosine')
dend_synt = dendrogram(Z_synt)

plt.axhline(y=treshold, c='grey', lw=2, linestyle='dashed')
plt.tick_params(axis='x', which='major', labelsize=15)
plt.title('Synthetic')
sns.despine()

In [None]:
def get_clusters(labels, groups):
  clusters = {key: ([],[]) for key in range(1, labels.max()+1)}
  for label, player in zip(labels, groups.index):
    clusters[label][1].append(player)
  for key in clusters.keys():
    try:
      clusters[key][0].append(
          np.dot(groups.loc[clusters[key][1][0]],
                groups.loc[clusters[key][1][1]])
      )
    except IndexError:
      clusters[key][0].append(0)
  return clusters

labels_nat = fcluster(Z_nat, treshold, criterion='distance') 
labels_synt = fcluster(Z_synt, treshold, criterion='distance') 

clusters_nat = get_clusters(labels_nat, inj_groups_nat_d)
clusters_synt = get_clusters(labels_synt, inj_groups_synt_d)

In [None]:
def calc_groups_stat(clusters, groups):
  cols = ['RosterPosition', 'Position', 'PositionGroup', 'PlayType', 'sum']
  groups_stat = pd.DataFrame(columns=cols)

  idx = 0
  for key in clusters.keys():
    inj_sum = len(clusters[key][1])
    if clusters[key][0][0] == 4:
      rp = groups.loc[clusters[key][1][0]]['RosterPosition'] 
      pos = groups.loc[clusters[key][1][0]]['Position']
      pg = groups.loc[clusters[key][1][0]]['PositionGroup']
      pt = groups.loc[clusters[key][1][0]]['PlayType'] 
    else:
      continue
    col = [rp, pos, pg, pt, inj_sum]
    groups_stat.loc[idx] = col
    idx += 1
  return groups_stat

groups_stat_nat = calc_groups_stat(clusters_nat, inj_groups_nat)
groups_stat_synt = calc_groups_stat(clusters_synt, inj_groups_synt)

Groups for natural field:

In [None]:
groups_stat_nat.sort_values(by='sum', ascending=False, inplace=True)
groups_stat_nat.reset_index(drop=True, inplace=True)
groups_stat_nat

In [None]:
groups_stat_synt.sort_values(by='sum', ascending=False, inplace=True)
groups_stat_synt.reset_index(drop=True, inplace=True)
groups_stat_synt

In [None]:
play_list_nat = play_list[play_list['FieldType']=='Natural']
play_list_synt = play_list[play_list['FieldType']=='Synthetic']

count_synt = pd.DataFrame(columns=['sum_total'])

for i in groups_stat_synt.index:
  rp = groups_stat_synt.loc[i]['RosterPosition']
  pos = groups_stat_synt.loc[i]['Position']
  pg = groups_stat_synt.loc[i]['PositionGroup']
  pt = groups_stat_synt.loc[i]['PlayType']

  subset = play_list_synt[play_list_synt['RosterPosition'] == rp] \
            [play_list_synt['Position'] == pos] \
            [play_list_synt['PositionGroup'] == pg] \
            [play_list_synt['PlayType'] == pt]
  count_synt.loc[i] = [subset.shape[0]]
    
count_nat = pd.DataFrame(columns=['sum_total'])

for i in groups_stat_nat.index:
  rp = groups_stat_nat.loc[i]['RosterPosition']
  pos = groups_stat_nat.loc[i]['Position']
  pg = groups_stat_nat.loc[i]['PositionGroup']
  pt = groups_stat_nat.loc[i]['PlayType']

  subset = play_list_nat[play_list_nat['RosterPosition'] == rp] \
            [play_list_nat['Position'] == pos] \
            [play_list_nat['PositionGroup'] == pg] \
            [play_list_nat['PlayType'] == pt]
  count_nat.loc[i] = [subset.shape[0]]

For natural turf:

In [None]:
groups_stat_nat_tot = pd.merge(groups_stat_nat, count_nat, left_on=groups_stat_nat.index, right_on=count_nat.index).drop('key_0', axis=1)
groups_stat_nat_tot['Ratio'] = groups_stat_nat_tot['sum'] / groups_stat_nat_tot['sum_total'] * 100
groups_stat_nat_tot.sort_values(by='Ratio', ascending=False, inplace=True)
pd.options.display.float_format = '{:.3f}'.format
groups_stat_nat_tot

In [None]:
groups_stat_synt_tot = pd.merge(groups_stat_synt, count_synt, left_on=groups_stat_synt.index, right_on=count_synt.index).drop('key_0', axis=1)
groups_stat_synt_tot['Ratio'] = groups_stat_synt_tot['sum'] / groups_stat_synt_tot['sum_total'] * 100
groups_stat_synt_tot.sort_values(by='Ratio', ascending=False, inplace=True)
pd.options.display.float_format = '{:.3f}'.format
groups_stat_synt_tot

We can see that a scenario “Wide Receiver, WR, WR, Kickoff” on synthetic turf leads to the highest injury frequency from all current observations. A chance of injury in such a scenario is about 0.9%. 
<br> A scenario “Linebacker, OLB, LB, Punt” on natural turf results in a chance of injury of 0.4%. Rest of the scenarios had much lower frequency.


**Main takeaways:**
<li>	If one considers factors separately, a running back roster position leads to highest injury risk, about 0.05% of athletes in this position were injured. Also about 0.16% of punts resulted in athletes getting injured, 0.08% of athletes in a DB position and 0.05% of athletes in a RB group position got injured.
<li>The highest risk injury on synthetic turf happens in a “Wide Receiver, WR, WR, Kickoff” scenario. The chance of injury in such a scenario is about 0.9%. 
<li>The highest risk injury on natural turf happens in a “Linebacker, OLB, LB, Punt” scenario. The chance of injury in such a scenario is about 0.4%.
<li>	Athletes in positions with more movement get injured on synthetic turf more frequently.


# 3. Influence of game conditions.

In [None]:
injury_record = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/InjuryRecord.csv')
play_list = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/PlayList.csv')

First of all corresponding data should be cleaned as there are typos in “Weather” and “StadiumType” fields. 

Stadium types will be represented as::
1.   Open
2.   Closed


In [None]:
# Renaming open stadiums
open_stadiums = ['Outdoor', 'Oudoor', 'Outdoors', 'Open', 
       'Outdoor Retr Roof-Open', 'Ourdoor', 'Bowl', 
       'Outddors', 'Retr. Roof-Open', 'Indoor, Open Roof',
       'Domed, Open', 'Domed, open', 'Heinz Field',
       'Retr. Roof - Open', 'Outdor', 'Outside']

play_list.replace(open_stadiums, 'Open', inplace=True)

# Renaming closed stadiums
closed_stadiums = ['Indoors', 'Closed Dome', 'Domed, closed', 
                   'Dome', 'Indoor', 'Domed', 'Retr. Roof-Closed', 
                   'Retractable Roof', 'Indoor, Roof Closed', 
                   'Retr. Roof - Closed', 'Dome, closed', 'Retr. Roof Closed']

play_list.replace(closed_stadiums, 'Closed', inplace=True)

In [None]:
# dropping 'Cloudy' data (outlier)
idx_to_drop = play_list[play_list['StadiumType'] == 'Cloudy'].index
play_list.drop(idx_to_drop, inplace=True)

# Dropping nan data
play_list['StadiumType'].dropna(axis=0, inplace=True)

games_total = play_list['GameID'].unique().shape[0]

Weather will be represented as:

1. Clear
2. Cloudy
3. Rainy
4. Snowy

In [None]:
clear = ['Clear and warm', 'Sunny', 'Clear',
       'Controlled Climate', 'Sunny and warm', 'Clear and Cool',
       'Clear and cold', 'Sunny and cold', 'Closed', 'Partly Sunny',
       'Mostly Sunny', 'Clear Skies', 'Partly sunny',
       'Sunny and clear', 'Clear skies', 'Sunny Skies',
       'Fair', 'Partly clear', 'Heat Index 95', 
       'Sunny, highs to upper 80s', 'Sun & clouds',
       'Mostly sunny', 'Sunny, Windy', 'Mostly Sunny Skies',
       'Clear and Sunny', 'Clear and sunny',
       'Clear to Partly Cloudy', 'Cold', 'N/A Indoor', 'N/A (Indoors)']

cloudy = ['Mostly Cloudy', 'Cloudy',
       'Partly Cloudy', 'Mostly cloudy', 'Cloudy and cold',
       'Cloudy and Cool', 'Partly cloudy',
       'Party Cloudy', 'Partly Clouidy', 'Overcast',
       'Mostly Coudy', 'cloudy', 'Coudy']

rainy = ['Cloudy, fog started developing in 2nd quarter', 'Rain',
       'Rain Chance 40%', 'Showers', 'Scattered Showers', 'Hazy',
       'Rain likely, temps in low 40s.', 'Cloudy, 50% change of rain', 
       'Light Rain', '10% Chance of Rain', 'Cloudy, chance of rain',
       'Cloudy, Rain', 'Rainy', '30% Chance of Rain',
       'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.',
       'Rain shower']

snowy = ['Snow', 'Heavy lake effect snow', 
                'Cloudy, light snow accumulating 1-3"']

play_list['Weather'].replace(clear, 'Clear', inplace=True)
play_list['Weather'].replace(cloudy, 'Cloudy', inplace=True)
play_list['Weather'].replace(rainy, 'Rainy', inplace=True)
play_list['Weather'].replace(snowy, 'Snowy', inplace=True)
play_list['Weather'].dropna(inplace=True)

# drop outliers from temperature
play_list = play_list[play_list['Temperature'] != -999]

In [None]:
# selecting all unique games with correspondind conditions
game_conditions = play_list.drop(['PlayerKey', 'PlayKey', 'RosterPosition',
                                               'PlayerDay', 'PlayerGame', 'PlayType', 'PlayerGamePlay',
                                               'Position', 'PositionGroup'], axis=1).copy()
game_conditions.drop_duplicates(inplace=True)
print(f'Total amount of GameID: {games_total}')
print(f'Total amount of GameID with full information about conditions: {game_conditions.shape[0]}')
print('Percent of data losed after cleaning: {:.2f}%'.format(100 - game_conditions.shape[0] / games_total *100))

Less then 10% of data was loosed after cleaning, not so bad. <br>
Lets start with weather conditions:

In [None]:
gi_with_injuryes = injury_record['GameID'].unique()

game_conditions_natural = game_conditions[game_conditions['FieldType']=='Natural']
game_conditions_synt = game_conditions[game_conditions['FieldType']=='Synthetic']

injured_natural = game_conditions_natural.query('GameID in @gi_with_injuryes')
injured_synthetic = game_conditions_synt.query('GameID in @gi_with_injuryes')

inj_st_natural = injured_natural['StadiumType'].value_counts()
inj_st_synt = injured_synthetic['StadiumType'].value_counts()
total_st_natural = game_conditions_natural['StadiumType'].value_counts()
total_st_synt = game_conditions_synt['StadiumType'].value_counts()

st_inj_ratio_natural = inj_st_natural / total_st_natural * 100
st_inj_ratio_synt = inj_st_synt / total_st_synt * 100

sns.set_context("talk", font_scale=0.8) 
sns.set_palette("Blues_d")

fig, ax = plt.subplots(3, 2, figsize=(15, 7))

sns.barplot(x=total_st_natural.sort_values(ascending=False).values,
            y=total_st_natural.sort_values(ascending=False).index,
              ax=ax[0][0], palette="Blues_d")
sns.barplot(x=total_st_synt.sort_values(ascending=False).values,
            y=total_st_synt.sort_values(ascending=False).index,
              ax=ax[0][1], palette="Blues_d")

sns.barplot(x=inj_st_natural.sort_values(ascending=False).values,
            y=inj_st_natural.sort_values(ascending=False).index,
              ax=ax[1][0], palette="Blues_d")
sns.barplot(x=inj_st_synt.sort_values(ascending=False).values,
            y=inj_st_synt.sort_values(ascending=False).index,
              ax=ax[1][1], palette="Blues_d")

sns.barplot(x=st_inj_ratio_natural.sort_values(ascending=True).values,
            y=st_inj_ratio_natural.sort_values(ascending=True).index,
              ax=ax[2][0], palette="Blues_d")
sns.barplot(x=st_inj_ratio_synt.sort_values(ascending=True).values,
            y=st_inj_ratio_synt.sort_values(ascending=True).index,
              ax=ax[2][1], palette="Blues_d")

ax[2][0].set_xlim([0, 4])
ax[2][1].set_xlim([0, 4])
ax[1][0].set_xlim([0, 50])
ax[1][1].set_xlim([0, 50])
ax[0][0].set_xlim([0, 3100])
ax[0][1].set_xlim([0, 3100])
ax[0][0].set_title('Natural')
ax[0][1].set_title('Synthetic')
ax[0][0].set_xlabel('Amount of plays')
ax[0][1].set_xlabel('Amount of plays')
ax[1][0].set_xlabel('Amount of plays with injury')
ax[1][1].set_xlabel('Amount of plays with injury')
ax[2][0].set_xlabel('percent of injury, %')
ax[2][1].set_xlabel('percent of injury, %')
sns.despine() 
fig.tight_layout()

Lets consider weather conditions:

In [None]:
inj_wt_natural = injured_natural['Weather'].value_counts()
inj_wt_synt = injured_synthetic['Weather'].value_counts()
total_wt_natural = game_conditions_natural['Weather'].value_counts()
total_wt_synt = game_conditions_synt['Weather'].value_counts()

wt_inj_ratio_natural = inj_wt_natural / total_wt_natural * 100
wt_inj_ratio_synt = inj_wt_synt / total_wt_synt * 100

fig, ax = plt.subplots(3, 2, figsize=(15, 7))

sns.barplot(x=total_wt_natural.sort_values(ascending=False).values,
            y=total_wt_natural.sort_values(ascending=False).index,
            order=total_wt_synt.sort_values(ascending=False).index,
              ax=ax[0][0], palette="Blues_d")
sns.barplot(x=total_wt_synt.sort_values(ascending=False).values,
            y=total_wt_synt.sort_values(ascending=False).index,
            order=total_wt_synt.sort_values(ascending=False).index,
              ax=ax[0][1], palette="Blues_d")

sns.barplot(x=inj_wt_natural.sort_values(ascending=False).values,
            y=inj_wt_natural.sort_values(ascending=False).index,
            order=total_wt_synt.sort_values(ascending=False).index,
              ax=ax[1][0], palette="Blues_d")
sns.barplot(x=inj_wt_synt.sort_values(ascending=False).values,
            y=inj_wt_synt.sort_values(ascending=False).index,
            order=total_wt_synt.sort_values(ascending=False).index,
              ax=ax[1][1], palette="Blues_d")

sns.barplot(x=wt_inj_ratio_natural.sort_values(ascending=False).values,
            y=wt_inj_ratio_natural.sort_values(ascending=False).index,
            order=total_wt_synt.sort_values(ascending=False).index,
              ax=ax[2][0], palette="Blues_d")
sns.barplot(x=wt_inj_ratio_synt.sort_values(ascending=False).values,
            y=wt_inj_ratio_natural.sort_values(ascending=False).index,
            order=total_wt_synt.sort_values(ascending=False).index,
              ax=ax[2][1], palette="Blues_d")

ax[2][0].set_xlim([0, 3])
ax[2][1].set_xlim([0, 3])
ax[1][0].set_xlim([0, 25])
ax[1][1].set_xlim([0, 25])
ax[0][0].set_xlim([0, 1500])
ax[0][1].set_xlim([0, 1500])
ax[0][0].set_title('Natural')
ax[0][1].set_title('Synthetic')
ax[0][0].set_xlabel('Amount of games')
ax[0][1].set_xlabel('Amount of games')
ax[1][0].set_xlabel('Amount of games with injury')
ax[1][1].set_xlabel('Amount of games with injury')
ax[2][0].set_xlabel('percent of injury, %')
ax[2][1].set_xlabel('percent of injury, %')
sns.despine() 
fig.tight_layout()

**Main takeaways:**


1.   Most of the games were played at clear or cloudy weather.
2.   **8%** of games were played in rain conditions is same proportion for natural and synthetic turf.
3. Chance of injury in rain weather are highterst between rest weather positions, and it slightly higher for synthetic turf (**3%** vs **2.2%**).
4. There are no injures in snowy weather, probably as there are quite few games at such weather conditions (**0.8%**).



Now lets explore temperature influence:

In [None]:
# Left graph (Injured split by natural and synthetic)
inj_temp_natural = injured_natural['Temperature']
inj_temp_synt = injured_synthetic['Temperature']

# Right graph (all split by injured and not)
injured_games = game_conditions.query('GameID in @gi_with_injuryes')
not_injured_games = game_conditions.query('GameID not in @gi_with_injuryes')
inj_temp = injured_games['Temperature']
not_inj_temp = not_injured_games['Temperature']

fig, ax = plt.subplots(1, 2, figsize=(15, 5))

sns.distplot(inj_temp_natural, label='Natural', color='g', bins=10,  ax=ax[0])
sns.distplot(inj_temp_synt, label='Synthetic', bins=10, ax=ax[0])

sns.distplot(not_inj_temp, label='Not injured', color='g', bins=10,ax=ax[1])
sns.distplot(inj_temp, label='Injured', color='r', bins=10,  ax=ax[1])

ax[0].set_title('Games with injury')
ax[1].set_title('All games')
ax[0].legend()
ax[1].legend()
sns.despine()

On visual inspection, one can suspect a slight difference between distributions.
In order to be more precise, let us conduct Kolmogorov-Smirnow test to compare distributions.

In [None]:
from scipy.stats import ks_2samp

inj_games_test = ks_2samp(inj_temp_natural, inj_temp_synt)
all_games_test = ks_2samp(not_inj_temp, inj_temp)
print('p-value that temperatures distribution differs between \
synthetic and natural turf for games with at leas one injury: {:.3f}%'.format(inj_games_test[1]*100))
print('p-value that temperatures distribution differs between\
 games there are at least one injury happens and games without injury: {:.3f}%'.format(all_games_test[1]*100))

Distributions on the left aren’t likely to have a statistically significant difference, but distributions on the right differ. 

**Now lets consider factors combination:**

In [None]:
game_id_with_injuryes = injury_record['GameID'].unique()
game_id_without_injuryes = play_list.query('GameID not in @game_id_with_injuryes')['GameID'].unique()

injured_games = play_list.query('GameID in @game_id_with_injuryes')
injured_games_conditions = injured_games.drop(['PlayerKey', 'PlayKey', 'RosterPosition',
                                               'PlayerDay', 'PlayerGame', 'PlayType', 'PlayerGamePlay',
                                               'Position', 'PositionGroup'], axis=1)

injured_games_conditions.drop_duplicates(inplace=True)
injured_games_conditions.dropna(inplace=True)
injured_games_conditions.drop('GameID', axis=1, inplace=True)

max_temp = play_list['Temperature'].max()
# Normalize tempereture
injured_games_conditions['Temperature'] = injured_games_conditions['Temperature'] / injured_games_conditions['Temperature'].max()

injured_games_conditions_one_hot = pd.get_dummies(injured_games_conditions)

In [None]:
from scipy.cluster.hierarchy import fcluster, dendrogram, linkage

# selecting most similar
treshold = 0.1

plt.subplots(figsize=(18, 6))

Z = linkage(injured_games_conditions_one_hot, method='average', metric='cosine')
dend = dendrogram(Z)

plt.axhline(y=treshold, c='grey', lw=2, linestyle='dashed')
plt.tick_params(axis='x', which='major', labelsize=15)
sns.despine()

In [None]:
labels = fcluster(Z, treshold, criterion='distance') 

clusters = {key: ([],[]) for key in range(1, labels.max() + 1)}
for label, player in zip(labels, injured_games_conditions_one_hot.index):
  clusters[label][1].append(player)

for key in clusters.keys():
  try:
    clusters[key][0].append(
        np.dot(injured_games_conditions_one_hot.loc[clusters[key][1][0]],
              injured_games_conditions_one_hot.loc[clusters[key][1][1]])
    )
  except IndexError:
    clusters[key][0].append(0)

In [None]:
cols = ['StadiumType', 'FieldType', 'Weather', 'Temperature', 'sum']
condition_stat = pd.DataFrame(columns=cols)

idx = 0
for key in clusters.keys():
  inj_sum = len(clusters[key][1])
  if clusters[key][0][0] > 3:
    st = injured_games_conditions.loc[clusters[key][1][0]]['StadiumType'] 
    ft = injured_games_conditions.loc[clusters[key][1][0]]['FieldType']
    wt = injured_games_conditions.loc[clusters[key][1][0]]['Weather']
    # tmp as average
    tmp = 0
    for cond in clusters[key][1]:
      tmp += injured_games_conditions.loc[cond]['Temperature'] 
    tmp /= len(clusters[key][1])
  else:
    continue

  col = [st, ft, wt, tmp, inj_sum]
  condition_stat.loc[idx] = col
  idx += 1

condition_stat.sort_values(by='sum', ascending=False, inplace=True)
condition_stat.reset_index(drop=True, inplace=True)
condition_stat

In [None]:
count = pd.DataFrame(columns=['mean_tmp', 'sum_total'])

for idx in condition_stat.index:
  st = condition_stat.loc[idx]['StadiumType']
  ft = condition_stat.loc[idx]['FieldType']
  wt = condition_stat.loc[idx]['Weather']
  subset = game_conditions[game_conditions['StadiumType'] == st] \
            [game_conditions['FieldType'] == ft] \
            [game_conditions['Weather'] == wt]

  tmp = subset['Temperature'].mean() / max_temp
  count.loc[idx] = [tmp, subset.shape[0]]
    
condition_stat_tot = pd.merge(condition_stat, count, left_on=condition_stat.index, right_on=count.index)
condition_stat_tot['Ratio'] = condition_stat_tot['sum'] / condition_stat_tot['sum_total'] * 100
condition_stat_tot.drop('key_0', axis=1, inplace=True)

condition_stat_tot['Temperature'] *= max_temp
condition_stat_tot['mean_tmp'] *= max_temp
condition_stat_tot.sort_values(by='Ratio', ascending=False, inplace=True)
condition_stat_tot.reset_index(drop=True, inplace=True)
condition_stat_tot

As one can see, the highest risk of injury can be observed in closed stadiums with synthetic turf. Synthetic turf has again proved to be the riskiest in terms of getting injured. Top 4 riskiest combinations include synthetic turf. Let us point out that the injury risk on a closed stadium with synthetic turf is higher than the injury risk in rainy weather on an open stadium with natural turf (3.1% vs 2.3%). <br>
Also it is of a great interest that synthetic turf at a relatively cold temperature in rainy weather seems to lead to a lower injury risk than in dry weather but with a higher temperature. <br>
One more interesting finding here is that the risk of injury on synthetic turf seems to decrease with decreasing temperature. <br> It also correlates with single temperature observations in the previous section. Even rainy weather seems to be a less important factor. Comparing observation #3 and #6 in Table we can see that cloudy weather with lower temperature results in even lesser risk than rainy weather with higher temperature with the all the other identical factors. According to table injury risk with a temperature of about 75 F is approximately twice higher than with a temperature of 50 F on synthetic turf. 


**Into another perspective:**

In [None]:
injury_record = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/InjuryRecord.csv')
play_list = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/PlayList.csv')

# leave only data with correct temperature
play_list = play_list[play_list['Temperature'] != -999]
play_list['Temperature'].dropna(inplace=True)
# leave only synthetic turf
play_list_synt = play_list[play_list['FieldType']=='Synthetic']
play_list_nat = play_list[play_list['FieldType']=='Natural']

# leave only unique GameIDs
play_list_synt.drop_duplicates('GameID', inplace=True)
play_list_nat.drop_duplicates('GameID', inplace=True)

# total amount of games by temperatures
total_synt = play_list_synt.groupby('Temperature').count()['PlayKey']
total_nat = play_list_nat.groupby('Temperature').count()['PlayKey']

# selecting injured GameIDs on synthetyc surface
inj_synt = injury_record[injury_record['Surface']=='Synthetic']
synt_inj_id = inj_synt['GameID']
inj_nat = injury_record[injury_record['Surface']=='Natural']
nat_inj_id = inj_nat['GameID']

# selecting injured from synthetic play list
play_list_synt_inj = play_list_synt.query('GameID in @synt_inj_id')
play_list_nat_inj = play_list_nat.query('GameID in @nat_inj_id')

# total amount of games by temperatures
inj_synt = play_list_synt_inj.groupby('Temperature').count()['PlayKey']
inj_nat = play_list_nat_inj.groupby('Temperature').count()['PlayKey']

In [None]:
def group_by_t(data, dt=10):
  res = {t: 0 for t in range(10, 110, dt)}
  for t in data.index:
    if t % dt == 0:
      res[t] += data[t]
    else:
      key = (t // dt + 1) * dt
      res[key] += data[t]
  return res

inj_synt_g = group_by_t(inj_synt)
all_synt_g = group_by_t(total_synt)

inj_nat_g = group_by_t(inj_nat)
all_nat_g = group_by_t(total_nat)

res_synt = []
for key in list(inj_nat_g.keys())[:-1]:
  res_synt.append(inj_synt_g[key] / all_synt_g[key]*100)

res_nat = []
for key in list(inj_nat_g.keys())[:-1]:
  res_nat.append(inj_nat_g[key] / all_nat_g[key]*100)

t = range(10, 100, 10)

synt_fit = np.polyfit(t[3:], res_synt[3:], 1)
nat_fit = np.polyfit(t[3:], res_nat[3:], 1)

sns.set_context("talk", font_scale=0.8) 
sns.set_palette("Blues_d")
fig, ax = plt.subplots(figsize=(12, 6))

sns.scatterplot(t, res_synt, label='Synthetic', color='r', ax=ax)
ax.plot(t[3:], synt_fit[0]*t[3:] + synt_fit[1], '--', color='r', alpha=0.8)
sns.scatterplot(t, res_nat, label='Natural', color='g', marker='s', ax=ax)
ax.plot(t[3:], nat_fit[0]*t[3:] + nat_fit[1], '--', color='g', alpha=0.8)

ax2 = ax.twinx()  # instantiate a second axes that shares the same x-axis
color = 'tab:gray'
ax2.set_ylabel('Total amount of games', color=color)  
play_list_synt['Temperature'].hist(bins=35, label='Synthetic', color='r', alpha=0.2, ax=ax2)
play_list_nat['Temperature'].hist(bins=35, label='Natural', color='g',  alpha=0.2, ax=ax2)
ax2.tick_params(axis='y', labelcolor=color)

ax.set_ylabel('Percent of injuries, %')
ax.set_xlabel('Temperature, F')
ax2.set_ylim([0, 350])
sns.despine()
ax.legend(loc='upper left', facecolor='white', framealpha=1)
fig.tight_layout()


There is not so much data to draw a solid conclusion, but it’s likely that the temperature may influence injury risk on synthetic turf. Such an influence feels valid as there is a simple physical model which can be applied to describe it. Physical properties of synthetic turf and their dependencies on temperature should be investigated. If that is the case, changing of synthetic turf properties can lead to reducing injury risks.

**Summirizung categorical features:**

In [None]:
play_list = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/PlayList.csv')

clear = ['Clear and warm', 'Sunny', 'Clear',
       'Controlled Climate', 'Sunny and warm', 'Clear and Cool',
       'Clear and cold', 'Sunny and cold', 'Closed', 'Partly Sunny',
       'Mostly Sunny', 'Clear Skies', 'Partly sunny',
       'Sunny and clear', 'Clear skies', 'Sunny Skies',
       'Fair', 'Partly clear', 'Heat Index 95', 
       'Sunny, highs to upper 80s', 'Sun & clouds',
       'Mostly sunny', 'Sunny, Windy', 'Mostly Sunny Skies',
       'Clear and Sunny', 'Clear and sunny',
       'Clear to Partly Cloudy', 'Cold', 'N/A Indoor', 'N/A (Indoors)']

cloudy = ['Mostly Cloudy', 'Cloudy',
       'Partly Cloudy', 'Mostly cloudy', 'Cloudy and cold',
       'Cloudy and Cool', 'Partly cloudy',
       'Party Cloudy', 'Partly Clouidy', 'Overcast',
       'Mostly Coudy', 'cloudy', 'Coudy']

rainy = ['Cloudy, fog started developing in 2nd quarter', 'Rain',
       'Rain Chance 40%', 'Showers', 'Scattered Showers', 'Hazy',
       'Rain likely, temps in low 40s.', 'Cloudy, 50% change of rain', 
       'Light Rain', '10% Chance of Rain', 'Cloudy, chance of rain',
       'Cloudy, Rain', 'Rainy', '30% Chance of Rain',
       'Cloudy with periods of rain, thunder possible. Winds shifting to WNW, 10-20 mph.',
       'Rain shower']

snowy = ['Snow', 'Heavy lake effect snow', 
                'Cloudy, light snow accumulating 1-3"']

play_list['Weather'].replace(clear, 'Clear', inplace=True)
play_list['Weather'].replace(cloudy, 'Cloudy', inplace=True)
play_list['Weather'].replace(rainy, 'Rainy', inplace=True)
play_list['Weather'].replace(snowy, 'Snowy', inplace=True)
play_list['Weather'].dropna(inplace=True)

# drop outliers from temperature
play_list = play_list[play_list['Temperature'] != -999]

injury_record = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/InjuryRecord.csv')
play_list_clean = play_list
play_list_clean.dropna(axis=0, inplace=True)
play_list_clean = play_list_clean[play_list_clean['PlayType'] != '0']
play_list_clean = play_list_clean[play_list_clean['PositionGroup'] != 'Missing Data']

injured_game_id = injury_record['GameID'].unique()

play_list_injured = play_list_clean.query('GameID in @injured_game_id').copy()
play_list_not_injured = play_list_clean.query('GameID not in @injured_game_id').copy()

play_list_injured.drop_duplicates(subset=['GameID'], inplace=True)
play_list_not_injured.drop_duplicates(subset=['GameID'], inplace=True)

play_list_not_injured_subset = play_list_not_injured.sample(400)

play_list_injured['Injury'] = np.ones((play_list_injured.shape[0], 1))
play_list_not_injured_subset['Injury'] = np.zeros((play_list_not_injured_subset.shape[0], 1))

play_list_subset = pd.concat([play_list_injured, play_list_not_injured_subset])

play_list_subset['Temperature'] /= play_list_subset['Temperature'].max()
play_list_subset['PlayerDay'] /= play_list_subset['PlayerDay'].max() 
play_list_subset['PlayerGame'] /= play_list_subset['PlayerGame'].max() 
play_list_subset.drop(['PlayerKey', 'GameID', 'PlayKey', 'PlayerGamePlay'], axis=1, inplace=True)

play_list_X = play_list_subset[play_list_subset.columns[:-1]]
injury_y = play_list_subset['Injury']

play_list_subs_one_hot = pd.get_dummies(play_list_X)
play_list_subs_one_hot = pd.concat([play_list_subs_one_hot, injury_y], axis=1)

import seaborn as sns

#get correlations of each features in dataset
corrmat = play_list_subs_one_hot.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
sns.heatmap(play_list_subs_one_hot[top_corr_features].corr(), annot=False, cmap="RdBu_r")
#plt.tight_layout()

Most of the relations have already been observed in the previous sections, for example that natural turf correlated with open stadiums. Also we can see some natural dependencies like slight correlation between temperature and stadium type or with weather conditions. Play type “rush” had a strong negative correlation with “pass”, as they are opposing play types. There are strong correlations between positions that reflect most frequent player combinations. Focusing on injuries, there is a weak negative correlation with player’s game, which likely reflects that injured athletes take part in less games. Other correlations with injuries are negligible. 

Attempt to create prediction model by features above:

In [None]:
play_list_X.drop(['PlayerDay', 'PlayerGame'], axis=1, inplace=True)
X = pd.get_dummies(play_list_X)

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score

X_train, X_test, y_train, y_test = train_test_split(X, injury_y, 
                                                    test_size=0.25, 
                                                    stratify=injury_y,
                                                    random_state=0)

In [0]:
from sklearn.metrics import roc_curve, auc

clf = GradientBoostingClassifier(min_samples_leaf=2,
                                 max_depth=5,
                                 max_features=8,
                                 learning_rate=0.2,
                                 n_estimators=100)

y_score_lr = clf.fit(X_train, y_train).decision_function(X_test)
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr)
roc_auc_lr = auc(fpr_lr, tpr_lr)

fig, ax = plt.subplots(figsize=(6, 15))
sns.set_context("talk", font_scale=0.8) 
ax.set_xlim([-0.01, 1.00])
ax.set_ylim([-0.01, 1.01])
ax.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr))
ax.set_xlabel('False Positive Rate', fontsize=16)
ax.set_ylabel('True Positive Rate', fontsize=16)
ax.set_title('ROC curve', fontsize=16)
ax.legend(loc='lower right', fontsize=13)
ax.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--')
ax.set_aspect('equal')
sns.despine()
fig.tight_layout()

Model performance is poor, feature importance analisys became pointless.

# 4. Athlete movement analisys.

In [None]:
injury_record = pd.read_csv('/kaggle/input/nfl-playing-surface-analytics/InjuryRecord.csv')
play_list = pd.read_csv("../input/nfl-playing-surface-analytics/PlayList.csv")
track_data = pd.read_csv("../input/nfl-playing-surface-analytics/PlayerTrackData.csv")

In [0]:
def get_deriv(data, dt=0.1, threshold=300):
  '''
  Deriviate that handle with outliers that appears in angle variables, due to 360 rotation.
  Outliers are replaced by mean of left and right neightbours.
  '''
  delta = np.array(data[1:]) - np.array(data[:-1])
  outliers = np.where(np.abs(delta) > threshold)
  mean_nei = np.array(outliers)
  for i in outliers:
    try:
      delta[i] = (delta[i - 1] + delta[i + 1]) / 2
    except:
      continue
  return delta / dt

def smooth(x, window_len=11, window='hanning'):
    """smooth the data using a window with requested size.
    This method is based on the convolution of a scaled window with the signal.
    The signal is prepared by introducing reflected copies of the signal 
    (with the window size) in both ends so that transient parts are minimized
    in the begining and end part of the output signal.
    input:
        x: the input signal 
        window_len: the dimension of the smoothing window; should be an odd integer
        window: the type of window from 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'
            flat window will produce a moving average smoothing.
    output:
        the smoothed signal  
    """
    if x.ndim != 1:
        raise ValueError('smooth only accepts 1 dimension arrays')
    if x.size < window_len:
        raise ValueError("Input vector needs to be bigger than window size.")
    if window_len < 3:
        return x
    if not window in ['flat', 'hanning', 'hamming', 'bartlett', 'blackman']:
        raise ValueError("Window is on of 'flat', 'hanning', 'hamming', 'bartlett', 'blackman'")
    s = np.r_[x[window_len-1: 0: -1], x, x[-2: -window_len-1 :-1]]
    if window == 'flat': #moving average
        w = np.ones(window_len, 'd')
    else:
        w = eval('np.' + window + '(window_len)')
    y=np.convolve(w/w.sum(), s, mode='valid')
    return y

def eng_difference(fi1, fi2):
  fi1 = (fi1 + 90)*2*np.pi / 360
  fi2 = (fi2 + 90)*2*np.pi / 360
  dp = np.abs(np.sin(fi1) - np.sin(fi2))
  dfi = np.arccos(1 - dp)
  return dfi*360/2/np.pi

Data is represented as time series and consists of coordinates, speed, orientation and direction. Average duration of such an interval is about 30s with discretisation frequency 10Hz. Here are the examples of one such plays:

In [0]:
track = track_data[track_data['PlayKey'] == '36621-13-58']

sns.set_context("talk", font_scale=0.8) 
sns.set_palette("Blues_d")
fig, ax = plt.subplots(2, 2, figsize=(15, 7))

ax[0][0].plot(track['x'], track['y'], marker='.', ms=8,)
ax[0][1].plot(track['time'], track['x'], marker='.', ms=8, label='x')
ax[0][1].set_label('x')
ax2 = ax[0][1].twinx()  # instantiate a second axes that shares the same x-axis
color = 'tab:blue'
ax2.set_ylabel('y', color=color)  
ax2.plot(track['time'], track['y'], marker='.', ms=8, color=color, label='y')
ax2.tick_params(axis='y', labelcolor=color)

ax[1][0].plot(track['time'], track['o'], marker='.', ms=8, label='orientation')
ax[1][0].plot(track['time'], track['dir'], marker='.', ms=8, label='direction', color='tab:blue')

ax[1][1].plot(track['time'], track['s'], marker='.', ms=8, label='direction')

ax[0][0].set_title('Trajectory')
ax[0][1].set_title('x(t), y(t)')
ax[1][0].set_title('Orientation and direction')
ax[1][1].set_title('Speed')
ax[1][0].legend(fontsize=12)
ax[0][1].legend(fontsize=12)
ax2.legend(fontsize=12)
ax[0][0].grid()
ax[1][0].grid()
ax[0][1].grid()
ax[1][1].grid()
sns.despine()

ax[0][0].set_xlabel('$x, yd$')
ax[0][0].set_ylabel('$y, yd$')
ax[0][1].set_xlabel('$t, s$')
ax[0][1].set_ylabel('$x, yd$')
ax2.set_ylabel('$y, yd$')
ax[1][0].set_xlabel('$t, s$')
ax[1][0].set_ylabel('$angle, deg$')
ax[1][1].set_xlabel('$t, s$')
ax[1][1].set_ylabel('$V, yd/s$')

fig.tight_layout()

**Main ideas for feature extraction:**
<li>	It is likely that injuries happen at moments of rapid acceleration or sudden stops, i.e. during big positive and negative linear accelerations.
<li>	Sudden direction changes may lead to injury.
<li>	Running fast may increase the chance of positioning feet in such a manner that may lead to injury.
<li>	Running when having a big difference between orientation and direction may lead to injury.


Based on the above-mentioned assumptions, the following features were extracted:
1.	Linear acceleration (peak value)
2.	Linear deceleration (peak value)
3.	Linear speed (peak value)
4.	Angle speed of orientation (peak value)
5.	Angle speed of direction (peak value)
6.	Difference between direction and orientation (peak value)
7.	Difference between direction and orientation in the moment with maximum speed
8.	Difference between direction and orientation in the  moment with maximum acceleration
9.	Difference between direction and orientation in the  moment with maximum deceleration
10.	Speed in the moment of maximum difference between direction and orientation
11.	Acceleration in the moment of maximum difference between direction and orientation
12.	Average speed of side running (when the difference between direction and orientation is 80-100 degrees)
13.	Total distance per play
14.	Average speed of running backwards (when difference between direction and orientation is >100 degrees)


In [None]:
def get_global_peaks_extended(play_keys, min_sample_size=50):

  cols = ['a_max', 'a_min', 'v_max', 'vo_max', 'vdir_max',
          'd_max', 'd_Vmax', 'd_Amax', 'd_Amin', 
          'maxD_v', ' maxD_a',
          'siderun', 'distance', 'back_run']
  data = pd.DataFrame(columns=cols)

  win_len = 7
  col = 0
  for pk in play_keys:
    track = track_data[track_data['PlayKey'] == pk]
    vx = get_deriv(track['x'])
    vy = get_deriv(track['y'])
    v = np.sqrt(vx**2 + vy**2)
    o = track['o']
    dir_ = track['dir']
    dist = track['dis']

    if v.shape[0] + o.shape[0] + dir_.shape[0]  > 3*min_sample_size:
      vf = smooth(v, win_len, 'hanning')
      a = get_deriv(vf)
      af = smooth(a, win_len, 'hanning')
      vo = get_deriv(o)
      vof = smooth(vo, win_len-2, 'hanning')

      norm = np.array([1 if vi > 0.05*vf.max() else 0 for vi in v])
      d = np.abs(eng_difference(o[:-1], dir_[:-1])) * norm

      vdir = get_deriv(dir_) * norm
      vdir = smooth(vdir, win_len, 'hanning') 

      d = np.abs(eng_difference(o[:-1], dir_[:-1])) * norm

      a_max = af.max()
      a_min = af.min()
      v_max = vf.max()
      vo_max = vof.max()
      vdir_max = vdir.max()
      d_max = d.max()

      sr = side_run(d, vf)
      dist = dist.sum()
      bck_run = back_run(vf, d)

      try:
        d_Vmax = d[np.where(vf==v_max)[0][0]-1]
        d_Amax = d[np.where(af==a_max)[0][0]-1]
        d_Amin = d[np.where(af==a_min)[0][0]-1]
        maxD_v = vf[np.where(d==d.max())[0][0]-1]
        maxD_a = af[np.where(d==d.max())[0][0]-1]
      except:
        continue
      
      values = [a_max, a_min, v_max, vo_max, vdir_max,
                d_max, d_Vmax, d_Amax, d_Amin,
                maxD_v, maxD_a,
                sr, dist, bck_run]
      data.loc[col] = values
      col += 1     
    else:
      continue

  return data

def back_run(v, d):
  d.index = range(len(d.index))
  back_r = d[d > 100]
  if back_r.shape[0] > 1:
    avg_v = v[back_r.index[:-1]].mean()
    return avg_v
  return 0

def side_run(d, v):
  d.index = range(len(d.index))
  side_r = d[(d > 80) & (d < 100)]
  if side_r.shape[0] > 1:
    v = v[side_r.index[:-1]].mean()
    return v
  return 0

In [None]:
inj_pk = injury_record['PlayKey'].dropna()
pk = play_list['PlayKey'].dropna()

healthy_pk = play_list.query('PlayKey not in @inj_pk')
healthy_pk = healthy_pk['PlayKey'].dropna()

# all pk without injury
healthy_pk = healthy_pk.unique()

subs = np.array(healthy_pk)
np.random.shuffle(subs)

subset_size = 400
healthy_pk_subset = subs[:subset_size]

In [None]:
injured_metrics = pd.read_csv("../input/metrics-for-nfl/InjMetrics_1.csv")
not_injured_metrics = pd.read_csv("../input/metrics-for-nfl/NotInjMetrics_1.csv")

injured_metrics['Injury'] = np.ones((injured_metrics.shape[0], 1))
not_injured_metrics['Injury'] = np.zeros((not_injured_metrics.shape[0], 1))

data = pd.concat([injured_metrics, not_injured_metrics])

In order to deal with class imbalance between plays where athletes got injured and plays where athletes did not get injured, plays without injuries were randomly downsampled to 400. Finally, all the defined features were extracted from the dataset of 77 plays where athletes got injured and 400 plays where athletes did not get injured. To see the overall picture, it’s convenient to look at the correlation heat map:

In [None]:
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,8))
#plot heat map
sns.heatmap(data[top_corr_features].corr(), annot=False, cmap="RdBu_r")
#plt.tight_layout()

In [None]:
from scipy.stats import ks_2samp


vo_ks_test = ks_2samp(injured_metrics['vo_max'], not_injured_metrics['vo_max'])[1]
vdir_ks_test = ks_2samp(injured_metrics['vdir_max'], not_injured_metrics['vdir_max'])[1]
d_ks_test = ks_2samp(injured_metrics['d_max'], not_injured_metrics['d_max'])[1]

dVmax_ks_test = ks_2samp(injured_metrics['d_Vmax'], not_injured_metrics['d_Vmax'])[1]
dAmax_ks_test = ks_2samp(injured_metrics['d_Amax'], not_injured_metrics['d_Amax'])[1]
aAmin_ks_test = ks_2samp(injured_metrics['d_Amin'], not_injured_metrics['d_Amin'])[1]
maxDv_ks_test = ks_2samp(injured_metrics['maxD_v'], not_injured_metrics['maxD_v'])[1]

distance_ks_test = ks_2samp(injured_metrics['distance'], not_injured_metrics['distance'])[1]
siderun_ks_test = ks_2samp(injured_metrics['siderun'], not_injured_metrics['siderun'])[1]
bkrun_ks_test = ks_2samp(injured_metrics['back_run'], not_injured_metrics['back_run'])[1]

a_ks_test = ks_2samp(injured_metrics['a_max'], not_injured_metrics['a_max'])[1]
d_ks_test = ks_2samp(injured_metrics['a_min'], not_injured_metrics['a_min'])[1]
v_ks_test = ks_2samp(injured_metrics['v_max'], not_injured_metrics['v_max'])[1]
vo_ks_test = ks_2samp(injured_metrics['vo_max'], not_injured_metrics['vo_max'])[1]

Extracted features do not have a strong correlation with the risk of injury. Distributions of metrics for injured and not injured athletes were compared with a view to finding differences. To be more precise about metrics difference, Kolmohorov-Smirnov test was conducted. P-value of the test is represented in the graphs.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(20, 6))
sns.set_context("talk", font_scale=0.8) 
sns.set_palette("Blues_d")

g1 = sns.distplot(injured_metrics['vo_max'], label='Injured', ax=ax[0], color='r')
sns.distplot(not_injured_metrics['vo_max'], label='Not injured', ax=ax[0], color='g')
g1.text(500, 0.004, 'K-S test \np_val={:.1f}%'.format(vo_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g2 = sns.distplot(injured_metrics['vdir_max'], label='Injured', ax=ax[1], color='r')
sns.distplot(not_injured_metrics['vdir_max'], label='Not injured', ax=ax[1], color='g')
g2.text(600, 0.0024, 'K-S test \np_val={:.1f}%'.format(vdir_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g3 = sns.distplot(injured_metrics['d_max'], label='Injured', ax=ax[2], color='r')
sns.distplot(not_injured_metrics['d_max'], label='Not injured', ax=ax[2], color='g')
g3.text(60, 0.012, 'K-S test \np_val={:.1f}%'.format(d_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

fig.suptitle('Global peaks', fontsize=16)
ax[0].set_title('Maximum orientation speed')
ax[1].set_title('Maximum direction speed')
ax[2].set_title('Maximum difference between orientaion and direction')
ax[0].set_xlabel('$deg/s$')
ax[1].set_xlabel('$deg/s$')
ax[2].set_xlabel('$deg$')
ax[0].legend()
ax[1].legend()
ax[2].legend()
ax[0].set_xlim(0, 500)
ax[1].set_xlim(0, 1000)
sns.despine()

The maximum difference between orientation and direction and maximum direction rotation speed had no significant difference between injured and non-injured samples. Formally the maximum orientation speed was different for injured and non-injured populations, but it is likely due to outliers on the tail of the distribution. There is some difference in total distance and “side run” features:

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(20, 6))
sns.set_context("talk", font_scale=0.8) 
sns.set_palette("Blues_d")

g1 = sns.distplot(injured_metrics['distance'], label='Injured', ax=ax[0], color='r')
sns.distplot(not_injured_metrics['distance'], label='Not injured', ax=ax[0], color='g')
g1.text(100, 0.013, 'K-S test \np_val={:.1f}%'.format(distance_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g2 = sns.distplot(injured_metrics['siderun'], label='Injured', ax=ax[1], color='r')
sns.distplot(not_injured_metrics['siderun'], label='Not injured', ax=ax[1], color='g')
g2.text(4, 0.3, 'K-S test \np_val={:.1f}%'.format(siderun_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g3 = sns.distplot(injured_metrics['back_run'], label='Injured', ax=ax[2], color='r')
sns.distplot(not_injured_metrics['back_run'], label='Not injured', ax=ax[2], color='g')
g3.text(4, 0.3, 'K-S test \np_val={:.1f}%'.format(bkrun_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

ax[0].set_title('Distance')
ax[1].set_title('Side run')
ax[2].set_title('Back run')
ax[0].set_xlabel('$yd$')
ax[1].set_xlabel('$yd/s$')
ax[2].set_xlabel('$yd/s$')
ax[0].legend()
ax[1].legend()
ax[2].legend()
sns.despine()

Comparing such metrics as the difference between direction and orientation in the moment with maximum speed, the difference between direction and orientation in the moment with maximum acceleration, the difference between direction and orientation in the moment with maximum deceleration, and speed in the moment of the maximum difference between direction and orientation has not shown no significant difference:

In [None]:
fig, ax = plt.subplots(1, 4, figsize=(20, 6))
sns.set_context("talk", font_scale=0.8) 
sns.set_palette("Blues_d")

g1 = sns.distplot(injured_metrics['d_Vmax'], label='Injured', ax=ax[0], color='r')
sns.distplot(not_injured_metrics['d_Vmax'], label='Not injured', ax=ax[0], color='g')
g1.text(125, 0.006, 'K-S test \np_val={:.1f}%'.format(dVmax_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g2 = sns.distplot(injured_metrics['d_Amax'], label='Injured', ax=ax[1], color='r')
sns.distplot(not_injured_metrics['d_Amax'], label='Not injured', ax=ax[1], color='g')
g2.text(130, 0.0065, 'K-S test \np_val={:.1f}%'.format(dAmax_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g3 = sns.distplot(injured_metrics['d_Amin'], label='Injured', ax=ax[2], color='r')
sns.distplot(not_injured_metrics['d_Amin'], label='Not injured', ax=ax[2], color='g')
g3.text(120, 0.007, 'K-S test \np_val={:.1f}%'.format(aAmin_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g4 = sns.distplot(injured_metrics['maxD_v'], label='Injured', ax=ax[3], color='r')
sns.distplot(not_injured_metrics['maxD_v'], label='Not injured', ax=ax[3], color='g')
g4.text(3, 0.55, 'K-S test \np_val={:.1f}%'.format(maxDv_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

fig.suptitle('Global peaks', fontsize=16)
ax[0].set_title('|dir - o| when v_max')
ax[1].set_title('|dir - o| when a_max')
ax[2].set_title('|dir - o| when a_min')
ax[3].set_title('Speed in moment of max |dir - o|')
ax[0].set_xlabel('$deg$')
ax[1].set_xlabel('$deg$')
ax[2].set_xlabel('$deg$')
ax[3].set_xlabel('$yd/s$')
ax[0].legend()
ax[1].legend()
ax[2].legend()
ax[3].legend()
sns.despine()

And the most significant difference is seen between speed and acceleration features:

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(20, 6))

g1 = sns.distplot(injured_metrics['a_max'], label='Injured', ax=ax[0], color='r')
sns.distplot(not_injured_metrics['a_max'], label='Not injured', ax=ax[0], color='g')
g1.text(10, 0.125, 'K-S test \np_val={:.1f}%'.format(a_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g2 = sns.distplot(injured_metrics['a_min'], label='Injured', ax=ax[1], color='r')
sns.distplot(not_injured_metrics['a_min'], label='Not injured', ax=ax[1], color='g')
g2.text(-11, 0.23, 'K-S test \np_val={:.1f}%'.format(d_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

g3 = sns.distplot(injured_metrics['v_max'], label='Injured', ax=ax[2], color='r')
sns.distplot(not_injured_metrics['v_max'], label='Not injured', ax=ax[2], color='g')
g3.text(10, 0.14, 'K-S test \np_val={:.1f}%'.format(v_ks_test*100), horizontalalignment='left', 
        size='medium', color='black', weight='semibold', fontsize=12)

fig.suptitle('Global peaks', fontsize=16)
ax[0].set_title('Maximum linear accelerations')
ax[1].set_title('Maximum linear decelerations')
ax[2].set_title('Maximum linear speed')
ax[0].set_xlabel('$yd/s^2$')
ax[1].set_xlabel('$yd/s^2$')
ax[2].set_xlabel('$yd/s$')
ax[0].legend()
ax[1].legend()
ax[2].legend()

ax[0].set_xlim(0, 20)
ax[1].set_xlim(-15, 0)
ax[2].set_xlim(0, 15)
sns.despine()

We can see that the distribution of injured and not injured athletes differs, and according to Kolmohorov-Smirnov test the difference can be treated as statistically significant. Difference seems logical: injured players on average had more powerful accelerations and decelerations. Average maximum speed of injured population is approximately 20% higher. That may well be the reason of difference in distance and “side run” between injured and non-injured populations, as both of metrics correlate with speed.

# General conclusions:
Injuries are of stochastic nature. Some important factors may have been out of the scope, for example, health condition of athletes, i.e. some of them can be more prone to injuries (due to joint dysfunction, repetitive traumas, or some chronic diseases, etc.). It seems fair to suggest that different athletes may have different injury risks while being exposed to the same conditions. A strong single factor or a combination of factors, which alteration may diminish the risk of injury was not found in this work. At the same time it has been found that there are a lot of factors and combinations that slightly influence injury risk. The main factors influencing injury risks are as follows: 
1.	Turf type. Injuries on synthetic turf happen 1.7 times more often than on natural turf. 
2.	Play scenario. Injury risks vary depending on play scenario ( a combination of roster position, position, group position, play type) and the turf type. A play scenario “Wide Receiver, WR, WR, Kickoff” on synthetic turn has the highest injury risk. The chance of injury in such a scenario is about 0.9%. The highest risk injury on natural turf happens in a “Linebacker, OLB, LB, Punt” scenario. The chance of injury in such a scenario is about 0.4%.
3.	Temperature. It is likely that injury risk on synthetic turf increases with higher temperatures (approximately above 70F). One can suggest that some physical properties of synthetic turf start changing at these temperatures. Natural turf is seemingly not temperature-sensitive, as there were no indications of such dependencies found.
4.	Player dynamics. It has been found that athletes who got injured on average run faster and have more rapid accelerations than those who did not got  injured. This difference does not depend on turf type.
