# Exploratory Data Analysis

1.  Data Distribution
2.  Categorical Data
3.  Correlation Analysis
4.  Time-Based Analysis
5.  Team Performance Analysis
6.  Advanced Stats
7.  Feature Insights

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import os
import pandas as pd

# Add the project root to the Python path
notebook_dir = os.path.dirname(os.path.abspath('__file__'))
project_root = os.path.dirname(notebook_dir)
sys.path.append(project_root)

In [2]:
# Load the games data from the parquet files
df_2016_plus = pd.read_parquet('../data/02_interim/df_2016_plus.parquet')
df_all_years = pd.read_parquet('../data/02_interim/df_all_years.parquet')

## 1.   Data Distribution

**Columns**

*Descriptive*
-   id -> individual to each game
-   season, week, start_date, season_type -> year, week, date, and regular or post season
-   neutral_site, conference_game, venue_id, venue
-   team_id, team, team_name, opponent_id, opponent
-   team_division, team_conference, opponent_division, opponent_conference
-   is_home, home_away
-   matchup

*Advanced Stats (for both offense and defense)*
- drives, explosiveness, line_yards, line_yards_total
- open_field_yards, open_field_yards_total, plays
- power_success, ppa, second_level_yards
- second_level_yards_total, stuff_rate, success_rate
- total_ppa

*Situational Stats (for both offense and defense)*
- passing_downs:
  - explosiveness, ppa, success_rate
- passing_plays:
  - explosiveness, ppa, success_rate, total_ppa
- rushing_plays:
  - explosiveness, ppa, success_rate, total_ppa
- standard_downs:
  - explosiveness, ppa, success_rate

*Basic Stats*
- team_points, opponent_points, point_difference
- result, win
- totalYards, firstDowns, possessionTime
- thirdDownEff, fourthDownEff
- passingTDs, netPassingYards, completionAttempts, yardsPerPass
- rushingTDs, rushingYards, rushingAttempts, yardsPerRushAttempt
- puntReturns, puntReturnYards, puntReturnTDs
- kickingPoints
- totalPenaltiesYards
- turnovers, interceptions, interceptionYards, interceptionTDs, passesIntercepted
- totalFumbles, fumblesLost, fumblesRecovered

*Additional Stats (df_2016_plus only)*
- attendance, excitement_index
- kickReturnYards, kickReturnTDs, kickReturns
- tacklesForLoss, defensiveTDs, tackles, sacks
- qbHurries, passesDeflected
- team_talent, opponent_talent

**Potential Features**
-   season, week, season_type
-   neutral_site, venue_id, team_id, opponent_id, is_home
-   team_points, opponent_points, point_difference

**Target**
-   win
-   team_points, opponent_points

In [5]:
# List all columns in df_all_years
all_years_columns = df_all_years.columns.tolist()
print("Columns in df_all_years:")
for i in range(0, len(all_years_columns), 3):
    print(", ".join(all_years_columns[i:i+3]))

# Find additional columns in df_2016_plus
additional_columns = [col for col in df_2016_plus.columns if col not in df_all_years.columns]
print("\nAdditional columns in df_2016_plus:")
for i in range(0, len(additional_columns), 3):
    print(", ".join(additional_columns[i:i+3]))


Columns in df_all_years:
id, season, week
season_type, start_date, neutral_site
conference_game, venue_id, venue
team_id, team, team_division
team_points, opponent_id, opponent
opponent_conference, opponent_division, opponent_points
is_home, team_name, team_conference
home_away, team_points_stats, fumblesRecovered
rushingTDs, puntReturnYards, puntReturnTDs
puntReturns, passingTDs, kickingPoints
firstDowns, thirdDownEff, fourthDownEff
totalYards, netPassingYards, completionAttempts
yardsPerPass, rushingYards, rushingAttempts
yardsPerRushAttempt, totalPenaltiesYards, turnovers
fumblesLost, interceptions, possessionTime
interceptionYards, interceptionTDs, passesIntercepted
totalFumbles, point_difference, result
offense_drives, offense_explosiveness, offense_line_yards
offense_line_yards_total, offense_open_field_yards, offense_open_field_yards_total
offense_plays, offense_power_success, offense_ppa
offense_second_level_yards, offense_second_level_yards_total, offense_stuff_rate
offense_su