# ScoreSight - Part 1: Data Loading and Initial EDA

**Author:** Prathamesh Fuke  
**Branch:** Prathamesh_Fuke  
**Date:** October 28, 2025

## Objective
Load the EPL datasets and perform initial exploratory data analysis to understand:
- Dataset structure and dimensions
- Column names and data types
- Missing values
- Basic statistics
- Data quality issues

## Datasets
1. **Match Winner.csv** - Historical match results
2. **Goals & Assist.xlsx** - Player statistics
3. **ScoreSight_ML_Season_LeagueWinner_Champion.csv** - Season-level data

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

print("✓ Libraries imported successfully!")

## 2. Load Datasets

In [None]:
# Load Match Winner dataset
print("Loading Match Winner dataset...")
match_data = pd.read_csv('../datasets/Match Winner.csv')
print(f"✓ Loaded successfully! Shape: {match_data.shape}")
print(f"  Rows: {match_data.shape[0]:,} | Columns: {match_data.shape[1]}")

In [None]:
# Load Goals & Assist dataset
print("Loading Goals & Assist dataset...")
player_data = pd.read_excel('../datasets/Goals & Assist.xlsx')
print(f"✓ Loaded successfully! Shape: {player_data.shape}")
print(f"  Rows: {player_data.shape[0]:,} | Columns: {player_data.shape[1]}")

In [None]:
# Load Season League Winner dataset
print("Loading Season League Winner dataset...")
league_data = pd.read_csv('../datasets/ScoreSight_ML_Season_LeagueWinner_Champion.csv')
print(f"✓ Loaded successfully! Shape: {league_data.shape}")
print(f"  Rows: {league_data.shape[0]:,} | Columns: {league_data.shape[1]}")

## 3. Initial Data Inspection

### 3.1 Match Winner Dataset

In [None]:
print("="*80)
print("MATCH WINNER DATASET - OVERVIEW")
print("="*80)
print(f"\nShape: {match_data.shape}")
print(f"\nColumn Names ({len(match_data.columns)}):")
for i, col in enumerate(match_data.columns, 1):
    print(f"{i:2d}. {col}")

In [None]:
print("\nFirst 5 rows:")
display(match_data.head())

In [None]:
print("\nData Types:")
print(match_data.dtypes)

In [None]:
print("\nDataset Info:")
match_data.info()

In [None]:
print("\nMissing Values Analysis:")
missing = match_data.isnull().sum()
missing_pct = (missing / len(match_data)) * 100
missing_df = pd.DataFrame({
    'Column': match_data.columns,
    'Missing_Count': missing.values,
    'Missing_Percentage': missing_pct.values
})
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
if len(missing_df) > 0:
    display(missing_df)
else:
    print("✓ No missing values found!")

In [None]:
print("\nStatistical Summary (Numeric Columns):")
display(match_data.describe())

### 3.2 Player Data (Goals & Assists)

In [None]:
print("="*80)
print("PLAYER DATA (GOALS & ASSISTS) - OVERVIEW")
print("="*80)
print(f"\nShape: {player_data.shape}")
print(f"\nColumn Names ({len(player_data.columns)}):")
for i, col in enumerate(player_data.columns, 1):
    print(f"{i:2d}. {col}")

In [None]:
print("\nFirst 5 rows:")
display(player_data.head())

In [None]:
print("\nData Types:")
print(player_data.dtypes)

In [None]:
print("\nDataset Info:")
player_data.info()

In [None]:
print("\nMissing Values Analysis:")
missing = player_data.isnull().sum()
missing_pct = (missing / len(player_data)) * 100
missing_df = pd.DataFrame({
    'Column': player_data.columns,
    'Missing_Count': missing.values,
    'Missing_Percentage': missing_pct.values
})
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
if len(missing_df) > 0:
    display(missing_df)
else:
    print("✓ No missing values found!")

In [None]:
print("\nStatistical Summary (Numeric Columns):")
display(player_data.describe())

### 3.3 League Data

In [None]:
print("="*80)
print("LEAGUE DATA - OVERVIEW")
print("="*80)
print(f"\nShape: {league_data.shape}")
print(f"\nColumn Names ({len(league_data.columns)}):")
for i, col in enumerate(league_data.columns, 1):
    print(f"{i:2d}. {col}")

In [None]:
print("\nFirst 5 rows:")
display(league_data.head())

In [None]:
print("\nData Types:")
print(league_data.dtypes)

In [None]:
print("\nDataset Info:")
league_data.info()

In [None]:
print("\nMissing Values Analysis:")
missing = league_data.isnull().sum()
missing_pct = (missing / len(league_data)) * 100
missing_df = pd.DataFrame({
    'Column': league_data.columns,
    'Missing_Count': missing.values,
    'Missing_Percentage': missing_pct.values
})
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
if len(missing_df) > 0:
    display(missing_df)
else:
    print("✓ No missing values found!")

In [None]:
print("\nStatistical Summary (Numeric Columns):")
display(league_data.describe())

## 4. Data Quality Checks

In [None]:
print("="*80)
print("DATA QUALITY SUMMARY")
print("="*80)

datasets = {
    'Match Winner': match_data,
    'Player Data': player_data,
    'League Data': league_data
}

for name, df in datasets.items():
    print(f"\n{name}:")
    print(f"  - Total Rows: {len(df):,}")
    print(f"  - Total Columns: {len(df.columns)}")
    print(f"  - Duplicates: {df.duplicated().sum():,}")
    print(f"  - Missing Values: {df.isnull().sum().sum():,}")
    print(f"  - Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 5. Key Findings and Next Steps

### Summary
After loading and inspecting all three datasets, we can now proceed to:

1. **Data Cleaning** (Notebook 02)
   - Handle missing values
   - Remove duplicates
   - Fix data type issues
   - Handle outliers

2. **Feature Engineering** (Notebook 03)
   - Create team form indicators
   - Calculate goal averages
   - Generate home/away statistics
   - Build player performance metrics

3. **Encoding & Feature Selection** (Notebook 04)
   - Encode categorical variables
   - Select relevant features for modeling
   - Handle multicollinearity

### Save Current State

In [None]:
# Save raw data for reference
print("Saving datasets for next stage...")
match_data.to_csv('../data/raw/data_raw_match.csv', index=False)
player_data.to_csv('../data/raw/data_raw_player.csv', index=False)
league_data.to_csv('../data/raw/data_raw_league.csv', index=False)
print("\n✓ All datasets saved successfully!")
print("\n" + "="*80)
print("NOTEBOOK 01 COMPLETED - Ready for Data Cleaning")
print("="*80)