# üèÜ FootyPredict Pro - Model Training Notebook

**Google Colab GPU Training for Football Match Predictions**

This notebook downloads football datasets from multiple sources, trains an ensemble of ML models, and provides download links for the trained models.

### Models Trained:
- XGBoost
- LightGBM
- CatBoost
- Neural Network (PyTorch)

### Data Sources:
- Football-Data.co.uk (Historical odds and results)
- Kaggle European Soccer Database
- Open Football Data

---
**Run each cell in order. After training completes, download models from the Files panel.**

## Step 1: Setup Environment

In [None]:
# Install required packages
!pip install -q xgboost lightgbm catboost torch scikit-learn pandas numpy kaggle gdown

import os
import json
import zipfile
import requests
import numpy as np
import pandas as pd
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.filterwarnings('ignore')

print('‚úÖ Environment ready!')
print(f'üìÖ Training started: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}')

## Step 2: Download Football Datasets

In [None]:
# Create data directory
os.makedirs('data', exist_ok=True)
os.makedirs('models/trained', exist_ok=True)

# Football-Data.co.uk - Historical match data with odds
leagues = {
    'E0': 'Premier League',
    'D1': 'Bundesliga',
    'SP1': 'La Liga',
    'I1': 'Serie A',
    'F1': 'Ligue 1'
}

seasons = ['2324', '2223', '2122', '2021', '1920', '1819', '1718', '1617', '1516', '1415']

all_data = []

print('üì• Downloading historical match data...')
for league_code, league_name in leagues.items():
    for season in seasons:
        url = f'https://www.football-data.co.uk/mmz4281/{season}/{league_code}.csv'
        try:
            df = pd.read_csv(url, encoding='utf-8', on_bad_lines='skip')
            df['League'] = league_name
            df['Season'] = season
            all_data.append(df)
            print(f'  ‚úì {league_name} {season}: {len(df)} matches')
        except Exception as e:
            pass

# Combine all data
raw_data = pd.concat(all_data, ignore_index=True)
print(f'\nüìä Total matches downloaded: {len(raw_data):,}')

## Step 3: Feature Engineering

In [None]:
# Select relevant columns
columns_needed = ['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG', 'HTAG', 'HTR',
                  'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR',
                  'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'League']

# Keep only columns that exist
available_cols = [c for c in columns_needed if c in raw_data.columns]
df = raw_data[available_cols].copy()

# Drop rows with missing values
df = df.dropna(subset=['HomeTeam', 'AwayTeam', 'FTR'])

print(f'üìä Matches after cleaning: {len(df):,}')

# Encode teams
team_encoder = LabelEncoder()
all_teams = pd.concat([df['HomeTeam'], df['AwayTeam']]).unique()
team_encoder.fit(all_teams)

df['HomeTeamEnc'] = team_encoder.transform(df['HomeTeam'])
df['AwayTeamEnc'] = team_encoder.transform(df['AwayTeam'])

# Encode result (H=Home Win, D=Draw, A=Away Win)
result_map = {'H': 0, 'D': 1, 'A': 2}
df['Result'] = df['FTR'].map(result_map)

# Encode league
league_encoder = LabelEncoder()
df['LeagueEnc'] = league_encoder.fit_transform(df['League'])

# Calculate derived features
df['GoalDiff'] = df['FTHG'] - df['FTAG']
df['TotalGoals'] = df['FTHG'] + df['FTAG']
df['BTTS'] = ((df['FTHG'] > 0) & (df['FTAG'] > 0)).astype(int)
df['Over25'] = (df['TotalGoals'] > 2.5).astype(int)

# Fill numeric columns with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

print(f'‚úÖ Feature engineering complete')
print(f'   Teams encoded: {len(all_teams)}')
print(f'   Leagues: {list(leagues.values())}')

## Step 4: Prepare Training Data

In [None]:
# Features for training
feature_cols = ['HomeTeamEnc', 'AwayTeamEnc', 'LeagueEnc']

# Add odds features if available
odds_cols = ['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA']
for col in odds_cols:
    if col in df.columns:
        feature_cols.append(col)

# Add match stats if available
stat_cols = ['HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY']
for col in stat_cols:
    if col in df.columns:
        feature_cols.append(col)

# Filter valid features
feature_cols = [c for c in feature_cols if c in df.columns]

X = df[feature_cols].values
y_result = df['Result'].values
y_btts = df['BTTS'].values
y_over25 = df['Over25'].values

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_result, test_size=0.2, random_state=42)
_, _, y_btts_train, y_btts_test = train_test_split(X_scaled, y_btts, test_size=0.2, random_state=42)
_, _, y_o25_train, y_o25_test = train_test_split(X_scaled, y_over25, test_size=0.2, random_state=42)

print(f'‚úÖ Training data prepared')
print(f'   Training samples: {len(X_train):,}')
print(f'   Test samples: {len(X_test):,}')
print(f'   Features: {len(feature_cols)}')

## Step 5: Train XGBoost Model

In [None]:
import xgboost as xgb

print('üöÄ Training XGBoost model...')

xgb_model = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=8,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    use_label_encoder=False,
    eval_metric='mlogloss'
)

xgb_model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

xgb_pred = xgb_model.predict(X_test)
xgb_acc = accuracy_score(y_test, xgb_pred)

# Save model
xgb_model.save_model('models/trained/xgb_football.json')

print(f'‚úÖ XGBoost trained - Accuracy: {xgb_acc:.2%}')

## Step 6: Train LightGBM Model

In [None]:
import lightgbm as lgb

print('üöÄ Training LightGBM model...')

lgb_model = lgb.LGBMClassifier(
    n_estimators=500,
    max_depth=10,
    learning_rate=0.05,
    num_leaves=31,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbose=-1
)

lgb_model.fit(X_train, y_train, eval_set=[(X_test, y_test)])

lgb_pred = lgb_model.predict(X_test)
lgb_acc = accuracy_score(y_test, lgb_pred)

# Save model
lgb_model.booster_.save_model('models/trained/lgb_football.txt')

print(f'‚úÖ LightGBM trained - Accuracy: {lgb_acc:.2%}')

## Step 7: Train CatBoost Model

In [None]:
from catboost import CatBoostClassifier

print('üöÄ Training CatBoost model...')

cat_model = CatBoostClassifier(
    iterations=500,
    depth=8,
    learning_rate=0.05,
    loss_function='MultiClass',
    random_seed=42,
    verbose=False
)

cat_model.fit(X_train, y_train, eval_set=(X_test, y_test))

cat_pred = cat_model.predict(X_test)
cat_acc = accuracy_score(y_test, cat_pred)

# Save model
cat_model.save_model('models/trained/cat_football.cbm')

print(f'‚úÖ CatBoost trained - Accuracy: {cat_acc:.2%}')

## Step 8: Train Neural Network (PyTorch)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

print('üöÄ Training Neural Network...')

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'   Using device: {device}')

# Define model
class FootballPredictor(nn.Module):
    def __init__(self, input_dim, num_classes=3):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, num_classes)
        )
    
    def forward(self, x):
        return self.model(x)

# Prepare data
X_train_t = torch.FloatTensor(X_train).to(device)
y_train_t = torch.LongTensor(y_train).to(device)
X_test_t = torch.FloatTensor(X_test).to(device)
y_test_t = torch.LongTensor(y_test).to(device)

train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Initialize model
nn_model = FootballPredictor(X_train.shape[1]).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(nn_model.parameters(), lr=0.001, weight_decay=0.01)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)

# Training loop
best_acc = 0
for epoch in range(100):
    nn_model.train()
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = nn_model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
    
    # Evaluate
    nn_model.eval()
    with torch.no_grad():
        outputs = nn_model(X_test_t)
        _, predicted = torch.max(outputs, 1)
        acc = (predicted == y_test_t).sum().item() / len(y_test_t)
        scheduler.step(1 - acc)
        
        if acc > best_acc:
            best_acc = acc
            torch.save(nn_model.state_dict(), 'models/trained/nn_football.pt')

print(f'‚úÖ Neural Network trained - Best Accuracy: {best_acc:.2%}')

## Step 9: Training Summary

In [None]:
# Summary
print('='*60)
print('üèÜ TRAINING COMPLETE!')
print('='*60)
print(f'\nüìä Model Accuracies:')
print(f'   XGBoost:     {xgb_acc:.2%}')
print(f'   LightGBM:    {lgb_acc:.2%}')
print(f'   CatBoost:    {cat_acc:.2%}')
print(f'   Neural Net:  {best_acc:.2%}')
print(f'\n   Ensemble Avg: {(xgb_acc + lgb_acc + cat_acc + best_acc) / 4:.2%}')

# Save metadata
metadata = {
    'training_date': datetime.now().isoformat(),
    'total_samples': len(df),
    'features': feature_cols,
    'accuracies': {
        'xgboost': round(xgb_acc, 4),
        'lightgbm': round(lgb_acc, 4),
        'catboost': round(cat_acc, 4),
        'neural_net': round(best_acc, 4)
    }
}

with open('models/trained/training_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print('\nüíæ Models saved to: models/trained/')

## Step 10: Download Trained Models

In [None]:
import shutil
from google.colab import files

# Create zip file with all models
shutil.make_archive('footypredict_models', 'zip', 'models/trained')

print('üì¶ Models packaged!')
print('\nüì• Click below to download:')

# Auto-download the zip file
files.download('footypredict_models.zip')

print('\n‚úÖ After download, extract to: soccer/models/trained/')

---
## üìã Manual Model Files

If auto-download fails, you can manually download from Files panel (left sidebar):

```
models/trained/
‚îú‚îÄ‚îÄ xgb_football.json      # XGBoost model
‚îú‚îÄ‚îÄ lgb_football.txt       # LightGBM model
‚îú‚îÄ‚îÄ cat_football.cbm       # CatBoost model
‚îú‚îÄ‚îÄ nn_football.pt         # PyTorch Neural Network
‚îî‚îÄ‚îÄ training_metadata.json # Training info
```

### To use in FootyPredict Pro:
1. Download the `footypredict_models.zip` file
2. Extract to your project folder: `soccer/models/trained/`
3. Restart the app

---
*FootyPredict Pro v3.0 | AI-Powered Football Predictions*