In [1]:
import pandas as pd
import os
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')



In [3]:
data = pd.read_csv('cleaned_football_data.csv')

data

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,AvgA,PSCH,PSCD,PSCA,Season,League,LBH,LBD,LBA,HomeWin
0,E0,09/08/2019,Liverpool,Norwich,4,1,H,4.0,0.0,H,...,19.830000,1.14,10.43,19.63,2019,Premier League,2.953658,4.099487,5.155421,1
1,E0,10/08/2019,West Ham,Man City,0,5,A,0.0,1.0,A,...,1.250000,11.11,6.68,1.27,2019,Premier League,2.953658,4.099487,5.155421,0
2,E0,10/08/2019,Bournemouth,Sheffield United,1,1,D,0.0,0.0,D,...,3.830000,1.98,3.67,4.06,2019,Premier League,2.953658,4.099487,5.155421,0
3,E0,10/08/2019,Burnley,Southampton,3,0,H,0.0,0.0,D,...,2.780000,2.71,3.19,2.90,2019,Premier League,2.953658,4.099487,5.155421,1
4,E0,10/08/2019,Crystal Palace,Everton,0,0,D,0.0,0.0,D,...,2.400000,3.37,3.45,2.27,2019,Premier League,2.953658,4.099487,5.155421,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6075,I1,20/05/18,Milan,Fiorentina,5,1,H,2.0,1.0,H,...,4.277143,1.66,4.31,5.23,2017,Serie A,1.800000,3.800000,4.000000,1
6076,I1,20/05/18,Napoli,Crotone,2,1,H,2.0,0.0,H,...,5.790000,1.41,5.55,7.13,2017,Serie A,1.440000,4.600000,6.500000,1
6077,I1,20/05/18,Sassuolo,Roma,0,1,A,0.0,1.0,A,...,1.850000,4.84,3.85,1.78,2017,Serie A,3.800000,3.800000,1.850000,0
6078,I1,20/05/18,Spal,Sampdoria,3,1,H,1.0,0.0,H,...,5.512857,1.85,3.78,4.56,2017,Serie A,1.610000,4.000000,5.200000,1



---

# Regression 2: Predicting **Home Team Goals**

This regression aims to estimate the **expected number of goals scored by the home team**, which is useful for:
- **Team analysis**: Identifying key drivers of offensive performance.
- **Match preparation**: Highlighting factors impacting scoring.

---

### Proposed Model Components:

1. **Likelihood**:
   - The number of goals scored by the home team is **discrete** and non-negative, so we use a **Poisson likelihood**, commonly applied for modeling count data:
     $$
     y_i \sim \text{Poisson}(\lambda_i)
     $$
     where $\lambda_i$ is the expected goals scored by the home team for match $i$.

2. **Predictor**:
   - $\lambda_i$ (the expected home team goals) depends on several predictors:
     - **Home Shots** ($x_{1i}$).
     - **Home Shots on Target** ($x_{2i}$).
     - **Away Shots** ($x_{3i}$).
     - **Away Shots on Target** ($x_{4i}$).
     - **Fouls Committed** (home team $x_{5i}$, away team $x_{6i}$).
   - The linear relationship is modeled as:
     $$
     \log(\lambda_i) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \beta_4 x_{4i} + \beta_5 x_{5i} + \beta_6 x_{6i}
     $$
     where:
     - $\beta_0$: Intercept.
     - $\beta_1, \beta_2, \dots, \beta_6$: Coefficients for the predictors.

3. **Priors**:
   - Weakly informative priors are applied to the coefficients:
     $$
     \beta_0, \beta_1, \dots, \beta_6 \sim \mathcal{N}(0, 5)
     $$

---

### Why Is This Useful?

1. **Insights into Offensive Performance**:
   - This model highlights the relationship between different performance indicators and goals scored.

2. **Predictive Power**:
   - It can forecast home team goals, aiding match analysis and planning.

3. **Scalability**:
   - Can be extended to include other predictors (e.g., weather, crowd size).