# ⚽ **HOL: Eredivisie Prediction** 🥇
### Notebook - Data Ingestion - 1/4

---


### What We'll Do:
1. -> **Data Ingestion**: Fetch Eredivisie data from the GitHub repository.
2. **Data Transformation**: Utilize Snowpark DataFrames for data preparation and analysis.
3. **Model Training**: Train model and store it in the Snowflake Model Registry
4. **Prediction**: Predict who is going to win Eredivisie 2024/2025

![Eredivisie](https://64.media.tumblr.com/a1d3bc556ab9711ed0b57c02aa1264a8/874fef2692632b47-f4/s540x810/48bf693b3002d223526fe26b8dba3d3db88023d1.gif)


⚠️ *If you are using a Snowflake Trial account, you should execute the `setup_trial.sql` and upload dataset manually to stage. Once completed, you can move directly to the data transformation notebook, or you can jump to the cell in this notebook `"Verify_Data_Loaded_1"`.* Before to do so, on the top right **Packages** tab, let's import the `plotly` library and click on the **Start** button.

### Setup

Before using this notebook, ensure that you have created the following objects by running the `setup.sql` script in a worksheet:

- **Database**: `EREDIVISIE_PREDICTION`
- **Schema**: `RAW_DATA`
- **Warehouse**: `EREDIVISIE_PREDICTION_WH`
- **Network Rule**: `GITHUB_NETWORK_RULE`
- **External Access Integration**: `GITHUB_EXTERNAL_ACCESS_INTEGRATION`

The first three items are required as you will need to define the Database, Schema, and Warehouse when you import this notebook into the Snowflake UI.

For the **Network Rule** and **External Access Integration**, once created, follow these steps to make them available within this notebook:

1. **Click on Notebook Settings** (located at the top right of the worksheet screen).
2. **Select the External Access Tab**.
3. **Enable** `GITHUB_EXTERNAL_ACCESS_INTEGRATION` from the list.
4. ++ On the top right **Packages** tab, let's import the `plotly` library. It's not related to the external access, but we'll need it at the end of the notebook. 
4. **Reload the Notebook**. Once reloaded, you will have access to the GitHub URL directly from this notebook.

With these configurations in place, you’ll be ready to extract and work with the dataset from the external GitHub URL in the following cells.

---


## Step 1: Data Loading
---

In this notebook, we will use Python functions and External Access Integration to load and analyze data about Eredivisie from 1995 to 2023. Our data source is a GitHub repository, from which we'll fetch and directly store historical data in our Snowflake account. No S3 buckets or local downloads are needed — our goal is to simplify the execution of this Hands-On Lab (HOL) while showcasing the extensive capabilities of Snowflake!

The next cells will perform the following actions:

- Creating Function to fetch data from external API / GitHub
- Create tables automatically 
- Verify data loaded

Let's get it done! 

In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import Session

session = get_active_session()

In [None]:
--Note: For this Hands-On Lab (HOL), we are not creating ad hoc roles and users to minimize prerequisites and simplify setup.

USE ROLE ACCOUNTADMIN;

In [None]:
import requests
import pandas as pd
from snowflake.snowpark import DataFrame as df
from io import StringIO

def fetch_dataset_from_github(url: str) -> 'DataFrame':
    # Fetch the CSV data from the URL
    response = requests.get(url)
    if response.status_code == 200:
        # Decode the content and read into a Pandas DataFrame
        csv_data = response.content.decode('utf-8')
        csv_file = StringIO(csv_data)
        pandas_df = pd.read_csv(csv_file)
        
        # Convert Pandas DataFrame to Snowpark DataFrame
        return session.create_dataframe(pandas_df)
    else:
        raise Exception(f"Failed to fetch CSV: {response.status_code} - {response.text}")


In [None]:
# Once an updated file is on the github dataset repo, it's enoguh to run this cell to reload the new datasets.
from snowflake.snowpark.functions import col

# Function to rename columns to uppercase
def rename_columns_to_uppercase(df):
    # Generate a list of columns with uppercase names
    new_columns = [col(c).alias(c.upper()) for c in df.columns]
    # Select columns with new names
    return df.select(*new_columns)

# Base URL and list of files
url_base = 'https://github.com/sfc-gh-mconsoli/eredivisie_prediction/raw/main/dataset/'
url_files = [
    'eredivisie_history.csv',
    'eredivisie_fixture.csv'
]

# Loop through each URL
for url in url_files:
    # Get Snowpark DataFrame from the URL
    df = fetch_dataset_from_github(url_base + url)

    # Extract table name from URL
    table_name = url.split('/')[-1].replace('.csv', '').upper()

    # Drop the table if it exists
    session.sql(f"DROP TABLE IF EXISTS {table_name}").collect()

    # Convert column names to uppercase
    df = rename_columns_to_uppercase(df)

    # Create table and insert data from Snowpark DataFrame
    df.write.save_as_table(table_name, mode='overwrite')

    print(f"Table {table_name} created and data loaded successfully.")


In [None]:
# Let's check loaded data
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import Session
import streamlit as st

session = get_active_session()

eredivisie_history_df = session.table('EREDIVISIE_HISTORY')

st.dataframe(eredivisie_history_df)

In [None]:
-- Verify we loaded data for all seasons
SELECT YEAR(TO_DATE("DATE",'DD/MM/YYYY')), COUNT(*)
FROM EREDIVISIE_PREDICTION.PUBLIC.EREDIVISIE_HISTORY
GROUP BY YEAR(TO_DATE("DATE",'DD/MM/YYYY'))
order by YEAR(TO_DATE("DATE",'DD/MM/YYYY'))  DESC

In [None]:
import streamlit as st
import snowflake.snowpark as sp
import snowflake.snowpark.functions as F
import plotly.express as px

# Filter home wins
home_wins = (
    eredivisie_history_df.filter(F.col('FTR') == 'H')
    .group_by('HOMETEAM')
    .agg(F.count('HOMETEAM').alias('Wins'))
)

# Filter away wins
away_wins = (
    eredivisie_history_df.filter(F.col('FTR') == 'A')
    .group_by('AWAYTEAM')
    .agg(F.count('AWAYTEAM').alias('Wins'))
)

# Rename columns to match for concatenation
home_wins = home_wins.select(F.col('HOMETEAM').alias('Team'), 'Wins')
away_wins = away_wins.select(F.col('AWAYTEAM').alias('Team'), 'Wins')

# Combine home and away wins
total_wins = home_wins.union_all(away_wins).group_by('Team').agg(F.sum('Wins').alias('Total_Wins'))

# Collect the result into a Pandas DataFrame
total_wins_pd = total_wins.to_pandas()
# Sort by most wins
total_wins_pd = total_wins_pd.sort_values(by='TOTAL_WINS', ascending=False)

# Plotting the chart
# Function to plot bar chart
def plot_bar_chart(df, x_col, y_col, title, labels):
    fig = px.bar(df, x=x_col, y=y_col, title=title, labels=labels)
    st.plotly_chart(fig)

# Plot
# Plot the chart for the team with most wins
plot_bar_chart(
    total_wins_pd, 
    'TEAM', 
    'TOTAL_WINS', 
    'Top Teams by Number of Wins', 
    {'TEAM': 'Team', 'TOTAL_WINS': 'Number of Wins'}
)