# ⚽ **HOL: Eredivisie Prediction** 🥇

---
## Step 1: Data Ingestion

Welcome to the **Eredivisie Prediction**! In this notebook, we will use Python functions and External Access Integration to load and analyze data about Eredivisie from 1995 to 2023. Our data source is a GitHub repository, from which we'll fetch and directly store historical data in our Snowflake account. No S3 buckets or local downloads are needed — our goal is to simplify the execution of this Hands-On Lab (HOL) while showcasing the extensive capabilities of Snowflake!

### What We'll Do:
1. **Load Data**: Fetch Olympic data from the GitHub repository.
2. **Analyze Data**: Utilize Snowpark DataFrames for data preparation and analysis.
3. **Visualize Insights**: Build interactive dashboards with Streamlit for comprehensive analytics.

![Olympic Rings](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Eredivisie_nieuw_logo_2017-.svg/640px-Eredivisie_nieuw_logo_2017-.svg.png)

---

Dive into the code below to start loading data!


### Setup

Before using this notebook, ensure that you have created the following objects by running the `setup.sql` script in a worksheet:

- **Database**: `EREDIVISIE_PREDICTION`
- **Schema**: `RAW_DATA`
- **Warehouse**: `EREDIVISIE_PREDICTION_WH`
- **Network Rule**: `GITHUB_NETWORK_RULE`
- **External Access Integration**: `GITHUB_EXTERNAL_ACCESS_INTEGRATION`

The first three items are required as you will need to define the Database, Schema, and Warehouse when you import this notebook into the Snowflake UI.

For the **Network Rule** and **External Access Integration**, once created, follow these steps to make them available within this notebook:

1. **Click on Notebook Settings** (located at the top right of the worksheet screen).
2. **Select the External Access Tab**.
3. **Enable** `GITHUB_EXTERNAL_ACCESS_INTEGRATION` from the list.
4. **Reload the Notebook**. Once reloaded, you will have access to the GitHub URL directly from this notebook.

With these configurations in place, you’ll be ready to extract and work with the dataset from the external GitHub URL in the following cells.

---


In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import Session

session = get_active_session()

In [None]:
--Note: For this Hands-On Lab (HOL), we are not creating ad hoc roles and users to minimize prerequisites and simplify setup.

USE ROLE ACCOUNTADMIN;

In [None]:
import requests
import pandas as pd
from snowflake.snowpark import DataFrame as df
from io import StringIO

def fetch_dataset_from_github(url: str) -> 'DataFrame':
    # Fetch the CSV data from the URL
    response = requests.get(url)
    if response.status_code == 200:
        # Decode the content and read into a Pandas DataFrame
        csv_data = response.content.decode('utf-8')
        csv_file = StringIO(csv_data)
        pandas_df = pd.read_csv(csv_file)
        
        # Convert Pandas DataFrame to Snowpark DataFrame
        return session.create_dataframe(pandas_df)
    else:
        raise Exception(f"Failed to fetch CSV: {response.status_code} - {response.text}")


In [None]:
# Once an updated file is on the github dataset repo, it's enoguh to run this cell to reload the new datasets.
from snowflake.snowpark.functions import col

# Function to rename columns to uppercase
def rename_columns_to_uppercase(df):
    # Generate a list of columns with uppercase names
    new_columns = [col(c).alias(c.upper()) for c in df.columns]
    # Select columns with new names
    return df.select(*new_columns)

# Base URL and list of files
url_base = 'https://github.com/sfc-gh-mconsoli/eredivisie_prediction/raw/main/dataset/'
url_files = [
    'eredivisie_history.csv',
    'eredivisie_fixture.csv'
]

# Loop through each URL
for url in url_files:
    # Get Snowpark DataFrame from the URL
    df = fetch_dataset_from_github(url_base + url)

    # Extract table name from URL
    table_name = url.split('/')[-1].replace('.csv', '').upper()

    # Drop the table if it exists
    session.sql(f"DROP TABLE IF EXISTS {table_name}").collect()

    # Convert column names to uppercase
    df = rename_columns_to_uppercase(df)

    # Create table and insert data from Snowpark DataFrame
    df.write.save_as_table(table_name, mode='overwrite')

    print(f"Table {table_name} created and data loaded successfully.")


In [None]:
# Let's check loaded data

session.table('EREDIVISIE_HISTORY').limit(51)

In [None]:
--NOTE: It seems we might need to transform some data or maybe there are some issue in the CSV.
SELECT YEAR(TO_DATE("DATE",'DD/MM/YYYY')), COUNT(*)
FROM EREDIVISIE_PREDICTION.PUBLIC.EREDIVISIE_HISTORY
GROUP BY YEAR(TO_DATE("DATE",'DD/MM/YYYY'))
order by YEAR(TO_DATE("DATE",'DD/MM/YYYY'))  DESC

In [None]:
import streamlit as st
import snowflake.snowpark as sp

#Let's get number of match won by each Team in the last XYZ years


In [None]:
import plotly.graph_objects as go

# Draw a chart with statistics extracted above.