# 01 - Data Collection for Ballon d'Or PR Analysis

## Project Goal
The primary objective of this project is to analyze the extent to which public relations (PR) and general public interest influence the final voting results of the Ballon d'Or award.

## Notebook Purpose
This notebook performs the initial data acquisition, merging, and feature engineering necessary for the analysis. It brings together three main categories of data:

1.  **Ballon d'Or Results:** Scraped from Wikipedia (final rankings, player, club, nationality).
2. **Achievements & Ratings (Manual Data):** Player collective trophies, achievements and average **Sofa Score** rating for each important tournament (Other rating is average of other tournaments) that was manually collected.
3.  **PR Metrics (Scraped Data):** Two external metrics that proxy player visibility and public buzz: **Google Trends** (search interest) and **Wikipedia Page Views** (article views).

## Data Sources
* **Ballon d'Or Results:** Wikipedia
* **Achievements & Ratings:** Manual CSV files (`achievements_and_ratings.csv`)
* **PR Metrics:** Google Trends API (`pytrends`) and Wikimedia Pageviews API

## 1. Data Acquisition: Ballon d'Or Results & Achievements

### 1.1 Scraping Ballon d'Or Results from Wikipedia

In [1]:
import pandas as pd
import requests
from io import StringIO
from pytrends.request import TrendReq
import time
import random
import os

# Define a standard header to mimic a web browser request and avoid being blocked
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124"}

# Initialize a dictionary to store the scraped Ballon d'Or data for each year
ballondor_df = {}

# Loop through the years for which we want to collect data
for year in range(2022, 2026):
    # Construct the Wikipedia URL for the specific year's award
    url = f"https://en.wikipedia.org/wiki/{year}_Ballon_d%27Or"
    
    # Request the HTML content of the page
    html = requests.get(url, headers=headers).text
    
    # Use pandas to automatically find and parse tables from the HTML
    # [2] is typically the index for the main results table on these pages
    table = pd.read_html(StringIO(html))[2]
    
    # Clean the 'Club' column: remove bracketed citations (e.g., "Real Madrid[3]")
    table['Club'] = table['Club'].str.split('[').str[0]
    
    # Store the cleaned DataFrame in the dictionary, keyed by year
    ballondor_df[year] = table
    
print("Successfully scraped Ballon d'Or results for years 2022 through 2025.")

Successfully scraped Ballon d'Or results for years 2022 through 2025.


### 1.2 Preview Scraped Wikipedia Data

In [2]:
# Display the first 5 rows of the 2022 scraped data to verify columns and cleaning
ballondor_df[2022].head()

Unnamed: 0,Rank,Player,Nationality,Position,Club,Points
0,1,Karim Benzema,France,Forward,Real Madrid,549
1,2,Sadio Mané,Senegal,Forward,Liverpool,193
2,3,Kevin De Bruyne,Belgium,Midfielder,Manchester City,175
3,4,Robert Lewandowski,Poland,Forward,Bayern Munich,170
4,5,Mohamed Salah,Egypt,Forward,Liverpool,116


### 1.3 Merge Scraped Data with Achievements Data

*Note: This step incorporates the manually collected achievements, collective trophies, and the ratings from Sofa Score.*

In [3]:
def create_dataframe(year):
    """
    Merges the scraped Ballon d'Or results with the manually collected 
    achievements and ratings data for a specific year.

    Args:
        year (int): The year of the data to process.

    Returns:
        pd.DataFrame: A combined DataFrame for the specified year.
    """
    # Retrieve the scraped Ballon d'Or table for the year
    year_data = ballondor_df[year]
    
    # Add a 'Year' column to the scraped data for easy identification
    year_data.insert(0, 'Year', year)

    # Load the manually collected achievements and ratings data
    # NOTE: This file is expected to be ordered the same as the scraped data
    try:
        achievements_and_ratings_data = pd.read_csv(f'../data/{str(year)}/achievements_and_ratings.csv')
    except FileNotFoundError:
        print(f"ERROR: Manual data file not found for year {year}. Skipping merge.")
        return year_data
        
    # Drop redundant columns that already exist in the scraped table (or are an old index)
    columns_to_drop = ['Unnamed: 0', 'Player', 'Nationality', 'Position', 'Club']
    achievements_and_ratings_data = achievements_and_ratings_data.drop(columns_to_drop, axis=1)
    
    # Concatenate the two DataFrames horizontally (axis=1). 
    # This relies on the manual data file being in the exact same player order.
    combined_df = pd.concat((year_data, achievements_and_ratings_data), axis=1)
    
    return combined_df

# Initialize a dictionary to store the combined DataFrames
combined_df = {}

# Execute the merging function for all years
for year in range(2022, 2026):
    combined_df[year] = create_dataframe(year)

print("Successfully merged scraped data with manual achievements data for all years.")

Successfully merged scraped data with manual achievements data for all years.


In [4]:
# Display the first 5 rows of the combined 2022 DataFrame
# Note the addition of the achievement columns from the CSV
combined_df[2022].head()

Unnamed: 0,Year,Rank,Player,Nationality,Position,Club,Points,League,League Rating,League Top Scorer,...,Continental Cup Top Assist Provider,Continental Cup Best Goalkeeper,Continental Cup Best Player,WC,WC Rating,WC Top Scorer,WC Top Assist Provider,WC Best Goalkeeper,WC Best Player,Other Rating
0,2022,1,Karim Benzema,France,Forward,Real Madrid,549,1,7.73,1,...,,,,,,,,,,7.46
1,2022,2,Sadio Mané,Senegal,Forward,Liverpool,193,0,7.22,0,...,0.0,0.0,1.0,,,,,,,7.56
2,2022,3,Kevin De Bruyne,Belgium,Midfielder,Manchester City,175,1,7.78,0,...,,,,,,,,,,7.85
3,2022,4,Robert Lewandowski,Poland,Forward,Bayern Munich,170,1,7.47,1,...,,,,,,,,,,7.69
4,2022,5,Mohamed Salah,Egypt,Forward,Liverpool,116,0,7.14,1,...,0.0,0.0,0.0,,,,,,,7.11


### 1.4 Consolidate and Inspect Master Data

In [5]:
# Vertically stack all the yearly DataFrames from the dictionary into one master DataFrame
concatenated_df = pd.concat((x for x in combined_df.values()), axis=0).reset_index(drop=True)

print(f"Final combined DataFrame created with {len(concatenated_df)} rows.")

Final combined DataFrame created with 120 rows.


In [6]:
# Display a summary of the master DataFrame to check for missing values (Continental Cup/WC data)
# and ensure data types are correct before further processing
concatenated_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 32 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Year                                 120 non-null    int64  
 1   Rank                                 120 non-null    int64  
 2   Player                               120 non-null    object 
 3   Nationality                          120 non-null    object 
 4   Position                             120 non-null    object 
 5   Club                                 120 non-null    object 
 6   Points                               120 non-null    int64  
 7   League                               120 non-null    int64  
 8   League Rating                        120 non-null    float64
 9   League Top Scorer                    120 non-null    int64  
 10  League Top Assist Provider           120 non-null    int64  
 11  League Best Goalkeeper          

### 1.5 Checkpoint: Save Intermediate Master File

In [7]:
# Save the master DataFrame containing Ballon d'Or results and achievements/ratings
# This acts as a checkpoint before adding the PR metrics
concatenated_df.to_csv('../data/ballondor_and_achievements_data.csv', index=False)

print("Intermediate data saved to '../data/ballondor_and_achievements_data.csv'")

Intermediate data saved to '../data/ballondor_and_achievements_data.csv'


## 2. PR Metric 1: Google Trends Data

This section uses Google Trends to measure public search interest, acting as a proxy for media visibility and PR success.

### 2.1 Setup: Player Lists and Ballon d'Or Periods

In [8]:
# Extract player names from the scraped data, grouped by year
players = {}
for year, dataset in ballondor_df.items():
    players[year] = [x for x in dataset['Player'].values]
    
# Define the specific date ranges (seasons) used for each Ballon d'Or cycle
ballondor_period = {
    2022 : {
        'start_date' : '2021-08-01',
        'end_date' : '2022-07-31'
    },
    2023 : {
        'start_date' : '2022-08-01',
        'end_date' : '2023-07-31'
    },
    2024 : {
        'start_date' : '2023-08-01',
        'end_date' : '2024-07-31'
    },
    # Note: 2025 period ends before the typical end date (July 13)
    2025 : {
        'start_date' : '2024-08-01',
        'end_date' : '2025-07-13'
    },
}

print("Player lists and Ballon d'Or periods defined.")

Player lists and Ballon d'Or periods defined.


### 2.2 Utility Functions for PyTrends API

In [9]:
def chunk_list(lst, size=5):
    """Generates chunks (batches) of a list with a specified size."""
    for i in range(0, len(lst), size):
        yield lst[i:i+size]

def safe_payload(pytrends, players, start_date, end_date):
    """
    Attempts to fetch Google Trends data with retry logic.
    Google Trends blocks requests after too many calls, so a loop is used 
    to pause and retry upon failure.
    """
    timeframe = f"{start_date} {end_date}"
    while True:
        try:
            # Build the request payload for the batch of players and time period
            pytrends.build_payload(players, timeframe=timeframe)
            # Fetch the interest over time data
            data = pytrends.interest_over_time()
            if not data.empty:
                return data
        except Exception as e:
            # If blocked (most common exception), wait and retry
            print("Blocked. Retrying in 20 sec…")
            time.sleep(20)

def get_trends_for_players(players, start_date, end_date):
    """
    Main function to fetch Google Trends data in batches for a list of players.
    Applies a random delay between batches to further mitigate blocking.
    """
    # Initialize the pytrends API connection
    pytrends = TrendReq(hl="en-US", tz=360)
    final_df = pd.DataFrame()

    # Google Trends allows a maximum of 5 keywords/players per request
    for batch in chunk_list(players, 5):
        print("Processing batch:", batch)

        # Call the safe function to handle API blocking
        df = safe_payload(pytrends, batch, start_date, end_date)
        
        # Drop the 'isPartial' column which indicates if the last week is incomplete
        df = df.drop('isPartial', axis = 1)

        # Concatenate the time series results horizontally
        final_df = pd.concat([final_df, df], axis=1)

        # Delay to avoid Google Block (10 to 15 seconds)
        time.sleep(10 + random.random()*5)

    return final_df

### 2.3 Execute Data Collection

In [11]:
google_trend_df = {}

# Loop through each year/period to fetch the trends data
for year, players_list in players.items():
    print(f"[+] Processing Ballon d'Or {str(year)} period Google Trend data..")
    
    # Call the main function with the player list and date range
    google_trend_df[year] = get_trends_for_players(
        players = players_list, 
        start_date = ballondor_period[year]['start_date'],
        end_date = ballondor_period[year]['end_date']
    )
    
    print(f"[+] Finished Ballon d'Or {str(year)} period Google Trend data\n\n")

[+] Processing Ballon d'Or 2022 period Google Trend data..
Processing batch: ['Karim Benzema', 'Sadio Mané', 'Kevin De Bruyne', 'Robert Lewandowski', 'Mohamed Salah']
Blocked. Retrying in 20 sec…
Blocked. Retrying in 20 sec…
Blocked. Retrying in 20 sec…
Processing batch: ['Kylian Mbappé', 'Thibaut Courtois', 'Vinícius Júnior', 'Luka Modrić', 'Erling Haaland']
Processing batch: ['Son Heung-min', 'Riyad Mahrez', 'Sébastien Haller', 'Fabinho', 'Rafael Leão']
Processing batch: ['Virgil van Dijk', 'Casemiro', 'Luis Díaz', 'Dušan Vlahović', 'Cristiano Ronaldo']
Processing batch: ['Harry Kane', 'Trent Alexander-Arnold', 'Phil Foden', 'Bernardo Silva', 'João Cancelo']
Processing batch: ['Joshua Kimmich', 'Mike Maignan', 'Christopher Nkunku', 'Darwin Núñez', 'Antonio Rüdiger']
[+] Finished Ballon d'Or 2022 period Google Trend data


[+] Processing Ballon d'Or 2023 period Google Trend data..
Processing batch: ['Lionel Messi', 'Erling Haaland', 'Kylian Mbappé', 'Kevin De Bruyne', 'Rodri']
Blocked

### 2.4 Analyze Sample Trends Data (Descriptive Statistics)

In [12]:
# Display descriptive statistics for the 2022 Google Trends data
# This shows the mean, max, and variance of public interest over the period
google_trend_df[2022].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Karim Benzema,53.0,21.716981,18.964924,6.0,11.0,13.0,24.0,81.0
Sadio Mané,53.0,6.716981,6.143719,1.0,2.0,5.0,9.0,33.0
Kevin De Bruyne,53.0,11.509434,6.943793,6.0,8.0,9.0,12.0,41.0
Robert Lewandowski,53.0,22.660377,17.648211,8.0,12.0,16.0,26.0,100.0
Mohamed Salah,53.0,17.735849,8.788434,5.0,12.0,16.0,20.0,46.0
Kylian Mbappé,53.0,18.660377,12.41888,7.0,11.0,14.0,20.0,62.0
Thibaut Courtois,53.0,4.09434,5.946118,1.0,2.0,2.0,4.0,35.0
Vinícius Júnior,53.0,2.509434,1.866946,0.0,1.0,2.0,3.0,9.0
Luka Modrić,53.0,2.0,1.629063,1.0,1.0,1.0,2.0,8.0
Erling Haaland,53.0,19.641509,13.400767,9.0,14.0,16.0,21.0,100.0


### 2.5 Feature Engineering: Time-Weighted Trend Score (The Core PR Metric)

This metric assigns a **linear weight** to interest scores, giving a higher value to search volume that occurred closer to the end of the period, providing a more accurate proxy for immediate PR influence before the vote.

In [16]:
def calculate_time_weighted_score(trends_df):
    """
    Calculates a single Time-Weighted Google Trend Score for each player.
    
    Weights are applied linearly, giving a weight of 1 to the first week and 
    N (total weeks) to the last week, prioritizing interest closer to the end of the period.
    
    Args:
        trends_df (pd.DataFrame): The DataFrame of weekly Google Trends interest over time.

    Returns:
        pd.DataFrame: DataFrame containing 'Player' and 'Time-Weighted Google Trend Score'.
    """
    # Exclude the 'isPartial' column
    data = trends_df.drop('isPartial', axis=1, errors='ignore')
    
    # 1. Determine the number of time periods (weeks), N
    N = len(data)
    
    # 2. Create the linear weight array (e.g., [1, 2, ..., N])
    # This Series is indexed by time to align with the trends_df rows (axis=0)
    weights = pd.Series(range(1, N + 1), index=data.index)
    
    # 3. Apply weights to the interest data: multiply each week's score by its weight.
    # This uses column broadcasting to apply the weights across all players simultaneously.
    weighted_data = data.multiply(weights, axis=0)
    
    # 4. Calculate the sum of the weighted interests (Numerator)
    numerator = weighted_data.sum()
    
    # 5. Calculate the sum of the weights (Denominator)
    denominator = weights.sum()
    
    # 6. Calculate the final time-weighted average score
    time_weighted_scores = (numerator / denominator).reset_index()
    time_weighted_scores.columns = ['Player', 'Time-Weighted Google Trend Score']
    
    return time_weighted_scores

# Initialize a DataFrame to store the final weighted scores
weighted_google_trend_score = pd.DataFrame()

# Calculate the time-weighted score for each year's trends data and concatenate them
for year, trends_df in google_trend_df.items():
    # Apply the new time-weighted function
    yearly_scores = calculate_time_weighted_score(trends_df)
    weighted_google_trend_score = pd.concat([weighted_google_trend_score, yearly_scores], axis=0)

# Reset the index of the final score DataFrame to prepare for merging
weighted_google_trend_score = weighted_google_trend_score.reset_index(drop=True)

print("Time-Weighted Google Trend Scores calculated and ready for merge.")

Time-Weighted Google Trend Scores calculated and ready for merge.


### 2.6 Preview Top Players by Time-Weighted Score

In [23]:
# Display the players with the highest time-weighted average public interest/buzz
weighted_google_trend_score.nlargest(10, 'Time-Weighted Google Trend Score')

Unnamed: 0,Player,Time-Weighted Google Trend Score
112,Jude Bellingham,59.535445
91,Lamine Yamal,41.08371
100,Pedri,34.626697
106,Robert Lewandowski,33.938914
3,Robert Lewandowski,28.143955
13,Fabinho,27.73515
102,Harry Kane,27.03092
62,Jude Bellingham,26.640112
0,Karim Benzema,25.183089
19,Cristiano Ronaldo,23.122292


### 2.7 Merge Trend Score to Master DataFrame

In [24]:
# Merge the newly calculated score column to the main combined DataFrame
# Note: This relies on the row order remaining consistent, as we merge by index (axis=1)
concatenated_df = pd.concat((concatenated_df, weighted_google_trend_score['Time-Weighted Google Trend Score']), axis=1)

print("Weighted Google Trend Score successfully merged.")

Weighted Google Trend Score successfully merged.


In [25]:
# Check the head of the DataFrame to confirm the new column is present
concatenated_df.head()

Unnamed: 0,Year,Rank,Player,Nationality,Position,Club,Points,League,League Rating,League Top Scorer,...,Continental Cup Best Goalkeeper,Continental Cup Best Player,WC,WC Rating,WC Top Scorer,WC Top Assist Provider,WC Best Goalkeeper,WC Best Player,Other Rating,Time-Weighted Google Trend Score
0,2022,1,Karim Benzema,France,Forward,Real Madrid,549,1,7.73,1,...,,,,,,,,,7.46,25.183089
1,2022,2,Sadio Mané,Senegal,Forward,Liverpool,193,0,7.22,0,...,0.0,1.0,,,,,,,7.56,8.700908
2,2022,3,Kevin De Bruyne,Belgium,Midfielder,Manchester City,175,1,7.78,0,...,,,,,,,,,7.85,12.795947
3,2022,4,Robert Lewandowski,Poland,Forward,Bayern Munich,170,1,7.47,1,...,,,,,,,,,7.69,28.143955
4,2022,5,Mohamed Salah,Egypt,Forward,Liverpool,116,0,7.14,1,...,0.0,0.0,,,,,,,7.11,18.823201


## 3. PR Metric 2: Wikipedia Page Views

Wikipedia page views act as a secondary proxy for public interest and player visibility.

### 3.1 Function to Scrape Monthly Page Views

In [38]:
def get_wiki_pageviews(player, start_date, end_date):
    """
    Fetches the total Wikipedia page views for a player within a specified period.
    This serves as a secondary, external measure of public/media interest and PR success.

    Args:
        player (str): The player's name (used as the Wikipedia article title).
        start_date (str): The start date (YYYYMMDD format, but only YYYYMM required for this API).
        end_date (str): The end date (YYYYMMDD format, but only YYYYMM required for this API).

    Returns:
        int: The total number of monthly views during the period, or 0 if data is unavailable.
    """
    # Replace spaces with underscores for the Wikipedia article title
    title = player.replace(" ", "_")
    
    # Construct the Wikimedia Pageviews API URL for monthly views
    # The API expects start and end dates in YYYYMMDD format, we only provide YYYYMM
    url = (f"https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/"
           f"en.wikipedia/all-access/all-agents/{title}/monthly/"
           f"{start_date.replace('-', '')}/{end_date.replace('-', '')}")

    try:
        data = requests.get(url, headers=headers).json()
    except Exception:
        # Handle cases where the request fails
        return 0

    if "items" not in data:
        # Handle cases where the API returns no data for the player/period
        return 0
        
    df = pd.DataFrame(data["items"])
    df["views"] = df["views"].astype(int)
    
    # Sum the monthly views to get the total views for the period
    return df["views"].sum()

### 3.2 Execute Page View Data Collection

In [39]:
wiki_page_views_df = {}

# Loop through each year and its player list
for year, players_list in players.items():
    print(f"[+] Processing Ballon d'Or {str(year)} period Wiki page views data..")
    
    start = ballondor_period[year]['start_date']
    end = ballondor_period[year]['end_date']
    
    # Create a DataFrame for the year's results
    yearly_wiki_df = pd.DataFrame({
        'Player': players_list,
        # Apply the scraping function to get the total views for each player
        'Wiki Page Views': [get_wiki_pageviews(player, start, end) for player in players_list]
    })
    
    wiki_page_views_df[year] = yearly_wiki_df
    
    print(f"[+] Finished Ballon d'Or {str(year)} period Wiki page views data\n\n")

[+] Processing Ballon d'Or 2022 period Wiki page views data..
[+] Finished Ballon d'Or 2022 period Wiki page views data


[+] Processing Ballon d'Or 2023 period Wiki page views data..
[+] Finished Ballon d'Or 2023 period Wiki page views data


[+] Processing Ballon d'Or 2024 period Wiki page views data..
[+] Finished Ballon d'Or 2024 period Wiki page views data


[+] Processing Ballon d'Or 2025 period Wiki page views data..
[+] Finished Ballon d'Or 2025 period Wiki page views data




In [40]:
# Check the page views collected for the 2022 period
wiki_page_views_df[2022].head()

Unnamed: 0,Player,Wiki Page Views
0,Karim Benzema,3635167
1,Sadio Mané,2615768
2,Kevin De Bruyne,1645548
3,Robert Lewandowski,3785756
4,Mohamed Salah,4237344


### 3.3 Consolidate and Preview Wiki Data

In [41]:
# Vertically stack all the yearly Wiki Page Views DataFrames
concatenated_wiki_page_views = pd.concat((x for x in wiki_page_views_df.values())).reset_index(drop=True)

print("Concatenated all Wikipedia Page View DataFrames.")

Concatenated all Wikipedia Page View DataFrames.


In [42]:
# Display the top players in terms of absolute public interest/viewership
concatenated_wiki_page_views.nlargest(10, 'Wiki Page Views')

Unnamed: 0,Player,Wiki Page Views
30,Lionel Messi,32366142
19,Cristiano Ronaldo,19722980
32,Kylian Mbappé,16380331
31,Erling Haaland,13223396
62,Jude Bellingham,8630027
65,Kylian Mbappé,8242715
67,Lamine Yamal,7651971
91,Lamine Yamal,7154361
69,Harry Kane,6409902
64,Erling Haaland,5895104


## 4. Final Data Consolidation

### 4.1 Final Merge and Inspection

In [43]:
# Merge the final Wiki Page Views column to the master DataFrame
final_concatenated_df = pd.concat((concatenated_df, concatenated_wiki_page_views['Wiki Page Views']), axis=1)

print("Wikipedia Page Views successfully merged into the final DataFrame.")

Wikipedia Page Views successfully merged into the final DataFrame.


In [44]:
# Show the full final DataFrame with all features (Ballon d'Or, Achievements, PR Proxies)
final_concatenated_df

Unnamed: 0,Year,Rank,Player,Nationality,Position,Club,Points,League,League Rating,League Top Scorer,...,Continental Cup Best Player,WC,WC Rating,WC Top Scorer,WC Top Assist Provider,WC Best Goalkeeper,WC Best Player,Other Rating,Time-Weighted Google Trend Score,Wiki Page Views
0,2022,1,Karim Benzema,France,Forward,Real Madrid,549,1,7.73,1,...,,,,,,,,7.46,25.183089,3635167
1,2022,2,Sadio Mané,Senegal,Forward,Liverpool,193,0,7.22,0,...,1.0,,,,,,,7.56,8.700908,2615768
2,2022,3,Kevin De Bruyne,Belgium,Midfielder,Manchester City,175,1,7.78,0,...,,,,,,,,7.85,12.795947,1645548
3,2022,4,Robert Lewandowski,Poland,Forward,Bayern Munich,170,1,7.47,1,...,,,,,,,,7.69,28.143955,3785756
4,2022,5,Mohamed Salah,Egypt,Forward,Liverpool,116,0,7.14,1,...,0.0,,,,,,,7.11,18.823201,4237344
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,2025,26,Erling Haaland,Norway,Forward,Manchester City,18,0,7.27,0,...,,,,,,,,7.43,8.459276,3355395
116,2025,27,Declan Rice,England,Midfielder,Arsenal,13,0,7.36,0,...,,,,,,,,7.56,8.588989,1449663
117,2025,28,Virgil van Dijk,Netherlands,Defender,Liverpool,7,1,7.28,0,...,,,,,,,,7.46,4.224736,1557623
118,2025,29,Florian Wirtz,Germany,Midfielder,Bayer Leverkusen,5,0,7.68,0,...,,,,,,,,7.53,12.888386,1260511


### 4.2 Final Checkpoint: Save Complete Dataset

In [45]:
# Save the final, complete dataset to be used for the next steps of the project (analysis/modelling)
final_concatenated_df.to_csv('../data/final_ballondor_pr_data.csv', index=False)

print("Final dataset saved to '../data/final_ballondor_pr_data.csv'")

Final dataset saved to '../data/final_ballondor_pr_data.csv'


In [48]:
final_concatenated_df[['Points', 'Time-Weighted Google Trend Score', 'Wiki Page Views']].corr()

Unnamed: 0,Points,Time-Weighted Google Trend Score,Wiki Page Views
Points,1.0,0.044119,0.22797
Time-Weighted Google Trend Score,0.044119,1.0,0.208332
Wiki Page Views,0.22797,0.208332,1.0
