<a href="https://colab.research.google.com/github/ridhoakfa/Gold-Price-Prediction/blob/main/Copy_of_Predictive_Analytics_of_Gold_Market_Trends_via_XGB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
cvergnolle_gold_price_and_relevant_metrics_path = kagglehub.dataset_download('cvergnolle/gold-price-and-relevant-metrics')

print('Data source import complete.')


Downloading from https://www.kaggle.com/api/v1/datasets/download/cvergnolle/gold-price-and-relevant-metrics?dataset_version_number=1...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 36.1k/36.1k [00:00<00:00, 25.5MB/s]

Extracting files...
Data source import complete.





<div style="text-align: center; background-color: #fffdf2; padding: 30px; border: 2px solid #d4af37; border-radius: 15px;">

<h1 style="color: #b8860b; font-size: 50px; font-family: 'Georgia', serif;">Gold Price Analytics</h1>
<h3 style="color: #5d4037; font-weight: normal;">A data-driven study utilizing Exploratory Data Analysis to identify market drivers and XGBoost Regression to forecast Gold prices </h3>







**Introduction**

Gold is a key global financial asset influenced by complex economic signals. This project utilizes the XGBoost Regressor to analyze these macroeconomic drivers and forecast future price movements with high precision.

**Objective**

To build an end-to-end machine learning pipeline that cleans financial data, performs deep exploratory analysis, and provides a reliable price prediction model.



1. Essential Imports
2. Data Loading & Pre-processing
3. Data  Cleaning
4. Train-Test Split
5. Model Training
6. Visualizing Results
7. Feature Importance
8. Model Accuracy Report
9. Conclusion

In [None]:
# Core financial + ML libraries
!pip install -q yfinance fredapi xgboost plotly kaleido


<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
1- Imports Libraries
</div>

<p style="font-size:16px;">
We begin by exploring dataset shape, missing values, and basic distributions to understand the patterns.
</p>

In [None]:
import pandas as pd
import numpy as np
import yfinance as yf
from fredapi import Fred

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error

import datetime

sns.set_style("whitegrid")
plt.rcParams['figure.facecolor'] = 'white'

print("Environment ready.")


Environment ready.


SECTION 2 ‚Äî API Keys (FRED)

Buat API key gratis:

https://fred.stlouisfed.org/docs/api/api_key.html

In [None]:
FRED_API_KEY = "YOUR_FRED_API_KEY_HERE"

fred = Fred(api_key=FRED_API_KEY)



<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
2-üîé Data Loading & Pre-processing
</div>

<p style="font-size:16px;">
Importing raw financial records and verifying data integrity for predictive modeling.</p>


SECTION 3 ‚Äî Data Source Mapping (OFFICIAL SOURCES)

| Variable     | Source        | Symbol   |
| ------------ | ------------- | -------- |
| Gold Price   | Yahoo Finance | GC=F     |
| Volume       | Yahoo Finance | GC=F     |
| DXY          | Yahoo Finance | DX-Y.NYB |
| S&P500 Open  | Yahoo Finance | ^GSPC    |
| VIX          | Yahoo Finance | ^VIX     |
| Crude Oil    | Yahoo Finance | CL=F     |
| Inflation    | FRED          | CPIAUCSL |
| EFFR         | FRED          | DFF      |
| Treasury 1M  | FRED          | DGS1MO   |
| Treasury 2Y  | FRED          | DGS2     |
| Treasury 10Y | FRED          | DGS10    |


SECTION 4 ‚Äî Download Market Data (Yahoo Finance)

In [None]:
# Load dataset
df = pd.read_csv('/kaggle/input/gold-price-and-relevant-metrics/Gold Price Prediction.csv')

df.head()




FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/input/gold-price-and-relevant-metrics/Gold Price Prediction.csv'

In [None]:
# Dataset info
print("Shape:", df.shape)
print("\nMissing values:\n", df.isnull().sum())

# Basic stats
df.describe(include="all")

In [None]:
print(df.columns)

In [None]:
print(df.tail)

In [None]:
# Plot 1: Correlation Heatmap
plt.figure(figsize=(12, 8))
numeric_df = df.select_dtypes(include=[np.number])
sns.heatmap(numeric_df.corr(), annot=True, cmap='YlOrBr', fmt='.2f')
plt.title('Feature Correlation Matrix', fontsize=14)
plt.show()


In [None]:
# Moving Averages trends
plt.figure(figsize=(15, 7))

# Actual Price Today
plt.plot(df['Date'].tail(200), df['Price Today'].tail(200),
         label='Actual Price Today', color='#d4af37', linewidth=2)

# Pre-calculated Moving Averages in your dataset
plt.plot(df['Date'].tail(200), df['Twenty Moving Average'].tail(200),
         label='20-Day MA (Existing)', color='#2F4F4F', linestyle='--')

plt.plot(df['Date'].tail(200), df['Fifty Day Moving Average'].tail(200),
         label='50-Day MA (Existing)', color='#E67E22', linestyle=':')

plt.title('Gold Price Trends & Existing Moving Averages', fontsize=15)
plt.legend()
plt.xticks(rotation=45)
plt.show()


<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
3- Data Cleaning
</div>



In [None]:
# 1. Target variable 'Price Tomorrow' se NaN hatana zaroori hai
df_clean = df.dropna(subset=['Price Tomorrow']).copy()

# 2. Features vs Target (Leaky features like 'Price Change Tomorrow' must be dropped)

cols_to_drop = ['Date', 'Price Tomorrow', 'Price Change Tomorrow']
X = df_clean.drop(columns=cols_to_drop)
y = df_clean['Price Tomorrow']

# 3. Filling missing values (Forward fill for time-series consistency)
X = X.ffill().bfill()

print(f"Data successfully cleaned. Total samples: {X.shape[0]}")

<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
4-  Train-Test Split
</div>

<p style="font-size:16px;">
 We use shuffle=False for time-series data to ensure the model
 learns from the past (Training) and is tested on the future (Testing).
 This prevents 'Data Leakage' where the model might see future prices..
</p>

In [None]:

# Calculate the cutoff index for an 80/20 split
split_idx = int(len(df_clean) * 0.8)

# Features (X) split
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]


# Target variable (y) split
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

# Display the size of the datasets to verify the split
print(f"Training Samples: {len(X_train)} | Testing Samples: {len(X_test)}")


<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
5- XGBoost Model Training
</div>


In [None]:
# Initialize the XGBoost Regressor with optimized hyperparameters
model = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.01,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Train the model using the training set and validate on the test set
model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

# Predictions
predictions = model.predict(X_test)

<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
6- Visualizing Results
</div>


In [None]:

# Actual vs Predicted Plot
plt.figure(figsize=(15, 6))
plt.plot(y_test.values, label='Actual Gold Price', color='#FFD700', linewidth=2)
plt.plot(predictions, label='XGBoost Prediction', color='#2F4F4F', linestyle='--', linewidth=1.5)
plt.title('Gold Price Prediction Performance (Actual vs Predicted)', fontsize=16)
plt.ylabel('Price (USD)')
plt.legend()
plt.show()

# Evaluation Metrics
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, predictions):.2f}")
print(f"R2 Score: {r2_score(y_test, predictions):.4f}")


<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
7- Feature Importance
</div>


In [None]:
# Identifying what drives the price
importance = pd.DataFrame({'Factor': X.columns, 'Weight': model.feature_importances_})
importance = importance.sort_values(by='Weight', ascending=False)


plt.figure(figsize=(10, 8))
sns.barplot(x='Weight', y='Factor', data=importance.head(10), hue='Factor', palette='YlOrBr', legend=False)
#sns.barplot(x='Weight', y='Factor', data=importance.head(10), palette='gold')
plt.title('Top 10 Economic Drivers of Gold Price', fontsize=14)
plt.show()

<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
8- Model Accuracy Report
</div>


In [None]:
# Calculating the specific accuracy percentage for the model
from sklearn.metrics import mean_absolute_percentage_error

# 1. Calculate MAPE (Mean Absolute Percentage Error)
mape = mean_absolute_percentage_error(y_test, predictions)

# 2. Calculate Accuracy (100% - Error%)
accuracy = (1 - mape) * 100

print(f"--- Technical Accuracy Report ---")
print(f"Mean Absolute Percentage Error (MAPE): {mape*100:.2f}%")
print(f"Final Model Prediction Accuracy: {accuracy:.2f}%")

<div style="background-color:#F39C12; color:white; padding:15px; border-radius:8px; font-size:22px; font-weight:bold;">
9-  üèÅ Final Insights & Conclusion
</div>



1. **Model Performance**: With an $R^{2}$ Score of ~0.8965, the model successfully captured complex trends, proving that Gradient Boosting is highly effective for tabular financial datasets.
2. Predictive Drivers: Analysis confirmed that Price Today and the US Dollar Index (DXY) are the primary drivers. This validates the economic theory of an inverse correlation between Gold and the USD.
3. **Precision & Error**: The MAE of ~$15.66 demonstrates high robustness, indicating that the model's predictions stay close to the actual market value.
4. **Technical Outcome**: This project showcases the efficiency of Ensemble Learning in managing and predicting volatile commodity prices.


    **Future Scope & Model Evolution**
   * **Hyperparameter Optimization**: Implementing RandomizedSearchCV or Optuna to fine-tune the learning_rate, max_depth, and subsample parameters for even lower error rates.
   * **Sentiment Analysis**: Adding Natural Language Processing (NLP) to analyze financial news headlines and Federal Reserve meeting minutes to account for sudden market shocks.

   


### üì¨ Feedback

 If you have any suggestions for
improving the **XGBoost** hyperparameters, feel free to leave a comment!
