# Gold Price Analysis Project
## Data Source & Attribution

This analysis is based on the **Historical Gold Prices (1995–2026)** provided by [Kaggle](https://www.kaggle.com/).

* **Dataset Source:** https://www.kaggle.com/datasets/mr1rameez/historical-gold-prices-19952026
* **License:** [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
* **Author/Publisher:** Muhammad Rameez

> **Notice:** As required by the CC BY 4.0 license, any modifications, data cleaning, or transformations performed on the original dataset are documented in the code sections below.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

**Note:** In the previous cell, we imported the necessary libraries for our analysis:  

- `pandas` for data manipulation and handling the DataFrame.  
- `matplotlib.pyplot` for creating visualizations and plots of the gold price data.

In the next cell, we load the gold price dataset from a CSV file into a Pandas DataFrame called `df`.  
The dataset contains monthly gold prices from August 2000 to February 2026.  
Using Pandas, we can easily manipulate and analyze the data, and visualize it with Matplotlib.

In [2]:
df = pd.read_csv('Dataset/gold_prices_aug_2000_2026_feb.csv')

In [3]:
df.head(10)

Unnamed: 0,Date,Gold_Price_USD_YFinance
0,2000-08-01,276.099991
1,2000-09-01,273.389996
2,2000-10-01,269.80909
3,2000-11-01,265.874997
4,2000-12-01,271.515005
5,2001-01-01,265.371427
6,2001-02-01,261.805263
7,2001-03-01,262.290907
8,2001-04-01,261.079997
9,2001-05-01,272.077272


In the previous cell, we use `df.head(10)` to display the first 10 rows of the DataFrame.  
This allows us to quickly inspect the structure of the data, check the column names.

In [4]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 2 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Date                     307 non-null    str    
 1   Gold_Price_USD_YFinance  307 non-null    float64
dtypes: float64(1), str(1)
memory usage: 4.9 KB


Using df.info() gives the following information about the DataFrame: 

The dataset contains 307 rows and 2 columns. The columns are:

Date – 307 non-null entries, type str (dates stored as text)

Gold_Price_USD_YFinance – 307 non-null entries, type float64 (gold prices in USD)

The DataFrame uses approximately 4.9 KB of memory.

There are no missing values, so the dataset is complete. You may consider converting the Date column to datetime for easier time-series analysis.

Note: The Date column is currently stored as strings. To perform time-series analysis, it is recommended to convert it to datetime format using pd.to_datetime().

In [5]:
df['Date'] = pd.to_datetime(df['Date'])

In [6]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 2 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Date                     307 non-null    datetime64[us]
 1   Gold_Price_USD_YFinance  307 non-null    float64       
dtypes: datetime64[us](1), float64(1)
memory usage: 4.9 KB


Now I have converted the Date column to datetime,and I used the df.info() function to comfirm,Now I will sort the dates from oldest to newest in the next cell. This ensures that any issues with date order are resolved.

In [7]:
df = df.sort_values('Date')

In [8]:
df.head()

Unnamed: 0,Date,Gold_Price_USD_YFinance
0,2000-08-01,276.099991
1,2000-09-01,273.389996
2,2000-10-01,269.80909
3,2000-11-01,265.874997
4,2000-12-01,271.515005


In the next cell, I will change the index to the dataframe from 0,1,2,... to the dates in the data frame

In [9]:
df.set_index('Date', inplace=True)

In [10]:
df.head(10)

Unnamed: 0_level_0,Gold_Price_USD_YFinance
Date,Unnamed: 1_level_1
2000-08-01,276.099991
2000-09-01,273.389996
2000-10-01,269.80909
2000-11-01,265.874997
2000-12-01,271.515005
2001-01-01,265.371427
2001-02-01,261.805263
2001-03-01,262.290907
2001-04-01,261.079997
2001-05-01,272.077272


**Note:** After setting the `Date` column as the index, we need to ensure that all dates are unique. 
Duplicated dates can cause issues in time-series analysis, such as incorrect plots, wrong calculations 
for moving averages or returns, and problems with resampling or rolling operations. 

The following cell checks for any duplicated dates in the index to make sure our data is safe for analysis.


In [11]:
df.index.duplicated().sum()

np.int64(0)

df.index.duplicated().sum() returns np.int64(0), it simply means there are zero duplicated dates in your DataFrame index.

# **Data Ready for Analysis:**  

The `Date` column has been converted to `datetime`, sorted from oldest to newest, and set as the DataFrame index.  
There are no missing values, and all dates are unique.  

This means the dataset is now clean and properly structured, and it is ready for time-series analysis, plotting, and calculation of returns, moving averages, or volatility.
