# Introduction

Hello Guys, In this kernel we will try to perform exploratory data analysis and build machine learning model to Predict Gold Price. it will be supervised machine learning and this model will try to solve the regression problem like predict the gold prices based on other stock prices. you can check below youtube link for same.

https://www.youtube.com/watch?v=zrC7xE4CVIs&lc=UgzE8GMA3SI7TErWIF54AaABAg

**Data Description**

This is the gold price dataset. The dataset gives you information about a gold prices based on serveral other stock prices as given below in which you will have to analyze the gold price and build best machine learning model to predict the gold price.

**Data set columns**

- Date - mm/dd/yyyy
- SPX - is a free-float weighted measurement stock market index of the 500 largest companies listed on stock exchanges in the United States.
- GLD - Gold Price
- USO - United States Oil Fund
- SLV - Silver Price
- EUR/USD - currency pair quotation of the Euro against the US

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#create data drame to read data set
df = pd.read_csv('/kaggle/input/gold-price-data/gld_price_data.csv')

In [None]:
df.head()

In [None]:
# check the df structure
df.info()

In [None]:
# find number of rows and column
df.shape

In [None]:
# describe df numerical columns
df.describe()

**Feature**

- Date - mm/dd/yyyy
- SPX - is a free-float weighted measurement stock market index of the 500 largest companies listed on stock exchanges in the United States.
- USO - United States Oil Fund - Not Sure of UOM
- SLV - Silver Price
- EUR/USD - currency pair quotation of the Euro against the US

**Label**

- GLD - Gold Price

# Exploratory Data Analysis

**1. Find Unwanted Columns**

**Take-away**:
- for this kernal we will not consider Date Feature and hence we will drop this feature in feature engineering section.

**2. Find Missing Values**

In [None]:
# find missing values
features_na = [features for features in df.columns if df[features].isnull().sum() > 0]
for feature in features_na:
    print(feature, np.round(df[feature].isnull().mean(), 4),  ' % missing values')
else:
    print("No missing value found")

**Take-away**:
- No missing value found

**3. Find Features with One Value**

In [None]:
for column in df.columns:
    print(column,df[column].nunique())

**Take-away**:
- No feature with only one value

**4. Explore the Categorical Features**

In [None]:
categorical_features=[feature for feature in df.columns if ((df[feature].dtypes=='O') & (feature not in ['GLD']))]
categorical_features

In [None]:
for feature in categorical_features:
    print('The feature is {} and number of categories are {}'.format(feature,len(df[feature].unique())))

**Take-away**:
- there are 1 categorical features

**5. Find Categorical Feature Distribution**

**Take-away**:
    NA

**6. Relationship between Categorical Features and Label**

**Take-away**:
    NA

**7. Explore the Numerical Features**

In [None]:
# list of numerical variables
numerical_features = [feature for feature in df.columns if ((df[feature].dtypes != 'O') & (feature not in ['GLD']))]
print('Number of numerical variables: ', len(numerical_features))

# visualise the numerical variables
df[numerical_features].head()

**Take-away**:
- there are 4 numerical features

**8. Find Discrete Numerical Features**

In [None]:
discrete_feature=[feature for feature in numerical_features if len(df[feature].unique())<25]
print("Discrete Variables Count: {}".format(len(discrete_feature)))

**Take-away**:
- there is no Discrete Variables in give dataset

**9. Relation between Discrete numerical Features and Labels**
- NA

**10. Find Continous Numerical Features**

In [None]:
continuous_features=[feature for feature in numerical_features if feature not in discrete_feature+['GOD']]
print("Continuous feature Count {}".format(len(continuous_features)))

**Take-away**:
- there are 4 continuous numerical features

**11. Distribution of Continous Numerical Features**

In [None]:
#plot a univariate distribution of continues observations
plt.figure(figsize=(20,60), facecolor='white')
plotnumber =1
for continuous_feature in continuous_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.distplot(df[continuous_feature])
    plt.xlabel(continuous_feature)
    plotnumber+=1
plt.show()

**Take-away**: 
- it seems SPX,SLV and EUR/USD distributed normally
- USO heavely skewed towards right and seems to be have some outliers.

**12. Relation between Continous numerical Features and Labels**

In [None]:
plt.figure(figsize=(20,60), facecolor='white')
plotnumber =1
for feature in continuous_features:
    data=df.copy()
    ax = plt.subplot(12,3,plotnumber)
    plt.scatter(data[feature],data['GLD'])
    plt.xlabel(feature)
    plt.ylabel('GLD')
    plt.title(feature)
    plotnumber+=1
plt.show()

**Take-away**:
- it seems SLV feature linearly progressing with GLD

**13. Find Outliers in numerical features**

In [None]:
#boxplot on numerical features to find outliers
plt.figure(figsize=(20,60), facecolor='white')
plotnumber =1
for numerical_feature in numerical_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.boxplot(df[numerical_feature])
    plt.xlabel(numerical_feature)
    plotnumber+=1
plt.show()

**Take-away**:
- it seems USO and SLV has some outliers

**14. Explore the Correlation between numerical features**

In [None]:
## Checking for correlation
cor_mat=df.corr()
fig = plt.figure(figsize=(15,7))
sns.heatmap(cor_mat,annot=True)
plt.show()

In [None]:
print (cor_mat['GLD'].sort_values(ascending=False), '\n')

**Take-away**: 
- it seems SLV feature is heavily correlated with GLD

# Feature Engineering

- Drop unwanted Features
- Handle Missing Values
- Handle Categorical Features
- Handle Feature Scalling
- Remove Outliers

As per Exploratory Data Analysis EDA, 
- for this session, we are not considering date feature and hence we will drop this feature
- no missing value found
- outliers found in USO and SLV. but for this session we will ignore those.

In [None]:
df2=df.copy()

In [None]:
df2.head()

In [None]:
# drop Date
df2.drop(['Date'],axis=1, inplace=True)

# Split Dataset into Training set and Test set

In [None]:
X = df2.drop(['GLD'],axis=1)
y = df2['GLD']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=0)

In [None]:
len(X_train)

In [None]:
len(X_test)

# Model Selection

In [None]:
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV

In [None]:
def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
            }
        },
        'RandomForestRegressor':{
        'model':RandomForestRegressor(),
        'params':{
            'n_estimators': [10, 50, 100, 130], 
            'criterion': ['mse'],
            'max_depth': range(2, 4, 1), 
            'max_features': ['auto', 'log2']
        }
    },
    'XGBRegressor':{
        'model':XGBRegressor(),
        'params':{
           'learning_rate': [0.5, 0.1, 0.01, 0.001],
            'max_depth': [2, 3],
            'n_estimators': [10, 50, 100, 200]
        }
    }
    }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X,y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['model','best_score','best_params'])

In [None]:
pd.set_option('display.max_colwidth', 100)
find_best_model_using_gridsearchcv(X,y)

**Take-away**: 
- after applying model selection tech on algos like DecisionTreeRegressor, RandomForestRegressor and XGBRegressor. we found that XGBRegressor gives best result. and hence we will build model using XGBRegressor algorithm.

# Model Building

In [None]:
model_xgb = XGBRegressor(learning_rate=0.5, max_depth=3, n_estimators=200)

In [None]:
model_xgb.fit(X_train,y_train)

In [None]:
model_xgb.score(X_test,y_test)

In [None]:
y_pred= model_xgb.predict(X_test)

In [None]:
y_pred

In [None]:
y_test

if you like this kernel explaination, please vote this and put your comments. thank you.

https://www.youtube.com/watch?v=zrC7xE4CVIs&lc=UgzE8GMA3SI7TErWIF54AaABAg