## Mineral Deposit Data Analysis

This notebook load and analyze the mineral deposit data. 

### load data
This section imports the required packages, download the dataset from the website and list the file names. As below, the data_loc is where the data set will be saved. To save storage space, the dataset won't be extracted as a folder. The data procesing will be done based via reading corresponding file contained in the zip file.  

In [None]:
# import required packages
import pandas as pd
import os
import pickle
import sys
pd.options.display.width=None
pd.options.display.max_columns=None


if sys.version_info >= (3, 6):
    from zipfile import ZipFile as zipfile
else:
    import zipfile36 as zipfile
    
url = "https://unearthed-exploresa.s3-ap-southeast-2.amazonaws.com/Unearthed_5_SARIG_Data_Package.zip" 
# enter the directory to save data
data_loc = './data'
file_name = 'Unearthed_5_SARIG_Data_Package.zip'

if os.path.isfile(os.path.join(data_loc, file_name)):
    print ("File exist")
    pass
else:
    # open and save the zip file onto computer
    url = urlopen(URL)
    output = open('./data/Unearthed_5_SARIG_Data_Package.zip', 'wb')    # note the flag:  "wb"        
    output.write(url.read())
    output.close()
    
files_in_dataset = []
file_name = 'Unearthed_5_SARIG_Data_Package.zip'
for file in zipfile(os.path.join(data_loc, file_name),'r').filelist:
    files_in_dataset.append(file.filename)
    
files_in_dataset



 For this part of data cleaning, we will only use the following files: 
 - 'SARIG_Data_Package/sarig_md_commodity_exp.csv',
 - 'SARIG_Data_Package/sarig_md_details_exp.csv',
 - 'SARIG_Data_Package/sarig_md_mineralogy_exp.csv',
 - 'SARIG_Data_Package/sarig_md_reference_exp.csv',
 - 'SARIG_Data_Package/sarig_md_zone_hr_lith_exp.csv',
 - 'SARIG_Data_Package/sarig_md_zone_lith_exp.csv'

### determine the record identifier

In [None]:
# read the reference data
sarig_md_reference_exp = pd.read_csv(
    zipfile(os.path.join(data_loc, file_name),'r').open('SARIG_Data_Package/sarig_md_reference_exp.csv','r'), 
    sep=',', encoding='latin1')
sarig_md_reference_exp['PUBLICATION_DATE'] = pd.to_datetime(sarig_md_reference_exp['PUBLICATION_DATE'])
sarig_md_reference_exp.head(3)

In [None]:
sarig_md_reference_exp.info()

In [None]:
sarig_md_reference_exp.isnull().any()

Since the columns "MINERAL_DEPOSIT_NO",  "SITE_NO", "LONGITUDE_GDA2020", "LATITUDE_GDA2020" contain no null values, they are potential identifier of records in the following analysis. 

In [None]:
# decide the uniqueness
print(len(sarig_md_reference_exp['MINERAL_DEPOSIT_NO'].unique()), len(sarig_md_reference_exp['SITE_NO'].unique()))
print(len(sarig_md_reference_exp[['LONGITUDE_GDA2020', 'LATITUDE_GDA2020']].drop_duplicates()))
print(len(sarig_md_reference_exp[['LONGITUDE_GDA2020', 'LATITUDE_GDA2020']].drop_duplicates()))


Here, the unique values of 'MINERAL_DEPOSIT_NO' and 'SITE_NO' are equal but they are different from that of ['LONGITUDE_GDA2020', 'LATITUDE_GDA2020']. There might be distinct 'MINERAL_DEPOSIT_NO' or 'SITE_NO' correspond to the same longitude and latitude. This should be investigated. 

In [None]:
# remove the duplicates 
site_lon_lat = sarig_md_reference_exp[['MINERAL_DEPOSIT_NO','SITE_NO', 'LONGITUDE_GDA2020', 'LATITUDE_GDA2020']].drop_duplicates()

# count the records corresponding to the same longitude and latitude
count_site = site_lon_lat.groupby(by=['LONGITUDE_GDA2020', 'LATITUDE_GDA2020']).count()

# find these cooridinates
ifentified_lon_lat = count_site[count_site['MINERAL_DEPOSIT_NO']!=1].reset_index()[['LONGITUDE_GDA2020', 'LATITUDE_GDA2020']]

In [None]:
sarig_md_reference_exp.merge(ifentified_lon_lat, how='inner', on=['LONGITUDE_GDA2020', 'LATITUDE_GDA2020']).set_index(['LONGITUDE_GDA2020', 'LATITUDE_GDA2020'])

From above, we can see there are three cases where two MINERAL_DEPOSIT_NO share the same set of (LONGITUDE, LATITUDE). This might be the case where TWO sites (with different SITE_NO, MINERAL_DEPOSIT_NO) have the same coordinates. The following data provides evidence for this guess. 

In [None]:
# the set of the MINERAL_DEPOSIT_NO which share cordinates
sarig_md_reference_exp.merge(
    ifentified_lon_lat, how='inner', 
    on=['LONGITUDE_GDA2020', 'LATITUDE_GDA2020']).set_index(
    ['LONGITUDE_GDA2020', 'LATITUDE_GDA2020'])['MINERAL_DEPOSIT_NO'].values

### Commodities 

This section identifies the set of commodity names and allows the users of this code to select the commodities for which they want to extract related data

In [None]:
sarig_md_commodity_exp = pd.read_csv(
    zipfile(os.path.join(data_loc, file_name),'r').open('SARIG_Data_Package/sarig_md_commodity_exp.csv','r'), 
    sep=',', encoding='latin1')
sarig_md_commodity_exp.head(5)

In [None]:
sarig_md_commodity_exp.loc[sarig_md_commodity_exp['MINERAL_DEPOSIT_NO'].isin([3612,  3612,  3717,  3717,  7104,  7105, 10516, 10517])]

The same coordinate but different MINERAL_DEPOSIT_NO, as well as different DEPOSIT_NAME. This proves the guess that some sites (SITE_NO, MINERAL_DEPOSIT_NO) actually share coordinates. This fact also suggest that we should use SITE_NO or MINERAL_DEPOSIT_NO as record identifier o records. 

In [None]:
sarig_md_commodity_exp.info()

In [None]:
# '''
# use this code to generate the set_commodity_name
#  sarig_md_commodity_exp['COMMODITY_NAME'].unique()
# set_commodity_name = ['Copper', 'Iron', 'Rare Earths', 'Heavy Minerals', 
#                       'Gold', 'Chrysoprase', 'Cobalt', 'Nickel','Corundum', 
#                       'Vanadium', 'Ilmenite', 'Chromium', 'Agate', 'Celestite',
#                       'Clay', 'Shale', 'Granite', 
#                       'Ironstone - construction materials', 'Opal', 'Alunite',
#                       'Micaceous Hematite', 'Kaolin', 'Dolomite', 'Limestone',
#                       'Gravel', 'Sandstone', 'Quartzite', 'Dolerite', 
#                       'Rhyolite', 'Graphite', 'Magnesite', 'Lead', 'Marble', 
#                       'Uranium', 'Thorium', 'Asbestos', 'Zinc', 'Talc', 
#                       'Manganese', 'Sand', 'Gneiss', 'Gabbro', 'Amphibolite', 
#                       'Beryl', 'Uranium Oxide', 'Iron Ore', 'Silver', 'Schist', 
#                       'Calcrete', 'Metasiltstone', 'Amazonite', 'Tungsten', 
#                       'Molybdenum', 'Gypsum', 'Lime sand', 'Phosphate', 
#                       'Diamond', 'Platinoids', 'Salt', 'Aluminium', 'Tin', 
#                       'Amethyst', 'Jade', 'Pozzolan (Volcanic Ash)', 
#                       'Silica sand', 'Sapphire', 'Slate', 'Basalt', 
#                       'Tourmaline', 'Feldspar', 'Silica', 'Barite', 
#                       'Calcite', 'Fluorite', 'Kyanite', 'Sulphur', 'Quartz', 
#                       'Sillimanite', 'Mica', 'Beryllium', 'Pegmatite', 
#                       'Andalusite', 'Bismuth','Carphosiderite', 'Chiastolite',
#                       'Radium', 'Wollastonite', 'Arsenic', 'Garnet', 'Ochre',
#                       'Coal', 'Rutile', 'Mercury','Palygorskite', 'Turquoise', 
#                       'Scholzite', 'Shell grit', 'Topaz','Vermiculite', 
#                       'Siltstone', 'Norite', 'Magnesium', 'Antimony',
#                       'Epsomite', 'Albite', 'Ruby', 'Trona', 'Potash', 'Peat',
#                       'Diatomite', 'Tantalum', 'Oil Shale', 'Nephrite', 
#                       'Allanite', 'Monazite', 'Halloysite', 'Titanium', 'Gas',
#                       'Evaporites',  'Geothermal Energy', 'Lithium']
# '''
# # select the commodities interested from the above commodity names. 
# commodities_interested = ['Copper', 'Gold']


Here, we use SITE_NO as record identifier. 

In [None]:
interested_md_commodity_exp = sarig_md_commodity_exp[
    ['MINERAL_DEPOSIT_NO', 'COMMODITY_CODE',
    'SITE_NO', 'EASTING_GDA2020', 'NORTHING_GDA2020', 'ZONE_GDA2020',
    'LONGITUDE_GDA2020', 'LATITUDE_GDA2020', 'LONGITUDE_GDA94',
    'LATITUDE_GDA94']]

### Mineral Deposit Details Data

In [None]:
# read the mineral deposit details data
sarig_md_details_exp = pd.read_csv(
    zipfile(os.path.join(data_loc, file_name),'r').open('SARIG_Data_Package/sarig_md_details_exp.csv','r'), 
    sep=',', encoding='latin1')
sarig_md_details_exp['DISCOVERY_YEAR'] = sarig_md_details_exp['DISCOVERY_YEAR'].astype('Int64')
sarig_md_details_exp.head(5)

In [None]:
expand_commodities = sarig_md_details_exp.set_index('MINERAL_DEPOSIT_NO')['COMMODITIES'].str.split(',', expand=True).stack().reset_index().drop('level_1', axis=1)
expand_commodities.rename(columns={0: "COMMODITY_NAME"}, inplace=True)
expand_commodities.head(3)

In [None]:
expand_md_details_exp = sarig_md_details_exp.merge(expand_commodities, how='left', on='MINERAL_DEPOSIT_NO')
expand_md_details_exp.drop('COMMODITIES', axis=1, inplace=True)

In [None]:
interested_md_details_exp = expand_md_details_exp[['MINERAL_DEPOSIT_NO', 'DEPOSIT_NAME', 'DEPOSIT_SYNONYMS',
        'DEPOSIT_CLASS', 'MINEROLOGY_ORE', 'REFERENCE_FLAG', 'MAP_250000', 'MAP_100000', 'MAP_50000', 'SITE_NO',
        'ELEVATION_M', 'SURVEY_METHOD_CODE', 'COMMODITY_NAME']]

### Load Mineral Deposit Mineralogy Data

In [None]:
sarig_md_mineralogy_exp = pd.read_csv(
    zipfile(os.path.join(data_loc, file_name),'r').open('SARIG_Data_Package/sarig_md_mineralogy_exp.csv','r'), 
    sep=',', encoding='latin1')
sarig_md_mineralogy_exp.head(5)

In [None]:
interested_md_mineralogy_exp = sarig_md_mineralogy_exp[['MINERAL_DEPOSIT_NO', 'MINERAL_CODE', 'MINERAL',
       'MINERAL_TYPE', 'RELATIVE_ABUNDANCE_CODE', 'SITE_NO']]
interested_md_mineralogy_exp.head()

### Load Mineral Deposit HR Lithology Data

In [None]:
sarig_md_zone_hr_lith_exp = pd.read_csv(
    zipfile(os.path.join(data_loc, file_name),'r').open('SARIG_Data_Package/sarig_md_zone_hr_lith_exp.csv','r'), 
    sep=',', encoding='latin1')
sarig_md_zone_hr_lith_exp.head(5)

In [None]:
interested_md_zone_hr_lith_exp = sarig_md_zone_hr_lith_exp
interested_md_zone_hr_lith_exp.head()

### Load Mineral Deposit Lithology Data

In [None]:
sarig_md_zone_lith_exp = pd.read_csv(
    zipfile(os.path.join(data_loc, file_name),'r').open('SARIG_Data_Package/sarig_md_zone_lith_exp.csv','r'), 
    sep=',', encoding='latin1')
sarig_md_zone_lith_exp.head(5)

In [None]:
interested_md_zone_lith_exp = sarig_md_zone_lith_exp
interested_md_zone_lith_exp.head(5)

### Save the data set extracted from mineral deposit dataset for the selected commodities

In [None]:
extract_mineral_deposit = sarig_md_commodity_exp.merge(
    interested_md_details_exp, 
    how='left', 
    on=['SITE_NO', 'MINERAL_DEPOSIT_NO'],
    suffixes=('', '_details'))
extract_mineral_deposit = extract_mineral_deposit.merge(
    interested_md_mineralogy_exp, 
    how='inner', 
    on=['SITE_NO', 'MINERAL_DEPOSIT_NO'])
extract_mineral_deposit = interested_md_commodity_exp.merge(
    interested_md_details_exp, 
    how='inner', 
    on=['SITE_NO', 'MINERAL_DEPOSIT_NO'],
    suffixes=('', '_details'))
extract_mineral_deposit.to_csv(
    'mineral_deposit_details.csv', 
    sep=',', 
    header='infer')

interested_md_zone = interested_md_zone_hr_lith_exp.merge(
    interested_md_zone_lith_exp,  
    how='inner', 
    on=['SITE_NO', 'MINERAL_DEPOSIT_NO'],
    suffixes=('_hr', '_zone'))
interested_md_zone.to_csv(
    'interested_md_zone.csv', 
    sep=',', 
    header='infer')

In [None]:
mineral_deposit_details = pd.read_csv(
    'mineral_deposit_details.csv', 
    sep=',', 
    header='infer')
interested_md_zone = pd.read_csv(
    'interested_md_zone.csv', 
    sep=',', 
    header='infer')

In [None]:
# # load the required SITE_NO from the csv file extracted from the rs_data.
path = '.\\data'
for directory in os.listdir(path):
    if os.path.isfile(os.path.join(path, directory)):
        pass
    else:
        new_path = os.path.join(path, directory)

        if os.path.exists(os.path.join(new_path, 'rs_chem_site_sample_num.csv')):
            rs_chem_site_sample_num = pd.read_csv(
                os.path.join(new_path, 'rs_chem_site_sample_num.csv'), 
                header='infer', 
                sep=',')['SITE_NO'].drop_duplicates()
            #print('read rs_chem_site_sample_num.csv successfully.')

            extract_mineral_deposit = mineral_deposit_details.merge(
                rs_chem_site_sample_num, how='inner', on='SITE_NO')

            extract_mineral_deposit = extract_mineral_deposit.merge(
                interested_md_zone, 
                how='left', 
                on=['SITE_NO', 'MINERAL_DEPOSIT_NO'],
                suffixes=('', '_zone')
            )

            extract_mineral_deposit.to_csv(os.path.join(new_path,'extract_mineral_deposit.csv'), sep=',', header='infer') 
        else:
            pass