<div class="alert alert-block alert-success">
<p style="font-size:20px"><b>Section 1: Data Preparation for S6000: </b></p> 
Load Windfarm Data from H1 and H2 into a mongoDB database, Data_S6000.  It includes
</ul>
<li>Raw data of the 31 feature variables (v31)</li>
<li>Minmax scaled v31 data.</li>
<li>Octave band filtered data (Butterworth) going through IOA filter with 819 Hz frame and A weighting.  There are 100 data points for each band.</li>
</ul>

All trained models using <b>sklearn modules</b> are saved, including all <b>min_max scalers</b> and <b>KMeans clustering</b> models.
</div>

In [1]:
import sys

oneDrive_root={}
oneDrive_root[1]="C:\\Users\\Chihyang\OneDrive for Business\\"
oneDrive_root[2]="C:\\Users\\Chihyang\OneDrive for Business\\"
oneDrive_root[3]="C:\\Users\\tsaic\\OneDrive - State University of New York at New Paltz\\"  # laptop
WNF_lib = ".//WNF_lib//"

site=3   # the short or long business OneDrive directory name
lib_dir=oneDrive_root[site]+'\\Prudentia\\Tsaipy'
# append additional library path for this study
sys.path.append(lib_dir)
sys.path.append(WNF_lib)

In [2]:
import pandas as pd
from pymongo import MongoClient
import pymongo
from dbconnect import mongodb as mdb
import numpy as np
import pprint
from WFN_lib.cluster_classify import cluster_analysis

### Input parameters

In [3]:
db_name = "Windfarm_S6000"
collection_data = "S6000_Data"
infile_name1 = ".//Results//data_sets//Complete_Features_Rating_Table.xlsx"


<div class="alert alert-block alert-info">
<p style="font-size:20px"><b>Section 1.1: Connect to the database:</b></p> 
<ul>
    <li>client: equivalent to <b>conn</b> in the return of a SQL connection request.</li>
    <li>Check if the database with name <b>db_name</b> exists.  If yes, drop it to re-create a new one.</li>
    <li>db: The reference to the database defined in "<b>db_name</b>"</li>
    <li>The database is not actually created until data are put into it.</li>
</ul>
</div>        

In [4]:
## Create a new database, db_name, in MongoDB
## If db_name exists, delete it first and then create a new one

client = MongoClient('localhost', 27017)  # connect to the db engine
dbnames = client.list_database_names()  # find all existing databases
print("Existing databases before adding the new one: ",dbnames)
# delete the database, db_name, if exists
if db_name in dbnames:   
    client.drop_database(db_name)
else:
    print(f"Database, {db_name}, does not exist!!!")

# create a new database, db_name
print(f"db_name: {db_name}")
db = client[db_name]  # create a new database  

## Check if the database is created.  At this point, it is not found since no data created in it yet.
dbnames = client.list_database_names()  # find all existing databases
print("Databases after adding the new one: ", dbnames)
if db_name in dbnames:   # check if db_name is created
    print(f"Database, {db_name}, successfully created!!")
else:
    print(f"Database, {db_name}, not found!!")

Existing databases before adding the new one:  ['Windfarm_6000', 'Windfarm_S6000', 'admin', 'config', 'fin_clustering2002_2019', 'fin_clustering2005_2022', 'fin_clustering2019', 'local']
db_name: Windfarm_S6000
Databases after adding the new one:  ['Windfarm_6000', 'admin', 'config', 'fin_clustering2002_2019', 'fin_clustering2005_2022', 'fin_clustering2019', 'local']
Database, Windfarm_S6000, not found!!


<div class="alert alert-block alert-info">
<p style="font-size:20px"><b>Section 1.2: Create a database collection (table) for the data:</b></p> There are three datasets in this data based on the same source wav files.  The Minmax scaler created to scale the data is saved under <b>.//trained_models</b>.
<ul>
    <li><b>v31</b>: The 31 feature variables per Nguyen</b>"</li>
    <li><b>v31_minmax</b>:The min_max scaled 31 variables into the range of (0,1).</li>
    <li><b>IOA</b>:The octave band filtered, A-weighted data</li>
</ul>
</div> 

#### Read the data source from an Excel file with three worksheets into pandas DataFrames

#### Insert rows to collection_data collection
- This process is done row-by-row using the insert_one() function
- Each row obtains its data from the three sources, two from the dataframes and one from reading the npy file.
- An additional index is created based on the stock __id__ field.

In [5]:
### drop the dataset collection if exists, collection_data
try:
    db.validate_collection(collection_data)  # Try to validate a collection   
    db[collection_data].drop()
    print(f"Collection (table), {collection_data}, dropped")
except pymongo.errors.OperationFailure:  # If the collection doesn't exist
        print(f"Collection, {collection_data}, doesn't exist") 

## create a new collection, collection_data, in the database, db_name
coll_data=db[collection_data]   # collection (table) name 


Collection, S6000_Data, doesn't exist


In [6]:
### For the 31 feature variables in Nguyen 2021 ########################################
## Retrieve the source data from the Excel file, infile_name1
df_v31=pd.read_excel(infile_name1,header=0,index_col=0)
print(df_v31.shape)
print(df_v31.head(3))

(6000, 35)
        Site filename    Rating  Prob_AM_RdmFrst  spectralCentroid  \
H1_0001   H1   s1.wav  1.091934         0.174275         63.081251   
H1_0002   H1   s2.wav  1.094161         0.019092         57.650009   
H1_0003   H1   s3.wav  4.967208         0.648424         75.049887   

         spectralCrest  spectralDecrease  spectralEntropy  spectralFlatness  \
H1_0001      28.079092        -92.466205         0.143241          0.010610   
H1_0002      30.680069       -259.422066         0.067829          0.011763   
H1_0003      25.424498        -22.843544         0.217764          0.008283   

         spectralFlux  ...          PR   Fo  AMfactor       DAM  \
H1_0001  1.087143e-07  ...   69.138538  0.9  0.239249  1.339453   
H1_0002  4.547078e-05  ...    7.843309  0.8  0.111174  1.217609   
H1_0003  3.236673e-06  ...  104.108914  0.7  0.827716  7.589734   

         peakloc_unweightedSPL       L63      L125      L250      L500  \
H1_0001                    0.2  3.403435  2.6850

In [7]:
## Minmax scaling the 31 feasure variables: columns of df_v31[4:35]
## The first 4 columns are not scaled. 
## Check if the first three rows of the scaled data are the same as those from df_v31minmax_veri
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_v31minmax = scaler.fit_transform(df_v31.iloc[:,4:35])
df_v31minmax = pd.DataFrame(df_v31minmax, columns=df_v31.columns[4:35], index=df_v31.index)
print(df_v31minmax.shape)
## Check the first three rows of the scaled data with those from df_v31minmax_veri
print(df_v31minmax.head(3))
## Save the scaler to be used for scaling new data
import joblib
joblib.dump(scaler, './/trained_models//scalers//Minmax_scaler_v31_S6000.joblib')


(6000, 31)
         spectralCentroid  spectralCrest  spectralDecrease  spectralEntropy  \
H1_0001          0.012519       0.871260          0.999324         0.142351   
H1_0002          0.008121       0.957792          0.998104         0.065464   
H1_0003          0.022211       0.782944          0.999833         0.218333   

         spectralFlatness  spectralFlux  spectralKurtosis  \
H1_0001          0.010966      0.001331          0.004294   
H1_0002          0.012185      0.557133          0.002559   
H1_0003          0.008507      0.039656          0.000540   

         spectralRolloffPoint  spectralSkewness  spectralSlope  ...        PR  \
H1_0001              0.021432          0.078659       0.997918  ...  0.064817   
H1_0002              0.007925          0.068858       0.622292  ...  0.007353   
H1_0003              0.040809          0.027030       0.967046  ...  0.097601   

               Fo  AMfactor       DAM  peakloc_unweightedSPL       L63  \
H1_0001  1.000000  0.047060 

['.//trained_models//scalers//Minmax_scaler_v31_S6000.joblib']

In [8]:
## Check if two scalers created at different times and saved in two locations are the same

# Load the joblib file
file1 = joblib.load('.//trained_models//scalers//Minmax_scaler_v31_S6000.joblib')
file2 = joblib.load('.//trained_models//Minmax_scaler_v31_S6000.joblib')
# Print or inspect the content of the file
print(file1.data_max_)
print(file2.data_max_)

print(file1.data_min_)
print(file2.data_min_)


[ 1.28259413e+03  3.19487568e+01 -9.49792591e-03  9.84429088e-01
  9.46154743e-01  8.16156375e-05  3.30122735e+05  3.24719115e+03
  3.13545090e+02  7.11390135e-14  1.17464580e+03  4.00000000e+02
  8.58377337e-01  7.15270475e+01  2.84255734e+00  2.46119394e+00
  3.06684184e+01  1.05000000e+00  5.63476848e+02  1.74230837e+01
 -1.19318803e-01  1.06667625e+03  9.00000000e-01  4.41591099e+00
  4.71328706e+01  4.80000000e+00  5.51901854e+01  5.25686596e+01
  5.15162113e+01  4.76758433e+01  4.66158008e+01]
[ 1.28259413e+03  3.19487568e+01 -9.49792591e-03  9.84429088e-01
  9.46154743e-01  8.16156375e-05  3.30122735e+05  3.24719115e+03
  3.13545090e+02  7.11390135e-14  1.17464580e+03  4.00000000e+02
  8.58377337e-01  7.15270475e+01  2.84255734e+00  2.46119394e+00
  3.06684184e+01  1.05000000e+00  5.63476848e+02  1.74230837e+01
 -1.19318803e-01  1.06667625e+03  9.00000000e-01  4.41591099e+00
  4.71328706e+01  4.80000000e+00  5.51901854e+01  5.25686596e+01
  5.15162113e+01  4.76758433e+01  4.6615

## Store data in the collection Data_S6000
- First three columns: _id, site, file_name
- v31: 31 columns for the original 31 feature variables
- IOA: four bands 25-100Hz,50-200Hz, 100-400Hz, 200-800Hz, each with 100 values (No threshold, 819 Hz frame version), 100 values for each sample in each band.

In [9]:
### For the 31 feature variables and the 31 scaled feature variables ########################################
v31_keys=list(df_v31minmax.columns)   # variable names (or column names in the collection), v31 and v31_minmax share the same keys
### For reading and organizing IOA data ###################################################################
dir_npy='.//Windfarm_IOA//IOA_NPY_data_S6000_819//'  # A-weighted 819 frame rate data
bands = ['025_100Hz','050_200Hz', '100_400Hz', '200_800Hz']
post_fix=['_025_100Hz.npy','_050_200Hz.npy', '_100_400Hz.npy', '_200_800Hz.npy']
IOA_keys = ['v'+str(i).zfill(3) for i in range(100)]   # for the 100 values extracted from 819 rate time series data
#############################################################################################################
#df_v31.head(3)
for i, idx in enumerate(df_v31.index):
    pymongo_data_dict=dict()
    row = df_v31.loc[idx]    # v31
    row_minmax = df_v31minmax.loc[idx]   #v31_minmax
    
    #print(f"{i}: {idx}")

    ### file information
    pymongo_data_dict["_id"]=idx
    pymongo_data_dict["site"]=row["Site"]
    pymongo_data_dict["file_name"]=idx+".wav"

    pymongo_data_dict["v31"]=dict(); 
    pymongo_data_dict["IOA_3rdOctave"]=dict()

    for key in v31_keys:  #  collect the values of the V31 and V31_minmax variables
        pymongo_data_dict["v31"][key]=row[key]          
        
    ## for IOA data, collect the values of the 4 bands for each file
    for (band, pos) in zip(bands, post_fix):
        pymongo_data_dict['IOA_3rdOctave'][band]=list(np.load(dir_npy+idx+pos))


    ## insert the document into the collection
    coll_data.insert_one(pymongo_data_dict)
    #if i>3: break

<div class="alert alert-block alert-success"><b>Part 2: Add analysis results from S6000 to collection_data</b> which includes
<ol>
<li>scores: The rating and Random forest result from Nguyen 2021 based on 31 summary variables.</li>
<li>cluster_v31_minmax: clustering results based the minmax scaled 31 variables, K=2,..., 10</li>
<li>IOA rating: two ratings (IOA rating, AMWG) for each band, (1) prominence, (2) modulation depth (L5-L95)
<li>YAMnet classes, probs: the top five classes and corresponding probabilities from YAMNET
</ol>
</div>

### Create Result database content
In the update statement, the last 2 True/False fields specifies the upsert and multi flags.

- Upsert flag: If set to true, creates a new document when no document matches the query criteria.

- Multi flag: If set to true, updates multiple documents that meet the query criteria. If set to false, updates one document.

In [10]:
df_v31.index[:5]

Index(['H1_0001', 'H1_0002', 'H1_0003', 'H1_0004', 'H1_0005'], dtype='object')

In [11]:
# Result 1: From the nguyen's paper
## These two scores are also in the collection, collection_data (Data_S6000)
df_v31.head(3)  
scores = ["Rating","Prob_AM_RdmFrst"]
for i, idx in enumerate(df_v31.index):
    pymongo_result_dict=dict()
    #print(f"{i}: {idx}")
    row = df_v31.loc[idx]


    pymongo_result_dict["nguyen_scores"]=dict()
    ###############################
    for s in scores:
        pymongo_result_dict["nguyen_scores"][s]=row[s]
    ## insert the document into the collection

    coll_data.update_one({"_id":idx },{'$set' : pymongo_result_dict}, False, False)

In [12]:
## Result 2: Clustering result by Kmeans with K=2, 3, .., 10 and save the clustering models
### Cluster the 6000 samples into 2-10 clusters
df_S6000_cluster, _ =cluster_analysis(df_v31minmax, [2,3,4,5,6,7,8,9,10], random_state=42, scaler_type=None,n_init=10, max_iter=300, tol=0.0001,
                                 save_model_dir='.//trained_models//clustering//', model_file_name='Kmeans_S6000_v31_MinMax_K=')
print(df_S6000_cluster.head(3))
print(df_S6000_cluster.shape)

## Double check with the saved clustering result
infile_name3 = ".//Results//cluster_allotment//Cluster_S6000_MinMax_result.xlsx"
df_cluster=pd.read_excel(infile_name3,header=0,index_col=0)
print(df_cluster.head(3))
print(df_cluster.shape)

         K=2  K=3  K=4  K=5  K=6  K=7  K=8  K=9  K=10
H1_0001    0    2    3    0    5    6    0    0     3
H1_0002    1    0    1    4    2    5    3    3     7
H1_0003    0    1    3    0    5    6    0    1     2
(6000, 9)
         K=2  K=3  K=4  K=5  K=6  K=7  K=8  K=9  K=10
H1_0001    0    2    3    0    5    6    0    0     3
H1_0002    1    0    1    4    2    5    3    3     7
H1_0003    0    1    3    0    5    6    0    1     2
(6000, 9)


In [17]:
model=joblib.load('.//trained_models//clustering//Kmeans_S6000_v31_MinMax_K=5.joblib')
print(model.cluster_centers_[:2])  # get the cluster centers of the first two clusters

[[0.01485179 0.88585969 0.9987002  0.12869654 0.01196497 0.07153502
  0.00350418 0.02499423 0.06059704 0.93350133 0.06555377 0.98953944
  0.02435714 0.44514964 0.39084459 0.2639636  0.47558703 0.58648474
  0.00952726 0.06958317 0.91977739 0.01539305 0.77185695 0.06721571
  0.04312486 0.13660059 0.11806475 0.09896513 0.06189822 0.06190411
  0.06980546]
 [0.00587739 0.92736801 0.99828461 0.08563632 0.00369299 0.01793724
  0.00892482 0.01007855 0.10272272 0.98385129 0.03304128 0.99906487
  0.06550112 0.27118861 0.56725939 0.48925307 0.65266533 0.49569536
  0.00559861 0.05198528 0.9412403  0.00942295 0.70364238 0.05355128
  0.04020534 0.12825727 0.12898097 0.08962586 0.05620599 0.05567665
  0.06124672]]


In [18]:
## Result 2: Clustering result
infile_name3 = ".//Results//cluster_allotment//Cluster_S6000_MinMax_result.xlsx"
df_cluster=pd.read_excel(infile_name3,header=0,index_col=0)
df_cluster.head(3)

pymongo_cluster_dict=dict()
for i, idx in enumerate(df_S6000_cluster.index):
    pymongo_cluster_dict[idx]=dict()
    #print(f"{i}: {idx}")
    row = df_cluster.loc[idx]

    ## mongodb does not allow np.int64 as a value, chagne it to int
    pymongo_cluster_dict[idx]["K=2"]=int(row["K=2"])
    pymongo_cluster_dict[idx]["K=3"]=int(row["K=3"])
    pymongo_cluster_dict[idx]["K=4"]=int(row["K=4"])
    pymongo_cluster_dict[idx]["K=5"]=int(row["K=5"])
    pymongo_cluster_dict[idx]["K=6"]=int(row["K=6"])
    pymongo_cluster_dict[idx]["K=7"]=int(row["K=7"])
    pymongo_cluster_dict[idx]["K=8"]=int(row["K=8"])
    pymongo_cluster_dict[idx]["K=9"]=int(row["K=9"])
    pymongo_cluster_dict[idx]["K=10"]=int(row["K=10"])
    

    ## Append the new result for idx to the document 
    #print(i, idx, pymongo_cluster_dict[idx])
    coll_data.update_one({"_id":idx },{'$set' : {"cluster_v31_minmax": pymongo_cluster_dict[idx]}}, False, False)
    #if i>3: break



In [19]:
## Result 3: from IOA results
infile_name4=".//Windfarm_IOA//IOA_Result_819_S6000//All6000_IOA_AW_819_NoThreshold.xlsx"
xls = pd.ExcelFile(infile_name4)
sheet_names = xls.sheet_names  # see all sheet names
print(sheet_names)
keys=None

### Get the data from each sheet
df_bands = dict()
for sht in sheet_names[1:]:
    df_bands[sht]=pd.read_excel(infile_name4,sheet_name=sht,header=0,index_col=0)
    
    #print(df_bands[sht].head(3))
    if sht==sheet_names[1]:
        indices = df_bands[sht].index  # get the index names of the first sheet, since they are the same for all sheets
    
keys = ['prominence','L5-L95']   # get the column names of the two IOA values to extract

############################################################


for i, idx in enumerate(indices):
    pymongo_sht_dict=dict()
    for sht in sheet_names[1:]:
        pymongo_sht_dict[sht]=dict()
        ## find the index of the row in coll_result, with the same id
        row = df_bands[sht].loc[idx]
        #print(f"sht = {sht}, idx = {idx}")

        for key in keys:  # skip the first first column, 'band', which is the parent key
            pymongo_sht_dict[sht][key]=row[key]
        ## append the document to the end of the document with the same id if the document with the same id exists, otherwise show error message

    #print(pymongo_sht_dict)
    coll_data.update_one({"_id":idx },{'$set' : {"IOA_rating":pymongo_sht_dict}}, False, False)


        

['All', '025_100Hz', '050_200Hz', '100_400Hz', '200_800Hz']


In [20]:
## Result 4: YAMNET results
infile_name5=".//Results//classification//YAMNET_cls_results.xlsx"
df_H1=pd.read_excel(infile_name5,sheet_name="H1_cls",header=0,index_col=0)
df_H2=pd.read_excel(infile_name5,sheet_name="H2_cls",header=0,index_col=0)
### concatenate the two dataframes
df_H=pd.concat([df_H1,df_H2],axis=0)
classes=df_H.columns[:5]  # the first 5 columns are the class names
probs=df_H.columns[5:]  # the rest of the columns are the probabilities
print(classes, probs)
#############################################################################
#pymongo_class_dict=dict()
#pymongo_prob_dict=dict()

for i, idx in enumerate(df_H.index):
    pymongo_YAMNET_dict=dict()
    pymongo_YAMNET_dict['class']=dict()
    pymongo_YAMNET_dict['prob']=dict()
    ## find the index of the row in coll_result, with the same id
    #pymongo_YAMNET_dict['class'][idx]=dict()
    #pymongo_YAMNET_dict['prob'][idx]=dict()
    row = df_H.loc[idx]

    for s in classes:
        pymongo_YAMNET_dict['class'][s]=row[s]
    for s in probs:
        pymongo_YAMNET_dict['prob'][s]=row[s]

    #print(pymongo_class_dict[idx])
    #print(pymongo_prob_dict[idx])
    ## append the document to the end of the document with the same id if the document with the same id exists, 
    #coll_result.update({"id":idx },{'$set' : {"classes": pymongo_class_dict[idx], "probs":pymongo_prob_dict[idx]}}, True, True)
    coll_data.update({"_id":idx },{'$set' : {"YAMNET_class": pymongo_YAMNET_dict}}, True, True)



Index(['top_1', 'top_2', 'top_3', 'top_4', 'top_5'], dtype='object') Index(['prob_1', 'prob_2', 'prob_3', 'prob_4', 'prob_5'], dtype='object')


  coll_data.update({"_id":idx },{'$set' : {"YAMNET_class": pymongo_YAMNET_dict}}, True, True)


<div class="alert alert-block alert-success">
<p style="font-size:20px"><b>Section 3: What have you achieved:</b></p> 
At this point, we have created a mongoDB collection <b>S6000_Data</b> which constains the following.

Data:
<ul>
    <li><b>_id</b>: using file id as the id for each document, i.e., H1_0001, ... H1_3000 and H2_0001,... H2_3000.</li>
    <li><b>site</b>: Either H1 or H2, representing Farm 1 or Farm 2</li>
    <li><b>file name</b>: The corresponding wav file name with extension wav, for example, H1_0001.wav which is the id with ".wav".</li>
    <li><b>v31</b>: The 31 feature variables from Nguyen et. al., 2021.</li>
    <li><b>IOA_3rdOctave</b>: The filtered 3rd octave band data with a 100 milisecond frame.</li>
</ul>

Analysis Result:
<ul>
    <li><b>nguyen scores</b>: The 1 to 5 rating scores in Nguyen and the probability of AM existence from Random Forest analysis.</li>
    <li><b>cluster_v31_minmax</b>: Cluster id for K=2, 3, ..., 10 using the minmax scaled 31 variables.</li>
    <li><b>IOA_rating</b>: The Prominence and Modulation Depth ratings on the IOA_3rdOctave data.</li>
    <li><b>YAMNET_class</b>: The top five classes picked by YAMNET and their corresponding probabilities</li>    
</ul>

Sklearn models created and saved: path = './/trained_models//model_type//' where model_type is either scaler or clustering
<ul>
    <li><b>MinMax_scaler</b>: MinMax scaler based on the original 31 feature variables.</li>
    <li><b>KMeans cluster on v31_minmax</b>: Nine clustering model for K=2, 3, ..., 10 using the minmax scaled 31 variables.</li>   
</ul>
</div>   