# Clustering a subset of the complete dataset on Min_Max scaled data

#### A note on converting Jupyter Notebook output to MS Word Documents

- The best way to convert the ipynb file (Jupyter Notebook) to a docx file is to follow the two step approach explained in: https://blog.ouseful.info/2017/06/13/using-jupyter-notebooks-for-assessment-export-as-word-docx-extension/, i.e. run the following two commands on the anaconda command line
- Step 1: $ jupyter nbconvert --no-input --to html file_name.ipynb (use --no-input to exclude code cells, i.e. convert only markdown cells)

- Step 2: $ pandoc -s file_name.html -o file_name.docx
- This approach produces, by far, the best quality docx ouput, no distortion of either the text or the graphs.  The only drawback is that it include the hidden code cells made by the "hide all" extension.  I have to manually delete those contents from the produced docx file.
#### convert ijynb to a docx file
- https://nbconvert.readthedocs.io/en/latest/index.html
- install nbconvert [pip install nbconvert]
- install pandoc: [https://github.com/jgm/pandoc/releases/tag/3.1.3]

In [1]:
import sys

oneDrive_root={}
oneDrive_root[1]="C:\\Users\\Chihyang\OneDrive for Business\\"
oneDrive_root[2]="C:\\Users\\Chihyang\OneDrive for Business\\"
oneDrive_root[3]="C:\\Users\\tsaic\\OneDrive - State University of New York at New Paltz\\"  # laptop

site=3   # the short or long business OneDrive directory name
lib_dir=oneDrive_root[site]+'\\Prudentia\\Tsaipy'
# append additional library path for this study
sys.path.append(lib_dir)

In [2]:
import pandas as pd
import numpy as np
#import sklearn as sk
#from sklearn.cluster import KMeans
import joblib

from WFN_lib.WindFarm import distance2center, write_df_sub2Excel
from WFN_lib.mongodb_util import flatten_dictionary
from WFN_lib.cluster_classify import cluster_analysis
from dbconnect.mongodb import cursor_to_dataframe3
import copy

In [3]:
from pymongo import MongoClient
import pymongo
from dbconnect import mongodb as mdb

<div class="alert alert-block alert-info"><b>MongoDb Parameters:</b>

</div>

In [4]:
## pymongo connect to mongodb database and collection
db_name = "Windfarm_S6000"
client = MongoClient('localhost', 27017)  # connect to the db engine
db = client[db_name]
coll_data=db['S6000_Data']

print(f"Rows in coll_data = {coll_data.count_documents({})}")


Rows in coll_data = 6000


<div class="alert alert-block alert-info"><b>Test access to nested keys of a collection:</b> 
<ul>
<li>Since pymongo uses "." to connect (as a separator) keys in two <b>nested levels</b>, collections should not have column names that contain "." to avoid confusion.</li>
<li>The warning about count deprecated and suggest using Collection.count_documents instead is not true.  count_documents only works for <b>collections</b> not on <b>cursor</b>.  Ignore the warning for the time being.</li>
<li>The following example, retrieve K=5 cluster allotment, distance to center, and ranking from the closest to center to the farthest to center.</li>
</ul>

</div>

<div class="alert alert-block alert-success"><b>Find clustering result and related information from a MongoDB collection and convert the resulting cursor to a dataframe</b>
<ul>
<li>information requested include: cluster id, distance to cluster center, and ranking based on distance to cluster center.</li>
<li>Note that distance to center and ranking of distance to center (from closest to farthest) are with respect to each cluster, i.e., computaion done cluster by cluster.</li>
<li>For <b>nexted MongoDB columns</b>, use a dot "." to separate column names at two differnet levels.</li>
</ul></div>

### We compare the cluster membership of three cluster analyses
- Test 1: S3000 vs S6000
- Test 2: S843 vs S3000
- For S3000 and S6000, we use K=5.   For __S843__, we can test __various K values__ available in the KMeans analysis.
- We only analyze various values of __K__ for the __S843__ set again __K=5__ for S3000.
- The comparison between S3000 and S6000 is also based on __K=5__

<div class="alert alert-block alert-info"><b>Cluster Consistency Test 1:</b> Check clustering result of the H1 3000 samples with K=5 (1) when using the 3000 samples to cluster, and (3) the 3000 sampls' clustering result when all 6000 cases from H1 and H2 are included in clustering.  If consistent, a large portion of cases should appear on the diagonal of the crosstab table.</div>


#### Retrive the H1 3000 samples with the 31 Nguyen feature variables, minmax scaled them and cluster them with K=5
- Minmax scaled the variables <b>based on the 3000 cases</b>, using the scaler saved earlier.
- Cluster the 3000 cases using the minmax scaled v31, using the clustering model saved earlker

In [16]:
## set number of clusters to be analyzed
nk=5

In [17]:
res=db['S6000_Data'].find({'site':'H1'},{'v31':1})
df_S3000 = cursor_to_dataframe3(res,"_id")
row_names = df_S3000.index
col_names = df_S3000.columns

## load the trained model: scaler to minmax scale the H1 S3000 and cluster them based on the saved clustering model
Minmax_scaler = joblib.load('.//trained_models//scalers//Minmax_scaler_v31_H1_S3000.joblib')
model_cluster = joblib.load('.//trained_models//clustering//KMeans_H1_S3000_v31_MinMax_K='+str(nk)+'.joblib')

df_S3000_scaled = Minmax_scaler.transform(df_S3000)
df_S3000_scaled = pd.DataFrame(df_S3000_scaled, index=row_names, columns=col_names)
## predict the cluster label
cluster3000 = model_cluster.predict(df_S3000_scaled)
df_S3000_scaled['cluster']=cluster3000
print(cluster3000)


[0 1 0 ... 1 2 3]


#### Retrive the clustered result of K=5 on the whole 3000 samples (clustering was done with the 31 minmax scaled variable on all 6000 cases)
- Rename the cllusters_v31_minmax.K=5 column to S6000.K=5 to avoid repetition with that from the df_fr3000
- Note that the clustering was done on all 6000 samples. We just extract the 3000 samples from H1 here for comparison.

In [19]:
### Compare it with the result from all 6000 samples using v31_minmax
## Retrieve the cluster information from the S6000_Data but only for the 3000 samples from site H1
## clustering result is based on the clustering of all 6000 samples but we only retrieve the result for the first 3000 samples.
res = db['S6000_Data'].find({'site':'H1'},{'cluster_v31_minmax.K='+str(nk):1})
df_S6000_K = cursor_to_dataframe3(res,"_id")
print(df_S6000_K.head(3))
## replace df_S6000_K column name 'clusters.K=5' with 'S6000_clusters.K=5'
df_S6000_K.rename(columns={'cluster_v31_minmax.K='+str(nk):'S6000.K='+str(nk)}, inplace=True)
print(df_S6000_K.head(3))
print(df_S6000_K.shape)
## add the clustering result based on H1 S3000
df_S6000_K['S3000.K='+str(nk)] = cluster3000

         cluster_v31_minmax.K=5
H1_0001                       0
H1_0002                       4
H1_0003                       0
         S6000.K=5
H1_0001          0
H1_0002          4
H1_0003          0
(3000, 1)


#### Crosstab comparison of the cluter membership between the two approaches.

In [21]:
crsstb=pd.crosstab(df_S6000_K['S3000.K='+str(nk)], df_S6000_K['S6000.K='+str(nk)], margins=True)
print(crsstb)
crsstb.to_excel(".\\Results_H1\\Crosstab_S3000_vs_S6000_v31_minmax_K='+str(nk)+'.xlsx")

S6000.K=5     0    1    2    3    4   All
S3000.K=5                                
0           974   17   46    8   16  1061
1             4    0  406    0  427   837
2            43    0   10    5    0    58
3            33  695   86    0    0   814
4             0    0   30  200    0   230
All        1054  712  578  213  443  3000


<div class="alert alert-block alert-info"><b>Clean Set Test 2:</b> Check clustering result of the H1 3000 K=5 clustering result against S843 clustering result. This is analogous with a false positive test and a false negative test.

From S3000 perspective
<ol>
<li>Crosstab test to see if a cluster of S483 falls totally inside a cluster on S3000 or not.</li>
<li>Use a left join of S3000's with S843.</li>
<li>For cases in S3000 but not in S843, the cluster value of S843 is NaN which is replaced by an artifical cluster 10.</li>
<li>The last column in the crosstab where cluster Id is 10 are S3000 cases not in the clean set S843,</li>
<li>We observed Cluster 2 of S3000 has no cases in the clean set and Cluster 4 only has one case in the clean set.</li>
</ol>

From S843 perspective
<ol>
<li>Cluster 0 of S843 mostly come from Cluster 3 of S3000.</li>
<li>Cluster 1 of S843 mostly come from Cluster 1 of S3000..</li>
<li>Cluster 2 of S843 mostly come from Cluster 0 of S3000..</li>
<li>Cluster 3 of S843 mostly come from Cluster 1 of S3000..</li>
<li>Cluster 4 of S843 all come from Cluster 3 of S3000..</li>
<li>The last column in the crosstab where cluster Id is 10 are S3000 cases not in the clean set S843,</li>
<li>We observed Cluster 2 of S3000 has no cases in the clean set and Cluster 4 only has one case in the clean set.</li>
</ol>
</div>


In [22]:
## target number of clusters for analyzing S843
target_K_S843=5

In [23]:
## Get the clustering results of the clean S843 set with three different methods, v31_minmax, IOA, and combined
## Retrieve cluster ID, distance to cluster center, and ranking of distance to cluster center from the shortest to the longest
## Three methods each with three columns, 3x3=9 columns in total
## Set the k value for S843;  For S3000 and #6000, K is always 5
k=target_K_S843
res=db['H1_S843_clusters'].find({},{'v31_minmax.K='+str(k):1,'v31_minmax.dist_K='+str(k):1,
        'v31_minmax.rank_K='+str(k):1,'IOA.K='+str(k):1,'IOA.dist_K='+str(k):1,
        'IOA.rank_K='+str(k):1,'combined.K='+str(k):1,'combined.dist_K='+str(k):1,'combined.rank_K='+str(k):1})
# find number of documents in res
#print(res.count())

## convert the cursor to a dataframe where the cursor is the return value of a mongodb query with keys up to three levels (nested keys)
df_mul843 = cursor_to_dataframe3(res,"_id")
print(df_mul843.shape)
print(df_mul843.head(1))

(843, 9)
         v31_minmax.K=5  v31_minmax.dist_K=5  v31_minmax.rank_K=5  IOA.K=5  \
H1_0006             2.0          1947.159431                127.0      4.0   

         IOA.dist_K=5  IOA.rank_K=5  combined.K=5  combined.dist_K=5  \
H1_0006      0.122979         293.0           4.0         387.975172   

         combined.rank_K=5  
H1_0006               35.0  


### Left join S3000 with S843
- Cluster 2 and Cluster 4 from S3000 have only one case in the clean set S843. This implies that Clusters 2 and 4 contain contaminated samples.
- However, Clusters 0, 1, and 3 have higher percentage of clean samples, close to 70%.


In [24]:
k=target_K_S843
df_excl=df_S6000_K.join(df_mul843,how='left')
print(df_excl.shape)
#print(df_excl.head(3))
### replace NaN with 10
df_excl.fillna(10,inplace=True)
print(df_excl.head(2))
## crosstab 
df_excl.rename(columns={'v31_minmax.K='+str(k):'S843.K='+str(k)}, inplace=True)
## Note that here nk and k can be different values, nk for the whole 3000 samples and k for the S843 samples
crsstb = pd.crosstab(df_excl['S3000.K='+str(nk)], df_excl['S843.K='+str(k)], margins=True)
print(crsstb)


(3000, 11)
         S6000.K=5  S3000.K=5  v31_minmax.K=5  v31_minmax.dist_K=5  \
H1_0001          0          0            10.0                 10.0   
H1_0002          4          1            10.0                 10.0   

         v31_minmax.rank_K=5  IOA.K=5  IOA.dist_K=5  IOA.rank_K=5  \
H1_0001                 10.0     10.0          10.0          10.0   
H1_0002                 10.0     10.0          10.0          10.0   

         combined.K=5  combined.dist_K=5  combined.rank_K=5  
H1_0001          10.0               10.0               10.0  
H1_0002          10.0               10.0               10.0  
S843.K=5   0.0  1.0  2.0  3.0  4.0  10.0   All
S3000.K=5                                     
0            1  127    1  199    7   726  1061
1            0    1  140    4  116   576   837
2            0    0    0    0    0    58    58
3          216   18    2   10    0   568   814
4            0    1    0    0    0   229   230
All        217  147  143  213  123  2157  3000


### Crosstab analysis between the clustering on all H1 3000 samples and on clustering result based on the 843 clean samples with K=k where k = 3, 4, 5, 6 defined by the variable target_K_S843

The crosstab result indicates that 
- Cluster 2 based on v31_minmax.K=5 does not appear in the clean set at all.
- Cluster 4  only has one case, H1_0894, appeared in the clean set.

In [32]:
## inner join df_fr3000 and df_mul on index
df_cluster = df_S6000_K.join(df_mul843, how='inner')
print(df_cluster.head(1))
### cross tabulation of cluster assignment between cluster.K=5 and v31_minmax.K=5
print("S3000_K=5 vs. S843.v31_minmax:\n",pd.crosstab(df_cluster['S3000.K='+str(nk)], df_cluster['v31_minmax.K='+str(k)],margins=True))
print("S3000_K=5 vs. S843.IOA:\n",pd.crosstab(df_cluster['S3000.K='+str(nk)], df_cluster['IOA.K='+str(k)],margins=True))
print("S3000_K=5 vs. S843_combined:\n",pd.crosstab(df_cluster['S3000.K='+str(nk)], df_cluster['combined.K='+str(k)],margins=True))



         S6000.K=5  S3000.K=5  v31_minmax.K=5  v31_minmax.dist_K=5  \
H1_0006          2          1             2.0          1947.159431   

         v31_minmax.rank_K=5  IOA.K=5  IOA.dist_K=5  IOA.rank_K=5  \
H1_0006                127.0      4.0      0.122979         293.0   

         combined.K=5  combined.dist_K=5  combined.rank_K=5  
H1_0006           4.0         387.975172               35.0  
S3000_K=5 vs. S843.v31_minmax:
 v31_minmax.K=5  0.0  1.0  2.0  3.0  4.0  All
S3000.K=5                                   
0                 1  127    1  199    7  335
1                 0    1  140    4  116  261
3               216   18    2   10    0  246
4                 0    1    0    0    0    1
All             217  147  143  213  123  843
S3000_K=5 vs. S843.IOA:
 IOA.K=5    0.0  1.0  2.0  3.0  4.0  All
S3000.K=5                              
0           48    4   21  103  159  335
1           32    1   46   73  109  261
3           83    3    5   38  117  246
4            0    0    1

### The following analysis compare each pair of result
- Crosstab between v31_minmax and combined shows the highest correlation.
- This indicates that v31_minmax is a more dominant factor than IOA in terms of differentiate cases.

In [25]:
print("v31_minmax vs. IOA:\n",pd.crosstab(df_cluster['v31_minmax.K='+str(k)], df_cluster['IOA.K='+str(k)]))
print("v31_minmax vs. combined:\n",pd.crosstab(df_cluster['v31_minmax.K='+str(k)], df_cluster['combined.K='+str(k)]))
print("IOA vs. combined:\n",pd.crosstab(df_cluster['IOA.K='+str(k)], df_cluster['combined.K='+str(k)]))

v31_minmax vs. IOA:
 IOA.K=4         0.0  1.0  2.0  3.0
v31_minmax.K=4                    
0.0             110    3   11   94
1.0              61    1   28   36
2.0             211    4   46   85
3.0              67    0   42   44
v31_minmax vs. combined:
 combined.K=4    0.0  1.0  2.0  3.0
v31_minmax.K=4                    
0.0               0  218    0    0
1.0             116    1    2    7
2.0               2    0  344    0
3.0               2    0    0  151
IOA vs. combined:
 combined.K=4  0.0  1.0  2.0  3.0
IOA.K=4                         
0.0            60  111  213   65
1.0             1    3    4    0
2.0            22   11   46   48
3.0            37   94   83   45


In [26]:
## Find the case where cluster.K=5 is in cluster 4.  
## Cluster 2 might be the noisy cluster.
df_cluster.loc[df_cluster['S3000_clusters.K=5']==4]

Unnamed: 0,S3000_clusters.K=5,v31_minmax.K=4,v31_minmax.dist_K=4,v31_minmax.rank_K=4,IOA.K=4,IOA.dist_K=4,IOA.rank_K=4,combined.K=4,combined.dist_K=4,combined.rank_K=4
H1_0894,4.0,2.0,1.301569,346.0,2.0,1.805784,127.0,2.0,2.303932,346.0


### Get the IOA_rating and YAMnet data from the complete set of 6000 for
- the S3000 subset: df_IOA_YAM_S3000 after converting to a dataframe
- the 843 clean subset: df_IOA_YAM_S843 after converting to a dataframe

In [27]:
### find the document in collection 'Results_S6000' where _id contains 'H1'
res=db['Results_S6000'].find({'_id':{'$regex':'H1'}},{'IOA_rating':1, 'YAMnet':1})
print(res.count())
df_IOA_YAM_S3000 = cursor_to_dataframe3(res,"_id")
print(df_IOA_YAM_S3000.head(3))
print(df_IOA_YAM_S3000.shape)

  print(res.count())


3000
         IOA_rating.025_100Hz.prominence  IOA_rating.025_100Hz.L5-L95  \
H1_0001                        14.307293                     3.114064   
H1_0002                         5.458717                     3.831529   
H1_0003                         4.129869                     3.878474   

         IOA_rating.050_200Hz.prominence  IOA_rating.050_200Hz.L5-L95  \
H1_0001                        25.675691                     1.621372   
H1_0002                         4.284061                     3.762765   
H1_0003                        11.068699                     3.421233   

         IOA_rating.100_400Hz.prominence  IOA_rating.100_400Hz.L5-L95  \
H1_0001                        36.235453                     1.477713   
H1_0002                         1.733793                     2.792567   
H1_0003                       104.565394                     3.803469   

         IOA_rating.200_800Hz.prominence  IOA_rating.200_800Hz.L5-L95  \
H1_0001                        50.222898   

In [28]:
## retrieve the IOA_rating and YAMnet of the 843 clean samples from the collection 'Results_S6000'
S843 = db['Subsets'].find_one({'_id':'S843'})['list']   # get the ID name list of the 843 clean samples
res=db['Results_S6000'].find({'_id':{'$in':S843}},{'IOA_rating':1, 'YAMnet':1})

df_IOA_YAM_S843 = cursor_to_dataframe3(res,"_id")
print(df_IOA_YAM_S843.head(3))
print(df_IOA_YAM_S843.shape)

         IOA_rating.025_100Hz.prominence  IOA_rating.025_100Hz.L5-L95  \
H1_0006                         1.157707                     2.038109   
H1_0007                         1.647752                     5.391628   
H1_0008                         2.281802                     3.641156   

         IOA_rating.050_200Hz.prominence  IOA_rating.050_200Hz.L5-L95  \
H1_0006                         1.098841                     1.859755   
H1_0007                         1.380059                     4.596037   
H1_0008                         2.428720                     3.727153   

         IOA_rating.100_400Hz.prominence  IOA_rating.100_400Hz.L5-L95  \
H1_0006                         3.436457                     2.442856   
H1_0007                         0.819226                     3.693764   
H1_0008                         1.454394                     3.026670   

         IOA_rating.200_800Hz.prominence  IOA_rating.200_800Hz.L5-L95  \
H1_0006                         2.449454        

### Join all information for the clean subset

In [29]:
df_all = df_cluster.join(df_IOA_YAM_S843, how='inner')
print(df_all.shape)

(843, 28)


### Organize data by methods and output to Excel files
- Each method (v31_minmax, IOA, combined) will output an Excel file
- Under each method, documents are organized by cluster using K=5, one cluster in one Worksheet
- Within each seat, documents are listed by the closest to center ranking 
- IOA result with -100 was placed by 0 when running clustering analysis.  However, in displacing the source data, we still show -100 to indicae a peak is not found in either first, second, or third harmonic


In [30]:
df_all.columns

Index(['S3000_clusters.K=5', 'v31_minmax.K=4', 'v31_minmax.dist_K=4',
       'v31_minmax.rank_K=4', 'IOA.K=4', 'IOA.dist_K=4', 'IOA.rank_K=4',
       'combined.K=4', 'combined.dist_K=4', 'combined.rank_K=4',
       'IOA_rating.025_100Hz.prominence', 'IOA_rating.025_100Hz.L5-L95',
       'IOA_rating.050_200Hz.prominence', 'IOA_rating.050_200Hz.L5-L95',
       'IOA_rating.100_400Hz.prominence', 'IOA_rating.100_400Hz.L5-L95',
       'IOA_rating.200_800Hz.prominence', 'IOA_rating.200_800Hz.L5-L95',
       'YAMnet.class.top_1', 'YAMnet.class.top_2', 'YAMnet.class.top_3',
       'YAMnet.class.top_4', 'YAMnet.class.top_5', 'YAMnet.prob.prob_1',
       'YAMnet.prob.prob_2', 'YAMnet.prob.prob_3', 'YAMnet.prob.prob_4',
       'YAMnet.prob.prob_5'],
      dtype='object')

<div class="alert alert-block alert-info"><b>Characterstics of the five clusters of S843:</b>

The v31_minmax clustering result has better cluster commonality when interpreted with YAMnet result.
<ul>
<li>Cluster 0: 151 cases; The first choice class is <b>all "Silence"</b> for all cases with mostly high confidence, but still a few low confidence cases.</li>
<li>Cluster 1: 155 cases; The first choice class has a lot of "White Noise", but the confidence level is low.</li>
<li>Cluster 2: 327 cases; The first choice class has quite a few "Vehicle", but the confidence level is low.</li>
<li>Cluster 3: 125 cases; The first choice class has a lot of "Animal", but the confidence level is low.</li>
<li>Cluster 4: 85 cases; The first choice class is <b>all "Silence"</b> for all cases with mostly high confidence, fewer low confidence cases than Cluster 0.</li>
</ul>

</div>

In [31]:
print(target_K_S843)

4


In [32]:
K=target_K_S843
all_columns = df_all.columns
for method in ['v31_minmax','IOA','combined']:  
    ## find columns that contain the method name
    columns=list()
    columns = [col for col in all_columns if method+"." in col]  # clustering result for S483
    columns.extend([col for col in all_columns if 'S3000_clusters' in col])  # clustering result for S3000
    columns.extend([col for col in all_columns if 'IOA_' in col])
    columns.extend([col for col in all_columns if 'YAMnet.' in col])
    print("\n",columns)
    df_sub = df_all[columns]
    write_df_sub2Excel(df_sub,method, K, f".//Results_H1//S843_Results//S843_{method}_K={K}.xlsx")



 ['v31_minmax.K=4', 'v31_minmax.dist_K=4', 'v31_minmax.rank_K=4', 'S3000_clusters.K=5', 'IOA_rating.025_100Hz.prominence', 'IOA_rating.025_100Hz.L5-L95', 'IOA_rating.050_200Hz.prominence', 'IOA_rating.050_200Hz.L5-L95', 'IOA_rating.100_400Hz.prominence', 'IOA_rating.100_400Hz.L5-L95', 'IOA_rating.200_800Hz.prominence', 'IOA_rating.200_800Hz.L5-L95', 'YAMnet.class.top_1', 'YAMnet.class.top_2', 'YAMnet.class.top_3', 'YAMnet.class.top_4', 'YAMnet.class.top_5', 'YAMnet.prob.prob_1', 'YAMnet.prob.prob_2', 'YAMnet.prob.prob_3', 'YAMnet.prob.prob_4', 'YAMnet.prob.prob_5']
v31_minmax.K=4
[0 1 2 3]
(218, 22)
Sort rows by:  v31_minmax.rank_K=4
cluster = 0:  Shape of dk: (218, 22)
         v31_minmax.K=4  v31_minmax.dist_K=4  v31_minmax.rank_K=4  \
H1_0508             0.0             0.115812                  1.0   
H1_0912             0.0             0.118935                  2.0   
H1_1872             0.0             0.129841                  3.0   

         S3000_clusters.K=5  IOA_rating.025

  warn("Calling close() on already closed file.")
  warn("Calling close() on already closed file.")


[0 1 2 3]
(449, 22)
Sort rows by:  IOA.rank_K=4
cluster = 0:  Shape of dk: (449, 22)
         IOA.K=4  IOA.dist_K=4  IOA.rank_K=4  S3000_clusters.K=5  \
H1_2801      0.0      0.033643           1.0                 0.0   
H1_2055      0.0      0.035155           2.0                 3.0   
H1_1060      0.0      0.036184           3.0                 0.0   

         IOA_rating.025_100Hz.prominence  IOA_rating.025_100Hz.L5-L95  \
H1_2801                         2.984913                     2.807993   
H1_2055                         2.489346                     2.703953   
H1_1060                         3.609509                     2.749440   

         IOA_rating.050_200Hz.prominence  IOA_rating.050_200Hz.L5-L95  \
H1_2801                         2.982218                     1.894821   
H1_2055                         2.507032                     1.993388   
H1_1060                         1.572941                     1.780790   

         IOA_rating.100_400Hz.prominence  IOA_rating.100

  warn("Calling close() on already closed file.")


<div class="alert alert-bloc, alert-info"><b>Analyze cases in S3000 but not in S843</b>
abc
</div>

In [33]:
K=5
## find cases in S3000 but not in S843 and get the clutering results, including distance-to-centers and their ranking of these cases
res = db['Data_Results_H1_S3000'].find({'_id':{'$nin':S843}},{'cluster_v31_minmax':1})
df_H1_NonS843 = cursor_to_dataframe3(res,"_id")
print(df_H1_NonS843.shape)
print(df_H1_NonS843.head(3))
### Keep only columns whose names contain 'K=5'
df_H1_NonS843 = df_H1_NonS843.filter(regex='K=5')
print(df_H1_NonS843.shape)

res = db['Results_S6000'].find({'_id':{'$in':list(df_H1_NonS843.index)}},{'IOA_rating':1, 'YAMnet':1})
df_IOA_YAM_NonS843 = cursor_to_dataframe3(res,"_id")
print(df_IOA_YAM_NonS843.shape)

df_H1_NonS843_join = df_H1_NonS843.join(df_IOA_YAM_NonS843, how='inner')
print(df_H1_NonS843_join.shape)

(2157, 12)
         cluster_v31_minmax.K=3  cluster_v31_minmax.K=4  \
H1_0001                     1.0                     2.0   
H1_0002                     2.0                     1.0   
H1_0003                     0.0                     2.0   

         cluster_v31_minmax.K=5  cluster_v31_minmax.K=6  \
H1_0001                     0.0                     3.0   
H1_0002                     1.0                     1.0   
H1_0003                     0.0                     3.0   

         cluster_v31_minmax.dist_K=3  cluster_v31_minmax.rank_K=3  \
H1_0001                     0.539937                       1383.0   
H1_0002                     0.557416                        807.0   
H1_0003                     0.603725                        109.0   

         cluster_v31_minmax.dist_K=4  cluster_v31_minmax.rank_K=4  \
H1_0001                     0.552750                        934.0   
H1_0002                     0.585724                        740.0   
H1_0003                     0.6

In [34]:
df_H1_NonS843.shape

(2157, 3)

In [None]:
df_H1_NonS843_join.columns

In [None]:
method = 'cluster_v31_minmax'
write_df_sub2Excel(df_H1_NonS843_join, method, K, f".//Results_H1//S843_Results//H1_nonS843_{method}_K={K}.xlsx")