<div class="alert alert-block alert-success"><b>Working on the H1 Data</b>: The file contains some analysis on the 3000 samples from H1 (Windfarm 1).
Specifically, it 
<ul>
<li>Retrieve the H1 data from the mongoDB collection S6000_Data.</li>
<li>Minmax scale the 3000 samples on the 31 feature variables.</li>
<li>Cluster the 3000 samples into clusters with K=3, 4, 5, 6 using the v31_minmax variables.  Save the clustering models.</li>
<li>Apply the SVC classification model on clusers with K=5 and save the classification model.</li>
</ul>

<em>!!! This program does not add, delete or update any collection in the database Windfarm_S6000.</em>

</div>

#### A note on converting Jupyter Notebook output to MS Word Documents

- The best way to convert the ipynb file (Jupyter Notebook) to a docx file is to follow the two step approach explained in: https://blog.ouseful.info/2017/06/13/using-jupyter-notebooks-for-assessment-export-as-word-docx-extension/, i.e. run the following two commands on the anaconda command line
- Step 1: $ jupyter nbconvert --no-input --to html file_name.ipynb (use --no-input to exclude code cells, i.e. convert only markdown cells)

- Step 2: $ pandoc -s file_name.html -o file_name.docx
- This approach produces, by far, the best quality docx ouput, no distortion of either the text or the graphs.  The only drawback is that it include the hidden code cells made by the "hide all" extension.  I have to manually delete those contents from the produced docx file.
#### convert ijynb to a docx file
- https://nbconvert.readthedocs.io/en/latest/index.html
- install nbconvert [pip install nbconvert]
- install pandoc: [https://github.com/jgm/pandoc/releases/tag/3.1.3]

In [1]:
import sys

oneDrive_root={}
oneDrive_root[1]="C:\\Users\\Chihyang\OneDrive for Business\\"
oneDrive_root[2]="C:\\Users\\Chihyang\OneDrive for Business\\"
oneDrive_root[3]="C:\\Users\\tsaic\\OneDrive - State University of New York at New Paltz\\"  # laptop

site=3   # the short or long business OneDrive directory name
lib_dir=oneDrive_root[site]+'\\Prudentia\\Tsaipy'
# append additional library path for this study
sys.path.append(lib_dir)

In [2]:
import pandas as pd
import numpy as np
#import sklearn as sk
#from sklearn.cluster import KMeans
from dbconnect.mongodb import cursor_to_dataframe3
from WFN_lib.WindFarm import distance2center,  write_df_sub2Excel
from WFN_lib.mongodb_util import flatten_dictionary
from WFN_lib.cluster_classify import cluster_analysis


In [3]:
from pymongo import MongoClient
import pymongo
from dbconnect import mongodb as mdb

<div class="alert alert-block alert-info"><b>MongoDb Parameters:</b>

</div>

In [4]:
## pymongo connect to mongodb database and collection
db_name = "Windfarm_S6000"
client = MongoClient('localhost', 27017)  # connect to the db engine
db = client[db_name]
coll_data=db['S6000_Data']

print(f"Rows in coll_data = {coll_data.count_documents({})}")


Rows in coll_data = 6000


<div class="alert alert-block alert-warning"><b>I. Clustering the 3000 cases in H1</b>

<b>DO NOT</b> use <b>v31_minmax</b> obtained from S6000 since it is scaled with the whole 6000 cases.  <b>For validation purpose</b> with H2 files, we need to scale v31 with only the 3000 cases in H1.

Thus, for the S3000 database, we only store v31_minmax variables scaled based on the 3000 files in H1.  If you need variable values in the original scale, just retrieve them from collection <b>Data_S6000</b> with the indices H1_xxxx, where xxxx ranges from <b>H1_0001</b> to <b>H1_3000</b>.
</div>

### Extract data from H1

In [5]:
res = coll_data.find({'site':'H1'},{'v31':1})  # index '_id' is automatically added
df_H1 = cursor_to_dataframe3(res,"_id")
print(df_H1.shape)
print(df_H1.head(3))

(3000, 31)
         v31.spectralCentroid  v31.spectralCrest  v31.spectralDecrease  \
H1_0001             63.081251          28.079092            -92.466205   
H1_0002             57.650009          30.680069           -259.422066   
H1_0003             75.049887          25.424498            -22.843544   

         v31.spectralEntropy  v31.spectralFlatness  v31.spectralFlux  \
H1_0001             0.143241              0.010610      1.087143e-07   
H1_0002             0.067829              0.011763      4.547078e-05   
H1_0003             0.217764              0.008283      3.236673e-06   

         v31.spectralKurtosis  v31.spectralRolloffPoint  v31.spectralSkewness  \
H1_0001           1420.161272                115.968033             25.345770   
H1_0002            847.458521                 72.747519             22.280157   
H1_0003            180.938195                177.970026              9.196154   

         v31.spectralSlope  ...      v31.PR  v31.Fo  v31.AMfactor   v31.DAM  \

In [6]:
## use the unscaled data and ask the function to scale it based on the 3000 files
## df_res is the dataframe with the cluster allotment; df_scaled is the scaled data
## Save the trained KMeans model to the directory save_model_dir
df_res, df_H1_scaled=cluster_analysis(df_H1, [3,4,5,6], random_state=42, scaler_type="MinMax",n_init=10, max_iter=300, tol=0.0001,
                                   save_model_dir='.//trained_models//clustering//', model_file_name='Kmeans_H1_S3000_v31_MinMax_K=', 
                                   scaler_model_dir='.//trained_models//scalers//', scaler_model_name='_scaler_v31_H1_S3000')

In [7]:
import joblib
file1=joblib.load('.//trained_models//clustering//Kmeans_H1_S3000_v31_MinMax_K=5.joblib')
print("cluster center of the first two clusters:\n",file1.cluster_centers_[:2])

cluster center of the first two clusters:
 [[0.02119858 0.81810136 0.99622386 0.22347605 0.02835462 0.100438
  0.00387291 0.02865809 0.05283666 0.90645383 0.06808207 0.99802304
  0.02119692 0.49835447 0.37158984 0.24629471 0.46649812 0.56352498
  0.01013088 0.06774282 0.92194446 0.01180246 0.73023353 0.05695455
  0.03825488 0.12299717 0.12356561 0.08505314 0.05038632 0.04593872
  0.06330229]
 [0.02001092 0.8734187  0.98782653 0.17282922 0.04541925 0.26940534
  0.00797646 0.0247377  0.06823995 0.70452366 0.07394631 0.99487267
  0.01294426 0.57682307 0.31523801 0.17429152 0.36345346 0.08566308
  0.00397009 0.03529413 0.96251697 0.00525148 0.72414709 0.02542821
  0.02344513 0.08106072 0.12233221 0.09032051 0.04726544 0.03935171
  0.05459326]]


In [8]:
df_H1_clusters, cluster_size_H1, cluster_means_H1, cluster_vars_H1 = distance2center(df_H1_scaled, df_res, [3,4,5,6])
write_df_sub2Excel(df_H1_clusters, method='', n_clusters=5, filename='.//Results_H1//S3000_Results//Cluster_dist2center_H1_S3000_v31_minmax.xlsx')

K=5
[0 1 2 3 4]
(1061, 12)
Sort rows by:  rank_K=5
cluster = 0:  Shape of dk: (1061, 12)
         K=3  K=4  K=5  K=6  dist_K=3  rank_K=3  dist_K=4  rank_K=4  dist_K=5  \
H1_0307    1    2    0    3  0.321323       485  0.111703         1  0.109369   
H1_2459    1    2    0    3  0.222810       107  0.161323        28  0.121509   
H1_1010    1    2    0    3  0.280208       303  0.117239         2  0.122694   

         rank_K=5  dist_K=6  rank_K=6  
H1_0307         1  0.123694         6  
H1_2459         2  0.113775         1  
H1_1010         3  0.120491         2  
(837, 12)
Sort rows by:  rank_K=5
cluster = 1:  Shape of dk: (837, 12)
         K=3  K=4  K=5  K=6  dist_K=3  rank_K=3  dist_K=4  rank_K=4  dist_K=5  \
H1_1954    2    1    1    1  0.157574         3  0.129302         1  0.127962   
H1_2346    2    1    1    1  0.182414         6  0.145103         2  0.141174   
H1_1493    2    1    1    1  0.149650         1  0.145251         3  0.145966   

         rank_K=5  dist_K=6  r

  warn("Calling close() on already closed file.")


<div class="alert alert-block alert-info"><b>Write the K=5 Clustering result with additional details into an Excel file:</b> Data in a cluster is presented in one Worksheet of the file.   

In each cluster, rows are sorted from the nearest to its class center the farthest.  For each row (case), columns include 
<ol>
<li> Cluster it is assigned to,</li>
<li>distance to cluster center,</li>
<li>ranking of distance to center,</li>
<li>two IOA ratings (prominenace and modulation depth) for each band (four bands), and</li> 
<li>YAMnet top five classes with corresponding probabilities (10 columns)</li>
</ol>

<b>Cluster Characteristics</b>:
<ul>
<li>Cluster 0: 1061 cases; Low confidence even to the first YAMnet class,</li>
<li>Cluster 1: 837 cases; Low confidence even to the first YAMnet class,</li>
<li>Cluster 2: 58 cases; High confidence to the first YAMnet class; first class has lots of "silence"</li>
<li>Cluster 3: 814 cases; Moderate confidence to the first YAMnet class; first class has lots of "silence", and</li> 
<li>Cluster 4: 230 cases; Moderate confidence to the first YAMnet class; first class has lots of "Animal"</li>
</ul>
</div>

In [14]:
cnt=df_res['K=5'].value_counts()
cnt.sort_index(inplace=True)
print(cnt)


0    1061
1     837
2      58
3     814
4     230
Name: K=5, dtype: int64


<div class="alert alert-block alert-info"><b>Classification:</b> A three step approach to determine and train a classification model to separate the clusters.  This is just a test run.  Currently, no analysis is based on this classification.
<ul>
<li>Based on the 3000 samples in H1.</li>
<li>Use the clustering result (K=k) on v31_minmax on the 3000 samples as the y variables.</li>
<li>Train a SVC classifier to classify the clusters.</li>
</ul>
</div>

In [15]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
import joblib

#### Step 1: Use GridSearch to determine the hyperparameters

In [21]:
## Set df_scaled as X and df_res["K=5'"] as y
X = df_H1_scaled
y = df_res["K=5"]
## Grid search for the best hyperparameters for SVC
param_grid = {'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['linear', 'rbf']}
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=3, n_jobs=-1)
grid.fit(X, y)
print(grid.best_params_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


{'C': 100, 'gamma': 0.1, 'kernel': 'rbf'}


#### Step 2: Use a ten-fold cross-validation to evaluate the model

In [22]:
## Test the best hyperparameters using cross-validation
n_folds=10
model = SVC(**grid.best_params_)
cv = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print(f"Average Accuracy from the {n_folds} trials: {scores.mean():.3f}  with standard deviation: {scores.std():.3f}")

Average Accuracy from the 10 trials: 0.984  with standard deviation: 0.010


#### Step 3: Fit the model with the best hyperparameters and save the trained model

In [23]:
## Use the best hyperparameters to fit SVC and save the trained model
model.fit(X, y)
## Save the trained model
joblib.dump(model, './/trained_models//classification//SVC_H1_S3000_v31_minmax_K=5.joblib')
print("Model saved")

y_pred = model.predict(X)
print(f"Accuracy: {accuracy_score(y, y_pred):.3f}")
print(confusion_matrix(y, y_pred))


Model saved
Accuracy: 0.996
[[1056    1    0    3    1]
 [   2  832    0    0    3]
 [   0    0   58    0    0]
 [   0    1    0  813    0]
 [   0    0    0    1  229]]


In [25]:
## Retrive the saved model and use it to classify the data
model = joblib.load('.//trained_models//classification//SVC_H1_S3000_v31_minmax_K=5.joblib')
y_pred = model.predict(X)
print(f"Accuracy: {accuracy_score(y, y_pred):.3f}")

Accuracy: 0.996
