# KNN Component using FAISS library

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy

This KNN Component is developed with Wrapper using FAISS Python library and uses this [python wrapper API](https://github.com/shankarpm/faiss_knn/blob/master/faiss_knn.py) to train and predict models using FAISS Library. This FAISS Implementation can work in CPU and GPU Instances.

Base Image of this FAISS Component uses [plippe/faiss-docker:1.4.0-gpu](https://hub.docker.com/r/plippe/faiss-docker/)

More Information on FAISS can be found [here](https://github.com/facebookresearch/faiss)

# Parameters

In [11]:
from ml_workflow import arguments
from IPython.display import HTML

args_metadata = arguments.get_args_metadata()

header = ['Name', 'Argument', 'Description']
rows = []
for arg in args_metadata:
    if arg['type'] == 'OutputPath':
        continue
    if type(arg['type']) == dict:
        arg['type'] = 'Enum:' + str(arg['type']['Enum'])
        
    under_desc = '\nOptional: {} | Default: {} | Type: {}'.format(
        arg['optional'], arg['default'], arg['type']
    )
    arg['description'] += under_desc
    if not arg['optional']:
        arg['name'] = "<b>{}</b>".format(arg['name'])
    row = [
        "{}".format(arg['name']),
        "{}".format(arg['argument']),
        "{}".format(arg['description'].replace('\n', '<br>'))
    ]

    row = '<td>{}</td>'.format('</td><td>'.join(row))
    rows.append(row)
    

header = '<tr><th>{}</th></tr>'.format('</th><th>'.join(header))
rows = '<tr>{}</tr>'.format('</tr><tr>'.join(rows))
table = '<table>{}{}</table>'.format(header, rows)
HTML(table)

Name,Argument,Description
Data,--data,GCS url path of input data in csv format Optional: False | Default: None | Type: String
Output Data Dir,--output-data-dir,GCS url path for writing logs and output KPI measurements. Optional: False | Default: None | Type: String
Training Test Split Ratio,--training-test-split-ratio,Split the training and test data sample Optional: True | Default: 0.9 | Type: float
Test Mode,--test-mode,Get only accuracy and other kpi value during test mode for unit tests mode only - values - {accuracy} Optional: True | Default: | Type: String
K Value,--k-value,postive value for K in computing K nearest neighbors Optional: False | Default: None | Type: Integer


# Input

The component expects an input data location from public Google Storage bucket with a csv format dataset containing feature columns and last column as feature label output.

Sample Input is available in Google Storage Bucket(gs://gs-public-test-data/covtype.data_1.gz)

Comonent also expects a optional split ratio number to split the input data into training and test data set. The default is 0.9


# Output

The Component also expects a public Google Storage bucket to save all the checkpoints , logs and KPI metrics of KNN-FAISS component. The Checkpoints and logs are saved in a separate file format (knn-faiss-output-log-20190305-050509.txt) and KPI Metrics are saved in a separate CSV File (kpi_metrics_knn-faiss-log-20190305-050509.csv).


# Use Examples

Below are minimalistic examples for a data use of the KNN component.

### Docker data pipeline Example

All pipeline components are Docker containers and can be tested in a local environment.

```bash 
docker run -it gcr.io/ml-workflow1/knn \
    --data=gs://gs-public-test-data/covtype.data_1.gz \
    --k-value=5 \
    --training-test-split-ratio=0.8 \
    --output-data-dir=gs://gs-public-test-data/Logs
```

### Kubeflow Pipelines Example

This code snippet creates a pipeline file. To run the pipeline, you have to upload the file to an instance of Kubeflow Pipelines using its user interface.

```python
# !pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.10/kfp.tar.gz --upgrade
import kfp
import kfp.dsl as dsl
import kfp.gcp as gcp
from kfp.components import load_component_from_file

KnnOp = load_component_from_file('component.yaml')

@dsl.pipeline(
    name='Knn',
    description='Generated')
def knn_pipeline(
    data=dsl.PipelineParam(name="Data"),
    output_data_dir=dsl.PipelineParam(name="Output Data Dir"),
    k_value=dsl.PipelineParam(name="K Value"),
    training_test_split_ratio=dsl.PipelineParam(name="Training Test Split Ratio", value="0.9"),
    test_mode=dsl.PipelineParam(name="Test Mode", value="")):

    knn_op = KnnOp(
                data=data,
        output_data_dir=output_data_dir,
        k_value=k_value,
        training_test_split_ratio=training_test_split_ratio,
        test_mode=test_mode
    ).apply(gcp.use_gcp_secret('user-gcp-sa'))

if __name__ == '__main__':
    import kfp.compiler as compiler
    compiler.Compiler().compile(knn_pipeline, 'pipeline.tar.gz')
```

### Example assets:

- Full featured [pipeline.py](pipeline.py)
- Full featured [pipeline.tar.gz](pipeline.tar.gz)
- [component.yaml](component.yaml)

In [47]:
from IPython.display import HTML

tab_8cpu_mac = """
    <h3>KNN Comparison with 581k datapoints on Mac PC (8 Core CPU)</h3> 
    <table>
    <tr><th></th>                  <th>SkLearn ( K - 5 )</th> <th>FAISS ( K - 5 )</th>  <th>SageMaker ( K - 5 )</th></tr>
    <tr><td>Test DataPoints</td>                <td>58k</td>   <td>58k</td>     <td>58k</td></tr>
    <tr><td>Features</td>                       <td>54</td>     <td>54</td>       <td>54</td></tr>
    <tr><td>Accuracy</td>                       <td>96.90%</td> <td>97.00%</td>   <td>93.80%</td></tr>
    <tr><td>Prediction Time (secs)</td>         <td>5.73</td>     <td>42</td>        <td>26</td></tr>
    <tr><td>Training Model Time (in secs)</td>  <td>4</td>     <td>0.08</td>      <td>186</td></tr>
</table>"""

tab_1gpu_k80 = """
    <h3>KNN Comparison with 2.1million datapoints on Google Compute Eng. ( 1 CPU / 1 GPU Nvidia Tesla K 80)</h3> 
    <table>
    <tr><th></th>                  <th>SkLearn ( K - 5 )</th> <th>FAISS ( K - 5 )</th>  <th>SageMaker ( K - 5 )</th></tr>
    <tr><td>Test DataPoints</td>                <td>200k</td>   <td>200k</td>     <td>200k</td></tr>
    <tr><td>Features</td>                       <td>54</td>     <td>54</td>       <td>54</td></tr>
    <tr><td>Accuracy</td>                       <td>99.20%</td> <td>99.00%</td>   <td>94.40%</td></tr>
    <tr><td>Prediction Time (secs)</td>         <td>28</td>     <td>64</td>        <td>17.68</td></tr>
    <tr><td>Training Model Time (in secs)</td>  <td>30</td>     <td>53.99</td>      <td>217</td></tr>
</table>"""

tab_4gpu_p100 = """
    <h3>KNN Comparison with 2.1million datapoints on Google Compute Eng. ( 4 CPU / 4 GPU NVIDIA Tesla P100))</h3> 
    <table>
    <tr><th></th>                  <th>SkLearn ( K - 5 )</th> <th>FAISS ( K - 5 )</th>  <th>SageMaker ( K - 5 )</th></tr>
    <tr><td>Test DataPoints</td>                <td>200k</td>   <td>200k</td>     <td>200k</td></tr>
    <tr><td>Features</td>                       <td>54</td>     <td>54</td>       <td>54</td></tr>
    <tr><td>Accuracy</td>                       <td>99.20%</td> <td>99.00%</td>   <td>90.80%</td></tr>
    <tr><td>Prediction Time (secs)</td>         <td>22</td>     <td>7</td>        <td>17</td></tr>
    <tr><td>Training Model Time (in secs)</td>  <td>30</td>     <td>51</td>      <td>170</td></tr>
</table>"""
tab_8gpu_k80 = """
    <h3>KNN Comparison with 2.1million datapoints on Google Compute Eng. ( 8 CPU / 8 GPU NVIDIA Tesla K80)</h3> 
    <table>
    <tr><th></th>                  <th>SkLearn ( K - 5 )</th> <th>FAISS ( K - 5 )</th>  </tr>
    <tr><td>Test DataPoints</td>                <td>200k</td>   <td>200k</td>     </tr>
    <tr><td>Features</td>                       <td>54</td>     <td>54</td>       </tr>
    <tr><td>Accuracy</td>                       <td>99.20%</td> <td>99.00%</td>  </tr>
    <tr><td>Prediction Time (secs)</td>         <td>26</td>     <td>13</td>        </tr>
    <tr><td>Training Model Time (in secs)</td>  <td>28</td>     <td>64</td>      </tr>
</table>"""
tab_8gpu_v100 = """
    <h3>KNN Comparison with 2.1million datapoints on Google Compute Eng. ( 8 CPU / 8 GPU NVIDIA Tesla V100)</h3> 
    <table>
    <tr><th></th>                  <th>SkLearn ( K - 5 )</th> <th>FAISS ( K - 5 )</th>  <th>SageMaker ( K - 5 )</th></tr>
    <tr><td>Test DataPoints</td>                <td>200k</td>   <td>200k</td>     <td>200k</td></tr>
    <tr><td>Features</td>                       <td>54</td>     <td>54</td>       <td>54</td></tr>
    <tr><td>Accuracy</td>                       <td>99.20%</td> <td>99.00%</td>   <td>92.60%</td></tr>
    <tr><td>Prediction Time (secs)</td>         <td>23</td>     <td>5</td>        <td>12</td></tr>
    <tr><td>Training Model Time (in secs)</td>  <td>27</td>     <td>163</td>      <td>81</td></tr>
</table>"""  

# Benchmarks
Datasets used in the benchmarks are covtype dataset containing data for a multi-class problem with 54 features and 581k datapoints.It’s a labeled dataset where each entry describes a geographic area, and the label is a type of forest cover. There are seven possible labels, and we aim to solve the multi-class classification problem using FAISS-kNN.
 
The below table shows the benchmarks results for KNN implementation using FAISS , SageMaker and SKLearn.

In [48]:
HTML(tab_8cpu_mac)

Unnamed: 0,SkLearn ( K - 5 ),FAISS ( K - 5 ),SageMaker ( K - 5 )
Test DataPoints,58k,58k,58k
Features,54,54,54
Accuracy,96.90%,97.00%,93.80%
Prediction Time (secs),5.73,42,26
Training Model Time (in secs),4,0.08,186


In [49]:
HTML(tab_1gpu_k80)

Unnamed: 0,SkLearn ( K - 5 ),FAISS ( K - 5 ),SageMaker ( K - 5 )
Test DataPoints,200k,200k,200k
Features,54,54,54
Accuracy,99.20%,99.00%,94.40%
Prediction Time (secs),28,64,17.68
Training Model Time (in secs),30,53.99,217


In [50]:
HTML(tab_4gpu_p100)

Unnamed: 0,SkLearn ( K - 5 ),FAISS ( K - 5 ),SageMaker ( K - 5 )
Test DataPoints,200k,200k,200k
Features,54,54,54
Accuracy,99.20%,99.00%,90.80%
Prediction Time (secs),22,7,17
Training Model Time (in secs),30,51,170


In [51]:
HTML(tab_8gpu_k80)

Unnamed: 0,SkLearn ( K - 5 ),FAISS ( K - 5 )
Test DataPoints,200k,200k
Features,54,54
Accuracy,99.20%,99.00%
Prediction Time (secs),26,13
Training Model Time (in secs),28,64


In [56]:
HTML(tab_8gpu_v100)

Unnamed: 0,SkLearn ( K - 5 ),FAISS ( K - 5 ),SageMaker ( K - 5 )
Test DataPoints,200k,200k,200k
Features,54,54,54
Accuracy,99.20%,99.00%,92.60%
Prediction Time (secs),23,5,12
Training Model Time (in secs),27,163,81


<b>Notebooks and DataSets:</b>
- Notebook for producing the results [Sagemaker](notebooks/KNN-SageMaker.ipynb),      [https://github.com/shankarpm/faiss_knn/blob/master/KNN-SageMaker.ipynb]
- CovType Datasets can be downloaded here ['https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz']

<b>Resources Used:	</b>																						
1 CPU / 1 GPU Nvidia Tesla K 80																									
4 CPU / 4 GPU NVIDIA Tesla P100																									
8 CPU / 8 GPU NVIDIA Tesla K80																									
8 CPU / 8 GPU NVIDIA Tesla V100																									
Mac PC - CPU - Intel 8-core - i7																									

<h3> Analysis and Conclusion: </h3>

<b> Accuracy: </b>
<p style='text-align: justify;'> 
&nbsp;&nbsp;&nbsp; Accuracy was tested with the same toy dataset with 54 features for all 3 models.
With Respect to Accuracy using Inference Time, FAISS and Sklearn always gives the best with very close to each other with more than 99% .For 200k and 2.1 million data points . 
    <br>
  &nbsp;&nbsp;&nbsp;        For both the volumes , Accuracy almost remains the same for FAISS and sklearn. FAISS and SKLearn accuracy was around 5-10% better compared to Sagemaker in low and high volumes of data with the same value of KNN parameter ‘K’.<br>
&nbsp;&nbsp;&nbsp;It is interesting that all these 3 models use different default distance metric to calculate nearest neighbors like sklearn uses Minkowski distance ,  Not sure If Sagemaker uses cosine distance(although FAISS index can be used) , and FAISS using IndexIVFFlat index.
Accuracy remains the same independent of multi-core computing(CPU or GPU) for all 3 models. 
</p>

<b>Model Training Time :</b>
<p style='text-align: justify;'>
CPU:<br>
 &nbsp;&nbsp;&nbsp;Based on the benchmark results from 3 Models , We find training time is proportional to the datapoints size. Sklearn is exceptionally fast when tested on CPU compared to FAISS and Sagemaker.
For 500k datapoints on CPU, SKlearn takes 4 secs , with FAISS 40 secs and sagemaker 186 seconds.
  <br>
    <br>
GPU:
    <br>
 &nbsp;&nbsp;&nbsp;   Sklearn doesn't utilitze the GPU model with any number of instances unlike FAISS and Sagemaker.
FAISS performed 3-4 times faster than Sagemaker on 1 GPU and 4 GPU instances and performed 20% faster on 8 GPU instances with 2.1 million datapoints.
P100 Model( 4 GPU) performed the best among all Tesla Models with respect to FAISS model training time of 51 secs.
<p> 
    
    
<b>Inference Time:</b>
<p style='text-align: justify;'>
 &nbsp;&nbsp;&nbsp; Based on the benchmark results from 3 Models , We find Inference time for FAISS improves significantly from CPU to GPU like 7 minutes to 1 minute for 200k test data points.
Similar to Training time, Sklearn doesn't utilize the GPU model with any number of instances unlike FAISS and Sagemaker.<br>
 &nbsp;&nbsp;&nbsp; FAISS showed good response from 64 seconds to 7 seconds with 1 GPU to 4 GPU respectively with 200k test data-points. Sagemaker didn't show much significance change with test on multiple GPUS(1,4,8).  Looks FAISS is the clear winner here too.<br>
 &nbsp;&nbsp;&nbsp; For FAISS , P100 model with 4 GPU gave better results(from 13 seconds to 7 seconds) than K80 with 8 GPU.
And V100 model performed the best compared to K80 in 8 GPU model from 13 seconds to 5 seconds.
    <p> 
        
        
<b>Conclusion:</b>
<p style='text-align: justify;'>
 &nbsp;&nbsp;&nbsp; Based on the above benchmark results ,Looks like FAISS is clear winner in all KPI's.
SKlearn is better compared to Sagemaker in accuracy terms but doesn't work in GPU models. So sklearn may not a good candidate for big data sets even though the accuracy is good. SKlearn may be better model for small datasets in CPU.
FAISS beats Sagemaker in all areas very significantly.
    <p> 
        
        
<b>Interesting Find:   </b>   
<p style='text-align: justify;'>
 &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; Training on FAISS model on the first time for a new dataset takes longer time but running on the same dataset subsequently becomes very fast.   <br>                                              
 Not sure if it caches the index somewhere.                                                  
For example with 200k test datapoints on 8 core GPU,First time it takes 72 seconds to train the model. Ran the code again with the same parameters , it took almost 18 seconds to train the model.Its almost 4 times faster on the 2nd time.  <p>                                                 