# Data Science with SiLK/ELK - Solr/Elastic Search-Logstash-Kibana stack 

In [None]:
import logstash
import logging
import pysolr
import csv

## 1. Introduction:


Searching and Sorting are the most fundamental problems of computer science and also our day-to-day life. Whether it is to look out when will the next season of our favorite Netflix series be hosted or find the score of our favorite baseball team we start searching. Well to have our search algorithm work efficiently and fetch the data we want, we  have to sort documents, snippets, articles and web pages so that only relevant data are at the top of the hit list.
All this is the background for a search engine and there are a lot of search engines in the market ( Yes! there are search engines other than our beloved Google and most hated Internet Explorer!). Out of them Apache Lucene is one famous free and open-source information retrieval software library originally written completely in Java.[1]


## 1.1 Motivation:
There are a number of problems that the current technology is facing with respect to searching and sorting.
The motivation here is to address two of them quoted below using a well designed elastic search engine-solr logstash and banana dashboards.

### 1.1.1 Faceted Search using Solr: 
The use case here is simple. Let us consider an online shopping scenario. An important requirement here is to help online shoppers to  filter data ( items in the shopping catalog ) based on multiple features. For example, say I have to buy a car. I would love to search based on the color, mileage, 4 wheel drive, horse power etc. These are the features and the search engine has to automatically give me sorted results based on these multiple results. This is faceted search. Apache Solr provides a nice way to integrate faceted search. I will be walking through the steps of creating such an experience using a simple dataset having details about the sales of digital wearables equipments using Solr.

### 1.1.2 Time Series Analysis using ELK Stack:
This is a fairly complex use case when compared to the faceted search. Time series analysis is highly ubiquitous since ages. Now, there are different use cases along the same line. Be it real time stock analysis, transport data prediction, or  Stats telemetry, all these include hundreds and thousands of metrics to be collected every minute. The challenge is how to store, analyze and visualize them. This is exactly where Data Science comes into picture. The SiLK and ELK stack can be used in an efficient way to solve this problem. I will be walking through a real stock analysis example to how can the ELK stack be used as an "eye" for such complex data.

## 1.2 Ingredients

### 1.2.1 Apache Solr:
According to wikipedia page[2], Solr is an open source enterprise search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling. In our use case we can think Solr-Lucene combo as the back end storage engine that also provides elastic search capability and it is cool as it is open source!

### 1.2.2 Elastic Search:
Elasticsearch is a search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.[10] Elastic search can be thought as a cousin of Apache Solr as both are enterprise search engines built on top of Lucene.

### 1.2.3 Logstash: 
Logstash is a tool to collect, process, and forward events and log messages[3] For our use case, we can treat logstash as kind of middle-man in the SiLK pipeline who has input plugins to collect the logs (sys-logs), provides some nice filters to modify and annotate [3] them and finally give out the annotated data to the output channels. Logstash can also collect data from various databases like MySQL, MongoDB, WildFly etc.

### 1.2.4 Kibana:
Kibana is an open source data visualization plugin for Elasticsearch.[11] It is the eye for data that we have hosted on top of Elastic search/ Solr. It provides rich representation of data so that the analysis can be made in an efficient way.
Kibana completes the whole stack. 

## 2. Installation Guidelines:

### 2.1 Apache Solr Installation: 
[5] is from the official Solr page that is the best guide to install Solr in a simple way.(They provide guidelines to host Solr on Windows machines which I tried. But I felt it is better and easier to host it on Linux based machines because of lots of dependencies on other tools like curl etc that are required to build the whole stack.)  Briefing out the installation steps with screenshots in the following cell.
( I will be demonstrating the installation steps on Ubuntu 16.04 for the tutorial )

#### 2.1.1 System Requirements:
As solr is a platform running on Lucene which is built on top of Java, we need to have the latest JVM
Further requriement is as mentioned in [6]

#### 2.1.2 Download and Installation Steps:
1. Download the latest version here. 
http://apache.claz.org/lucene/solr/7.2.1 
2. Unpack the tar



#### 2.1.3 PySolr- A Python Wrapper for Solr Installation:
Pysolr can be directly downloaded from Conda-forge cloud as follows:

VirtualBox:/tmp$ conda install -c conda-forge pysolr

### 2.2 Logstash Installation on Anaconda:
If pip is not installed on Anaconda,
1. Navigate to Conda home path 

2. Install pip
VirtualBox:/tmp$ conda install pip

3. Pip install python-logstash
VirtualBox:/tmp$ pip install python-logstash


### 2.1.4 Installing Elastic Search:

1. First step is to check the version of Java. If the machine doesn't have Java, we can download as follows:

sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java8-installer
java -version

2. Download Elastic search from the website 

https://www.elastic.co/downloads/elasticsearch
dpkg -i elasticsearch-6.2.3.deb
sudo systemctl enable elasticsearch.service

sudo service elasticsearch start


To test if it is installed, we can fire a simple curl as follows

curl -XGET 'localhost:9200'


irtualBox:~/Desktop/elk$ sudo dpkg -i elasticsearch-6.2.3.deb 
Selecting previously unselected package elasticsearch.
(Reading database ... 220574 files and directories currently installed.)
Preparing to unpack elasticsearch-6.2.3.deb ...
Creating elasticsearch group... OK
Creating elasticsearch user... OK
Unpacking elasticsearch (6.2.3) ...
Setting up elasticsearch (6.2.3) ...
Processing triggers for systemd (229-4ubuntu21.2) ...
Processing triggers for ureadahead (0.100.0-19) ...
VirtualBox:~/Desktop/elk$ 




The output shall be this if elastic search is installed properly

VirtualBox:~/Desktop/elk$ curl -XGET 'localhost:9200'
{
  "name" : "Fx7F0m6",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "Jz7sepLTQyqi6DzCMFG_Eg",
  "version" : {
    "number" : "6.2.3",
    "build_hash" : "c59ff00",
    "build_date" : "2018-03-13T10:06:29.741383Z",
    "build_snapshot" : false,
    "lucene_version" : "7.2.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Thats it! Elastic starts running at the assigned port.


#### 2.1.5 Installing Kibana:

1. Download the Kibana from the official website
https://www.elastic.co/downloads/kibana
sudo dpkg -i kibana-6.2.3-amd64.deb

VirtualBox:~/Desktop/elk$ sudo dpkg -i kibana-6.2.3-amd64.deb 
Selecting previously unselected package kibana.
(Reading database ... 177484 files and directories currently installed.)
Preparing to unpack kibana-6.2.3-amd64.deb ...
Unpacking kibana (6.2.3) ...
Setting up kibana (6.2.3) ...
Processing triggers for systemd (229-4ubuntu21.1) ...
Processing triggers for ureadahead (0.100.0-19) ..


2. We have to take care of two things in the kibana Yaml Configuration file:
  1. The URL of the elastic search. For the demo we are using the local host

    

![kibana_1.png](kibana_1.png)

  2. 

## 3. The Fun Part: Actual Working of the Stack

## 3.1 Running Solr and creating Cores:
For the demonstration purpose, creating a core called 'wearables' as follows:


#### 3.1.1 Creating Cores:
@-VirtualBox:~/Desktop/solr-7.2.1$ bin/solr create -c wearables

WARNING: Using _default configset. Data driven schema functionality is enabled by default, which is
         NOT RECOMMENDED for production use.

         To turn it off:
            curl http://localhost:7574/solr/wearables/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}'
Created collection 'wearables' with 1 shard(s), 1 replica(s) with config-set 'wearables'

@-VirtualBox:~/Desktop/solr-7.2.1$ 

#### 3.1.2 Posting CSV Files:
Posting a CSV file on the newly created core wearables

@-VirtualBox:~/Desktop/solr-7.2.1$ bin/post -c wearables Wearable1.csv

/usr/lib/jvm/java-8-oracle/bin/java -classpath /home//Desktop/solr-7.2.1/dist/solr-core-7.2.1.jar -Dauto=yes -Dc=wearables -Ddata=files org.apache.solr.util.SimplePostTool Wearable1.csv
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/wearables/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file Wearable1.csv (text/csv) to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/wearables/update...
Time spent: 0:00:01.454

@-VirtualBox:~/Desktop/solr-7.2.1$ 



## 3.2 Solr in Action with PySolr

In [None]:

#https://pypi.python.org/pypi/pysolr/3.6.0 [8]
#http://www.confusedcoders.com/bigdata/solr/indexing-csv-data-in-solr-via-python-pysolr

#This creates a connection to the solr server running on the local host with port 8983.
#test2 is the core name that we created in above step.
#The timeout is an optional field

solr = pysolr.Solr('http://localhost:8983/solr/wearables', timeout=10)

#Reading a CSV file to Solr

#solr search similar to lucene search
results = solr.search(q='*:*')
print("Saw {0} result(s).".format(len(results)))

#Each result is a pysolr object
print("results ",str(results))

#Printing the name of some items 
for result in results:
    print("Item Name '{0}'.".format(result['Name']))
    


### 3.2.1: Pagination in Solr
Now, in the previous example, we might wonder why only 10 results are shown ( the CSV file has more than 10 records )
[9] explains the issue. Basically Solr is NOT a database. It tries to retrieve the top results but not all documents which can be viewed as a typical behaviour of a search engine. 
However, we can have more rows printed as follows:

In [None]:
#Pagination in Solr
results = solr.search(q='*:*',rows=20)    
for result in results:
    print("Item Name '{0}'.".format(result['Name']))

### 3.2.2 More Filtering in Solr:
More filtering can be added pretty easily and queries can be launched as follows:

In [None]:
#Find all the wearables manufactured by Casio
#https://stackoverflow.com/questions/9532395/pysolr-filter-search
filters = ['Company.Name:Casio, Category:Medical']
results = solr.search(q="*:*",fq=filters,rows=20)    
for result in results:
    print("Item Name '{0}'.".format(result['Name']))

Note: This is a basic search where Solr retrieves 20 wearables manufactured by Casio OR ( not AND ) categorized as Medical

### 3.2.3 Parial Match in Solr:
Requires a "*" in the filter

In [None]:

results = solr.search(q="Company.Name:Cas*",rows=20)    
for result in results:
    print("Item Name '{0}'.".format(result['Company.Name']))

### 3.2.4 Time to See Fuzzy Matches:

This is where Solr stands out. The fuzzy matches are based on Edit distances. That is number of replacement,deletion or insertion required to match the pattern with a string is edit distance. We can do fuzzy matches in a very simple and convenient way in Solr.
First, let us look at a scenario where we have to Company.Name having an edit distance of 1 with "Cas" can be done something as follows:



In [None]:
#https://stackoverflow.com/questions/16655933/fuzzy-search-in-solr
results = solr.search(q="Company.Name:Cas~1",rows=20)    
for result in results:
    print("Item Name '{0}'.".format(result['Company.Name']))

Solr retrieves 'Sas Safety' as Sas is 1 edit distance away from Cas

### 3.2.5 More Interesting Fuzzy Matches:
Suppose we are interested in watches only out all different wearables.

In [None]:
results = solr.search(q="Name:watches~1",rows=20)    
for result in results:
    print("Item Name '{0}'.".format(result['Name']))

In [144]:

#https://stackoverflow.com/questions/18704339/how-to-use-facets-with-pysolr-cant-seem-to-get-facet-results-to-show
params = {
  'facet': 'on',
  'facet.query':['facet.field', "Body.Location",
          'facet.field.value', "Torso", 
          'facet.field', "Company.Name",
          'facet.field.value', "Jawbone",
   ],
  'rows': '10',
}

results = solr.search( q="*:*",**params)
for result in results:
    print("Item Name '{0}'.".format(result['Name']))

Item Name '['Barska GB12166 Fitness Watch with Heart Rate Monitor']'.
Item Name '['Belkin GS5 Sport Fit Armband, Black F8M918B1C00']'.
Item Name '['Bowflex EZ Pro Strapless Heart Rate Monitor Watch, Black']'.
Item Name '['Casio G Shock Watch Solar Atom (gwm500a-1)']'.
Item Name '['Casio WS220 Solar Runner Digital Wrist Watch']'.
Item Name '['Coleman G7HD-SWIM POV 1080p 5 Megapixel Goggles Camcorder ELBG7HDSWIM']'.
Item Name '['Ekho Fit-18 Heart Rate Monitor 12-2042']'.
Item Name '["Ekho FiT-9 Women's Heart Rate Monitor 12-2041"]'.
Item Name '['Fitbit Flex Cordless Activity/Sleep Tracker - Black']'.
Item Name '['G1 Smartwatch /w Bluetooth Hands-Free Solution Large Black']'.


### Time Series Analysis With ELK Stack

For this use case, we need to install Logstash. We have already installed Elastic Search and Kibana.
Just to check they are working fine, we can perform the following,

ps -eaf | gerp kibana
ps -eaf | grep elasticsearch

![ek_working.png](ek_working.png)

#### Installing Logstash
1. Download and unpack the logstash package the same way we did for Elastic search and Kibana
2. Navigate to the directory
    cd /usr/share/logstash/bin
    sudo ./logstash -f FILE_PATH for the conf file

![logstash_loading_firstdata.png](logstash_loading_firstdata.png)

#### The Logstash Conf File

It has three components
1. Input: Defines where the data being fetched
Provide the path for the input file. In our case, it is the path for the csv file

2. Filter: 
This is where all the fun happens with Logstash. We can filter the unstructured data using GROK- a pattern matching library (please find the link for a good tutorial on the same in the following cells)

3. Output:

In [None]:
#https://github.com/minsuk-heo/BigData/tree/master/ch06
#The conf file for logstash to get the stock data on to the elastic search and also provide a pipeline to Kibana
input {
  file {
    path => "/home/Desktop/gasg/elk/stock.csv"
    start_position => "beginning"
    sincedb_path => "/dev/null"    
  }
}

filter {
  csv {
      separator => ","
      columns => ["Date","Open","High","Low","Close","Volume","Adj Close"]
  }
  mutate {convert => ["Open", "float"]}
  mutate {convert => ["High", "float"]}
  mutate {convert => ["Low", "float"]}
  mutate {convert => ["Close", "float"]}
}
output {  
    elasticsearch {
        hosts => "localhost"
        index => "stock"
    }
    stdout {}
}


#### Running Kibana: The last leg!

![kibana_logs.png](kibana_logs.png)

![kibana_frontend.png](kibana_frontend.png)

#### The Input CSV File for the ELK stack

![ugly lokking.png](ugly lokking.png)

![kibana_stock1.png](kibana_stock1.png)

#### All the Stocks data indexed by the date field can be viewed as both Json Format and tables on Kibana

![timestamps.png](timestamps.png)

![kibana3.png](kibana3.png)

In [None]:
#### We Can Also Perform Lucene Syntax Based Search in Kibana
close:[30 TO 35] shows all the dates when the closing value of the stocks were between 30-35 


![close-30-35.png](close-30-35.png)

#### Visualization Options Provided by Kibana:


![visuals.png](visuals.png)


![weekly.png](weekly.png)



![both_closr_high.png](both_closr_high.png)

#### Kibana Dashboards
Finally, We can have a nice neat dashboard that can have collection of all the graphs and visualizations on a single page as follows

![dashboard.png](dashboard.png)

### Conclusion:

#### ELK and SiLK Face off! 

In this part, I would like to report some of the pros and cons of both stacks based on my experience installing and setting them up. 

Documentation/Time spent in bringing up the stack:
ELK is easier to set-up and work on. Period! I spent a lot of time finding the support for Solr and SiLk. With all due respect to the authors of installation guide for Solr ( which is beautifully written ) but they lack the support. Especially, when it comes to the SiLk. Kibana doesn't support Solr out of the box. We need to have a fork of Kibana called Banana-UI for the same. The banana-UI lacks support too. 
On the other hand, even logstash doesn't gel well with Solr. We need a logstash plugin called solr-http-plugin to push logs from logstash to Solr. Again takes a lot of effort to integrate all the three to bring up SiLK.







#### Auxiliary Tutorials and Blogs:

The following are some amazing guidelines and tutorials to create complex and more useful applications with the help of Solr,Elasticsearch Logstash and Kibana

Logstash requires Grok which is a clean and nice pattern matching tool to parse unstructured data and process them so that they are indexed. Indexed data are efficiently queryable. Naturally, Grok is perfect for processing syslogs for the second kind of usecases we looked at in this tutorial. The following link provides a good start for using Grok.

 https://logz.io/blog/logstash-grok/

Elastic search is an excellent distributed search engine with a lots of knobs available to tune up the way we can spread large data-set. Elastic search sharding can be acheived by a few configurations and it the setup depends on the requirement and usecases. In the below git hub, the author thoroughly explains sharding, filebeat( that helps putting data to logstash from multiple sources) and other aspects aroudn ELK stack.

https://github.com/minsuk-heo/BigData/

ELK over Amazon AWS: The following set of Youtube videos shall be a good place to start with if the requirement is to setup ELK stack over Amazon cloud.

https://www.youtube.com/watch?v=ge8uHdmtb1M&t=218s

# Citations and references
1. https://en.wikipedia.org/wiki/Apache_Lucene
2. https://en.wikipedia.org/wiki/Apache_Solr
3. https://wikitech.wikimedia.org/wiki/Logstash
4. https://github.com/lucidworks/banana
5. https://lucene.apache.org/solr/guide/7_2/solr-tutorial.html
6. https://lucene.apache.org/solr/7_2_1/SYSTEM_REQUIREMENTS.html
7. https://github.com/logpai/loghub/tree/master/Linux
8. https://pypi.python.org/pypi/pysolr/3.6.0
9. https://stackoverflow.com/questions/6385168/get-all-the-results-from-solr-without-10-as-limit
10. https://en.wikipedia.org/wiki/Elasticsearch
11. https://en.wikipedia.org/wiki/Kibana
12. http://solr-vs-elasticsearch.com/
13. https://github.com/minsuk-heo/BigData/tree/master/ch06