# Loading Data into Elasticsearch

### Elasticsearch can obtain data from multible sources. The information can be delivered by Logstash (A pipeline and Preprocessing Engine) or directly using Elasticsearch API.

### In this notebook we will present a simple example for uploading CSV files.
<img src="img/es_data_collection.png">

### Install Python PIP

##### Python pip is the standard package-manager for installing and managing software packages for pythom. Since we are using Python 3, we will install pip3 using the following:

In [None]:
%%bash
sudo apt install -y python3-pip

##### Now, we will use python pip to install the following 3 libraries:

* elasticsearch-loader: Used for uploading daata to elasticsearch.
* elasticsearch: Used for querying elasticsearch.
* elasticsearch_dsl: A high-level library built on top of elasticsearch-py for writting and running queries against elasticsearch.

In [None]:
%%bash
pip3 install --user elasticsearch-loader
pip3 install --user elasticsearch
pip3 install --user elasticsearch_dsl

### Install Pandas

##### Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python.

In [None]:
%%bash
pip3 install --user pandas

### Import the installed libraries

In [None]:
import elasticsearch
import csv
import pandas as pd
from elasticsearch_dsl import Search
from elasticsearch_dsl import Q

##### We use "elasticsearch_loader" to communicate with our elasticsearch cluster and upload the file "students.csv". Before you run, make sure you replace "gtp860219" placed after "csv-itesm-" with your initials and any "YYMMDD"

In [None]:
%%bash
cd /home/ubuntu/ml_and_big_data_in_cloud_environmnets
elasticsearch_loader --es-host <ELASTICSEARCH-IP-ADDRESS>:9200 \
    --http-auth logstash_internal:elasticsiem \
    --index csv-itesm-gtp860219 \
    --type student-records csv /home/ubuntu/ml_and_big_data_in_cloud_environmnets/files/students.csv    

# %%bash
# cd /home/ubuntu/ml_and_big_data_in_cloud_environmnets
# elasticsearch_loader --es-host 10.1.1.18:9200 \
#     --http-auth logstash_internal:elasticsiem \
#     --index test-ml \
#     --type airline-records csv /home/ubuntu/ml_and_big_data_in_cloud_environmnets/workshop_aug_2019/files/airline-passengers.csv    

### Make an Elasticsearch Query
##### Let us collect the data that we have uploaded to elastic search by communicating with it's API. We start by connecting to the Elasticsearch node using our credentials.
##### Next, we define our query using the "Search" command.
##### Finally we print the total number of records matching the query.

In [None]:
es = elasticsearch.Elasticsearch(["elastic:elasticsiem@localhost:9200"])
#res = Search(using=es, index="csv-itesm*").query("match", username="Erin")
#res = Search(using=es, index="csv-itesm*")\
#        .query('bool', filter=Q('exists', field='name') & Q('exists', field='major'))

# Print all records matching the index csv-itesm where the name is Erin
res = Search(using=es, index="csv-itesm*").query("match", major="Engineering")
response = res.execute()
print(response)

# Let us print the number of records obtained
print("Total number of logs: %d \n" %(response.hits.total.value))

# Print All resords matching the index csv-itesm
res = Search(using=es, index="csv-itesm*")
response = res.execute()
print(response)

# Let us print the number of records obtained
print("Total number of logs: %d \n" %(response.hits.total.value))


### Dump queried information into a dataframe.
##### First, we will create a Dataframe to place the collected data. Each matched document in the query can be retrieved using the "hit" paramenter. Scanning through each hit and defining the key that we want to retrieve is the approach we use for collecting and storing data in the dataframe.

In [None]:
student_df = pd.DataFrame(((hit["\ufeffname"],hit['major']) for hit in res.scan()),\
                    columns=['name','major'])

### Print the first values in the dataframe

In [None]:
student_df.head()