# Loading Data into Elasticsearch

### Elasticsearch can obtain data from multible sources. The information can be delivered by Logstash (A pipeline and Preprocessing Engine) or directly using Elasticsearch API.

### In this notebook we will present a simple example for uploading CSV files.
<img src="img/es_data_collection.png">

In [None]:
%%bash
sudo apt install python3-pip

In [None]:
%%bash
pip3 install --user elasticsearch-loader
pip3 install --user elasticsearch
pip3 install --user elasticsearch_dsl

In [2]:
%%bash
pip3 install --user pandas

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/1d/9a/7eb9952f4b4d73fbd75ad1d5d6112f407e695957444cb695cbb3cdab918a/pandas-0.25.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting numpy>=1.13.3 (from pandas)
  Using cached https://files.pythonhosted.org/packages/19/b9/bda9781f0a74b90ebd2e046fde1196182900bd4a8e1ea503d3ffebc50e7c/numpy-1.17.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.6.1 (from pandas)
  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl
Collecting pytz>=2017.2 (from pandas)
  Using cached https://files.pythonhosted.org/packages/87/76/46d697698a143e05f77bec5a526bf4e56a0be61d63425b68f4ba553b51f2/pytz-2019.2-py2.py3-none-any.whl
Collecting six>=1.5 (from python-dateutil>=2.6.1->pandas)
  Using cached https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-n

In [3]:
import elasticsearch
import csv
import pandas as pd
from elasticsearch_dsl import Search
from elasticsearch_dsl import Q



##### We use "elasticsearch_loader" to communicate with our elasticsearch cluster and upload the file "students.csv". Before you run, make sure you replace "gtp860219" placed after "csv-itesm-" with your initials and any "YYMMDD"

In [9]:
%%bash
cd /home/ubuntu/ml_and_big_data_in_cloud_environmnets
elasticsearch_loader --es-host localhost:9200 \
    --http-auth logstash_internal:elasticsiem \
    --index csv-itesm-gtp860219 \
    --type student-records csv /home/ubuntu/ml_and_big_data_in_cloud_environmnets/files/students.csv    
    

{'es_host': 'localhost:9200', 'http_auth': 'logstash_internal:elasticsiem', 'index': 'csv-itesm-gtp860219', 'type': 'student-records', 'bulk_size': 500, 'verify_certs': False, 'use_ssl': False, 'ca_certs': None, 'delete': False, 'update': False, 'progress': False, 'id_field': None, 'as_child': False, 'with_retry': False, 'index_settings_file': None, 'timeout': 10.0, 'encoding': 'utf-8', 'keys': [], 'es_conn': <Elasticsearch([{'host': 'localhost', 'port': 9200}])>}




### Let us collect the data that we have uploaded to elastic search by communicating with it's API. 

In [14]:
#es = elasticsearch.Elasticsearch(["localhost:9200"],http_auth=('elastic', 'elasticsiem'),scheme="https",port=443)
es = elasticsearch.Elasticsearch(["elastic:elasticsiem@localhost:9200"])
#res = Search(using=es, index="csv-itesm*").query("match", username="Erin")
#res = Search(using=es, index="csv-itesm*")\
#        .query('bool', filter=Q('exists', field='name') & Q('exists', field='major'))

# Print all records matching the index csv-itesm where the name is Erin
res = Search(using=es, index="csv-itesm*").query("match", major="Engineering")
response = res.execute()
print(response)

# Let us print the number of records obtained
#print("Total number of logs: %i \n" %(response.hits.total))

# Print All resords matching the index csv-itesm
res = Search(using=es, index="csv-itesm*")
response = res.execute()
print(response)

# Let us print the number of records obtained
#print("Total number of logs: %i \n" %response.hits.total)

<Response: [<Hit(csv-itesm-gtp860219/JRt8bWwB5flVJT0cuCUW): {'\ufeffname': 'Mike', 'major': 'Engineering'}>]>
<Response: [<Hit(csv-itesm-gtp860219/JRt8bWwB5flVJT0cuCUW): {'\ufeffname': 'Mike', 'major': 'Engineering'}>, <Hit(csv-itesm-gtp860219/Jht8bWwB5flVJT0cuCUW): {'\ufeffname': 'Erin', 'major': 'Computer Science'}>]>


### We will now create a Dataframe to place the collected data.

In [15]:
student_df = pd.DataFrame(((hit["\ufeffname"],hit['major']) for hit in res.scan()),\
                    columns=['name','major'])

### Print the first values in the dataframe

In [16]:
student_df.head()

Unnamed: 0,name,major
0,Mike,Engineering
1,Erin,Computer Science
