# Loading Data into Elasticsearch

### Elasticsearch can obtain data from multible sources. The information can be delivered by Logstash (A pipeline and Preprocessing Engine) or directly using Elasticsearch API.

### In this notebook we will present a simple example for uploading CSV files.
<img src="img/es_data_collection.png">

In [None]:
!pip install --user elasticsearch-loader
!pip install --user elasticsearch
!pip install --user elasticsearch_dsl

In [6]:
import elasticsearch
import csv
import pandas as pd
from elasticsearch_dsl import Search
from elasticsearch_dsl import Q

### We use "elasticsearch_loader" to communicate with our elasticsearch cluster and upload the file "students.csv". Before you run, make sure you replace "gtp860219" placed after "csv-itesm-" with your initials and any "YYMMDD"

In [7]:
!elasticsearch_loader --es-host 149.165.170.209:9200 --index csv-itesm-gtp860219 --type student-records csv files/students.csv

{'es_host': '149.165.170.209:9200', 'index': 'csv-itesm-gtp860219', 'type': 'student-records', 'bulk_size': 500, 'verify_certs': False, 'use_ssl': False, 'ca_certs': None, 'http_auth': None, 'delete': False, 'update': False, 'progress': False, 'id_field': None, 'as_child': False, 'with_retry': False, 'index_settings_file': None, 'timeout': 10.0, 'encoding': 'utf-8', 'es_conn': <Elasticsearch([{'host': '149.165.170.209', 'port': 9200}])>}
[?25l  [####################################][?25h


### Let us collect the data that we have uploaded to elastic search by communicating with it's API. 

In [8]:
es = elasticsearch.Elasticsearch(["149.165.170.209:9200"])
#res = Search(using=es, index="csv-itesm*").query("match", username="Erin")
#res = Search(using=es, index="csv-itesm*")\
#        .query('bool', filter=Q('exists', field='name') & Q('exists', field='major'))

# Print all records matching the index csv-itesm where the name is Erin
res = Search(using=es, index="csv-itesm*").query("match", major="Engineering")
response = res.execute()
print(response)

# Let us print the number of records obtained
print("Total number of logs: %i \n" %response.hits.total)

# Print All resords matching the index csv-itesm
res = Search(using=es, index="csv-itesm*")
response = res.execute()
print(response)

# Let us print the number of records obtained
print("Total number of logs: %i \n" %response.hits.total)

<Response: [<Hit(csv-itesm-gtp860219/9t5xf2sBz3JRcf9d6vmp): {'\ufeffname': 'Mike', 'major': 'Engineering'}>, <Hit(csv-itesm-gtp860219/feQHgGsBz3JRcf9d5ulA): {'\ufeffname': 'Mike', 'major': 'Engineering'}>, <Hit(csv-itesm-gtp860219/Et-Ef2sBz3JRcf9dXbQ3): {'\ufeffname': 'Mike', 'major': 'Engineering'}>]>
Total number of logs: 3 

<Response: [<Hit(csv-itesm-gtp860219/9t5xf2sBz3JRcf9d6vmp): {'\ufeffname': 'Mike', 'major': 'Engineering'}>, <Hit(csv-itesm-gtp860219/995xf2sBz3JRcf9d6vmp): {'\ufeffname': 'Erin', 'major': 'Computer Science'}>, <Hit(csv-itesm-gtp860219/feQHgGsBz3JRcf9d5ulA): {'\ufeffname': 'Mike', 'major': 'Engineering'}>, <Hit(csv-itesm-gtp860219/E9-Ef2sBz3JRcf9dXbQ3): {'\ufeffname': 'Erin', 'major': 'Computer Science'}>, <Hit(csv-itesm-gtp860219/fuQHgGsBz3JRcf9d5ulA): {'\ufeffname': 'Erin', 'major': 'Computer Science'}>, <Hit(csv-itesm-gtp860219/Et-Ef2sBz3JRcf9dXbQ3): {'\ufeffname': 'Mike', 'major': 'Engineering'}>]>
Total number of logs: 6 



### We will now create a Dataframe to place the collected data.

In [9]:
student_df = pd.DataFrame(((hit["\ufeffname"],hit['major']) for hit in res.scan()),\
                    columns=['name','major'])

### Print the first values in the dataframe

In [12]:
student_df.head()

Unnamed: 0,name,major
0,Mike,Engineering
1,Erin,Computer Science
2,Mike,Engineering
3,Erin,Computer Science
4,Erin,Computer Science
