# 👋 Welcome to the PyCon 2022 Demo of Eland!

The Elasticsearch cluster being used for this demo is [hosted on Elastic Cloud](https://cloud.elastic.co)
and this Jupyter Notebook is [hosted by Binder](https://mybinder.org).

## Resources

### ➤ [Eland Documentation (eland.readthedocs.io)](https://eland.readthedocs.io)
### ➤ [Eland Source Code on GitHub](https://github.com/elastic/eland)
### ➤ [Jupyter Notebook Source Code](https://github.com/sethmlarson/eland-binder-demo)
### ➤ [Tweets indexed by Tweepy](https://www.tweepy.org)

## Installing

Eland is available on [PyPI](https://pypi.org/project/eland) with Pip:

```
$ python -m pip install eland
```

and on [Conda Forge](https://anaconda.org/conda-forge/eland):

```
$ conda install -c conda-forge eland
```

## Getting Started

This Jupyter Notebook works like any other Jupyter Notebook, you can simply browse the already completed results or you can try things out yourself by modifying a code block and hitting "Run". The Elasticsearch cluster in Elastic Cloud will respond with new results.

If you're unfamiliar with Jupyter Notebooks there's a [quickstart guide available](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook).

Our Elasticsearch cluster is already pre-loaded with the NYC OpenData Restaurant Inspection Results dataset in the `nyc-restaurants` index which includes a good mix of data types that we can use including text, integers, floating point, locations, and dates.

In [25]:
import pandas as pd
import matplotlib.pyplot as plt

import eland as ed
from elasticsearch import Elasticsearch

## Creating a Connection to Elasticsearch

Our Elasticsearch instance is running in [Elastic Cloud](https://www.elastic.co/cloud)
and uses an [API Key](https://www.elastic.co/guide/en/kibana/master/api-keys.html)
configured to be read only with access to the `tweets` index.

In [26]:
es = Elasticsearch(
  cloud_id="PyCon_US_2022_Demo:dXMtY2VudHJhbDEuZ2NwLmNsb3VkLmVzLmlvJDY5N2MyMjBiMzZlZDRlNmQ5YTNkYmQ2NGYzMDFiZjY5JGQ5MTk0OTgzMDQ1MzQwZDQ5ZGM3N2U0NGQ4NmRjZjZh",
  api_key=("DS18dYAB3ckAcsHiHYu7", "qBkMFMpsQ4WsxaBxvwOdfg")
)

Now we can create an [`eland.DataFrame`](https://eland.readthedocs.io/en/latest/reference/dataframe.html) from the `tweets` index:

In [27]:
df = ed.DataFrame(
  es, es_index_pattern="tweets",
)

## Exploring the Dataset

You can explore a DataFrame in Eland the same way you would a Pandas DataFrame. If we just look at the `df` instance we receive:

In [28]:
df

Unnamed: 0,@timestamp,author,content,language,likes,replies,retweets,tweet_id
1520031058187546625,2022-04-29 13:23:39+00:00,Mohammad_ML_Eng,Guys! Anyone can just grab a speaker badge! I’...,en,0,0,0,1520031058187546625
1520030638669062144,2022-04-29 13:21:59+00:00,Rish_co59,"@driscollis @pycon Thanks @driscollis , just g...",en,0,0,0,1520030638669062144
1520030534520344578,2022-04-29 13:21:34+00:00,Mohammad_ML_Eng,@pycon @llanga Already seeing people without m...,en,0,0,0,1520030534520344578
1520029314313424896,2022-04-29 13:16:43+00:00,djdarkbeat,Heading into #PyCon2022 soon. Who is here? Ru...,en,0,0,0,1520029314313424896
1520028739064971264,2022-04-29 13:14:26+00:00,tukang_logika,"@driscollis @pycon awesome ... \nThis book, by...",en,0,0,0,1520028739064971264
...,...,...,...,...,...,...,...,...
1516844730486771720,2022-04-20 18:22:20+00:00,masonegger,@lisaironcutter @pycon You and @bphogan taught...,en,3,1,0,1516844730486771720
1516845911028416512,2022-04-20 18:27:01+00:00,pycon,Thanks a bunch to #PyConUS2022 Supporting and ...,en,3,0,1,1516845911028416512
1516849506935005188,2022-04-20 18:41:18+00:00,lisaironcutter,@masonegger @pycon @bphogan https://t.co/By9yb...,und,0,0,0,1516849506935005188
1516849958187532288,2022-04-20 18:43:06+00:00,ericholscher,Sadly decided I'm not going to Pycon. I'm stil...,en,60,4,0,1516849958187532288


Looks just like a pandas DataFrame! Let's look closer:

In [29]:
df.info()

<class 'eland.dataframe.DataFrame'>
Index: 1187 entries, 1520031058187546625 to 1516844730486771720
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   @timestamp  1187 non-null   datetime64[ns]
 1   author      1187 non-null   object        
 2   content     1187 non-null   object        
 3   language    1187 non-null   object        
 4   likes       1187 non-null   int64         
 5   replies     1187 non-null   int64         
 6   retweets    1187 non-null   int64         
 7   tweet_id    1187 non-null   object        
dtypes: datetime64[ns](1), int64(3), object(4)
memory usage: 64.000 bytes
Elasticsearch storage usage: 1.226 MB


## Window into Elasticsearch

When data updates in Elasticsearch the dataframe is automatically updated.

In [30]:
df.shape

(1187, 8)

## Mapping Data Types from Elasticsearch to Pandas

Eland maps [Elasticsearch Field datatypes](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html) into [datatypes understood by pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes).

We can see the mapping between the two dtypes nicely by joining the
traditional `DataFrame.dtypes` series with `DataFrame.es_dtypes` to show
what types are being used within Elasticsearch:

In [31]:
pd_dtypes = df.dtypes.rename("pandas")
es_dtypes = df.es_dtypes.rename("Elasticsearch")

pd_dtypes.to_frame().join(es_dtypes)

Unnamed: 0,pandas,Elasticsearch
@timestamp,datetime64[ns],date
author,object,keyword
content,object,text
language,object,keyword
likes,int64,long
replies,int64,long
retweets,int64,long
tweet_id,object,keyword


## Retrieving Data from Elasticsearch

Here we ask for the first 5 entries via the `head()` method. Data still hasn't been permanently pulled out of Elasticsearch yet, to do that we use `to_pandas()`, but be careful, we don't want to dump all the data from our cluster all at once.

In [10]:
# Still an Eland DataFrame:
head_df = df.head(5)
print(type(head_df))

head_df

<class 'eland.dataframe.DataFrame'>


Unnamed: 0,@timestamp,author,content,language,likes,replies,retweets,tweet_id
1520031058187546625,2022-04-29 13:23:39+00:00,Mohammad_ML_Eng,Guys! Anyone can just grab a speaker badge! I’...,en,0,0,0,1520031058187546625
1520030638669062144,2022-04-29 13:21:59+00:00,Rish_co59,"@driscollis @pycon Thanks @driscollis , just g...",en,0,0,0,1520030638669062144
1520030534520344578,2022-04-29 13:21:34+00:00,Mohammad_ML_Eng,@pycon @llanga Already seeing people without m...,en,0,0,0,1520030534520344578
1520029314313424896,2022-04-29 13:16:43+00:00,djdarkbeat,Heading into #PyCon2022 soon. Who is here? Ru...,en,0,0,0,1520029314313424896
1520028739064971264,2022-04-29 13:14:26+00:00,tukang_logika,"@driscollis @pycon awesome ... \nThis book, by...",en,0,0,0,1520028739064971264


In [11]:
# Now we're a pandas DataFrame:
pd_df = head_df.to_pandas()
print(type(pd_df))

pd_df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,@timestamp,author,content,language,likes,replies,retweets,tweet_id
1520031058187546625,2022-04-29 13:23:39+00:00,Mohammad_ML_Eng,Guys! Anyone can just grab a speaker badge! I’...,en,0,0,0,1520031058187546625
1520030638669062144,2022-04-29 13:21:59+00:00,Rish_co59,"@driscollis @pycon Thanks @driscollis , just g...",en,0,0,0,1520030638669062144
1520030534520344578,2022-04-29 13:21:34+00:00,Mohammad_ML_Eng,@pycon @llanga Already seeing people without m...,en,0,0,0,1520030534520344578
1520029314313424896,2022-04-29 13:16:43+00:00,djdarkbeat,Heading into #PyCon2022 soon. Who is here? Ru...,en,0,0,0,1520029314313424896
1520028739064971264,2022-04-29 13:14:26+00:00,tukang_logika,"@driscollis @pycon awesome ... \nThis book, by...",en,0,0,0,1520028739064971264


## Filtering Rows in a Data Frame

You can apply filters to our data frame in the same way that you can with Pandas. Here we're filtering for rows which are in the 'Staten Island' borough, with a 'True' `critical_flag`, and a `grade` that's either 'C' or 'N'.
Then we filter the data frame to only include the columns `name`, `boro`, `critical_flag`, and `grade`:

In [20]:
from datetime import datetime

today = datetime.now().replace(hour=0, minute=0, second=0)

df[
  (df["@timestamp"] >= today)
  & (df.author != "pycon")
].filter(
  ["@timestamp", "author", "content"]
).to_pandas().sort_values("@timestamp").head(10)

Unnamed: 0,@timestamp,author,content
1519828816855715840,2022-04-29 00:00:01+00:00,Auth0Ambassador,Going to @pycon US this week?!\n\n👋Stop by and...
1519829570811318274,2022-04-29 00:03:01+00:00,Boldstartvc,This is so great! Look for @SlimDevOps at @pyc...
1519830536855773186,2022-04-29 00:06:51+00:00,andrewgodwin,While I am missing a lot of people who couldn'...
1519830688760877056,2022-04-29 00:07:27+00:00,__biancarosa,Found my ex at @pycon https://t.co/ajoAY0T5Yl
1519831135164899330,2022-04-29 00:09:14+00:00,ThatEliGuyatOCI,Opening up at PyCon 2022. @AccordionGuy at the...
1519831372138835969,2022-04-29 00:10:10+00:00,anvil_works,It’s looking busy at @pycon! Come by to see a ...
1519832200702619648,2022-04-29 00:13:28+00:00,StoneVaughanLaw,@hjwp you gonna be at PyCon 2022?
1519832577959514114,2022-04-29 00:14:58+00:00,PandaConstantin,@JavaFXpert @pycon @qiskit Yeah !!
1519832867898937345,2022-04-29 00:16:07+00:00,herlo,@trixtur I'm at @pycon in SLC. Where are you?
1519834573076541440,2022-04-29 00:22:54+00:00,jshell,@calvinhp Holy shit. I didn't even realize PyC...


## Aggregations with Eland

Eland supports many Pandas aggregations including min, mean, median, max, var, std, count, nunique, sum, and mad.

Aggregations are mapped to Elasticsearch aggs and then unpacked into a Pandas DataFrame or Series.

Read more about [`eland.DataFrame.agg()`](https://eland.readthedocs.io/en/latest/reference/api/eland.DataFrame.agg.html#eland.DataFrame.agg)

In [21]:
df[["likes", "retweets", "replies"]].agg(["min", "mean", "median", "max", "var", "std"])

Unnamed: 0,likes,retweets,replies
min,0.0,0.0,0.0
mean,8.811289,1.048863,0.698399
median,2.0,0.0,0.0
max,1066.0,100.0,72.0
var,1586.828399,16.695464,8.380976
std,39.860203,4.088592,2.896822


In [22]:
df.language.value_counts()

en     1032
es       50
und      45
ja       27
pt       14
pl        4
ca        3
in        3
fr        2
it        2
Name: language, dtype: int64

## Machine Learning

Since this is a read-only cluster you can't upload your own models
however I can show you how an existing model on the cluster was
trained.

More information on [Machine Learning in Eland](https://eland.readthedocs.io/en/latest/reference/ml.html)

For this Machine Learning demo we use the scikit-learn [`wine` classifier dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html)
and a [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) model. Full list of supported model types available in the Machine Learning docs for Eland.

### Import Scikit-Learn and Train Model Locally

```python
>>> from sklearn import datasets
>>> from sklearn.tree import DecisionTreeClassifier

>>> digits = datasets.load_wine()
>>> print("Feature Names:", digits.feature_names)
Feature Names: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

>>> print("Data example:", digits.data[0])
Data example: [1.423e+01 1.710e+00 2.430e+00 1.560e+01 1.270e+02 2.800e+00 3.060e+00
 2.800e-01 2.290e+00 5.640e+00 1.040e+00 3.920e+00 1.065e+03]

# Save 10, 80, and 140 for testing our model
>>> data = [x for i, x in enumerate(digits.data) if i not in (10, 80, 140)]
>>> target = [x for i, x in enumerate(digits.target) if i not in (10, 80, 140)]

# Fit the other data to a DecisionTreeClassifier
>>> sk_classifier = DecisionTreeClassifier()
>>> sk_classifier.fit(data, target)
```

### Test the Locally Trained Model

```python
>>> print(sk_classifier.predict(digits.data[[10, 80, 140]]))
[0 1 2]

>>> print(digits.target[[10, 80, 140]])
[0 1 2]
```

### Serialize the Scikit-Learn Model into Elasticsearch

```python
>>> from eland.ml import MLModel

>>> es_classifier = MLModel.import_model(
...     es_client=es,
...     model_id="wine-classifier",
...     model=sk_classifier,
...     feature_names=digits.feature_names,
...     es_if_exists="replace"
>>> )
```

### Run the Model in Elasticsearch!

```python
>>> print(es_classifier.predict(digits.data[[10, 80, 140]]))
[0 1 2]

# Tada!
```

## What's Coming Next?

Eland is open sourced on GitHub and if [you're interested in helping build new features](https://github.com/elastic/eland/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) we'd love to have you!

Here's a list of features that are coming soon to Eland:

#### – Native [Full-Text Search](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html) with `DataFrame.es_match()`

#### – Native [Geo-spatial Queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html) and Integration with [Geopandas](https://geopandas.org/)

#### – Aggregate and Visualize Time-Series Data

#### – Pivoted Aggregations with `DataFrame.groupby()`