Skip to content

Commit

Permalink
Merge pull request #290 from tableau/dev
Browse files Browse the repository at this point in the history
Merging models/small bug fixes into master
  • Loading branch information
sbabayan committed May 13, 2019
2 parents 702499b + 61c8c54 commit a1387ed
Show file tree
Hide file tree
Showing 52 changed files with 1,099 additions and 633 deletions.
7 changes: 7 additions & 0 deletions .coveragerc
@@ -0,0 +1,7 @@
[report]
# Exclude lines that match patterns from coverage report.
exclude_lines =
if __name__ == .__main__.:

# Only show one number after decimal point in report.
precision = 1
2 changes: 2 additions & 0 deletions .vscode/settings.json
@@ -1,6 +1,8 @@
{
"git.enabled": true,
"files.exclude": {
"**/build": true,
"**/dist": true,
"**/__pycache__": true,
"**/.pytest_cache": true,
"**/*.egg-info": true,
Expand Down
9 changes: 8 additions & 1 deletion CHANGELOG
@@ -1,8 +1,15 @@
# TabPy Changelog

This file list notable changes for TabPy project releases.
This file lists notable changes for TabPy project releases.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## v0.5

### Improvements

- Scripts, documentation, and integration tests for models
- Small bug fixes

## v0.4.1

- Added request context logging as a feature controlled with
Expand Down
4 changes: 1 addition & 3 deletions CONTRIBUTING.md
Expand Up @@ -46,8 +46,6 @@ be able to work on TabPy changes:
cd TabPy
```

## Setting Up Environment

Before making any code changes run environment setup script.
For Windows run this command from the repository root folder:

Expand Down Expand Up @@ -99,7 +97,7 @@ details are on
## Documentation Updates

For any process, scripts or API changes documentation needs to be updated accordingly.
Please use markdown validation tools like web-based[markdownlint](https://dlaa.me/markdownlint/)
Please use markdown validation tools like web-based [markdownlint](https://dlaa.me/markdownlint/)
or npm [markdownlint-cli](https://github.com/igorshubovych/markdownlint-cli).

TOC for markdown file is built with [markdown-toc](https://www.npmjs.com/package/markdown-toc):
Expand Down
2 changes: 1 addition & 1 deletion VERSION
@@ -1 +1 @@
0.4.1
0.5
2 changes: 2 additions & 0 deletions docs/security.md
Expand Up @@ -2,6 +2,8 @@

The following security issues should be kept in mind as you use TabPy with Tableau:

- tabpy_tools client does not validate that the tabpy server cert is signed
by a trusted CA
- REST server and Python execution context are the same meaning they share
Python session, e.g. HTTP requests are served in the same space where
user scripts are evaluated.
Expand Down
23 changes: 17 additions & 6 deletions docs/server-config.md
Expand Up @@ -163,21 +163,32 @@ With the feature on additional information is logged for HTTP requests: caller i
URL, client infomation (Tableau Desktop\Server), Tableau user name (for Tableau Server)
and TabPy user name as shown in the example below:

<!-- markdownlint-disable MD013 -->
<!-- markdownlint-disable MD040 -->

```
2019-04-17,15:20:37 [INFO] (evaluation_plane_handler.py:evaluation_plane_handler:86):
::1 calls POST http://localhost:9004/evaluate,
Client: Tableau Server 2019.2,
Tableau user: ogolovatyi,
TabPy user: user1
function to evaluate=def _user_script(tabpy, _arg1, _arg2):
2019-05-02,13:50:08 [INFO] (base_handler.py:base_handler:90): Call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4, Caller: ::1, Method: POST, Resource: http://localhost:9004/evaluate, Client: Postman for manual testing, Tableau user: ogolovatyi
2019-05-02,13:50:08 [DEBUG] (base_handler.py:base_handler:120): Checking if need to handle authentication, <<
call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4>>
2019-05-02,13:50:08 [DEBUG] (base_handler.py:base_handler:120): Handling authentication, <<call ID: 934073bd-
0d29-46d3-b693-b1e4b1efa9e4>>
2019-05-02,13:50:08 [DEBUG] (base_handler.py:base_handler:120): Checking request headers for authentication d
ata, <<call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4>>
2019-05-02,13:50:08 [DEBUG] (base_handler.py:base_handler:120): Validating credentials for user name "user1",
<<call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4>>
2019-05-02,13:50:08 [DEBUG] (state.py:state:484): Collecting Access-Control-Allow-Origin from state file...
2019-05-02,13:50:08 [INFO] (base_handler.py:base_handler:120): function to evaluate=def _user_script(tabpy, _
arg1, _arg2):
res = []
for i in range(len(_arg1)):
res.append(_arg1[i] * _arg2[i])
return res
, <<call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4>>
```

<!-- markdownlint-enable MD040 -->
<!-- markdownlint-enable MD013 -->

No passwords are logged.

NOTE the request context details are logged with INFO level.
4 changes: 4 additions & 0 deletions docs/server-configurations.md
Expand Up @@ -7,6 +7,9 @@ To download specific release of TabPy find it on

TabPy release | Python | Operating System | Owner | When confirmed to work | Comments
-------------- |------- |----------------- |------ |----------------------- |----------
0.4.1 | 3.6.5 | Windows 10 | @tableau | 2019-05-02 | Win 10 x64, Python 3.7.2 x64
0.4.1 | 3.6.5 | centOS 7.6-1810 | @tableau | 2019-05-02 |
0.4.1 | 3.6.5 | macOS Sierra | @tableau | 2019-05-02 |
0.3.2 | 3.6.5 | Windows 10 | @tableau | 2019-01-29 | Win 10 x64, Python 3.5.6 x64
0.3.2 | 3.6.5 | centOS 7.6-1810 | @tableau | 2019-01-30 |
0.3.2 | 3.6.5 | macOS Sierra | @tableau | 2019-01-29 |
Expand All @@ -16,4 +19,5 @@ tested with TabPy instances, but are not guaranteed to be 100% supported.

TabPy release | Python | Operating System | Owner | When confirmed to work | Comments
-------------- |------- |----------------- |------ |----------------------- |----------
0.4.1 | 3.6.7 | Ubuntu 18.04 | @tableau | 2019-05-03 | must be run as sudo
0.3.2 | 3.6.7 | Ubuntu 18.04 | @tableau | 2019-03-26 | Ubuntu ships Python 3.6.7
108 changes: 108 additions & 0 deletions docs/tabpy-tools.md
Expand Up @@ -157,6 +157,114 @@ client.remove('WillItDefault')

```

## Predeployed Functions

To setup models download the latest version of TabPy and follow the [instructions](server-download.md)
to install and start up your server. Once your server is running, navigate to the
models directory and run setup.py. If your TabPy server is running on the default
config (default.conf), you do not need to specify a config file when launching the
script. If your server is running using a custom config you can specify the config
in the command line like so:

```sh

python setup.py custom.conf

```

The setup file will install all of the necessary dependencies `(eg. sklearn,
nltk, textblob, pandas, & numpy)` and deploy all of the prebuilt models
located in `./models/scripts`. For every model that is successfully deployed
a message will be printed to the console:

```sh
"Successfully deployed PCA"
```

If you would like to deploy additional models using the deploy script, you can
copy any python file to the `./models/scripts` directory and modify setup.py to
include all necessary packages when installing dependencies or alternatively install
all the required dependencies manually.

You can deploy models individually by navigating to models/scripts/ and running
each file in isolation like so:

```sh

python PCA.py

```

Similarly to the setup script, if your server is running using a custom config,
you can specify the config's file path through the command line.

### Principal Component Analysis (PCA)

[Principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis)
is a statistical technique which extracts new, linearly uncorrelated,
variables out of a dataset which capture the maximum variance in the
data. In this way, `PCA` can be used to reduce the number of variables
in a high dimensional dataset, a process that is called dimensionality
reduction. The first principal component captures the largest amount of
variance, while the second captures the largest portion of the remaining
variance while remaining orthogonal to the first and so on. This allows the
reduction of the number of dimensions while maintaining as much of the
information from the original data as possible. `PCA` is useful in
exploratory data analysis because complex linear relationships can be
visualized in a 2D scatter plot of the first few principal components.

TabPy’s implementation of `PCA` uses the scikit-learn
[decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
algorithm, which is further documented [here](https://scikit-learn.org/stable/modules/decomposition.html#pca).
In the Tableau script, after the function name `PCA`, you must specify a
principal component to return. This integer input should be > 0 and <= the
number of variables you pass in to the function. When passing categorical
variables we perform the `scikit-learn` [One Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
to transform your non-numeric variables into a one-hot numeric array of 0s and
1s. In order for `One Hot Encoding` to be performant we have limited the number
of unique values your categorical column may contain to 25 and do not permit
any nulls or empty strings in the column. In Tableau's implementation of `PCA`
is performed, all variables are normalized to have a mean of 0 and unit
variance using the `scikit-learn` [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

A Tableau calculated field to perform PCA will look like:

```python

tabpy.query(‘PCA’, 1, _arg1, _arg2, _arg3)[‘response’]

```

### Sentiment Analysis

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is
a technique which uses natural language processing to extract the emotional
positivity or negativity – the sentiment – behind a piece of text and converts
that into a numeric value. Our implementation of `sentiment analysis` returns a
polarity score between -1 and 1 which rates the positivity of the string with
1 being very positive and -1 being very negative. Calling the `Sentiment
Analysis` function from TabPy in Tableau will look like the following,
where \_arg1 is a Tableau dimension containing text

```python

tabpy.query('Sentiment Analysis', _arg1)[‘response’]

```

Python provides multiple packages that compute `sentiment analysis` – our implementation
defaults to use [NLTK’s sentiment package](https://www.nltk.org/api/nltk.sentiment.html).
If you would like to use [TextBlob’s sentiment analysis](https://textblob.readthedocs.io/en/dev/quickstart.html)
algorithm you can do so by specifying the optional argument “library=textblob”
when calling the `Sentiment Analysis` function through a calculated field in
Tableau

```python

tabpy.query('Sentiment Analysis', _arg1, library='textblob')[‘response’]

```

## Providing Schema Metadata

As soon as you share your deployed functions, you also need to share metadata
Expand Down
84 changes: 84 additions & 0 deletions models/scripts/PCA.py
@@ -0,0 +1,84 @@
from tabpy_tools.client import Client
import pandas as pd
from numpy import array
from sklearn.decomposition import PCA as sklearnPCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import sys
from pathlib import Path
sys.path.append(str(Path(__file__).resolve().parent.parent.parent / 'models'))
from utils import setup_utils


def PCA(component, _arg1, _arg2, *_argN):
'''
Principal Component Analysis is a technique that extracts the key
distinct components from a high dimensional space whie attempting
to capture as much of the variance as possible. For more information
on the function and how to use it please refer to tabpy-tools.md
'''
cols = [_arg1, _arg2] + list(_argN)
encodedCols = []
labelEncoder = LabelEncoder()
oneHotEncoder = OneHotEncoder(categories='auto', sparse=False)

for col in cols:
if isinstance(col[0], (int, float)):
encodedCols.append(col)
elif type(col[0]) is bool:
intCol = array(col)
encodedCols.append(intCol.astype(int))
else:
if len(set(col)) > 25:
print('ERROR: Non-numeric arguments cannot have more than '
'25 unique values')
raise ValueError
integerEncoded = labelEncoder.fit_transform(array(col))
integerEncoded = integerEncoded.reshape(len(col), 1)
oneHotEncoded = oneHotEncoder.fit_transform(integerEncoded)
transformedMatrix = oneHotEncoded.transpose()
encodedCols += list(transformedMatrix)

dataDict = {}
for i in range(len(encodedCols)):
dataDict[f'col{1 + i}'] = list(encodedCols[i])

if component <= 0 or component > len(dataDict):
print('ERROR: Component specified must be >= 0 and '
'<= number of arguments')
raise ValueError

df = pd.DataFrame(data=dataDict, dtype=float)
scale = StandardScaler()
scaledData = scale.fit_transform(df)

pca = sklearnPCA()
pcaComponents = pca.fit_transform(scaledData)

return pcaComponents[:, component - 1].tolist()


if __name__ == '__main__':
# running from setup.py
if len(sys.argv) > 1:
config_file_path = sys.argv[1]
else:
config_file_path = setup_utils.get_default_config_file_path()
port, auth_on, prefix = setup_utils.parse_config(config_file_path)

connection = Client(f'{prefix}://localhost:{port}/')

if auth_on:
# credentials are passed in from setup.py
if len(sys.argv) == 4:
user, passwd = sys.argv[2], sys.argv[3]
# running PCA independently
else:
user, passwd = setup_utils.get_creds()
connection.set_credentials(user, passwd)

connection.deploy('PCA', PCA,
'Returns the specified principal component.',
override=True)
print("Successfully deployed PCA")
62 changes: 62 additions & 0 deletions models/scripts/SentimentAnalysis.py
@@ -0,0 +1,62 @@
from tabpy_tools.client import Client
from textblob import TextBlob
from nltk.sentiment import SentimentIntensityAnalyzer
import sys
from pathlib import Path
sys.path.append(str(Path(__file__).resolve().parent.parent.parent / 'models'))
from utils import setup_utils


def SentimentAnalysis(_arg1, library='nltk'):
'''
Sentiment Analysis is a procedure that assigns a score from -1 to 1
for a piece of text with -1 being negative and 1 being positive. For
more information on the function and how to use it please refer to
tabpy-tools.md
'''
if not (isinstance(_arg1[0], str)):
raise TypeError

library = library.lower()
supportedLibraries = {'nltk', 'textblob'}

if library not in supportedLibraries:
raise ValueError

scores = []
if library == 'nltk':
sid = SentimentIntensityAnalyzer()
for text in _arg1:
sentimentResults = sid.polarity_scores(text)
score = sentimentResults['compound']
scores.append(score)
elif library == 'textblob':
for text in _arg1:
currScore = TextBlob(text)
scores.append(currScore.sentiment.polarity)
return scores


if __name__ == '__main__':
# running from setup.py
if len(sys.argv) > 1:
config_file_path = sys.argv[1]
else:
config_file_path = setup_utils.get_default_config_file_path()
port, auth_on, prefix = setup_utils.parse_config(config_file_path)

connection = Client(f'{prefix}://localhost:{port}/')

if auth_on:
# credentials are passed in from setup.py
if len(sys.argv) == 4:
user, passwd = sys.argv[2], sys.argv[3]
# running Sentiment Analysis independently
else:
user, passwd = setup_utils.get_creds()
connection.set_credentials(user, passwd)

connection.deploy('Sentiment Analysis', SentimentAnalysis,
'Returns a sentiment score between -1 and '
'1 for a given string.', override=True)
print("Successfully deployed SentimentAnalysis")

0 comments on commit a1387ed

Please sign in to comment.