Merge pull request #290 from tableau/dev

Merging models/small bug fixes into master
tableau · May 13, 2019 · a1387ed · a1387ed
2 parents 702499b + 61c8c54
commit a1387ed
Show file tree

Hide file tree

Showing 52 changed files with 1,099 additions and 633 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -0,0 +1,7 @@
+[report]
+# Exclude lines that match patterns from coverage report.
+exclude_lines = 
+    if __name__ == .__main__.:
+
+# Only show one number after decimal point in report.
+precision = 1
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -1,6 +1,8 @@
 {
     "git.enabled": true,
     "files.exclude": {
+        "**/build": true,
+        "**/dist": true,
         "**/__pycache__": true,
         "**/.pytest_cache": true,
         "**/*.egg-info": true,

diff --git a/CHANGELOG b/CHANGELOG
@@ -1,8 +1,15 @@
 # TabPy Changelog
 
-This file list notable changes for TabPy project releases.
+This file lists notable changes for TabPy project releases.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
+## v0.5
+
+### Improvements
+
+- Scripts, documentation, and integration tests for models
+- Small bug fixes 
+
 ## v0.4.1
 
 - Added request context logging as a feature controlled with

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -46,8 +46,6 @@ be able to work on TabPy changes:
     cd TabPy
     ```
 
-## Setting Up Environment
-
 Before making any code changes run environment setup script.
 For Windows run this command from the repository root folder:
 
@@ -99,7 +97,7 @@ details are on
 ## Documentation Updates
 
 For any process, scripts or API changes documentation needs to be updated accordingly.
-Please use markdown validation tools like web-based[markdownlint](https://dlaa.me/markdownlint/)
+Please use markdown validation tools like web-based [markdownlint](https://dlaa.me/markdownlint/)
 or npm [markdownlint-cli](https://github.com/igorshubovych/markdownlint-cli).
 
 TOC for markdown file is built with [markdown-toc](https://www.npmjs.com/package/markdown-toc):

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.4.1
+0.5
diff --git a/docs/security.md b/docs/security.md
@@ -2,6 +2,8 @@
 
 The following security issues should be kept in mind as you use TabPy with Tableau:
 
+- tabpy_tools client does not validate that the tabpy server cert is signed
+  by a trusted CA
 - REST server and Python execution context are the same meaning they share
   Python session, e.g. HTTP requests are served in the same space where
   user scripts are evaluated.

diff --git a/docs/server-config.md b/docs/server-config.md
@@ -163,21 +163,32 @@ With the feature on additional information is logged for HTTP requests: caller i
 URL, client infomation (Tableau Desktop\Server), Tableau user name (for Tableau Server)
 and TabPy user name as shown in the example below:
 
+<!-- markdownlint-disable MD013 -->
 <!-- markdownlint-disable MD040 -->
 
 ```
-2019-04-17,15:20:37 [INFO] (evaluation_plane_handler.py:evaluation_plane_handler:86):
- ::1 calls POST http://localhost:9004/evaluate,
- Client: Tableau Server 2019.2,
- Tableau user: ogolovatyi,
- TabPy user: user1
-function to evaluate=def _user_script(tabpy, _arg1, _arg2):
+2019-05-02,13:50:08 [INFO] (base_handler.py:base_handler:90): Call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4, Caller: ::1, Method: POST, Resource: http://localhost:9004/evaluate, Client: Postman for manual testing, Tableau user: ogolovatyi
+2019-05-02,13:50:08 [DEBUG] (base_handler.py:base_handler:120): Checking if need to handle authentication, <<
+call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4>>
+2019-05-02,13:50:08 [DEBUG] (base_handler.py:base_handler:120): Handling authentication, <<call ID: 934073bd-
+0d29-46d3-b693-b1e4b1efa9e4>>
+2019-05-02,13:50:08 [DEBUG] (base_handler.py:base_handler:120): Checking request headers for authentication d
+ata, <<call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4>>
+2019-05-02,13:50:08 [DEBUG] (base_handler.py:base_handler:120): Validating credentials for user name "user1",
+ <<call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4>>
+2019-05-02,13:50:08 [DEBUG] (state.py:state:484): Collecting Access-Control-Allow-Origin from state file...  
+2019-05-02,13:50:08 [INFO] (base_handler.py:base_handler:120): function to evaluate=def _user_script(tabpy, _
+arg1, _arg2):
  res = []
  for i in range(len(_arg1)):
    res.append(_arg1[i] * _arg2[i])
  return res
+, <<call ID: 934073bd-0d29-46d3-b693-b1e4b1efa9e4>>
 ```
 
 <!-- markdownlint-enable MD040 -->
+<!-- markdownlint-enable MD013 -->
 
 No passwords are logged.
+
+NOTE the request context details are logged with INFO level.
diff --git a/docs/server-configurations.md b/docs/server-configurations.md
@@ -7,6 +7,9 @@ To download specific release of TabPy find it on
 
  TabPy release | Python | Operating System | Owner | When confirmed to work | Comments
 -------------- |------- |----------------- |------ |----------------------- |----------
+0.4.1 | 3.6.5 | Windows 10 | @tableau | 2019-05-02 | Win 10 x64, Python 3.7.2 x64
+0.4.1 | 3.6.5 | centOS 7.6-1810 | @tableau | 2019-05-02 |
+0.4.1 | 3.6.5 | macOS Sierra | @tableau | 2019-05-02 |
 0.3.2 | 3.6.5 | Windows 10 | @tableau | 2019-01-29 | Win 10 x64, Python 3.5.6 x64
 0.3.2 | 3.6.5 | centOS 7.6-1810 | @tableau | 2019-01-30 |
 0.3.2 | 3.6.5 | macOS Sierra | @tableau | 2019-01-29 |
@@ -16,4 +19,5 @@ tested with TabPy instances, but are not guaranteed to be 100% supported.
 
  TabPy release | Python | Operating System | Owner | When confirmed to work | Comments
 -------------- |------- |----------------- |------ |----------------------- |----------
+0.4.1 | 3.6.7 | Ubuntu 18.04 | @tableau | 2019-05-03 | must be run as sudo
 0.3.2 | 3.6.7 | Ubuntu 18.04 | @tableau | 2019-03-26 | Ubuntu ships Python 3.6.7
diff --git a/docs/tabpy-tools.md b/docs/tabpy-tools.md
@@ -157,6 +157,114 @@ client.remove('WillItDefault')
 
 ```
 
+## Predeployed Functions
+
+To setup models download the latest version of TabPy and follow the [instructions](server-download.md)
+to install and start up your server. Once your server is running, navigate to the
+models directory and run setup.py.  If your TabPy server is running on the default
+config (default.conf), you do not need to specify a config file when launching the
+script. If your server is running using a custom config you can specify the config
+in the command line like so:
+
+```sh
+
+python setup.py custom.conf
+
+```
+
+The setup file will install all of the necessary dependencies `(eg. sklearn,
+nltk, textblob, pandas, & numpy)` and deploy all of the prebuilt models
+located in `./models/scripts`. For every model that is successfully deployed
+a message will be printed to the console:
+
+```sh
+"Successfully deployed PCA"
+```
+
+If you would like to deploy additional models using the deploy script, you can
+copy any python file to the `./models/scripts` directory and modify setup.py to
+include all necessary packages when installing dependencies or alternatively install
+all the required dependencies manually.
+
+You can deploy models individually by navigating to models/scripts/ and running
+each file in isolation like so:
+
+```sh
+
+python PCA.py
+
+```
+
+Similarly to the setup script, if your server is running using a custom config,
+you can specify the config's file path through the command line.
+
+### Principal Component Analysis (PCA)
+
+[Principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis)
+is a statistical technique which extracts new, linearly uncorrelated,
+variables out of a dataset which capture the maximum variance in the
+data. In this way, `PCA` can be used to reduce the number of variables
+in a high dimensional dataset, a process that is called dimensionality
+reduction. The first principal component captures the largest amount of
+variance, while the second captures the largest portion of the remaining
+variance while remaining orthogonal to the first and so on. This allows the
+reduction of the number of dimensions while maintaining as much of the
+information from the original data as possible. `PCA` is useful in
+exploratory data analysis because complex linear relationships can be
+visualized in a 2D scatter plot of the first few principal components.
+
+TabPy’s implementation of `PCA` uses the scikit-learn
+[decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
+algorithm, which is further documented [here](https://scikit-learn.org/stable/modules/decomposition.html#pca).
+In the Tableau script, after the function name `PCA`, you must specify a
+principal component to return. This integer input should be > 0 and <= the
+number of variables you pass in to the function. When passing categorical
+variables we perform the `scikit-learn` [One Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
+to transform your non-numeric variables into a one-hot numeric array of 0s and
+1s. In order for `One Hot Encoding` to be performant we have limited the number
+of unique values your categorical column may contain to 25 and do not permit
+any nulls or empty strings in the column. In Tableau's implementation of `PCA`
+is performed, all variables are normalized to have a mean of 0 and unit
+variance using the `scikit-learn` [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).
+
+A Tableau calculated field to perform PCA will look like:
+
+```python
+
+tabpy.query(‘PCA’, 1, _arg1, _arg2, _arg3)[‘response’]
+
+```
+
+### Sentiment Analysis
+
+[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is
+a technique which uses natural language processing to extract the emotional
+positivity or negativity – the sentiment – behind a piece of text and converts
+that into a numeric value. Our implementation of `sentiment analysis` returns a
+polarity score between -1 and 1 which rates the positivity of the string with
+1 being very positive and -1 being very negative. Calling the `Sentiment
+Analysis` function from TabPy in Tableau will look like the following,
+where \_arg1 is a Tableau dimension containing text
+
+```python
+
+tabpy.query('Sentiment Analysis', _arg1)[‘response’]
+
+```
+
+Python provides multiple packages that compute `sentiment analysis` – our implementation
+defaults to use [NLTK’s sentiment package](https://www.nltk.org/api/nltk.sentiment.html).
+If you would like to use [TextBlob’s sentiment analysis](https://textblob.readthedocs.io/en/dev/quickstart.html)
+algorithm you can do so by specifying the optional argument “library=textblob”
+when calling the `Sentiment Analysis` function through a calculated field in
+Tableau
+
+```python
+
+tabpy.query('Sentiment Analysis', _arg1, library='textblob')[‘response’]
+
+```
+
 ## Providing Schema Metadata
 
 As soon as you share your deployed functions, you also need to share metadata

diff --git a/models/scripts/PCA.py b/models/scripts/PCA.py
@@ -0,0 +1,84 @@
+from tabpy_tools.client import Client
+import pandas as pd
+from numpy import array
+from sklearn.decomposition import PCA as sklearnPCA
+from sklearn.preprocessing import StandardScaler
+from sklearn.preprocessing import LabelEncoder
+from sklearn.preprocessing import OneHotEncoder
+import sys
+from pathlib import Path
+sys.path.append(str(Path(__file__).resolve().parent.parent.parent / 'models'))
+from utils import setup_utils
+
+
+def PCA(component, _arg1, _arg2, *_argN):
+    '''
+    Principal Component Analysis is a technique that extracts the key
+    distinct components from a high dimensional space whie attempting
+    to capture as much of the variance as possible. For more information
+    on the function and how to use it please refer to tabpy-tools.md
+    '''
+    cols = [_arg1, _arg2] + list(_argN)
+    encodedCols = []
+    labelEncoder = LabelEncoder()
+    oneHotEncoder = OneHotEncoder(categories='auto', sparse=False)
+
+    for col in cols:
+        if isinstance(col[0], (int, float)):
+            encodedCols.append(col)
+        elif type(col[0]) is bool:
+            intCol = array(col)
+            encodedCols.append(intCol.astype(int))
+        else:
+            if len(set(col)) > 25:
+                print('ERROR: Non-numeric arguments cannot have more than '
+                      '25 unique values')
+                raise ValueError
+            integerEncoded = labelEncoder.fit_transform(array(col))
+            integerEncoded = integerEncoded.reshape(len(col), 1)
+            oneHotEncoded = oneHotEncoder.fit_transform(integerEncoded)
+            transformedMatrix = oneHotEncoded.transpose()
+            encodedCols += list(transformedMatrix)
+
+    dataDict = {}
+    for i in range(len(encodedCols)):
+        dataDict[f'col{1 + i}'] = list(encodedCols[i])
+
+    if component <= 0 or component > len(dataDict):
+        print('ERROR: Component specified must be >= 0 and '
+              '<= number of arguments')
+        raise ValueError
+
+    df = pd.DataFrame(data=dataDict, dtype=float)
+    scale = StandardScaler()
+    scaledData = scale.fit_transform(df)
+
+    pca = sklearnPCA()
+    pcaComponents = pca.fit_transform(scaledData)
+
+    return pcaComponents[:, component - 1].tolist()
+
+
+if __name__ == '__main__':
+    # running from setup.py
+    if len(sys.argv) > 1:
+        config_file_path = sys.argv[1]
+    else:
+        config_file_path = setup_utils.get_default_config_file_path()
+    port, auth_on, prefix = setup_utils.parse_config(config_file_path)
+
+    connection = Client(f'{prefix}://localhost:{port}/')
+
+    if auth_on:
+        # credentials are passed in from setup.py
+        if len(sys.argv) == 4:
+            user, passwd = sys.argv[2], sys.argv[3]
+        # running PCA independently
+        else:
+            user, passwd = setup_utils.get_creds()
+        connection.set_credentials(user, passwd)
+
+    connection.deploy('PCA', PCA,
+                      'Returns the specified principal component.',
+                      override=True)
+    print("Successfully deployed PCA")
diff --git a/models/scripts/SentimentAnalysis.py b/models/scripts/SentimentAnalysis.py
@@ -0,0 +1,62 @@
+from tabpy_tools.client import Client
+from textblob import TextBlob
+from nltk.sentiment import SentimentIntensityAnalyzer
+import sys
+from pathlib import Path
+sys.path.append(str(Path(__file__).resolve().parent.parent.parent / 'models'))
+from utils import setup_utils
+
+
+def SentimentAnalysis(_arg1, library='nltk'):
+    '''
+    Sentiment Analysis is a procedure that assigns a score from -1 to 1
+    for a piece of text with -1 being negative and 1 being positive. For
+    more information on the function and how to use it please refer to
+    tabpy-tools.md
+    '''
+    if not (isinstance(_arg1[0], str)):
+        raise TypeError
+
+    library = library.lower()
+    supportedLibraries = {'nltk', 'textblob'}
+
+    if library not in supportedLibraries:
+        raise ValueError
+
+    scores = []
+    if library == 'nltk':
+        sid = SentimentIntensityAnalyzer()
+        for text in _arg1:
+            sentimentResults = sid.polarity_scores(text)
+            score = sentimentResults['compound']
+            scores.append(score)
+    elif library == 'textblob':
+        for text in _arg1:
+            currScore = TextBlob(text)
+            scores.append(currScore.sentiment.polarity)
+    return scores
+
+
+if __name__ == '__main__':
+    # running from setup.py
+    if len(sys.argv) > 1:
+        config_file_path = sys.argv[1]
+    else:
+        config_file_path = setup_utils.get_default_config_file_path()
+    port, auth_on, prefix = setup_utils.parse_config(config_file_path)
+
+    connection = Client(f'{prefix}://localhost:{port}/')
+
+    if auth_on:
+        # credentials are passed in from setup.py
+        if len(sys.argv) == 4:
+            user, passwd = sys.argv[2], sys.argv[3]
+        # running Sentiment Analysis independently
+        else:
+            user, passwd = setup_utils.get_creds()
+        connection.set_credentials(user, passwd)
+
+    connection.deploy('Sentiment Analysis', SentimentAnalysis,
+                      'Returns a sentiment score between -1 and '
+                      '1 for a given string.', override=True)
+    print("Successfully deployed SentimentAnalysis")