Singer.io tap for extracting data from the Slack Web API
It is highly recommended installing tap-slack
in it's own isolated virtual environment. For example:
python3 -m venv ~/.venvs/tap-slack
source ~/.venvs/tap-slack/bin/activate
pip3 install tap-slack
deactivate
The tap requires a Slack API token to interact with your Slack workspace. You can obtain a token for a single workspace by creating a new Slack App in your workspace and assigning it the relevant scopes. As of right now, the minimum required scopes for this App are:
channels:history
channels:join
channels:read
files:read
groups:read
links:read
reactions:read
remote_files:read
remote_files:write
team:read
usergroups:read
users.profile:read
users:read
users:read.email
Create a config file containing the API token and a start date, e.g.:
{
"token":"xxxx",
"start_date":"2020-05-01T00:00:00"
}
Optionally, you can also specify whether you want to sync private channels or not by adding the following to the config:
"private_channels":"false"
By default, private channels will be synced.
By adding the following to your config file you can have the tap auto-join all public channels in your ogranziation.
"join_public_channels":"true"
If you do not elect to have the tap join all public channels you must invite the bot to all channels you wish to sync.
By default, the tap will sync all channels it has been invited to. However, you can limit the tap to sync only the channels you specify by adding their IDs to the config file, e.g.:
"channels":[
"abc123",
"def345"
]
Note this needs to be channel ID, not the name, as recommended by the Slack API. To get the ID for a channel, either use the Slack API or find it in the URL.
You can control whether or not the tap will sync archived channels by including the following in the tap config:
"exclude_archived": "false"
It's important to note that a bot CANNOT join an archived channel, so unless the bot was added to the channel prior to it being archived it will not be able to sync the data from that channel.
Due to the potentially high volume of data when syncing certain streams (messages, files, threads) this tap implements date windowing based on a configuration parameter.
including
"date_window_size": "5"
Will cause the tap to sync 5 days of data per request, for applicable streams. The default value if one is not defined is to window requests for 7 days at a time.
It is recommended to follow Singer best practices when running taps either on their own or with a Singer target.
In practice, it will look something like the following:
~/.venvs/tap-slack/bin/tap-slack --config slack.config.json --catalog catalog.json | ~/.venvs/target-stitch/bin/target-stitch --config stitch.config.json
The Slack Conversations API does not natively store last updated timestamp information about a Conversation. In addition, Conversation records are mutable. Thus, tap-slack
requires a FULL_TABLE
replication strategy to ensure the most up-to-date data in replicated when replicating the following Streams:
Channels
(Conversations)Channel Members
(Conversation Members)
The Users
stream does store information about when a User record was last updated, so tap-slack
uses that timestamp as a bookmark value and prefers using an INCREMENTAL
replication strategy.
- Table Name:
channels
- Description:
- Primary Key Column:
id
- Replication Strategy:
FULL_TABLE
- API Documentation: Link
- Table Name:
channel_members
- Description:
- Primary Key Columns:
channel_id
,user_id
- Replication Strategy:
FULL_TABLE
- API Documentation: Link
- Table Name:
messages
- Description:
- Primary Key Columns:
channel_id
,ts
- Replication Strategy:
INCREMENTAL
- API Documentation: Link
- Table Name:
users
- Description:
- Primary Key Column:
id
- Replication Strategy:
INCREMENTAL
- API Documentation: Link
- Table Name:
threads
- Description:
- Primary Key Columns:
channel_id
,ts
,thread_ts
- Replication Strategy:
FULL_TABLE
for each parentmessage
- API Documentation: Link
- Table Name:
user_groups
- Description:
- Primary Key Column:
id
- Replication Strategy:
FULL_TABLE
- API Documentation: Link
- Table Name:
files
- Description:
- Primary Key Column:
id
- Replication Strategy:
INCREMENTAL
query filtered using date windows and lookback window - API Documentation: Link
- Table Name:
remote_files
- Description:
- Primary Key Column:
id
- Replication Strategy:
INCREMENTAL
query filtered using date windows and lookback window - API Documentation: Link
While developing the Slack tap, the following utilities were run in accordance with Singer.io best practices: Pylint to improve code quality:
> pylint tap_slack -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments
Pylint test resulted in the following score:
Your code has been rated at 9.72/10
To check the tap and verify working:
> tap-slack --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
> tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
Check tap resulted in the following:
Checking stdin for valid Singer-formatted data
The output is valid.
It contained 3657 messages for 9 streams.
581 schema messages
2393 record messages
683 state messages
Details by stream:
+-----------------+---------+---------+
| stream | records | schemas |
+-----------------+---------+---------+
| threads | 633 | 573 |
| user_groups | 1 | 1 |
| channel_members | 1049 | 1 |
| users | 22 | 1 |
| channels | 0 | 1 |
| remote_files | 3 | 1 |
| messages | 573 | 1 |
| teams | 1 | 1 |
| files | 111 | 1 |
+-----------------+---------+---------+
Copyright © 2019 Stitch