diff --git a/src/_data/sidenav/main.yml b/src/_data/sidenav/main.yml index b4fd3df6e4..ce2bcbe5b9 100644 --- a/src/_data/sidenav/main.yml +++ b/src/_data/sidenav/main.yml @@ -219,6 +219,8 @@ sections: title: Redshift Cluster and Redshift Connector Limitations - path: /connections/storage/warehouses/redshift-tuning title: Speeding Up Redshift Queries + - path: /connections/storage/warehouses/redshift-useful-sql + title: Useful SQL Queries for Redshift - path: /connections/test-connections title: Testing Connections - path: /connections/data-export-options diff --git a/src/_includes/content/how-a-sync-works.md b/src/_includes/content/how-a-sync-works.md new file mode 100644 index 0000000000..cdc93ea414 --- /dev/null +++ b/src/_includes/content/how-a-sync-works.md @@ -0,0 +1,3 @@ +When Segment loads data into your warehouse, each sync goes through two steps: +1. **Ping:** Segment servers connect to your warehouse. For Redshift warehouses, Segment also runs a query to determine how many slices a cluster has. Common reasons a sync might fail at this step include a blocked VPN or IP, a warehouse that isn't set to be publicly accessible, or an issue with user permissions or credentials. +2. **Load:** Segment de-duplicates the transformed data and loads it into your warehouse. If you have queries set up in your warehouse, they run after the data is loaded into your warehouse. \ No newline at end of file diff --git a/src/_includes/content/spec-table-header.md b/src/_includes/content/spec-table-header.md index 4140ff7373..b996d513f8 100644 --- a/src/_includes/content/spec-table-header.md +++ b/src/_includes/content/spec-table-header.md @@ -1,6 +1,8 @@ + Field Type Description + \ No newline at end of file diff --git a/src/_sass/components/_markdown.scss b/src/_sass/components/_markdown.scss index cf1a135e87..dd1abc5847 100644 --- a/src/_sass/components/_markdown.scss +++ b/src/_sass/components/_markdown.scss @@ -308,10 +308,10 @@ } -a[target="_blank"]:not(.reference-button):after { - content: url("/docs/images/external-link-alt-solid.svg"); - margin-left: 4px; -} +//a[target="_blank"]:not(.reference-button):after { + //content: url("/docs/images/external-link-alt-solid.svg"); +// margin-left: 4px; +//} a.no-icon[target="_blank"]:after { content: none @@ -436,4 +436,4 @@ div.highlighter-rouge { color: black; font-weight: bold; margin-left: 3px -} \ No newline at end of file +} diff --git a/src/connections/sources/catalog/cloud-apps/sendgrid/index.md b/src/connections/sources/catalog/cloud-apps/sendgrid/index.md index 2feacd9465..b57bc8395e 100644 --- a/src/connections/sources/catalog/cloud-apps/sendgrid/index.md +++ b/src/connections/sources/catalog/cloud-apps/sendgrid/index.md @@ -1,34 +1,38 @@ --- -title: Sendgrid Source +title: SendGrid Source id: jhr8dT2yHn --- {% include content/source-region-unsupported.md %} -SendGrid is a trusted platform for transactional email and email marketing. [Visit Website](http://sendgrid.com) +[SendGrid](http://sendgrid.com) is a trusted platform for transactional email and email marketing. -Take your company's analysis to the next level by **adding Sendgrid as a Source to Segment.** Segment automatically collects events like `Click` or `Delivered` and objects such as `Recipients` or `Campaigns` and load them into your data warehouse.  +Take your company's analysis to the next level by **adding SendGrid as a Source to Segment.** Segment automatically collects events like `Click` or `Delivered` and objects such as `Recipients` or `Campaigns` and loads them into your data warehouse.  ## Getting Started -1. From the [Source catalog page](https://app.segment.com/goto-my-workspace/sources/catalog) in your Segment workspace, enter "Sendgrid" and select the Sendgrid source that appears. -2. From the Sendgrid information panel that appears, click **Add source**. +Adding SendGrid as a Source in Segment requires a SendGrid API key. If you don't yet have a SendGrid API key, first follow these steps within your SendGrid account: -3. Give the Source a name and add any labels to help you organize and filter your sources. - You can give the source any name, but Segment recommends a name that reflects the source itself, as this name autopopulates the schema name. For example, the source name `Sendgrid` creates the schema `sendgrid`. You can add multiple instances if you have multiple SendGrid accounts. +1. Log in to your SendGrid account. +2. Navigate to **Settings > API Keys**, then click **General API Key**. +3. Name the key and, optionally, adjust its settings. +4. Copy the API Key, omitting all spaces. -4. Provide your API Key. In order to pull information about your contacts, we'll make requests to SendGrid's API with our [sync component](#sync). You can create an API Key by navigating to **Settings > API Keys**, clicking **General API Key**. +> info "SendGrid API Key Settings" +> Segment recommends providing read permissions for **Email Activity** and **Marketing Activity**. - You will then be prompted to name that key and given the option to adjust the settings. We recommend providing read permissions for **Email Activity** and **Marketing Activity**. +To finish adding the SendGrid source, return to your Segment Workspace and follow these steps: -6. Finally, copy the resulting API Key into the Segment interface, taking care to trim any errant trailing spaces from copying and pasting, and press connect. +1. From the [Source catalog page](https://app.segment.com/goto-my-workspace/sources/catalog) in your Segment workspace, enter **SendGrid** and select the SendGrid source that appears. +2. From the SendGrid information panel that appears, click **Add source**. +3. Give the Source a name and add any labels to help you organize and filter your sources. + Segment recommends a name that reflects the source itself, as this name populates the schema name. For example, the source name `SendGrid` creates the schema `SendGrid`. You can add multiple instances if you have multiple SendGrid accounts. +4. Paste the SendGrid API Key you copied above into the Segment interface. Click **Connect**. +![](images/601347_Key.png) - ![](images/601347_Key.png) +6. Copy the auto-generated Webhook URL and paste it into SendGrid's Event Notification settings pane under **Settings > Mail Settings**. +![](images/694785_Webhook.png) -7. Copy the auto-generated Webhook URL and paste it into SendGrid's Event Notification settings pane under **Settings > Mail Settings**. - - ![](images/694785_Webhook.png) - -8. Once you enable the Event Notification, you're good to go! Press **Next**, and then **Finish** to wrap up the set up flow. +7. Enable Event Notification in SendGrid. Select **Next** and then **Finish** to complete setup. ### Event URL @@ -38,18 +42,18 @@ SendGrid has a single Event URL location. By using the SendGrid source, you will ### Sync -SendGrid has a sync component, which means we'll make requests to their API on your behalf on a 3 hour interval to pull the latest data into Segment. In the initial sync, we'll grab all the SendGrid objects (and their corresponding properties) according to the [Collections Table](#collections) below. **Note**: If you don't use Sendgrid's marketing campaigns features, these collections will be empty in Sendgrid and you'll see "Zero data synced" in your runs. The webhook will still be processing activity data (but only activity data) for you though! +Segment makes requests to the SendGrid API every three hours. In the initial sync, Segment pulls all SendGrid objects (and their corresponding properties) according to the [Collections Table](#collections) below. If you don't use SendGrid's marketing campaigns features, these collections will be empty in SendGrid and you'll see "Zero data synced" in your runs. The webhook still processes activity data. -Our sync component gets resources from SendGrid and forwards them to Segment using an upsert API, so the dimensional data in your warehouse loaded will reflect the latest state of the corresponding resource in SendGrid. For example, if `lists.recipient_count` goes from `100` to `200` between syncs, on its next flush to your warehouse, that tickets status will be `200`. +Segment's sync component pulls and forwards SendGrid resources to Segment using an upsert API. As a result, dimensional data loaded into your warehouse reflects the latest state of the corresponding resource in SendGrid. For example, if `lists.recipient_count` goes from `100` to `200` between syncs, its status will be `200` on its next flush to your warehouse. -The source syncs and warehouse syncs are independent processes. Source runs pull your data into the Segment Hub, and warehouse runs flush that data to your warehouse. Sources will sync with Segment every 3 hours. Depending on your Warehouses plan, we will push the Source data to your warehouse on the interval associated with your billing plan. - -At the moment, we don't support filtering which objects or properties get synced. If you're interested in this feature, [let us know](https://segment.com/help/contact/)! +The source syncs and warehouse syncs are independent processes. Source runs pull your data into the Segment Hub, and warehouse runs flush that data to your warehouse. Sources sync with Segment every three hours. Depending on your Warehouses plan, Segment pushes the Source data to your warehouse on the interval associated with your billing plan. +> info "SendGrid Syncs" +> Segment syncs all objects and properties. [Reach out to support](https://segment.com/help/contact/) if you're interested in filtering objects or properties during syncs. ### Streaming -The SendGrid source also has a streaming component which listens in real time for inbound webhooks from SendGrid's Event Notifications and batches the events to be uploaded on your next warehouse flush. **These events only append to your warehouse.** +The SendGrid source's streaming component listens in real time for inbound webhooks from SendGrid's Event Notifications. The source batches these events for upload on your next warehouse flush. **These events only append to your warehouse.** > note "" > **NOTE:** If you don't use SendGrid's marketing features, this will be the only data that Segment receives from SendGrid. There isn't a way to retrieve email event history from SendGrid, so you will only have access to data that Segment collected after you successfully enable this component of the source destination. @@ -57,32 +61,35 @@ The SendGrid source also has a streaming component which listens in real time fo ## Collections -Collections are the groupings of resources we pull from your source. In your warehouse, each collection gets its own table. +Collections are the groupings of resources Segment pulls from your source. In your warehouse, each collection gets its own table. **Object** collections are updated with each sync. These are pulled using Segment's sync component. -**Event** collections are append only, represent a user action or activity, and may be likened to fact tables in a traditional data warehouse. **Note:** Unlike traditional events captured by Segment, you can't forward these events to Destinations you've configured in your Segment workspace. You can only sync these events to a supported data warehouse. +**Event** collections are append only, represent a user action or activity, and may be likened to fact tables in a traditional data warehouse. Unlike traditional events captured by Segment, you can't forward these events to Destinations you've configured in your Segment workspace. You can only sync these events to a supported data warehouse. | Collection | Type | Description | | ------ | ------ | ------ | -| activity | Event | The union of all SendGrid **event** tables. Useful for creating funnels | -| _open | Event | Recipient has opened the HTML message. You need to enable Open Tracking for getting this type of event. | -| click | Event | Recipient clicked on a link within the message. You need to enable Click Tracking for getting this type of event. | +| activity | Event | The union of all SendGrid **event** tables. Useful for creating funnels. | +| _open | Event | Recipient has opened the HTML message. Enable Open Tracking to get this type of event. | +| click | Event | Recipient clicked on a link within the message. Enable Click Tracking to get this type of event. | | bounce | Event | Receiving server could not or would not accept message. | | delivered | Event | Message has been successfully delivered to the receiving server. | -| processed | Event | Triggered when the email is processed | +| processed | Event | Triggered when the email is processed. | | dropped | Event | You may see the following drop reasons: Invalid SMTPAPI header, Spam Content (if spam checker app enabled), Unsubscribed Address, Bounced Address, Spam Reporting Address, Invalid, Recipient List over Package Quota | | deferred | Event | Recipient's email server temporarily rejected message. | | unsubscribe | Event | Recipient clicked on message's subscription management link. You need to enable Subscription Tracking for getting this type of event. | | spam_report | Event | Recipient marked message as spam. | -| lists | Object | [Groups of contacts](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/contactdb.html). **Will only return data if you're using Marketing Campaign features of SendGrid.** | -| segments | Object | [Slices of lists](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/contactdb.html). **Will only return data if you're using Marketing Campaign features of SendGrid.** | -| recipients | Object | All contacts who have received an email, with information about their last activities and custom activities. [More Info](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/contactdb.html). **Will only return data if you're using Marketing Campaign features of SendGrid.** | -| campaigns | Object | All campaigns you've created in Sendgrid. [More Info](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/campaigns.html). **Will only return data if you're using Marketing Campaign features of SendGrid.** | +| lists | Object | [Groups of contacts](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/contactdb.html). **Will only return data if you're using SendGrid's Marketing Campaign features.** | +| segments | Object | [Slices of lists](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/contactdb.html). **Will only return data if you're using SendGrid's Marketing Campaign features.** | +| recipients | Object | All contacts who have received an email, with information about their last activities and custom activities. [More Info](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/contactdb.html). **Will only return data if you're using SendGrid's Marketing Campaign features.** | +| campaigns | Object | All campaigns you've created in SendGrid. [More Info](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/campaigns.html). **Will only return data if you're using SendGrid's Marketing Campaign features.** | + +> info "SendGrid and Personas" +> SendGrid data is not available in Personas. -## Troubleshooting +## Troubleshooting If you're getting an "Invalid Credentials" error when setting up the SendGrid source, send a direct ping to the [SendGrid Marketing Campaigns API](https://sendgrid.com/docs/API_Reference/Web_API_v3/Marketing_Campaigns/campaigns.html) to test if you're using the correct credentials. -Make sure you whitelist the Segment IP addresses on Sendgrid. [Contact Segment](https://segment.com/help/contact/) for the list of IP addresses to whitelist. +Make sure you allowlist Segment IP addresses on SendGrid. [Contact Segment](https://segment.com/help/contact/) for the list of IP addresses to allowlist. diff --git a/src/connections/sources/catalog/libraries/server/java/index.md b/src/connections/sources/catalog/libraries/server/java/index.md index 7bdee1de35..153dc32e69 100644 --- a/src/connections/sources/catalog/libraries/server/java/index.md +++ b/src/connections/sources/catalog/libraries/server/java/index.md @@ -38,7 +38,7 @@ Add to `pom.xml`: or if you're using Gradle: ```bash -compile 'com.segment.analytics.java:analytics:+' +implementation 'com.segment.analytics.java:analytics:+' ``` ### Initialize the SDK @@ -74,14 +74,13 @@ We recommend calling `identify` a single time when the user's account is first c Example `identify` call: ```java +Map map = new HashMap(); +map.put("name", "Michael Bolton"); +map.put("email", "mbolton@example.com"); + analytics.enqueue(IdentifyMessage.builder() - .userId("f4ca124298") - .traits(ImmutableMap.builder() - .put("name", "Michael Bolton") - .put("email", "mbolton@example.com") - .build() - ) -); + .userId("f4ca124298") + .traits(map)); ``` This call is identifying Michael by his unique User ID (the one you know him by in your database) and labeling him with `name` and `email` traits. diff --git a/src/connections/sources/catalog/libraries/server/java/quickstart.md b/src/connections/sources/catalog/libraries/server/java/quickstart.md index 35edefb988..97666556f2 100644 --- a/src/connections/sources/catalog/libraries/server/java/quickstart.md +++ b/src/connections/sources/catalog/libraries/server/java/quickstart.md @@ -40,7 +40,7 @@ Here's what it would look like with Maven: *or if you're using Gradle:* ```bash -compile 'com.segment.analytics.java:analytics:+' +implementation 'com.segment.analytics.java:analytics:+' ``` ## Step 3: Initialize the SDK @@ -71,14 +71,13 @@ The `identify` message is how you tell Segment who the current user is. It inclu Here's what a basic call to `identify` a user might look like: ```java +Map map = new HashMap(); +map.put("name", "Michael Bolton"); +map.put("email", "mbolton@example.com"); + analytics.enqueue(IdentifyMessage.builder() - .userId("f4ca124298") - .traits(ImmutableMap.builder() - .put("name", "Michael Bolton") - .put("email", "mbolton@example.com") - .build() - ) -); + .userId("f4ca124298") + .traits(map)); ``` **Note:** The enqueue method takes a `MessageBuilder` instance and not a `Message` instance directly. This is to allow you to use a `MessageTransformer` that applies to all incoming messages and transform or add data. diff --git a/src/connections/sources/catalog/libraries/server/node/index.md b/src/connections/sources/catalog/libraries/server/node/index.md index 730e65f7d3..30b6b85f86 100644 --- a/src/connections/sources/catalog/libraries/server/node/index.md +++ b/src/connections/sources/catalog/libraries/server/node/index.md @@ -437,6 +437,84 @@ analytics.flush(function(err, batch){ }); ``` +## Long running process + +You should call `client.track(...)` and know that events will be queued and eventually sent to Segment. To prevent losing messages, be sure to capture any interruption (for example, a server restart) and call flush to know of and delay the process shutdown. + +```js +import { randomUUID } from 'crypto'; +import Analytics from 'analytics-node' + +const WRITE_KEY = '...'; + +const analytics = new Analytics(WRITE_KEY, { flushAt: 10 }); + +analytics.track({ + anonymousId: randomUUID(), + event: 'Test event', + properties: { + name: 'Test event', + timestamp: new Date() + } +}); + +const exitGracefully = async (code) => { + console.log('Flushing events'); + await analytics.flush(function(err, batch) { + console.log('Flushed, and now this program can exit!'); + process.exit(code); + }); +}; + +[ + 'beforeExit', 'uncaughtException', 'unhandledRejection', + 'SIGHUP', 'SIGINT', 'SIGQUIT', 'SIGILL', 'SIGTRAP', + 'SIGABRT','SIGBUS', 'SIGFPE', 'SIGUSR1', 'SIGSEGV', + 'SIGUSR2', 'SIGTERM', +].forEach(evt => process.on(evt, exitGracefully)); + +function logEvery2Seconds(i) { + setTimeout(() => { + console.log('Infinite Loop Test n:', i); + logEvery2Seconds(++i); + }, 2000); +} + +logEvery2Seconds(0); +``` + +## Short lived process + +Short-lived functions have a predictably short and linear lifecycle, so use a queue big enough to hold all messages and then await flush to complete its work. + + +```js +import { randomUUID } from 'crypto'; +import Analytics from 'analytics-node' + + +async function lambda() +{ + const WRITE_KEY = '...'; + const analytics = new Analytics(WRITE_KEY, { flushAt: 20 }); + analytics.flushed = true; + + analytics.track({ + anonymousId: randomUUID(), + event: 'Test event', + properties: { + name: 'Test event', + timestamp: new Date() + } + }); + await analytics.flush(function(err, batch) { + console.log('Flushed, and now this program can exit!'); + }); +} + +lambda(); +``` + ## Multiple Clients diff --git a/src/connections/spec/common.md b/src/connections/spec/common.md index 8700ed8a20..d5ae24b031 100644 --- a/src/connections/spec/common.md +++ b/src/connections/spec/common.md @@ -122,11 +122,13 @@ Beyond this common structure, each API call adds a few specialized top-level fie Context is a dictionary of extra information that provides useful context about a datapoint, for example the user's `ip` address or `locale`. You should **only use** Context fields for their intended meaning. + - - - + + + + @@ -140,7 +142,7 @@ Context is a dictionary of extra information that provides useful context about @@ -189,7 +191,7 @@ Context is a dictionary of extra information that provides useful context about - @@ -231,58 +233,58 @@ Context is a dictionary of extra information that provides useful context about ## Context Fields Automatically Collected -Below is a chart that shows you which context variables are populated automatically by our iOS, Android and analytics.js libraries. +Below is a chart that shows you which context variables are populated automatically by the iOS, Android and analytics.js libraries. Other libraries only collect `context.library`, any other context variables must be sent manually. | Context Field | Analytics.js | Analytics-ios | Analytics-android | -|--------------------------|--------------|---------------|-------------------| -| app.name | | √ | √ | -| app.version | | √ | √ | -| app.build | | √ | √ | -| campaign.name | √ | | | -| campaign.source | √ | | | -| campaign.medium | √ | | | -| campaign.term | √ | | | -| campaign.content | √ | | | -| device.type | | √ | √ | -| device.id | | √ | √ | -| device.advertisingId | | √ | √ | -| device.adTrackingEnabled | | √ | √ | -| device.manufacturer | | √ | √ | -| device.model | | √ | √ | -| device.name | | √ | √ | -| library.name | √ | √ | √ | -| library.version | √ | √ | √ | -| ip* | √ | √ | √ | -| locale | √ | √ | √ | +| ------------------------ | ------------ | ------------- | ----------------- | +| app.name | | ✅ | ✅ | +| app.version | | ✅ | ✅ | +| app.build | | ✅ | ✅ | +| campaign.name | ✅ | | | +| campaign.source | ✅ | | | +| campaign.medium | ✅ | | | +| campaign.term | ✅ | | | +| campaign.content | ✅ | | | +| device.type | | ✅ | ✅ | +| device.id | | ✅ | ✅ | +| device.advertisingId | | ✅ | ✅ | +| device.adTrackingEnabled | | ✅ | ✅ | +| device.manufacturer | | ✅ | ✅ | +| device.model | | ✅ | ✅ | +| device.name | | ✅ | ✅ | +| library.name | ✅ | ✅ | ✅ | +| library.version | ✅ | ✅ | ✅ | +| ip* | ✅ | ✅ | ✅ | +| locale | ✅ | ✅ | ✅ | | location.latitude | | | | | location.longitude | | | | | location.speed | | | | -| network.bluetooth | | | √ | -| network.carrier | | √ | √ | -| network.cellular | | √ | √ | -| network.wifi | | √ | √ | -| os.name | | √ | √ | -| os.version | | √ | √ | -| page.path | √ | | | -| page.referrer | √ | | | -| page.search | √ | | | -| page.title | √ | | | -| page.url | √ | | | -| screen.density | | | √ | -| screen.height | | √ | √ | -| screen.width | | √ | √ | -| traits | | √ | √ | -| userAgent | √ | | √ | -| timezone | | √ | √ | - -- IP Address is not collected by our libraries, but instead filled in by our servers when it receives a message for **client side events only**. -- Our Android library collects `screen.density` with [this method](/docs/connections/spec/common/#context-fields-automatically-collected). +| network.bluetooth | | | ✅ | +| network.carrier | | ✅ | ✅ | +| network.cellular | | ✅ | ✅ | +| network.wifi | | ✅ | ✅ | +| os.name | | ✅ | ✅ | +| os.version | | ✅ | ✅ | +| page.path | ✅ | | | +| page.referrer | ✅ | | | +| page.search | ✅ | | | +| page.title | ✅ | | | +| page.url | ✅ | | | +| screen.density | | | ✅ | +| screen.height | | ✅ | ✅ | +| screen.width | | ✅ | ✅ | +| traits | | ✅ | ✅ | +| userAgent | ✅ | ✅ | ✅ | +| timezone | | ✅ | ✅ | + +- IP Address is not collected by Segment's libraries, but instead filled in by Segmen'ts servers when it receives a message for **client side events only**. +- The Android library collects `screen.density` with [this method](/docs/connections/spec/common/#context-fields-automatically-collected). ## Integrations -A dictionary of destination names that the message should be sent to. `'All'` is a special key that applies when no key for a specific destinatio n is found. +A dictionary of destination names that the message should be sent to. `'All'` is a special key that applies when no key for a specific destination n is found. Integrations defaults to the following: @@ -293,9 +295,9 @@ Integrations defaults to the following: } ``` -This is because [Salesforce](/docs/connections/destinations/catalog/salesforce/) has strict limits on API calls, and we don't want to run over your limits by accident. +This is because [Salesforce](/docs/connections/destinations/catalog/salesforce/) has strict limits on API calls. -Sending data to the rest of our destinations is opt-out so if you don't specify the destination as false in this object, it will be sent to rest of the destinations that can accept it. +Sending data to the rest of Segment's destinations is opt-out so if you don't specify the destination as false in this object, it will be sent to rest of the destinations that can accept it. ## Timestamps @@ -378,16 +380,16 @@ The `originalTimestamp` tells you when call was invoked on the client device or ### sentAt -The `sentAt` timestamp specifies the clock time for the client's device when the network request was made to the Segment API. For libraries and systems that send batched requests, there can be a long gap between a datapoint's `timestamp` and `sentAt`. Combined with `receivedAt`, we can use `sentAt` to correct the original `timestamp` in situations where a user's device clock cannot be trusted (mobile phones and browsers). The `sentAt` and `receivedAt` timestamps are assumed to occur at the same time (maximum a few hundred milliseconds), and therefore the difference is the user's device clock skew, which can be applied back to correct the `timestamp`. +The `sentAt` timestamp specifies the clock time for the client's device when the network request was made to the Segment API. For libraries and systems that send batched requests, there can be a long gap between a datapoint's `timestamp` and `sentAt`. Combined with `receivedAt`, Segment uses `sentAt` to correct the original `timestamp` in situations where a user's device clock cannot be trusted (mobile phones and browsers). The `sentAt` and `receivedAt` timestamps are assumed to occur at the same time (maximum a few hundred milliseconds), and therefore the difference is the user's device clock skew, which can be applied back to correct the `timestamp`. **Note:** The `sentAt` timestamp is not useful for any analysis since it's tainted by user's clock skew. ### receivedAt -The `receivedAt` timestamp is added to incoming messages as soon as they hit our API. It's used in combination with `sentAt` to correct clock skew, and also to aid with debugging libraries and systems that deliver events in batches. +The `receivedAt` timestamp is added to incoming messages as soon as they hit the API. It's used in combination with `sentAt` to correct clock skew, and also to aid with debugging libraries and systems that deliver events in batches. -The `receivedAt` timestamp is most important as the sort key in our Warehouses product. Use this for max query speed when retrieving data from your Warehouse! +The `receivedAt` timestamp is most important as the sort key in Segment's Warehouses product. Use this for max query speed when retrieving data from your Warehouse! **Note:** Chronological order of events is not ensured with `receivedAt`. diff --git a/src/connections/storage/warehouses/faq.md b/src/connections/storage/warehouses/faq.md index 6bc79eaa66..806e8a7e68 100644 --- a/src/connections/storage/warehouses/faq.md +++ b/src/connections/storage/warehouses/faq.md @@ -44,11 +44,9 @@ Your warehouse id appears in the URL when you look at the [warehouse destination ## How fresh is the data in Segment Warehouses? -Data is available in Warehouses within 24-48 hours. The underlying Redshift datastore has a subtle tradeoff between data freshness, robustness, and query speed. For the best experience, Segment needs to balance all three of these. +Data is available in Warehouses within 24-48 hours, depending on your tier's sync frequency. For more information about sync frequency by tier, see [Sync Frequency](/docs/connections/storage/warehouses/warehouse-syncs/#sync-frequency). -Real-time loading of the data into Segment Warehouses would cause significant performance degradation at query time because of the way Redshift uses large batches to optimize and compress columns. To optimize for your query speed, reliability, and robustness, Segment guarantees that your data will be available in Redshift within 24 hours. - -As Segment improves and updates the ETL processes and optimizes for SQL query performance downstream, the actual load time will vary, but Segment ensures it's always within 24 hours. +Real-time loading of the data into Segment Warehouses would cause significant performance degradation at query time. To optimize for your query speed, reliability, and robustness, Segment guarantees that your data will be available in your warehouse within 24 hours. The underlying datastore has a subtle tradeoff between data freshness, robustness, and query speed. For the best experience, Segment needs to balance all three of these. ## What if I want to add custom data to my warehouse? @@ -58,23 +56,23 @@ The only restriction when loading your own data into your connected warehouse is If you want to insert custom data into your warehouse, create new schemas that are not associated with an existing source, since these may be deleted upon a reload of the Segment data in the cluster. -We highly recommend scripting any sort of additions of data you might have to warehouse, so that you aren't doing one-off tasks that can be hard to recover from in the future in the case of hardware failure. +Segment recommends scripting any sort of additions of data you might have to warehouse, so that you aren't doing one-off tasks that can be hard to recover from in the future in the case of hardware failure. -## Which IPs should I whitelist? +## Which IPs should I allowlist? -You can whitelist Segment's custom IP `52.25.130.38/32` while authorizing Segment to write in to your Redshift or Postgres port. +You can allowlist Segment's custom IP `52.25.130.38/32` while authorizing Segment to write in to your Redshift or Postgres port. If you're in the EU region, use CIDR `3.251.148.96/29`. > info "" > EU workspace regions are currently in beta. If you would like to learn more about the beta, please contact your account manager. -BigQuery does not require whitelisting an IP address. To learn how to set up BigQuery, check out our [set up guide](https://segment.com/docs/connections/storage/catalog/bigquery/#getting-started) +BigQuery does not require allowlisting an IP address. To learn how to set up BigQuery, check out Segment's BigQuery [set up guide](/docs/connections/storage/catalog/bigquery/#getting-started) ## Will Segment sync my historical data? -We will automatically load up to 2 months of your historical data when you connect a warehouse. +Segment loads up to two months of your historical data when you connect a warehouse. For full historical backfills you'll need to be a Segment Business plan customer. If you'd like to learn more about our Business plan and all the features that come with it, [check out our pricing page](https://segment.com/pricing). @@ -92,3 +90,45 @@ When you create a new source, the source syncs to all warehouse(s) in the worksp - **Config API**: Send a [PATCH Connected Warehouse request](https://reference.segmentapis.com/?version=latest#ec12dae0-1a3e-4bd0-bf1c-840f43537ee2) to update the settings for the warehouse(s) you want to prevent from syncing. After a source is created, you can enable or disable a warehouse sync within the Warehouse Settings page. + +## Can I be notified when warehouse syncs fail? + +If you enabled activity notifications for your storage destination, you'll receive notifications in the Segment app for the fifth and 20th consecutive warehouse failures. + +To sign up for warehouse sync notifications: +1. Open the Segment app. +2. Go to **Settings** > **User Preferences**. +3. In the Activity Notifications section, select **Storage Destinations**. +4. Enable **Storage Destination Sync Failed**. + +## How is the data formatted in my warehouse? + +Data in your warehouse is formatted into **schemas**, which involve a detailed description of database elements (tables, views, indexes, synonyms, etc.) +and the relationships that exist between elements. Segment's schemas use the following template:
`..`, for example, +`segment_engineering.tracks.user_id`, where Source refers to the source or project name (segment_engineering), collection refers to the event (tracks), + and the property refers to the data being collected (user_id). + +> note " " +> All schema data is always represented in snake case. + +For more information about Warehouse Schemas, see the [Warehouse Schemas](/docs/connections/storage/warehouses/schema) page. + +## If my syncs fail and get fixed, do I need to ask for a backfill? + +If your syncs fail, you do not need to reach out to Segment Support to request a backfill. Once a successful sync takes place, +Segment automatically loads all of the data generated since the last successful sync occurred. + + +## Can I change my schema names once they've been created? + +Segment stores the name of your schema in the **SQL Settings** page. Changing the name of your schema in the app without updating the name in your data warehouse causes a new schema to form, one that doesn't contain historical data. + +To change the name of your schema without disruptions: + +1. Open the Segment app, select your warehouse from the Sources tab, and select **Settings.** +2. Under the "Enable Source" section, disable your warehouse and click **Save Changes.** +3. Select the "SQL Settings" tab. +4. Update the "Schema Name" field with the new name for your schema and click **Save Changes.** +5. Rename the schema in your Data Warehouse to match the new name in the Segment app. +6. Open the Segment app, select your warehouse from the Sources tab, and select **Settings.** On the source's settings page, select "Basic." +7. Under the "Enable Source" section, enable your warehouse and click **Save Changes.** \ No newline at end of file diff --git a/src/connections/storage/warehouses/images/sql-redshift-table-1.jpg b/src/connections/storage/warehouses/images/sql-redshift-table-1.jpg new file mode 100644 index 0000000000..5243c61928 Binary files /dev/null and b/src/connections/storage/warehouses/images/sql-redshift-table-1.jpg differ diff --git a/src/connections/storage/warehouses/index.md b/src/connections/storage/warehouses/index.md index 9864ea146a..233f741a3d 100644 --- a/src/connections/storage/warehouses/index.md +++ b/src/connections/storage/warehouses/index.md @@ -20,6 +20,8 @@ Relational databases are great when you know and predefine the information colle Examples of data warehouses include Amazon Redshift, Google BigQuery, and Postgres. +{% include content/how-a-sync-works.md %} +
> info "Looking for the Warehouse Schemas docs?" > They've moved! Check them out [here](schema/). @@ -32,7 +34,7 @@ Examples of data warehouses include Amazon Redshift, Google BigQuery, and Postgr [How do I give users permissions to my warehouse?](/docs/connections/storage/warehouses/add-warehouse-users/) -Check out our [Frequently Asked Questions about Warehouses](/docs/connections/storage/warehouses/faq/) and [a list of helpful queries to get you started](https://help.segment.com/hc/en-us/articles/205577035-Common-Segment-SQL-Queries). +Check out the [Frequently Asked Questions about Warehouses](/docs/connections/storage/warehouses/faq/) page and [a list of helpful SQL queries to get you started with Redshift ](/docs/connections/storage/warehouses/redshift-useful-sql). ## FAQs @@ -42,7 +44,7 @@ Check out our [Frequently Asked Questions about Warehouses](/docs/connections/st [How do I give users permissions?](/docs/connections/storage/warehouses/add-warehouse-users/) -[What are the limitations of Redshift clusters and our warehouses connector?](/docs/connections/storage/warehouses/redshift-faq/) +[What are the limitations of Redshift clusters and warehouses connectors?](/docs/connections/storage/warehouses/redshift-faq/) [Where do I find my source slug?](/docs/connections/storage/warehouses/faq/#how-do-i-find-my-source-slug) @@ -50,28 +52,28 @@ Check out our [Frequently Asked Questions about Warehouses](/docs/connections/st [How do I create a user, grant usage on a schema and then grant the privileges that the user will need to interact with that schema?](/docs/connections/storage/warehouses/add-warehouse-users/) -[Which IPs should I whitelist?](/docs/connections/storage/warehouses/faq/#which-ips-should-i-whitelist) +[Which IPs should I allowlist?](/docs/connections/storage/warehouses/faq/#which-ips-should-i-whitelist) [Will Segment sync my historical data?](/docs/connections/storage/warehouses/faq/#will-segment-sync-my-historical-data) [Can I load in my own data into my warehouse?](/docs/connections/storage/warehouses/faq/#what-if-i-want-to-add-custom-data-to-my-warehouse) -[Can I control what data is sent to my warehouse?](/docs/connections/storage/warehouses/faq/) +[Can I control what data is sent to my warehouse?](/docs/connections/storage/warehouses/faq/#can-i-control-what-data-is-sent-to-my-warehouse) ### Managing a warehouse -[How fresh is the data in my warehouse?](/docs/connections/storage/warehouses/faq/) +[How fresh is the data in my warehouse?](/docs/connections/storage/warehouses/faq/#how-fresh-is-the-data-in-segment-warehouses) -[Can I add, tweak, or delete some of the tables?](/docs/connections/storage/warehouses/faq/) +[Can I add, tweak, or delete some of the tables?](/docs/connections/storage/warehouses/faq/#can-we-add-tweak-or-delete-some-of-the-tables) -[Can I transform or clean up old data to new formats or specs?](/docs/connections/storage/warehouses/faq/) +[Can I transform or clean up old data to new formats or specs?](/docs/connections/storage/warehouses/faq/#can-we-transform-or-clean-up-old-data-to-new-formats-or-specs) [What are common errors and how do I debug them?](/docs/connections/storage/warehouses/warehouse-errors/) -[How do I speed up my queries?](/docs/connections/storage/warehouses/redshift-tuning/) +[How do I speed up my Redshift queries?](/docs/connections/storage/warehouses/redshift-tuning/) ### Analyzing with SQL [How do I forecast LTV with SQL and Excel for e-commerce businesses?](/docs/guides/how-to-guides/forecast-with-sql/) -[How do I measure the ROI of my Marketing Campaigns?](/docs/guides/how-to-guides/measure-marketing-roi/) +[How do I measure the ROI of my Marketing Campaigns?](/docs/guides/how-to-guides/measure-marketing-roi/) \ No newline at end of file diff --git a/src/connections/storage/warehouses/redshift-useful-sql.md b/src/connections/storage/warehouses/redshift-useful-sql.md new file mode 100644 index 0000000000..afb7295238 --- /dev/null +++ b/src/connections/storage/warehouses/redshift-useful-sql.md @@ -0,0 +1,349 @@ +--- +title: Useful SQL Queries for Redshift +--- +Below you'll find a library of some of the most useful SQL queries customers use in their Redshift warehouses. You can run these in your Redshift instance with little to no modification. + +> success "Ways to improve query speed" +> If you're looking to improve the speed of your queries, check out Segment's [Speeding Up Redshift Queries](/docs/connections/storage/warehouses/redshift-tuning/) page. + +You can use SQL queries for the following tasks: +- [Tracking events](#tracking-events) +- [Defining sessions](#defining-sessions) +- [Identifying users](#identifying-users) +- [Groups to accounts](#groups-to-accounts) + +> note " " +> If you're looking for SQL queries for warehouses other than Redshift, check out some of Segment's [Analyzing with SQL guides](/docs/connections/storage/warehouses/index/#analyzing-with-sql). + +## Tracking events + +The Track call allows you to record any actions your users perform. A Track call takes three parameters: the userId, the event, and any optional properties. + +Here's a basic Track call: + +```javascript +analytics.track('Completed Order', + item: 'pants', + color: 'blue' + size: '32x32' + payment: 'credit card' +}); +``` + +A completed order Track call might look like this: + +```javascript +analytics.track('Completed Order', { + item: 'shirt', + color: 'green' + size: 'Large' + payment: 'paypal' +}); +``` + +Each Track call is stored as a distinct row in a single Redshift table called `tracks`. To get a table of your completed orders, you can run the following query: + +```sql +select * +from initech.tracks +where event = 'completed_order' +``` + +That SQL query returns a table that looks like this: + +![](images/sql-redshift-table-1.jpg) + +But why are there columns in the table that weren't a part of the Track call, like `event_id`? +This is because the Track method (for client-side libraries) includes additional properties of the event, like `event_id`, `sent_at`, and `user_id`! + +### Grouping events by day +If you want to know how many orders were completed over a span of time, you can use the `date()` and `count` function with the `sent_at` timestamp: + +```sql +select date(sent_at) as date, count(event) +from initech.tracks +where event = 'completed_order' +group by date +``` +That query returns a table like this: + +| date | count | +| ---------- | ----- | +| 2021-12-09 | 5 | +| 2021-12-08 | 3 | +| 2021-12-07 | 2 | + +To see the number of pants and shirts that were sold on each of those dates, you can query that using case statements: + +```sql +select date(sent_at) as date, +sum(case when item = 'shirt' then 1 else 0 end) as shirts, +sum(case when item = 'pants' then 1 else 0 end) as pants +from initech.tracks +where event = 'completed_order' +group by date +``` + +That query returns a table like this: + +| date | shirts | pants | +| ---------- | ------ | ----- | +| 2021-12-09 | 3 | 2 | +| 2021-12-08 | 1 | 2 | +| 2021-12-07 | 2 | 0 | + + +## Define sessions +Segment’s API does not impose any restrictions on your data with regard to user sessions. + +Sessions aren’t fundamental facts about the user experience. They’re stories Segment builds around the data to understand how customers actually use the product in their day-to-day lives. And since Segment’s API is about collecting raw, factual data, there's no API for collecting sessions. Segment leaves session interpretation to SQL partners, which let you design how you measure sessions based on how customers use your product. + +For more on why Segment doesn't collect session data at the API level, [check out a blog post here](https://segment.com/blog/facts-vs-stories-why-segment-has-no-sessions-api/){:target="_blank"}. + +### How to define user sessions using SQL +Each of Segment's SQL partners allow you to define sessions based on your specific business needs. With [Looker](https://looker.com){:target="_blank"}, for example, you can take advantage of their persistent derived tables and LookML modeling language to layer sessionization on top of your Segment SQL data. Segment recommends [checking out Looker's approach here](https://segment.com/blog/using-sql-to-define-measure-and-analyze-user-sessions/). + +To define sessions with raw SQL, a great query and explanation comes from [Mode Analytics](https://mode.com). + +Here’s the query to make it happen, but read Mode Analytics' [blog post](https://blog.modeanalytics.com/finding-user-sessions-sql/) for more information. Mode walks you through the reasoning behind the query, what each portion accomplishes, how you can tweak it to suit your needs, and the kinds of further analysis you can add on top of it. + +```sql +-- Finding the start of every session +SELECT * + FROM ( + SELECT * + LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event + FROM "your_source".tracks + ) last +WHERE EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10) + OR last_event IS NULL + +-- Mapping every event to its session +SELECT *, + SUM(is_new_session) OVER (ORDER BY user_id, sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS global_session_id, + SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY sent_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS user_session_id + FROM ( + SELECT *, + CASE WHEN EXTRACT('EPOCH' FROM sent_at) - EXTRACT('EPOCH' FROM last_event) >= (60 * 10) + OR last_event IS NULL + THEN 1 ELSE 0 END AS is_new_session + FROM ( + SELECT *, + LAG(sent_at,1) OVER (PARTITION BY user_id ORDER BY sent_at) AS last_event + FROM "your_source".tracks + ) last + ) final +``` + +## Identify users + +### Historical traits + +The Identify method ties user attributes to a `userId`. + +```javascript +analytics.identify('bob123',{ + email: 'bob@initech.com', + plan: 'Free' +}); +``` +As these user traits change over time, you can continue calling the Identify method to update their changes. With this query, you can update Bob’s account plan to “Premium”. + +```javascript +analytics.identify('bob123', { + email: 'bob@initech.com', + plan: 'Premium' +}); +``` + +Each Identify call is stored in a single Redshift table called `identifies`. To see how a user's plan changes over time, you can run the following query: + +```sql +select email, plan, sent_at +from initech.identifies +where email = 'bob@initech.com' +``` + +This SQL query returns a table of Bob's account information, with each entry representing the state of his account at different time periods: + +| user_id | email | plan | sent_at | +| ------- | -------------- | ------- | ------------------- | +| bob123 | bob@intech.com | Premium | 2021-12-20 19:44:03 | +| bob123 | bob@intech.com | Basic | 2021-12-18 17:48:10 | + +If you want to see what your users looked like at a previous point in time, you can find that data in the `identifies` table. To get this table for your users, replace ‘initech’ in the SQL query with your source slug. + +If you only want the current state of the users, convert the `identifies` table into a [distinct users table](#convert-the-identifies-table-into-a-users-table) by returning the most recent Identify call for each account. + +### Convert the identifies table into a users table + +The following query returns the `identifies` table: + +```sql +select * +from initech.identifies +``` +That query returns a table like this: + +| user_id | email | plan | sent_at | +| ------- | -------------- | ------- | ------------------- | +| bob123 | bob@intech.com | Premium | 2021-12-20 19:44:03 | +| bob123 | bob@intech.com | Basic | 2021-12-18 17:48:10 | +| jeff123 | jeff@intech.com | Premium | 2021-12-20 19:44:03 | +| jeff123 | jeff@intech.com | Basic | 2021-12-18 17:48:10 | + +If all you want is a table of distinct user with their current traits and without duplicates, you can do so with the following query: + +```sql +with identifies as ( + select user_id, + email, + plan, + sent_at, + row_number() over (partition by user_id order by sent_at desc) as rn + from initech.identifies +), +users as ( + select user_id, + email, + plan + from identifies + where rn = 1 +) + +select * +from users +``` + +### Counts of user traits +Let's say you have an `identifies` table that looks like this: + +| user_id | email | plan | sent_at | +| ------- | -------------- | ------- | ------------------- | +| bob123 | bob@intech.com | Premium | 2021-12-20 19:44:03 | +| bob123 | bob@intech.com | Basic | 2021-12-18 17:48:10 | +| jeff123 | jeff@intech.com | Premium | 2021-12-20 19:44:03 | +| jeff123 | jeff@intech.com | Basic | 2021-12-18 17:48:10 | + +If you want to query the traits of these users, you first need to [convert the identifies table into a users table](#converting-the-identifies-table-into-a-users-table). From there, run a query like this to get a count of users with each type of plan: + +```sql +with identifies as ( + select user_id, + email, + plan, + sent_at, + row_number() over (partition by user_id order by sent_at desc) as rn + from initech.identifies +), +users as ( + select plan + from identifies + where rn = 1 +) + +select sum(case when plan = 'Premium' then 1 else 0 end) as premium, + sum(case when plan = 'Free' then 1 else 0 end) as free +from users +``` + +And there you go: a count of users with each type of plan! + +| premium | free | +| ------- | ---- | +| 2 | 0 | + +## Groups to accounts + +### Historical Traits + +The `group` method ties a user to a group. It also lets you record custom traits about the group, like the industry or number of employees. + +Here’s what a basic `group` call looks like: + +```javascript +analytics.group('0e8c78ea9d97a7b8185e8632', { + name: 'Initech', + industry: 'Technology', + employees: 329, + plan: 'Premium' +}); +``` +As these group traits change over time, you can continue calling the group method to update their changes. + +```javascript +analytics.group('0e8c78ea9d97a7b8185e8632', { + name: 'Initech', + industry: 'Technology', + employees: 600, + plan: 'Enterprise' +}); +``` + +Each group call is stored as a distinct row in a single Redshift table called `groups`. To see how a group changes over time, you can run the following query: + +```sql +select name, industry, plan, employees, sent_at +from initech.groups +where name = 'Initech' +``` + +The previous query will return a table of Initech's group information, with each entry representing the state of the account at different times. + +| name | industry | employees | plan | sent_at | +| ------- | ---------- | --------- | ------- | ------------------- | +| Initech | Technology | 600 | Premium | 2021-12-20 19:44:03 | +| Initech | Technology | 349 | Free | 2021-12-18 17:18:15 | + +If you want to see a group’s traits at a previous point in time, this query is useful (To get this table for your groups, replace ‘initech’ with your source slug). + +If you only want to see the most recent state of the group, you can convert the groups table into a distinct groups table by viewing the most recent groups call for each account. + +### Converting the Groups Table into an Organizations Table + +The following query will return your groups table: + +```sql +select * +from initech.groups +``` + +The previous query returns the following table: + +| name | industry | employees | plan | sent_at | +| --------- | ------------- | --------- | ------- | ------------------- | +| Initech | Technology | 600 | Premium | 2021-12-20 19:44:03 | +| Initech | Technology | 349 | Free | 2021-12-18 17:18:15 | +| Acme Corp | Entertainment | 15 | Premium | 2021-12-20 19:44:03 | +| Acme Corp | Entertainment | 10 | Free | 2021-12-18 17:18:15 | + +However, if all you want is a table of distinct groups and current traits, you can do so with the following query: + +```sql +with groups as ( + select name, + industry, + employees, + plan, + sent_at, + row_number() over (partition by name order by sent_at desc) as rn + from initech.groups +), +organizations as ( + select name, + industry, + employees, + plan + from groups + where rn = 1 +) + +select * +from organizations +``` +This query will return a table with your distinct groups, without duplicates. + +| name | industry | employees | plan | sent_at | +| --------- | ------------- | --------- | ------- | ------------------- | +| Initech | Technology | 600 | Premium | 2021-12-20 19:44:03 | +| Acme Corp | Entertainment | 15 | Premium | 2021-12-20 19:44:03 | \ No newline at end of file diff --git a/src/connections/storage/warehouses/schema.md b/src/connections/storage/warehouses/schema.md index a35d52248b..c0198ae4ad 100644 --- a/src/connections/storage/warehouses/schema.md +++ b/src/connections/storage/warehouses/schema.md @@ -2,6 +2,117 @@ title: Warehouse Schemas --- +A **schema** describes the way that the data in a warehouse is organized. Segment stores data in relational schemas, which organize data into the following template: +`..`, for example `segment_engineering.tracks.user_id`, where source refers to the source or project name (segment_engineering), collection refers to the event (tracks), and the property refers to the data being collected (user_id). All schemas convert collection and property names from `CamelCase` to `snake_case`. + +> note "Warehouse column creation" +> **Note:** Segment creates tables for each of your custom events in your warehouse, with columns for each event's custom properties. Segment does not allow unbounded `event` or `property` spaces in your data. Instead of recording events like "Ordered Product 15", use a single property of "Product Number" or similar. + +### How warehouse tables handle nested objects and arrays + +Segment's libraries pass nested objects and arrays into tracking calls as **properties**, **traits**, and **tracking calls**. To preserve the quality of your events data, Segment uses the following methods to store properties and traits in database tables: + +- The warehouse connector stringifies all **properties** that contain a nested **array/object** +- The warehouse connector stringifies all **context fields** that contain a nested **array** +- The warehouse connector stringifies all **traits** that contain a nested **array** +- The warehouse connector "flattens" all **traits** that contain a nested **object** +- The warehouse connector optionally stringifies **arrays** when they follow the [Ecommerce spec](/docs/connections/spec/ecommerce/v2/) +- The warehouse connector "flattens" all **context fields** that contain a nested **object** (for example, context.field.nestedA.nestedB becomes a column called context_field_nestedA_nestedB) + +
**Field****Type****Description**FieldTypeDescription
`active` BooleanObject dictionary of information about the current application, containing `name`, `version` and `build`.

- This is collected automatically from our mobile libraries when possible. + This is collected automatically from the mobile libraries when possible.
`page` ObjectDictionary of information about the current page in the browser, containing `path`, `referrer`, `search`, `title` and `url`. This is automatically collected by [Analytics.js](https://segment.com/docs/connections/sources/catalog/libraries/website/javascript/#context--traits). + Dictionary of information about the current page in the browser, containing `path`, `referrer`, `search`, `title` and `url`. This is automatically collected by [Analytics.js](/docs/connections/sources/catalog/libraries/website/javascript/#context--traits).
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Field Code (Example) Schema (Example)
Object (Context): Flatten + +``` json +context: { + app: { + version: "1.0.0" + } +} +``` + + Column Name:
+ context_app_version +

+ Value:
+ "1.0.0" +
Object (Traits): Flatten + +```json +traits: { + address: { + street: "6th Street" + } +} +``` + + +Column Name:
+address_street
+
+Value:
+"6th Street" +
Object (Properties): Stringify + +```json +properties: { + product_id: { + sku: "G-32" + } +} +``` + + Column Name:
+ product_id

+ Value:
+ "{sku.'G-32'}" +
Array (Any): Stringify + +```json +products: { + product_id: [ + "507f1", "505bd" + ] +} +``` + + + Column Name:
+ product_id

+ Value: + "[507f1, 505bd]" +
+ ## Warehouse tables The table below describes the schema in Segment Warehouses: @@ -15,7 +126,7 @@ The table below describes the schema in Segment Warehouses: | `.groups` | A table with your `group` method calls. This table includes the `traits` you record for groups as top-level columns, for example `.groups.employee_count`. | | `.accounts` | *IN BETA* A table with unique `group` method calls. Group calls are upserted into this table (updated if an existing entry exists, appended otherwise). This table holds the latest state of a group. | | `.identifies` | A table with your `identify` method calls. This table includes the `traits` you identify users by as top-level columns, for example `.identifies.email`. | -| `.users` | A table with unique `identify` calls. `identify` calls are upserted on `user_id` into this table (updated if an existing entry exists, appended otherwise). This table holds the latest state of a user. The `id` column in the users table is the same as the `user_id` column in the identifies table. Also note that this table won't have an `anonymous_id` column since a user can have multiple anonymousIds. To retrieve a user's `anonymousId`, query the identifies table. *If you observe any duplicates in the users table [contact us](https://segment.com/help/contact/) (unless you are using BigQuery, where [this is expected](/docs/connections/storage/catalog/bigquery/#schema))*. | +| `.users` | A table with unique `identify` calls. `identify` calls are upserted on `user_id` into this table (updated if an existing entry exists, appended otherwise). This table holds the latest state of a user. The `id` column in the users table is the same as the `user_id` column in the identifies table. Also note that this table won't have an `anonymous_id` column since a user can have multiple anonymousIds. To retrieve a user's `anonymousId`, query the identifies table. *If you observe any duplicates in the users table [contact Segment support](https://segment.com/help/contact/) (unless you are using BigQuery, where [this is expected](/docs/connections/storage/catalog/bigquery/#schema))*. | | `.pages` | A table with your `page` method calls. This table includes the `properties` you record for pages as top-level columns, for example `.pages.title`. | | `.screens` | A table with your `screen` method calls. This table includes `properties` you record for screens as top-level columns, for example `.screens.title`. | | `.tracks` | A table with your `track` method calls. This table includes standardized properties that are all common to all events: `anonymous_id`, `context_*`, `event`, `event_text`, `received_at`, `sent_at`, and `user_id`. This is because every event that you send to Segment has different properties. For querying by the custom properties, use the `.` tables instead. | @@ -24,7 +135,7 @@ The table below describes the schema in Segment Warehouses: ## Identifies table -The `identifies` table stores the `.identify()` method calls =. Query it to find out user-level information. It has the following columns: +The `identifies` table stores the `.identify()` method calls. Query it to find out user-level information. It has the following columns: | method | property | | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | @@ -229,104 +340,6 @@ AND table_name = '' ORDER by column_name ``` -### How event tables handle nested objects and arrays - -To preserve the quality of your events data, Segment uses the following methods to store objects and arrays in the event tables: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Field Code (Example) Schema (Example)
Object (Context): Flatten - -``` json -context: { - app: { - version: "1.0.0" - } -} -``` - - Column Name:
- context_app_version -

- Value:
- "1.0.0" -
Object (Traits): Flatten - -```json -traits: { - address: { - street: "6th Street" - } -} -``` - - -Column Name:
-address_street
-
-Value:
-"6th Street" -
Object (Properties): Stringify - -```json -properties: { - product_id: { - sku: "G-32" - } -} -``` - - Column Name:
- product_id

- Value:
- "{sku.'G-32'}" -
Array (Any): Stringify - -```json -products: { - product_id: [ - "507f1", "505bd" - ] -} -``` - - - Column Name:
- product_id

- Value: - "[507f1, 505bd]" -
- ## Tracks vs. Events Tables To see the tables for your organization, you can run this query: @@ -391,24 +404,32 @@ ORDER BY day New event properties and traits create columns. Segment processes the incoming data in batches, based on either data size or an interval of time. If the table doesn't exist we lock and create the table. If the table exists but new columns need to be created, we perform a diff and alter the table to append new columns. -> note "Column creation in Redshift" -> **Note:** Segment creates tables for each of your custom events, and columns for each event's custom properties. Redshift has limits on the number of columns in a table, so Segment does not allow unbounded event or property spaces in your data. Instead of recording events like "Ordered Product 15", use a single property of "Product Number" or similar. - When Segment process a new batch and discover a new column to add, we take the most recent occurrence of a column and choose its datatype. -The data types that we currently support include:  -- `timestamp` -- `integer`  -- `float` -- `boolean` -- `varchar` +### Supported Data Types +Data types are set up in your warehouse based on the first value that comes in from a source. For example, if the first value that came in from a source was a string, Segment would set the data type in the warehouse to `string`. + +The data types that Segment currently supports include: + +#### `timestamp` + +#### `integer` + +#### `float` + +#### `boolean` + +#### `varchar` + +> note " " +> To change data types after they've been determined, please reach out to [Segment Support](https://segment.com/help/contact) for assistance. ## Column Sizing After analyzing the data from dozens of customers, we set the string column length limit at 512 characters. Longer strings are truncated. We found this was the sweet spot for good performance and ignoring non-useful data. -We special-case compression for some known columns, like event names and timestamps. The others default to LZO. We may add look-ahead sampling down the road, but from inspecting the datasets today this would be unnecessary complexity. +Segment uses special-case compression for some known columns, like event names and timestamps. The others default to LZO. Segment may add look-ahead sampling down the road, but from inspecting the datasets today this would be unnecessarily complex. ## Timestamps @@ -438,12 +459,14 @@ To learn more about timestamps in Segment, [read our timestamps overview](/docs/ Each row in your database will have an `id` which is equivalent to the messageId which is passed through in the raw JSON events. The `id` is a unique message id associated with the row. -## uuid and uuid_ts +## uuid, uuid_ts, and loaded_at The `uuid` column is used to prevent duplicates. You can ignore this column. The `uuid_ts` column is used to keep track of when the specific event was last processed by our connector, specifically for deduping and debugging purposes. You can generally ignore this column. +The `loaded_at` column contains the UTC timestamp reflecting when the data was staged by the processor. + ## Sort Key All tables use `received_at` for the sort key. Amazon Redshift stores your data on disk in sorted order according to the sort key. The Redshift query optimizer uses sort order when it determines optimal query plans. @@ -454,4 +477,6 @@ All tables use `received_at` for the sort key. Amazon Redshift stores your data [How do I give users permissions to my warehouse?](/docs/connections/storage/warehouses/add-warehouse-users/) -Check out our [Frequently Asked Questions about Warehouses](/docs/connections/storage/warehouses/faq/) and [a list of helpful queries to get you started](https://help.segment.com/hc/en-us/articles/205577035-Common-Segment-SQL-Queries). +[How frequently does data sync to my warehouse?](/docs/connections/storage/warehouses/warehouse-syncs/#sync-frequency) + +Check out our [Frequently Asked Questions about Warehouses](/docs/connections/storage/warehouses/faq/) and [a list of helpful Redshift queries to get you started](/docs/connections/storage/warehouses/redshift-useful-sql). diff --git a/src/connections/storage/warehouses/warehouse-syncs.md b/src/connections/storage/warehouses/warehouse-syncs.md index 7107498667..722bc449fd 100644 --- a/src/connections/storage/warehouses/warehouse-syncs.md +++ b/src/connections/storage/warehouses/warehouse-syncs.md @@ -3,13 +3,26 @@ title: Warehouse Syncs redirect_from: '/connections/warehouses/selective-sync/' --- -The Warehouse Sync process prepares the raw data coming from a source and loads it into a warehouse destination. There are two phases to the sync process: -1. **Preparation phase**: This is where Segment prepares the data coming from a source so that it's in the right format for the loading phase. -2. **Loading phase**: This is where Segment deduplicates data and the data loads into the warehouse destination. Any sync issues that occur in this phase can be traced back to your warehouse. - Instead of constantly streaming data to the warehouse destination, Segment loads data to the warehouse in bulk at regular intervals. Before the data loads, Segment inserts and updates events and objects, and automatically adjusts the schema to make sure the data in the warehouse is inline with the data in Segment. -Warehouses sync with all data coming from your source and your data is available in your warehouse within 24-48 hours. If you'd like to manage the data you send to your warehouse, use [Warehouse Selective Sync](#warehouse-selective-sync). +{% include content/how-a-sync-works.md %} + +Warehouses sync with all data coming from your source. However, Business plan members can manage the data that is sent to their warehouses using [Selective Sync](#warehouse-selective-sync). + +## Sync Frequency + +Your plan determines how frequently data is synced to your warehouse. + +| Plan | Frequency | +| --------- | -------------------------------------------------------------------------------------------------------------- | +| Free | Once a day (every 86,400 seconds) | +| Team | Twice a day (every 43,200 seconds) | +| Business* | Up to 24 times a day. Generally, these syncs are fixed to the top of the hour (:00), but these times can vary. | + +*If you're a Business plan member and would like to adjust your sync frequency, you can do so using the Selective Sync feature. To enable Selective Sync, please go to **Warehouse** > **Settings** > **Sync Schedule**. + +> note "Why can't I sync more than 24 times per day?" +> We do not set syncs to happen more than once per hour (24 times per day). The warehouse product is not designed for real-time data, so more frequent syncs would not necessarily be helpful. ## Sync History You can use the Sync History page to see the status and history of data updates in your warehouse. The Sync History page is available for every source connected to each warehouse. This page helps you answer questions like, “Has the data from a specific source been updated recently?” “Did a sync completely fail, or only partially fail?” and “Why wasn’t this sync successful?” diff --git a/src/partners/streams.md b/src/partners/streams.md index 553e435124..5b638da4c3 100644 --- a/src/partners/streams.md +++ b/src/partners/streams.md @@ -1,11 +1,11 @@ --- -title: Building a Stream +title: Build a Stream --- > info "" -> The Developer Center currently only supports the [Subscription](/docs/partners/subscriptions) component in _Developer Preview_. Include [your information here](https://airtable.com/shrj3BkHMhdeaPYWt) and we'll contact you once _Streams_ are made available! +> The Developer Center currently only supports the [Subscription](/docs/partners/subscriptions) component in _Developer Preview_. Include [your information here](https://airtable.com/shrj3BkHMhdeaPYWt) and Segment will contact you once _Streams_ are made available! -Streams enable you to send data to our mutual customers from your web services in realtime. +Streams enable you to send data to mutual customers from your web services in realtime. # Building a Stream @@ -19,33 +19,36 @@ Customers can find their write key in the source settings and regenerate it as n ![](images/s_8E933880F61B29168308B8A8203AE878319289A26E8E2054D0824C7A53E43DD4_1479162638952_file.png) -*Important*: We are working on an OAuth solution to reduce friction for customers. Partner Streams submitted through the developer center *will* be required to support this OAuth as it comes available. +> warning "" +> Segment working on an OAuth solution to reduce friction for customers. Partner Streams submitted through the developer center *will* be required to support this OAuth as it comes available. ## The Segment Spec -To learn about the semantics of the five supported API calls, and the semantic event names and properties we recognize, read the Segment [Spec](https://segment.com/docs/connections/spec). +To learn about the semantics of the five supported API calls, and the semantic event names and properties Segment recognizes, read the Segment [Spec](/docs/connections/spec). The spec is a critical component of preserving semantics between sources and destinations. If you break the spec, you are breaking the promise of Segment, which is grounds for removal from the catalog. -*Important*: If there are any events you send that match existing events from our spec that you are not adhering to (eg. sending "Purchase" instead of "Order Completed" or "install" instead of "Application Installed"), we will reject your application. +> info "" +> If any events you send to Segment match, but do not adhere to, existing events from the Segment Spec (for example, sending "Purchase" instead of "Order Completed" or "install" instead of "Application Installed"), Segment will reject your application. -If there is something unique about your tool that requires specific data points that are not included in the spec, [get in touch](https://segment.com/help/contact/). We love partner suggestions for augmentations to the spec! +If there is something unique about your tool that requires specific data points that are not included in the spec, [get in touch](https://segment.com/help/contact/){:target="_blank"}. ## Sending data -To send events to Segment you should post events directly to the [Segment HTTP API](https://segment.com/docs/connections/sources/catalog/libraries/server/http-api/#track). You may use a Segment [library](https://segment.com/docs/connections/sources/catalog/) to do so. The HTTP API has a couple of basic requirements. +To send events to Segment you should post events directly to the [Segment HTTP API](/docs/connections/sources/catalog/libraries/server/http-api/#track). You may use a Segment [library](/docs/connections/sources/catalog/) to do so. The HTTP API has a couple of basic requirements. Beyond the Spec, there are a few additional requirements for partner Streams. ### `userId` -Each call sent to Segment must contain a `userId`. The `userId` is what allows us to identify each unique user. This value should be stored by your tool when you receive an event from Segment. +Each call sent to Segment must contain a `userId`. The `userId` is what allows Segment to identify each unique user. This value should be stored by your tool when you receive an event from Segment. For example, you might receive an `identify` call with a `userId` and `traits` about that user. If that user is sent an email and opens that email, you would want to send an `Email Opened` event back to Segment with that same `userId` . The `userId` should be part of the call body as a top-level object. -> **For Customers, it's critical that the** `**userId**` **be consistent across all data flowing through Segment — this has significant implications for Segment billing (based on unique Monthly Tracked Users) and usefulness of data in integrations/warehouses. Passing back the** `**userId**` **value sent from Segment into your tool should be the default behavior of your track calls. If you're not a destination, make sure that you're using the customer's internal database ID, not your tool's ID.** +> info "" +> For Customers, it's critical that the `userId` be consistent across all data flowing through Segment — this has significant implications for Segment billing (based on unique Monthly Tracked Users) and usefulness of data in integrations/warehouses. Passing back the `userId` value sent from Segment into your tool should be the default behavior of your track calls. If you're not a destination, make sure that you're using the customer's internal database ID, not your tool's ID. -If you have your own unique identifier you use in your tool, we recommend passing that along as a context property in the event for QA purposes. For example: +If you have your own unique identifier you use in your tool, Segment recommends passing that along as a context property in the event for QA purposes. For example: ```json "type": "track", @@ -60,7 +63,7 @@ If you have your own unique identifier you use in your tool, we recommend passin ### `integration` -Each call should contain a `context.integration` object in the call body that identifies your tool (i.e., where the call is coming from). Use the slugified name for your tool, and `1.0.0` as the initial version — if you're unsure of your integration slug, contact us. Once Streams are supported in the Developer Center, this will be rendered for you and will be validated as part of the QA process. +Each call should contain a `context.integration` object in the call body that identifies your tool (for example where the call is coming from). Use the slugified name for your tool, and `1.0.0` as the initial version — if you're unsure of your integration slug, contact Segment support. Once Streams are supported in the Developer Center, this will be rendered for you and will be validated as part of the QA process. This should be part of the `context` top-level object and will look like: @@ -77,18 +80,23 @@ This should be part of the `context` top-level object and will look like: Each call must contain a `writeKey`. Segment provides this `writeKey` to customers in the settings panel for each of their sources. As mentioned in the set up flow description above, customers will need to save their Segment write key in your UI in order authenticate calls being made by your tool. -The write key is required in the header of every call to identify the customer whose data we're receiving. See the [authentication section](https://segment.com/docs/connections/sources/catalog/libraries/server/http-api/#authentication) of the HTTP API docs for more detail. If you do not include a customer write key in the call header, we will reject track calls from your tool. +The write key is required in the header of every call to identify the customer whose data Segment receives. See the [authentication section](/docs/connections/sources/catalog/libraries/server/http-api/#authentication) of the HTTP API docs for more detail. If you do not include a customer write key in the call header, Segment will reject track calls from your tool. **Rate limits and batching** -There is no hard rate limit at which point Segment will drop your data. However, to avoid processing delays, we ask partners to send requests at a maximum rate of 50 requests per second. +There is no hard rate limit at which point Segment will drop your data. However, to avoid processing delays, Segment asks partners to send requests at a maximum rate of 50 requests per second. + +If you want to batch requests to the HTTP endpoint, refer to the batching documentation [here](/docs/connections/sources/catalog/libraries/server/http-api/#import). The suggested maximum rate includes any batch requests. + +## Regional Segment +Segment offers customers the option to lead on data residency by providing regional infrastructure in both the Europe and the United States. -If you want to batch requests to the HTTP endpoint, refer to the batching documentation [here](https://segment.com/docs/connections/sources/catalog/libraries/server/http-api/#import). The suggested maximum rate includes any batch requests. +Segment recommends you enable the user to the Segment [endpoint](/docs/guides/regional-segment/#server-side-and-project-sources) to send data to for the given writeKey. # Process ## Plan -If you have not already, contact review your timeline and resourcing plan. Include which events your source will be sending to Segment to ensure they are properly specified. We are onboarding new sources as quickly as we can, but you should only commence building once you receive approval from Segment. +If you have not already, contact review your timeline and resourcing plan. Include which events your source will be sending to Segment to ensure they are properly specified. Segment onboards new sources as quickly as possible, but you should only commence building once you receive approval from Segment. ## Build @@ -97,7 +105,7 @@ All three of these steps should be completed before you begin testing: - Following the guidelines above, format your outbound webhook to Segment's HTTP API. - Add a field in your settings UI where customers can input their Segment write key. -- Write docs for your source — you'll need to have separate docs for your source ([example](https://segment.com/docs/connections/sources/catalog/cloud-apps/drip/)) and integration. +- Write docs for your source — you'll need to have separate docs for your source ([example](/docs/connections/sources/catalog/cloud-apps/drip/)) and integration. ## Testing diff --git a/src/protocols/tracking-plan/create.md b/src/protocols/tracking-plan/create.md index 67a9561ee3..0ef80f5f3f 100644 --- a/src/protocols/tracking-plan/create.md +++ b/src/protocols/tracking-plan/create.md @@ -32,12 +32,12 @@ To create a new Tracking Plan: ### Tracking Plan Columns The Tracking Plan editor is organized as a spreadsheet to help you add new events and properties, and edit the relevant fields for each. Like a spreadsheet, you can navigate across cells in a single event with your arrow keys and press enter to edit a cell. -| Column Name | Details | -| ------------ | --------- | -| Name | Specify the name of your event or property. | -| Description | Enter a description for your event or property. These descriptions are helpful for both engineers instrumenting Segment and consumers of the data. | -| Status | Specify whether a property is required or optional. You can't require a `.track()` call because Segment is unable to verify when a `.track()` call should be fired. | -| Data Type | Specify the data type of the property. Data type options include `any, array, object, boolean, integer, number, string, Date time`. Note: Date time is required to be in ISO-8601 format | +| Column Name | Details | +| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Name | Specify the name of your event or property. | +| Description | Enter a description for your event or property. These descriptions are helpful for both engineers instrumenting Segment and consumers of the data. | +| Status | Specify whether a property is required or optional. You can't require a `.track()` call because Segment is unable to verify when a `.track()` call should be fired. | +| Data Type | Specify the data type of the property. Data type options include `any, array, object, boolean, integer, number, string, Date time`. Note: Date time is required to be in ISO-8601 format | | Permitted Values | Enter simple regular expressions to validate property values. This works when a property data type is set to `string`. For example, you can add pipe delimited strings to the regex column to generate violations when a property value does not match fall, winter or spring. | > info ""