GitHub - harehimself/duxsoup-etl: ETL system utilizing the DuxSoup API for programmatic LinkedIn extraction. The project is a data extraction pipeline that automatically retrieves extensive LinkedIn profile data from first-degree connections for network analysis and relationship intelligence applications.

A production-ready LinkedIn extraction pipeline. The system performs automatic LinkedIn profile extractions in real-time. Normalizes profile data into structured MongoDB Atlas records. Differentiates between scans and visits with custom routing. Includes health checks, validation, and logging. Built for background processing and extensibility. Useful for lead enrichment and graph-based CRM models.

Features

Webhook Processing: Handles DuxSoup LinkedIn data via POST /api/webhook.
Type-Based Routing: Automatically routes scan vs visit data to appropriate handlers based on the type field in the payload.
Data Validation: Comprehensive validation for required fields, including a custom validator for the id field to ensure it's a non-empty string.
Error Handling: Robust error handling with detailed logging using Winston.
Production Ready: Designed for deployment on platforms like Render with health monitoring.
Extensible: Easy to add MongoDB storage and data normalization.
MongoDB Storage: Integrates with MongoDB using Mongoose to persist processed data.

API Endpoints

`POST /api/webhook`

Main webhook endpoint for DuxSoup data processing. It expects a JSON payload with a type field to route the data correctly.

Request Body Examples:

Visit Data:

{
  "type": "visit",
  "id": "visit_12345",
  "VisitTime": "2025-06-06T10:30:00Z",
  "Profile": "[https://www.linkedin.com/in/johndoe/](https://www.linkedin.com/in/johndoe/)",
  "Degree": "MBA",
  "First Name": "John",
  "Last Name": "Doe",
  "Headline": "Senior Software Engineer at TechCorp",
  "Location": "San Francisco, CA",
  "Connections": "500+",
  "Position-0-Company": "TechCorp",
  "Position-0-Title": "Senior Software Engineer",
  "Position-0-StartDate": "Jan 2022",
  "Position-0-EndDate": "Present",
  "Position-0-Duration": "3 years 5 months",
  "Position-0-Location": "San Francisco, CA",
  "Position-0-Description": "Leading development of cloud-native applications",
  "Position-1-Company": "StartupXYZ",
  "Position-1-Title": "Software Engineer",
  "Position-1-StartDate": "Jun 2019",
  "Position-1-EndDate": "Dec 2021",
  "Position-1-Duration": "2 years 7 months",
  "Position-1-Location": "New York, NY",
  "School-0-School": "Stanford University",
  "School-0-Degree": "Master of Business Administration",
  "School-0-Field": "Technology Management",
  "School-0-StartYear": "2017",
  "School-0-EndYear": "2019",
  "School-1-School": "UC Berkeley",
  "School-1-Degree": "Bachelor of Science",
  "School-1-Field": "Computer Science",
  "School-1-StartYear": "2013",
  "School-1-EndYear": "2017",
  "Skill-0": "JavaScript",
  "Skill-1": "Node.js",
  "Skill-2": "React",
  "Skill-3": "MongoDB",
  "Skill-4": "AWS",
  "Summary": "Experienced software engineer with expertise in full-stack development",
  "Industry": "Technology"
}

Scan Data:

{
  "type": "scan",
  "id": "scan_67890",
  "ScanTime": "2025-06-06T14:15:00Z",
  "Profile": "[https://www.linkedin.com/in/janesmith/](https://www.linkedin.com/in/janesmith/)",
  "First Name": "Jane",
  "Last Name": "Smith",
  "Headline": "Marketing Director at BigCorp",
  "Location": "Chicago, IL",
  "Connections": "1000+",
  "Position-0-Company": "BigCorp",
  "Position-0-Title": "Marketing Director",
  "Position-0-StartDate": "Mar 2021",
  "Position-0-EndDate": "Present",
  "Position-0-Duration": "4 years 3 months",
  "Position-0-Location": "Chicago, IL",
  "Position-1-Company": "MidSize Inc",
  "Position-1-Title": "Senior Marketing Manager",
  "Position-1-StartDate": "Jan 2018",
  "Position-1-EndDate": "Feb 2021",
  "Position-1-Duration": "3 years 2 months",
  "Position-1-Location": "Detroit, MI",
  "School-0-School": "Northwestern University",
  "School-0-Degree": "Master of Marketing",
  "School-0-StartYear": "2016",
  "School-0-EndYear": "2018",
  "Skill-0": "Digital Marketing",
  "Skill-1": "Brand Management",
  "Skill-2": "Analytics",
  "Summary": "Experienced marketing professional",
  "Industry": "Marketing"
}

Responses:

{
  "status": "ok",
  "database": {
    "isConnected": true,
    "readyState": 1,
    "host": "your_mongodb_host",
    "name": "duxsoup-etl"
  },
  "timestamp": "2025-06-06T..."
}

`GET /health`

Health check endpoint returning server and database status.

{
  "status": "ok",
  "database": {
    "isConnected": true,
    "readyState": 1,
    "host": "your_mongodb_host",
    "name": "duxsoup-etl"
  },
  "timestamp": "2025-06-06T..."
}

`GET /api/test`

Test endpoint for API verification.

Data Models

The application uses Mongoose to define schemas for Visit and Scan data, ensuring data integrity and structure. Both models include createdAt and updatedAt timestamps.

Visit Model

id: String, Required, Unique, Indexed.
VisitTime: Date, Required, Indexed.
Profile: String, Required, Indexed.
First Name: String, Required.
Last Name: String (Optional).
Additional Data Points: SalesProfile, RecruiterProfile, Picture, Middle Name, Connections, Summary, Title, From, Company, **CompanyProfile, **CompanyWebsite, **PersonalWebsite, Email, Phone, IM, Twitter, Location, Industry, etc.
rawData: Mixed type, stores the entire original webhook payload.

Scan Model

id: String, Required, Unique, Indexed.
ScanTime: Date, Required, Indexed.
Profile: String, Required, Indexed.
First Name: String, Required.
Last Name: String, Required.
Additional Data Points: Company, Title, Location, Industry, ConnectionDegree, ProfileUrl, etc.
rawData: Mixed type, stores the entire original webhook payload.

Data Normalization

Both models use pre-save hooks to normalize raw DuxSoup data from a flat format to structured arrays.

Raw Format Example:

{
  "Position-0-Company": "TechCorp",
  "Position-0-Title": "Engineer",
  "School-0-School": "Stanford",
  "Skill-0": "JavaScript"
}

Normalized Format Example:

{
  "positions": [
    { "company": "TechCorp", "title": "Engineer" }
  ],
  "schools": [
    { "school": "Stanford" }
  ],
  "skills": ["JavaScript"]
}

🛠️ Tech Stack

Runtime: Node.js 18+
Framework: Express.js
Database: MongoDB Atlas
ODM: Mongoose
Logging: Winston
Environment Variables: Dotenv
CORS: Cors
Deployment: Render
Development: Nodemon

Local Development

Prerequisites

Node.js 18+
MongoDB Atlas account
Git

Step-by-Step Setup

Clone Project and Install Dependencies:

git clone https://github.com/harehimself/duxsoup-etl.git
cd duxsoup-etl
npm install

Set Up MongoDB Atlas:

Go to MongoDB Atlas and create a free cluster.
Create a database user and whitelist IP addresses.
Obtain your MongoDB connection string.

Configure Environment:

cp .env.example .env

Edit the .env file:

PORT=3000
MONGODB_URI=mongodb+srv://<user>:<pass>@cluster.mongodb.net/duxsoup-etl
NODE_ENV=development

Test Locally

npm run dev

Health Check:

curl http://localhost:3000/health

Test Visit Webhook:

curl -X POST -H "Content-Type: application/json" -d @examples/visit.json http://localhost:3000/api/webhook

Test Scan Webhook:

curl -X POST -H "Content-Type: application/json" -d @examples/scan.json http://localhost:3000/api/webhook

📁 Project Structure

duxsoup-etl/
├── src/
│   ├── controllers/
│   ├── models/
│   ├── routes/
│   ├── utils/
│   └── index.js
├── examples/
├── .env.example
├── package.json
├── render.yaml
└── README.md

Deployment

Option 1: Render (Recommended)

Push to GitHub.
Connect Render to GitHub repo.
Use render.yaml for config.
Set MONGODB_URI in dashboard.

Option 2: Manual

npm install --production
npm start

Development

Available Scripts:

npm start
npm run dev

Adding New Fields:

Update models.
Adjust normalization logic.
Update controllers.
Test with payloads.

Debugging

Check console logs (Winston).
Monitor MongoDB Atlas.

Troubleshooting

MongoDB Connection: Check URI, IP whitelist, user permissions.
Validation Errors: Confirm schema conformity.
Duplicate Key Errors: Check uniqueness of id.

Logs

Dev Logs: Console.
Prod Logs: Use Winston files.

Monitoring

GET /health
Check MongoDB connectivity.

Data Validation

Schema + controller-level validation.
Unique id fields.

Security Notes

Validate all inputs.
Use environment variables.
Enable CORS carefully.
Consider webhook authentication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Features

API Endpoints

`POST /api/webhook`

`GET /health`

`GET /api/test`

Data Models

Visit Model

Scan Model

Data Normalization

🛠️ Tech Stack

Local Development

Prerequisites

Step-by-Step Setup

📁 Project Structure

Deployment

Option 1: Render (Recommended)

Option 2: Manual

Development

Debugging

Troubleshooting

Logs

Monitoring

Data Validation

Security Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
examples		examples
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
duxsoup-etl.png		duxsoup-etl.png
package-lock.json		package-lock.json
package.json		package.json
render.yaml		render.yaml

License

harehimself/duxsoup-etl

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Features

API Endpoints

POST /api/webhook

GET /health

GET /api/test

Data Models

Visit Model

Scan Model

Data Normalization

🛠️ Tech Stack

Local Development

Prerequisites

Step-by-Step Setup

📁 Project Structure

Deployment

Option 1: Render (Recommended)

Option 2: Manual

Development

Debugging

Troubleshooting

Logs

Monitoring

Data Validation

Security Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`POST /api/webhook`

`GET /health`

`GET /api/test`

Packages