Skip to content

ETL system utilizing the DuxSoup API for programmatic LinkedIn extraction. The project is a data extraction pipeline that automatically retrieves extensive LinkedIn profile data from first-degree connections for network analysis and relationship intelligence applications.

License

Notifications You must be signed in to change notification settings

harehimself/duxsoup-etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

A production-ready LinkedIn extraction pipeline. The system performs automatic LinkedIn profile extractions in real-time. Normalizes profile data into structured MongoDB Atlas records. Differentiates between scans and visits with custom routing. Includes health checks, validation, and logging. Built for background processing and extensibility. Useful for lead enrichment and graph-based CRM models.


Contributors Forks Stars Issues MIT License



Table of Contents


Features

  • Webhook Processing: Handles DuxSoup LinkedIn data via POST /api/webhook.
  • Type-Based Routing: Automatically routes scan vs visit data to appropriate handlers based on the type field in the payload.
  • Data Validation: Comprehensive validation for required fields, including a custom validator for the id field to ensure it's a non-empty string.
  • Error Handling: Robust error handling with detailed logging using Winston.
  • Production Ready: Designed for deployment on platforms like Render with health monitoring.
  • Extensible: Easy to add MongoDB storage and data normalization.
  • MongoDB Storage: Integrates with MongoDB using Mongoose to persist processed data.

API Endpoints

POST /api/webhook

Main webhook endpoint for DuxSoup data processing. It expects a JSON payload with a type field to route the data correctly.

Request Body Examples:

Visit Data:

{
  "type": "visit",
  "id": "visit_12345",
  "VisitTime": "2025-06-06T10:30:00Z",
  "Profile": "[https://www.linkedin.com/in/johndoe/](https://www.linkedin.com/in/johndoe/)",
  "Degree": "MBA",
  "First Name": "John",
  "Last Name": "Doe",
  "Headline": "Senior Software Engineer at TechCorp",
  "Location": "San Francisco, CA",
  "Connections": "500+",
  "Position-0-Company": "TechCorp",
  "Position-0-Title": "Senior Software Engineer",
  "Position-0-StartDate": "Jan 2022",
  "Position-0-EndDate": "Present",
  "Position-0-Duration": "3 years 5 months",
  "Position-0-Location": "San Francisco, CA",
  "Position-0-Description": "Leading development of cloud-native applications",
  "Position-1-Company": "StartupXYZ",
  "Position-1-Title": "Software Engineer",
  "Position-1-StartDate": "Jun 2019",
  "Position-1-EndDate": "Dec 2021",
  "Position-1-Duration": "2 years 7 months",
  "Position-1-Location": "New York, NY",
  "School-0-School": "Stanford University",
  "School-0-Degree": "Master of Business Administration",
  "School-0-Field": "Technology Management",
  "School-0-StartYear": "2017",
  "School-0-EndYear": "2019",
  "School-1-School": "UC Berkeley",
  "School-1-Degree": "Bachelor of Science",
  "School-1-Field": "Computer Science",
  "School-1-StartYear": "2013",
  "School-1-EndYear": "2017",
  "Skill-0": "JavaScript",
  "Skill-1": "Node.js",
  "Skill-2": "React",
  "Skill-3": "MongoDB",
  "Skill-4": "AWS",
  "Summary": "Experienced software engineer with expertise in full-stack development",
  "Industry": "Technology"
}

Scan Data:

{
  "type": "scan",
  "id": "scan_67890",
  "ScanTime": "2025-06-06T14:15:00Z",
  "Profile": "[https://www.linkedin.com/in/janesmith/](https://www.linkedin.com/in/janesmith/)",
  "First Name": "Jane",
  "Last Name": "Smith",
  "Headline": "Marketing Director at BigCorp",
  "Location": "Chicago, IL",
  "Connections": "1000+",
  "Position-0-Company": "BigCorp",
  "Position-0-Title": "Marketing Director",
  "Position-0-StartDate": "Mar 2021",
  "Position-0-EndDate": "Present",
  "Position-0-Duration": "4 years 3 months",
  "Position-0-Location": "Chicago, IL",
  "Position-1-Company": "MidSize Inc",
  "Position-1-Title": "Senior Marketing Manager",
  "Position-1-StartDate": "Jan 2018",
  "Position-1-EndDate": "Feb 2021",
  "Position-1-Duration": "3 years 2 months",
  "Position-1-Location": "Detroit, MI",
  "School-0-School": "Northwestern University",
  "School-0-Degree": "Master of Marketing",
  "School-0-StartYear": "2016",
  "School-0-EndYear": "2018",
  "Skill-0": "Digital Marketing",
  "Skill-1": "Brand Management",
  "Skill-2": "Analytics",
  "Summary": "Experienced marketing professional",
  "Industry": "Marketing"
}

Responses:

{
  "status": "ok",
  "database": {
    "isConnected": true,
    "readyState": 1,
    "host": "your_mongodb_host",
    "name": "duxsoup-etl"
  },
  "timestamp": "2025-06-06T..."
}

GET /health

Health check endpoint returning server and database status.

{
  "status": "ok",
  "database": {
    "isConnected": true,
    "readyState": 1,
    "host": "your_mongodb_host",
    "name": "duxsoup-etl"
  },
  "timestamp": "2025-06-06T..."
}

GET /api/test

Test endpoint for API verification.


Data Models

The application uses Mongoose to define schemas for Visit and Scan data, ensuring data integrity and structure. Both models include createdAt and updatedAt timestamps.

Visit Model

  • id: String, Required, Unique, Indexed.
  • VisitTime: Date, Required, Indexed.
  • Profile: String, Required, Indexed.
  • First Name: String, Required.
  • Last Name: String (Optional).
  • Additional Data Points: SalesProfile, RecruiterProfile, Picture, Middle Name, Connections, Summary, Title, From, Company, **CompanyProfile, **CompanyWebsite, **PersonalWebsite, Email, Phone, IM, Twitter, Location, Industry, etc.
  • rawData: Mixed type, stores the entire original webhook payload.

Scan Model

  • id: String, Required, Unique, Indexed.
  • ScanTime: Date, Required, Indexed.
  • Profile: String, Required, Indexed.
  • First Name: String, Required.
  • Last Name: String, Required.
  • Additional Data Points: Company, Title, Location, Industry, ConnectionDegree, ProfileUrl, etc.
  • rawData: Mixed type, stores the entire original webhook payload.

Data Normalization

Both models use pre-save hooks to normalize raw DuxSoup data from a flat format to structured arrays.

Raw Format Example:

{
  "Position-0-Company": "TechCorp",
  "Position-0-Title": "Engineer",
  "School-0-School": "Stanford",
  "Skill-0": "JavaScript"
}

Normalized Format Example:

{
  "positions": [
    { "company": "TechCorp", "title": "Engineer" }
  ],
  "schools": [
    { "school": "Stanford" }
  ],
  "skills": ["JavaScript"]
}

πŸ› οΈ Tech Stack

  • Runtime: Node.js 18+
  • Framework: Express.js
  • Database: MongoDB Atlas
  • ODM: Mongoose
  • Logging: Winston
  • Environment Variables: Dotenv
  • CORS: Cors
  • Deployment: Render
  • Development: Nodemon

Local Development

Prerequisites

  • Node.js 18+
  • MongoDB Atlas account
  • Git

Step-by-Step Setup

Clone Project and Install Dependencies:

git clone https://github.com/harehimself/duxsoup-etl.git
cd duxsoup-etl
npm install

Set Up MongoDB Atlas:

  1. Go to MongoDB Atlas and create a free cluster.
  2. Create a database user and whitelist IP addresses.
  3. Obtain your MongoDB connection string.

Configure Environment:

cp .env.example .env

Edit the .env file:

PORT=3000
MONGODB_URI=mongodb+srv://<user>:<pass>@cluster.mongodb.net/duxsoup-etl
NODE_ENV=development

Test Locally

npm run dev

Health Check:

curl http://localhost:3000/health

Test Visit Webhook:

curl -X POST -H "Content-Type: application/json" -d @examples/visit.json http://localhost:3000/api/webhook

Test Scan Webhook:

curl -X POST -H "Content-Type: application/json" -d @examples/scan.json http://localhost:3000/api/webhook

πŸ“ Project Structure

duxsoup-etl/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ controllers/
β”‚   β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ routes/
β”‚   β”œβ”€β”€ utils/
β”‚   └── index.js
β”œβ”€β”€ examples/
β”œβ”€β”€ .env.example
β”œβ”€β”€ package.json
β”œβ”€β”€ render.yaml
└── README.md

Deployment

Option 1: Render (Recommended)

  1. Push to GitHub.
  2. Connect Render to GitHub repo.
  3. Use render.yaml for config.
  4. Set MONGODB_URI in dashboard.

Option 2: Manual

npm install --production
npm start

Development

Available Scripts:

  • npm start
  • npm run dev

Adding New Fields:

  • Update models.
  • Adjust normalization logic.
  • Update controllers.
  • Test with payloads.

Debugging

  • Check console logs (Winston).
  • Monitor MongoDB Atlas.

Troubleshooting

  • MongoDB Connection: Check URI, IP whitelist, user permissions.
  • Validation Errors: Confirm schema conformity.
  • Duplicate Key Errors: Check uniqueness of id.

Logs

  • Dev Logs: Console.
  • Prod Logs: Use Winston files.

Monitoring

  • GET /health
  • Check MongoDB connectivity.

Data Validation

  • Schema + controller-level validation.
  • Unique id fields.

Security Notes

  • Validate all inputs.
  • Use environment variables.
  • Enable CORS carefully.
  • Consider webhook authentication.

About

ETL system utilizing the DuxSoup API for programmatic LinkedIn extraction. The project is a data extraction pipeline that automatically retrieves extensive LinkedIn profile data from first-degree connections for network analysis and relationship intelligence applications.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published