A production-ready LinkedIn extraction pipeline. The system performs automatic LinkedIn profile extractions in real-time. Normalizes profile data into structured MongoDB Atlas records. Differentiates between scans and visits with custom routing. Includes health checks, validation, and logging. Built for background processing and extensibility. Useful for lead enrichment and graph-based CRM models.
- Webhook Processing: Handles DuxSoup LinkedIn data via
POST /api/webhook
. - Type-Based Routing: Automatically routes scan vs visit data to appropriate handlers based on the
type
field in the payload. - Data Validation: Comprehensive validation for required fields, including a custom validator for the
id
field to ensure it's a non-empty string. - Error Handling: Robust error handling with detailed logging using Winston.
- Production Ready: Designed for deployment on platforms like Render with health monitoring.
- Extensible: Easy to add MongoDB storage and data normalization.
- MongoDB Storage: Integrates with MongoDB using Mongoose to persist processed data.
Main webhook endpoint for DuxSoup data processing. It expects a JSON payload with a type
field to route the data correctly.
Request Body Examples:
Visit Data:
{
"type": "visit",
"id": "visit_12345",
"VisitTime": "2025-06-06T10:30:00Z",
"Profile": "[https://www.linkedin.com/in/johndoe/](https://www.linkedin.com/in/johndoe/)",
"Degree": "MBA",
"First Name": "John",
"Last Name": "Doe",
"Headline": "Senior Software Engineer at TechCorp",
"Location": "San Francisco, CA",
"Connections": "500+",
"Position-0-Company": "TechCorp",
"Position-0-Title": "Senior Software Engineer",
"Position-0-StartDate": "Jan 2022",
"Position-0-EndDate": "Present",
"Position-0-Duration": "3 years 5 months",
"Position-0-Location": "San Francisco, CA",
"Position-0-Description": "Leading development of cloud-native applications",
"Position-1-Company": "StartupXYZ",
"Position-1-Title": "Software Engineer",
"Position-1-StartDate": "Jun 2019",
"Position-1-EndDate": "Dec 2021",
"Position-1-Duration": "2 years 7 months",
"Position-1-Location": "New York, NY",
"School-0-School": "Stanford University",
"School-0-Degree": "Master of Business Administration",
"School-0-Field": "Technology Management",
"School-0-StartYear": "2017",
"School-0-EndYear": "2019",
"School-1-School": "UC Berkeley",
"School-1-Degree": "Bachelor of Science",
"School-1-Field": "Computer Science",
"School-1-StartYear": "2013",
"School-1-EndYear": "2017",
"Skill-0": "JavaScript",
"Skill-1": "Node.js",
"Skill-2": "React",
"Skill-3": "MongoDB",
"Skill-4": "AWS",
"Summary": "Experienced software engineer with expertise in full-stack development",
"Industry": "Technology"
}
Scan Data:
{
"type": "scan",
"id": "scan_67890",
"ScanTime": "2025-06-06T14:15:00Z",
"Profile": "[https://www.linkedin.com/in/janesmith/](https://www.linkedin.com/in/janesmith/)",
"First Name": "Jane",
"Last Name": "Smith",
"Headline": "Marketing Director at BigCorp",
"Location": "Chicago, IL",
"Connections": "1000+",
"Position-0-Company": "BigCorp",
"Position-0-Title": "Marketing Director",
"Position-0-StartDate": "Mar 2021",
"Position-0-EndDate": "Present",
"Position-0-Duration": "4 years 3 months",
"Position-0-Location": "Chicago, IL",
"Position-1-Company": "MidSize Inc",
"Position-1-Title": "Senior Marketing Manager",
"Position-1-StartDate": "Jan 2018",
"Position-1-EndDate": "Feb 2021",
"Position-1-Duration": "3 years 2 months",
"Position-1-Location": "Detroit, MI",
"School-0-School": "Northwestern University",
"School-0-Degree": "Master of Marketing",
"School-0-StartYear": "2016",
"School-0-EndYear": "2018",
"Skill-0": "Digital Marketing",
"Skill-1": "Brand Management",
"Skill-2": "Analytics",
"Summary": "Experienced marketing professional",
"Industry": "Marketing"
}
Responses:
{
"status": "ok",
"database": {
"isConnected": true,
"readyState": 1,
"host": "your_mongodb_host",
"name": "duxsoup-etl"
},
"timestamp": "2025-06-06T..."
}
Health check endpoint returning server and database status.
{
"status": "ok",
"database": {
"isConnected": true,
"readyState": 1,
"host": "your_mongodb_host",
"name": "duxsoup-etl"
},
"timestamp": "2025-06-06T..."
}
Test endpoint for API verification.
The application uses Mongoose to define schemas for Visit and Scan data, ensuring data integrity and structure. Both models include createdAt and updatedAt timestamps.
id:
String, Required, Unique, Indexed.VisitTime:
Date, Required, Indexed.Profile:
String, Required, Indexed.First Name:
String, Required.Last Name:
String (Optional).- Additional Data Points:
SalesProfile
,RecruiterProfile
,Picture
,Middle Name
,Connections
,Summary
,Title
,From
,Company
, **CompanyProfile
, **CompanyWebsite
, **PersonalWebsite
,Email
,Phone
,IM
,Twitter
,Location
,Industry
, etc. rawData:
Mixed type, stores the entire original webhook payload.
id:
String, Required, Unique, Indexed.ScanTime:
Date, Required, Indexed.Profile:
String, Required, Indexed.First Name:
String, Required.Last Name:
String, Required.- Additional Data Points:
Company
,Title
,Location
,Industry
,ConnectionDegree
,ProfileUrl
, etc. rawData:
Mixed type, stores the entire original webhook payload.
Both models use pre-save hooks to normalize raw DuxSoup data from a flat format to structured arrays.
Raw Format Example:
{
"Position-0-Company": "TechCorp",
"Position-0-Title": "Engineer",
"School-0-School": "Stanford",
"Skill-0": "JavaScript"
}
Normalized Format Example:
{
"positions": [
{ "company": "TechCorp", "title": "Engineer" }
],
"schools": [
{ "school": "Stanford" }
],
"skills": ["JavaScript"]
}
Runtime:
Node.js 18+Framework:
Express.jsDatabase:
MongoDB AtlasODM:
MongooseLogging:
WinstonEnvironment Variables:
DotenvCORS:
CorsDeployment:
RenderDevelopment:
Nodemon
- Node.js 18+
- MongoDB Atlas account
- Git
Clone Project and Install Dependencies:
git clone https://github.com/harehimself/duxsoup-etl.git
cd duxsoup-etl
npm install
Set Up MongoDB Atlas:
- Go to MongoDB Atlas and create a free cluster.
- Create a database user and whitelist IP addresses.
- Obtain your MongoDB connection string.
Configure Environment:
cp .env.example .env
Edit the .env
file:
PORT=3000
MONGODB_URI=mongodb+srv://<user>:<pass>@cluster.mongodb.net/duxsoup-etl
NODE_ENV=development
Test Locally
npm run dev
Health Check:
curl http://localhost:3000/health
Test Visit Webhook:
curl -X POST -H "Content-Type: application/json" -d @examples/visit.json http://localhost:3000/api/webhook
Test Scan Webhook:
curl -X POST -H "Content-Type: application/json" -d @examples/scan.json http://localhost:3000/api/webhook
duxsoup-etl/
βββ src/
β βββ controllers/
β βββ models/
β βββ routes/
β βββ utils/
β βββ index.js
βββ examples/
βββ .env.example
βββ package.json
βββ render.yaml
βββ README.md
- Push to GitHub.
- Connect Render to GitHub repo.
- Use
render.yaml
for config. - Set
MONGODB_URI
in dashboard.
npm install --production
npm start
Available Scripts:
npm start
npm run dev
Adding New Fields:
- Update models.
- Adjust normalization logic.
- Update controllers.
- Test with payloads.
- Check console logs (Winston).
- Monitor MongoDB Atlas.
- MongoDB Connection: Check URI, IP whitelist, user permissions.
- Validation Errors: Confirm schema conformity.
- Duplicate Key Errors: Check uniqueness of
id
.
- Dev Logs: Console.
- Prod Logs: Use Winston files.
GET /health
- Check MongoDB connectivity.
- Schema + controller-level validation.
- Unique
id
fields.
- Validate all inputs.
- Use environment variables.
- Enable CORS carefully.
- Consider webhook authentication.