# WebScraping

In [36]:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://job-boards.greenhouse.io/gomotive/jobs/8355403002")
data = loader.load().pop().page_content
# print(data)

In [37]:
print(data)

Job Application for Data Annotator at MotiveBack to jobsData AnnotatorMexico City, MexicoApplyWho we are:
Motive empowers the people who run physical operations with tools to make their work safer, more productive, and more profitable. For the first time ever, safety, operations and finance teams can manage their drivers, vehicles, equipment, and fleet related spend in a single system. Combined with industry leading AI, the Motive platform gives you complete visibility and control, and significantly reduces manual workloads by automating and simplifying tasks.
Motive serves nearly 100,000 customers – from Fortune 500 enterprises to small businesses – across a wide range of industries, including transportation and logistics, construction, energy, field service, manufacturing, agriculture, food and beverage, retail, and the public sector.
Visit gomotive.com to learn more.About the Role:
Being customer obsessed and data driven is what drives this role. We’re looking for individuals who ca

# Retrieve Information

In [14]:
from dotenv import load_dotenv
import os
from langchain_groq import ChatGroq

load_dotenv()

groq_key = os.getenv("GROQ_API_KEY")

llm = ChatGroq(
    groq_api_key=groq_key,
    model="meta-llama/llama-4-scout-17b-16e-instruct",
    temperature=0.1,
    max_retries=2
)



**Making Prompt for LLM**

In [38]:
from langchain_core.prompts import PromptTemplate

prompt_Extract = PromptTemplate.from_template(
  """
  You are a job posting parser. Extract information from the job posting and return ONLY valid JSON with no explanations or markdown.

  **Required JSON format:**
{{
  "role": "job title",
  "company": "company name or null",
  "location": "location or null",
  "experience": {{
    "minimum_years": "number or null",
    "level": "Entry/Junior/Mid/Senior/Lead or null"
  }},
  "skills": {{
    "required": ["skill1", "skill2"],
    "preferred": ["skill1", "skill2"]
  }},
  "description": {{
    "summary": "brief overview",
    "responsibilities": ["duty1", "duty2"],
    "qualifications": ["requirement1", "requirement2"]
  }},
  "salary_range": "salary text or null",
  "job_type": "Full-time/Part-time/Contract/Internship or null"
}}

**Rules:**
- Use null for missing information, not empty strings
- Extract years from phrases like "3+ years", "5-7 years"
- Separate required vs preferred skills based on context
- List responsibilities and qualifications as separate items
- Return ONLY the JSON object

**Job posting text:**
{page_data}
    """
)

chain = prompt_Extract | llm
response = chain.invoke(input={"page_data": data})

In [42]:
print(response.content)

```
{
  "role": "Data Annotator",
  "company": "Motive",
  "location": "Mexico City, Mexico",
  "experience": {
    "minimum_years": null,
    "level": null
  },
  "skills": {
    "required": [
      "detail-oriented",
      "organized",
      "good communication skills",
      "ability to work collaboratively"
    ],
    "preferred": [
      "Bachelors in Computer Science",
      "IT",
      "Management",
      "knowledge of AI/ML"
    ]
  },
  "description": {
    "summary": "Annotate live videos for risk identification and support Machine Learning models",
    "responsibilities": [
      "Review and analyze live videos for risk identification",
      "Adhere to established guidelines and protocols",
      "Ensure timely delivery of accurate data",
      "Enhance annotation procedures"
    ],
    "qualifications": [
      "Bachelors in Computer Science, IT, Management or related field",
      "Detail-oriented and organized",
      "Growth mindset",
      "Customer-centric approach"
 

In [41]:
print(type(response.content))

<class 'str'>


**Type is Str so We have to convert into JSON**

In [43]:
from langchain_core.output_parsers import JsonOutputParser
json_parser = JsonOutputParser()
json_response = json_parser.parse(response.content)
print(type(json_response))

<class 'dict'>


In [48]:
import csv

def read_csv_file(file_path):
    data = []
    with open(file_path, 'r',encoding='utf-8') as file:
        csv_reader = csv.reader(file)
        # Skip the header row
        # Check if file has content
        try:
            header = next(csv_reader)  # Skip header
            print(f"Header found: {header}")
        except StopIteration:
            print("Warning: CSV file is empty!")
            return data
            
        for row in csv_reader:
            # Separate technical skills (list) and project link (string)
            skills = tuple(row[:-1])  # Exclude the last element (project link)
            project_link = row[-1]  # Get the last element (project link)
            data.append((skills, project_link))  # Create a tuple with skills and link
    return data


In [50]:
# Test the function


file_path = 'sample_links.csv' 
data = read_csv_file(file_path) 

for skills, project_link in data:
    print(f"Skills: {skills}, Project Link: {project_link}")

Header found: ['Techstack', 'Links']
Skills: ('React, Node.js, MongoDB, Express.js',), Project Link: https://github.com/johndoe/mern-ecommerce-platform
Skills: ('Python, Pandas, NumPy, Matplotlib, Jupyter',), Project Link: https://github.com/johndoe/sales-data-analysis
Skills: ('Python, Flask, SQLAlchemy, PostgreSQL, Bootstrap',), Project Link: https://github.com/johndoe/flask-blog-application
Skills: ('Python, Django, REST API, JWT, PostgreSQL',), Project Link: https://github.com/johndoe/django-task-manager-api
Skills: ('Python, Scikit-learn, Pandas, NumPy, Streamlit',), Project Link: https://github.com/johndoe/ml-house-price-predictor
Skills: ('JavaScript, React, Redux, Tailwind CSS, Firebase',), Project Link: https://github.com/johndoe/react-todo-app-firebase
Skills: ('Python, TensorFlow, Keras, OpenCV, NumPy',), Project Link: https://github.com/johndoe/image-classification-cnn
Skills: ('Node.js, Express.js, MongoDB, JWT, Bcrypt',), Project Link: https://github.com/johndoe/nodejs-au

In [51]:
import uuid
import chromadb

client = chromadb.PersistentClient("VectorDB")
collection = client.get_or_create_collection(name="job_postings")

if not collection.count():
    for skills, project_url in data:
        collection.add(
            documents=str(skills),
            metadatas={'portfolio_url':project_url},
            ids=[str(uuid.uuid4())]
        )

In [52]:
json_response["skills"]

{'required': ['detail-oriented',
  'organized',
  'good communication skills',
  'ability to work collaboratively'],
 'preferred': ['Bachelors in Computer Science',
  'IT',
  'Management',
  'knowledge of AI/ML']}

In [64]:
json_response["description"]

{'summary': 'Annotate live videos for risk identification and support Machine Learning models',
 'responsibilities': ['Review and analyze live videos for risk identification',
  'Adhere to established guidelines and protocols',
  'Ensure timely delivery of accurate data',
  'Enhance annotation procedures'],
 'qualifications': ['Bachelors in Computer Science, IT, Management or related field',
  'Detail-oriented and organized',
  'Growth mindset',
  'Customer-centric approach']}

In [65]:
# Extract all skills
required_skills = json_response['skills'].get('required', [])
preferred_skills = json_response['skills'].get('preferred', [])
all_skills = required_skills + preferred_skills

# Extract description components
summary = json_response['description'].get('summary', '')
responsibilities = json_response['description'].get('responsibilities', [])
qualifications = json_response['description'].get('qualifications', [])

# Combine skills (main priority)
skills_text = ", ".join(all_skills)

# Combine responsibilities and qualifications
responsibilities_text = ". ".join(responsibilities) if responsibilities else ""
qualifications_text = ". ".join(qualifications) if qualifications else ""

# Create query string with skills prioritized
query_string = f"{skills_text}. {summary}. {responsibilities_text}. {qualifications_text}"

# Query ChromaDB
portfolio_urls = collection.query(
    query_texts=[query_string],
    n_results=2
)

# Display results nicely
print("Query used for matching:")
print(query_string)
print("\n" + "="*80)
print("Top matching portfolios:\n")

if portfolio_urls['documents'][0]:
    for i in range(len(portfolio_urls['documents'][0])):
        print(f"{i+1}. Skills: {portfolio_urls['documents'][0][i]}")
        print(f"   Metadata: {portfolio_urls['metadatas'][0][i]}")
        print(f"   Distance: {portfolio_urls['distances'][0][i]:.4f}")
        print("-"*80)
else:
    print("No matching portfolios found!")

Query used for matching:
detail-oriented, organized, good communication skills, ability to work collaboratively, Bachelors in Computer Science, IT, Management, knowledge of AI/ML. Annotate live videos for risk identification and support Machine Learning models. Review and analyze live videos for risk identification. Adhere to established guidelines and protocols. Ensure timely delivery of accurate data. Enhance annotation procedures. Bachelors in Computer Science, IT, Management or related field. Detail-oriented and organized. Growth mindset. Customer-centric approach

Top matching portfolios:

1. Skills: ('Python, Scikit-learn, Pandas, NumPy, Streamlit',)
   Metadata: {'portfolio_url': 'https://github.com/johndoe/ml-house-price-predictor'}
   Distance: 1.5133
--------------------------------------------------------------------------------
2. Skills: ('Python, TensorFlow, Keras, OpenCV, NumPy',)
   Metadata: {'portfolio_url': 'https://github.com/johndoe/image-classification-cnn'}
   Di

{'ids': [['95c30c13-a783-4a25-b284-4cf45eb52df9', 'ad9036ee-4cea-4e94-abbe-ac4676d7700c']], 'embeddings': None, 'documents': [["('Python, Scikit-learn, Pandas, NumPy, Streamlit',)", "('Python, TensorFlow, Keras, OpenCV, NumPy',)"]], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'portfolio_url': 'https://github.com/johndoe/ml-house-price-predictor'}, {'portfolio_url': 'https://github.com/johndoe/image-classification-cnn'}]], 'distances': [[1.5056747198104858, 1.5315231084823608]]}


In [59]:
json_response["description"]

{'summary': 'Annotate live videos for risk identification and support Machine Learning models',
 'responsibilities': ['Review and analyze live videos for risk identification',
  'Adhere to established guidelines and protocols',
  'Ensure timely delivery of accurate data',
  'Enhance annotation procedures'],
 'qualifications': ['Bachelors in Computer Science, IT, Management or related field',
  'Detail-oriented and organized',
  'Growth mindset',
  'Customer-centric approach']}

In [70]:
# Extract skills and build query
required_skills = json_response['skills'].get('required', [])
preferred_skills = json_response['skills'].get('preferred', [])
all_skills = required_skills + preferred_skills

# Extract description
summary = json_response['description'].get('summary', '')
responsibilities = json_response['description'].get('responsibilities', [])
qualifications = json_response['description'].get('qualifications', [])

# Build query string
skills_text = ", ".join(all_skills)
responsibilities_text = ". ".join(responsibilities) if responsibilities else ""
qualifications_text = ". ".join(qualifications) if qualifications else ""
query_string = f"{skills_text}. {summary}. {responsibilities_text}. {qualifications_text}"

# Query ChromaDB for matching portfolios
portfolio_urls = collection.query(
    query_texts=[query_string],
    n_results=2
)

# Extract portfolio links (adjust based on your metadata structure)
portfolio_links = []
if portfolio_urls['documents'][0]:
    for i in range(len(portfolio_urls['documents'][0])):
        skills = portfolio_urls['documents'][0][i]
        # Get link from metadata - adjust the key name based on your data
        metadata = portfolio_urls['metadatas'][0][i]
        
        # Try to get the link with different possible key names
        link = metadata.get('Links') or metadata.get('link') or metadata.get('url') or "Link not found"
        
        portfolio_links.append(f"- {skills}: {link}")

# Format portfolio URLs as a string
portfolio_urls_text = "\n".join(portfolio_links) if portfolio_links else "No relevant portfolios found"

# Create job description text
job_description = f"""
Role: {json_response.get('role', 'Not specified')}
Company: {json_response.get('company', 'Not specified')}
Location: {json_response.get('location', 'Not specified')}

Summary: {summary}

Key Responsibilities:
{chr(10).join([f"- {r}" for r in responsibilities])}

Required Skills:
{chr(10).join([f"- {s}" for s in required_skills])}

Qualifications:
{chr(10).join([f"- {q}" for q in qualifications])}
"""

# Email generation prompt
email_prompt = PromptTemplate.from_template(
    
    """
    I will give you a role and a task that you have to perform in that specific role.
    Your Role: Your name is Sahil, You are an incredible business development officer who knows how to get clients. You work for X Consulting firm, your firm works with all sorts of IT clients and provide solutions in the domain of Data Science and AI. 
    X AI focuses on efficient tailored solutions for all clients keeping costs down. 
    Your Job: Your Job is to write cold emails to clients regarding the Job openings that they have advertised. Try to pitch your clients with an email hook that opens a conversation about a possibility of working with them. Add the most relevant portfolio URLs from
    the following (shared below) to showcase that we have the right expertise to get the job done. 
    I will now provide you with the Job description and the portfolio URLs:
    JOB DESCRIPTION: {job_description}
    ------
    PORTFOLIO URLS: {portfolio_urls}

    add gmail in last that opens a conversation about a possibility of working with them.

    and not give any suggestions in last 
    """
)


# Generate email
chain_email = email_prompt | llm
response = chain_email.invoke({
    'job_description': job_description, 
    'portfolio_urls': portfolio_urls
})

# Display the email
print("="*80)
print("GENERATED COLD EMAIL")
print("="*80)
print(response.content)

GENERATED COLD EMAIL
Here is a cold email to the client:

Subject: Expert Data Science Solutions for Motive's Annotation Needs

Dear Hiring Manager,

I came across the job posting for a Data Annotator at Motive, and I was impressed by the company's focus on leveraging Machine Learning models for risk identification. As a Business Development Officer at X Consulting firm, I'd like to introduce our team of experts in Data Science and AI who can provide tailored solutions to support your annotation needs.

Our team has extensive experience in developing and implementing Machine Learning models, and we've successfully worked on various projects involving data annotation and computer vision. You can check out some of our work here:

* https://github.com/johndoe/ml-house-price-predictor (Python, Scikit-learn, Pandas, NumPy, Streamlit)
* https://github.com/johndoe/image-classification-cnn (Python, TensorFlow, Keras, OpenCV, NumPy)

We understand the importance of accurate and timely data anno