### Web Scraping using LLM 

### Install `pip install langchain_community beautifulsoup4`

In [1]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://timesofindia.indiatimes.com/city/vijayawada/andhra-pradesh-origin-scientists-team-opens-door-on-black-holes/articleshow/118434954.cms")

data_from_web_page = loader.load().pop().page_content

print(data_from_web_page)

USER_AGENT environment variable not set, consider setting it to identify your requests.


Raga Deepika Pucha: Andhra Pradesh origin scientist team opens door on black holes | Vijayawada News - The Times of IndiaEditionININUSSign InTOICityvijayawadamumbaidelhibengaluruHyderabadkolkatachennaiagraagartalaahmedabadajmeramaravatiamritsarbareillybhubaneswarbhopalchandigarhchhatrapati sambhajinagarcoimbatorecuttackdehradunerodefaridabadghaziabadgoagurgaonguwahatihubballiimphalindoreitanagarjaipurjammujamshedpurjodhpurkanpurkochikohimakolhapurkozhikodeludhianalucknowmaduraimangalurumeerutmumbai regionmysurunagpurnashiknavi mumbainoidapatnaprayagrajpuducherrypuneraipurrajkotranchithanesalemshillongshimlasrinagarsurattrichythiruvananthapuramudaipurvadodaravaranasivisakhapatnamphotosWeb StoriesToday's ePaperweatherandhra electionsNewsCity Newsvijayawada NewsAndhra Pradesh-origin scientist’s team opens door on black holesTrendingDK ShivakumarNoida Lift IncidentDirector S ShankarRahul GandhiSourav Ganguly Car AccidentRekha GuptaDK ShivakumarNoida Lift IncidentDirector S ShankarRahul Gan

In [2]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://jobs.adidas-group.com/job/Melbourne-Assistant-Store-Manager-Carrum-Downs-Factory-Outlet-VIC/1173229601/?feedId=301201&utm_source=j2w")

data_from_web_page = loader.load().pop().page_content

print(data_from_web_page)














Assistant Store Manager - Carrum Downs Factory Outlet Job Details | adidas

































                    By continuing to use and navigate this website, you are agreeing to the use of cookies.
                
            

                Accept

                Close




Press Tab to Move to Skip to Content Link
Skip to main content








Prepare for interview
Search Jobs
SEE CATEGORIES




















Search by Keyword




Search by Location







                                 
                            















Prepare for interview
Search Jobs
SEE CATEGORIES









View Profile




Employee Login






















Search by Keyword




Search by Location






Show More Options





Loading...







                                            Team:
                                        


All





                                            Location
                                        


All





                 

In [3]:
from dotenv import load_dotenv
load_dotenv()

True

In [4]:
import os
groq_api_key = os.getenv("GROQ_API_KEY")

In [5]:
from langchain_groq import ChatGroq

llm = ChatGroq(
    temperature=0,
    groq_api_key = groq_api_key,
    model_name = "llama-3.3-70b-versatile" 
)

### Providing prompt to get the data extracted from the web page structured using the LLM

In [6]:
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    """
        ### SCRAPED TEXT FROM WEBSITE:
        {page_data}
        ### INSTRUCTION:
        The scraped text is from the career's page of a website.
        Your job is to extract the job postings and return them in JSON format containing the 
        following keys: `role`, `experience`, `skills` and `description`.
        Only return the valid JSON.
        ### VALID JSON (NO PREAMBLE):    
        """
)

chain = prompt | llm
result = chain.invoke(input={"page_data" : data_from_web_page})
print(result.content)

```json
{
  "role": "Assistant Store Manager",
  "experience": "Solid retail sales background with a proven track record of success in a similar retail leadership or supervisory position",
  "skills": [
    "Retail leadership experience in men’s/women’s apparel or sports footwear",
    "People management",
    "Store management",
    "Customer service",
    "Visual Merchandising",
    "In-Store Communication",
    "Inventory processing"
  ],
  "description": "We are on the lookout for an amazing Assistant Store Manager to join the team at our Carrum Downs Factory Outlet. If you're an efficient, pro-active leader with a connection to sport and you're looking for a new opportunity to work in an adidas factory outlet, then we are looking for you!"
}
```


### Converting the string result into JSON

In [7]:
from langchain_core.output_parsers import JsonOutputParser

json_parser = JsonOutputParser()
json_result = json_parser.parse(result.content)
json_result

{'role': 'Assistant Store Manager',
 'experience': 'Solid retail sales background with a proven track record of success in a similar retail leadership or supervisory position',
 'skills': ['Retail leadership experience in men’s/women’s apparel or sports footwear',
  'People management',
  'Store management',
  'Customer service',
  'Visual Merchandising',
  'In-Store Communication',
  'Inventory processing'],
 'description': "We are on the lookout for an amazing Assistant Store Manager to join the team at our Carrum Downs Factory Outlet. If you're an efficient, pro-active leader with a connection to sport and you're looking for a new opportunity to work in an adidas factory outlet, then we are looking for you!"}

In [8]:
type(json_result)

dict

### Storing CSV file data in Chroma DB

In [10]:
import pandas as pd

In [11]:
df = pd.read_csv("portfolio.csv")

In [12]:
df

Unnamed: 0,Techstack,Links
0,"React, Node.js, MongoDB",https://example.com/react-portfolio
1,"Angular,.NET, SQL Server",https://example.com/angular-portfolio
2,"Vue.js, Ruby on Rails, PostgreSQL",https://example.com/vue-portfolio
3,"Python, Django, MySQL",https://example.com/python-portfolio
4,"Java, Spring Boot, Oracle",https://example.com/java-portfolio
5,"Flutter, Firebase, GraphQL",https://example.com/flutter-portfolio
6,"WordPress, PHP, MySQL",https://example.com/wordpress-portfolio
7,"Magento, PHP, MySQL",https://example.com/magento-portfolio
8,"React Native, Node.js, MongoDB",https://example.com/react-native-portfolio
9,"iOS, Swift, Core Data",https://example.com/ios-portfolio


In [13]:
import chromadb

In [14]:
client = chromadb.PersistentClient('Vector_Database')

In [15]:
collection = client.get_or_create_collection(name = 'portfolio_collection')

In [16]:
collection.count()

0

In [17]:
import uuid

### Storing data from dataframe containing CSV data to Vector Database

In [18]:
if not collection.count():
    for _, rows in df.iterrows():
        collection.add(documents=rows['Techstack'], 
                       metadatas={"links" : rows['Links']}, 
                       ids = [str(uuid.uuid4())])

In [19]:
links = collection.query(query_texts=["Experience in Java", "Experience in ReactNative"], n_results=2)

In [20]:
links

{'ids': [['f9b17d4f-6708-4ca9-ada0-591db0c8bf42',
   '0f1538b0-2e17-411c-8f70-3cef46a86d03'],
  ['5739763d-5de3-4d22-b50b-a895a9608a13',
   '0c033ca0-fef0-4a30-a382-2df76dfa9524']],
 'embeddings': None,
 'documents': [['Java, Spring Boot, Oracle',
   'Android, Java, Room Persistence'],
  ['React, Node.js, MongoDB', 'React Native, Node.js, MongoDB']],
 'uris': None,
 'data': None,
 'metadatas': [[{'links': 'https://example.com/java-portfolio'},
   {'links': 'https://example.com/android-portfolio'}],
  [{'links': 'https://example.com/react-portfolio'},
   {'links': 'https://example.com/react-native-portfolio'}]],
 'distances': [[1.0862322998041076, 1.1617181510147703],
  [1.3036504632671142, 1.3369225610904993]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

### Getting specific data from the database

In [21]:
links = collection.query(query_texts=["Experience in Java", "Experience in ReactNative"], n_results=2).get('metadatas', [])

In [22]:
links

[[{'links': 'https://example.com/java-portfolio'},
  {'links': 'https://example.com/android-portfolio'}],
 [{'links': 'https://example.com/react-portfolio'},
  {'links': 'https://example.com/react-native-portfolio'}]]

### Now lets use the Job requirements data fetched from adidas website and obtain data related to it from the database

In [23]:
json_result

{'role': 'Assistant Store Manager',
 'experience': 'Solid retail sales background with a proven track record of success in a similar retail leadership or supervisory position',
 'skills': ['Retail leadership experience in men’s/women’s apparel or sports footwear',
  'People management',
  'Store management',
  'Customer service',
  'Visual Merchandising',
  'In-Store Communication',
  'Inventory processing'],
 'description': "We are on the lookout for an amazing Assistant Store Manager to join the team at our Carrum Downs Factory Outlet. If you're an efficient, pro-active leader with a connection to sport and you're looking for a new opportunity to work in an adidas factory outlet, then we are looking for you!"}

In [24]:
json_result['skills']

['Retail leadership experience in men’s/women’s apparel or sports footwear',
 'People management',
 'Store management',
 'Customer service',
 'Visual Merchandising',
 'In-Store Communication',
 'Inventory processing']

In [25]:
loader = WebBaseLoader("https://jobs.adidas-group.com/job/Amsterdam-Internship-Software-Engineering-6-months-NH/1171989701/?feedId=301201&utm_source=j2w")

data_from_web_page = loader.load().pop().page_content

print(data_from_web_page)














Internship - Software Engineering [6 months] Job Details | adidas

































                    By continuing to use and navigate this website, you are agreeing to the use of cookies.
                
            

                Accept

                Close




Press Tab to Move to Skip to Content Link
Skip to main content








Prepare for interview
Search Jobs
SEE CATEGORIES




















Search by Keyword




Search by Location







                                 
                            















Prepare for interview
Search Jobs
SEE CATEGORIES









View Profile




Employee Login






















Search by Keyword




Search by Location






Show More Options





Loading...







                                            Team:
                                        


All





                                            Location
                                        


All





                          

In [26]:
prompt = PromptTemplate.from_template(
    """
        ### SCRAPED TEXT FROM WEBSITE:
        {page_data}
        ### INSTRUCTION:
        The scraped text is from the career's page of a website.
        Your job is to extract the job postings and return them in JSON format containing the 
        following keys: `role`, `experience`, `skills` and `description`.
        Only return the valid JSON.
        ### VALID JSON (NO PREAMBLE):    
        """
)

chain = prompt | llm
result = chain.invoke(input={"page_data" : data_from_web_page})
print(result.content)

```json
{
  "role": "Internship - Software Engineering",
  "experience": "6 months",
  "skills": [
    "TypeScript",
    "React",
    "Redux",
    "Node.js",
    "JavaScript",
    "CI/CD pipelines",
    "Containerization",
    "Cloud deployments",
    "Agile development methodologies"
  ],
  "description": "Join our Amsterdam software engineering team and develop code for adidas' global online stores, ensuring high performance and scalability. Participate in Agile ceremonies and invest in personal growth through online learning, mentorship, and hands-on development experience."
}
```


In [27]:
json_parser = JsonOutputParser()
json_result = json_parser.parse(result.content)
json_result

{'role': 'Internship - Software Engineering',
 'experience': '6 months',
 'skills': ['TypeScript',
  'React',
  'Redux',
  'Node.js',
  'JavaScript',
  'CI/CD pipelines',
  'Containerization',
  'Cloud deployments',
  'Agile development methodologies'],
 'description': "Join our Amsterdam software engineering team and develop code for adidas' global online stores, ensuring high performance and scalability. Participate in Agile ceremonies and invest in personal growth through online learning, mentorship, and hands-on development experience."}

In [28]:
json_result['skills']

['TypeScript',
 'React',
 'Redux',
 'Node.js',
 'JavaScript',
 'CI/CD pipelines',
 'Containerization',
 'Cloud deployments',
 'Agile development methodologies']

In [29]:
json_result['description']

"Join our Amsterdam software engineering team and develop code for adidas' global online stores, ensuring high performance and scalability. Participate in Agile ceremonies and invest in personal growth through online learning, mentorship, and hands-on development experience."

In [30]:
json_result['role']

'Internship - Software Engineering'

In [32]:
links = collection.query(query_texts=json_result['skills'], n_results=2).get('metadatas', [])

In [33]:
links

[[{'links': 'https://example.com/typescript-frontend-portfolio'},
  {'links': 'https://example.com/full-stack-js-portfolio'}],
 [{'links': 'https://example.com/react-portfolio'},
  {'links': 'https://example.com/react-native-portfolio'}],
 [{'links': 'https://example.com/full-stack-js-portfolio'},
  {'links': 'https://example.com/typescript-frontend-portfolio'}],
 [{'links': 'https://example.com/react-portfolio'},
  {'links': 'https://example.com/full-stack-js-portfolio'}],
 [{'links': 'https://example.com/full-stack-js-portfolio'},
  {'links': 'https://example.com/typescript-frontend-portfolio'}],
 [{'links': 'https://example.com/devops-portfolio'},
  {'links': 'https://example.com/magento-portfolio'}],
 [{'links': 'https://example.com/devops-portfolio'},
  {'links': 'https://example.com/ios-ar-portfolio'}],
 [{'links': 'https://example.com/devops-portfolio'},
  {'links': 'https://example.com/xamarin-portfolio'}],
 [{'links': 'https://example.com/ios-ar-portfolio'},
  {'links': 'https

### Creating a prompt and generating the cold email using the LLM

In [34]:
prompt_email = PromptTemplate.from_template(
        """
        ### JOB DESCRIPTION:
        {job_description}
        
        ### INSTRUCTION:
        You are Mohan, a business development executive at AtliQ. AtliQ is an AI & Software Consulting company dedicated to facilitating
        the seamless integration of business processes through automated tools. 
        Over our experience, we have empowered numerous enterprises with tailored solutions, fostering scalability, 
        process optimization, cost reduction, and heightened overall efficiency. 
        Your job is to write a cold email to the client regarding the job mentioned above describing the capability of AtliQ 
        in fulfilling their needs.
        Also add the most relevant ones from the following links to showcase Atliq's portfolio: {link_list}
        Remember you are Mohan, BDE at AtliQ. 
        Do not provide a preamble.
        ### EMAIL (NO PREAMBLE):
        
        """
        )

chain_email = prompt_email | llm
res = chain_email.invoke({"job_description": str(json_result), "link_list": links})
print(res.content)

Subject: Expert Software Engineering Solutions for Adidas' Online Stores

Dear Hiring Manager,

I came across the internship opportunity for a Software Engineering role at adidas, and I'm excited to introduce AtliQ, a leading AI & Software Consulting company. We specialize in facilitating seamless integration of business processes through automated tools, and our expertise aligns perfectly with your requirements.

At AtliQ, we have a proven track record of empowering enterprises with tailored solutions, fostering scalability, process optimization, cost reduction, and heightened overall efficiency. Our team of experts is well-versed in the technologies you're looking for, including TypeScript, React, Redux, Node.js, JavaScript, CI/CD pipelines, Containerization, Cloud deployments, and Agile development methodologies.

Our portfolio showcases our capabilities in:

* TypeScript frontend development: https://example.com/typescript-frontend-portfolio
* React-based solutions: https://example