simple scraper with cheerio
In this post we will learn how to scrape a website using cheerio, and then create an api with the scraped data with node.js
that late you can use with a frontend
.
The website that we will be using for this example is pricecharting
You can contact me by telegram if you need to hire a Full Stack developer.
You can also contact me by discord.
You can clone the repo if you want.
This example is only for learning purposes
- open your terminal and type following
- mkdir node-cheerio-tut
- cd node-cheerio-tut
- npm init --y
- code .
- axios
- cheerio
- express
- nodemon
To install dependencies go to your project folder open a terminal and type the following:
npm i axios cheerio express mongoose
And for dev dependencies type
npm i -D nodemon
node-cheerio-tut/ ├── node_modules/ ├── public/ ├── src/ │ ├── routes/ │ ├── database.js │ └── index.js └── package.json
First go to your package.json
and add this line.
"scripts": {
"start": "node ./src index.js",
"dev": "nodemon ./src index.js"
},
Let's code
lets go to index.js inside the src folder and set up our basic server with express.
const expres = require('express')
const app = express()
//server
app.listen(3000, () => {
console.log('listening on port 3000')
})
now let's run this command npm run dev
and we should get this message:
listening on port 3000
Now in our index.js lets import axios and cheerio, then i''ll explain the code below.
- we're going to add a const url with the url value, in this case
https://www.pricecharting.com/search-products?q=
. (when you do a search in this web, you will be redirected to a new page, with a new route and a parameter with the value of the name you searched for.)
So we are going to use that url, also the website has two types of search, one by price and another by market, if we don''t specify the type in the url it will set market type by default. I leave it like this because in market returns the cover of the game and the system (we will use them later)
-
We will add this middlware
app.use(express.json())
because we don''t want to getundefined
when we do the post request. -
We will create a route with the post method to send a body to our server, (i'm going to use the REST Client vscode extension to test the api, but you can use postman or whatever you want)
test.http
POST http://localhost:3000
Content-Type: application/json
{
"game": "final fantasy"
}
final fantasy
As you can see we are getting the response, in this case i named the property game.
const axios = require("axios");
const cheerio = require("cheerio");
const express = require('express')
//initializations
const app = express()
const url = "https://www.pricecharting.com/search-products?q="
//middlwares
app.use(express.json())
app.post('/', async (req, res) => {
// console.log(req.body.game)
const game = req.body.game.trim().replace(/\s+/g, '+')
})
//server
app.listen(3000, () => {
console.log('listening on port 3000')
})
- Now we are going to create a constant named game that will store the value from
req.body.game
the we will use some methods to get the result like thisfinal+fantasy
.
-
First we're going to use
trim()
to remove the whitespace characters from the start and end of the string. -
Then we will replace the whitespaces between the words with a
+
symbol withreplace(/\s+/g, '+')
.
Finally we're going to use cheerio.
-
Now that we have our game constant we're going to use axios to make a request to our url + the game title.
-
We are going to use a
try catch block
, if we get a response then we will store it in a constant namedhtml
then we will use cherrio to load that data. -
We are going to create a constant named games that will store this value
$(".offer", html)
.
- If you open your developer tools and go to the elements tab you will that .offer class belongs to a table like the image below.
- If you take a look to this image you will easily understand whats going on in the code.
- now we are going to loop trough that table to get each title, and we cand do that using
.find(".product_name")
, then.find(".a")
, then we want thetext()
from the a tag.
.
.
.
app.post('/', async (req, res) => {
const game = req.body.game.trim().replace(/\s+/g, '+')
await axios(url + game)
try {
const response = await axios.get(url + game)
const html = response.data;
const $ = cheerio.load(html)
const games = $(".offer", html)
games.each((i, el) => {
const gameTitle = $(el)
.find(".product_name")
.find("a")
.text()
.replace(/\s+/g, ' ')
.trim()
console.log(gameTitle)
})
} catch (error) {
console.log(error)
}
})
.
.
.
- If you try this with
console.log(title)
you will get a message like this.
Final Fantasy VII
Final Fantasy III
Final Fantasy
Final Fantasy VIII
Final Fantasy II
.
.
.
- Now let's add more fields, for this example i want an id, a cover image and a system.
.
.
.
app.post('/', async (req, res) => {
const game = req.body.game.trim().replace(/\s+/g, '+')
await axios(url + game)
try {
const response = await axios.get(url + game)
const html = response.data;
const $ = cheerio.load(html)
const games = $(".offer", html)
games.each((i, el) => {
const gameTitle = $(el)
.find(".product_name")
.find("a")
.text()
.replace(/\s+/g, ' ')
.trim()
const id = $(el).attr('id').slice(8);
//cover image
const coverImage = $(el).find(".photo").find("img").attr("src");
const system = $(el)
.find("br")
.get(0)
.nextSibling.nodeValue.replace(/\n/g, "")
.trim();
})
} catch (error) {
console.log(error)
}
})
.
.
.
Let's store this data in an array, so in order to do this, lets create an array named videoGames
.
.
.
const url = "https://www.pricecharting.com/search-products?q=";
let videoGames = []
app.post('/', async (req, res) => {
const game = req.body.game.trim().replace(/\s+/g, '+')
await axios(url + game)
try {
const response = await axios.get(url + game)
const html = response.data;
const $ = cheerio.load(html)
const games = $(".offer", html)
games.each((i, el) => {
const gameTitle = $(el)
.find(".product_name")
.find("a")
.text()
.replace(/\s+/g, ' ')
.trim()
const id = $(el).attr('id').slice(8);
//cover image
const coverImage = $(el).find(".photo").find("img").attr("src");
const gameSystem = $(el)
.find("br")
.get(0)
.nextSibling.nodeValue.replace(/\n/g, "")
.trim();
})
videoGames.push({
id,
gameTitle,
coverImage,
gameSystem
})
res.json(videoGames)
} catch (error) {
console.log(error)
}
})
.
.
.
if you try the route again you will get a result similar to the image below
Optionally i made an array to get only certain systems because I didn't want to receive the same title with PAL and NTSC system, so I left the default system (NTSC).
.
.
.
const consoles = [
"Nintendo DS",
"Nintendo 64",
"Nintendo NES",
"Nintendo Switch",
"Super Nintendo",
"Gamecube",
"Wii",
"Wii U",
"Switch",
"GameBoy",
"GameBoy Color",
"GameBoy Advance",
"Nintendo 3DS",
"Playstation",
"Playstation 2",
"Playstation 3",
"Playstation 4",
"Playstation 5",
"PSP",
"Playstation Vita",
"PC Games",
]
.
.
.
app.post('/', async (req, res) => {
.
.
.
if (!system.includes(gameSystem)) return;
videoGames.push({
id,
gameTitle,
coverImage,
gameSystem,
});
.
.
.
})
.
.
.
Let's organize it a little bit, let's create a folder in src named routes then create a file named index.js.
Copy and paste the code below.
const {Router} = require('express')
const cheerio = require("cheerio");
const axios = require("axios");
const router = Router()
const url = "https://www.pricecharting.com/search-products?q="
let videoGames = []
const system = [
"Nintendo DS",
"Nintendo 64",
"Nintendo NES",
"Nintendo Switch",
"Super Nintendo",
"Gamecube",
"Wii",
"Wii U",
"Switch",
"GameBoy",
"GameBoy Color",
"GameBoy Advance",
"Nintendo 3DS",
"Playstation",
"Playstation 2",
"Playstation 3",
"Playstation 4",
"Playstation 5",
"PSP",
"Playstation Vita",
"PC Games",
]
router.post('/', async (req, res) => {
const game = req.body.game.trim().replace(/\s+/g, '+')
await axios(url + game)
try {
const response = await axios.get(url + game)
const html = response.data;
const $ = cheerio.load(html)
const games = $(".offer", html)
games.each((i, el) => {
const gameTitle = $(el)
.find(".product_name")
.find("a")
.text()
.replace(/\s+/g, ' ')
.trim()
const id = $(el).attr('id').slice(8);
const coverImage = $(el).find(".photo").find("img").attr("src");
const gameSystem = $(el)
.find("br")
.get(0)
.nextSibling.nodeValue.replace(/\n/g, "")
.trim();
if (!system.includes(gameSystem)) return;
videoGames.push({
id,
gameTitle,
coverImage,
gameSystem,
backlog: false
});
})
res.json(videoGames)
} catch (error) {
console.log(error)
}
})
module.exports = router
Now let's go back to our main file in src index.js and leave the code like this.
const express = require('express')
//routes
const main = require('./routes/index')
const app = express()
//middlwares
app.use(express.json())
//routes
app.use(main)
app.listen(3000, () => {
console.log('Server running on port 3000')
})
If you try it you will see that it still works without any troubles.
We learned how to make a simple scraper with cheerio.
I really hope you have been able to follow the post without any trouble, otherwise i apologize, please leave me your doubts or comments.
I plan to make a next post extending this code, adding more routes, mongodb, and a front end.
You can contact me by telegram if you need to hire a Full Stack developer.
You can also contact me by discord.
You can clone the repo if you want.
Thanks for your time.