GitHub - solidsnk86/neo-scraper: App para web-scraping

Neo - Scraper

Neo-Scraper is an simple application made with Flask, Python, Beautifulsoup to do a web-scraping.

Note

If you wanna use this app, you need to install all these dependencies:

Flask

pip install Flask

Flask-CORS

pip install flask-cors

Requests

pip install requests

BeautifulSoup

pip install beautifulsoup4

Or if you want to install all the modules, I created a requirements.txt file. ¿Do you know how to generate an automatic requirements.txt?

pip freeze > requirements.txt

This will save all dependencies and their versions in the requirements.txt file. You can then use this file to install the same dependencies in another environment with the command:

pip install -r requirements.txt

On the other hand, we need to execute the virtual server.

Important

These commands must be run in your terminal or command prompt. Make sure you are in a virtual environment if you are working on a specific project. You can create a virtual environment using venv as follows:

# Create virtual environment
python -m venv myenv

# Activate Virtual Enviroment (Windows)
myenv\Scripts\activate

# To activate Virtual Enviroment (Unix o MacOS)
source myenv/bin/activate

You can then install the dependencies within the virtual environment.

After installing the dependencies, you can run your Flask script. Make sure you have the complete Flask code and adjust the port or any other settings as necessary. For example, if your file is called app.py, you could run it like this:

python app.py

This will start your Flask application on the local server and you can access it from your browser. Make sure you open the browser and visit the address the console displays after running the script (by default, it's usually something like http://127.0.0.1:5000/).
Certainly! The provided code is a simple Flask web application that utilizes Flask and BeautifulSoup to scrape data from a website. Let's break down the code:

from flask import Flask, jsonify
from flask_cors import CORS
import requests
from bs4 import BeautifulSoup

app = Flask(__name__)
CORS(app)

@app.route('/api/scrape', methods=['GET'])
def scrape():
    try:
        url = 'https://solidsnk86.netlify.app'
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        titles = [h.text for h in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])]

        paragraphs = [p.text for p in soup.find_all('p')]

        list_items = [li.text for li in soup.find_all('li')]

        inputs = [input['value'] if 'value' in input.attrs else input.text for input in soup.find_all('input')]

        links = [{'text': a.text, 'href': a['href']} for a in soup.find_all('a')]

        images = [{'src': img['src'], 'alt': img['alt']} for img in soup.find_all('img')]

        table_data = [td.text for td in soup.find_all('td')]

        return jsonify({'titles': titles,
                        'paragraphs': paragraphs,
                        'list_items': list_items,
                        'inputs': inputs,
                        'links': links,
                        'images': images,
                        'table_data': table_data
                        })
    except Exception as e:
        return jsonify({'error': str(e)})

if __name__ == '__main__':
    app.run(debug=True)

# This code runs the Flask application when the script is executed directly
# (not imported as a module), with debugging enabled.

Web Scraping Logic:

The script specifies a target URL and configures Flask like an api route to get this URL from Axios.
Inside the try block, it navigates to the specified URL, retrieves the page source, and parses it using BeautifulSoup.
The titles and articles are extracted from the HTML content using BeautifulSoup.
The results are returned as a JSON response using Flask's jsonify function.

Warning

Web scraping should be done ethically and in accordance with the website's terms of service. Always be aware of legal and ethical considerations when scraping data from websites.

I have a short video on YouTube that show you how this app works. In my case, I use it in my React web app and call it with Axios. Follow this link.

Here is a brief description from the JSON that we will receive from the Python app.

from flask import Flask, jsonify
return jsonify({'titles': titles , 'paragraphs': paragraphs , 'list_items': list_items })

To get that JSON result into your web app from React, you need to have a component like this:

import { useState } from 'react';
import axios from 'axios';

export default function Scraper() {
  const [disabled, setDisabled] = useState(false);
  const [scrape, setScraping] = useState({
    titles: [],
    paragraphs: [],
    items: [],
    inputs: [],
    images: [],
    links: [],
  });

  const handleScrape = async () => {
    try {
      const response = await axios.post(
        'https://www.pythonanywhere.com/user/SolidSnk86/files/home/SolidSnk86/scrape/flask_app.py',
      );
      setScraping(response.data);
      setDisabled(true);
    } catch (error) {
      console.error('Error al realizar el raspado:', error);
    }
  };

  return (
    <>
      <div className="justify-center mx-auto my-6">
        <button
          onClick={handleScrape}
          disabled={disabled}
          className="border p-3 bg-zinc-300 dark:bg-zinc-800/95 dark:border-zinc-600/75 cursor-not-allowed rounded dark:border-zinc-800 border-zinc-00/10 hover:opacity-[.6] transition-all"
        >
          Hacer Scraping
        </button>
      </div>
      {Object.keys(scrape.titles).map((index) => (
        <article
          key={index}
          className="text-zinc-100 space-y-3 border-zinc-200 border-[1px] shadow-md rounded shadow-zinc-200 mt-6 p-6 dark:!shadow dark:border-zinc-800 overflow-x-auto"
        >
          <h1 className="text-[tomato] underline text-lg">
            {scrape.titles[index]}
          </h1>
          <p className="text-text-primary p-3 text-sm">
            {scrape.paragraphs[index]}
          </p>
          <ul className="text-zinc-500">
            {Array.isArray(scrape.images[index]) &&
              scrape.images[index].map((image, i) => (
                <li key={i}>
                  <p>Alt: {image.alt}</p>
                  <p>Src: {image.src}</p>
                </li>
              ))}
          </ul>
          <div>
            <p>Links:</p>
            <ul>
              {Array.isArray(scrape.links[index]) &&
                scrape.links[index].map((link, i) => (
                  <li key={i}>
                    <p>Alt: {link.alt}</p>
                    <p>Src: {link.src}</p>
                  </li>
                ))}
            </ul>
          </div>
        </article>
      ))}
    </>
  );
}

Error Reporting: If you encounter any issues or errors while using our application, please don't hesitate to reach out. Provide details about the problem, and I'll work to address it promptly.

Contributions: Interested in contributing to the project? Feel free to fork the repository on GitHub and submit a pull request. I appreciate any enhancements, bug fixes, or new features you may bring!

General Inquiries: Whether you have questions about the application's functionality, suggestions for improvement, or just want to say hello, your feedback is always welcome.

Contact for any queries or feedback, feel free to reach out to my email at:

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
public		public
LEEME.md		LEEME.md
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping Logic:

About

Releases

Packages

Languages

solidsnk86/neo-scraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Logic:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages