<a href="https://colab.research.google.com/github/theinshort/crawler/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#WEB DATA CRAWLING

In this project our aim is to scrap a webiste using python.

`This project is for learning purpose and is not intended to perform any kind of violations.`





## Objective
We will use a web crawler to acquire data from a specific domain

1. Choose a domains of interest (e.g., news articles, product reviews, scientific publications etc).
2. Identify and use web crawling tools or libraries (such as BeautifulSoup, Scrapy, or others) to extract data from the chosen domains.
3. Collect a sufficient amount of data to ensure diversity and relevance.
4. Scrape and clean the HTML contents to generate clean text outputs (at least 2 GB textual data, the more than better).

## Final Outcome


1. Colab Notebook:
- Showcases the entire process of web data crawling, including the
chosen domains, code implementation, and data extraction.
- Clearly comment and document each step in the notebook.
2. Dataset Files:
- Extracted dataset in a separate file format (e.g., CSV, JSON) that includes a sample of the collected data.
3. Summary:
- Why specific domains were selected.
- Briefly describe the web crawling tools or libraries used and why?
- Statistics of data extracted from each domain.

# CHOOSING A DOMAIN
In this section we will be performing some steps in order to finalize our domain of interest. We will be considering all ethical an legal concerns before staring our scaping process.

In order to choose the domain for scrapping, we have to understand the complexity of data and website structure. We will be focusing toward product based websites like shopping stores because data from these website are usually availble to scrap.

Data is useful and scrapping from a website without permission is illegal, so before starting to scrap data we need to check if the data available on the website is allowed for scrapping or not.

We will be performing following steps to finalize our domains:


1.   Decide what type of data we need to scrap.
2.   Find related websites
3.   Analyze website content and structure
4.   Check website robots.txt file to check restrictions
5.   Select website if allowed



## Decide what type of data we need to scrap.
We will be scraping products data, which is usefull in many aspects and also have some dificulties which will help us understand the scrapping procedure in better way.
Products data has a structure with multiple features, categories, sub-categories, price, and more. This type of data is usefull in machine learning and fine-tuning models to get desire results.

## Find related websites
Some of the websites with the desired data are as follows:
1. [Daraz.pk Online Shopping](https://www.daraz.pk/)
2. [Home Shopping Services](https://homeshopping.pk/)
3. [Telemart Online Shopping](https://www.telemart.pk/)


## Analyze website content and structure

 After analyzing these websites we have cam for a colclusion that daraz.pk is a better option for the desired job. As it has a huge database and somewhat average structure, which will be useful to learn scraping. Other websites are not that good for scrapping and the website has limited data.

## Check website robots.txt file to check restrictions

In order to scrap a website we have to first check if the website is allowing developers and other users to scrap their content. To check restricions, we need to analyse the website's robots.txt file.


In [2]:
# Importing required libraries
import requests as req
from bs4 import BeautifulSoup

# Creating a function that fetch the content of a robots.txt file
def get_robots_txt(url):
  """Fetch robots.txt file content from given url """
  file_url = f"{url}/robots.txt"
  # Fetching data from file using request library get function
  response = req.get(file_url)
  # Check the status of response before return
  if response.status_code == 200:
    return response.txt
  else:
    return None

def check_restrictions(url, robots_txt):
  """Check the robots.txt rules to check if URL is allowed for scrapping or not """

  if not robots_txt:
    return True

  soup = BeautifulSoup(robots_txt, "html.parser")
  # Adding a user agent header help mimic a real browser and reduce the chances of getting blocked.
  user_agents = soup.find_all("user-agent")

  for user_agent in user_agents:
    # Checks the wildcard for user agent in rorbots.txt file content
    if "*" in user_agent.text.strip():
      for disallow in soup.find_all("disallow"):
        disallow_path = disallow.text.strip()
        # Check if the url is in the restricted paths or not
        if disallow_path in url:
          return False
  # There are nor restricted rules available in the file. We will consider it as allowed for scrapping
  return True



What we have done in the above code example is:
1. Fetches the requests library for making web requests and BeautifulSoup for parsing HTML content.
2. Retrieves the `robots.txt` file from a given URL using `get_robots_txt()` function.
3. Analyzes the `robots.txt` content within `check_restrictions()` to determine if a URL is allowed for scraping based on website guidelines before proceeding.