<a href="https://colab.research.google.com/github/theinshort/crawler/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#WEB DATA CRAWLING

In this project our aim is to scrap a webiste using python.

`This project is for learning purpose and is not intended to perform any kind of violations.`





## Objective
We will use a web crawler to acquire data from a specific domain

1. Choose a domains of interest (e.g., news articles, product reviews, scientific publications etc).
2. Identify and use web crawling tools or libraries (such as BeautifulSoup, Scrapy, or others) to extract data from the chosen domains.
3. Collect a sufficient amount of data to ensure diversity and relevance.
4. Scrape and clean the HTML contents to generate clean text outputs (at least 2 GB textual data, the more than better).

## Final Outcome


1. Colab Notebook:
- Showcases the entire process of web data crawling, including the
chosen domains, code implementation, and data extraction.
- Clearly comment and document each step in the notebook.
2. Dataset Files:
- Extracted dataset in a separate file format (e.g., CSV, JSON) that includes a sample of the collected data.
3. Summary:
- Why specific domains were selected.
- Briefly describe the web crawling tools or libraries used and why?
- Statistics of data extracted from each domain.

# CHOOSING A DOMAIN
In this section we will be performing some steps in order to finalize our domain of interest. We will be considering all ethical an legal concerns before staring our scaping process.

In order to choose the domain for scrapping, we have to understand the complexity of data and website structure. We will be focusing toward product based websites like shopping stores because data from these website are usually availble to scrap.

Data is useful and scrapping from a website without permission is illegal, so before starting to scrap data we need to check if the data available on the website is allowed for scrapping or not.

We will be performing following steps to finalize our domains:


1.   Decide what type of data we need to scrap.
2.   Find related websites
3.   Analyze website content and structure
4.   Check website robots.txt file to check restrictions
5.   Select website if allowed



## Decide what type of data we need to scrap.
We will be scraping MCQs data, which is usefull in many aspects and also have some dificulties which will help us understand the scrapping procedure in better way.
MCQs data has a structure with multiple options, title, answer, explanations, and more. This type of data is usefull in machine learning and fine-tuning models to get desire results.

## Find related websites
Some of the websites with the desired data are as follows:
1. [PakMCQs ](https://pakmcqs.com/)
2. [CSSMCQs](https://cssmcqs.com/)
3. [MCQs Forum](https://mcqsforum.com/)
4. [MCQs Planet](https://mcqsplanet.com/)
5. [Top MCQs](https://topmcqs.com/)


## Analyze website content and structure

 After analyzing these websites we have colcluded that the structure of website asre different so we can not use a single method for all website, we have to handle each website indivisually.

## Check website robots.txt file for restrictions

In order to scrap a website we have to first check if the website is allowing developers and other users to scrap their content. To check restricions, we need to analyse the website's robots.txt file.


In [1]:
# Importing required libraries
import requests as req
from bs4 import BeautifulSoup

# Creating a function that fetch the content of a robots.txt file
def get_robots_txt(url):
  """Fetch robots.txt file content from given url """
  file_url = f"{url}/robots.txt"
  # Fetching data from file using request library get function
  response = req.get(file_url)
  # Check the status of response before return
  if response.status_code == 200:
    return response.text
  else:
    return None

def check_restrictions(url, robots_txt):
  """Check the robots.txt rules to check if URL is allowed for scrapping or not """

  if not robots_txt:
    return True

  soup = BeautifulSoup(robots_txt, "html.parser")
  # Adding a user agent header help mimic a real browser and reduce the chances of getting blocked.
  user_agents = soup.find_all("user-agent")

  for user_agent in user_agents:
    # Checks the wildcard for user agent in rorbots.txt file content
    if "*" in user_agent.text.strip():
      for disallow in soup.find_all("disallow"):
        disallow_path = disallow.text.strip()
        # Check if the url is in the restricted paths or not
        if disallow_path in url:
          return False
  # There are nor restricted rules available in the file. We will consider it as allowed for scrapping
  return True



What we have done in the above code example is:
1. Fetches the requests library for making web requests and BeautifulSoup for parsing HTML content.
2. Retrieves the `robots.txt` file from a given URL using `get_robots_txt()` function.
3. Analyzes the `robots.txt` content within `check_restrictions()` to determine if a URL is allowed for scraping based on website guidelines before proceeding.

In [2]:
site_url = "https://pakmcqs.com/"
robots_txt = get_robots_txt(site_url)

if robots_txt:
  # set target url to check for crawling
  target_url = "https://pakmcqs.com/category/english-mcqs"
  allowed = check_restrictions(target_url, robots_txt)
  if allowed:
    print(f"Target URL: '{target_url}' is allowed for Crawling")
  else:
    print(f"Target URL: '{target_url}' is not allowed for crawling")

else:
  print("robots.txt not found, Ask for confirmation of assume this as no restricions")

Target URL: 'https://pakmcqs.com/category/english-mcqs' is allowed for Crawling


  soup = BeautifulSoup(robots_txt, "html.parser")


After verification, we can confirm that the target url is allowed for scraping and we can proceed to the next step.

In [3]:
import requests
from bs4 import BeautifulSoup

def get_target_urls(base_url, starts_with, html_tag, tag_id):
  """
  Fetch list of target urls from the given base url and match urls string provided with html tag and tag id

  Args:
      base_url (str): The URL to fetch HTML from.
      starts_with (str):The starting URL to compare with the URLs to fetch.
      html_tag (str): HTML Tag from which urls needs to be fetched.
      tag_id (str): Tag id to fetch target URLs from.

  Returns:
      list: A list of URLs starting with the starts_with Args found within the html_tag with given tag_id.
  """

  # Send a GET request to the provided URL
  response = requests.get(base_url)

  # Checkif the URL is active
  if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the element with provided html_tag & tag_id
    tag_content = soup.find(html_tag, id=tag_id)

    # Create an empty list to store the target URLs
    target_urls = []

    # Check if the div is found
    if tag_content:
      # Find all anchor tags (a elements) within the div
      for anchor in tag_content.find_all('a'):
        href = anchor.get('href')
        if href and href.startswith(starts_with):
          target_urls.append(href)

    return target_urls
  else:
    print(f"Error fetching URL: {base_url} - Status code: {response.status_code}")
    return []


In [4]:
import re

def get_category_from_url(url):
  url_parts = url.split("/")
  return url_parts[-1]



In [47]:
import requests
from bs4 import BeautifulSoup

def extract_mcq(url, category):
  """
  Extracts MCQ details (question, answer choices) from a given URL.
  Args:
      url (str): The URL of the MCQ page.
  Returns:
      dict: A dictionary containing the extracted MCQ details (question, answers).
  """
  all_mcqs = []

  response = requests.get(url)

  if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the header containing the MCQ question
    headers = soup.find_all('header', class_='entry-header entry-header-index')
    for header in headers:
      mcq = {}
      if header:
        mcq["category"] = category
        # Extract question from the anchor tag within the header
        question_element = header.find('strong')
        if question_element:
          mcq['question'] = question_element.text.strip()

        # Find the answer choices within the content section
        content = header.find('div', class_='entry-content')
        if content:
          option_elements = content.find_all('p')
          # correct_answer = content.find('strong')
          if option_elements:
            # Assuming the first paragraph contains answer choices (modify if needed)
            options = option_elements[0].text.strip().split('\n')
            options_list = [option.strip() for option in options]
            for i, option in enumerate(options_list):
              mcq[f"option {i+1}"] = re.sub(r"(A\. |B\. |C\. |D\. |E\. )", "", option)

          # Identify bold answer by searching for strong tags within paragraphs
          bold_answer = ""
          for option_element in option_elements:
            strong_element = option_element.find('strong')
            if strong_element:
              bold_answer = strong_element.text.strip()
              break  # Exit after finding the first bold element (assuming only one)
          if bold_answer:
            mcq['correct_answer'] = bold_answer.strip()[0]
            all_mcqs.append(mcq)

  return all_mcqs




In [48]:
def get_max_page_number(url):
  max_page = 1000
  response = requests.get(url)
  if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    page_nav = soup.find("div", class_="wpsp-page-nav")
    page_numbers = page_nav.find_all("a")
    max_page = int(page_numbers[-2].text)

  return max_page


In [49]:
def get_all_page_mcqs(url, last_page):
  all_mcqs = []
  category = get_category_from_url(url)
  for x in range(1,last_page+1):
    mcqs = extract_mcq(url+f"/page/{x}", category)
    all_mcqs.extend(mcqs)

  write_mcqs_to_csv(all_mcqs, "mcqs.csv")
  return all_mcqs


In [50]:
import threading as td

def fetch_all_mcqs(urls):
  threads = []

  def process_url(url):
    max_page = get_max_page_number(url)
    mcqs = get_all_page_mcqs(url, max_page)

  for url in urls:
    thread = td.Thread(target=process_url, args=(url,))
    threads.append(thread)
    thread.start()

  for thread in threads:
    thread.join()



In [51]:
import csv

def write_mcqs_to_csv(all_mcqs, file_name):
  with open(file_name, "a", newline='') as csv_file:
    columns = ["question", "option 1", "option 2", "option 3", "option 4","option 5", "correct_answer", "category"]
    writer = csv.DictWriter(csv_file, fieldnames=columns)
    writer.writeheader()
    writer.writerows(all_mcqs)

In [52]:

base_url = "https://pakmcqs.com/"  # Website url to fect the categories urls

target_urls = get_target_urls(base_url,'https://pakmcqs.com/category/', 'div', 'secondary')

if target_urls:
  print(f"Extracted category URLs: {target_urls}")
else:
  print("No category URLs found in the provided URL or error fetching the content.")


Extracted category URLs: ['https://pakmcqs.com/category/english-mcqs', 'https://pakmcqs.com/category/mathematics-mcqs', 'https://pakmcqs.com/category/general_knowledge_mcqs', 'https://pakmcqs.com/category/pakistan-current-affairs-mcqs', 'https://pakmcqs.com/category/world-current-affairs-mcqs', 'https://pakmcqs.com/category/pak-study-mcqs', 'https://pakmcqs.com/category/islamic-studies-mcqs', 'https://pakmcqs.com/category/computer-mcqs', 'https://pakmcqs.com/category/everyday-science-mcqs', 'https://pakmcqs.com/category/physics-mcqs', 'https://pakmcqs.com/category/chemistry-mcqs', 'https://pakmcqs.com/category/biology-mcqs', 'https://pakmcqs.com/category/pedagogy-mcqs', 'https://pakmcqs.com/category/urdu-general-knowledge', 'https://pakmcqs.com/category/finance-mcqs', 'https://pakmcqs.com/category/hrm-mcqs', 'https://pakmcqs.com/category/marketing-mcqs', 'https://pakmcqs.com/category/accounting-mcqs', 'https://pakmcqs.com/category/auditing-mcqs', 'https://pakmcqs.com/category/electrica

In [43]:
all_mcqs = fetch_all_mcqs(target_urls)