<a href="https://colab.research.google.com/github/theinshort/crawler/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#WEB DATA CRAWLING

In this project our aim is to scrap a webiste using python.

`This project is for learning purpose and is not intended to perform any kind of violations.`





## Objective
We will use a web crawler to acquire data from a specific domain

1. Choose a domains of interest (e.g., news articles, product reviews, scientific publications etc).
2. Identify and use web crawling tools or libraries (such as BeautifulSoup, Scrapy, or others) to extract data from the chosen domains.
3. Collect a sufficient amount of data to ensure diversity and relevance.
4. Scrape and clean the HTML contents to generate clean text outputs (at least 2 GB textual data, the more than better).

## Final Outcome


1. Colab Notebook:
- Showcases the entire process of web data crawling, including the
chosen domains, code implementation, and data extraction.
- Clearly comment and document each step in the notebook.
2. Dataset Files:
- Extracted dataset in a separate file format (e.g., CSV, JSON) that includes a sample of the collected data.
3. Summary:
- Why specific domains were selected.
- Briefly describe the web crawling tools or libraries used and why?
- Statistics of data extracted from each domain.

## Choosing A Domain
In this section we will be performing some steps in order to finalize our domain of interest. We will be considering all ethical an legal concerns before staring our scaping process.

In order to choose the domain for scrapping, we have to understand the complexity of data and website structure. We will be focusing toward product based websites like shopping stores because data from these website are usually availble to scrap.

Data is useful and scrapping from a website without permission is illegal, so before starting to scrap data we need to check if the data available on the website is allowed for scrapping or not.

We will be performing following steps to finalize our domains:


1.   Decide what type of data we need to scrap.
2.   Find related websites
3.   Analyze website content and structure
4.   Check website robots.txt file to check restrictions
5.   Select website if allowed



## Decide what type of data we need to scrap.
We will be scraping MCQs data, which is usefull in many aspects and also have some dificulties which will help us understand the scrapping procedure in better way.
MCQs data has a structure with multiple options, title, answer, explanations, and more. This type of data is usefull in machine learning and fine-tuning models to get desire results.

## Find related websites
Some of the websites with the desired data are as follows:
1. [PakMCQs ](https://pakmcqs.com/)
2. [CSSMCQs](https://cssmcqs.com/)
3. [MCQs Forum](https://mcqsforum.com/)
4. [MCQs Planet](https://mcqsplanet.com/)
5. [Top MCQs](https://topmcqs.com/)


## Analyze website content and structure

 After analyzing these websites we have colcluded that the structure of website asre different so we can not use a single method for all website, we have to handle each website indivisually.

## Check website robots.txt file for restrictions

In order to scrap a website we have to first check if the website is allowing developers and other users to scrap their content. To check restricions, we need to analyse the website's robots.txt file.


In [None]:
# Importing required libraries
import requests as req
from bs4 import BeautifulSoup

# Creating a function that fetch the content of a robots.txt file
def get_robots_txt(url):
  """Fetch robots.txt file content from given url """
  file_url = f"{url}/robots.txt"
  # Fetching data from file using request library get function
  response = req.get(file_url)
  # Check the status of response before return
  if response.status_code == 200:
    return response.text
  else:
    return None

def check_restrictions(url, robots_txt):
  """Check the robots.txt rules to check if URL is allowed for scrapping or not """

  if not robots_txt:
    return True

  soup = BeautifulSoup(robots_txt, "html.parser")
  # Adding a user agent header help mimic a real browser and reduce the chances of getting blocked.
  user_agents = soup.find_all("user-agent")

  for user_agent in user_agents:
    # Checks the wildcard for user agent in rorbots.txt file content
    if "*" in user_agent.text.strip():
      for disallow in soup.find_all("disallow"):
        disallow_path = disallow.text.strip()
        # Check if the url is in the restricted paths or not
        if disallow_path in url:
          return False
  # There are nor restricted rules available in the file. We will consider it as allowed for scrapping
  return True



What we have done in the above code example is:
1. Fetches the requests library for making web requests and BeautifulSoup for parsing HTML content.
2. Retrieves the `robots.txt` file from a given URL using `get_robots_txt()` function.
3. Analyzes the `robots.txt` content within `check_restrictions()` to determine if a URL is allowed for scraping based on website guidelines before proceeding.

In [None]:
site_url = "https://pakmcqs.com/"
robots_txt = get_robots_txt(site_url)

if robots_txt:
  # set target url to check for crawling
  target_url = "https://pakmcqs.com/category/english-mcqs"
  allowed = check_restrictions(target_url, robots_txt)
  if allowed:
    print(f"Target URL: '{target_url}' is allowed for Crawling")
  else:
    print(f"Target URL: '{target_url}' is not allowed for Crawling")

else:
  print("robots.txt not found, Ask for confirmation of assume this as no restricions")

Target URL: 'https://pakmcqs.com/category/english-mcqs' is allowed for Crawling


  soup = BeautifulSoup(robots_txt, "html.parser")


After verification, we can confirm that the target url is allowed for scraping and we can proceed to the next step.

## Get Target URLs For Scrapping
Now we nee to to fetch the urls before statring to scrap the content. This will help us easily create a function to scrap the required data.
Foo this, we have to get the HTML tag whhich contains the categories urls. Then we create a function that get all the urls inside that html tag using `href`.

In [None]:
import requests
from bs4 import BeautifulSoup

def get_target_urls(base_url, starts_with, html_tag, tag_id):
  """
  Fetch list of target urls from the given base url and match urls string provided with html tag and tag id

  Args:
      base_url (str): The URL to fetch HTML from.
      starts_with (str):The starting URL to compare with the URLs to fetch.
      html_tag (str): HTML Tag from which urls needs to be fetched.
      tag_id (str): Tag id to fetch target URLs from.

  Returns:
      list: A list of URLs starting with the starts_with Args found within the html_tag with given tag_id.
  """

  # Send a GET request to the provided URL
  user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"

  headers = {
    'User-Agent': user_agent,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5'
  }
  response = requests.get(base_url, headers=headers)


  # Checkif the URL is active
  if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the element with provided html_tag & tag_id
    tag_content = soup.find(html_tag, id=tag_id)

    # Create an empty list to store the target URLs
    target_urls = []

    # Check if the div is found
    if tag_content:
      # Find all anchor tags (a elements) within the div
      for anchor in tag_content.find_all('a'):
        href = anchor.get('href')
        if href and href.startswith(starts_with):
          target_urls.append(href)

    return target_urls
  else:
    print(f"Error fetching URL: {base_url} - Status code: {response.status_code}")
    return []


### Fetching Category Name From URL
In order to increade our dataset quality, we have to add a saperate column for catregory. we will use the trick to extract the last element from url which depict the category of MCQs. Which we will use in our mcqs dictionary.

In [None]:
import re

def get_category_from_url(url):
  """Extract Category from provided url using split function"""
  url_parts = url.split("/")
  return url_parts[-1]



## Writing Code To Extract Data
In this section, we will create a function that extract all MCQs from a single page, and extract the MCQs components like Question, Options, Answers and Category. Then we will arrange them in a dictionary, creating a list of MCQ and then return.

In [None]:
import requests
from bs4 import BeautifulSoup

def extract_mcq(url, category):
  """
  Extracts MCQ details (question, answer choices) from a given URL.
  Args:
      url (str): The URL of the MCQ page.
  Returns:
      dict: A dictionary containing the extracted MCQ details (question, answers).
  """
  all_mcqs = []
  user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"

  headers = {
    'User-Agent': user_agent,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5'
  }
  # Adding headers and user agent to pretent to be a user.
  response = requests.get(url, headers=headers)

  if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the header containing the MCQ question
    headers = soup.find_all('header', class_='entry-header entry-header-index')
    for header in headers:
      mcq = {}
      if header:
        mcq["category"] = category
        # Extract question from the anchor tag within the header
        question_element = header.find('strong')
        if question_element:
          mcq['question'] = question_element.text.strip()

        # Find the answer choices within the content section
        content = header.find('div', class_='entry-content')
        if content:
          option_elements = content.find_all('p')
          # correct_answer = content.find('strong')
          if option_elements:
            # Assuming the first paragraph contains answer choices (modify if needed)
            options = option_elements[0].text.strip().split('\n')
            options_list = [option.strip() for option in options]
            if len(options_list) > 6:
              break
            for i, option in enumerate(options_list):
              mcq[f"option {i+1}"] = re.sub(r"(A\. |B\. |C\. |D\. |E\. |F\. )", "", option)

          # Identify bold answer by searching for strong tags within paragraphs
          bold_answer = ""
          for option_element in option_elements:
            strong_element = option_element.find('strong')
            if strong_element:
              bold_answer = strong_element.text.strip()
              break  # Exit after finding the first bold element (assuming only one)
          if bold_answer:
            mcq['correct_answer'] = bold_answer.strip()[0]
            all_mcqs.append(mcq)

  return all_mcqs


### Get Max Page Numbers
Each url have multiple pages, Each Page contain equal number of MCQs. We have to get the exact number of pages to iterate the loop through each page to get the data.

In [None]:
def get_max_page_number(url):
  """Get the max number of pages a single url have"""
  max_page = 1000
  response = requests.get(url)
  if response.status_code == 200:
    soup = BeautifulSoup(response.text, "html.parser")
    page_nav = soup.find("div", class_="wpsp-page-nav")
    page_numbers = page_nav.find_all("a")
    max_page = int(page_numbers[-2].text)

  return max_page


### Get All MCQs For Single URL
In this function, we will iterate through the max number of the pages, and fetch MCQs from each page. Then axtend each fetched mcqs list to all_mcqs list and return.

In [None]:
def get_all_page_mcqs(url, last_page):
  all_mcqs = []
  category = get_category_from_url(url)
  for x in range(1,last_page+1):
    mcqs = extract_mcq(url+f"/page/{x}", category)
    all_mcqs.extend(mcqs)

  write_mcqs_to_csv(all_mcqs, "mcqs.csv")


### Adding Threading To Speed Up
For each URL, we will be creating a saperate thread. This will improve the performance and speed by handling each url in a seperate thread.

In [None]:
import threading as td

def fetch_all_mcqs(urls):
  threads = []
  # Function that work as target for thread
  def process_url(url):
    max_page = get_max_page_number(url) # To get the max number of pages a single url has
    get_all_page_mcqs(url, max_page) # Fetch all page mcqs

  for url in urls:
    # Creating threads for each url
    thread = td.Thread(target=process_url, args=(url,))
    threads.append(thread)
    thread.start()

  for thread in threads:
    thread.join()



### Writing Data To CSV
After getting MCQs from each url, we will be adding MCQs to a csv file.

In [None]:
import csv

def write_mcqs_to_csv(all_mcqs, file_name):
  with open(file_name, "a", newline='') as csv_file:
    # Creates a header for the csv file
    columns = ["question", "option 1", "option 2", "option 3", "option 4","option 5", "option 6",  "correct_answer", "category"]
    writer = csv.DictWriter(csv_file, fieldnames=columns)
    writer.writeheader()
    # Writer write all mcqs at once
    writer.writerows(all_mcqs)

### Fetching Target URLs of single Domain
Here, we will fetching all the urls from the given html tag. This will then used to fetch all the mcqs saperately.

In [None]:
# Website url to fecth the categories urls list
base_url = "https://pakmcqs.com/"

# this function takes a base url, starting url matching string, the html tag and class name
target_urls = get_target_urls(base_url,'https://pakmcqs.com/category/', 'div', 'secondary')

if target_urls:
  print(f"Extracted Category URLs Successfully")
else:
  print("No category URLs found in the provided URL or error fetching the content.")


### Fetching All MCQs
We will use those urls to fetch all mcqs and add them in a csv file

In [None]:
# Fetch All MCQs and write To CSV files
fetch_all_mcqs(target_urls)

## Analyzing Data
Let us check what data we have scrapped so far and get some insights.

In [3]:
import csv

def analyze_mcqs(csv_file):
  """
  Analyzes MCQs data from a CSV file and provides comprehensive statistics.

  Args:
      csv_file (str): The path to the CSV file containing MCQ data.

  Returns:
      dict: A dictionary containing detailed statistics about the MCQ data.
  """

  num_mcqs = 0
  category_counts = {}
  answer_lengths = []
  # Track null values in each column
  null_values_per_column = {}

  try:
    with open(csv_file, 'r', newline='') as csvfile:
      reader = csv.DictReader(csvfile)
      headers = reader.fieldnames

      for row in reader:
        # Count number of mcqs
        num_mcqs += 1
        # Count Category for mcqs
        category = row['category']
        category_counts[category] = category_counts.get(category, 0) + 1

        # Extract and analyze answer lengths
        for option in range(1, 7):  # Assuming options are in columns 1-6
          option_key = f"option {option}"
          if option_key in row:
            answer_lengths.append(len(row[option_key]))

        # Check for null values in each column
        for header in headers:
          null_values_per_column[header] = null_values_per_column.get(header, 0)
          if row[header] == '':
            null_values_per_column[header] += 1

  except FileNotFoundError:
    print(f"Error: CSV file '{csv_file}' not found.")
    return {}

  # Calculate statistics for answer lengths
  min_length = min(answer_lengths) if answer_lengths else 0
  max_length = max(answer_lengths) if answer_lengths else 0
  avg_length = sum(answer_lengths) / len(answer_lengths) if answer_lengths else 0

  # Return statistics dictionary
  return {
      "number_of_mcqs": num_mcqs,
      "category_counts": category_counts,
      "answer_length_stats": {
          "min_length": min_length,
          "max_length": max_length,
          "avg_length": avg_length
      },
      "null_values_per_column": null_values_per_column
  }


In [5]:
statistics = analyze_mcqs('mcqs.csv')

if statistics:
  # Print informative statistics of our scrapped data
  print(f"Number of MCQs: {statistics['number_of_mcqs']}")

  print("\nCategory Distribution:")
  for category, count in statistics['category_counts'].items():
    print(f"- {category}: {count}")

  print("\nAnswer Length Statistics:")
  print(f"- Minimum answer length: {statistics['answer_length_stats']['min_length']}")
  print(f"- Maximum answer length: {statistics['answer_length_stats']['max_length']}")
  print(f"- Average answer length: {statistics['answer_length_stats']['avg_length']:.2f}")

  print("\nNull Values per Column:")
  for column, count in statistics['null_values_per_column'].items():
    print(f"- {column}: {count}")
else:
  print("No statistics available. Please check if 'mcqs.csv' exists.")


Number of MCQs: 87416

Category Distribution:
- auditing-mcqs: 153
- category: 43
- election-officer-mcqs: 167
- oral-anatomy: 247
- pathology: 333
- pedagogy-mcqs: 373
- oral-histology: 402
- biochemistry: 448
- microbiology: 502
- urdu-general-knowledge: 445
- physiology-mcqs: 522
- general-anatomy-mcqs: 538
- pharmacology: 576
- dental-materials: 627
- statistics-mcqs: 737
- physics-mcqs: 867
- oral-pathology-and-medicine: 953
- hrm-mcqs: 1015
- finance-mcqs: 1050
- computer-mcqs: 1311
- chemistry-mcqs: 1371
- accounting-mcqs: 1457
- islamic-studies-mcqs: 1458
- biology-mcqs: 1512
- marketing-mcqs: 1730
- sociology-mcqs: 1620
- everyday-science-mcqs: 1997
- pak-study-mcqs: 2479
- english-mcqs: 2798
- mathematics-mcqs: 1645
- psychology-mcqs: 1855
- software-engineering-mcqs: 2041
- mechanical-engineering-mcqs: 2137
- agriculture-mcqs: 2828
- judiciary-and-law-mcqs: 2953
- general_knowledge_mcqs: 5453
- civil-engineering-mcqs: 3052
- pakistan-current-affairs-mcqs: 4016
- english-lite

## Conclusion
We have successfully fetched the MCQs data from a single website. The data is not huge in size, but its actully large in number. MCQs data is not readily availble in huge amount so it is difficult to find such websites that can contain that much of data.