# Introduction

## Web Crawler Research

In [33]:
import requests
x = requests.get('https://willowbendlc.com')
print(x.text)

<!DOCTYPE html><html lang="en-CA"><head><link rel="icon" href="//img1.wsimg.com/isteam/ip/3373a061-5c71-4743-91b3-d0f1ecdee87a/favicon/3ea19da4-0533-4332-a8b2-cd2d704afe20.png/:/rs=w:16,h:16,m" sizes="16x16"/><link rel="icon" href="//img1.wsimg.com/isteam/ip/3373a061-5c71-4743-91b3-d0f1ecdee87a/favicon/3ea19da4-0533-4332-a8b2-cd2d704afe20.png/:/rs=w:24,h:24,m" sizes="24x24"/><link rel="icon" href="//img1.wsimg.com/isteam/ip/3373a061-5c71-4743-91b3-d0f1ecdee87a/favicon/3ea19da4-0533-4332-a8b2-cd2d704afe20.png/:/rs=w:32,h:32,m" sizes="32x32"/><link rel="icon" href="//img1.wsimg.com/isteam/ip/3373a061-5c71-4743-91b3-d0f1ecdee87a/favicon/3ea19da4-0533-4332-a8b2-cd2d704afe20.png/:/rs=w:48,h:48,m" sizes="48x48"/><link rel="icon" href="//img1.wsimg.com/isteam/ip/3373a061-5c71-4743-91b3-d0f1ecdee87a/favicon/3ea19da4-0533-4332-a8b2-cd2d704afe20.png/:/rs=w:64,h:64,m" sizes="64x64"/><meta charSet="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="viewport" content="width=d

**Using Python And Https Request To Get Website Titles.**

Below are some breakdown of the main points involved in executing the code to fetch and print the title of a website using Python.

**Introduction to HTTP Requests and Web Scraping**

HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web. When you type a URL into your browser, your browser sends an HTTP request to the server where the website is hosted. The server then responds with the requested resources, such as HTML, images, and other data. This process allows you to view and interact with websites.

In web scraping, we automate this process using code to fetch and parse the HTML content of web pages. By doing so, we can extract specific information, such as the title of a website.


**How HTTP Requests Work in Web Scraping**

**1. Sending an HTTP Request:**

We use the **'requests'** library in Python to send an HTTP GET request to a website's server. This request asks the server to send back the HTML content of the webpage.

**2. Receiving the Response:**

The server responds to our request by sending back the HTML content of the webpage. The **'requests'** library allows us to capture this response, which we can then analyze and parse.

**3. Parsing the HTML Content:**

Once we have the HTML content, we use the **'BeautifulSoup'** library to parse it.
This library provides convenient methods to navigate and search through the HTML structure, allowing us to extract the desired information, such as the title of the webpage.


**Points in Executing the Code**

**1. Setting Up the Environment**

  *   Ensure you have Python installed on your system.
  *   Install the required libraries (**'requests'** and **'beautifulsoup4'**) if they are not already installed.

**2. Importing Libraries**

*   Import the **'requests'** library to handle HTTP requests.
*   Import the **'BeautifulSoup'** class from the bs4 module to parse HTML content.

**3. Defining the Function**

*  Define a function **'get_website_title(url)'** that:

 *   Takes a URL as input.
 *   Sends an HTTP GET request to the URL.
 *   Checks if the request was successful.
 *   Parses the HTML content of the webpage.
 *   Extracts and returns the title of the webpage.
 *   Handles any potential errors gracefully.

**4. Setting the URL**

*   Define a variable **'url'** to store the URL of the website you want to analyze.
   Replace the placeholder URL with the actual URL of the target website.

**5. Fetching the Website Title**

*   Call the **'get_website_title(url)'** function with the defined URL.
*   Store the returned title in a variable **'website_title'**.

**6. Displaying the Result**

*   Print the title of the website to the console.


**Examplanations With Codes**

**1. Setting Up the Environment**

Ensure you have Python installed on your system. You can download it from python.org.

Install the required libraries using 'pip':

















In [34]:
!pip install requests
!pip install beautifulsoup4



**2. Importing Libraries**

We start by importing the necessary libraries:

In [35]:
import requests  # Used to send HTTP requests
from bs4 import BeautifulSoup  # Used to parse HTML content


*   requests: This library allows us to send HTTP requests to fetch data from web servers.
*   BeautifulSoup: This library helps us parse and navigate through HTML content.


**3. Defining the Function**

Define a function that will handle the entire process of fetching and extracting the website title:

In [36]:
def get_website_title(url):
    try:
        # Send a GET request to the website
        response = requests.get(url)

        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract the title from the parsed HTML
            title = soup.title.string

            # Return the title of the website
            return title
        else:
            # If the request was not successful, return a failure message with the status code
            return "Failed to retrieve the webpage. Status code: {}".format(response.status_code)
    except Exception as e:
        # If an error occurs, return an error message
        return "An error occurred: {}".format(e)



*   Function Input: Takes the URL of the website as input.
*   Send HTTP Request: Uses requests.get(url) to fetch the webpage content.
*   Check Response Status: Verifies the request was successful by checking response.status_code.
*   Parse HTML Content: Uses BeautifulSoup to parse the HTML content.
*   Extract Title: Retrieves the content of the 'title' tag.
*   Error Handling: Catches and handles any errors that occur during the process.



**4. Setting the URL**

Define the URL of the website you want to analyze:






In [37]:
url = 'http://example.com'  # Replace with the actual URL

**5. Fetching the Website Title**

Call the function with the URL and store the result:

In [38]:
website_title = get_website_title(url)

**6. Displaying the Result**

Print the title of the website to the console:

In [39]:
print(f"Website Title: {website_title}")

Website Title: Example Domain


By following the steps above, we can fetch and display the title of any website using Python. This process involves sending HTTP requests, parsing HTML content, and extracting specific elements from the HTML.


**Complete Code with Explanations**

Below is the complete code explanation from all the steps listed above:

## Title Proof-of-Concept

In [None]:
import requests
from bs4 import BeautifulSoup

def get_website_title(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            title = soup.title.string
            return title
        else:
            return "Failed to retrieve the webpage. Status code: {}".format(response.status_code)
    except Exception as e:
        return "An error occurred: {}".format(e)

url = 'https://willowbendlc.com'

website_title = get_website_title(url)
print(f"Website Title: {website_title}")


##Breakdown Explanation Of The Enhanced Script

1. Importing Libraries:

In [None]:
import requests
from bs4 import BeautifulSoup
import re

2. Defining the Function:

In [None]:
def get_website_info(url):

3. Sending an HTTP Request:

In [None]:
response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
else:
    return "Failed to retrieve the webpage. Status code: {}".format(response.status_code)

4. Parsing the Html

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

5. Extracting the Title:

In [None]:
title = soup.title.string if soup.title else "No title found"

6. Extracting Meta Description and Keywords:

In [None]:
description = soup.find('meta', attrs={'name': 'description'})
description = description['content'] if description else "No meta description found"

keywords = soup.find('meta', attrs={'name': 'keywords'})
keywords = keywords['content'] if keywords else "No meta keywords found"

7. Calculating Word Counts:

In [None]:
home_page_word_count = len(soup.get_text().split())
page_word_count = len(soup.body.get_text().split()) if soup.body else 0

8. Counting Videos:

In [None]:
video_count = len(soup.find_all('video'))



9. Estimating Scroll Length:

In [None]:
##python

scroll_length = page_word_count / 100  # Simplistic estimate, adjust as needed



10. Counting Links:

In [None]:
##python

link_count = len(soup.find_all('a'))



11. Simulating SEO Page Rank:

In [None]:
##python

seo_page_rank = "Simulated SEO Page Rank: 3/10"  # Placeholder value


12. Counting Images

In [None]:
##python

image_count = len(soup.find_all('img'))

13. Counting Headers

In [None]:
##python

header_count = sum([len(soup.find_all(f'h{i}')) for i in range(1, 7)])


14. Finding External Links

In [None]:
##python

external_links = [a['href'] for a in soup.find_all('a', href=True) if a['href'].startsWith('http')]


15. Finding Internal links

In [None]:
##python

internal_links = [a['href'] for a in soup.find_all('a', href=True) if not a['href'].startsWith('http')]


**Extra Steps for Enhanced Functionality**

1. **Setting Up the Environment:**


*   Ensure you have the necessary libraries installed: 'requests' for sending HTTP requests and 'BeautifulSoup' for parsing HTML content.
*   Install additional libraries if needed, such as 'lxml' for XML parsing or 'pandas' for data manipulation.

2. **Sending an HTTP Request:**

*   Use the 'requests' library to send a GET request to the target URL.
*   Handle potential errors such as network issues, invalid URLs, or server errors.

3. **Parsing HTML Content:**

*   Use 'BeautifulSoup' to parse the HTML content of the webpage.
*   Choose the appropriate parser ('html.parser', 'lxml', etc.) based on the complexity of the webpage.

4. **Extracting the Title:**

*   Locate the 'title' tag and extract its content.
*   Handle cases where the title tag might be missing.

5. **Extracting Meta Tags (SEO Tags):**

*   Locate and extract meta tags such as description and keywords.
*   Handle cases where these tags might be missing or malformed.

6. **Calculating Word Counts:**

*  Calculate the total word count of the home page and specific sections like the body content.
*   Use methods to clean and split the text content accurately.

7. **Counting Videos:**

*   Locate and count 'video' tags on the webpage.
*   Handle different video embedding methods (e.g., embedded iframes).

8. **Estimating Scroll Length:**

*   Estimate the scroll length based on the total word count or other content metrics.
*   This is a heuristic approach and might need adjustments based on actual usage.

9. **Counting Links:**

*   Locate and count all 'a' tags representing hyperlinks.
*   Consider both internal and external links for a comprehensive count.

10. Count Images:

*   Count the number of 'img' tags to determine the amount of image content.

11. Count Scripts:

*   Count the number of 'script' tags to get an idea of the amount of JavaScript on the page.

12. Extract Structured Data:

*   Extract JSON-LD structured data used for SEO and rich snippets.

13. Extract Social Media Links:

 *   Extract links to social media profiles to understand the website’s social
presence.

14. **Simulating SEO Page Rank:**

*   While actual SEO page rank requires external tools or APIs, simulate a placeholder value for demonstration purposes.








#Auto Crawler

Here's the enhanced script with detailed comments explaining each step:


In [29]:
import requests
from bs4 import BeautifulSoup
import re


def get_website_info(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:

            soup = BeautifulSoup(response.text, 'html.parser')
            title = soup.title.string if soup.title else "No title found"
            description = soup.find('meta', attrs={'name': 'description'})
            description = description['content'] if description else "No meta description found"
            keywords = soup.find('meta', attrs={'name': 'keywords'})
            keywords = keywords['content'] if keywords else "No meta keywords found"
            home_page_word_count = len(soup.get_text().split())
            page_word_count = len(soup.body.get_text().split()) if soup.body else 0
            video_count = len(soup.find_all('video'))
            scroll_length = page_word_count / 100
            link_count = len(soup.find_all('a'))
            image_count = len(soup.find_all('img'))
            header_count = sum([len(soup.find_all(f'h{i}')) for i in range(1, 7)])
            external_links = [a['href'] for a in soup.find_all('a', href=True) if a['href'].startswith('http')]
            internal_links = [a['href'] for a in soup.find_all('a', href=True) if not a['href'].startswith('http')]

            return {

                "Title": title,
                "Meta Description": description,
                "Meta Keywords": keywords,
                "Home Page Word Count": home_page_word_count,
                "Page Word Count": page_word_count,
                "Video Count": video_count,
                "Scroll Length": scroll_length,
                "Link Count": link_count,
                "Image Count": image_count,
                "Header Count": header_count,
                "External Links": external_links[:5],
                "Internal Links": internal_links[:5]
            }

        else:
            return "Failed to retrieve the webpage. Status code: {}".format(response.status_code)

    except Exception as e:
        return "An error occurred: {}".format(e)

url = 'https://willowbendlc.com'
test = get_website_info(url)
print(test)

{'Title': 'Preschool Programs at Willow Bend Learning Center', 'Meta Description': "Explore our top-rated preschool programs designed to nurture your child's growth and development. Schedule a tour today!", 'Meta Keywords': 'No meta keywords found', 'Home Page Word Count': 522, 'Page Word Count': 515, 'Video Count': 0, 'Scroll Length': 5.15, 'Link Count': 87, 'Image Count': 11, 'Header Count': 23, 'External Links': ['https://willowbendlc.com/ola/services/center-tour', 'https://childcare.hhs.texas.gov/Public/childcaresearch', 'https://www.facebook.com/wblc1/', 'https://www.privacypolicies.com/live/0366be6b-9501-43bc-b911-b3f85f20040e'], 'Internal Links': ['tel:9728671871', '/', '#', '#', '/m/account']}


## Toheeb, please add now the ability to check these same metrics for competitors, compare those numbers side by side, and pull basic insights like "your word count is very low, or you need more pictures", all based on the competitive analysis of the 3-4 sites