### **Lists vs Dictionaries**
When working on data scraping tasks, it's essential to be familiar with core Python data structures like lists, and dictionaries, as they are key to processing and organizing the scraped data efficiently.

In [2]:
"""
Objective: Create a List of URLs
"""
url = "https://example.com/page-1"

# TODO: From the url, extract the main url
print(url)
# TODO: Create the list of URLs for the next 5 pages
# Expected Output: ['https://example.com/page-1', 'https://example.com/page-2', 'https://example.com/page-3', 'https://example.com/page-4', 'https://example.com/page-5']
# Extract the base URL (everything before the page number)
base_url = url.rsplit('-', 1)[0]

# Create list of URLs for pages 1-5
urls = [f"{base_url}-{i}" for i in range(1, 6)]
print(urls)

https://example.com/page-1
['https://example.com/page-1', 'https://example.com/page-2', 'https://example.com/page-3', 'https://example.com/page-4', 'https://example.com/page-5']


In [8]:
"""
Objective: Extend the List of URLs
"""
urls = ["https://example.com/page-1", "https://example.com/page-2", "https://example.com/page-3", "https://example.com/page-4", "https://example.com/page-5"]
new_urls = ["https://example.com/page-6", "https://example.com/page-7", "https://example.com/page-8", "https://example.com/page-9", "https://example.com/page-10"]

# TODO: The urls have 5 elements
print(f"The urls have {len(urls)} elements")
# TODO: Add the new_urls to the urls to get 10 elements
# TODO: Print the length of urls
# Expected Output: 10
new_urls.extend(urls)
print(new_urls)
print(f"The urls have {len(new_urls)} elements")

The urls have 5 elements
['https://example.com/page-6', 'https://example.com/page-7', 'https://example.com/page-8', 'https://example.com/page-9', 'https://example.com/page-10', 'https://example.com/page-1', 'https://example.com/page-2', 'https://example.com/page-3', 'https://example.com/page-4', 'https://example.com/page-5']
The urls have 10 elements


In [12]:
"""
Objective: Extract Data from Nested Lists
"""
data = [["title1", "url1"], ["title2", "url2"], ["title3", "url3"]]

# TODO: Extract the title and url from the data
# Expected Output:
# title1 url1
# title2 url2
# title3 url3

for item in data:
    print(f"# {item[0]}, {item[1]}")

# title1, url1
# title2, url2
# title3, url3


In [13]:
"""
Objective: Remove duplicate elements from a List
"""
data = ["https://www.example1.com", "https://www.example1.com", "https://www.example2.com", "https://www.example2.com", "https://www.example3.com"]

# TODO: Remove duplicates from the data
# Expected result : 
# ['https://www.example1.com', 'https://www.example2.com', 'https://www.example3.com']
# Method 1: Using set() to remove duplicates
unique_data = list(set(data))
print(unique_data)

# Method 2: Using list comprehension
unique_data = list(dict.fromkeys(data))
print(unique_data)

['https://www.example2.com', 'https://www.example1.com', 'https://www.example3.com']
['https://www.example1.com', 'https://www.example2.com', 'https://www.example3.com']


In [17]:
"""
Objective: Create a Dictionary for Scraped Data
"""
urls = ["https://example.com/page-1", "https://example.com/page-2", "https://example.com/page-3", "https://example.com/page-4", "https://example.com/page-5"]

# Function to scrape data
def scrape_data(url):
    # Extracted data
    data = dict()
     # Get page number from URL
    page_num = url.split('-')[-1]
    # TODO: Add the title to the dictionary with value "Example Title 1" for page 1
    # TODO: Add the url to the dictionary with value "https://example.com/page-1" for page 1
    # Add title and URL to dictionary
    data['title'] = f"Example Title {page_num}"
    data['url'] = url
    return data

# TODO: Loop through the urls and call the scrape_data function for each url
# TODO: Append the returned data to the scraped_data list
# TODO: Print the scraped_data
# Expected Output:
# [{'title': 'Example Title 1', 'url': 'https://example.com/page-1'}, {'title': 'Example Title 2', 'url': 'https://example.com/page-2'}, {'title': 'Example Title 3', 'url': 'https://example.com/page-3'}, {'title': 'Example Title 4', 'url': 'https://example.com/page-4'}, {'title': 'Example Title 5', 'url': 'https://example.com/page-5'}]
# Create list to store all scraped data
scraped_data = []
# Loop through URLs and collect data
for url in urls:
    result = scrape_data(url)
    scraped_data.append(result)
print(scraped_data)

[{'title': 'Example Title 1', 'url': 'https://example.com/page-1'}, {'title': 'Example Title 2', 'url': 'https://example.com/page-2'}, {'title': 'Example Title 3', 'url': 'https://example.com/page-3'}, {'title': 'Example Title 4', 'url': 'https://example.com/page-4'}, {'title': 'Example Title 5', 'url': 'https://example.com/page-5'}]


In [21]:
"""
Objective: Retrieve Data from a Dictionary
"""
data = [{"title": "Example Title 1", "url": "https://example.com"},
        {"title": "Example Title 2", "url": "https://example.com"},
        {"title": "Example Title 3", "url": "https://example.com"},
        {"title": "Example Title 4", "url": "https://example.com"},
        {"title": "Example Title 5", "url": "https://example.com"}]

# TODO: Use a for loop to loop through the data
# TODO: Use the item["title"] to get the title
# TODO: Use the item.get("url") to get the url
# TODO: Print the title and url
# Expected Output:
# Example Title 1 https://example.com
# Example Title 2 https://example.com
# Example Title 3 https://example.com
# Example Title 4 https://example.com
# Example Title 5 https://example.com

# Loop through each dictionary in the data list
for item in data:
    title = item["title"]
    url = item.get("url")
    print(f"{title} {url}")

Example Title 1 https://example.com
Example Title 2 https://example.com
Example Title 3 https://example.com
Example Title 4 https://example.com
Example Title 5 https://example.com


In [22]:
"""
Objective: Create a List of Dictionary from two Lists
"""
titles = ["Example Title 1", "Example Title 2", "Example Title 3", "Example Title 4", "Example Title 5"]
urls = ["https://example.com/page-1", "https://example.com/page-2", "https://example.com/page-3", "https://example.com/page-4", "https://example.com/page-5"]

# TODO: Combine the titles and urls into a list of dictionaries
# Expected Output:
# [{'title': 'Example Title 1', 'url': 'https://example.com/page-1'}, {'title': 'Example Title 2', 'url': 'https://example.com/page-2'}, {'title': 'Example Title 3', 'url': 'https://example.com/page-3'}, {'title': 'Example Title 4', 'url': 'https://example.com/page-4'}, {'title': 'Example Title 5', 'url': 'https://example.com/page-5'}] 

# Method 1: Using zip() and list comprehension
combined_data = [{'title': title, 'url': url} for title, url in zip(titles, urls)]
print(combined_data)

# Method 2: Using zip() with a for loop
combined_data = []
for title, url in zip(titles, urls):
    combined_data.append({'title': title, 'url': url})
print(combined_data)


[{'title': 'Example Title 1', 'url': 'https://example.com/page-1'}, {'title': 'Example Title 2', 'url': 'https://example.com/page-2'}, {'title': 'Example Title 3', 'url': 'https://example.com/page-3'}, {'title': 'Example Title 4', 'url': 'https://example.com/page-4'}, {'title': 'Example Title 5', 'url': 'https://example.com/page-5'}]
[{'title': 'Example Title 1', 'url': 'https://example.com/page-1'}, {'title': 'Example Title 2', 'url': 'https://example.com/page-2'}, {'title': 'Example Title 3', 'url': 'https://example.com/page-3'}, {'title': 'Example Title 4', 'url': 'https://example.com/page-4'}, {'title': 'Example Title 5', 'url': 'https://example.com/page-5'}]


In [26]:
"""
Objective: Identify all keys in a Dictionary
"""
data = {"title": "Example Title", "url": "https://example.com", "tags": ["tag1", "tag2", "tag3"], "date": "2022-01-01"}

# TODO: Use keys() method to get all keys
# TODO: Convert the keys to a list
# Expected Output: ['title', 'url', 'tags', 'date']

keys_list = list(data.keys())
print(keys_list)

['title', 'url', 'tags', 'date']


In [31]:
"""
Objective: Loop through a Dictionary
"""
scraped_data = {
                "title": "Example Title",
                "url": "https://example.com",
                "author": "John Doe",
                "tags": ["tag1", "tag2", "tag3"],
                "views": 1000
            }

# TODO: Use .items() method to loop through the dictionary

# TODO: Print each key and value
# Expected Output:
# title: Example Title
# url: https://example.com
# author: John Doe
# tags: ['tag1', 'tag2', 'tag3']
# views: 1000
# Loop through dictionary items
for key, value in scraped_data.items():
    print(f"# {key}: {value}")

# title: Example Title
# url: https://example.com
# author: John Doe
# tags: ['tag1', 'tag2', 'tag3']
# views: 1000


In [34]:
"""
Objective: Extract Data from a Nested Dictionary
"""
scraped_data = [
    {
        "category": "Programming",
        "articles": [
            {
                "title": "How to Learn Python",
                "url": "https://example.com/learn-python",
                "author": "John Doe",
                "tags": ["Python", "Programming", "Tutorial"],
                "views": 1200,
                "comments": [
                    {"user": "Alice", "comment": "Great article!", "likes": 5},
                    {"user": "Bob", "comment": "Very informative.", "likes": 2}
                ]
            },
            {
                "title": "Advanced Python Tips",
                "url": "https://example.com/advanced-python",
                "author": "Jane Smith",
                "tags": ["Python", "Advanced", "Tips"],
                "views": 800,
                "comments": [
                    {"user": "Charlie", "comment": "Helpful for experts.", "likes": 3}
                ]
            }
        ]
    },
    {
        "category": "Web Scraping",
        "articles": [
            {
                "title": "Top 10 Web Scraping Tools",
                "url": "https://example.com/web-scraping-tools",
                "author": "Jane Smith",
                "tags": ["Web Scraping", "Tools", "Technology"],
                "views": 1500,
                "comments": [
                    {"user": "Dave", "comment": "Awesome list!", "likes": 10}
                ]
            },
            {
                "title": "Understanding BeautifulSoup",
                "url": "https://example.com/beautifulsoup",
                "author": "Alice Johnson",
                "tags": ["Web Scraping", "BeautifulSoup", "Python"],
                "views": 1100,
                "comments": [
                    {"user": "Eve", "comment": "Great for beginners.", "likes": 4},
                    {"user": "Frank", "comment": "Clear explanation!", "likes": 6}
                ]
            }
        ]
    },
    {
        "category": "APIs",
        "articles": [
            {
                "title": "Understanding REST APIs",
                "url": "https://example.com/rest-apis",
                "author": "John Doe",
                "tags": ["APIs", "REST", "Web Development"],
                "views": 900,
                "comments": [
                    {"user": "Grace", "comment": "Very clear overview.", "likes": 7}
                ]
            },
            {
                "title": "GraphQL vs REST",
                "url": "https://example.com/graphql-vs-rest",
                "author": "Charlie Brown",
                "tags": ["APIs", "GraphQL", "Comparison"],
                "views": 1300,
                "comments": [
                    {"user": "Hannah", "comment": "Helpful comparison!", "likes": 9}
                ]
            }
        ]
    }
]

# TODO: Show the article data with the highest number of views
# Initialize variables to track the highest views and corresponding article
highest_views = 0
article_with_highest_views = None

# Loop through each category and its articles
for category in scraped_data:
    for article in category['articles']:
        if article['views'] > highest_views:
            highest_views = article['views']
            article_with_highest_views = article

# Print the article with highest views
print(f"Article with highest views ({highest_views}):")
print(f"Title: {article_with_highest_views['title']}")
print(f"Author: {article_with_highest_views['author']}")
print(f"URL: {article_with_highest_views['url']}")
# TODO: Which article has the highest number of comments
# Initialize variables for tracking
most_comments = 0
article_with_most_comments = None

# Loop through each category and its articles
for category in scraped_data:
    for article in category['articles']:
        num_comments = len(article['comments'])
        if num_comments > most_comments:
            most_comments = num_comments
            article_with_most_comments = article

# Print the article with most comments
print(f"\nArticle with most comments ({most_comments} comments):")
print(f"Title: {article_with_most_comments['title']}")
print(f"Author: {article_with_most_comments['author']}")
print(f"Comments:")
for comment in article_with_most_comments['comments']:
    print(f"- {comment['user']}: {comment['comment']}")
# TODO: Which coment has the highest number of likes
# Initialize variables for tracking highest likes
highest_likes = 0
comment_with_highest_likes = None
article_title = None

# Loop through all categories, articles, and comments
for category in scraped_data:
    for article in category['articles']:
        for comment in article['comments']:
            if comment['likes'] > highest_likes:
                highest_likes = comment['likes']
                comment_with_highest_likes = comment
                article_title = article['title']

# Print the comment with highest likes
print(f"\nComment with highest likes ({highest_likes} likes):")
print(f"Article: {article_title}")
print(f"User: {comment_with_highest_likes['user']}")
print(f"Comment: {comment_with_highest_likes['comment']}")

Article with highest views (1500):
Title: Top 10 Web Scraping Tools
Author: Jane Smith
URL: https://example.com/web-scraping-tools

Article with most comments (2 comments):
Title: How to Learn Python
Author: John Doe
Comments:
- Alice: Great article!
- Bob: Very informative.

Comment with highest likes (10 likes):
Article: Top 10 Web Scraping Tools
User: Dave
Comment: Awesome list!


### **Reflection**
What is the difference between using item["keys"] with item.get("keys")? What happens if the key isn't exist?

(answer here)

item["keys"] (Direct Access):

- Uses square bracket notation for direct dictionary access
- Raises a KeyError exception if the key doesn't exist

In [35]:
item = {"name": "John"}
print(item["name"])  # Output: John
print(item["age"])   # Raises KeyError: 'age'

John


KeyError: 'age'

item.get("keys") (Safe Access):

- Uses the get() method for safe dictionary access
- Returns None by default if the key doesn't exist
- Can specify a default value as second argument

In [36]:
item = {"name": "John"}
print(item.get("name"))      # Output: John
print(item.get("age"))       # Output: None
print(item.get("age", 25))   # Output: 25 (using default value)

John
None
25


### **Exploration**
Python Collections provides specialized container datatypes beyond the standard Python collection types like lists, tuples, sets, and dictionaries. These container types are designed to make certain tasks more efficient and readable.

1. Counter : Counts occurrences of elements


In [37]:
from collections import Counter

# Count elements in a list
colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']
color_count = Counter(colors)
print(color_count)  # Output: Counter({'blue': 3, 'red': 2, 'green': 1})

Counter({'blue': 3, 'red': 2, 'green': 1})


2. defaultdict : Dictionary with default value for missing keys

In [38]:
from collections import defaultdict

# Create a dictionary with list as default value
grouped_data = defaultdict(list)
grouped_data['fruits'].append('apple')  
print(grouped_data)  

defaultdict(<class 'list'>, {'fruits': ['apple']})


3. OrderedDict : Dictionary that remembers insertion order

In [39]:
from collections import OrderedDict

# Create an ordered dictionary
ordered = OrderedDict()
ordered['first'] = 1
ordered['second'] = 2
print(ordered)  # Output: OrderedDict([('first', 1), ('second', 2)])

OrderedDict({'first': 1, 'second': 2})


4. deque : Double-ended queue with fast appends and pops

In [40]:
from collections import deque

# Create a double-ended queue
queue = deque(['a', 'b', 'c'])
queue.append('d')         # Add to right
queue.appendleft('z')     # Add to left
print(queue)  # Output: deque(['z', 'a', 'b', 'c', 'd'])

deque(['z', 'a', 'b', 'c', 'd'])


5. namedtuple : Tuple subclass with named fields

In [41]:
from collections import namedtuple

# Create a named tuple class
Person = namedtuple('Person', ['name', 'age'])
person = Person('John', 30)
print(person.name)  # Output: John
print(person.age)   # Output: 30

John
30
