File management and web scraping in Python, using real-world websites and scenarios. Each question will be designed to test and enhance different aspects of these skills. 

* Quiz 1: Basic File Reading and Writing
* Quiz 2: Web Scraping Basic HTML Data
* Quiz 3: Web Scraping with Pagination
* Quiz 4: Advanced File Operations
* Quiz 5: Web Scraping Dynamic Content
* Quiz 6: Extracting and Analyzing Data from API
* Quiz 7: Scraping and Processing E-commerce Product Data
* Quiz 8: Automated Data Cleaning from a Text File
* Quiz 9: Parsing and Summarizing Data from a News API
* Quiz 10: Web Scraping with JavaScript-Rendered Content

### Quiz 1: Basic File Reading and Writing
**Task**: Write a Python script to read a CSV file containing movie data from [IMDb](https://www.imdb.com/interfaces/), then convert and save this data into a JSON file. The script should be able to handle basic data cleaning like trimming whitespace from strings.

In [1]:
#quiz_01
#data source: https://developer.imdb.com/non-commercial-datasets/
import pandas as pd
# file_path = 'https://datasets.imdbws.com/title.basics.tsv.gz'
file_path = './dataset/title.basics.tsv.gz'
df=pd.read_csv(file_path, sep='\t', low_memory=False)
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10345990 entries, 0 to 10345989
Data columns (total 10 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   Unnamed: 0      int64 
 1   tconst          object
 2   titleType       object
 3   primaryTitle    object
 4   originalTitle   object
 5   isAdult         object
 6   startYear       object
 7   endYear         object
 8   runtimeMinutes  object
 9   genres          object
dtypes: int64(1), object(9)
memory usage: 789.3+ MB


Unnamed: 0.1,Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


### Quiz 2: Web Scraping Basic HTML Data
**Task**: Write a Python script using `BeautifulSoup` to scrape the current top news headlines from [BBC News](https://www.bbc.com/news). Extract the headline text and the corresponding URLs, and save them in a CSV file.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Fetching the webpage
response = requests.get("https://www.bbc.com/news")
soup = BeautifulSoup(response.content, 'html.parser')

# Extracting headlines and URLs
articles = soup.find_all('h3')
data = [{'headline': article.get_text(strip=True), 'url': 'https://www.bbc.com' + article.find_parent('a')['href']} for article in articles if article.find_parent('a')]

# Saving to CSV
pd.DataFrame(data).to_csv('bbc_news_headlines.csv', index=False)
df = pd.read_csv('./dataset/bbc_news_headlines.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   headline  39 non-null     object
 1   url       39 non-null     object
dtypes: object(2)
memory usage: 752.0+ bytes


Unnamed: 0,headline,url
0,More hostages released despite agonising delay,https://www.bbc.com/news/world-middle-east-675...
1,'I thought he died' - joy after Thai hostage f...,https://www.bbc.com/news/world-middle-east-675...
2,India rescuers to dig by hand after drill breaks,https://www.bbc.com/news/world-asia-india-6753...
3,Crowds cheer Palestinians released from Israel...,https://www.bbc.com/news/world-middle-east-675...
4,Indo-Chinese cuisine makes a splash in US dining,https://www.bbc.com/news/world-asia-india-6741...


### Quiz 3: Web Scraping with Pagination
**Task**: Create a Python script to scrape job listings from the first three pages of [Indeed](https://www.indeed.com) for a specific job title and location. The script should extract the job title, company name, location, and summary of each listing and save it to a CSV file.

In [60]:
import pandas as pd
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from urllib.parse import urlencode

async def scrape_indeed_jobs(job_title, location):
    job_listings = []
    pw = await async_playwright().start()
    browser = await pw.chromium.launch(headless = False)
    page = await browser.new_page()
    for start in range(0, 30, 15):  # 0, 10, 20 for the first three pages
        params = {
            'q': job_title,
            'l': location,
            'start': start
        }
        url=f'https://th.indeed.com/jobs?{urlencode(params)}'
        _ = await page.goto(url)
        await page.wait_for_selector('div#mosaic-jobResults')
        selector = await page.query_selector('body')
        html = await selector.inner_html()
    
        soup = BeautifulSoup(html, 'html.parser')
        
        for job_card in soup.find_all('div', class_='cardOutline'):
            title = job_card.find('h2', class_='jobTitle').get_text(strip=True)
            company = job_card.find('span', {"data-testid" : "company-name"}).get_text(strip=True)
            location = job_card.find('div', {"data-testid" : "text-location"}).get_text(strip=True) if job_card.find('div', {"data-testid" : "text-location"}) else 'N/A'
            # summary = job_card.find('div', class_='summary').get_text(strip=True)
            job_listings.append({'Job Title': title, 'Company': company, 'Location': location})
    
    await browser.close()
    await pw.stop()
    # print(job_listings)
    return job_listings

jobs = await scrape_indeed_jobs('software engineer', 'Pathum Thani')
pd.DataFrame(jobs).to_csv('./dataset/indeed_job_listings.csv', index=False)
df=pd.read_csv('./dataset/indeed_job_listings.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Job Title  30 non-null     object
 1   Company    30 non-null     object
 2   Location   30 non-null     object
dtypes: object(3)
memory usage: 848.0+ bytes


Unnamed: 0,Job Title,Company,Location
0,Test Software Development Engineer,Lumentum Operations LLC,นิคมอุตสาหกรรมนวนคร
1,"Engineer, Mfg Process Sustaining",Lumentum Operations LLC,นิคมอุตสาหกรรมนวนคร
2,Analysis Engineer,Kubota Research & Development Asia,ปทุมธานี
3,"Structural Engineer, AITS",Asian Institute of Technology,"คลองหลวง, ปทุมธานี"
4,Process Engineer,Lumentum Operations LLC,นิคมอุตสาหกรรมนวนคร


### Quiz 4: Advanced File Operations
**Task**: Write a Python script to scan a directory containing log files (text files). The script should aggregate error messages from all files, count their occurrences, and output a summary in a new text file. Assume a specific pattern in the log files denotes errors.

In [61]:
# data source https://github.com/logpai/loghub/blob/master/Linux/Linux_2k.log

In [77]:
import pandas as pd
import re

def parse_linux_log(file_path):
    log_data = []

    with open(file_path, 'r') as file:
        for line in file:
            # Regex pattern to extract datetime and the entire message
            match = re.match(r'(\w{3} \d{1,2} \d{2}:\d{2}:\d{2}) (.*)', line)
            if match:
                datetime, message = match.groups()
                severity = "ERROR" if "failure" in message else "WARNING"  # Assuming 'failure' indicates an error
                log_data.append({'datetime': datetime, 'severity': severity, 'message': message})

    return pd.DataFrame(log_data)

# Example usage
log_file_path = './dataset/Linux_2k.log'  # Replace with the actual path
df = parse_linux_log(log_file_path)
df.to_csv('./dataset/structured_log_data.csv', index=False)
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1546 entries, 0 to 1545
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   datetime  1546 non-null   object
 1   severity  1546 non-null   object
 2   message   1546 non-null   object
dtypes: object(3)
memory usage: 36.4+ KB


Unnamed: 0,datetime,severity,message
0,Jun 14 15:16:01,ERROR,combo sshd(pam_unix)[19939]: authentication fa...
1,Jun 14 15:16:02,WARNING,combo sshd(pam_unix)[19937]: check pass; user ...
2,Jun 14 15:16:02,ERROR,combo sshd(pam_unix)[19937]: authentication fa...
3,Jun 15 02:04:59,ERROR,combo sshd(pam_unix)[20882]: authentication fa...
4,Jun 15 02:04:59,ERROR,combo sshd(pam_unix)[20884]: authentication fa...


### Quiz 5: Web Scraping Dynamic Content
**Task**: Use Python with Selenium to scrape the latest tech news articles from [TechCrunch](https://techcrunch.com/). The script should navigate the site, handle dynamic content loading, and extract the article titles, authors, and publication dates, saving them in a CSV file.

In [82]:
import pandas as pd
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from urllib.parse import urlencode

pw = await async_playwright().start()
browser = await pw.chromium.launch(headless = False)
page = await browser.new_page()
url=f'https://techcrunch.com/'
_ = await page.goto(url)
await page.wait_for_selector('div.content')

selector = await page.query_selector('body')
html = await selector.inner_html()
soup = BeautifulSoup(html, 'html.parser')
# print(soup)

articles =  soup.find_all('article', class_="post-block")

data = []
for article in articles:
    # if not article!=article: break
    title =  article.find('h2').get_text(strip=True) 
    author =  article.find('span', class_ = 'river-byline__authors').get_text(strip=True)
    datetime =  article.find('time', class_ = 'river-byline__full-date-time').get_text(strip=True)
    data.append({'Title': title, 'Author': author, 'Publication Date': datetime})

await browser.close()
await pw.stop()
df=pd.DataFrame(data)
df.to_csv('./dataset/techcrunch.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Title             20 non-null     object
 1   Author            20 non-null     object
 2   Publication Date  20 non-null     object
dtypes: object(3)
memory usage: 608.0+ bytes


Unnamed: 0,Title,Author,Publication Date
0,What startup founders need to know about AI he...,Alex Wilhelm,"5:00 PM GMT+7•November 26, 2023"
1,"Sam Altman returns to OpenAI, Apple adopts RCS...",Kyle Wiggers,"4:16 AM GMT+7•November 26, 2023"
2,Black Friday online buying hits a record $9.8B...,Ingrid Lunden,"11:36 PM GMT+7•November 25, 2023"
3,"Neuralink, Elon Musk’s brain implant startup, ...",Kyle Wiggers,"11:25 PM GMT+7•November 25, 2023"
4,Fate of US venture capital in China teeters on...,Rita Liao,"4:32 PM GMT+7•November 25, 2023"


### Quiz 6: Extracting and Analyzing Data from API
**Task**: Write a Python script to fetch weather data from the [OpenWeatherMap API](https://openweathermap.org/api). Extract temperature, humidity, and weather conditions for a specified city, and write this data to a JSON file. Include error handling for invalid city names.

### Quiz 7: Scraping and Processing E-commerce Product Data
**Task**: Create a Python script to scrape product details from an e-commerce site like [Amazon](https://www.amazon.com). Focus on a specific category (e.g., books, electronics). Extract product names, prices, and ratings, and save them in a pandas DataFrame for further analysis.

### Quiz 8: Automated Data Cleaning from a Text File
**Task**: Write a Python script to read a text file from [Project Gutenberg](https://www.gutenberg.org/). The script should remove all the headers and footers added by Project Gutenberg, count the frequency of each word in the text, and output the top 10 most frequent words to a new file.

### Quiz 9: Parsing and Summarizing Data from a News API
**Task**: Use the [News API](https://newsapi.org/) to fetch recent news articles on a specific topic (e.g., "climate change"). Write a Python script to parse this data, extracting the article title, source, and publication date, and then summarize this data in a CSV file.

### Quiz 10: Web Scraping with JavaScript-Rendered Content
**Task**: Write a Python script using Selenium to scrape movie ratings and reviews from a site like [Rotten Tomatoes](https://www.rottentomatoes.com/). The script should navigate through a list of movies, handle the dynamically loaded content, and extract the movie title, rating, and a sample of user reviews.