A collection of three data scraping projects showcasing different scraping methodsβHTML scraping with requests and BeautifulSoup, and API scraping using RapidAPI. Each project involves data extraction, preprocessing, and storing the results in a structured format for analysis.
Objective: Scrape UHD TV product listings from Flipkart for basic price and feature comparison.
- Source: flipkart.com
- Method: HTML scraping using
requestsandBeautifulSoup - Pages Scraped: 10
- Total Rows: 240 products
- Columns: 7 (
Name,Price,Rating,Discount,Launch Year,Operating System,Delivery Type) - Output: CSV file
- Pagination handling
- Dealing with inconsistent HTML tags and missing data
- Headers and user-agent spoofing
Objective: Extract book details for dystopian genre to analyze popularity and patterns.
- Source: goodreads.com
- Method: HTML scraping using
requestsandBeautifulSoup - Pages Scraped: 40
- Total Rows: 3,636 books
- Columns: 6 (
Book Title,Author,Ratings,Avg Rating,Score,Total Votes) - Output: CSV file
- Extracting nested data in HTML
- Managing large pagination without server blocking
- Parsing numeric and textual data from strings
Objective: Collect tweet and user metadata using RapidAPI to experiment with API data extraction.
- Source: RapidAPI - twitter154.p.rapidapi.com
- Method: API scraping using
requestsand API key authentication - Total Rows: 83 tweets
- Columns: 27 (tweet_id, creation_date, text, media_url, video_url, user, language, favorite_count, retweet_count, reply_count, quote_count, retweet, views, timestamp, video_view_count, in_reply_to_status_id, quoted_status_id, binding_values, expanded_url, retweet_tweet_id, extended_entities, conversation_id, retweet_status, quoted_status, bookmark_count, source, community_note)
- Output: CSV or JSON