Skip to content

A repository for data mining project in the ITC data science course.

Notifications You must be signed in to change notification settings

royyanovski/Data_Mining_Project

Repository files navigation

Data Mining Project

This is a repo for the data mining project of the ITC data science course.

Description

The aim of this project is to extract data from ebay and analyze it. The scraper receives CLI arguments.

Input:

  1. search words (str)- enter words to search separated by a whitespace (enters the script as a list). If a search key is more than one word, use "_" between them.
  2. -p/--pages flag (int)- enter the number of pages you would like to search through, for each of the search words. Input pattern: $ python WebScraping.py search_key1 search_key2 ... -p page_num

Output: The program stores the following in a database, for each search word: product category, product title, category, condition, price, supplier's country, search page No., shipping cost, and condition, seller name, seller feedback score. The prices can be found in different currencies, converted by an up-to-date API ('ExchangeRate-API' from 'rapidapi.com').

  • As mentioned, the program stores the data in a pre-defined database (named 'ebay_products'), with the following structure: ERD (SQL script for DB creation can also be found in this repo)

Tables details:

  • products table- includes individual product data. Columns:

    1. product_id (int, pk): auto increments a serial no. for each product.
    2. seller_id (int, fk): ID of seller.
    3. category_id (int, fk): ID of category.
    4. country_id (int, fk): ID of country.
    5. condition_id (int, fk): ID of condition.
    6. product_name (varchar): the title of the product as appears on ebay.
    7. product_price (float): the price of the product, shipping unincluded, in ILS.
    8. shipping_fee (float): the price of shipping the product in ILS.
    9. page_number (int): the No. of search page in ebay.
  • conditions table- product conditions. Columns:

    1. condition_id (int, pk): ID of condition.
    2. product_condition (varchar): the condition category of the product (new, open box, seller refurbished, certified refurbished, used, for parts or not working).
  • countries table- product conditions. Columns:

    1. country_id (int, pk): ID of country.
    2. origin_country (varchar): the country which the product is sent from.
  • categories table- includes products and the ebay categories which they are related to. Columns:

    1. product_id (int, fk): the serial product id.
    2. category (varchar): the products categories.
  • sellers table- includes seller details per each product. Columns:

    1. seller_id (int, pk): ID of seller.
    2. seller_name (varchar): the ebay name of the seller.
    3. seller_feedback_score (int): the score of the seller by ebay's formula.
  • currency table- prices of products+shipping in different currencies. Columns:

    1. product_id (int, fk): the serial product id.
    2. Israeli_Shekel_ILS: Price in Israeli Shekels.
    3. US_Dollar_USD: Price in US dollars.
    4. EU_Euro_EUR: Price in Euros.
    5. GB_Pound_GBP: Price in British Pound.
    6. China_Yoan: Price in Chinese Yoan.
    7. Russia_Ruble: Price in Russian Ruble.

Installation

This repo includes a requirements file for necessary packages. In addition, a configuration file includes all constants, tags, and the URL pattern. In 'WebScraping.py', rows 9 and 12 include paths to the configuration file (CFG_FILE) and the password file (PASS_FILE) containing the personal password for the SQL execution, and for the API access from the RapidAPI site. The paths of these two files should be changed according to their location in the local PC, and personal passwords should have keys as follows: SQL password: 'my_password', rapidapi password: 'api_headers' as a dictionary with the two keys received by the website ("x-rapidapi-key" and "x-rapidapi-host"). The program includes logging to a logging file named 'ebay_scraping.log'. The logging level is set to 'warning' but can be changed in case needed (in line 23 of 'WebScraping.py').

Usage

WebScraping.py - Receives a list of search words and a number of pages to search (during each search), and returns the data of all the results (products). The data returned are: product description, price, condition, shipping fee, product category, seller country, seller name, and seller rating score.

There are functions in the program, 1 initiation class, and 1 main function, calling each other in the following order:

  1. main (using the ScrapeIt class) => 2. ebay_access (=> 3. collect_links) => 4. concentrating_data => 5. get_item_data (=> 6. element_parsing) => 7. storing_data (=> 8. convert_currency) => 9. sql_execution

Authors and Support

Roy Yanovski - yanovskir@gmail.com

Project status

Project Completed.

Last Update: 30.03.2021.

Releases

No releases published

Packages

No packages published