CoronaGram

Data Mining Project as part of an ITC Data Science course

Our program scrap instagram posts from Instagram hashtag pages

the program works in 3 steps:

Connection to a hashtag page, scrapping a collection post "shortcodes" and inserting them into an SQL DB
Collecting shortcodes that were inserted into the DB is step 1, converting them into post urls and scrape of post data
(optional) Enrichment database by analysing post texts with an API for detection of language, translation and sentiment analysis
to detect if the post was either positive, negative or neutral

the separation into 3 steps allows a modular workflow by which the user can choose how many urls to collect how many posts to scrap and how much if at all to enrich data set with the API

Graphical Abstract

Usage

basic usage:

python coronagram.py corona misterLovaLova 12345678

Positional arguments:

	Description
tag	Instagram #hashtag page for scraping
name	Instagram user name
password	Instagram password

Optional arguments:

Short	Long	Description	Default	comments
-h	--help	show this help message and exit
-lu	--url_limit	maximum urls to scrape	inf	if '0' is chosen program will perform only post scraping step from shortcodes inserted to the SQL DB
-lp	--post_limit	maximum posts to scrape	inf
-b	--browser	browser choice to be used by selenium	CHROME	supported browsers: CHROME or FIREFOX
-e	--executable	a path to the driver executable file	None	If none is given it will be assumed that the driver was added and available as an OS environment variable
-d	--db_batch	maximum number of records to insert and commit each time	50
-fc	--from_code	url shortcode to start scraping from	None
-sc	--stop_code	url shortcode that when reach will stop scrapping	None
-i	--implicit_wait	implicit wait time for webdriver in seconds	50
-o	--driver_options	java script optional arguments that will be injected to browser with selenium webdriver API	['--headless']	can be chosen multiple times
-mn	--min_scroll_wait	minimum number of seconds to wait after each scroll
-mx	--max_scroll_wait	maximum number of seconds to wait after each scroll
-hd	--headed_mode	running in headed mode (graphical browser)	False
-en	--enrich	maximum number of API enrichment tasks	0	API enrichment includes detection of post language, translation, and analysis of sentiment ( positive, negative or neutral)
-p	--proxy	using a proxy server for scrapping

DataBase ERD

Notes

This program used selenium library for scrolling on page due to the dynamic properties of instagram hashtag pages. Selenium headless option permits limiting usage of graphical resources. Selenium library permits working with the browser installed on the computer (Chrome or Firefox).
In case any browser is defined as default browser, you have to define a path to it. If it is set, the driver will use the one defined in OS environment variable
database enrichment and use of proxy server requires keys and server IP respectively. In case these abilities are required please make the necessary changes in the hidden configuration file

Dependencies

cffi==1.14.4
cryptography==3.3.1
numpy==1.19.4
pandas==1.1.5
pkg-resources==0.0.0
pycparser==2.20
pyOpenSSL==20.0.1
python-dateutil==2.8.1
pytz==2020.4
regex==2020.11.13
six==1.15.0

Authors

Yair Stemmer and Yoav Wigelman

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
images		images
README.md		README.md
conf.py		conf.py
coronagram.py		coronagram.py
db_control.py		db_control.py
hidden_conf.py		hidden_conf.py
requirements.txt		requirements.txt
sentiment.py		sentiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoronaGram

Graphical Abstract

Usage

Positional arguments:

Optional arguments:

DataBase ERD

Notes

Dependencies

Authors

About

Releases

Packages

Contributors 2

Languages

ywigelman/CoronaGram

Folders and files

Latest commit

History

Repository files navigation

CoronaGram

Graphical Abstract

Usage

Positional arguments:

Optional arguments:

DataBase ERD

Notes

Dependencies

Authors

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages