Web Crawler for my android application, this scripts fetches movie details from internet and puts in the sql database.
My Android application Namma Mysuru gives information about the Movies being showed in Theaters (along with other details about Mysore.
For this I wanted a crawler that can fetch that data from internet all by itself instead of me updating it every week. So went ahead and wrote a web crawler in python which collects data and puts it in SQL database after formatting it.
This Crawler uses Beautiful Soup HTML parser to parse the web pages. After formatting the data, I have used Wikipedia Python package to get movie information from Wikipedia. And search in youtube for trailer video ID, which is used in Android app to play trialer.
Then I use SQLITE3 python package to store the data into database. Since the theater details (address, gps co-ordinates etc..,) doesn't change, I have a table in my database named theater_detail that has all the information about theaters. When I run this python, I also have a sqlite database file called 'nammaMysuru.sqlite' which has this table.
I have created a shell script that set up crawler environment. This script sets up Apache Spark along with dependencies for crawler. I am using Apache spark when searching for trailer on YouTube. When I search for trailer for movies one after the after the other it takes approximately 40s (tested on raspberryPi) for each movie. This is due to delay in searching (network delay) and parsing the result (processing delay). So I have integrated Apache Spark which uses Map Reduce and spawns multiple thread to search for the list of movies I provide as input.
To better understand the code, I have included the html source file in the repository.
- Beautiful Soup
$ sudo apt-get install python-bs4 python-lxml
- Wikipedia
## Install Python pip
$ wget https://bootstrap.pypa.io/get-pip.py
$ chmod +x get-pip.py
$ python get-pip.py --user
## Install wikipedia and its dependencies using pip
$ pip install wikipedia --user
$ pip install pyopenssl ndg-httpsclient pyasn1
- SQLITE 3
$ sudo apt-get install sqlite3
- Clone this repository to any of the folder on linux (I downloaded to my desktop) and execute the following
$ cd Web-Crawler
$ chmod +x nammaMysuru.sh
$ ./nammaMysuru.sh
$ python nammaMysuru.py