Skip to content

Web Crawler for my android application, this scripts fetches movie details from internet and puts in the sql database.

License

Notifications You must be signed in to change notification settings

ragu-manjegowda/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-crawler

Web Crawler for my android application, this scripts fetches movie details from internet and puts in the sql database.

History

My Android application Namma Mysuru gives information about the Movies being showed in Theaters (along with other details about Mysore.

For this I wanted a crawler that can fetch that data from internet all by itself instead of me updating it every week. So went ahead and wrote a web crawler in python which collects data and puts it in SQL database after formatting it.

This Crawler uses Beautiful Soup HTML parser to parse the web pages. After formatting the data, I have used Wikipedia Python package to get movie information from Wikipedia. And search in youtube for trailer video ID, which is used in Android app to play trialer.

Then I use SQLITE3 python package to store the data into database. Since the theater details (address, gps co-ordinates etc..,) doesn't change, I have a table in my database named theater_detail that has all the information about theaters. When I run this python, I also have a sqlite database file called 'nammaMysuru.sqlite' which has this table.

I have created a shell script that set up crawler environment. This script sets up Apache Spark along with dependencies for crawler. I am using Apache spark when searching for trailer on YouTube. When I search for trailer for movies one after the after the other it takes approximately 40s (tested on raspberryPi) for each movie. This is due to delay in searching (network delay) and parsing the result (processing delay). So I have integrated Apache Spark which uses Map Reduce and spawns multiple thread to search for the list of movies I provide as input.

To better understand the code, I have included the html source file in the repository.

Installation

Manual Installation

Install the following dependencies

  1. Beautiful Soup
$ sudo apt-get install python-bs4 python-lxml
  1. Wikipedia
## Install Python pip
$ wget https://bootstrap.pypa.io/get-pip.py
$ chmod +x get-pip.py
$ python get-pip.py --user

## Install wikipedia and its dependencies using pip
$ pip install wikipedia --user
$ pip install pyopenssl ndg-httpsclient pyasn1
  1. SQLITE 3
$ sudo apt-get install sqlite3

Automated Installation

Install using the shell script

  1. Clone this repository to any of the folder on linux (I downloaded to my desktop) and execute the following
$ cd Web-Crawler
$ chmod +x nammaMysuru.sh
$ ./nammaMysuru.sh

Usage

$ python nammaMysuru.py

About

Web Crawler for my android application, this scripts fetches movie details from internet and puts in the sql database.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published