Before running, you have to build the tables in mysql and change the
host, passward and userID
(line 9 in mySQLWrapper.py and line 9 in mysql2file.py).
conn=pymysql.connect(host="IP",user="MySQL user ID",passwd="password",db="imdb")The main fuction files:
crawler.py - crawl the reviews
mysql2file.py - output the reviews in mysql database to an xml
format file
str2index.py - given a xml format file of reviews, output the word
index file for each instance(review), user file recording the user and movie
file recording the movie The Mysql database contains four main tables, imdbReview, imdbUser, imdbMovie, queueTask The descs are: imdbReview
+---------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------------+------+-----+---------+-------+
| user | varchar(20) | YES | MUL | NULL | |
| movie | varchar(20) | YES | MUL | NULL | |
| rating | int(1) | YES | | NULL | |
| useful | varchar(10) | YES | | NULL | |
| time | varchar(20) | YES | | NULL | |
| title | varchar(100) | YES | | | |
| content | varchar(5000) | YES | | | |
+---------+---------------+------+-----+---------+-------+taskQueue
+--------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+---------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| task | varchar(20) | YES | | NULL | |
| status | tinyint(3) unsigned | NO | | 0 | |
+--------+---------------------+------+-----+---------+----------------+imdbMovie
+-------+----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+----------------+------+-----+---------+-------+
| id | varchar(20) | YES | MUL | NULL | |
| title | varchar(100) | YES | | NULL | |
| data | varchar(55534) | YES | | NULL | |
+-------+----------------+------+-----+---------+-------+imdbUser
+----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| id | varchar(20) | YES | MUL | NULL | |
| username | varchar(100) | YES | | NULL | |
+----------+--------------+------+-----+---------+-------+