NOTE: Due to a recent update on 9 December for the Python Cryptography package which is required by the fabric package we used some of our scripts may not work as intended. Already informed Prof Dinh, Prof Kenny and GTA Vishva during presentation. See here.
- Tharun
- Nina
- Hadi
- Jun Yuan
- Akmal Hakim
- AWS EC2 Ubuntu 18.04 LTS (due to use of bash scripts)
- AWS EC2 key pair created in the region 'us-east-1'
- Install the necessary packages
sudo apt-get -y update && sudo apt -y install python3-pip && pip3 install pip && pip3 install --upgrade setuptools && pip3 install boto3 && pip3 install -U python-dotenv && sudo apt-get install build-essential libssl-dev libffi-dev python-dev && pip install fabric
- Clone the project repository using the following command
git clone https://github.com/thawalk/db-final.git
-
Change directory into the db-final/production folder
cd db-final/production
-
Run production.py to setup and configure the EC2 instances. This will create instances for hosting MongoDB, mySQL, frontend and backend servers.
python3 production.py
NOTE Upon execution, you will be prompted for your AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account.
In any event the
production.py
fails, or gets stuck in a loop, stop the script by usingCTRL + C
. Then, teardown all the instances, instructions HERE -
The public IP of the EC2 instance hosting the web application will be printed on the console. The website can then be viewed at
<public IP 1>
printed at the end.
This is the link of the website.
-
We are now at
/db-final/production
. For the Analytics portion, we shall head over to/db-final/hadoop
by runningcd .. cd hadoop
-
Let us now run the analytics script. This execution will take approximately 20-30 minutes. Run:
python3 hadoop_spark.py
NOTE: Similar to running
production.py
, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes for the cluster.Key in the number of clusters accordingly.In any event the
hadoop_spark.py
fails, or gets stuck in a loop, click HERE
-
The results of the analytics tasks will be stored in
/db-finals/analytics_files
. First, we change our directory to theanalytics_files
folder.cd .. cd analytics_folder
-
To obtain the results of the analytics, we run:
python3 get_results.py
A sample of the TF-IDF and correlation analytics output can be seen below.
Pearson
TF-IDF The folder with all the CSVs will be saved locally in the analytics_files folder after you run the script -
If you want to commission new nodes to the cluster, you can execute this script. This execution will take approximately 20-30 minutes. Run:
cd .. cd hadoop python3 hadoop_new_node.py
NOTE: Similar to running
production.py
, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes to be added to the cluster. Key in the number of clusters that you want to add to the cluster.In any event the
hadoop_new_node.py
fails, or gets stuck in a loop,you would need toCTRL + C
and teardown the cluster using instructions HERE. Then, rerunhadoop_spark.py
and try runninghadoop_new_node.py
again -
If you want to decommission nodes from the cluster, you can execute this script. This execution will take approximately 20-30 minutes. Run:
python3 hadoop_down_node.py
NOTE: If your hadoop_new_node.py has previously failed, please teardown entire cluster and rerun
hadoop_spark.py
before runninghadoop_down_node.py
. NOTE: Similar to runningproduction.py
, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes that you want to decommission from the cluster. Key in the number of clusters that you want to remove from the cluster.In any event the
hadoop_down_node.py
fails, or gets stuck in a loop, you would need toCTRL + C
. Then, rerunhadoop_down_node.py
again
- First, we have to change directory into the
db-final/teardown
folder where all our teardown scripts will be atcd db-final/teardown
- For tearing down production side instances, run:
python3 teardown_production.py
- For tearing down Hadoop side instances, run:
python3 teardown_hadoop.py
In this project, we built a web application for Kindle book reviews, similar to Goodreads, using public datasets from Amazon. The full details of the project requirements can be found here.
We were provided with 2 public datasets from Amazon.
- Amazon Kindle's reviews, available from Kaggle website. This dataset has 982,619 entries (about 700MB).
- Amazon Kindle metadata, available from UCSD website This dataset has 434,702 products (about 450MB)
After cleaning the dataset, we loaded the data into the respective databases.
A front-end with at least the following functionalities:
- Add a new book
- Search for existing book by author and by title
- Add a new review
- Sort books by reviews, genres
- User-friendliness
We chose to use Handlebars.js as a framework as it is simple, clean, and decoupled from logic-based JavaScript files. Its templating engine makes client-side templating simple without requiring complicated Single Page Application frameworks.
Some features of our frontend include:
- Search a book by title
- Search a book by author
- Give reviews on books
- Filter books based on Categories
- Filter books based on Rating
- Add new book
- Deploy at least 3 separate servers: one for the web server, one for relational database, one for the non-relational databases.
- A web server
- A relational database
- A non-relational database
- The web server logs are recorded in a document store. Each log record must have at least the following information:
- Timestamp
- Type of request
- Response code (404, 200, etc)
Flask was chosen because of its well-maintained libraries that we needed for the mySQL and MongoDB databases. Libraries used include PyMongo, Python-sql and mysql-connector.
Schema (MongoDB)
- id: ObjectId, primary key
- asin: String
- categories:
- description?: String
- title: String
- price: Double
- imUrl: String
- related: json
- also_bought: string[]
- also_viewed: string[]
- bought_together: string[]
- buy_after_viewing: string[]
Schema (MySQL)
- id: Integer, primary key
- asin: varchar
- helpful: varchar
- overall: Integer
- reviewText: Text
- reviewTime: varchar
- reviewerID: varchar
- reviewerName: varchar
- summary: varchar
- unixReviewTime: Integer
Schema (MongoDB)
- id: Integer, primary key
- timestamp: String
- request: String
- response: String
Schema (MongoDB)
- id: ObjectId, primary key
- asin: String
- book_title: String
- author_names: String
For analytics, we build a system that will :
- Gets data from production system and loads it into the HDFS
- Use Spark to gain two analysis which comprises of :
- Pearson Correlation : The covarience of the two variables divided by the product of their standard deviations.
- TF-IDF : A statistical measure on how relevant a word is to a document in a collection of documents
So when the user executes the hadoop_spark.py script in the hadoop folder, with their credentials and the specified number of nodes, the script will do all the necessary configuration, such as installing hadoop and spark in all the nodes. Then it will start the hadoop and spark cluster.
When the hadoop_new_node.py which is in the hadoop folder is executed, the number of specified nodes will be commissioned to the cluster. Below is the explanation.
We implemented the adding of the new nodes to the cluster the proper way, which is to commission new nodes to the cluster.
So when we the script is executed. The specifed number of new nodes will be created, with all the configuration, such as the installing hadoop and getting the public key of the namenode that is running in the cluster.
Then all the previous nodes would be informed of these new nodes that are going to be added to the cluster through the hosts and workers files, with some other configuration as well.
Now, that all the necessary configuration has been completed. The script would then place a includes file in the namenode that contains all the datanodes private ips, both old and new nodes. Then the script would stop and start the cluster. At this point, through the new workers file and the includes file, the namenode would be 'informed' of the new nodes, that are coming in. Then when the cluster starts, the new nodes, would be commissioned.
So when the user executes the hadoop_down_node.py script. The script would decommsission the specified number of nodes from the cluster. Below is the explanation
The way this is done is through the excludes file, which is placed in the namenode. The private ips of the nodes that are to be decommissioned wil be in the excludes file. Then the script will refresh the cluster. When this happens the namenods will decommission the nodes that are in the exludes file.
Point to the note: Excludes takes precedence over includes. So, when datanodes are decommissioned, the private ips in the includes file will does not have to be removed.
This is the image for the excludes file:
We added some special configuration to the namenode yarn file. These is the image below:
This configuration would use file named as 'includes', where the new nodes that are
After the hadoop and spark cluster is up. The user can run the analytics tasks by running the get_results.py file in the analytics_files folder. Below is the approach that we took for the analytics tasks.
To compute the Pearson correlation between price and average review length, we would first need to retrieve these data from HDFS. Since price is from the Meta
dataset while the reviews are from the Reviews
dataset, we would need to fetch the two datasets and do some preprocessing.
The preprocessing includes getting the average reviewText
length for each book (asin
), and removing all unnecessary columns in order to clean up our data. We then joined the 2 datasets using their asin, in order to get the the table with both price and reviewText length.
We used RDD in order to map the values of x, y, xy, x2, y2 and get the sum of each of these values. After getting these results, we were able to calculate the Pearson Correlation using the following equation.
The output of this is printed in the console when get_results.py is run (as shown above).
In order to calculate the TF-IDF, we need to get 2 variables from our dataset, Term Frequency and Inverse Document Frequency.
- Term Frequency : The number of instances a word appeared in the document
- Inverse Document Frequency: A numerical mesaure of how much information a word provides
TF-IDF is calculated by mutliplying the Term Frequency and the Inverse Document Frequency. The higher the score means that the word is more relevant in the document.
Finding the TF-IDF score for word t in document d from a set of documents D can be done with the following :
We first need to retrieve the Reviews
data stored in HDFS.
Following that, to get the TF, we need to calculate the word count of every word in the reviewText of each book and divide it with the total word count of the reviewText.
To get the IDF, we used the IDF(t, D) equation from above to calculate the score. We then proceeded to multiply the TF and IDF in order to get the TF-IDF scores for the books.
The results for the TF-IDF is downloaded into the db-final/analytics_files/tfidf_output.zip which contains an archived folder of the TF-IDF scores of the books, saved in a CSV.
Since the metadata given Amazon Kindle's reviews has alot of missing book_details, we had to web scrape all the missing details from the web.