# Introduction

In this demo we will be using the [MovieLens Dataset](https://grouplens.org/datasets/movielens/). The dataset slicing options are:

* **[Small](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip):** 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.
* **[Full](http://files.grouplens.org/datasets/movielens/ml-latest.zip):** 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Last updated 9/2018.



In [1]:
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -q --show-progress
!unzip ml-latest-small.zip
!rm ml-latest-small.zip

Archive:  ml-latest-small.zip
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [2]:
%env DATASET_PATH ml-latest-small

env: DATASET_PATH=ml-latest-small


# Creating a database

We will be using Hive's [Beeline CLS](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline%E2%80%93CommandLineShell).

Listing databases:

In [3]:
!beeline -u "jdbc:hive2://" -e "SHOW DATABASES;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
+----------------+
| database_name  |
+----------------+
| default        |
+----------------+
1 row selected (1.079 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


In [4]:
!beeline -u "jdbc:hive2://" -e "CREATE DATABASE IF NOT EXISTS movielens;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (1.038 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


In [5]:
!beeline -u "jdbc:hive2://" -e "SHOW DATABASES;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
+----------------+
| database_name  |
+----------------+
| default        |
| movielens      |
+----------------+
2 rows selected (1.06 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


## Creating a table ratings in the movielens database

In [6]:
#Verifying tables
!beeline -u "jdbc:hive2://" -e "USE movielens; SHOW tables;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.946 seconds)
OK
+-----------+
| tab_name  |
+-----------+
+-----------+
No rows selected (0.181 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


In [7]:
#Creating table ratings
!beeline -u "jdbc:hive2://" -e "USE movielens; CREATE TABLE ratings(                                           \
                                                                     userID  INT,                              \
                                                                     movieID INT,                              \
                                                                     rating  INT,                              \
                                                                     time    INT )                             \
                                                            ROW FORMAT DELIMITED                               \
                                                            FIELDS TERMINATED BY ','                           \
                                                            STORED AS TEXTFILE                                 \
                                                            tblproperties(\"skip.header.line.count\"=\"1\");"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.968 seconds)
OK
No rows affected (0.655 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


In [8]:
!beeline -u "jdbc:hive2://" -e "USE movielens; SHOW tables;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.968 seconds)
OK
+-----------+
| tab_name  |
+-----------+
| ratings   |
+-----------+
1 row selected (0.272 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


## Import data from local file system

In [9]:
!beeline -u "jdbc:hive2://" -e "USE movielens; SELECT * FROM ratings;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.916 seconds)
OK
+-----------------+------------------+-----------------+---------------+
| ratings.userid  | ratings.movieid  | ratings.rating  | ratings.time  |
+-----------------+------------------+-----------------+---------------+
+-----------------+------------------+-----------------+---------------+
No rows selected (1.336 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


In [10]:
#Loading data
!beeline -u "jdbc:hive2://" -e "USE movielens;  LOAD DATA LOCAL INPATH '$(pwd)/$DATASET_PATH/ratings.csv'\
                                                OVERWRITE INTO TABLE ratings;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.896 seconds)
Loading data to table movielens.ratings
OK
No rows affected (1.344 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


In [11]:
!beeline -u "jdbc:hive2://" -e "USE movielens; SELECT * FROM ratings LIMIT 10;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.908 seconds)
OK
+-----------------+------------------+-----------------+---------------+
| ratings.userid  | ratings.movieid  | ratings.rating  | ratings.time  |
+-----------------+------------------+-----------------+---------------+
| 1               | 1                | 4               | 964982703     |
| 1               | 3                | 4               | 964981247     |
| 1               | 6                | 4               | 964982224     |
| 1               | 47               | 5               | 964983815     |
| 1               | 50               | 5               | 964982931     |
| 1               | 70               | 3               | 964982400     |
| 1               | 101              | 5               | 964980868     |
| 1               | 110              | 4               | 964982176     |
| 1

## Verifying the file created by Hive

In [12]:
!hdfs dfs -ls /user/hive/warehouse/
!hdfs dfs -ls /user/hive/warehouse/movielens.db/
!hdfs dfs -ls /user/hive/warehouse/movielens.db/ratings

Found 1 items
drwxr-xr-x   - matheus supergroup          0 2019-07-18 06:26 /user/hive/warehouse/movielens.db
Found 1 items
drwxr-xr-x   - matheus supergroup          0 2019-07-18 06:27 /user/hive/warehouse/movielens.db/ratings
Found 1 items
-rwxr-xr-x   1 matheus supergroup    2483723 2019-07-18 06:27 /user/hive/warehouse/movielens.db/ratings/ratings.csv


## Finding the most popular movie

In [13]:
!beeline -u "jdbc:hive2://" -e "USE movielens; SELECT movieID, COUNT(movieID) as ratingCount  \
                                                                                FROM ratings                   \
                                                                                GROUP BY movieID               \
                                                                                ORDER BY ratingCount DESC      \
                                                                                LIMIT 10;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.935 seconds)
Query ID = matheus_20190718062725_1c5962bb-f38e-4869-97ef-e57e3fcd2c52
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1563431070808_0001, Tracking URL = http://54b67fa7c528:8088/proxy/application_1563431070808_0001/
Kill Command = /home/matheus/hadoop-2.9.2/bin/hadoop job  -kill job_1563431070808_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-07-18 06:27:33,835 Stage-1 map = 0%,  reduce = 0%
2

## Finding the name of the most popular movie

### Creating a new table that contains movies' title

In [14]:
#Creating a new table called movieNames
!beeline -u "jdbc:hive2://" -e "USE movielens; CREATE TABLE movieNames(                                                   \
                                                                        movieID  INT,                                     \
                                                                        title STRING)                                     \
                                                                        ROW FORMAT DELIMITED                              \
                                                                        FIELDS TERMINATED BY ','                          \
                                                                        STORED AS TEXTFILE                                \
                                                                        tblproperties(\"skip.header.line.count\"=\"1\");"                                                        

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.908 seconds)
OK
No rows affected (0.56 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


In [15]:
#Loading data into movieNames table
!beeline -u "jdbc:hive2://" -e "USE movielens;  LOAD DATA LOCAL INPATH '$DATASET_PATH/movies.csv'       \
                                                OVERWRITE INTO TABLE movieNames;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.908 seconds)
Loading data to table movielens.movienames
OK
No rows affected (0.894 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


### Creating a view to store the movies' popularity

In [16]:
!beeline -u "jdbc:hive2://" -e "USE movielens;  CREATE VIEW topMoviesIds AS                                          \
                                                SELECT movieID, COUNT(movieID) as ratingCount        \
                                                FROM ratings                                         \
                                                GROUP BY movieID                                     \
                                                ORDER BY ratingCount DESC;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.901 seconds)
OK
No rows affected (1.424 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://


### Finding the name of the most popular movie

In [17]:
!beeline -u "jdbc:hive2://" -e "USE movielens;  SELECT n.title, ratingCount                                                          \
                                                FROM topMoviesIds t JOIN movieNames n ON t.movieID = n.movieID       \
                                                LIMIT 10;"

Connecting to jdbc:hive2://
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (0.916 seconds)
Query ID = matheus_20190718062840_553754b9-482a-4b00-8420-c3e203f26518
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1563431070808_0003, Tracking URL = http://54b67fa7c528:8088/proxy/application_1563431070808_0003/
Kill Command = /home/matheus/hadoop-2.9.2/bin/hadoop job  -kill job_1563431070808_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-07-18 06:28:47,862 Stage-1 map = 0%,  reduce = 0%
2

# It's your turn

## Find the movie with the highest average time

What do you think about this result??

#Select to see the result

<font color='white'>
SELECT movieID, AVG(rating) as ratingAVG<br>
FROM ratings<br>
GROUP BY movieID<br>
ORDER BY ratingAVG DESC;<br>
LIMIT 10;
</font>


## Find the movie with the highest average time, only considering movies with more than 10 ratings

#Select to see the result

<font color='white'>
SOLUTION 1: <br>
SELECT r.movieID, m.title, AVG(r.rating) as ratingAVG, COUNT(r.rating) as ratingCount<br>
FROM ratings r JOIN movieNames m ON r.movieID = m.movieID<br>
GROUP BY r.movieID, m.title<br>
HAVING ratingCount > 10<br>
ORDER BY ratingAVG DESC<br>
LIMIT 10;<br><br>
SOLUTION 2:<br>
CREATE VIEW avgMoviesIds AS<br>
SELECT movieID, AVG(rating) as ratingAVG<br>
FROM ratings<br>
GROUP BY movieID<br>
ORDER BY ratingAVG DESC;<br>
<br>
SELECT n.title as titles, a.ratingAVG as average, t.ratingCount as rating<br>
FROM avgMoviesIds a JOIN topMoviesIds t ON a.movieID = t.movieID<br>
    JOIN movieNames n ON t.movieID = n.movieID<br>
WHERE t.ratingCount >10<br>
ORDER BY average DESC<br>
LIMIT 10;
</font>
