# Introduction

In this lecture we will use the MovieLens dataset containing 100k movie ratings.

MovieLens site: https://grouplens.org/datasets/movielens/

100k dataset: http://files.grouplens.org/datasets/movielens/ml-100k.zip


## Setting up Hadoop and Hive envvars

In [12]:
%env HADOOP_VERSION     2.9.2
%env HADOOP_HOME hadoop-2.9.2

env: HADOOP_VERSION=2.9.2
env: HADOOP_HOME=hadoop-2.9.2


In [13]:
%env HIVE_VERSION     hive-2.3.5
%env HIVE_HOME apache-hive-2.3.5-bin

env: HIVE_VERSION=hive-2.3.5
env: HIVE_HOME=apache-hive-2.3.5-bin


## Java Home envvar

In [14]:
%env JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
# !echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 " >> ~/.bashrc
# !echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 " >> ~/.profile

env: JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


## Setting the Hive path envvar

In [15]:
!echo "export HIVE_HOME=$(pwd)/$HIVE_HOME" >> ~/.bashrc
!echo "export HIVE_HOME=$(pwd)/$HIVE_HOME" >> ~/.profile

In [16]:
!echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> $(pwd)/${HIVE_HOME}/bin/hive-env.sh
!echo "export HADOOP_HOME=$(pwd)/$HADOOP_HOME"             >> $(pwd)/${HIVE_HOME}/bin/hive-env.sh
!echo "export HIVE_HOME=$(pwd)/$HIVE_HOME"                 >> $(pwd)/${HIVE_HOME}/bin/hive-env.sh

# Creating a database

In [17]:
#Verifying created databases
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "SHOW DATABASES;"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 02:52:45 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
+----------------+
| database_name  |
+----------------+
| default        |
| movielens      |
+----------------+
2 rows selected (2.576 seconds)
Beeline vers

In [18]:
#Creating a database
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "CREATE DATABASE IF NOT EXISTS movielens; SHOW DATABASES;"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 02:53:38 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
19/07/18 02:53:52 [HiveServer2-Background-Pool: Thread-29]: ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database movielens already exists)

In [None]:
#!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "SHOW DATABASES;"

## Creating a table ratings in the movielens database

In [19]:
#Verifying tables
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "USE movielens; SHOW tables;"

#Creating table ratings
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "CREATE TABLE ratings(\
        userID  INT,\
        movieID INT,\
        rating  INT,\
        time    INT)\
ROW FORMAT DELIMITED\
FIELDS TERMINATED BY '\t'\
STORED AS TEXTFILE;\
SHOW tables;"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 02:57:36 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (2.137 seconds)
OK
+-----------+
| tab_name  |
+-----------+
+-----------+
No rows selected (0.638 seconds)
Beeline version 2.3.5 by Apache H

## Import data from local file system

In [20]:
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "SELECT * FROM ratings;"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 03:00:38 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
+-----------------+------------------+-----------------+---------------+
| ratings.userid  | ratings.movieid  | ratings.rating  | ratings.time  |
+-----------

In [21]:
#Loading data
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "LOAD DATA LOCAL INPATH 'resources/examples/u.data'\
                                                OVERWRITE INTO TABLE ratings;"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 03:01:00 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Loading data to table default.ratings
19/07/18 03:01:14 [HiveServer2-Background-Pool: Thread-29]: WARN conf.HiveConf: HiveConf of name hive.internal.ss.authz.set

In [22]:
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "SELECT * FROM ratings LIMIT 10;"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 03:01:29 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
+-----------------+------------------+-----------------+---------------+
| ratings.userid  | ratings.movieid  | ratings.rating  | ratings.time  |
+-----------

## Verifying the file created by Hive

In [28]:
!./$HADOOP_HOME/bin/hdfs dfs -ls /user/hive/warehouse/
!./$HADOOP_HOME/bin/hdfs dfs -ls /user/hive/warehouse/movielens.db/
!./$HADOOP_HOME/bin/hdfs dfs -ls /user/hive/warehouse/movielens.db/ratings

Found 2 items
drwxrwxr-x   - jovyan supergroup          0 2019-07-18 02:03 /user/hive/warehouse/movielens.db
drwxrwxr-x   - jovyan supergroup          0 2019-07-18 03:01 /user/hive/warehouse/ratings


## Finding the most popular movie

In [29]:
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "SELECT movieID, COUNT(movieID) as ratingCount\
                                                FROM ratings\
                                                GROUP BY movieID\
                                                ORDER BY ratingCount DESC\
                                                LIMIT 10;"


SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 03:06:03 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
19/07/18 03:06:17 [HiveServer2-Background-Pool: Thread-28]: WARN ql.Driver: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. C

## Finding the name of the most popular movie

### Creating a new table that contains movies' title

In [30]:
#Creating a new table called movieNames
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "CREATE TABLE movieNames(\
                                                movieID  INT,\
                                                title STRING)\
                                                ROW FORMAT DELIMITED\
                                                FIELDS TERMINATED BY '|'\
                                                STORED AS TEXTFILE;"

#Loading data into movieNames table
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "LOAD DATA LOCAL INPATH 'resources/examples/u.item'\
                                                OVERWRITE INTO TABLE movieNames;"


SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 03:07:45 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (2.908 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://
19/07/18 03:07:56 [shutdown-hook-0]: WARN thrift.ThriftCLIServ

### Creating a view to store the movies' popularity

In [31]:
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "CREATE VIEW topMoviesIds AS\
                                                SELECT movieID, COUNT(movieID) as ratingCount\
                                                FROM ratings\
                                                GROUP BY movieID\
                                                ORDER BY ratingCount DESC;"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 03:10:22 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
OK
No rows affected (5.768 seconds)
Beeline version 2.3.5 by Apache Hive
Closing: 0: jdbc:hive2://
19/07/18 03:10:37 [shutdown-hook-0]: WARN thrift.ThriftCLIServ

### Finding the name of the most popular movie

In [34]:
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "SELECT n.title, ratingCount\
                                                FROM topMoviesIds t JOIN movieNames n ON t.movieID = n.movieID\
                                                LIMIT 10;"

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 03:11:45 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.5)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
19/07/18 03:11:57 [11412028-0111-4660-8390-ca073161a336 main]: WARN parse.RowResolver: Duplicate column info for ratings.movieid was overwritten in RowResolver m

# It's your turn

## Find the movie with the highest average time

What do you think about this result??

#Select to see the result

><font color='white'>
SELECT movieID, AVG(rating) as ratingAVG<br>
FROM ratings<br>
GROUP BY movieID<br>
ORDER BY ratingAVG DESC;<br>
LIMIT 10;
</font>


## Find the movie with the highest average time, only considering movies with more than 10 ratings

#Select to see the result

><font color='white'>
SOLUTION 1: <br>
SELECT r.movieID, m.title, AVG(r.rating) as ratingAVG, COUNT(r.rating) as ratingCount<br>
FROM ratings r JOIN movieNames m ON r.movieID = m.movieID<br>
GROUP BY r.movieID, m.title<br>
HAVING ratingCount > 10<br>
ORDER BY ratingAVG DESC<br>
LIMIT 10;<br><br>
SOLUTION 2:<br>
CREATE VIEW avgMoviesIds AS<br>
SELECT movieID, AVG(rating) as ratingAVG<br>
FROM ratings<br>
GROUP BY movieID<br>
ORDER BY ratingAVG DESC;<br>
<br>
SELECT n.title as titles, a.ratingAVG as average, t.ratingCount as rating<br>
FROM avgMoviesIds a JOIN topMoviesIds t ON a.movieID = t.movieID<br>
    JOIN movieNames n ON t.movieID = n.movieID<br>
WHERE t.ratingCount >10<br>
ORDER BY average DESC<br>
LIMIT 10;
</font>
