# Introduction

In this demo we will be using the [MovieLens Dataset](https://grouplens.org/datasets/movielens/). The dataset slicing options are:

* **[Small](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip):** 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.
* **[Full](http://files.grouplens.org/datasets/movielens/ml-latest.zip):** 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Last updated 9/2018.



In [None]:
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -q --show-progress
!unzip ml-latest-small.zip
!rm ml-latest-small.zip

In [None]:
%env DATASET_PATH /home/jovyan/labs/lab3-hive/ml-latest-small

# Creating a database

We will be using Hive's [Beeline CLS](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline%E2%80%93CommandLineShell).

Listing databases:

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "SHOW DATABASES;"

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "CREATE DATABASE IF NOT EXISTS movielens;"

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "SHOW DATABASES;"

## Creating a table ratings in the movielens database

In [None]:
#Verifying tables
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens; SHOW tables;"

In [None]:
#Creating table ratings
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens; CREATE TABLE ratings(                                           \
                                                                     userID  INT,                              \
                                                                     movieID INT,                              \
                                                                     rating  INT,                              \
                                                                     time    INT )                             \
                                                            ROW FORMAT DELIMITED                               \
                                                            FIELDS TERMINATED BY ','                           \
                                                            STORED AS TEXTFILE                                 \
                                                            tblproperties(\"skip.header.line.count\"=\"1\");"

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens; SHOW tables;"

## Import data from local file system

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens; SELECT * FROM ratings;"

In [None]:
#Loading data
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens;  LOAD DATA LOCAL INPATH '$DATASET_PATH/ratings.csv'\
                                                OVERWRITE INTO TABLE ratings;"

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens; SELECT * FROM ratings LIMIT 10;"

## Verifying the file created by Hive

In [None]:
!hdfs dfs -ls /user/hive/warehouse/
!hdfs dfs -ls /user/hive/warehouse/movielens.db/
!hdfs dfs -ls /user/hive/warehouse/movielens.db/ratings

## Finding the most popular movie

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens; SELECT movieID, COUNT(movieID) as ratingCount  \
                                                                                FROM ratings                   \
                                                                                GROUP BY movieID               \
                                                                                ORDER BY ratingCount DESC      \
                                                                                LIMIT 10;"

## Finding the name of the most popular movie

### Creating a new table that contains movies' title

In [None]:
#Creating a new table called movieNames
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens; CREATE TABLE movieNames(                                                   \
                                                                        movieID  INT,                                     \
                                                                        title STRING)                                     \
                                                                        ROW FORMAT DELIMITED                              \
                                                                        FIELDS TERMINATED BY ','                          \
                                                                        STORED AS TEXTFILE                                \
                                                                        tblproperties(\"skip.header.line.count\"=\"1\");"                                                        

In [None]:
#Loading data into movieNames table
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens;  LOAD DATA LOCAL INPATH '$DATASET_PATH/movies.csv'       \
                                                OVERWRITE INTO TABLE movieNames;"

### Creating a view to store the movies' popularity

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens;  CREATE VIEW topMoviesIds AS                                          \
                                                SELECT movieID, COUNT(movieID) as ratingCount        \
                                                FROM ratings                                         \
                                                GROUP BY movieID                                     \
                                                ORDER BY ratingCount DESC;"

### Finding the name of the most popular movie

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens;  SELECT n.title, ratingCount                                                          \
                                                FROM topMoviesIds t JOIN movieNames n ON t.movieID = n.movieID       \
                                                LIMIT 10;"

# It's your turn

## Find the movie with the highest average time

In [None]:
!cd ~/resources/local/hive-2.3.9/bin/ && ./beeline -u "jdbc:hive2://" -e "USE movielens;  CREATE VIEW avgMoviesIds AS\
SELECT movieID, AVG(rating) as ratingAVG\
FROM ratings\
GROUP BY movieID\
ORDER BY ratingAVG DESC;\
SELECT n.title as titles, a.ratingAVG as average, t.ratingCount as rating\
FROM avgMoviesIds a JOIN topMoviesIds t ON a.movieID = t.movieID\
JOIN movieNames n ON t.movieID = n.movieID\
WHERE t.ratingCount >10\
ORDER BY average DESC\
LIMIT 10;"

What do you think about this result??

#Select to see the result

<font color='white'>
SELECT movieID, AVG(rating) as ratingAVG<br>
FROM ratings<br>
GROUP BY movieID<br>
ORDER BY ratingAVG DESC<br>
LIMIT 10;
</font>


## Find the movie with the highest average time, only considering movies with more than 10 ratings

#Select to see the result

<font color='white'>
SOLUTION 1: <br>
SELECT r.movieID, m.title, AVG(r.rating) as ratingAVG, COUNT(r.rating) as ratingCount<br>
FROM ratings r JOIN movieNames m ON r.movieID = m.movieID<br>
GROUP BY r.movieID, m.title<br>
HAVING ratingCount > 10<br>
ORDER BY ratingAVG DESC<br>
LIMIT 10;<br><br>
SOLUTION 2:<br>
CREATE VIEW avgMoviesIds AS<br>
SELECT movieID, AVG(rating) as ratingAVG<br>
FROM ratings<br>
GROUP BY movieID<br>
ORDER BY ratingAVG DESC;<br>
<br>
SELECT n.title as titles, a.ratingAVG as average, t.ratingCount as rating<br>
FROM avgMoviesIds a JOIN topMoviesIds t ON a.movieID = t.movieID<br>
    JOIN movieNames n ON t.movieID = n.movieID<br>
WHERE t.ratingCount >10<br>
ORDER BY average DESC<br>
LIMIT 10;
</font>
