# Introduction

In this lecture we will use the MovieLens dataset containing 100k movie ratings.

MovieLens site: https://grouplens.org/datasets/movielens/

100k dataset: http://files.grouplens.org/datasets/movielens/ml-100k.zip


## Setting up Hadoop and Hive envvars

In [1]:
%env HADOOP_VERSION     2.9.2
%env HADOOP_HOME hadoop-2.9.2

env: HADOOP_VERSION=2.9.2
env: HADOOP_HOME=hadoop-2.9.2


In [2]:
%env HIVE_VERSION     hive-2.3.5
%env HIVE_HOME apache-hive-2.3.5-bin

env: HIVE_VERSION=hive-2.3.5
env: HIVE_HOME=apache-hive-2.3.5-bin


## Java Home envvar

In [3]:
%env JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
# !echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 " >> ~/.bashrc
# !echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 " >> ~/.profile

env: JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64


## Setting the Hive path envvar

In [4]:
!echo "export HIVE_HOME=$(pwd)/$HIVE_HOME" >> ~/.bashrc
!echo "export HIVE_HOME=$(pwd)/$HIVE_HOME" >> ~/.profile

In [5]:
!echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64" >> $(pwd)/${HIVE_HOME}/bin/hive-env.sh
!echo "export HADOOP_HOME=$(pwd)/$HADOOP_HOME"             >> $(pwd)/${HIVE_HOME}/bin/hive-env.sh
!echo "export HIVE_HOME=$(pwd)/$HIVE_HOME"                 >> $(pwd)/${HIVE_HOME}/bin/hive-env.sh

# Creating a database

In [None]:
#Verifying created databases
!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "SHOW DATABASES;"

#Creating a database
#!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "CREATE DATABASE IF NOT EXISTS movielens;"

#!./$HIVE_HOME/bin/beeline -u "jdbc:hive2://" -e "SHOW DATABASES;"


SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/apache-hive-2.3.5-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
19/07/18 02:34:09 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.


## Creating a table ratings in the movielens database

```bash
USE movielens;

# Verifying tables
SHOW tables;

#Creating table ratings
CREATE TABLE ratings(
        userID  INT,
        movieID INT,
        rating  INT,
        time    INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

SHOW tables;
```

## Import data from local file system

```bash
SELECT * FROM ratings;

#Loading data
LOAD DATA LOCAL INPATH 'resources/examples/u.data'
OVERWRITE INTO TABLE ratings;

SELECT * FROM ratings LIMIT 10;
```

## Verifying the file created by Hive

In [None]:
%env HADOOP_VERSION     2.9.2
%env HADOOP_HOME hadoop-2.9.2

!./$HADOOP_HOME/bin/hdfs dfs -ls /user/hive/warehouse/movielens.db/ratings/

## Finding the most popular movie

```bash
SELECT movieID, COUNT(movieID) as ratingCount
FROM ratings
GROUP BY movieID
ORDER BY ratingCount DESC
LIMIT 10;
```

## Finding the name of the most popular movie

### Creating a new table that contains movies' title

```bash
#Creating a new table called movieNames
CREATE TABLE movieNames(
        movieID  INT,
        title STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

#Loading data into movieNames table
LOAD DATA LOCAL INPATH 'resources/examples/u.item'
OVERWRITE INTO TABLE movieNames;
```

### Creating a view to store the movies' popularity

```bash
CREATE VIEW topMoviesIds AS
SELECT movieID, COUNT(movieID) as ratingCount
FROM ratings
GROUP BY movieID
ORDER BY ratingCount DESC;
```

### Finding the name of the most popular movie

```bash
SELECT n.title, ratingCount
FROM topMoviesIds t JOIN movieNames n ON t.movieID = n.movieID
LIMIT 10;
```

# It's your turn

## Create new tables and import the u.data and u.item using the HDFS

## Using your tables:

### Find the movie with the highest average time

What do you think about this result??

### Find the movie with the highest average time, only considering movies with more than 10 ratings