# Introduction to Big Data Modern Technologies course

## TOPIC 4: Modern Hadoop
### Part 2

### 1. Libraries

In [None]:
import os
import subprocess

### 2. Preprocessed data

In [None]:
!hdfs dfs -ls /

In [None]:
!hdfs dfs -ls /jovyan

### 3. Import data to Hive

[The Apache Hive](https://hive.apache.org/) is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage __using SQL__.

__NOTE__ that after loading the data, the source file will be deleted from the source location, and the file loaded to the Hive data warehouse location or to the LOCATION specified while creating a table.

#### 3.1. Users table

In [None]:
!hdfs dfs -ls /jovyan/users

In [None]:
def hdfs_dirs(path, filter_str=''):
    """
    Returns files in path provided as a list. 
    File names may be filtered by `filter_str` parameter,
    e.g. `filter_str='csv'` will display only `csv` files.
    
    """
    process = subprocess.Popen(
        ['hdfs', 'dfs', '-ls', path], 
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE
    )
    out, err = process.communicate()
    dirs = out.decode('utf-8').split('\n')
    dirs = list(filter(lambda x: filter_str in x, dirs))
    dirs = list(map(lambda x: x.split(' ')[-1], dirs))
    return dirs

In [None]:
users_path = '/jovyan/users'

In [None]:
hdfs_dirs(
    path=users_path, 
    filter_str='csv'
)

In [None]:
users_file = hdfs_dirs(
    path=users_path, 
    filter_str='csv'
)[0].split('/')[-1]

In [None]:
users_file

In [None]:
!hdfs dfs -head {users_path}/{users_file}

Read about [Hive data types](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types) first.

In [None]:
!hive -e \
    "CREATE TABLE users ( \
        jh_email STRING, \
        jh_login STRING, \
        jh_name STRING) \
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"

In [None]:
!touch result.txt

In [None]:
!echo ---------------------------- >> result.txt

In [None]:
!hive -S -e "SELECT * FROM users LIMIT 5" >> result.txt

In [None]:
!hive -e "LOAD DATA INPATH '{users_path}/{users_file}' OVERWRITE INTO TABLE users"

In [None]:
!echo ---------------------------- >> result.txt

In [None]:
!hive -S -e "SELECT * FROM users LIMIT 5" >> result.txt

#### 3.2. Instances table

### <font color='red'>HOME ASSIGNMENT</font>

#### 3.3. Events table

### <font color='red'>HOME ASSIGNMENT</font>

#### 3.4. Logs table

In [None]:
!hdfs dfs -ls /jovyan/logs

In [None]:
logs_path = '/jovyan/logs'

In [None]:
logs_file = hdfs_dirs(
    path=logs_path, 
    filter_str='csv'
)[0].split('/')[-1]

In [None]:
logs_file

In [None]:
!hdfs dfs -head {logs_path}/{logs_file}

In [None]:
!hive -e \
    "CREATE TABLE logs ( \
        jh_timestamp TIMESTAMP, \
        jh_hub STRING, \
        jh_event_code INT, \
        jh_event_type STRING, \
        jh_log STRING, \
        jh_login STRING) \
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','"

In [None]:
!hive -e "LOAD DATA INPATH '{logs_path}/{logs_file}' OVERWRITE INTO TABLE logs"

In [None]:
!echo ---------------------------- >> result.txt

In [None]:
!hive -S -e "SELECT * FROM logs LIMIT 5" >> result.txt

### 4. Test Hive SQL queries

In [None]:
!echo ---------------------------- >> result.txt

In [None]:
!hive -e \
    "SELECT \
        ls.jh_timestamp, \
        ls.jh_event_code, \
        us.jh_login, \
        us.jh_name, \
        us.jh_email, \
        ls.jh_log \
    FROM logs AS ls \
    LEFT JOIN users AS us ON ls.jh_login = us.jh_login \
    LIMIT 5" >> result.txt

In [None]:
!echo ---------------------------- >> result.txt
!hive -e \
    "SELECT COUNT(*) FROM logs AS ls \
    LEFT JOIN users AS us ON ls.jh_login = us.jh_login \
    WHERE us.jh_email = 'vgarshin@gsom.spbu.ru' \
    LIMIT 5" >> result.txt

### 5. How to drop tables

Answer is - <font color='red'>VERY CAREFULLY!</font>

In [None]:
!hive -e "DROP TABLE IF EXISTS logs"

### 6. Home assignment

Your home assignment for this part is:
1. Based on PySpark data processing script from part 1 and file with data on logs `~/__DATA/IBDT_Spring_2024/topic_1/jhub_logs.csv` make a full pipeline script
2. Make a few SQL queries (see below)

Run your previous SQL queries from HA #1 to Hive database in order to answer the questions:
- how many times jhub restarted (HINT - find all unique hub names, each name is for new instance when it restarts)
- how many users are in Jupyter?
- sort all types of events from more often to less often
- find users (name, email) with more and less activity in the Jupyter (HINT - more logs means more activity)

Check that the answers are the same as for HA #1.