# Hive
We use the *pyHive* lib: https://github.com/dropbox/PyHive 

## Resources
* [HQL cheat sheet](http://hortonworks.com/wp-content/uploads/2016/05/Hortonworks.CheatSheet.SQLtoHive.pdf)
* [Hive reference](https://cwiki.apache.org/confluence/display/Hive/LanguageManual)

## Install
Install *pyHive* via *anaconda*

In [1]:
!pip install -q condacolab 

In [2]:
import condacolab 
condacolab.install() 

✨🍰✨ Everything looks OK!


In [3]:
!conda install -y pyhive sasl

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / done

# All requested packages already installed.



In [4]:
#import libs
from pyhive import hive
from TCLIService.ttypes import TOperationState

In [5]:
# try to connect
server="ec2-54-155-223-96.eu-west-1.compute.amazonaws.com"
cursor = hive.connect(server).cursor()

In [6]:
#show our databases
cursor.execute('show databases')

In [7]:
#get data from execution
cursor.fetchall()

[('default',)]

In [8]:
#use default db
cursor.execute('use default')

In [9]:
#show tables in db
cursor.execute('show tables')

In [10]:
cursor.fetchall()

[('employee',), ('employee2',), ('salary',)]

In [11]:
# get table layout
cursor.execute('describe employee')
cursor.fetchall()

[('employee_id', 'int', ''),
 ('birthday', 'date', ''),
 ('first_name', 'string', ''),
 ('family_name', 'string', ''),
 ('gender', 'char(1)', ''),
 ('work_day', 'date', '')]

In [12]:
#select data
cursor.execute('select * from employee')

In [13]:
#get daata from selction
employee = cursor.fetchall()

In [14]:
#have a look
employee[:10]

[(10001, None, "'Georgi'", "'Facello'", "'", None),
 (10002, None, "'Bezalel'", "'Simmel'", "'", None),
 (10003, None, "'Parto'", "'Bamford'", "'", None),
 (10004, None, "'Chirstian'", "'Koblick'", "'", None),
 (10005, None, "'Kyoichi'", "'Maliniak'", "'", None),
 (10006, None, "'Anneke'", "'Preusig'", "'", None),
 (10007, None, "'Tzvetan'", "'Zielinski'", "'", None),
 (10008, None, "'Saniya'", "'Kalloufi'", "'", None),
 (10009, None, "'Sumant'", "'Peac'", "'", None),
 (10010, None, "'Duangkaew'", "'Piveteau'", "'", None)]

In [15]:
# get salary table layout
cursor.execute('describe salary')
cursor.fetchall()

[('employee_id', 'int', ''),
 ('salary', 'int', ''),
 ('start_date', 'date', ''),
 ('end_date', 'date', '')]

In [16]:
#select data
cursor.execute('select * from salary')

In [17]:
#get daata from selction
salary = cursor.fetchall()

In [18]:
salary[:10]

[(10001, 60117, None, None),
 (10001, 62102, None, None),
 (10001, 66074, None, None),
 (10001, 66596, None, None),
 (10001, 66961, None, None),
 (10001, 71046, None, None),
 (10001, 74333, None, None),
 (10001, 75286, None, None),
 (10001, 75994, None, None),
 (10001, 76884, None, None)]

## Ex 1
Get employees sorted by ``family_name``. Return first 10 entries.

In [19]:
cursor.execute("select family_name from employee")

In [20]:
res = cursor.fetchall()

In [21]:
res[:10]

[("'Facello'",),
 ("'Simmel'",),
 ("'Bamford'",),
 ("'Koblick'",),
 ("'Maliniak'",),
 ("'Preusig'",),
 ("'Zielinski'",),
 ("'Kalloufi'",),
 ("'Peac'",),
 ("'Piveteau'",)]

or

In [22]:
cursor.execute("SELECT family_name FROM employee LIMIT 10") 

In [23]:
cursor.fetchall()

[("'Facello'",),
 ("'Simmel'",),
 ("'Bamford'",),
 ("'Koblick'",),
 ("'Maliniak'",),
 ("'Preusig'",),
 ("'Zielinski'",),
 ("'Kalloufi'",),
 ("'Peac'",),
 ("'Piveteau'",)]

## Ex 2
Get ``family_name`` and ``salary`` of employees sorted by salary. Return first 100 entries. Hint: you need to join both tables...

In [24]:
cursor.execute("describe salary")
cursor.fetchall()

[('employee_id', 'int', ''),
 ('salary', 'int', ''),
 ('start_date', 'date', ''),
 ('end_date', 'date', '')]

In [25]:
cursor.execute("describe employee")
cursor.fetchall()

[('employee_id', 'int', ''),
 ('birthday', 'date', ''),
 ('first_name', 'string', ''),
 ('family_name', 'string', ''),
 ('gender', 'char(1)', ''),
 ('work_day', 'date', '')]

In [30]:
cursor.execute("SELECT e.family_name, s.salary FROM employee as e, salary as s WHERE e.employee_id == s.employee_id ORDER BY s.salary DESC")
data = cursor.fetchall()

In [31]:
data

[("'Pesch'", 158220),
 ("'Pesch'", 157821),
 ("'Whitcomb'", 155709),
 ("'Alameldin'", 155377),
 ("'Alameldin'", 155190),
 ("'Alameldin'", 154888),
 ("'Alameldin'", 154885),
 ("'Baca'", 154459),
 ("'Griswold'", 153715),
 ("'Pesch'", 153458),
 ("'Pesch'", 153166),
 ("'Whitcomb'", 151929),
 ("'Baca'", 151768),
 ("'Griswold'", 151596),
 ("'Alameldin'", 151484),
 ("'Pesch'", 151115),
 ("'Junet'", 150345),
 ("'Kambil'", 150052),
 ("'Whitcomb'", 149686),
 ("'Alameldin'", 149675),
 ("'Pesch'", 149571),
 ("'Thambidurai'", 149440),
 ("'Griswold'", 149241),
 ("'Alameldin'", 149208),
 ("'Baca'", 149140),
 ("'Minakawa'", 148820),
 ("'Kambil'", 148448),
 ("'Kambil'", 147702),
 ("'Unni'", 147469),
 ("'Kambil'", 147282),
 ("'Junet'", 146968),
 ("'Teitelbaum'", 146719),
 ("'Kobara'", 146655),
 ("'Kobara'", 146546),
 ("'Alameldin'", 146531),
 ("'Worfolk'", 146507),
 ("'Kambil'", 146281),
 ("'Baca'", 146222),
 ("'Brookner'", 146100),
 ("'Whitcomb'", 145940),
 ("'Ramaiah'", 145732),
 ("'Pesch'", 145711),


## Ex 3
Get the average salary by gender. Hint: use Group by

In [27]:
cursor.execute("SELECT AVG(s.salary) FROM employee as e, salary as s WHERE e.employee_id == s.employee_id GROUP BY e.gender")
cursor.fetchall()

[(63759.35423,)]