# Sqoop
![Sqoop](https://sqoop.apache.org/images/sqoop-logo.png)

- https://sqoop.apache.org/

## Setup

- download from https://downloads.apache.org/sqoop/1.4.7
- version 1.4.7

In [1]:
%%bash

# Download package
cd /opt/pkgs
wget -q -c https://downloads.apache.org/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

# unpack file and create link
tar -zxf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz -C /opt
ln -s /opt/sqoop-1.4.7.bin__hadoop-2.6.0 /opt/sqoop

# update commons-lang
rm /opt/sqoop/lib/commons-lang3-3.4.jar
cp /opt/hadoop/share/hadoop/yarn/timelineservice/lib/commons-lang-2.6.jar /opt/sqoop/lib

# update envvars.sh
cat >> /opt/envvars.sh << EOF
# Sqoop
export SQOOP_HOME=/opt/sqoop
export PATH=\${PATH}:\${SQOOP_HOME}/bin

EOF

cat /opt/envvars.sh

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export PDSH_RCMD_TYPE=ssh

export HADOOP_HOME=/opt/hadoop
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_YARN_HOME=${HADOOP_HOME}

export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin     

# Flume
export FLUME_HOME=/opt/flume
export PATH=${PATH}:${FLUME_HOME}/bin

# Sqoop
export SQOOP_HOME=/opt/sqoop
export PATH=${PATH}:${SQOOP_HOME}/bin



In [2]:
# Load environment variables
%load_ext dotenv
%dotenv -o /opt/envvars.sh
%env

{'HOSTNAME': 'hadoop',
 'OLDPWD': '/',
 'PWD': '/opt',
 'HOME': '/home/hadoop',
 'SHELL': '/bin/bash',
 'SHLVL': '1',
 'PATH': '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/hadoop/bin:/opt/hadoop/sbin:/opt/flume/bin:/opt/sqoop/bin',
 '_': '/usr/bin/nohup',
 'LANGUAGE': 'en.UTF-8',
 'LANG': 'en.UTF-8',
 'JPY_PARENT_PID': '1566',
 'TERM': 'xterm-color',
 'CLICOLOR': '1',
 'PAGER': 'cat',
 'GIT_PAGER': 'cat',
 'MPLBACKEND': 'module://ipykernel.pylab.backend_inline',
 'JAVA_HOME': '/usr/lib/jvm/java-1.8.0-openjdk-amd64',
 'PDSH_RCMD_TYPE': 'ssh',
 'HADOOP_HOME': '/opt/hadoop',
 'HADOOP_COMMON_HOME': '/opt/hadoop',
 'HADOOP_CONF_DIR': '/opt/hadoop/etc/hadoop',
 'HADOOP_HDFS_HOME': '/opt/hadoop',
 'HADOOP_MAPRED_HOME': '/opt/hadoop',
 'HADOOP_YARN_HOME': '/opt/hadoop',
 'FLUME_HOME': '/opt/flume',
 'SQOOP_HOME': '/opt/sqoop'}

### Mysql-connector

- https://dev.mysql.com/downloads/connector/j/

In [3]:
%%bash

# Download package
cd /opt/pkgs
wget -q -c https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java_8.0.22-1ubuntu18.04_all.deb
    
sudo dpkg -i mysql-connector-java_8.0.22-1ubuntu18.04_all.deb

cp /usr/share/java/mysql-connector-java-8.0.22.jar /opt/sqoop/lib

Selecting previously unselected package mysql-connector-java.
(Reading database ... 38151 files and directories currently installed.)
Preparing to unpack mysql-connector-java_8.0.22-1ubuntu18.04_all.deb ...
Unpacking mysql-connector-java (8.0.22-1ubuntu18.04) ...
Setting up mysql-connector-java (8.0.22-1ubuntu18.04) ...


## Mysql installation

In [4]:
%%bash

sudo apt install -qq -y mysql-server unzip >> /tmp/install.log 2>&1

# Enable external access (from worker nodes)
sudo sed -i "s/^bind-address/#bind-address/g" /etc/mysql/mysql.conf.d/mysqld.cnf 

sudo service mysql restart
sudo service mysql status

# create hadoop user
sudo mysql -e "create user 'hadoop'"
sudo mysql -e "grant all privileges on *.* to 'hadoop'@'%'"
sudo mysql -e "flush privileges"

 * Stopping MySQL database server mysqld
   ...done.
 * Starting MySQL database server mysqld
   ...done.
 * /usr/bin/mysqladmin  Ver 8.42 Distrib 5.7.32, for Linux on x86_64
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Server version		5.7.32-0ubuntu0.18.04.1
Protocol version	10
Connection		Localhost via UNIX socket
UNIX socket		/var/run/mysqld/mysqld.sock
Uptime:			1 sec

Threads: 1  Questions: 8  Slow queries: 0  Opens: 105  Flush tables: 1  Open tables: 98  Queries per second avg: 8.000


## Employees database setup

In [6]:
%%bash

# Download EmployeesDB sample database
cd /opt/pkgs
wget -q -c https://github.com/datacharmer/test_db/archive/master.zip

unzip master.zip

cd test_db-master

mysql -u hadoop < employees.sql

Archive:  master.zip
e5f310ac7786a2a181a7fc124973725d7aa4ce7c
   creating: test_db-master/
  inflating: test_db-master/Changelog  
  inflating: test_db-master/README.md  
  inflating: test_db-master/employees.sql  
  inflating: test_db-master/employees_partitioned.sql  
  inflating: test_db-master/employees_partitioned_5.1.sql  
   creating: test_db-master/images/
  inflating: test_db-master/images/employees.gif  
  inflating: test_db-master/images/employees.jpg  
  inflating: test_db-master/images/employees.png  
  inflating: test_db-master/load_departments.dump  
  inflating: test_db-master/load_dept_emp.dump  
  inflating: test_db-master/load_dept_manager.dump  
  inflating: test_db-master/load_employees.dump  
  inflating: test_db-master/load_salaries1.dump  
  inflating: test_db-master/load_salaries2.dump  
  inflating: test_db-master/load_salaries3.dump  
  inflating: test_db-master/load_titles.dump  
  inflating: test_db-master/objects.sql  
   creating: test_db-master/sakila/
 

## Explore database

In [7]:
%%bash

mysql -u hadoop -e 'show databases'

printf "\n%40s\n\n" | tr ' ' '='

mysql -u hadoop -D employees -e 'show tables'

printf "\n%40s\n\n" | tr ' ' '='

mysql -u hadoop -D employees -e 'describe employees'

Database
information_schema
employees
mysql
performance_schema
sys


Tables_in_employees
current_dept_emp
departments
dept_emp
dept_emp_latest_date
dept_manager
employees
salaries
titles


Field	Type	Null	Key	Default	Extra
emp_no	int(11)	NO	PRI	NULL	
birth_date	date	NO		NULL	
first_name	varchar(14)	NO		NULL	
last_name	varchar(16)	NO		NULL	
gender	enum('M','F')	NO		NULL	
hire_date	date	NO		NULL	


## Using sqoop

In [8]:
%%bash

sqoop list-databases --connect jdbc:mysql://hadoop --username hadoop

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
information_schema
employees
mysql
performance_schema
sys


/opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_ORG.APACHE.SQOOP.SQOOP_USER: bad substitution
/opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_ORG.APACHE.SQOOP.SQOOP_OPTS: bad substitution
2021-01-28 18:58:57,924 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2021-01-28 18:58:58,506 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.


In [9]:
%%bash

sqoop list-tables --connect jdbc:mysql://hadoop/employees --username hadoop

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
current_dept_emp
departments
dept_emp
dept_emp_latest_date
dept_manager
employees
salaries
titles


/opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_ORG.APACHE.SQOOP.SQOOP_USER: bad substitution
/opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_ORG.APACHE.SQOOP.SQOOP_OPTS: bad substitution
2021-01-28 18:59:20,246 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2021-01-28 18:59:21,481 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.


In [10]:
%%bash

sqoop import --connect jdbc:mysql://hadoop/employees --username hadoop --table employees

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.


/opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_ORG.APACHE.SQOOP.SQOOP_USER: bad substitution
/opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_ORG.APACHE.SQOOP.SQOOP_OPTS: bad substitution
2021-01-28 18:59:39,078 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2021-01-28 18:59:39,698 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2021-01-28 18:59:39,698 INFO tool.CodeGenTool: Beginning code generation
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
2021-01-28 18:59:41,295 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
2021-01-28 18:59:41,415 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1
2021-01-28 18:59:41,439 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/hadoop
Note:

In [11]:
%%bash

hdfs dfs -ls -h employees

hdfs dfs -head employees/part-m-00000

Found 5 items
-rw-r--r--   2 hadoop hadoop          0 2021-01-28 19:01 employees/_SUCCESS
-rw-r--r--   2 hadoop hadoop      4.3 M 2021-01-28 19:01 employees/part-m-00000
-rw-r--r--   2 hadoop hadoop      2.4 M 2021-01-28 19:01 employees/part-m-00001
-rw-r--r--   2 hadoop hadoop      2.0 M 2021-01-28 19:01 employees/part-m-00002
-rw-r--r--   2 hadoop hadoop      4.4 M 2021-01-28 19:01 employees/part-m-00003
10001,1953-09-02,Georgi,Facello,M,1986-06-26
10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
10003,1959-12-03,Parto,Bamford,M,1986-08-28
10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12
10006,1953-04-20,Anneke,Preusig,F,1989-06-02
10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
10008,1958-02-19,Saniya,Kalloufi,M,1994-09-15
10009,1952-04-19,Sumant,Peac,F,1985-02-18
10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24
10011,1953-11-07,Mary,Sluis,F,1990-01-22
10012,1960-10-04,Patricio,Bridgland,M,1992-12-18
10013,1963-06-07,Eberhardt,Terkki,M,1985-

2021-01-28 19:02:55,257 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false


In [12]:
%%bash

rm employees.java

# Stopping mysql
sudo service mysql stop

 * Stopping MySQL database server mysqld
   ...done.
