# Aula 8 - Pyspark

## Revisão

### Acessar um tabela Hive via pyspark

Criar o contexto `HiveContext`:

    from pyspark.sql import HiveContext
    contexto = HiveContext(sc)


Conectar o banco de dados na tabela:

    banco = contexto.table("hr.jobs")
    banco.show()

Vamos registra a tabela no spark para ficar disponível para execução de querys

    banco.registerTempTable("jobs")
    contexto.sql('Select * from jobs').show()
    contexto.sql('Select *  from jobs order by salario_max DESC limit 1').show()

### Criar um dataframe

A variável `jobs` é nosso dataframe

    jobs = contexto.sql("select * from jobs") 
    jobs.show()
    

### Alguns comandos

  
    jobs.show()
    jobs.printSchema()
    jobs.select('job_title').show()
    jobs.select('job_title', 'salario_max').show()
    jobs.select('salario_max').distinct().show()
    jobs.select('salario_max').distinct().count().show()
   

# Iniciando o pyspark

Para instalar `pyspark` localmente, execute em uma célula:

    pip install pyspark
    
    
**ATENÇÂO:** Se a célula abaixo falhar, pode ser necessário instalar o Java na sua máquina e reiniciar o computador!

[Download Java para Windows](https://www.java.com/pt-BR/download/ie_manual.jsp?locale=pt_BR)

In [7]:
!pip install pyspark



Para que o pyspark funcione, é preciso criar e configurar um ambiente.

In [8]:
# importar as funções
from pyspark import sql, SparkContext, HiveContext

# criar o sparkcontext
sc = SparkContext()

# criar a sessão spark
spark = sql.SparkSession(sc)

In [None]:
sc

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[*]) created by __init__ at <ipython-input-2-30a7a72c866c>:5 

In [11]:
spark

In [12]:
sc.getConf()

<pyspark.conf.SparkConf at 0x7f5035b755d0>

# Acessar arquivo csv com pyspark 

## Criar uma variável RDD a partir do CSV

In [13]:
jobs = sc.textFile('data/jobs.csv')
jobs.collect()

['1,Public Accountant,4200.00,9000.00',
 '2,Accounting Manager,8200.00,16000.00',
 '3,Administration Assistant,3000.00,6000.00',
 '4,President,20000.00,40000.00',
 '5,Administration Vice President,15000.00,30000.00',
 '6,Accountant,4200.00,9000.00',
 '7,Finance Manager,8200.00,16000.00',
 '8,Human Resources Representative,4000.00,9000.00',
 '9,Programmer,4000.00,10000.00',
 '10,Marketing Manager,9000.00,15000.00',
 '11,Marketing Representative,4000.00,9000.00',
 '12,Public Relations Representative,4500.00,10500.00',
 '13,Purchasing Clerk,2500.00,5500.00',
 '14,Purchasing Manager,8000.00,15000.00',
 '15,Sales Manager,10000.00,20000.00',
 '16,Sales Representative,6000.00,12000.00',
 '17,Shipping Clerk,2500.00,5500.00',
 '18,Stock Clerk,2000.00,5000.00',
 '19,Stock Manager,5500.00,8500.00']

## Acessar um arquivo csv via pyspark

In [14]:
# ler tabela countries
countries = spark.read.csv('data/countries.csv', header=True)
countries.show()

+----------+------------+---------+
|country_id|country_name|region_id|
+----------+------------+---------+
|        AR|   Argentina|        2|
|        AU|   Australia|        3|
|        BE|     Belgium|        1|
|        BR|      Brazil|        2|
|        CA|      Canada|        2|
|        CH| Switzerland|        1|
|        CN|       China|        3|
|        DE|     Germany|        1|
|        DK|     Denmark|        1|
|        EG|       Egypt|        4|
|        FR|      France|        1|
|        HK|    HongKong|        3|
|        IL|      Israel|        4|
|        IN|       India|        3|
|        IT|       Italy|        1|
|        JP|       Japan|        3|
|        KW|      Kuwait|        4|
|        MX|      Mexico|        2|
|        NG|     Nigeria|        4|
|        NL| Netherlands|        1|
+----------+------------+---------+
only showing top 20 rows



## Adicionar headers quando não estão presentes no arquivo

In [15]:
# ler arquivo jobs
jobs = spark.read.csv('data/jobs.csv')
jobs.show()

+---+--------------------+--------+--------+
|_c0|                 _c1|     _c2|     _c3|
+---+--------------------+--------+--------+
|  1|   Public Accountant| 4200.00| 9000.00|
|  2|  Accounting Manager| 8200.00|16000.00|
|  3|Administration As...| 3000.00| 6000.00|
|  4|           President|20000.00|40000.00|
|  5|Administration Vi...|15000.00|30000.00|
|  6|          Accountant| 4200.00| 9000.00|
|  7|     Finance Manager| 8200.00|16000.00|
|  8|Human Resources R...| 4000.00| 9000.00|
|  9|          Programmer| 4000.00|10000.00|
| 10|   Marketing Manager| 9000.00|15000.00|
| 11|Marketing Represe...| 4000.00| 9000.00|
| 12|Public Relations ...| 4500.00|10500.00|
| 13|    Purchasing Clerk| 2500.00| 5500.00|
| 14|  Purchasing Manager| 8000.00|15000.00|
| 15|       Sales Manager|10000.00|20000.00|
| 16|Sales Representative| 6000.00|12000.00|
| 17|      Shipping Clerk| 2500.00| 5500.00|
| 18|         Stock Clerk| 2000.00| 5000.00|
| 19|       Stock Manager| 5500.00| 8500.00|
+---+-----

In [16]:
# importar arquivos de suporte
from pyspark.sql.types import StructType, StringType, IntegerType, FloatType

# criar schema
schema_ = StructType() \
        .add('indice', IntegerType(), True) \
        .add('job_title', StringType(), True) \
        .add('salario_min', FloatType(), True) \
        .add('salario_max', FloatType(), True)

# ler o arquivo
df = spark.read.csv('data/jobs.csv', schema=schema_)
df.show()

+------+--------------------+-----------+-----------+
|indice|           job_title|salario_min|salario_max|
+------+--------------------+-----------+-----------+
|     1|   Public Accountant|     4200.0|     9000.0|
|     2|  Accounting Manager|     8200.0|    16000.0|
|     3|Administration As...|     3000.0|     6000.0|
|     4|           President|    20000.0|    40000.0|
|     5|Administration Vi...|    15000.0|    30000.0|
|     6|          Accountant|     4200.0|     9000.0|
|     7|     Finance Manager|     8200.0|    16000.0|
|     8|Human Resources R...|     4000.0|     9000.0|
|     9|          Programmer|     4000.0|    10000.0|
|    10|   Marketing Manager|     9000.0|    15000.0|
|    11|Marketing Represe...|     4000.0|     9000.0|
|    12|Public Relations ...|     4500.0|    10500.0|
|    13|    Purchasing Clerk|     2500.0|     5500.0|
|    14|  Purchasing Manager|     8000.0|    15000.0|
|    15|       Sales Manager|    10000.0|    20000.0|
|    16|Sales Representative

In [None]:
Intervalo. Voltamos 17:04!

# Analíses com pyspark

## select

In [17]:
df.select('job_title').show()

+--------------------+
|           job_title|
+--------------------+
|   Public Accountant|
|  Accounting Manager|
|Administration As...|
|           President|
|Administration Vi...|
|          Accountant|
|     Finance Manager|
|Human Resources R...|
|          Programmer|
|   Marketing Manager|
|Marketing Represe...|
|Public Relations ...|
|    Purchasing Clerk|
|  Purchasing Manager|
|       Sales Manager|
|Sales Representative|
|      Shipping Clerk|
|         Stock Clerk|
|       Stock Manager|
+--------------------+



## filter e/ou where

In [18]:
df.filter(df.salario_min>=15000).show()

+------+--------------------+-----------+-----------+
|indice|           job_title|salario_min|salario_max|
+------+--------------------+-----------+-----------+
|     4|           President|    20000.0|    40000.0|
|     5|Administration Vi...|    15000.0|    30000.0|
+------+--------------------+-----------+-----------+



In [19]:
df.where(df.salario_min<15000).show()

+------+--------------------+-----------+-----------+
|indice|           job_title|salario_min|salario_max|
+------+--------------------+-----------+-----------+
|     1|   Public Accountant|     4200.0|     9000.0|
|     2|  Accounting Manager|     8200.0|    16000.0|
|     3|Administration As...|     3000.0|     6000.0|
|     6|          Accountant|     4200.0|     9000.0|
|     7|     Finance Manager|     8200.0|    16000.0|
|     8|Human Resources R...|     4000.0|     9000.0|
|     9|          Programmer|     4000.0|    10000.0|
|    10|   Marketing Manager|     9000.0|    15000.0|
|    11|Marketing Represe...|     4000.0|     9000.0|
|    12|Public Relations ...|     4500.0|    10500.0|
|    13|    Purchasing Clerk|     2500.0|     5500.0|
|    14|  Purchasing Manager|     8000.0|    15000.0|
|    15|       Sales Manager|    10000.0|    20000.0|
|    16|Sales Representative|     6000.0|    12000.0|
|    17|      Shipping Clerk|     2500.0|     5500.0|
|    18|         Stock Clerk

In [21]:
df_min=df.filter(df.salario_min>=15000)
df_min.show()

+------+--------------------+-----------+-----------+
|indice|           job_title|salario_min|salario_max|
+------+--------------------+-----------+-----------+
|     4|           President|    20000.0|    40000.0|
|     5|Administration Vi...|    15000.0|    30000.0|
+------+--------------------+-----------+-----------+



In [24]:
df_min.select('job_title').where(df.salario_min==15000).show()

+--------------------+
|           job_title|
+--------------------+
|Administration Vi...|
+--------------------+



## sum, min, max, mean

## agg

## [join](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html)



In [None]:
# ler tabela employees


In [None]:
# join


## toPandas