## Spark Mini Lesson

In [3]:
import pandas as pd
import numpy as np

In [4]:
np.random.seed(13)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

## 1. Spark Dataframe Basics

### a. 
Use the starter code above to create a pandas dataframe.

In [6]:
pandas_dataframe.head()

Unnamed: 0,n,group,abool
0,-0.712391,z,False
1,0.753766,x,False
2,-0.044503,z,False
3,0.451812,y,False
4,1.345102,z,False


### b. 
Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

In [9]:
import pyspark

In [10]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/18 10:48:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/05/18 10:48:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [11]:
# convert the pandas DF to a Spark dataframe
df = spark.createDataFrame(pandas_dataframe)
df

DataFrame[n: double, group: string, abool: boolean]

### c. 
Show the first 3 rows of the dataframe.

In [12]:
df.show(3)

[Stage 0:>                                                          (0 + 1) / 1]

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



                                                                                

### d. 
Show the first 7 rows of the dataframe.

In [13]:
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



### e. 
What is the difference between .show and .head?

### f. 
View a summary of the data using .describe.

### g. 
Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.

### h. 
Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.

### i. 
Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.

### j. 
Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

## 2. Column Manipulation

### a. 
Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df

### b. 
Use .select to add 4 to the n column. Show the results.

### c. 
Subtract 5 from the n column and view the results.

### d. 
Multiply the n column by 2. View the results along with the original numbers.

### e. 
Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

### f. 
Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

### g. 
What happens when you run the code below?

df.group + df.abool

### h. 
What happens when you run the code below? What is the difference between this and the previous code sample?

df.select(df.group + df.abool)

### i. 
Try adding various other columns together. What are the results of combining the different data types?

## 3. Type casting

### a.
Use the starter code above to re-create a spark dataframe.

### b.
Use .printSchema to view the datatypes in your dataframe.

### c. 
Use .dtypes to view the datatypes in your dataframe.

### d.
What is the difference between the two code samples below?

df.abool.cast('int')
df.select(df.abool.cast('int')).show()

### e.
Use .select and .cast to convert the abool column to an integer type. View the results.

### f. 
Convert the group column to a integer data type and view the results. What happens?

### g. 
Convert the n column to a integer data type and view the results. What happens?

### h.
Convert the abool column to a string data type and view the results. What happens?

4. Built-in Functions

Use the starter code above to re-create a spark dataframe.
Import the necessary functions from pyspark.sql.functions
Find the highest n value.
Find the lowest n value.
Find the average n value.
Use concat to change the group column to say, e.g. "Group: x" or "Group: y"
Use concat to combine the n and group columns to produce results that look like this: "x: -1.432" or "z: 2.352"

5. When / Otherwise

Use the starter code above to re-create a spark dataframe.
Use when and .otherwise to create a column that contains the text "It is true" when abool is true and "It is false"" when abool is false.
Create a column that contains 0 if n is less than 0, otherwise, the original n value.

6. Filter / Where

Use the starter code above to re-create a spark dataframe.
Use .filter or .where to select just the rows where the group is y and view the results.
Select just the columns where the abool column is false and view the results.
Find the columns where the group column is not y.
Find the columns where n is positive.
Find the columns where abool is true and the group column is z.
Find the columns where abool is true or the group column is z.
Find the columns where abool is false and n is less than 1
Find the columns where abool is false or n is less than 1

7. Sorting

Use the starter code above to re-create a spark dataframe.
Sort by the n value.
Sort by the group value, both ascending and descending.
Sort by the group value first, then, within each group, sort by n value.
Sort by abool, group, and n. Does it matter in what order you specify the columns when sorting?

8. Aggregating

What is the average n value for each group in the group column?
What is the maximum n value for each group in the group column?
What is the minimum n value by abool?
What is the average n value for each unique combination of the group and abool column?

9. Spark SQL

Use the starter code above to re-create a spark dataframe.
Turn your dataframe into a table that can be queried with spark SQL. Name the table my_df. Answer the rest of the questions in this section with a spark sql query (spark.sql) against my_df. After each step, view the first 7 records from the dataframe.
What happens if you make a SQL syntax error in your query?
Write a query that shows all of the columns from your dataframe.
Write a query that shows just the n and abool columns from the dataframe.
Write a query that shows just the n and group columns. Rename the group column to g.
Write a query that selects n, and creates two new columns: n2, the original n values halved, and n3: the original n values minus 1.