# Mini Exercises

### 1. Spark Dataframe Basics

- Use the starter code above to create a pandas dataframe.
- Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.
- Show the first 3 rows of the dataframe.
- Show the first 7 rows of the dataframe.
- View a summary of the data using .describe.
- Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.
- Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.
- Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.
- Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

#### Use the starter code above to create a pandas dataframe.

In [1]:
import pandas as pd
import numpy as np
import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()
np.random.seed(13)

pandas_dataframe = pd.DataFrame({
    "n": np.random.randn(20),
    "group": np.random.choice(list("xyz"), 20),
    "abool": np.random.choice([True, False], 20),
})

In [2]:
pandas_dataframe

Unnamed: 0,n,group,abool
0,-0.712391,z,False
1,0.753766,x,False
2,-0.044503,z,False
3,0.451812,y,False
4,1.345102,z,False
5,0.532338,y,False
6,1.350188,z,False
7,0.861211,x,False
8,1.478686,z,True
9,-1.045377,y,True


#### Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.

In [3]:
df = spark.createDataFrame(pandas_dataframe)
df

DataFrame[n: double, group: string, abool: boolean]

#### Show the first 3 rows of the dataframe.

In [4]:
df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



#### Show the first 7 rows of the dataframe.

In [5]:
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



#### View a summary of the data using .describe.

In [7]:
df.describe().show()

+-------+------------------+-----+
|summary|                 n|group|
+-------+------------------+-----+
|  count|                20|   20|
|   mean|0.3664026449885217| null|
| stddev|0.8905322898155363| null|
|    min|-1.261605945319069|    x|
|    max|2.1503829673811126|    z|
+-------+------------------+-----+



#### Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.

In [8]:
df.select(df.n, df.abool).show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



#### Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.


In [9]:
df.select(df.group, df.abool).show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



#### Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.


In [11]:

df.select('group', df.abool.alias('a_boolean_value')).show(3)

+-----+---------------+
|group|a_boolean_value|
+-----+---------------+
|    z|          false|
|    x|          false|
|    z|          false|
+-----+---------------+
only showing top 3 rows



#### Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

In [12]:
df.select('group', df.n.alias('a_numeric_value')).show(6)

+-----+--------------------+
|group|     a_numeric_value|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



### Column Manipulation

- Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df

- Use .select to add 4 to the n column. Show the results.

- Subtract 5 from the n column and view the results.

- Multiply the n column by 2. View the results along with the original numbers.

- Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

- Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

- What happens when you run the code below?
    - df.group + df.abool
- What happens when you run the code below? What is the difference between this and the previous code sample?
    - df.select(df.group + df.abool)
- Try adding various other columns together. What are the results of combining the different data types?

#### Use .select to add 4 to the n column. Show the results.

In [13]:
df.select(df.n + 4).show()

+------------------+
|           (n + 4)|
+------------------+
|3.2876093379494122|
| 4.753766378659703|
|3.9554969216619464|
|  4.45181233874579|
|5.3451017084510095|
| 4.532337888294546|
| 5.350187899722527|
|  4.86121137416932|
| 5.478685737435897|
| 2.954622869461466|
|3.2110109750484512|
| 2.738394054680931|
| 4.562846785281032|
|3.7566737481144377|
| 4.913740704859677|
| 4.317350922736336|
| 4.127303280206981|
| 6.150382967381113|
| 4.606288656896298|
|3.9732283500135592|
+------------------+



#### Use .select to add 4 to the n column. Show the results.

In [14]:
df.select('n', df.n - 5).show()

+--------------------+-------------------+
|                   n|            (n - 5)|
+--------------------+-------------------+
|  -0.712390662050588| -5.712390662050588|
|   0.753766378659703| -4.246233621340297|
|-0.04450307833805...| -5.044503078338053|
| 0.45181233874578974|  -4.54818766125421|
|  1.3451017084510097|-3.6548982915489905|
|  0.5323378882945463| -4.467662111705454|
|  1.3501878997225267|-3.6498121002774733|
|  0.8612113741693206|  -4.13878862583068|
|  1.4786857374358966| -3.521314262564103|
| -1.0453771305385342| -6.045377130538534|
| -0.7889890249515489| -5.788989024951549|
|  -1.261605945319069| -6.261605945319069|
|  0.5628467852810314| -4.437153214718968|
|-0.24332625188556253| -5.243326251885563|
|  0.9137407048596775| -4.086259295140323|
| 0.31735092273633597| -4.682649077263664|
| 0.12730328020698067| -4.872696719793019|
|  2.1503829673811126|-2.8496170326188874|
|  0.6062886568962988| -4.393711343103702|
|-0.02677164998644...| -5.026771649986441|
+----------

#### Multiply the n column by 2. View the results along with the original numbers.

In [15]:
df.select('n', df.n * 5).show()

+--------------------+--------------------+
|                   n|             (n * 5)|
+--------------------+--------------------+
|  -0.712390662050588| -3.5619533102529397|
|   0.753766378659703|  3.7688318932985148|
|-0.04450307833805...|-0.22251539169026727|
| 0.45181233874578974|   2.259061693728949|
|  1.3451017084510097|  6.7255085422550485|
|  0.5323378882945463|  2.6616894414727317|
|  1.3501878997225267|   6.750939498612634|
|  0.8612113741693206|   4.306056870846603|
|  1.4786857374358966|   7.393428687179483|
| -1.0453771305385342|  -5.226885652692671|
| -0.7889890249515489| -3.9449451247577443|
|  -1.261605945319069|  -6.308029726595345|
|  0.5628467852810314|   2.814233926405157|
|-0.24332625188556253| -1.2166312594278126|
|  0.9137407048596775|   4.568703524298387|
| 0.31735092273633597|    1.58675461368168|
| 0.12730328020698067|  0.6365164010349034|
|  2.1503829673811126|  10.751914836905563|
|  0.6062886568962988|  3.0314432844814942|
|-0.02677164998644...|-0.1338582

#### Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.



In [17]:
n2 = (df.n * -1).alias('n2')
df.select('*').show(4)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
+--------------------+-----+-----+
only showing top 4 rows



#### Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.



In [19]:
n3 = (df.n ** 2).alias('n3')
df = df.select('*', n3)
df.show(5)

+--------------------+-----+-----+--------------------+
|                   n|group|abool|                  n3|
+--------------------+-----+-----+--------------------+
|  -0.712390662050588|    z|false|   0.507500455376875|
|   0.753766378659703|    x|false|  0.5681637535977627|
|-0.04450307833805...|    z|false|0.001980523981562...|
| 0.45181233874578974|    y|false| 0.20413438944294027|
|  1.3451017084510097|    z|false|  1.8092986060778251|
+--------------------+-----+-----+--------------------+
only showing top 5 rows



#### What happens when you run the code below?

df.group + df.abool

In [20]:
#add string to boolean
#we dont see an error until spark actually tries to evaluate a code
df.group + df.abool

Column<'(group + abool)'>

#### What happens when you run the code below? What is the difference between this and the previous code sample?
df.select(df.group + df.abool)


In [21]:
df.select(df.group + df.abool)

AnalysisException: cannot resolve '(CAST(`group` AS DOUBLE) + `abool`)' due to data type mismatch: differing types in '(CAST(`group` AS DOUBLE) + `abool`)' (double and boolean).;
'Project [(cast(group#1 as double) + abool#2) AS (group + abool)#370]
+- Project [n#0, group#1, abool#2, POWER(n#0, cast(2 as double)) AS n3#348]
   +- LogicalRDD [n#0, group#1, abool#2], false


#### Try adding various other columns together. What are the results of combining the different data types?

In [None]:
# can add numbers
# adding other things gives errors