# Installation for Pyspark




In [1]:
!apt-get -y install openjdk-8-jre-headless
!pip install pyspark

# Check Point 1: 0.5 points

Reading package lists... Done
Building dependency tree       
Reading state information... Done
openjdk-8-jre-headless is already the newest version (8u312-b07-0ubuntu1~18.04).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Start a simple Spark Session

In [2]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructType, IntegerType, StructField

spark = SparkSession.builder.appName('Warmup').getOrCreate()

Data Schema

In [3]:
data_schema = [StructField('age', IntegerType(), True),
               StructField('name', StringType(), True)]
final_struc = StructType(fields=data_schema)

Load the people.json gile, have Spark infer the data types.

In [4]:
df = spark.read.json('people.json', schema=final_struc)

#### What are the column names?

In [5]:
df.columns

['age', 'name']

#### What is the schema?

In [6]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



Show whole DataFrame 

In [7]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



Print out the first 2 rows.

In [8]:
for row in df.head(2):
    print(row)
    print("\n")

Row(age=None, name='Michael')


Row(age=30, name='Andy')




Use describe() to learn about the DataFrame

In [9]:
df.describe()

DataFrame[summary: string, age: string, name: string]

Use another data frame to learn about the statistical report

In [10]:
temp = df.describe()
temp.show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



There are too many decimal places for mean and stddev in the describe() dataframe.   
How to deal with it?

In [11]:
from pyspark.sql.functions import format_number

In [12]:
result = df.describe()
result.select(result["summary"],
              format_number(result["age"].cast("float"), 2).alias("age")
              ).show()

+-------+-----+
|summary|  age|
+-------+-----+
|  count| 2.00|
|   mean|24.50|
| stddev| 7.78|
|    min|19.00|
|    max|30.00|
+-------+-----+



Get the mean of age directly

In [13]:
from pyspark.sql.functions import mean
df.select(mean("age")).show()

+--------+
|avg(age)|
+--------+
|    24.5|
+--------+



What is the max and min of the Volume column?

In [14]:
from pyspark.sql.functions import min, max
df.select(max("age"), min("age")).show()

+--------+--------+
|max(age)|min(age)|
+--------+--------+
|      30|      19|
+--------+--------+



How many people whose age smaller than 30?

In [15]:
df.filter("age < 30").count()

1

In [16]:
from pyspark.sql.functions import count
result = df.filter(df["age"] < 30)
result.select(count("age")).show()

+----------+
|count(age)|
+----------+
|         1|
+----------+



**Checkpoint 2 - 0.5 point** 

How many people whose age larger than 18?

In [17]:
result = df.filter(df["age"] > 18)
result.select(count("age")).show()

+----------+
|count(age)|
+----------+
|         2|
+----------+

