# Installation for Pyspark




In [6]:
!apt-get -y install openjdk-8-jre-headless
!pip install pyspark

# Check Point 1: 0.5 points

E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to correct the problem. 
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#### Start a simple Spark Session

In [7]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,StructType,IntegerType,StructField
spark = SparkSession.builder.appName('Warmup').getOrCreate()


Data Schema

In [8]:
data_schema = [StructField('age',IntegerType(),True),StructField('name',StringType(),True)]
final_struc = StructType(fields=data_schema)

Load the people.json gile, have Spark infer the data types.

In [9]:
df = spark.read.json('people.json',schema=  final_struc)


#### What are the column names?

In [10]:
df.columns

['age', 'name']

#### What is the schema?

In [11]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



Show whole DataFrame 

In [12]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



Print out the first 2 rows.

In [13]:
for row in df.head(2):
  print(row)
  print('\n')

Row(age=None, name='Michael')


Row(age=30, name='Andy')




Use describe() to learn about the DataFrame

In [14]:
df.describe()

DataFrame[summary: string, age: string, name: string]

Use another data frame to learn about the statistical report

In [15]:
temp = df.describe()

There are too many decimal places for mean and stddev in the describe() dataframe.   
How to deal with it?

In [16]:
from pyspark.sql.functions import format_number

In [17]:
result = df.describe()
result.select(result['summary'],
              format_number(result['age'].cast('float'),2).alias('age')
              ).show()

+-------+-----+
|summary|  age|
+-------+-----+
|  count| 2.00|
|   mean|24.50|
| stddev| 7.78|
|    min|19.00|
|    max|30.00|
+-------+-----+



Get the mean of age directly

In [18]:
from pyspark.sql.functions import max,min,count
df.select(max("age"),min("age")).show()

+--------+--------+
|max(age)|min(age)|
+--------+--------+
|      30|      19|
+--------+--------+



What is the max and min of the Volume column?

In [19]:
df.filter("age<30").count()

1

How many people whose age smaller than 30?

In [20]:
df.filter("age<30").count()

1

In [21]:
result = df.filter(df['age'] < 30)
result.select(count('age')).show()

+----------+
|count(age)|
+----------+
|         1|
+----------+



**Checkpoint 2 - 0.5 point** 

How many people whose age larger than 18?

In [22]:
df.filter("age>18").count()

2