# Installation for Pyspark




In [None]:
!apt-get -y install openjdk-8-jre-headless
!pip install pyspark

# Check Point 1: 0.5 points

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  libnss-mdns fonts-dejavu-extra fonts-ipafont-gothic fonts-ipafont-mincho
  fonts-wqy-microhei fonts-wqy-zenhei fonts-indic
The following NEW packages will be installed:
  openjdk-8-jre-headless
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Need to get 28.2 MB of archives.
After this operation, 104 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 openjdk-8-jre-headless amd64 8u292-b10-0ubuntu1~18.04 [28.2 MB]
Fetched 28.2 MB in 4s (6,427 kB/s)
Selecting previously unselected package openjdk-8-jre-headless:amd64.
(Reading database ... 160772 files and directories currently installed.)
Preparing to unpack .../openjdk-8-jre-headless_8u292-b10-0ubuntu1~18.04_amd64.deb ...
Unpacking openjdk-8-jre-headless:amd64 (8u292-b10-0ubuntu1~18.04) ...
Setting up openjdk-8-jre-headless:amd64 (8u292-b10-0ubun

#### Start a simple Spark Session

In [None]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType,StructType,IntegerType,StructField
spark = SparkSession.builder.appName("warmip").getOrCreate()

Data Schema

In [None]:
data_schema = [StructField("age",IntegerType(),True),
        StructField("name",StringType(),True),
        ]

final_struc = StructType(fields=data_schema)

Load the people.json gile, have Spark infer the data types.

In [None]:
df = spark.read.json('people.json',schema = final_struc)

#### What are the column names?

In [None]:
df.columns

['age', 'name']

#### What is the schema?

In [None]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



Show whole DataFrame 

In [None]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



Print out the first 2 rows.

In [None]:
for row in df.head(2):
  print(row)
  print('\n')

Row(age=None, name='Michael')


Row(age=30, name='Andy')




Use describe() to learn about the DataFrame

In [None]:
df.describe()

DataFrame[summary: string, age: string, name: string]

Use another data frame to learn about the statistical report

In [None]:
temp = df.describe()
temp.show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



There are too many decimal places for mean and stddev in the describe() dataframe.   
How to deal with it?

In [None]:
from pyspark.sql.functions import format_number

In [None]:
result = df.describe()
result.select(result['summary'],
       format_number(result["age"].cast("float"),2).alias("age")
       ).show()

+-------+-----+
|summary|  age|
+-------+-----+
|  count| 2.00|
|   mean|24.50|
| stddev| 7.78|
|    min|19.00|
|    max|30.00|
+-------+-----+



Get the mean of age directly

In [None]:
from pyspark.sql.functions import mean
df.select(mean('age')).show()

+--------+
|avg(age)|
+--------+
|    24.5|
+--------+



What is the max and min of the Volume column?

In [None]:
from pyspark.sql.functions import max,min
df.select(max("age"),min("age")).show()

+--------+--------+
|max(age)|min(age)|
+--------+--------+
|      30|      19|
+--------+--------+



How many people whose age smaller than 30?

In [None]:
df.filter("age < 30").count()

1

In [None]:
from pyspark.sql.functions import count
result = df.filter(df['age']<30)
result.select(count('age')).show()

+----------+
|count(age)|
+----------+
|         1|
+----------+



**Check Point - 1 point** 

How many people whose age larger than 18?

In [None]:
from pyspark.sql.functions import count
result = df.filter(df['age']>18)
result.select(count('age')).show()

+----------+
|count(age)|
+----------+
|         2|
+----------+

