# Spark Project Level 1

Siny P Raphel

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [0]:
spark = SparkSession.builder.master('local').appName('fifa').getOrCreate()

In [0]:
players = spark.read.csv('/FileStore/tables/wc2018_players.csv', inferSchema=True, header=True)
players.show(2)

+---------+---+----+------------------+----------+----------+--------------------+------+------+
|     Team|  #|Pos.| FIFA Popular Name|Birth Date|Shirt Name|                Club|Height|Weight|
+---------+---+----+------------------+----------+----------+--------------------+------+------+
|Argentina|  3|  DF|TAGLIAFICO Nicolas|31.08.1992|TAGLIAFICO|      AFC Ajax (NED)|   169|    65|
|Argentina| 22|  MF|    PAVON Cristian|21.01.1996|     PAVÓN|CA Boca Juniors (...|   169|    65|
+---------+---+----+------------------+----------+----------+--------------------+------+------+
only showing top 2 rows



In [0]:
players.printSchema()

root
 |-- Team: string (nullable = true)
 |-- #: integer (nullable = true)
 |-- Pos.: string (nullable = true)
 |-- FIFA Popular Name: string (nullable = true)
 |-- Birth Date: string (nullable = true)
 |-- Shirt Name: string (nullable = true)
 |-- Club: string (nullable = true)
 |-- Height: integer (nullable = true)
 |-- Weight: integer (nullable = true)



In [0]:
players.columns

Out[14]: ['Team',
 '#',
 'Pos.',
 'FIFA Popular Name',
 'Birth Date',
 'Shirt Name',
 'Club',
 'Height',
 'Weight']

In [0]:
players = players.withColumnRenamed('FIFA Popular Name', 'Name')

*   Shows the names and height by adding 1 to the height column.

In [0]:
players.selectExpr('Name', 'Height + 1' ).show(5)

+------------------+------------+
|              Name|(Height + 1)|
+------------------+------------+
|TAGLIAFICO Nicolas|         170|
|    PAVON Cristian|         170|
|    LANZINI Manuel|         168|
|    SALVIO Eduardo|         168|
|      MESSI Lionel|         171|
+------------------+------------+
only showing top 5 rows



*    shows the player name and simultaneously checks whether or not they have height >170

In [0]:
players.selectExpr('Name', 'Height > 170').show(7)

+------------------+--------------+
|              Name|(Height > 170)|
+------------------+--------------+
|TAGLIAFICO Nicolas|         false|
|    PAVON Cristian|         false|
|    LANZINI Manuel|         false|
|    SALVIO Eduardo|         false|
|      MESSI Lionel|         false|
|  ANSALDI Cristian|          true|
|      BIGLIA Lucas|          true|
+------------------+--------------+
only showing top 7 rows



3.    Show FIFA Popular Name and 0 or 1 depending on Height>170

In [0]:
players.select('Name', F.expr('case when height > 170 then 1 else 0 end').alias('Height Check')).show(7)

+------------------+------------+
|              Name|Height Check|
+------------------+------------+
|TAGLIAFICO Nicolas|           0|
|    PAVON Cristian|           0|
|    LANZINI Manuel|           0|
|    SALVIO Eduardo|           0|
|      MESSI Lionel|           0|
|  ANSALDI Cristian|           1|
|      BIGLIA Lucas|           1|
+------------------+------------+
only showing top 7 rows



4.    name of  shortest player

In [0]:
players.agg({'height':'min'}).show()

+-----------+
|min(height)|
+-----------+
|        165|
+-----------+



In [0]:
players.filter('height == 165').select('Shirt Name').show()

+----------+
|Shirt Name|
+----------+
|  QUINTERO|
|     YAHIA|
|   SHAQIRI|
+----------+



In [0]:
players.orderBy('Height').select('Shirt Name').alias('Name').show(1)

+----------+
|Shirt Name|
+----------+
|     YAHIA|
+----------+
only showing top 1 row



5.    who is tallest of all. First we find the value of maximum height and then get the details of that player

In [0]:
max_ht =players.agg({'height' : 'max'}).first()['max(height)']

In [0]:
type(max_ht)

Out[123]: int

In [0]:
players.filter(F.col('height') == max_ht).show()

+-------+---+----+-------------+----------+----------+--------------+------+------+
|   Team|  #|Pos.|         Name|Birth Date|Shirt Name|          Club|Height|Weight|
+-------+---+----+-------------+----------+----------+--------------+------+------+
|Croatia| 12|  GK|KALINIC Lovre|03.04.1990|L. KALINIĆ|KAA Gent (BEL)|   201|    96|
+-------+---+----+-------------+----------+----------+--------------+------+------+



6.    average height of the players in Argentina team.

In [0]:
players.groupBy('team').mean('height').where('team == "Argentina"').show()

+---------+------------------+
|     team|       avg(height)|
+---------+------------------+
|Argentina|178.43478260869566|
+---------+------------------+



## Movies Spark Project Level 2 - Spark SQL

In [0]:
movies = spark.read.csv('/FileStore/tables/movies.csv', inferSchema=True, header=True)
ratings = spark.read.csv('/FileStore/tables/ratings.csv', inferSchema=True, header=True)

movies.show(2)
ratings.show(2)

+-------+----------------+--------------------+
|movieId|           title|              genres|
+-------+----------------+--------------------+
|      1|Toy Story (1995)|Adventure|Animati...|
|      2|  Jumanji (1995)|Adventure|Childre...|
+-------+----------------+--------------------+
only showing top 2 rows

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|     31|   2.5|1260759144|
|     1|   1029|   3.0|1260759179|
+------+-------+------+----------+
only showing top 2 rows



Find the list of  oldest released movies.

How many movies are released each year?

How many number of movies are there for each rating?

How many users have rated each movie?

What is the total rating for each movie?

What is the average rating for each movie?