Note: Many new features are introduced in Spark 1.5. In this session we look at few of these features namely writing custom udf, "withColumn", broadcast/map side join, etc. You will need to download spark-1.5 from [here](http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz)

# Download dataset (optional)

In [None]:
%%bash
mkdir -p data/meetup/movielens
cd data/meetup/movielens
rm -rf *
wget http://www.grouplens.org/system/files/ml-100k.zip

unzip -j ml-100k.zip "ml-100k/u.data" 
unzip -j ml-100k.zip "ml-100k/u.item" 
unzip -j ml-100k.zip "ml-100k/u.user" 
unzip -j ml-100k.zip "ml-100k/README"

ls -lh

# load dataset

In [None]:
def toIntTuple(x):
    return tuple([int(y) for y in x])

data = sc.textFile("data/meetup/movielens/u.data").map(lambda x: toIntTuple(x.split("\t")))


# Apply Schema and convert to DataFrame

In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
            StructField("user", IntegerType(), nullable=False),
            StructField("movie", IntegerType(), nullable=False),
            StructField("rating", IntegerType(), nullable=False),
            StructField("timestamp", IntegerType(), nullable=False)
    ])

df = sqlContext.createDataFrame(data, schema)
print df.printSchema()
df.show(5)

# Convert Rating to Binary

In [None]:
from pyspark.sql.functions import udf, col

# define function
def ratingToBinary(rating):
    if rating >= 3: return 1
    else: return 0

# register as udf
udfBinaryRating=udf(ratingToBinary, IntegerType())
df1 = df.withColumn("binaryRating1", udfBinaryRating("rating"))
df1.show(10)

#Using when to replicate case statements
Checkout other functions over [here](https://spark.apache.org/docs/1.4.0/api/python/_modules/pyspark/sql/functions.html)

In [None]:
df.withColumn("binaryRating2", when(col("rating") >= 3, 1).otherwise(0)).show(10)

# Broadcast/MapSide join
Since 1.5 but only if using Scala. 
Will be available in 1.6 of pyspark .. see over [here](# this is available in 1.6 -- see pull request over [here](https://github.com/Jianfeng-chs/spark/blob/5bf51b8f96c1a9f1addef5d7001123b865eda0db/python/pyspark/sql/functions.py) 
)

In [34]:
schema = StructType([
        StructField("user", IntegerType(), nullable=False),
        StructField("age", IntegerType(), nullable=False),
        StructField("gender", StringType(), nullable=False),
        StructField("job", StringType(), nullable=False),
        StructField("zipcode", StringType(), nullable=False)
    ])

users = (sc
             .textFile("data/meetup/movielens/u.user")
             .map(lambda x: x.split("|"))
             .map(lambda x: (int(x[0]), int(x[1]), x[2], x[3], x[4]))
         )

dfUser = sqlContext.createDataFrame(users, schema)
dfUser.show(10)

user age gender job           zipcode
1    24  M      technician    85711  
2    53  F      other         94043  
3    23  M      writer        32067  
4    24  M      technician    43537  
5    33  F      other         15213  
6    42  M      executive     98101  
7    57  M      administrator 91344  
8    36  M      administrator 05201  
9    29  M      student       01002  
10   53  M      lawyer        90703  


In [39]:
from pyspark.sql.dataframe import DataFrame
def broadcast(df):
    sc = SparkContext._active_spark_context
    return DataFrame(sc._jvm.functions.broadcast(df._jdf), df.sql_ctx)

In [40]:
df.join(broadcast(dfUser),"user").show(10)

Py4JError: org.apache.spark.sql.functionsbroadcast does not exist in the JVM