# MovieLens with Spark SQL

[MovieLense](https://movielens.org/) is a web page that with movie recommendations [IMDB](http://www.imdb.com/). Data is available at [tym adresem](http://grouplens.org/datasets/movielens/). You can find data in `data/ml-100k`. More precise description od data can be found at `data/ml-100k/README`.

In [1]:
import pyspark
import pyspark.sql.functions as func
import pyspark.sql.types as types
import matplotlib
%matplotlib inline 
sc = pyspark.SparkContext(appName='MovieLens')
sqlContext = pyspark.sql.SQLContext(sc)

Poniżej definiujemy kilka funkcji narzędziowych.

In [2]:
def to_bool(value):
    '''
    Converts values (0, 1 (non-zero)) to boolean
    
    @param value: int value to convert
    '''
    v = int(value)
    return False if v == 0 else True

In [3]:
def data_from_csv(line):
    '''
    Converts a line of data table from CSV to DataFrame Row
    
    @param line: line of data row 
    @returns: Row of parsed values
    '''
    c = line.split('\t')
    
    row = dict()
    row['userId'] = int(c[0])
    row['itemId'] = int(c[1])
    row['rating'] = int(c[2])
    row['timestamp'] = int(c[3]) # Timestamp Unix to long ale w Python 3 int to zarówno int jak long z Python 2.
    
    return pyspark.Row(**row)

In [4]:
def item_from_csv(line):
    '''
    Converts a line of item table from CSV to DataFrame Row
    
    @param line: line of item row 
    @returns: Row of parsed values
    '''
    c = line.split('|')
    
    row = dict()
    row['movieId'] = int(c[0])
    row['movieTitle'] = str(c[1])
    row['releaseDate'] = str(c[2])
    row['videoReleaseDate'] = str(c[3])
    row['imdbUrl'] = str(c[4])
    row['unknown'] = to_bool(c[5])
    row['action'] = to_bool(c[6])
    row['adventure'] = to_bool(c[7])
    row['animation'] = to_bool(c[8])
    row['childrens'] = to_bool(c[9])
    row['comedy'] = to_bool(c[10])
    row['crime'] = to_bool(c[11])
    row['documentary'] = to_bool(c[12])
    row['drama'] = to_bool(c[13])
    row['fantasy'] = to_bool(c[14])
    row['filmNoir'] = to_bool(c[15])
    row['horror'] = to_bool(c[16])
    row['musical'] = to_bool(c[17])
    row['mystery'] = to_bool(c[18])
    row['romance'] = to_bool(c[19])
    row['sciFi'] = to_bool(c[20])
    row['thriller'] = to_bool(c[21])
    row['war'] = to_bool(c[22])
    row['western'] = to_bool(c[23])
    
    return pyspark.Row(**row)

In [5]:
def user_from_csv(line):
    '''
    Converts a line of user table from CSV to DataFrame Row
    
    @param line: line of user row 
    @returns: Row of parsed values
    '''
    c = line.split('|')
    
    row = dict()
    row['userId'] = int(c[0])
    row['age'] = str(c[1])
    row['gender'] = str(c[2])
    row['occupation'] = str(c[3])
    row['zipCode'] = str(c[4])
        
    return pyspark.Row(**row)

Wczytujemy dane do DataFrame.

In [7]:
data_rdd = sc.textFile('data/ml-100k/u.data').map(data_from_csv)
data = sqlContext.createDataFrame(data_rdd)
data.printSchema()
data.show()

root
 |-- itemId: long (nullable = true)
 |-- rating: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- userId: long (nullable = true)

+------+------+---------+------+
|itemId|rating|timestamp|userId|
+------+------+---------+------+
|   242|     3|881250949|   196|
|   302|     3|891717742|   186|
|   377|     1|878887116|    22|
|    51|     2|880606923|   244|
|   346|     1|886397596|   166|
|   474|     4|884182806|   298|
|   265|     2|881171488|   115|
|   465|     5|891628467|   253|
|   451|     3|886324817|   305|
|    86|     3|883603013|     6|
|   257|     2|879372434|    62|
|  1014|     5|879781125|   286|
|   222|     5|876042340|   200|
|    40|     3|891035994|   210|
|    29|     3|888104457|   224|
|   785|     3|879485318|   303|
|   387|     5|879270459|   122|
|   274|     2|879539794|   194|
|  1042|     4|874834944|   291|
|  1184|     2|892079237|   234|
+------+------+---------+------+
only showing top 20 rows



In [8]:
item_rdd = sc.textFile('data/ml-100k/u.item').map(item_from_csv)
item = sqlContext.createDataFrame(item_rdd)
item.printSchema()
item.show()

root
 |-- action: boolean (nullable = true)
 |-- adventure: boolean (nullable = true)
 |-- animation: boolean (nullable = true)
 |-- childrens: boolean (nullable = true)
 |-- comedy: boolean (nullable = true)
 |-- crime: boolean (nullable = true)
 |-- documentary: boolean (nullable = true)
 |-- drama: boolean (nullable = true)
 |-- fantasy: boolean (nullable = true)
 |-- filmNoir: boolean (nullable = true)
 |-- horror: boolean (nullable = true)
 |-- imdbUrl: string (nullable = true)
 |-- movieId: long (nullable = true)
 |-- movieTitle: string (nullable = true)
 |-- musical: boolean (nullable = true)
 |-- mystery: boolean (nullable = true)
 |-- releaseDate: string (nullable = true)
 |-- romance: boolean (nullable = true)
 |-- sciFi: boolean (nullable = true)
 |-- thriller: boolean (nullable = true)
 |-- unknown: boolean (nullable = true)
 |-- videoReleaseDate: string (nullable = true)
 |-- war: boolean (nullable = true)
 |-- western: boolean (nullable = true)

+------+---------+--------

In [9]:
user_rdd = sc.textFile('data/ml-100k/u.user').map(user_from_csv)
user = sqlContext.createDataFrame(user_rdd)
user.printSchema()
user.show()

root
 |-- age: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- userId: long (nullable = true)
 |-- zipCode: string (nullable = true)

+---+------+-------------+------+-------+
|age|gender|   occupation|userId|zipCode|
+---+------+-------------+------+-------+
| 24|     M|   technician|     1|  85711|
| 53|     F|        other|     2|  94043|
| 23|     M|       writer|     3|  32067|
| 24|     M|   technician|     4|  43537|
| 33|     F|        other|     5|  15213|
| 42|     M|    executive|     6|  98101|
| 57|     M|administrator|     7|  91344|
| 36|     M|administrator|     8|  05201|
| 29|     M|      student|     9|  01002|
| 53|     M|       lawyer|    10|  90703|
| 39|     F|        other|    11|  30329|
| 28|     F|        other|    12|  06405|
| 47|     M|     educator|    13|  29206|
| 45|     M|    scientist|    14|  55106|
| 49|     F|     educator|    15|  97301|
| 21|     M|entertainment|    16|  10309|
| 30| 

## Exercises

* Find all films from yesr 1983; how many are there?
* Count frequency of user occupations.
* Find top 20 films with highest rating.
* ★ Find best movie (best rating) for top 20 most frequent users.
