# Setup Notebook for Exercises

##### <span style="color:red">IMPORTANT: Only modify cells which have the following comment:</span>
```python
# Modify this cell
```
##### <span style="color:red">Do not add any new cells when you submit the homework</span>

## Creating the Spark Context

In [None]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext(master="local[4]")


## Importing necessary libraries

In [None]:
import os
import sys

from pyspark.sql import SQLContext
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType
from pyspark.sql.functions import *
import Tester.SparkSQL as SparkSQL
pickleFile="Tester/SparkSQL.pkl"

## Creating the SQL Context

In [None]:
# Just like using Spark requires having a SparkContext, using SQL requires an SQLContext
sqlContext = SQLContext(sc)

# Teacher Stuff

In [None]:
import Tester.SparkSQL_Master as SparkSQL_Master
import Tester.SparkSQL as SparkSQL

In [None]:

SparkSQL_Master.gen_exercise_1(pickleFile, sqlContext)
SparkSQL_Master.gen_exercise_2(pickleFile, sqlContext)
SparkSQL_Master.gen_exercise_3(pickleFile, sqlContext)


In [None]:
SparkSQL_Master.exercise_1(sqlContext, pickleFile, SparkSQL_Master.func_ex_1)
SparkSQL_Master.exercise_2(sqlContext, pickleFile, SparkSQL_Master.func_ex_2)
SparkSQL_Master.exercise_3(sqlContext, pickleFile, SparkSQL_Master.func_ex_3)

In [None]:
SparkSQL.exercise_1(sqlContext, pickleFile, SparkSQL_Master.func_ex_1)
SparkSQL.exercise_2(sqlContext, pickleFile, SparkSQL_Master.func_ex_2)
SparkSQL.exercise_3(sqlContext, pickleFile, SparkSQL_Master.func_ex_3)

# Exercises

### Dataframes 
Dataframes are a special type of RDDs. They are similar to, but not the same as, pandas dataframes. They are used to store two dimensional data, similar to the type of data stored in a spreadsheet. Each column in a dataframe can have a different type and each row contains a `record`.

Spark DataFrames are similar to `pandas` DataFrames. With the important difference that spark DataFrames are **distributed** data structures, based on RDDs.

##  Exercise 1 -- Creating and transforming dataframes from JSON files

[JSON](http://www.json.org/) is a very popular readable file format for storing structured data.
Among it's many uses are **twitter**, `javascript` communication packets, and many others. In fact this notebook file (with the extension `.ipynb` is in json format. JSON can also be used to store tabular data and can be easily loaded into a dataframe.

In this exercise, you will do the following:

* Read the dataset from a json file and store it in a dataframe
* Filter the rows which has the column make_is_common = 1
* Group the rows by make_country column and compute the count for each country
* Return the list of countries which have count greater than n

######  <span style="color:blue">Sample Input:</span>
```python
 
example.json has following contents

        {"make_id":"acura","make_display":"Acura","make_is_common":"0","make_country":"USA"}
        {"make_id":"alpina","make_display":"Alpina","make_is_common":"1","make_country":"UK"}
        {"make_id":"aston-martin","make_display":"Aston Martin","make_is_common":"1","make_country":"UK"}
   
country_list = get_country_list("example.json", 1, sqlContext)

```
######  <span style="color:magenta">Sample Output:</span>
country_list = ['UK']


In [None]:
# Modify this cell

def get_country_list(json_filepath, n, sqlContext):
    # read json file into a dataframe
    makes_df = None
    
    # check the schema of the json file, uncomment the next line to see the schema
    #makes_df.printSchema()
    
    # The scheme should look like the one below
    #root
    # |-- make_country: string (nullable = true)
    # |-- make_display: string (nullable = true)
    # |-- make_id: string (nullable = true)
    # |-- make_is_common: string (nullable = true)
    
    country_list = []
    
    # Your implementation goes here
    
    return country_list

In [None]:
import Tester.SparkSQL as SparkSQL
SparkSQL.exercise_1(sqlContext, pickleFile, get_country_list)

## Exercise 2 -- Creating and transforming dataframes from Parquet files
[Parquet](http://parquet.apache.org/) is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. 


In this exercise, you will do the following:

* Read the dataset from a parquet file and store it in a dataframe
* Write a SQL query to group the rows by make_country and compute the count for each make_country
* Sort the make_country based on the count in descending order
* Return the list of tuples (country, count) of top "n" make_country

######  <span style="color:blue">Sample Input:</span>
```python
 
example.parquet has contents similar to the following json data

        {"make_id":"a","make_display":"A","make_is_common":"0","make_country":"USA"}
        {"make_id":"b","make_display":"B","make_is_common":"1","make_country":"UK"}
        {"make_id":"c","make_display":"C","make_is_common":"1","make_country":"UK"}
        {"make_id":"d","make_display":"D","make_is_common":"1","make_country":"USA"}
        {"make_id":"e","make_display":"E","make_is_common":"0","make_country":"Germany"}
        {"make_id":"f","make_display":"F","make_is_common":"0","make_country":"UK"}
   
top_n_country_list = get_top_n_country_list("example.parquet", 2, sqlContext)

```
######  <span style="color:magenta">Sample Output:</span>
top_n_country_list = [ ('UK', 3), ('USA', 2)]

In [None]:
# Modify this cell
def get_top_n_country_list(parquet_path, n, sqlContext):
    # read the parquet file
    makes_df = None
    
    # check the schema of the parquet file, uncomment the next line to see the schema
    #makes_df.printSchema()
    
    # The scheme should look like the one below
    #root
    # |-- make_country: string (nullable = true)
    # |-- make_display: string (nullable = true)
    # |-- make_id: string (nullable = true)
    # |-- make_is_common: string (nullable = true)
    
    # create a temporary table or view to manipulate and query data using SQL
    makes_table = None
    
    # write the SQL query to group rows by make_country and its count
    query= None
    
    # Uncomment this line to get the dataframe by running the SQL query
    #query_result_df = sqlContext.sql(query)
    
    # Your implementation to return the list of top "n" make_country in descending order of their count
    top_n_country_list = []
    return top_n_country_list


In [None]:
import Tester.SparkSQL as SparkSQL
SparkSQL.exercise_2(sqlContext, pickleFile, get_top_n_country_list)

## Exercise 3 -- Creating and transforming dataframes from CSV files


In this exercise, you will do the following:

* Read the dataset from a csv file and store it in a dataframe
* Filter the rows which has the word "city" in the first column of csv file - "name" and return the count

######  <span style="color:blue">Sample Input:</span>
```python
 
example.csv has contents similar to the following csv data 

    name, country, subcountry, geonameid
    logan city,australia,queensland,7281838
    carindale,australia,queensland,7281839
   
city_count = get_city_count("example.csv", sqlContext)

```
######  <span style="color:magenta">Sample Output:</span>
city_count = 1

In [None]:
# Modify this cell

def get_city_count(csv_filepath, sqlContext):
    # read csv file into a dataframe
    city_df = None
    
    # check the schema of the csv file, uncomment the next line to see the schema
    # city_df.printSchema()
    
    # filter the df and return the count, you can do (city_df.filter('YOUR FILTER CONDITION').count())
    count = None
    
    return count

In [None]:
import Tester.SparkSQL as SparkSQL
SparkSQL.exercise_3(sqlContext, pickleFile, get_city_count)