# Advertising Technology Sample Notebook (Part 1)
The purpose of this notebook is to provide example code to make sense of advertising-based web logs.  This notebook does the following:
* Setup the connection to your S3 bucket to access the web logs
* Create an external table against these web logs including the use of regular expression to parse the logs
* Identity Country (ISO-3166-1 Three Letter ISO Country Codes) based on IP address by calling a REST Web service API
* Identify Browser and OS information based on the User Agent string within the web logs using the user-agents PyPi package.
* Convert the Apache web logs date information, create a userid, and join back to the Browser and OS information

## Create External Table
* Create an external table against the Ubar cars dataset
* Instead of writing ETL logic to do this, our table definition handles this.

In [0]:
display(dbutils.fs.ls("/FileStore/tables/"))

path,name,size
dbfs:/FileStore/tables/Aggregated_Report_2018_03_25-d4a14.csv,Aggregated_Report_2018_03_25-d4a14.csv,55
dbfs:/FileStore/tables/ApartmentMaintenance.json,ApartmentMaintenance.json,733358
dbfs:/FileStore/tables/Apartment_Maintenance__1_-17a3c.csv,Apartment_Maintenance__1_-17a3c.csv,548837
dbfs:/FileStore/tables/Apartment__1_2_json-b3c24.txt,Apartment__1_2_json-b3c24.txt,421697
dbfs:/FileStore/tables/Apartment__1__2-d398f.csv,Apartment__1__2-d398f.csv,279449
dbfs:/FileStore/tables/Building.json,Building.json,193401
dbfs:/FileStore/tables/Building_Mainenance.json,Building_Mainenance.json,717938
dbfs:/FileStore/tables/Building_Maintenance__1_-c86c7.csv,Building_Maintenance__1_-c86c7.csv,562035
dbfs:/FileStore/tables/Building__1_-108aa.csv,Building__1_-108aa.csv,67170
dbfs:/FileStore/tables/Contractor_Table.json,Contractor_Table.json,180883


In [0]:
from pyspark.sql import SparkSession

sc = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

In [0]:
from pyspark.sql import HiveContext
hivecontext=HiveContext(sc)

In [0]:
hivecontext.setConf('hive.support.concurrency','true');
hivecontext.setConf('hive.enforce.bucketing','true');
hivecontext.setConf('hive.exec.dynamic.partition.mode','nostrict');
hivecontext.setConf('hive.compactor.initiator.on','true');
hivecontext.setConf('hive.compactor.worker.threads','1');

In [0]:
hivecontext.sql('use default')
hivecontext.sql('show tables').show()
# vecontext.sql('drop table sample_database.new_sample')
# econtext.sql('drop database sample_database')
# hivecontext.sql('create database sample_database')


In [0]:
hivecontext.sql('CREATE TABLE new_sample ( \
   city	STRING, \
   population INT \
) \
PARTITIONED BY (country STRING) tblproperties("skip.header.line.count"="1") ')

hivecontext.sql('show tables').show()

In [0]:
df=hivecontext.sql('INSERT INTO TABLE new_sample PARTITION (country)    SELECT city,population,country FROM  sample_csv');


In [0]:
hivecontext.sql("select count(*) from sample_database.new_sample").show()

In [0]:
hivecontext.sql("select * from sample_database.new_sample limit 10").show()

In [0]:
from pyspark.sql.types import StringType, IntegerType, StructType, StructField

schema = StructType([
            StructField("city", StringType(), True),
            StructField("country", StringType(), True),
            StructField("population", IntegerType(), True)])

countries = ['India', 'USA', 'Brazil', 'Spain']
cities = ['Bangalore', 'New York', '   Sao Paulo   ', 'Madrid']
population = [422300000,134795791,12341418,6489162]


In [0]:
df = sc.createDataFrame(list(zip(cities, countries, population)), schema=schema)
df.show()

In [0]:
df.registerTempTable('update_dataframe')
df.printSchema()

df_filter = df.filter(df.population.isin(6489162))
print(df_filter)
df_filter.show()

In [0]:
hivecontext.sql('INSERT OVERWRITE TABLE new_sample PARTITION (country) \
                   SELECT city,population,country \
                   FROM update_dataframe')


In [0]:
hivecontext.sql("select * from new_sample limit 10").show()