
# Data preparation using semantic discovery

In this tutorial, we will use Python to discover customer data, underestand semantic type of fields, examine data profiles and assess data quality. Each section will prepare the data for further analysis. We will use semantic feature of __Sparkling__ to discover data type of each column. 

### Import Extension Utilities and Create a Spark SQLContext

In [1]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from extension_utils import ExtensionUtils
eu = ExtensionUtils(sqlContext)

### Load dataset 
To load the dataset into Data Scientist Workbench, run the following cell. It will download and unzip the dataset into "My Data".

In [2]:
!wget --quiet  --output-document /resources/data/sparklingdataset.zip https://ibm.box.com/shared/static/9nxnsf6xwmuczjea911xjxp8l21yyd2x.zip
!unzip -o /resources/data/sparklingdataset.zip -d /resources/data/sparklingdata/
!rm /resources/data/sparklingdataset.zip

Archive:  /resources/data/sparklingdataset.zip
  inflating: /resources/data/sparklingdata/data/sampleDataDir/customers.csv  
  inflating: /resources/data/sparklingdata/data/sampleDataDir/drugInfo2014.json  
  inflating: /resources/data/sparklingdata/data/sampleDataDir/drugInfo2015.json  
  inflating: /resources/data/sparklingdata/data/sampleDocsDir/Events.doc  
  inflating: /resources/data/sparklingdata/data/sampleDocsDir/News.pdf  


# Explore customer data and prepare for analysis

Let us explore our customer data. This data set has 5 columns and does not have header. 



In [3]:
dfCustomers = sqlContext.read.format("com.ibm.spark.discover").load("/resources/data/sparklingdata/data/sampleDataDir/customers.csv")
dfCustomers.printSchema()
dfCustomers.show(5)

root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: string (nullable = true)
 |-- C4: string (nullable = true)

+-----+----------------+--------------------+--------------------+-------+
|   C0|              C1|                  C2|                  C3|     C4|
+-----+----------------+--------------------+--------------------+-------+
|t1234|       Tracy Doe|     Bank of America|69221 Newman Rd, ...|    250|
|t5566|   Lisa McDonald|         Wells Fargo|555 Bailey Ave, S...|   1000|
|t7666|Lonnie Leo Gomez|       Bank of Texas|1234 Airline Dr, ...|   2000|
|t5567|Stephen Brewster|First Bank of Ame...|425 Market Street...|3500.25|
|t1238|     Smith, Mary|         J.P. Morgan|3821 Twin Oaks Dr...|   5000|
+-----+----------------+--------------------+--------------------+-------+
only showing top 5 rows



## Discover semantic types

As mentioned this dataset does not have header. __Sparkling__ use _SemanticTypes_ option to find data type of each column. Additionally, it decompose each field and find all segments of each field. For example, it discover city, Zip and State from address field.

In [4]:
options = {'extractFields': True, 'semanticTypes': True}
dfCustomersInferred = eu.inferTypes(dfCustomers, options)
dfCustomersInferred.printSchema()
dfCustomersInferred.show(5)

root
 |-- C0: string (nullable = true)
 |-- C1: string (nullable = true)
 |-- C2: string (nullable = true)
 |-- C3: string (nullable = true)
 |-- C4: string (nullable = true)

+-----+----------------+--------------------+--------------------+-------+
|   C0|              C1|                  C2|                  C3|     C4|
+-----+----------------+--------------------+--------------------+-------+
|t1234|       Tracy Doe|     Bank of America|69221 Newman Rd, ...|    250|
|t5566|   Lisa McDonald|         Wells Fargo|555 Bailey Ave, S...|   1000|
|t7666|Lonnie Leo Gomez|       Bank of Texas|1234 Airline Dr, ...|   2000|
|t5567|Stephen Brewster|First Bank of Ame...|425 Market Street...|3500.25|
|t1238|     Smith, Mary|         J.P. Morgan|3821 Twin Oaks Dr...|   5000|
+-----+----------------+--------------------+--------------------+-------+
only showing top 5 rows



## Reveal bad data for 'C2' column
We run _inferrred type_ with __revealNA__ option and show the result data frame. The result dataset is all rows in dataframe that has bad values.
__mode__ in option can be "any" or "all". 
- "any" is to show the row if any fields has bad value.
- "all" if all of fields are bad values.

In [5]:
options = {"semanticTypes": True, "columns": ["C2"], "revealNA": {"mode": "any", "brackets": (">[", "]<")}}
dfCustomersForAnalysis = eu.inferTypes(dfCustomers, options)
dfCustomersForAnalysis.show()

+-----+----------------+--------------------+--------------------+-----+
|   C0|              C1|                  C2|                  C3|   C4|
+-----+----------------+--------------------+--------------------+-----+
|t9954|     John Miller|>[First Farmers &...|1555 Kingston Ave...|  200|
|t8887|   Helen Taranto|       >[BankFirst]<|1800 Century Park...|  300|
|t8763|  Michael Walker|                null|1463 Braxton Stre...|890.1|
|t8667|     Shana Wiley|                null|4589 Holly Street...| 2000|
|t2225|Stephen Brewster|>[First of America]<|4075 Harley Brook...|  600|
|t2229|   Hillary Frost|>[First Farmers &...|1234 Airline Dr, ...|  599|
+-----+----------------+--------------------+--------------------+-----+



## Run profile on customer data and print profile information

We can run __profile__ on input data frame to generate profile information. It help us to understand the type, range, distribution and some stats about diferrent columns of our data.
It shows some info about:   

"StatsNames" : e.g. "count","mean","min","max"  
"inferred_type": It shows data type, e.g. String   
"Bins": Bins range for numerical columns  
"Values": Bins frequency  

In [6]:
options = {'extractFields': False, 'semanticTypes': True}
dfCustomersInferred_1 = eu.inferTypes(dfCustomers, options)
dfCustProfiled = eu.profile(dfCustomersInferred_1)
eu.printProfile(dfCustProfiled)

C0:{"StatsNames":["count","numberOfCategories","mode"],"inferred_type":"String","columnSpec":{"type":"String"},"inferred_occurrence":100,"Percentages":[0.06451612903225806,0.06451612903225806,0.03225806451612903,0.03225806451612903,0.03225806451612903,0.03225806451612903],"DiscoveredDataTypePercentages":[1.0],"Values":[2,2,1,1,1,1],"Labels":["t4563","t1239","t2224","t5823","t8763","t1234"],"threshold%":50,"Stats":["31","29","t4563"],"DiscoveredDataTypes":["String"]}
C1:{"StatsNames":["count","numberOfCategories","mode"],"inferred_type":"Person","columnSpec":{"type":"Person"},"inferred_occurrence":96,"Percentages":[0.0967741935483871,0.06451612903225806,0.06451612903225806,0.06451612903225806,0.06451612903225806,0.03225806451612903],"DiscoveredDataTypePercentages":[0.967741935483871,0.03225806451612903],"Values":[3,2,2,2,2,1],"Labels":["Lisa McDonald","Jen Norman","Stephen Brewster","Mary Burchfield","Lonnie Leo Gomez","Peter Frost"],"threshold%":50,"Stats":["31","25","Lisa McDonald"],"

# Summary

This tutorial showed how to discover and explore data and prepare it for further analysis. You can copy this notebook or parts of this notebook into your own notebook and adjust the code as needed.


## Want to learn more?

<a href="http://bigdatauniversity.com/courses/introduction-to-python/?utm_source=tutorial-sparkling-semantic&utm_medium=dswb&utm_campaign=bdu"><img src = "https://ibm.box.com/shared/static/l8yxiek0fg4e15lwz0ikgunj338nrrtd.png"> </a>

Created by: <a href="https://bigdatauniversity.com/?utm_source=bducreatedbylink&utm_medium=dswb&utm_campaign=bdu">The Big Data University Team</a>