## Single Customer View
Steps involved in creating SCV
### 1. Build base UserSCV table  
> 1.1. Cleanse the data (Validate Email, format Phone No, Landline No). This is done by calling function on each row in dataFrame. 

> 1.2. An intermediate table is created to hold validated/cleanse data, before transforming original data. 

> 1.3. Create additional fields by combining base fields (FirstName, UserName, LastName, DOB). Some of the combinations are as follows:
            * Firstname_Lastname_RegIP		
            * Firstname_Lastname_LastIP		
            * Firstname_Lastname_Username		
            * Firstname_DOB_City				
            * Firstname_Postcode				
            * Firstname_Mobilephone			
            * DOB_Postcode					
            * Address1_Postcode				
            * Firstname_Lastname_Address1_City
> Create UserSCV hive table with base fields and additional fields.


### 2. For each data load, perform check against base UserCSV. A record is considered same if it meets any one of the criteria:
| FirstName| Lastname | DOB  | Email | Postcode | Result   |
| :-------:| :-------:| :---:| :----:| :-------:| :-------:|
| X|X|X|X|X|**MATCH**|
|  |X|X|X|X|**MATCH**|
| X| |X|X|X|**MATCH**|
| X|X| |X|X|**MATCH**|
| X|X|X| |X|**MATCH**|
| X|X|X|X| |**MATCH**|

**Minimal conditions for match: **

| S.No| Criteria|
| :--:| :------:|
|1.|Firstname + IP Address|
|2.|Firstname + Username|

### 2.1. Data is given as  csv file and converted into Table with cleansed data. Join is performed with UserSCV table and loaded data and eac criteria mentioned above is checked to determine the match with existing Master UserSCV table.  
### 3. If matched records found, insert new version of user record with Related Id into UserSCV table. 


####  Issues faced while building UserSCV

1. **Pre-processing data**: Pre-processing and cleansing posed as main milestone when building base UserSCV table. A function is called on each row to pre-process the data.
2. **Transforming data:** Transforming pre-process data before converting into UserSCV table involves adding many new fields by combining different combinations of exisitin field and assigning each row with unique Id. This unique Id will be used as "Related Id" when matching user record is found in UserSCV table.

In [0]:
from pyspark.sql.types import StringType, IntegerType, TimestampType, DateType, DoubleType, StructType, StructField
from pyspark.sql import Row
import datetime
import time
import phonenumbers 
import re
import datetime
import pandas as pd
from pyspark.sql import SQLContext, Row
from pyspark.sql.functions import  col
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql.functions import monotonically_increasing_id

In [0]:
# schema for SCV User Table 
user_schema = StructType([
            StructField("id", IntegerType(), False),
            StructField("Userid", IntegerType(), True),
            StructField("SkinID", StringType(), True),
            StructField("username", StringType(), True),
            StructField("first_name", StringType(), True),
            StructField("last_name", StringType(), True),
            StructField("email", StringType(), True),
            StructField("gender", StringType(), True), 
            StructField("ip_address", StringType(), True), 
            StructField("RegDate", StringType(), True), 
            StructField("RegIP", StringType(), True), 
            StructField("LastIP", StringType(), True), 
            StructField("DOB", StringType(), True), 
            StructField("Postcode", StringType(), True), 
            StructField("MobilePhone", StringType(), True), 
            StructField("Landline", StringType(), True), 
            StructField("Address1", StringType(), True),
            StructField("City", StringType(), True),
            StructField("County", StringType(), True),
            StructField("Country", StringType(), True),
            StructField("SelfExcludedUntil", StringType(), True),
            StructField("Status", StringType(), True)])
            

In [0]:
def fixUserRow(c):
    # get the Mobile field
    number = c.MobilePhone

    # initialize variables 
    is_valid_number = "N"
    clean_number = None
    number_type = None
    valid_mail = None

    p = None

    if number is not None:
        # Clean the Mobile Number first
        try:
            p = phonenumbers.parse(number, c.Country)

            if phonenumbers.is_valid_number(p):
                is_valid_number = "Y"
            elif phonenumbers.truncate_too_long_number(p):
                is_valid_number = "Y"
            else:
                is_valid_number = "N"

            clean_number = "%s%s" % (p.country_code, p.national_number)
            
        except:
            p = None

    # clean up PhoneNumber
    phone_no = c.Landline
    if phone_no is not None:
      phone_no = phone_no.replace('-', '')
      if (len(phone_no) != 10):
        phone_no = None
    
    # validate Email 
    if re.match(r"^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*$", c.email):
      valid_mail = c.email
    
    return Row( 
		id = c.id, 
        Userid = c.Userid, 			
        SkinID = c.SkinID,
        username = c.username,
        first_name = c.first_name, 		
        last_name = c.last_name,	
        email = valid_mail,			
        gender = c.gender,			
        ip_address = c.ip_address,
        RegDate = c.RegDate,
        RegIP = c.RegIP,
		LastIP = c.LastIP,			
		DOB = c.DOB,			
		Postcode = c.Postcode,		
		MobilePhone = clean_number, 	
		Landline = phone_no, 		
		Address1 = c.Address1,		
        City = c.City, 			
		County = c.County,			
		Country = c.Country, 		
        SelfExcludedUntil = c.SelfExcludedUntil,
		Status = c.Status			
    )


In [0]:
df = spark.sql("select count(*) from userSCVTable_new")
df.show()