## Problem Statement
ANCXY is an ecommerce company. The company wants to focus on targeting the right customers with the right products
to increase overall revenue and conversion rate.

To target the right customers with the right products, they need to build an ML model for marketing based on user interaction with
products in the past like number of views, most viewed product, number of activities of user, vintage of user and others.

Company has Visitor Web Log data for their customers.

Task#1: Develop Input Feature from Visitor Web Log Data. (Completed)

Task#2: Create ML Model

In [None]:
import pandas as pd
import numpy as np

# Importing DataSet

In [None]:
##Importing Data (Customers of Company and their Visiting Web Log)

UserLogOrig = pd.read_csv(r"../input/targetrightcustomers/data/userTable.csv")
WebVistLogOrig = pd.read_csv(r"../input/targetrightcustomers/data/VisitorLogsData.csv")

In [None]:
##Creating Data Copy
User = UserLogOrig.copy()
WebLog= WebVistLogOrig.copy()

## Initial Data Inspection

In [None]:
#Data Sample of Input Files (User Files)
User.head()

In [None]:
#Checking NA Values in UserFile.
round((User.isnull().sum()/User.shape[0]*100),2),User.shape

In [None]:
User.shape

In [None]:
User.info()

In [None]:
User.describe()

In [None]:
User.columns

User Data Set contains basic information of Customers of Company

1. User ID: ID of Registered Customer.
2. Signup Date: Date and time when Customer get registered on Company Website/App.
3. User Segment: Category of Users segratrated by Company.


No "NA" Value in DataSet and since this is authentic DataSet where entry only available when user get registered, so NA Values in UserID and Signup Date Columns in DataSet will not be appear and rare to get NA Values in User Segment also since company assign it thru their system, however we will do NA handling for User Segment Column. 

In [None]:
#Data Sample of Input Files WebSite Logs
WebLog.head()

In [None]:
#Checking NA Values in LogFile.
round((WebLog.isnull().sum()/WebLog.shape[0]*100),2),User.shape, WebLog.shape

In [None]:
WebLog.info()

In [None]:
WebLog.describe(include=np.object)

In [None]:
WebLog.shape

In [None]:
WebLog.columns

In [None]:
#User Count checking in Both Files
UserCnt = pd.Series(User['UserID'].nunique())
LogFile = pd.Series(WebLog['UserID'].nunique())
UserCount = {"MainFileUser":UserCnt,"LogFileUser":LogFile}
UniqueDf = pd.concat(UserCount,axis=1)
UniqueDf.head()

Visitor WebLog Data Set contains basic information of Customers of Company.


1. webClientID: Unique ID of browser for every system
2. VisitDateTime: Date and Time of vist on website/app for product review.
3. ProductID: Product which is review by Customer.
4. UserID: UserID of Registered User.
5. Activity: What type of action done website, Clicked and Browsing.
6. Browser: Browser Used by Customer.
7. OS: OS Used by Customer.
8. City: City of Customer.
8. Country: Country of Customer.

There are "NA" Value in DataSet in almost all the columns except webClientID, we will do NA handling accordingly feature creation.

## Validation of Input Data Layout
Here we validate the Columns Name and Data Type of Input Information, if there are any mismatch will handle accordingly.


## Input Feature Creation Detail

From above information of Company Customers, we will create Input Features for our ML Model. The detail of features will be follows.

1. UserID: User ID of the Registered user
2. No_of_days_Visited_7_Days: How many days Customers active on the Website in Last 7 Days.
3. No_Of_Products_Viewed_15_Days: How many Product viewed by Customers in Last 15 Days.
4. User_Vintage: How Old the Customer is (In Days).
5. Most_Viewed_product_15_Days: Most Viewed Product in Last 15 Days by Customer in Last 15 days.
6. Most_Active_OS: Most Frequently used Operating System by Customer.
7. Recently_Viewed_Product: Last Product Viewed by Customer.
8. Pageloads_last_7_days: How many times Page Load activity happend on Website for Customer in Last 7 Days.
9. Clicks_last_7_days: How many times Page Load activity happend on Website for Customer in Last 7 Days.


*Specific Condition*
1. Most_Viewed_product_15_Days: If there are multiple products that have a similar number of page loads then , consider the recent one. If a
user has not viewed any product in the last 15 days then put it as Product101. 
2. Recently_Viewed_Product: If a user has not viewed any product then put it as Product101.
3. We will use Last 21 Days of Visitor Log for creation of Input Features.

## Dropping Unnecessary Informatin of Input Data
After Analysis of Input Source Data, some of the information is not required for our Input Feature. Will drop these columns.

*User Data*
1. User Segment

*Visitor Log Data*
1. City
2. Country

In [None]:
##Unnecessary Column Dropping
User.drop("User Segment", axis=1, inplace=True)
WebLog.drop(["City","Country"],axis=1,inplace=True)

## Input Data Validation (Schema Validation)

In [None]:
from schema import Schema

In [None]:
##Schema Validation for User File.
UserSchema = Schema([{"UserID": str,"Signup Date": str}])
WebLogSchema = Schema([{"webClientID": str,"VisitDateTime": str,"ProductID":str,"UserID":str,"Activity":str,"Browser":str,"OS":str}])

In [None]:
##Since data of each columns always be homogenous, so we will validate the data structure on first 10 records of each input file.
UserFirst10 =pd.DataFrame(User.head(10).dropna())
LogFirst10 = pd.DataFrame(WebLog.head(10).dropna())

In [None]:
##Formatting of sample records of input files in required format for validating the schema.
User10 = UserFirst10.to_dict('records')
Log10 = LogFirst10.to_dict('records')

In [None]:
#Validating the Schema
UserValid = UserSchema.is_valid(User10)
WebValid = WebLogSchema.is_valid(Log10)

In [None]:
##Email Setup For Notification
#import yagmail
#Receipient = "pand.anup@gmail.com"
#Subject = "Input Data Sctructure Falied"
#MailMessage = "Structure of Input Data is Not Valid, Please check and fix the Issue"
#Rmail = yagmail.SMTP("pand.anup@gmail.com")

##This section is commented because of security reason.

In [None]:
if UserValid == True:
    FinalUser = User.copy()
else:
    print("Structure of Input Data is Not Valid, Please check and fix the Issue")
    #Rmail.send(to =Receipient,subject = Subject, contents = MailMessage)

##Commenting Mail sender code for security reason.


In [None]:
if WebValid == True:
    FinalWeb = WebLog.copy()
else:
    print("Structure of Input Data is Not Valid, Please check and fix the Issue")
    #Rmail.send(to =Receipient,subject = Subject, contents = MailMessage)

##Commenting Mail sender code for security reason.


##Now we have validate the Input Structure of Data and we have Final Data on which we start feature creation.
1. For User: FinalUser
2. For Log: FinalWeb

In [None]:
FinalUser.head()

In [None]:
FinalWeb.head()

## Data Standardization of Input Feature

In [None]:
#Renaming Columns of User Files
FinalUser=FinalUser.rename(columns={'Signup Date':'Signup_Date'})

In [None]:
##Creating Date Variable in User Table.
from datetime import datetime,date
import calendar
FinalUser["Signup_Date"] = FinalUser["Signup_Date"].apply(lambda x : pd.to_datetime(str(x)))
FinalUser["Dates"] = FinalUser["Signup_Date"].dt.date
FinalUser["Time"] = FinalUser["Signup_Date"].dt.time
FinalUser.drop('Signup_Date',axis=1,inplace=True)

In [None]:
## Dropping NA rows from WebLog for unregistered User since we are working for Registered User Only.
FinalWeb.dropna(subset=['UserID','VisitDateTime'], inplace=True)

In [None]:
FinalWeb.shape

In [None]:
##Since we have unix format in VisitDateTime field, so transformation required for normal date value.

from datetime import datetime,date
import calendar
def unix_or_dt(x):
    Date_Format = '%Y-%m-%d %H:%M:%S'
    try:
        return datetime.strptime(str(x['VisitDateTime']), '%Y-%m-%d %H:%M:%S.%f').strftime(Date_Format)
    except:
        return datetime.utcfromtimestamp(int(x['VisitDateTime'][:10])).strftime(Date_Format)

FinalWeb["VisitDateTime"]= FinalWeb.apply(unix_or_dt, axis=1)

In [None]:
#Creating Date Variable User Log Data;
FinalWeb["VisitDateTime"]= pd.to_datetime(FinalWeb["VisitDateTime"])
FinalWeb["VisitDateTime"] = FinalWeb["VisitDateTime"].apply(lambda x : pd.to_datetime(str(x)))
FinalWeb["Dates"] = FinalWeb["VisitDateTime"].dt.date
FinalWeb["Time"] = FinalWeb["VisitDateTime"].dt.time
FinalWeb.drop('VisitDateTime',axis=1,inplace=True)
FinalWeb.head(5)

In [None]:
#FinalData
WeblogFinal = FinalWeb.copy()
UserFinal  = FinalUser.copy()

##Creating Date Variable for filtering the data table for last 21 days, since our data is old so I will apply the filter after manipulating the current date, but in Live we will apply this with CurrentDate (Today) variable in filtration, so whenever we extract the data, the data should be available only for last 21 days.

In [None]:
import datetime
datadate =date.today() - WeblogFinal["Dates"].max() 
caldate =date.today()-(datadate)
TimeSpan = datetime.timedelta(21)
L21Days = (caldate-TimeSpan).strftime('%Y-%m-%d')
L21Days = pd.to_datetime(L21Days)

In [None]:
L21DayWeblogFinal = WeblogFinal[WeblogFinal['Dates'] > L21Days]

In [None]:
L21DayWeblogFinal.head()

In [None]:
##Join the Final Data Files using both User and Log Files for Further Processing
FinalData = UserFinal.merge(L21DayWeblogFinal, on='UserID', how='outer')

In [None]:
FinalData.head()

In [None]:
#Renaming Columns
FinalData=FinalData.rename(columns={'Dates_x':'Signup_Date','Time_x':'Signup_Time','Dates_y':'Visit_Date','Time_y':'Visit_Time'})

In [None]:
#Checking NA Values in Final Data.
round((FinalData.isnull().sum()/FinalData.shape[0]*100),2),FinalData.shape

In [None]:
##Dropping NA Records of Visit Date.
FinalData.dropna(subset=['Visit_Date'], inplace=True)

In [None]:
##Converting the Core String Data in same case for avioiding any quality issue created features.
FinalData['Browser'] = FinalData['Browser'].str.upper()
FinalData['Activity'] = FinalData['Activity'].str.upper()
FinalData['OS'] = FinalData['OS'].str.upper()

In [None]:
#Creating Last 7 Days and 15 Days Variable which we will use for input feature generation. Few extra handling done here for getting the our old data which will be removed in Live Data.

#Last7Days
datadate =date.today() - WeblogFinal["Dates"].max() 
caldate =date.today()-(datadate)
TimeSpan = datetime.timedelta(7)
L7Days = (caldate-TimeSpan).strftime('%Y-%m-%d')
L7Days = pd.to_datetime(L7Days)

#Last15Days
datadate =date.today() - WeblogFinal["Dates"].max() 
caldate =date.today()-(datadate)
TimeSpan = datetime.timedelta(15)
L15Days = (caldate-TimeSpan).strftime('%Y-%m-%d')
L15Days = pd.to_datetime(L15Days)


In [None]:
#Creating 7 days and 15 Days dataset
FinalData7Days = FinalData[FinalData['Visit_Date']>L7Days]
FinalData15Days = FinalData[FinalData['Visit_Date']>L15Days]

In [None]:
#Feature 1: #No_of_days_Visited_7_Days
Group7Days = FinalData7Days.groupby(["UserID","Visit_Date"])
Feat7Days = pd.DataFrame(Group7Days.count().reset_index())
Visited_7_Days= Feat7Days[['UserID','Signup_Date']].rename(columns={"Signup_Date":"No_of_days_Visited_7_Days"})

In [None]:
#Feature 2: No_Of_Products_Viewed_15_Days
Group15Days = FinalData15Days.groupby(["UserID"])
Feat15Days = pd.DataFrame(Group15Days.count().reset_index())
Feat15Days= Feat15Days[["UserID","ProductID"]].rename(columns={"ProductID":"No_Of_Products_Viewed_15_Days"})

In [None]:
#Feature 3: User_Vintage
Vintage = FinalData[['UserID','Signup_Date']]
VintageDay = Vintage.groupby(["UserID"])
VintageDays = pd.DataFrame(VintageDay.min().reset_index())
VintageDays["TodayDate"] = pd.to_datetime(date.today().strftime('%Y-%m-%d')) 
VintageDays['User_Vintage'] = (pd.to_datetime(VintageDays["TodayDate"]) - pd.to_datetime(VintageDays['Signup_Date'])).dt.days
VintageDays.drop(["Signup_Date","TodayDate"],axis=1,inplace=True)

In [None]:
#Feature 4: Most_Viewed_product_15_Days
PageLoad15Data = FinalData15Days[FinalData15Days["Activity"]=="PAGELOAD"]
MostViewedProd= PageLoad15Data.groupby(["UserID","ProductID",'Visit_Date','Visit_Time'])
MostViewedProd15 = pd.DataFrame(MostViewedProd.count().reset_index()).sort_values(by = ["Activity","Visit_Date" ,"Visit_Time"], ascending = [False, False,False])
MostViewedProd15A = MostViewedProd15[["UserID","ProductID"]].rename(columns={"ProductID":"Most_Viewed_product_15_Days"})
MostRecentProd15 = MostViewedProd15A.copy()
MostRecentProd15.drop_duplicates(subset ="UserID", keep = 'first', inplace = True)
MostRecentProd15["Most_Viewed_product_15_Days"].fillna(value = "Product101" , inplace = True)

In [None]:
#Feature 5: Most_Active_OS
MostOS = FinalData.groupby(["UserID","OS"])
MostActiveOS1 = pd.DataFrame(MostOS.count().reset_index()).sort_values(by = "webClientID", ascending = False)
MostActiveOS = MostActiveOS1.copy()
MostActiveOS = MostActiveOS[["UserID","OS"]].rename(columns={"OS":"Most_Active_OS"})
MostActiveOS.drop_duplicates(subset ="UserID", keep = 'first', inplace = True)

In [None]:
#Feature 6: Recently_Viewed_Product
AllPageLoad= FinalData[FinalData["Activity"]=="PAGELOAD"]
RecViewProd= AllPageLoad.groupby(["UserID","ProductID",'Visit_Date','Visit_Time'])
RecViewProdAll = pd.DataFrame(RecViewProd.count().reset_index()).sort_values(by = ["Visit_Date" ,"Visit_Time"], ascending = [False,False])
RecViewProdAll = RecViewProdAll[['UserID','ProductID']].rename(columns={"ProductID":"Recently_Viewed_Product"})
RecViewProdAllDays = RecViewProdAll.copy()
RecViewProdAllDays.drop_duplicates(subset ="UserID", keep = 'first', inplace = True)
RecViewProdAllDays["Recently_Viewed_Product"].fillna(value = "Product101" , inplace = True)

In [None]:
#Feature 7: Pageloads_last_7_days
Fin7PageLoad = FinalData7Days[FinalData7Days["Activity"]=="PAGELOAD"]
Fin7PageLoadCnt = Fin7PageLoad.groupby("UserID")
Last7PageLoad = pd.DataFrame(Fin7PageLoadCnt.count().reset_index())
Last7PageLoad = Last7PageLoad[["UserID","Activity"]].rename(columns={"Activity":"Pageloads_last_7_days"})

In [None]:
#Feature 8: Clicks_last_7_days
Fin7Clicked = FinalData7Days[FinalData7Days["Activity"]=="CLICK"]
Fin7ClickedCnt = Fin7Clicked.groupby("UserID")
Last7DaysClicked = pd.DataFrame(Fin7ClickedCnt.count().reset_index())
Last7DaysClicked = Last7DaysClicked[["UserID","Activity"]].rename(columns={"Activity":"Clicks_last_7_days"})

## Final Data Creation with Input Feature

In [None]:
##Creating Final Data Set for newly created input features.
FinalFeature =pd.DataFrame()
FinalFeature['UserID'] = User['UserID']
df1 = FinalFeature.merge(Visited_7_Days, on='UserID', how='left')
df2 = df1.merge(Feat15Days, on='UserID', how='left')
df3 = df2.merge(VintageDays, on='UserID', how='left')
df4 = df3.merge(MostRecentProd15, on='UserID', how='left')
df5 = df4.merge(MostActiveOS, on='UserID', how='left')
df6 = df5.merge(RecViewProdAllDays, on='UserID', how='left')
df7 = df6.merge(Last7PageLoad, on='UserID', how='left')
df8 = df7.merge(Last7DaysClicked, on='UserID', how='left')

In [None]:
df8.sort_values(by="No_of_days_Visited_7_Days",ascending = False, inplace=True)
df8.drop_duplicates(subset ="UserID", keep = 'first', inplace = True)

In [None]:
df8.sort_values(by='UserID', inplace=True)

In [None]:
#df8.to_csv("Marketplace Feature Table.csv", index=False)
#Section commented

In [None]:
df8.head()

##Input Feature has been created, now we will start ML Model Development.
## End of Feature Creation Part