# Logistic Regression

**odds, odds ratio, and probability**

$$odds(p) = (\frac{p}{1-p})$$
$$odds\ ratio = \frac{\frac{X_A}{X_B}}{\frac{Y_A}{Y_B}}$$
$$relative\_risk = \frac{\frac{X_A}{X_A+Y_A}}{\frac{X_B}{X_B+Y_B}}$$
where X is treated, Y is control, A is impacted, B is not impacted
$$ probability = \frac{odds}{1+odds} = \frac{4}{1+4} = 0.8$$

**Distribution of logistic regression predictor and outcome variables**

$$Z = logit(P) = log(odds) = log(\frac{P}{1-P}) = \theta^Tx = \theta_0 + \theta_1$$
$$e^Z = \frac{P}{1-P}$$
$$P = \frac{e^Z}{1+e^Z} = \frac{1}{1+e^{-Z}}$$

**Sigmoid function (logistic function for binary classification and a neuron activation function)**

$$\sigma(x) = \frac{1}{1+e^{-\theta x}}$$

**Derivative of sigmoid funtion (we can expand this to softmax)**

$$\frac{d}{dx}\sigma(x)=\frac{e^{-x}}{(1+e^{-x})^2}$$
or 
$$\sigma'(x) = \sigma(x)(1-\sigma(x)) $$ 

**Logistic Regression Definition (put the above concept together)**

* Hypothesis function $h_{\theta}(x)$
  Logit: $Z = \theta^Tx$
  $$h_{\theta}(x) = \frac{1}{1+e^Z} = \frac{1}{1+e^{-\theta^T x}}$$

* Decision Boundry:
  $$h_{\theta}(x) \geq 0.5  \to y = 1$$
  $$h_{\theta}(x) < 0.5  \to y = 0$$
  or
  $$\theta^T \geq 0 \to y = 1$$
  $$\theta^T < 0 \to y = 0$$

* Cost Function (Measure the goodness of our hypothesis with respect to all data samples)
  $$J(\theta) = \frac{1}{m} \sum^m_{i=1}Cost(h_\theta(x^{(i)}), y^(i))$$
  $$J(\theta) = \frac{1}{m} \sum^m_{i=1}(-y^ilog(h_\theta(x^i)) - (1-y^i)log(1-h_\theta(x^i)) )$$
  $$J(\theta) = -\frac{1}{m} \sum^m_{i=1}(y^ilog(h_\theta(x^i)) + (1-y^i)log(1-h_\theta(x^i)) )$$

In [7]:
import math
import numpy as np

def logit(P):
    return log(P/(1-P))

def sigmoid(p, x):
    Z = -1*(p.T@x)
    return 1/(1+np.exp(-Z))

def d_sigmoid(p, x):
    return sigmoid(p, x)*(1 - sigmoid(p, x))

# Probabilistic Programming and Bayesian DL

* Bayesian Statistics
    * Beta Distribution
    * Binomial likelihood
* Probability Theory
* Probabilistic library (PyMc3, Stan)

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/Users/qianyu/Desktop/New_query_2023_03_20.csv')
df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,visitor,pages_viewed,events_interacted,event_detail_sequence,user_visit_age,visit_start_timestamp,qbo_search_keyword,ivid,bing_search_keyword,url_sequence,...,session_duration,device,search_keyword,last_visit_gap,visit_num,country,timestamp,year,month,day
0,0264d333-12b5-45e0-bf50-9ad450555621,4,2,displayed button chat displayed button chat,3,2023-03-10 21:58:42,,0264d333-12b5-45e0-bf50-9ad450555621,,online,...,106,Mobile,,1.0,4,United States,1679295774272,2023,3,20
1,0774df53-a2dc-498d-9782-ffcce1dfc616,1,1,,0,2023-02-12 11:26:20,,0774df53-a2dc-498d-9782-ffcce1dfc616,,payments,...,1,Mobile,,,3,United States,1679295774272,2023,3,20
2,08142ba2-7df8-4951-a432-945ec501aad9,1,0,,0,2023-02-07 02:40:48,,08142ba2-7df8-4951-a432-945ec501aad9,,ca,...,0,Desktop,,,1,Canada,1679295774272,2023,3,20
3,117df4e7-94d3-4b29-ba2a-121165eb95d4,1,0,,0,2023-03-03 02:08:13,,117df4e7-94d3-4b29-ba2a-121165eb95d4,,accounting,...,0,Desktop,accounting,,1,United States,1679295774272,2023,3,20
4,0f0c0fb8-a486-494c-82b4-87b57371b2f7,15,60,displayed button chat displayed proactive_invi...,7,2023-03-14 20:59:22,,0f0c0fb8-a486-494c-82b4-87b57371b2f7,,learn-support en-us help-article manage-users...,...,1478,Desktop,"quickbooks pro,quickbooks desktop pro,quickboo...",0.0,4,United States,1679295774272,2023,3,20


In [3]:
df[df['visit_num'].isna()].shape

(0, 21)

In [4]:
df[df['country'] == 'United States'].shape

(2757001, 21)

In [13]:
col_names = ["visitor", "ivid", "session_duration", "visit_num", "data_wa_link_sequence", "url_sequence", 
                     "event_detail_sequence", "search_keyword", "bing_search_keyword", "qbo_search_keyword", "device", 
                     "events_interacted", "pages_viewed", "country", "visit_start_timestamp", "user_visit_age", 
                     "last_visit_gap", "timestamp"]

In [27]:
#df = pd.read_csv('/Users/qianyu/Desktop/part-00199-9a6f5ca7-233e-42e0-b7bc-ca66033a2ec3-c000.csv')
df = pd.read_csv('/Users/qianyu/Desktop/part-00000-1b3d1eb8-71bb-41ed-b35c-c5b41216a861-c000.csv', names=col_names)
df.head(50)

Unnamed: 0,visitor,ivid,session_duration,visit_num,data_wa_link_sequence,url_sequence,event_detail_sequence,search_keyword,bing_search_keyword,qbo_search_keyword,device,events_interacted,pages_viewed,country,visit_start_timestamp,user_visit_age,last_visit_gap,timestamp
0,00003eeb-06db-48a0-b0bb-4ab85c863258,00003eeb-06db-48a0-b0bb-4ab85c863258,1176,3,standard qbo-en-ca-link-mileagetracker qbo-en-...,learn-support en-ca ca desktop pro ca ...,clicked Pricing link clicked QuickBooks for bu...,,,,Desktop,50,8,Canada,2023-03-17 01:22:17,1,1.0,1679382477351
1,000179b5-48e4-4b60-9a7f-ea793da464fb,000179b5-48e4-4b60-9a7f-ea793da464fb,0,1,,learn-support en-us help-article back-data loc...,,,,,Desktop,3,1,United States,2023-03-03 04:30:58,0,,1679382477351
2,00025320-80f4-44c3-a9eb-696c12a85101,00025320-80f4-44c3-a9eb-696c12a85101,0,1,,learn-support en-us help-article bank-connecti...,,,,,Desktop,3,1,United States,2023-02-26 01:38:12,0,,1679382477351
3,000350d1-473f-4499-bf26-b081efa5034f,000350d1-473f-4499-bf26-b081efa5034f,0,1,,ca online advanced,,,,,Desktop,0,1,Canada,2023-02-19 13:06:05,0,,1679382477351
4,0007479f-61ea-4be9-8c14-6aba367acb70,0007479f-61ea-4be9-8c14-6aba367acb70,0,1,,,,,,,Desktop,0,1,United States,2023-02-15 21:00:44,0,,1679382477351
5,0007dc20-ea1f-4f59-bb0a-2b04d6f2f9d9,0007dc20-ea1f-4f59-bb0a-2b04d6f2f9d9,0,1,,,,,,,Mobile,0,1,United States,2023-03-17 02:46:43,0,,1679382477351
6,000afdfd-3bfc-4a27-932b-fd1463607fce,000afdfd-3bfc-4a27-932b-fd1463607fce,0,1,,,,,,,Desktop,0,1,United States,2023-02-08 02:37:59,0,,1679382477351
7,000b1b90-a6db-11ed-ab27-ede817b8c4d8,000b1b90-a6db-11ed-ab27-ede817b8c4d8,115,1,,,,,,,Desktop,36,2,United States,2023-02-07 11:31:52,0,,1679382477351
8,000de1ac-575c-447d-907a-3893b36c767c,000de1ac-575c-447d-907a-3893b36c767c,0,1,,oa get-quickbooks,,,,,Mobile,0,1,United States,2023-02-23 16:42:01,0,,1679382477351
9,000f742e-7d07-42ea-ace0-8c24279ec393,000f742e-7d07-42ea-ace0-8c24279ec393,0,1,,ca resources small-business-index,,,,,Mobile,0,1,Canada,2023-03-15 19:34:45,0,,1679382477351


In [28]:
df['visit_num'].value_counts()

1      35890
2       4525
3       1902
4       1077
5        708
       ...  
84         1
270        1
371        1
243        1
277        1
Name: visit_num, Length: 235, dtype: int64

In [11]:
df[df['ivid'] == '00070705-76b4-4565-aa76-b6b5d56af7f8']

Unnamed: 0,visitor,pages_viewed,events_interacted,event_detail_sequence,user_visit_age,visit_start_timestamp,qbo_search_keyword,ivid,bing_search_keyword,url_sequence,...,session_duration,device,search_keyword,last_visit_gap,visit_num,country,timestamp,year,month,day


In [12]:
df.shape

(3168000, 21)

In [3]:
df1 = pd.read_csv('/Users/qianyu/Desktop/payroll_propensity_train.csv')
df1.head()

Unnamed: 0,ivid,visitor_type,num_days_before_signup,total_session_duration,day_of_week_latest_visit,hour_of_day_latest_visit,number_session,number_active_days,combined_text_features_cleaned_processed,payroll_signup
0,000b3349-5216-4d5c-8936-32e56a50c007,repeat,3,9,1,19,5,3,ppc US QBO US GGL Brand Top Search Mobile Rest...,0
1,003a46c7-5414-41d1-8fcc-eede7597c5ca,repeat,20,115,2,11,9,6,ppc US QBO US BNG Brand Core Search Desktop Re...,0
2,00549445-e712-410e-b86d-d0eb4b819936,new,0,38,5,5,1,1,Unknown viewed content page body viewed conten...,0
3,006005d3-1a24-4ba4-b67f-573a48168c8e,new,0,4,2,13,1,1,Unknown viewed screen viewed screen loaded pag...,0
4,006569c3-3fac-4740-94d2-429f38caa70d,new,0,619,1,14,1,1,bing seo Unknown ppc US QBDT US BNG PLA Core F...,0


In [4]:
df2 = pd.read_csv('/Users/qianyu/Desktop/payroll_propensity_validation.csv')
df2.head()

Unnamed: 0,ivid,visitor_type,num_days_before_signup,total_session_duration,day_of_week_latest_visit,hour_of_day_latest_visit,number_session,number_active_days,combined_text_features_cleaned_processed,payroll_signup
0,00575aa3-eb89-4cfc-8a35-ab90912aafd3,new,0,0,6,10,1,1,viewed screen viewed screen,0
1,0079a8d3-b5d0-43f9-9165-9cb1a6686e35,new,0,1,3,6,1,1,Unknown viewed screen viewed screen loaded page,0
2,00a2c3e6-32c7-4eca-a3f4-b24704d9414d,new,0,8,4,11,1,1,Unknown viewed screen viewed screen,0
3,00b1e3cf-5161-476d-b197-c3e25a014700,repeat,2,7,6,19,3,3,Unknown viewed screen viewed screen viewed cha...,0
4,00e758ff-9dc1-4afc-a71f-9e336030e542,new,0,39,1,13,1,1,viewed screen viewed screen loaded page intera...,0


In [5]:
df3 = pd.read_csv('/Users/qianyu/Desktop/payroll_propensity_test.csv')
df3.head()

Unnamed: 0,ivid,visitor_type,num_days_before_signup,total_session_duration,day_of_week_latest_visit,hour_of_day_latest_visit,number_session,number_active_days,combined_text_features_cleaned_processed,payroll_signup
0,001ac6a4-9b69-462e-a46b-439b1439b3b4,new,0,0,7,12,1,1,payments,0
1,008c508f-c3a7-4972-b698-b6f03ff5e966,new,0,0,6,10,1,1,learn support help article bank connectivity r...,0
2,009f6734-96a2-4a5a-9730-e39759b81b5a,new,0,4,2,9,1,1,imasdk googleapis ref displayed button chat,0
3,00ae9973-349e-498e-8d81-877ba8f0738a,new,0,1,2,22,1,1,akamai test,0
4,00b10fc7-4729-47db-bcce-cc924e821705,new,0,25,7,12,1,1,VID US ECO US DV CrossDevice INI ECO TOF ROAS ...,0


In [6]:
df1.shape

(1125871, 10)

In [7]:
df2.shape

(481737, 10)

In [8]:
df3.shape

(497940, 10)

In [9]:
t1 = df1['ivid'].to_list()
len(t1), len(set(t1))

(1125871, 1125871)

In [10]:
t2 = df2['ivid'].to_list()
len(t2), len(set(t2))

(481737, 481737)

In [11]:
t3 = df3['ivid'].to_list()
len(t3), len(set(t3))

(497940, 497940)

In [12]:
set(t2).intersection(set(t3))

set()

In [13]:
df1['payroll_signup'].value_counts()

0    1046896
1      78975
Name: payroll_signup, dtype: int64

In [14]:
df2['payroll_signup'].value_counts()

0    477932
1      3805
Name: payroll_signup, dtype: int64

In [15]:
df3['payroll_signup'].value_counts()

0    496394
1      1546
Name: payroll_signup, dtype: int64

In [16]:
df4 = df3.copy()
df4.shape

(497940, 10)

In [19]:
df1 = df1.loc[:, ['payroll_signup', 'num_days_before_signup', 'total_session_duration',
            'day_of_week_latest_visit', 'hour_of_day_latest_visit', 'number_session',
            'number_active_days', 'combined_text_features_cleaned_processed']]

In [21]:
df1[df1.duplicated()].head()

Unnamed: 0,payroll_signup,num_days_before_signup,total_session_duration,day_of_week_latest_visit,hour_of_day_latest_visit,number_session,number_active_days,combined_text_features_cleaned_processed
348,0,0,0,7,14,1,1,Unknown viewed screen viewed screen
360,0,0,2,6,0,1,1,Unknown viewed screen viewed screen loaded page
395,0,0,0,6,9,1,1,Unknown viewed screen viewed screen
494,0,0,3,5,23,1,1,DIS US QBO US TTD CrossDevice INI QBO BOF ROAS...
555,0,0,0,7,13,1,1,google seo viewed screen viewed screen


In [1]:
import pandas as pd

In [2]:
df5 = pd.read_parquet('/Users/qianyu/Desktop/part-00000-025ed404-b21a-4998-9a97-3bbbc712c74f-c000.snappy.parquet')
df5.head()

Unnamed: 0,ivid,visitor,session_duration,visit_num,data_wa_link_sequence,url_sequence,event_detail_sequence,search_keyword,bing_search_keyword,qbo_search_keyword,device,events_interacted,pages_viewed,country,visit_start_timestamp,user_visit_age,last_visit_gap,timestamp
0,00003eeb-06db-48a0-b0bb-4ab85c863258,00003eeb-06db-48a0-b0bb-4ab85c863258,1176,3,For_self-employed_expenses For_small_business_...,ca ca pricing learn-support en-ca ...,clicked See plans & pricing button clicked For...,,,,Desktop,50,8,Canada,2023-03-17 01:22:17,1,1.0,1679517465571
1,000179b5-48e4-4b60-9a7f-ea793da464fb,000179b5-48e4-4b60-9a7f-ea793da464fb,0,1,,learn-support en-us help-article back-data lo...,,,,,Desktop,3,1,United States,2023-03-03 04:30:58,0,,1679517465571
2,00025320-80f4-44c3-a9eb-696c12a85101,00025320-80f4-44c3-a9eb-696c12a85101,0,1,,learn-support en-us help-article bank-connect...,,,,,Desktop,3,1,United States,2023-02-26 01:38:12,0,,1679517465571
3,000350d1-473f-4499-bf26-b081efa5034f,000350d1-473f-4499-bf26-b081efa5034f,0,1,,ca online advanced,,,,,Desktop,0,1,Canada,2023-02-19 13:06:05,0,,1679517465571
4,0007479f-61ea-4be9-8c14-6aba367acb70,0007479f-61ea-4be9-8c14-6aba367acb70,0,1,,,,,,,Desktop,0,1,United States,2023-02-15 21:00:44,0,,1679517465571
