# Selling Laptops - Smart Marketing

Overview

You are the owner of a retail website. You're planning on running a promotion on a laptop and you want to send emails out about it. However, you only want to send it to people that may be interested in it so as not to annoy people that aren't interested. You're looking to use your data about who clicked on similar emails in 2020 to help you predict which users may be interested in the promotion.
You can decide what features to consider and how to create your classifier. Your grade will correspond to the accuracy of your predictions. 50% accuracy and below will give a grade of 0%, whereas accuracy of 75% and above will give a grade of 100%; any accuracy between 50 and 75 will be rescaled to a 0-100% grade. Some models can get better than 90% accuracy, so we encourage you to keep improving your model beyond what is necessary for full credit if you have time.

In [None]:
# NOTES:
# Create class UserPredictor and include two methods - fit and predict
# Feature engineering - start off with past_purchase_amt, time spent on website
# Use a standard scaler in the pipeline
# Do cross-validation in fit method and print some stats

## Data Preprocessing

In [382]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [330]:
# Read the files

train_users = pd.read_csv('data/train_users.csv')
train_logs =  pd.read_csv('data/train_logs.csv')
train_y = pd.read_csv('data/train_y.csv')

test_users = pd.read_csv('data/test1_users.csv')
test_logs =  pd.read_csv('data/test1_logs.csv')
test_y = pd.read_csv('data/test1_y.csv')

In [302]:
display(train_users.head())
display(train_logs.head())
display(train_y.head())

Unnamed: 0,user_id,names,age,past_purchase_amt,badge
0,0,Adriana Mcclure,26,39.344704,gold
1,1,Stacy Gilmore,67,15.840151,silver
2,2,Joanna Walsh,50,1099.420085,bronze
3,3,Eduardo Moore,65,5.880239,bronze
4,4,Angela Freeman,88,1312.296847,bronze


Unnamed: 0,user_id,date,url,seconds
0,0,10/27/2021,/keyboard.html,159
1,0,3/15/2021,/blender.html,15
2,0,7/29/2021,/keyboard.html,11
3,0,1/27/2021,/laptop.html,142
4,1,3/1/2021,/keyboard.html,78


Unnamed: 0,user_id,y
0,0,False
1,1,False
2,2,True
3,3,False
4,4,False


## Feature Engineering

In [331]:
# Total and avg time per user

train_data = train_users[['user_id', 'age', 'past_purchase_amt']].merge(train_logs[['user_id', 'seconds']], on='user_id', how='left')
train_data['seconds'] = train_data['seconds'].fillna(0)
train_data = train_data[['user_id', 'age', 'past_purchase_amt']].merge(train_data.groupby('user_id')['seconds'].sum(), on='user_id').drop_duplicates().merge(train_data.groupby('user_id')['seconds'].mean(), on='user_id').drop_duplicates()
train_data = train_data.rename({'seconds_x': 'total_time', 'seconds_y': 'avg_time'}, axis=1)
train_data = train_data.merge(train_y, on='user_id', how='right')

test_data = test_users[['user_id', 'age', 'past_purchase_amt']].merge(test_logs[['user_id', 'seconds']], on='user_id', how='left')
test_data['seconds'] = test_data['seconds'].fillna(0)
test_data = test_data[['user_id', 'age', 'past_purchase_amt']].merge(test_data.groupby('user_id')['seconds'].sum(), on='user_id').drop_duplicates().merge(test_data.groupby('user_id')['seconds'].mean(), on='user_id').drop_duplicates()
test_data = test_data.rename({'seconds': 'total_time', 'seconds_y': 'avg_time'}, axis=1)

In [332]:
train_data.head()

Unnamed: 0,user_id,age,past_purchase_amt,total_time,avg_time,y
0,0,26,39.344704,327.0,81.75,False
1,1,67,15.840151,78.0,78.0,False
2,2,50,1099.420085,432.0,108.0,True
3,3,65,5.880239,0.0,0.0,False
4,4,88,1312.296847,0.0,0.0,False


In [333]:
# url visits per user

train_data['laptop_visits'] = train_logs[train_logs['url']=='/laptop.html'].groupby('user_id')['user_id'].count()
train_data['non_laptop_visits'] = train_logs[~train_logs['url'].eq('/laptop.html')].groupby('user_id')['user_id'].count()
train_data = train_data.fillna(0)
train_data['no_visits'] = np.where(train_data['laptop_visits'] + train_data['non_laptop_visits'] == 0, 1, 0)

train_data.head()

Unnamed: 0,user_id,age,past_purchase_amt,total_time,avg_time,y,laptop_visits,non_laptop_visits,no_visits
0,0,26,39.344704,327.0,81.75,False,1.0,3.0,0
1,1,67,15.840151,78.0,78.0,False,0.0,1.0,0
2,2,50,1099.420085,432.0,108.0,True,0.0,4.0,0
3,3,65,5.880239,0.0,0.0,False,0.0,0.0,1
4,4,88,1312.296847,0.0,0.0,False,0.0,0.0,1


In [334]:
# One hot encoding for badges feature 

badge_dummies = pd.get_dummies(train_users['badge'], dtype=float)
train_users = pd.concat([train_users, badge_dummies], axis=1)
train_data['bronze'] = train_users['bronze']
train_data['silver'] = train_users['bronze']
train_data['gold'] = train_users['gold']
train_data[['bronze','silver','gold']].head()

Unnamed: 0,bronze,silver,gold
0,0.0,0.0,1.0
1,0.0,0.0,0.0
2,1.0,1.0,0.0
3,1.0,1.0,0.0
4,1.0,1.0,0.0


In [335]:
train_data['total_badges'] = train_data['gold'] + train_data['silver'] + train_data['bronze']
train_data[['bronze','silver','gold', 'total_badges']].head()

Unnamed: 0,bronze,silver,gold,total_badges
0,0.0,0.0,1.0,1.0
1,0.0,0.0,0.0,0.0
2,1.0,1.0,0.0,2.0
3,1.0,1.0,0.0,2.0
4,1.0,1.0,0.0,2.0


In [336]:
# Total time per user who visited laptop.html

train_data = train_data.merge(train_logs[train_logs['url'] == '/laptop.html'].groupby('user_id')['seconds'].sum(), on='user_id', how='left').fillna(0).merge(train_logs[train_logs['url'] == '/laptop.html'].groupby('user_id')['seconds'].mean(), on='user_id', how='left').fillna(0).rename({'seconds_x': 'total_time_laptop', 'seconds_y': 'avg_time_laptop'}, axis=1)

In [340]:
train_data[['user_id', 'past_purchase_amt', 'age', 'total_time', 'avg_time', 'total_time_laptop', 'avg_time_laptop', 'total_badges']].head()

Unnamed: 0,user_id,past_purchase_amt,age,total_time,avg_time,total_time_laptop,avg_time_laptop,total_badges
0,0,39.344704,26,327.0,81.75,142.0,142.0,1.0
1,1,15.840151,67,78.0,78.0,0.0,0.0,0.0
2,2,1099.420085,50,432.0,108.0,0.0,0.0,2.0
3,3,5.880239,65,0.0,0.0,0.0,0.0,2.0
4,4,1312.296847,88,0.0,0.0,0.0,0.0,2.0


In [384]:
# Check relationship between features and target variable

correlation_coefficients = train_data.corr()['y']

print(correlation_coefficients)

user_id              0.001949
age                 -0.144062
past_purchase_amt    0.244942
total_time           0.556370
avg_time             0.276055
y                    1.000000
laptop_visits        0.367834
non_laptop_visits    0.489705
no_visits           -0.300206
bronze              -0.141115
silver              -0.141115
gold                 0.166348
total_badges        -0.085288
total_time_laptop    0.342595
avg_time_laptop      0.311482
Name: y, dtype: float64


## Model Evaluation

In [376]:
# Accuracy score of model from running model on new data

!python3 tester.py

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
We'll use this to grade you:
    python3 tester.py main test2

test2_users.csv and test2_log.csv are secret, so you can use this to estimate your accuracy:
    python3 tester.py main test1

Grading:
    Max Seconds: 60
    Accuracy <50: grade=0%
    Accuracy >75: grade=100%

Fitting+Predicting...
Features: ['past_purchase_amt', 'age', 'total_time', 'avg_time', 'total_time_laptop', 'avg_time_laptop', 'total_badges']
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/mod

In [374]:
# Model Choice

model_comparison = {
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest'],
    'Accuracy': [82.64333333333333, 81.81, 83.87],
    'Latency': [2.754293203353882, 6.11757230758667, 53.65154814720154]
}

df = pd.DataFrame(model_comparison)
display(df)

Unnamed: 0,Model,Accuracy,Latency
0,Logistic Regression,82.643333,2.754293
1,Decision Tree,81.81,6.117572
2,Random Forest,83.87,53.651548


### I selected the Logistic Regression model because it offers a high accuracy with a low latency, making it suitable for real-time applications like email marketing. Although the Random Forest model is more accurate, its high latency would cause delays in sending out emails, which would negatively impact the campaign's effectiveness.