# Lookalike Model Development
Introduction
The goal is to build a Lookalike Model that recommends 3 similar customers for each customer based on their profile and transaction history. The model uses both customer and product information and assigns a similarity score to each recommended customer. We will focus on the first 20 customers (CustomerID: C0001 - C0020).

# Approach
Data Preparation:

Merge Datasets: Combine the Customers, Transactions, and Products datasets to have a complete view of each customer's profile and transaction history.
Feature Engineering: Create features that represent customer profiles and behaviors, including demographic information and transaction patterns.
Feature Representation:

Customer Profile: Encode categorical variables like Region and Category using one-hot encoding.
Transaction History: Aggregate transaction data to capture purchasing behavior, such as total spend per category, number of transactions, etc.
Similarity Computation:

Vectorization: Represent each customer as a feature vector combining their profile and transaction features.
Normalization: Normalize the feature vectors to ensure that no single feature dominates the similarity measure.
Similarity Metric: Use cosine similarity to compute the similarity between customers.
Generating Recommendations:

For each target customer (C0001 - C0020), find the top 3 most similar customers based on the similarity scores.
Assign Similarity Scores: Provide the similarity score for each recommended customer.
Output Preparation:

Create a Lookalike.csv file with the required format: Map<cust_id, List<cust_id, score>>.

# Implementation
Below is the step-by-step implementation using Python.

# 1. Import Necessary Libraries

In [15]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
import json
import warnings
warnings.filterwarnings('ignore')

# 2. Load the Datasets

In [16]:
# Load the datasets
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

# Convert dates to datetime
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

# 3. Data Preparation and Merging

## 3.1 Merge Transactions with Products

In [17]:
# Merge Transactions with Products on 'ProductID'
transactions_products = pd.merge(transactions, products, on='ProductID', how='left')

## 3.2 Merge with Customers

In [18]:
# Merge the above with Customers on 'CustomerID'
full_data = pd.merge(transactions_products, customers, on='CustomerID', how='left')

# 4. Feature Engineering


## 4.1 Aggregate Transaction Data
For each customer, we aggregate transaction data to obtain features like:

Total spend per category.
Total quantity purchased per category.
Number of unique products purchased.
Total transactions.

In [19]:
# Total spend per category
spend_per_category = full_data.groupby(['CustomerID', 'Category'])['TotalValue'].sum().unstack(fill_value=0)

# Total quantity per category
quantity_per_category = full_data.groupby(['CustomerID', 'Category'])['Quantity'].sum().unstack(fill_value=0)

# Number of unique products purchased
unique_products = full_data.groupby('CustomerID')['ProductID'].nunique().rename('UniqueProducts')

# Total transactions
total_transactions = full_data.groupby('CustomerID')['TransactionID'].nunique().rename('TotalTransactions')

# Combine all transaction features
transaction_features = pd.concat([spend_per_category, quantity_per_category, unique_products, total_transactions], axis=1)
transaction_features.fillna(0, inplace=True)

## 4.2 Encode Customer Profile Information
Encode categorical variables:
Region: One-hot encoding.

In [20]:
from sklearn.preprocessing import OneHotEncoder

region_encoder = OneHotEncoder(sparse_output=False)  # Use 'sparse_output' instead of 'sparse'
region_encoded = region_encoder.fit_transform(customers[['Region']])

region_encoded_df = pd.DataFrame(
    region_encoded,
    columns=region_encoder.get_feature_names_out(['Region']),
    index=customers['CustomerID']
)

# Prepare customer features
customer_features = region_encoded_df


## 4.3 Combine Customer Profile and Transaction Features

In [21]:
# Merge customer features with transaction features
customer_data = customer_features.join(transaction_features, how='left')
customer_data.fillna(0, inplace=True)

# Reset index to have CustomerID as a column
customer_data.reset_index(inplace=True)

# 5. Feature Scaling
Normalize features to ensure fair similarity computation.

In [22]:
# Identify feature columns
feature_cols = customer_data.columns.drop('CustomerID')

# Initialize scaler
scaler = MinMaxScaler()

# Fit and transform features
customer_data_scaled = customer_data.copy()
customer_data_scaled[feature_cols] = scaler.fit_transform(customer_data[feature_cols])

# Set CustomerID as index
customer_data_scaled.set_index('CustomerID', inplace=True)

# 6. Similarity Computation
Compute cosine similarity between customers.

In [23]:
# Compute cosine similarity matrix
similarity_matrix = cosine_similarity(customer_data_scaled)

# Convert to DataFrame
similarity_df = pd.DataFrame(similarity_matrix, index=customer_data_scaled.index, columns=customer_data_scaled.index)

# 7. Generating Lookalike Recommendations
For each of the first 20 customers, find the top 3 most similar customers (excluding themselves).

In [24]:
# Get list of first 20 customers
first_20_customers = customers['CustomerID'].head(20).tolist()

# Initialize dictionary to store recommendations
lookalike_dict = {}

for cust_id in first_20_customers:
    # Get similarity scores for the customer
    sim_scores = similarity_df.loc[cust_id]
    
    # Exclude self-similarity
    sim_scores = sim_scores.drop(index=cust_id)
    
    # Get top 3 similar customers
    top_3_customers = sim_scores.nlargest(3)
    
    # Store in dictionary
    lookalike_dict[cust_id] = list(zip(top_3_customers.index, top_3_customers.values))

# 8. Prepare Output File

## Create Lookalike.csv with the map as specified.

In [25]:
# Convert the lookalike dictionary to a list of dictionaries
lookalike_list = []
for cust_id, recommendations in lookalike_dict.items():
    recs = [{'cust_id': rec[0], 'score': rec[1]} for rec in recommendations]
    lookalike_list.append({'cust_id': cust_id, 'recommendations': recs})

# Convert to DataFrame
lookalike_df = pd.DataFrame(lookalike_list)

# Save to CSV
lookalike_df.to_csv('Lookalike.csv', index=False)

# 9. Display Top 3 Lookalikes for First 20 Customers

In [26]:
for item in lookalike_list:
    cust_id = item['cust_id']
    print(f"CustomerID: {cust_id}")
    print("Top 3 Lookalike Customers with Similarity Scores:")
    for rec in item['recommendations']:
        print(f" - CustomerID: {rec['cust_id']}, Similarity Score: {rec['score']:.4f}")
    print("\n")

CustomerID: C0001
Top 3 Lookalike Customers with Similarity Scores:
 - CustomerID: C0120, Similarity Score: 0.9619
 - CustomerID: C0181, Similarity Score: 0.9616
 - CustomerID: C0091, Similarity Score: 0.9590


CustomerID: C0002
Top 3 Lookalike Customers with Similarity Scores:
 - CustomerID: C0159, Similarity Score: 0.9915
 - CustomerID: C0178, Similarity Score: 0.9844
 - CustomerID: C0134, Similarity Score: 0.9732


CustomerID: C0003
Top 3 Lookalike Customers with Similarity Scores:
 - CustomerID: C0031, Similarity Score: 0.9806
 - CustomerID: C0152, Similarity Score: 0.9623
 - CustomerID: C0085, Similarity Score: 0.9623


CustomerID: C0004
Top 3 Lookalike Customers with Similarity Scores:
 - CustomerID: C0113, Similarity Score: 0.9720
 - CustomerID: C0012, Similarity Score: 0.9538
 - CustomerID: C0148, Similarity Score: 0.9483


CustomerID: C0005
Top 3 Lookalike Customers with Similarity Scores:
 - CustomerID: C0007, Similarity Score: 0.9947
 - CustomerID: C0140, Similarity Score: 0

### Results
Here are the top 3 lookalike customers with their similarity scores for the first 20 customers (C0001 - C0020):