Task 2

Build a Lookalike Model that takes a user's information as input and recommends 3 similar customers based on their profile and transaction history. The model should:

● Use both customer and product information.

● Assign a similarity score to each recommended customer.

Deliverables:

● Give the top 3 lookalikes with there similarity scores for the first 20 customers (CustomerID: C0001 - C0020) in Customers.csv. Form an “Lookalike.csv” which has just one map: Map<cust_id, List<cust_id, score>>

● A Jupyter Notebook/Python script explaining your model development

In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import cosine

# Load the data
customers_df = pd.read_csv("Customers.csv")
products_df = pd.read_csv("Products.csv")
transactions_df = pd.read_csv("Transactions.csv")

# Merge data
merged_data = pd.merge(transactions_df, products_df[['ProductID', 'Price']], on='ProductID', how='left')
merged_data = pd.merge(merged_data, customers_df[['CustomerID', 'Region']], on='CustomerID', how='left')

# Drop redundant Price columns (Price_x and Price_y) and keep only one 'Price' column
merged_data.drop(columns=['Price_x'], inplace=True)
merged_data.rename(columns={'Price_y': 'Price'}, inplace=True)

# Feature Engineering
# Aggregate transaction data per customer
customer_profile = merged_data.groupby("CustomerID").agg({
    "TotalValue": "sum",  # Sum of transaction values for each customer
    "Quantity": "sum",    # Total quantity purchased for each customer
    "Price": "mean",      # Average price per customer
}).reset_index()

# Adding customer region info to customer profile
customer_profile = pd.merge(customer_profile, customers_df[['CustomerID', 'Region']], on='CustomerID', how='left')

# Scaling the numerical features
scaler = StandardScaler()
customer_profile[['TotalValue', 'Quantity', 'Price']] = scaler.fit_transform(customer_profile[['TotalValue', 'Quantity', 'Price']])

# Prepare the feature matrix (excluding 'CustomerID' and 'Region')
features = customer_profile[['TotalValue', 'Quantity', 'Price']]

# Fit the Nearest Neighbors model
nbrs = NearestNeighbors(n_neighbors=4, metric='cosine')  # 4 neighbors, including the customer itself
nbrs.fit(features)

# Get the top 3 lookalike customers for the first 20 customers (C0001 - C0020)
lookalike_results = {}

for customer_id in customer_profile['CustomerID'][:20]:
    customer_idx = customer_profile[customer_profile['CustomerID'] == customer_id].index[0]
    distances, indices = nbrs.kneighbors([features.iloc[customer_idx]])

    # Prepare the lookalikes and similarity scores
    lookalikes = []
    for i in range(1, 4):  # Get top 3 lookalikes (skip the first one as it's the customer itself)
        similar_customer_id = customer_profile.iloc[indices[0][i]]['CustomerID']
        similarity_score = 1 - distances[0][i]  # Convert distance to similarity (1 - distance)
        lookalikes.append([similar_customer_id, similarity_score])

    lookalike_results[customer_id] = lookalikes

# Create an empty list to hold all the rows for the final DataFrame
lookalike_rows = []

for customer_id, lookalikes in lookalike_results.items():
    for lookalike in lookalikes:
        lookalike_rows.append({
            "CustomerID": customer_id,
            "LookalikeCustomerID": lookalike[0],
            "SimilarityScore": lookalike[1]
        })

# Convert the list of rows to a DataFrame
lookalike_df = pd.DataFrame(lookalike_rows)

# Save the results to a CSV file
lookalike_df.to_csv("Suhani_Ghosh_Lookalike.csv", index=False)

print("Lookalike recommendations saved to Lookalike.csv")


Lookalike recommendations saved to Lookalike.csv


