# **Lookalike Model**

###**Step 1: Data Preparation**

1)Loading the required datasets

2)Merging the datasets to create a comprehensive dataframe that contains:

Customer profile information (Region, etc.).

Transaction details (TotalValue, Quantity, etc.).

Product details (Category, Price, etc.).

In [2]:
import pandas as pd

# Load datasets
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

# Convert dates to datetime
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

# Merge datasets
merged = transactions.merge(customers, on='CustomerID').merge(products, on='ProductID')

# Display merged dataset
print(merged.head())

  TransactionID CustomerID ProductID     TransactionDate  Quantity  \
0        T00001      C0199      P067 2024-08-25 12:38:23         1   
1        T00112      C0146      P067 2024-05-27 22:23:54         1   
2        T00166      C0127      P067 2024-04-25 07:38:55         1   
3        T00272      C0087      P067 2024-03-26 22:55:37         2   
4        T00363      C0070      P067 2024-03-21 15:10:10         3   

   TotalValue  Price_x     CustomerName         Region SignupDate  \
0      300.68   300.68   Andrea Jenkins         Europe 2022-12-03   
1      300.68   300.68  Brittany Harvey           Asia 2024-09-04   
2      300.68   300.68  Kathryn Stevens         Europe 2024-04-04   
3      601.36   300.68  Travis Campbell  South America 2024-04-11   
4      902.04   300.68    Timothy Perez         Europe 2022-03-15   

                       ProductName     Category  Price_y  
0  ComfortLiving Bluetooth Speaker  Electronics   300.68  
1  ComfortLiving Bluetooth Speaker  Electronic

###**Step 2: Creating Features**


In [3]:
# Aggregate customer-level transaction data
customer_features = merged.groupby('CustomerID').agg({
    'TotalValue': ['sum', 'mean'],  # Total and average spending
    'TransactionID': 'count',      # Purchase frequency
    'Category': lambda x: x.value_counts().index[0],  # Top category
    'Region': 'first',             # Region
    'SignupDate': 'first'          # Signup date
}).reset_index()

# Rename columns
customer_features.columns = ['CustomerID', 'TotalSpending', 'AvgTransactionValue',
                             'TransactionCount', 'TopCategory', 'Region', 'SignupDate']

# Encode categorical features
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
customer_features['TopCategory'] = encoder.fit_transform(customer_features['TopCategory'])
customer_features['Region'] = encoder.fit_transform(customer_features['Region'])
customer_features['SignupDate'] = customer_features['SignupDate'].apply(lambda x: x.toordinal())

print(customer_features.head())


  CustomerID  TotalSpending  AvgTransactionValue  TransactionCount  \
0      C0001        3354.52              670.904                 5   
1      C0002        1862.74              465.685                 4   
2      C0003        2725.38              681.345                 4   
3      C0004        5354.88              669.360                 8   
4      C0005        2034.24              678.080                 3   

   TopCategory  Region  SignupDate  
0            2       3      738346  
1            3       0      738199  
2            3       3      738952  
3            0       3      738437  
4            2       0      738382  


###**Step 3: Building Cosine Similarity Based Lookalike Model**


In [4]:
from sklearn.metrics.pairwise import cosine_similarity

# Drop the 'CustomerID' column to focus on numerical features
feature_matrix = customer_features.drop('CustomerID', axis=1)

# Calculate pairwise cosine similarity
similarity_matrix = cosine_similarity(feature_matrix)

# Create a DataFrame to store similarity scores
similarity_df = pd.DataFrame(similarity_matrix, index=customer_features['CustomerID'],
                             columns=customer_features['CustomerID'])

# Function to find top 3 similar customers
def find_top_3_similar(customers_df, customer_id, similarity_df):
    # Sort the similarity scores for the given customer (excluding self-comparison)
    similar_customers = similarity_df[customer_id].sort_values(ascending=False).drop(customer_id).head(3)
    return [(cust_id, score) for cust_id, score in similar_customers.items()]

# Generate recommendations for the first 20 customers (C0001-C0020)
lookalikes = {}
for customer_id in customers['CustomerID'][:20]:
    lookalikes[customer_id] = find_top_3_similar(customers, customer_id, similarity_df)

# Convert to Lookalike.csv format
lookalike_csv = pd.DataFrame.from_dict(lookalikes, orient='index', columns=['Cust1_Score', 'Cust2_Score', 'Cust3_Score'])
lookalike_csv.to_csv('Lookalike.csv')

print("Lookalike Recommendations:")
print(lookalike_csv.head())


Lookalike Recommendations:
                       Cust1_Score                  Cust2_Score  \
C0001  (C0137, 0.9999999994072495)  (C0152, 0.9999999990396297)   
C0002  (C0029, 0.9999999942096599)  (C0157, 0.9999999923028083)   
C0003  (C0178, 0.9999999997380015)  (C0086, 0.9999999836216747)   
C0004  (C0021, 0.9999999962197693)  (C0155, 0.9999999906242747)   
C0005  (C0073, 0.9999999992913566)  (C0159, 0.9999999985125421)   

                       Cust3_Score  
C0001  (C0181, 0.9999999881388147)  
C0002  (C0199, 0.9999999869357591)  
C0003  (C0035, 0.9999999813368393)  
C0004  (C0093, 0.9999999900798923)  
C0005  (C0112, 0.9999999943170265)  
