<a href="https://colab.research.google.com/github/yasarsultan/Ecommerce-Data-Analysis/blob/main/Yasar_Sultan_Lookalike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Building a Lookalike model**

## Importing required libraries

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

## Loading dataset

In [2]:
customers_df = pd.read_csv("Customers.csv")
products_df = pd.read_csv("Products.csv")
transactions_df = pd.read_csv("Transactions.csv")

## Aggregating and Merging data
- We will do feature engineering here and will skip data cleaning because we already saw that data was cleaned while performing EDA

In [3]:
transactions_agg = transactions_df.groupby("CustomerID").agg(
    totalSpend = ("TotalValue", "sum"),
    totalTransactions = ("TransactionID", "count"),
    avgTransactionValue = ("TotalValue", "mean"),
    uniqueProducts = ("ProductID", "nunique")
).reset_index()

customer_profiles = customers_df.merge(transactions_agg, on='CustomerID')
customer_profiles.dropna(inplace=True)
customer_profiles.head()

Unnamed: 0,CustomerID,CustomerName,Region,SignupDate,totalSpend,totalTransactions,avgTransactionValue,uniqueProducts
0,C0001,Lawrence Carroll,South America,2022-07-10,3354.52,5,670.904,5
1,C0002,Elizabeth Lutz,Asia,2022-02-13,1862.74,4,465.685,4
2,C0003,Michael Rivera,South America,2024-03-07,2725.38,4,681.345,4
3,C0004,Kathleen Rodriguez,South America,2022-10-09,5354.88,8,669.36,8
4,C0005,Laura Weber,Asia,2022-08-15,2034.24,3,678.08,3


## Feature Scaling and Computing similarities
- Will normalize the features here for consistent distance computation.
- We are choosing Cosine similarity as a metric for finding similar customers.

In [4]:
scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer_profiles.iloc[:, 4:])

similarity_matrix = cosine_similarity(scaled_features)

- Finding similar customers of first 20 customers

In [5]:
lookalikes = {}
customer_ids = customer_profiles['CustomerID'].values

for i, cust_id in enumerate(customer_ids[:20]):  # First 20 customers
    similarity_scores = list(enumerate(similarity_matrix[i]))

    similar_customers = sorted(
        [(customer_ids[j], score) for j, score in similarity_scores if j != i],
        key=lambda x: x[1], reverse=True
    )[:3]  # Top 3 similar customers

    lookalikes[cust_id] = similar_customers

- Saving data to CSV file

In [6]:
lookalike_df = pd.DataFrame({
    'CustomerID': lookalikes.keys(),
    'Lookalikes': [str(similar) for similar in lookalikes.values()]
})

lookalike_df.to_csv("Yasar_Sultan_Lookalikes.csv", index=False)

In [7]:
lookalike_df.head()

Unnamed: 0,CustomerID,Lookalikes
0,C0001,"[('C0137', 0.9962112629754635), ('C0152', 0.98..."
1,C0002,"[('C0029', 0.9995348566666874), ('C0199', 0.99..."
2,C0003,"[('C0178', 0.9996894278379866), ('C0005', 0.99..."
3,C0004,"[('C0021', 0.9997854801171627), ('C0075', 0.99..."
4,C0005,"[('C0073', 0.9996669866239412), ('C0063', 0.99..."
