<a href="https://colab.research.google.com/github/shrut9/ecommerce_transaction_dataset/blob/main/FirstName_LastName_Lookalike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import davies_bouldin_score


In [2]:
customers = pd.read_csv('/content/Customers.csv')
products = pd.read_csv('/content/Products.csv')
transactions = pd.read_csv('/content/Transactions.csv')

Merge Data

In [5]:
merged_data = transactions.merge(customers, on='CustomerID', how='inner').merge(products, on='ProductID', how='inner')


In [7]:
if 'Price' not in merged_data.columns:
    merged_data['Price'] = merged_data['TotalValue'] / merged_data['Quantity']

In [8]:
print(merged_data.columns)
print(merged_data.head())

Index(['TransactionID', 'CustomerID', 'ProductID', 'TransactionDate',
       'Quantity', 'TotalValue', 'Price_x', 'CustomerName', 'Region',
       'SignupDate', 'ProductName', 'Category', 'Price_y', 'Price'],
      dtype='object')
  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10         3   

   TotalValue  Price_x     CustomerName         Region  SignupDate  \
0      300.68   300.68   Andrea Jenkins         Europe  2022-12-03   
1      300.68   300.68  Brittany Harvey           Asia  2024-09-04   
2      300.68   300.68  Kathryn Stevens         Europe  2024-04-04   
3      601.36   300.68  Travis Campbell  South America  2024-0

Feature Engineering


In [9]:
# Aggregate data for each customer
customer_profiles = merged_data.groupby('CustomerID').agg({
    'TotalValue': 'sum',       # Total spending
    'Quantity': 'sum',         # Total quantity purchased
    'Price': 'mean',           # Average price of purchased products
    'Region': 'first',         # Region of the customer
}).reset_index()

In [10]:
customer_profiles = pd.get_dummies(customer_profiles, columns=['Region'], drop_first=True)


In [11]:
customer_features = customer_profiles.drop(['CustomerID'], axis=1)

In [13]:
# Normalize Features for Similarity Computation
scaler = StandardScaler()
customer_features_scaled = scaler.fit_transform(customer_features)

In [15]:
similarity_matrix = cosine_similarity(customer_features_scaled)

In [16]:
# Generate Recommendations for the First 20 Customers
recommendations = {}
customer_ids = customer_profiles['CustomerID'].tolist()

for i in range(20):
    customer_id = customer_ids[i]
    similarity_scores = similarity_matrix[i]
    similar_indices = similarity_scores.argsort()[-4:-1][::-1]
    similar_customers = [(customer_ids[idx], similarity_scores[idx]) for idx in similar_indices]
    recommendations[customer_id] = similar_customers

In [17]:
# Save Recommendations to CSV
lookalike_df = pd.DataFrame.from_dict(recommendations, orient='index', columns=['Similar_Customer1', 'Similar_Customer2', 'Similar_Customer3'])
lookalike_df = lookalike_df.reset_index().rename(columns={'index': 'CustomerID'})
lookalike_df.to_csv('FirstName_LastName_Lookalike.csv', index=False)

In [18]:
from google.colab import files


files.download('FirstName_LastName_Lookalike.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Customer Similarity-based Recommendation Model
This notebook will show how to build a customer similarity-based recommendation model using cosine similarity. The objective is to recommend customers who have similar purchasing behaviors, based on aggregated data from transactions. This involves data preprocessing, similarity calculation, and generation of recommendations.

1. Importing Required Libraries

We start by importing necessary libraries for handling data manipulation and machine learning tasks.

Pandas is used for loading, merging, and processing data.

sklearn.metrics.pairwise.cosine_similarity is used to calculate the similarity between customer profiles.

sklearn.preprocessing.StandardScaler is used for feature scaling to make sure all variables have equal weight.

google.colab.files is used to upload and download files in the Colab environment.

2. Loading and Exploring the Data

There are three input data CSV files used:
Customers.csv contains information about the customers.
Products.csv contains information about the products.
Transactions.csv contains links of customers with the products based on their transactions.

In Google Colab, these are uploaded using the files.upload() function. After uploading, we load data into pandas DataFrames and inspect the columns to ensure the structure is as expected.

3. Merging Datasets
We merge datasets using a merge() function based on common columns:

CustomerID : This is a unique identifier for each customer.
ProductID : This is a unique identifier for each product.

This produces one DataFrame that provides all information related to each transaction, such as customer demographics, product information, and transaction data.

Also, in case the Price column is absent, we calculate it by dividing TotalValue by Quantity, that is, unit price.

4. Aggregating Customer Profiles

We aggregate data for every customer by grouping it according to CustomerID. The aggregated metrics include:

TotalValue: The total amount spent by the customer.
Quantity: Total quantity of products purchased.

Price: Mean price of the items bought.

Region: The region the customer is from (this variable is categorical).

To include Region in the similarity calculations, we use One-Hot Encoding to transform it into numeric features.

5. Feature Scaling

As the cosine similarity measure is sensitive to the magnitude of the data, we scale our features with StandardScaler so that each feature, such as Total spending, quantity, would influence the computation equally.

6. Cosine Similarity Calculation

We now calculate the cosine similarity matrix based on the standardized customer profiles. Cosine similarity is the measure of how similar two vectors are in relation to each other, by angle regardless of any scale the vectors might have. The smaller angle means higher similarity.

To compute the similarity scores between all customers, we use the cosine_similarity() function from the sklearn package. This results in a matrix where each element represents the similarity score of a pair of customers.

7. Generation of Recommendations

For each customer, we identify the 3 most similar customers. It is done in the following steps:

Fetch the similarity scores for a specific customer.

Then sort the scores in descending order, and pick up the top three excluding the customer.

These similar customers are recommended by their purchasing patterns.

8. Save and Download the Recommendations

The recommendations are saved in a new DataFrame, which contains each row as a customer with their top 3 most similar customers and similarity scores.

Finally, we save this DataFrame to a CSV file called FirstName_LastName_Lookalike.csv. In Google Colab, we trigger the file download using files.download().

9. Conclusion

It will help a company to recommend a product to customers similar to the active one-a feature useful for targeted marketing, cross-selling opportunities, and increasing the customer experience level. The resulting recommendations are saved as a CSV file and can be downloaded for further analysis or integration into a larger system.