# Features to Use:

1. **Region**: Geographical similarity.
2. **Category**: Product categories purchased by each customer.
3. **Quantity**: Number of products purchased by the customer.
4. **TotalValue**: Total value of purchases made by the customer (more indicative of customer spending behavior).

### Preprocessing:

- **Region**: One-hot encode the Region.
- **Category**: Aggregate the frequency of product categories purchased by each customer.
- **Quantity**: Aggregate the total quantity purchased by each customer.
- **TotalValue**: Aggregate the total amount spent by each customer.

### Similarity Calculation:
We will compute pairwise cosine similarity using the adjusted feature set to measure customer similarity.


### Category Frequency Calculation:

The category_counts variable is created by aggregating the frequency of each product category purchased by every customer. This is done using the value_counts() function, which counts how many times each category appears for each customer in the Transactions.csv. The result is a matrix where each row corresponds to a customer, and each column corresponds to a specific product category. The values in this matrix represent how many times each customer has purchased products from each category.

To summarize, the **category** feature is considered by calculating how many products from each category a customer has bought. This frequency information is then used as part of the input features for the similarity model. When cosine similarity is computed, customers with similar category purchase patterns will be ranked higher, and their similarity scores will be higher.


# Cosine similarity is the measure of similarity between two non-zero vectors widely applied in many machine learning and data analysis applications. It actually measures the cosine of the angle between two vectors. As a result, an idea is given about how far the two vectors point in the same direction irrespective of their magnitudes. It can be found in popular usage in tasks of text analysis, such as comparison of similarity between documents, search queries, and even recommendation systems so that user preferences can be matched.

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load data
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

# Data Preprocessing

# Merge transactions with product details
transactions = transactions.merge(products, on='ProductID')

# Aggregate transaction data per customer
customer_transactions = transactions.groupby('CustomerID').agg(
    total_spent=('TotalValue', 'sum'),
    transaction_count=('TransactionID', 'count'),
    total_quantity=('Quantity', 'sum'),
    unique_categories=('Category', 'nunique')
).reset_index()

# Merge aggregated transaction data with customer profile
customer_data = customers.merge(customer_transactions, on='CustomerID')

# Feature Engineering: OneHotEncode Region and calculate the frequency of categories
customer_data['Region'] = customer_data['Region'].astype(str)
category_counts = transactions.groupby('CustomerID')['Category'].value_counts().unstack(fill_value=0)

# Merge category frequencies with customer data
customer_data = customer_data.merge(category_counts, on='CustomerID')

# Define the features to use for similarity calculation
features = ['Region', 'total_spent', 'transaction_count', 'total_quantity', 'unique_categories'] + category_counts.columns.tolist()

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('region', OneHotEncoder(), ['Region']),
        ('numeric', StandardScaler(), ['total_spent', 'transaction_count', 'total_quantity', 'unique_categories'] + category_counts.columns.tolist())
    ])

# Apply preprocessing
X = preprocessor.fit_transform(customer_data[features])

# Compute pairwise cosine similarity
cos_sim = cosine_similarity(X)

# Get the top 3 lookalike customers for each of the first 20 customers
lookalikes = {}
for i in range(20):
    similarities = cos_sim[i]
    similar_customers = np.argsort(similarities)[::-1][1:4]  # Exclude self (i), top 3 others
    similarity_scores = similarities[similar_customers]
    lookalikes[customer_data.iloc[i]['CustomerID']] = [
        (customer_data.iloc[j]['CustomerID'], similarity_scores[idx])
        for idx, j in enumerate(similar_customers)
    ]

# Prepare Lookalike.csv format
lookalike_df = []
for cust_id, similar in lookalikes.items():
    row = {'cust_id': cust_id}
    for idx, (sim_cust_id, score) in enumerate(similar):
        row[f'similar_cust_{idx+1}'] = sim_cust_id
        row[f'score_{idx+1}'] = score
    lookalike_df.append(row)

lookalike_df = pd.DataFrame(lookalike_df)

# Save to CSV
lookalike_df.to_csv('Jay_Wanjare_Lookalike.csv', index=False)

# Display result
lookalike_df.head()


Unnamed: 0,cust_id,similar_cust_1,score_1,similar_cust_2,score_2,similar_cust_3,score_3
0,C0001,C0091,0.761214,C0069,0.740388,C0127,0.698927
1,C0002,C0159,0.917833,C0134,0.860843,C0133,0.82118
2,C0003,C0031,0.926822,C0158,0.866181,C0129,0.781113
3,C0004,C0113,0.881253,C0012,0.875605,C0065,0.862158
4,C0005,C0007,0.992734,C0140,0.933528,C0186,0.886995


### Disadvantages of Cosine Similarity

1. **Sensitive to Sparse Data**:  
   Cosine similarity may not be effective when applied to sparse data, where many components in the vectors are zero. In such cases, other similarity measures might work better.

2. **Does Not Account for Absolute Differences**:  
   Cosine similarity only considers the angle between vectors, not their magnitude. As a result, it may overlook differences in magnitude, which could be important in certain contexts.

3. **Symmetry**:  
   Cosine similarity is symmetric, meaning it does not differentiate between the order of comparison. For some tasks, this may not be desirable, as directionality may be relevant.

4. **Not Applicable for Negative Values**:  
   Cosine similarity may not be suitable for datasets containing negative values, as it could produce misleading results, or the interpretation of the angle between vectors may become problematic.

> **Note**: Despite these disadvantages, for this particular use case, **cosine similarity** remains a minimal and simple approach that is well-suited.


In [12]:
import pandas as pd

# Load the data
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

# Load the lookalike data
lookalike_df = pd.read_csv('Jay_Wanjare_Lookalike.csv')

# Merge transactions with product details (if needed for additional comparison)
transactions = transactions.merge(products, on='ProductID')

# Aggregate transaction data per customer (if needed for additional comparison)
customer_transactions = transactions.groupby('CustomerID').agg(
    total_spent=('TotalValue', 'sum'),
    transaction_count=('TransactionID', 'count'),
    total_quantity=('Quantity', 'sum'),
    unique_categories=('Category', 'nunique')
).reset_index()

# Merge aggregated transaction data with customer profile
customer_data = customers.merge(customer_transactions, on='CustomerID')

# Function to compare four customers (Original + 3 Lookalikes)
def compare_customers_with_lookalikes(customer_id):
    """Compare the original customer with their top 3 lookalikes based on available data."""

    # Find the lookalikes for the given customer_id from the lookalike dataframe
    lookalike_data = lookalike_df[lookalike_df['cust_id'] == customer_id]

    # Collect the similar customers (including the original)
    similar_customers = [customer_id] + lookalike_data[['similar_cust_1', 'similar_cust_2', 'similar_cust_3']].values.flatten().tolist()

    # Fetch the customer data for the original and lookalikes
    customer_comparison = customer_data[customer_data['CustomerID'].isin(similar_customers)]

    # Select the relevant columns for comparison
    comparison_columns = ['CustomerID', 'Region', 'total_spent', 'transaction_count', 'total_quantity', 'unique_categories']
    comparison_data = customer_comparison[comparison_columns]

    # Fetch exact categories per customer
    category_data = transactions[transactions['CustomerID'].isin(similar_customers)].groupby('CustomerID')['Category'].unique().reset_index()

    # Merge category data with comparison data
    comparison_data = comparison_data.merge(category_data, on='CustomerID', how='left')
    comparison_data.rename(columns={'Category': 'purchased_categories'}, inplace=True)

    return comparison_data

# Loop through all unique customers in the lookalike dataframe
for customer_id in lookalike_df['cust_id'].unique():
    comparison = compare_customers_with_lookalikes(customer_id)
    print(f"\nComparison for Customer {customer_id} and their Lookalikes:\n")
    print(comparison)



Comparison for Customer C0001 and their Lookalikes:

  CustomerID         Region  total_spent  transaction_count  total_quantity  \
0      C0001  South America      3354.52                  5              12   
1      C0069         Europe      2878.69                  5              10   
2      C0091  South America      3137.66                  6              20   
3      C0127         Europe      3232.88                  6              11   

   unique_categories                 purchased_categories  
0                  3     [Books, Home Decor, Electronics]  
1                  2            [Electronics, Home Decor]  
2                  3  [Electronics, Clothing, Home Decor]  
3                  3     [Electronics, Home Decor, Books]  

Comparison for Customer C0002 and their Lookalikes:

  CustomerID         Region  total_spent  transaction_count  total_quantity  \
0      C0002           Asia      1862.74                  4              10   
1      C0133  South America      2884.