# Lookalike Model

## Similarity Measurement

To find the most similar customers, we compute a **similarity score** based on the customer profiles and transaction history. Common approaches to measuring similarity include:

- **Cosine Similarity**: This method measures the cosine of the angle between two vectors (in this case, customer profiles). A higher cosine similarity indicates that two customers are more similar.

   Formula for Cosine Similarity:
   \[
   \text{Similarity} = \frac{{\text{A} \cdot \text{B}}}{{\|\text{A}\| \cdot \|\text{B}\|}}
   \]
   Where:
   - \(\text{A}\) and \(\text{B}\) are the feature vectors for two customers.
   - The numerator is the dot product of the vectors.
   - The denominator is the product of the magnitudes of the vectors.

- **Euclidean Distance**: Another option is calculating the Euclidean distance between customers' profiles. However, cosine similarity is often preferred for recommendation tasks as it is not sensitive to the magnitude of values (only to the direction).

We compute the similarity between a given customer's vector and every other customer's vector using these measures. The result is a matrix where each entry represents the similarity between two customers.


## Importing libraries

In [15]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
import numpy as np

## Loading Dataset

In [None]:

# Load the datasets if needed
customers_df = pd.read_csv('../Dataset/Customers.csv')
products_df = pd.read_csv('../Dataset/Products.csv')
transactions_df = pd.read_csv('../Dataset/Transactions.csv')

In [17]:
products_df['Category'] = products_df['Category'].astype('category')
transactions_df['TransactionDate'] = pd.to_datetime(transactions_df['TransactionDate'])

## Merge Datset

In [18]:
# Merge the datasets
transactions_products = pd.merge(transactions_df, products_df, on='ProductID', how='inner')
full_data = pd.merge(transactions_products, customers_df, on='CustomerID', how='inner')

## Create Customer-Product Interaction Matrix

In [19]:
# Create a Customer-Product matrix (transactions data)
customer_product_matrix = full_data.pivot_table(index='CustomerID', columns='ProductID', values='Quantity', aggfunc='sum', fill_value=0)

# Scale the matrix for better similarity calculation (optional)
scaler = StandardScaler()
customer_product_matrix_scaled = scaler.fit_transform(customer_product_matrix)

## Calculate Similarity (Collaborative Filtering)

In [20]:
# Calculate cosine similarity between customers
cosine_sim = cosine_similarity(customer_product_matrix_scaled)
cosine_sim_df = pd.DataFrame(cosine_sim, index=customer_product_matrix.index, columns=customer_product_matrix.index)

## Calculate Profile-Based Similarity (Content-Based)

In [21]:
# Convert categorical columns (e.g., Region) into numerical values
profile_data = customers_df[['CustomerID', 'Region']]  # Add more features as needed
profile_data['Region'] = profile_data['Region'].astype('category').cat.codes  # Encoding categorical columns

# Calculate cosine similarity for customer profile data
profile_similarity = cosine_similarity(profile_data[['Region']])

# Create a DataFrame for profile similarity
profile_sim_df = pd.DataFrame(profile_similarity, index=profile_data['CustomerID'], columns=profile_data['CustomerID'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profile_data['Region'] = profile_data['Region'].astype('category').cat.codes  # Encoding categorical columns


## Combine Both Similarities

In [22]:
# Combine both similarity scores (you can give different weights)
combined_sim_df = (cosine_sim_df + profile_sim_df) / 2

## Generating Recommendations

Once the similarity scores are calculated for each customer, we can generate recommendations by selecting the top 3 most similar customers to the input customer.

For each customer (from `C0001` to `C0020`), we:
- Sort all other customers based on their similarity scores in descending order.
- Select the top 3 customers with the highest similarity scores.
- We exclude the input customer from the list of similar customers (because a customer is always similar to themselves).


## Generate Lookalike Recommendations

In [27]:
# Create a list of lookalike customers and their similarity scores
lookalike_data = []

# Loop through the first 20 customers
for customer_id in customers_df['CustomerID'][:20]:  # For customers C0001 to C0020
    # Get the similarity scores for the current customer (excluding self)
    similar_customers = combined_sim_df[customer_id].sort_values(ascending=False)[1:4]  # Exclude self (first entry)
    
    # Add the top 3 lookalikes and their similarity scores to the lookalike_data list
    for similar_customer, score in zip(similar_customers.index, similar_customers.values):
        lookalike_data.append({
            'CustomerID': customer_id,
            'Lookalike_CustomerID': similar_customer,
            'Similarity_Score': score
        })

# Create a DataFrame to save the results
lookalike_df = pd.DataFrame(lookalike_data)

# Save the results in a CSV file
lookalike_df.to_csv('Lookalike.csv', index=False)

print("Top 3 Lookalikes for each of the top 20 customers saved to 'Lookalike.csv'")


Top 3 Lookalikes for each of the top 20 customers saved to 'Lookalike.csv'


## Output: Top 3 Lookalikes

For each customer, we return the **top 3 most similar customers**, along with their **similarity scores**. This can be stored in a `Lookalike.csv` file, where each row contains:
- `CustomerID`: The ID of the input customer.
- `Lookalike_CustomerID`: The ID of the similar customer.
- `Similarity_Score`: The similarity score indicating how similar the two customers are.

In [29]:
# Check the first few entries for the lookalike DataFrame
lookalike_df.head(20)  # This will display the top 20 rows of the dataframe

# Alternatively, print lookalikes for each customer
for customer_id in lookalike_df['CustomerID'].unique():
    print(f"Lookalikes for Customer {customer_id}:")
    lookalikes_for_customer = lookalike_df[lookalike_df['CustomerID'] == customer_id]
    print(lookalikes_for_customer[['Lookalike_CustomerID', 'Similarity_Score']])
    print()

Lookalikes for Customer C0001:
  Lookalike_CustomerID  Similarity_Score
0                C0194          0.702464
1                C0104          0.687001
2                C0020          0.683304

Lookalikes for Customer C0002:
  Lookalike_CustomerID  Similarity_Score
3                C0030          0.202308
4                C0091          0.191889
5                C0071          0.160079

Lookalikes for Customer C0003:
  Lookalike_CustomerID  Similarity_Score
6                C0181          0.738786
7                C0144          0.711900
8                C0067          0.671254

Lookalikes for Customer C0004:
   Lookalike_CustomerID  Similarity_Score
9                 C0070          0.675951
10                C0132          0.639799
11                C0105          0.637891

Lookalikes for Customer C0005:
   Lookalike_CustomerID  Similarity_Score
12                C0096          0.243728
13                C0023          0.235126
14                C0055          0.191050

Lookalikes f


## Explanation of the Model Development

The model was built using the **customer profiles** and **transactional data**. The similarity between customers is calculated based on these attributes. 

### Chosen Similarity Measure: Cosine Similarity
- **Cosine similarity** was chosen as the measure of similarity because it is effective for high-dimensional data, such as customer profiles and transactional behavior, where we care more about the **pattern** of purchases rather than their absolute quantities or values.

### Recommendation Process:
- The model computes similarity scores for each customer with respect to all other customers.
- The top 3 customers with the highest similarity scores are recommended as lookalikes.

### Exclusion of Self:
- We exclude the input customer from the recommendations, as they would always have the highest similarity to themselves.

## Final Model Output

The final output for the first 20 customers (from `C0001` to `C0020`) would look something like this:

