## **Task 2: Lookalike Model**

### Data Preparation

In [3]:
import pandas as pd

# Load datasets
customers_df = pd.read_csv(r"C:\Users\Sushila S\Zeotap_DataScience_Assignment\Datasets\Customers.csv")
products_df = pd.read_csv(r"C:\Users\Sushila S\Zeotap_DataScience_Assignment\Datasets\Products.csv")
transactions_df = pd.read_csv(r"C:\Users\Sushila S\Zeotap_DataScience_Assignment\Datasets\Transactions.csv")


##### Merge the Datasets

In [4]:

# Merge transactions with customers on CustomerID
transactions_with_customers = pd.merge(transactions_df, customers_df, on='CustomerID', how='left')

# Merge the above result with products on ProductID
full_data = pd.merge(transactions_with_customers, products_df, on='ProductID', how='left')


## Feature Engineering:

- Total Value of Transactions per Customer
- Product Categories Purchased by Customer
- Frequency of Purchases
- Average Purchase Value

In [5]:
# Total value per customer
customer_total_value = full_data.groupby('CustomerID')['TotalValue'].sum()

# Frequency of transactions per customer
customer_frequency = full_data.groupby('CustomerID')['TransactionID'].count()

# Average purchase value
customer_avg_purchase = full_data.groupby('CustomerID')['TotalValue'].mean()

# Product categories bought per customer (this could be aggregated as a list)
customer_categories = full_data.groupby('CustomerID')['Category'].unique()


##### Merge Features

In [6]:
customer_features = pd.DataFrame({
    'TotalValue': customer_total_value,
    'TransactionFrequency': customer_frequency,
    'AvgPurchaseValue': customer_avg_purchase,
    'ProductCategories': customer_categories
})

# Reset index so that 'CustomerID' is a column
customer_features.reset_index(inplace=True)


## Similarity Calculation

#### Transform Product Categories to Numerical Features

In [7]:
from sklearn.preprocessing import MultiLabelBinarizer

# One-Hot Encoding of Product Categories
mlb = MultiLabelBinarizer()
categories_encoded = mlb.fit_transform(customer_features['ProductCategories'])

# Create a DataFrame from the encoded categories
categories_df = pd.DataFrame(categories_encoded, columns=mlb.classes_)

# Merge back with the customer features
customer_features = pd.concat([customer_features, categories_df], axis=1)


#####  Similarity Calculation Using Euclidean Distance or Cosine Similarity

Cosine Similarity is widely used for calculating similarity between customer profiles

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

# Extract relevant features for similarity calculation (ignore CustomerID)
features_for_similarity = customer_features.drop(['CustomerID', 'ProductCategories'], axis=1)

# Calculate similarity matrix using cosine similarity
similarity_matrix = cosine_similarity(features_for_similarity)


## Model Development

Create a Function to Get Top 3 Similar Customers:

In [9]:
def get_top_3_lookalikes(customer_id, similarity_matrix, customer_ids):
    # Get the index of the customer_id in the DataFrame
    customer_index = customer_ids.index(customer_id)
    
    # Get the similarity scores for the given customer
    similarity_scores = similarity_matrix[customer_index]
    
    # Sort the scores in descending order and get top 3 similar customers (excluding self)
    sorted_indices = similarity_scores.argsort()[::-1][1:4]
    
    top_3_customers = [(customer_ids[i], similarity_scores[i]) for i in sorted_indices]
    
    return top_3_customers


##### Compute Top 3 Lookalikes for Customers C0001 to C0020

In [10]:
lookalike_recommendations = []

# Extract the list of CustomerIDs
customer_ids = customer_features['CustomerID'].tolist()

# Loop through customers C0001 to C0020 and get top 3 lookalikes
for customer_id in customer_ids[:20]:
    top_3 = get_top_3_lookalikes(customer_id, similarity_matrix, customer_ids)
    lookalike_recommendations.append([customer_id, top_3])

# Convert the recommendations to a DataFrame
lookalike_df = pd.DataFrame(lookalike_recommendations, columns=['CustomerID', 'Top_3_Lookalikes'])

# Save to CSV
lookalike_df.to_csv("Sushila_Shivashimpiger_Lookalike.csv", index=False)


## Deliverables

#### Jupyter Notebook:

File Name: Sushila_Shivashimpiger_Lookalike.ipynb
### A.Data Preparation Section:
##### Goal: Merge the datasets to form a single unified dataset.
Steps:
- Load the datasets: Load Customers.csv, Products.csv, and Transactions.csv into pandas DataFrames.
 Merge the datasets:
- Merge Transactions.csv with Customers.csv on CustomerID to get customer details along with their transactions.
- Then merge this result with Products.csv on ProductID to get product details along with customer transactions.
- Display the merged data and confirm it’s correctly combined.
### B.Feature Engineering Section
##### Goal: Create new features that help in determining the similarity between customers.
Steps:
##### Create features like:
- Total Transaction Value per customer: Sum of TotalValue from transactions.
- Transaction Frequency: Number of transactions per customer.
- Average Purchase Value: Average of TotalValue for each customer.
- Product Categories: Unique product categories each customer has purchased.

### C. Similarity Calculation Section:
##### Goal: Calculate similarity scores between customers based on their features.
Steps:
- Use cosine similarity (or Euclidean distance) to calculate similarity between customers based on their features.
- Normalize categorical features like product categories (one-hot encoding, for example).
### D. Function to Get Top 3 Lookalikes:
#### Goal: For each customer, recommend the top 3 similar customers based on the similarity matrix.
Steps:
- Implement a function that takes a customer’s ID and returns their top 3 most similar customers based on the cosine similarity scores.
### E. Output of Recommendations Section:
#### Goal: Generate recommendations for customers C0001 to C0020 and save them in a CSV file.
Steps:
- For each customer (from C0001 to C0020), calculate their top 3 lookalikes.
- Store the results in a CSV file with the format: CustomerID, Lookalike_1, Similarity_Score_1, Lookalike_2, Similarity_Score_2, Lookalike_3, Similarity_Score_3.

## 2. CSV Output: Sushila_Shivashimpiger_Lookalike.csv
##### Contents:
- CustomerID: The ID of the customer for whom you’re generating lookalikes.
- Lookalike_1, Lookalike_2, Lookalike_3: The IDs of the top 3 most similar customers.
- Similarity_Score_1, Similarity_Score_2, Similarity_Score_3: The cosine similarity scores for each of the lookalikes.