## Task 2

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# Load the customer and transaction data
customers = pd.read_csv("Customers.csv")
transactions = pd.read_csv("Transactions.csv")

## Dropping the missing values if any in the dataset.....
customers = customers.dropna()  
transactions = transactions.dropna()  

## Data Preprocessing:

Customer and transaction data are loaded.

Missing values are removed from both datasets.

In [2]:
customer_transactions = transactions.groupby('CustomerID').agg({
    'Quantity': 'sum',
    'TotalValue': 'sum'
}).reset_index()
customer_profile = pd.merge(customers[['CustomerID', 'Region']], customer_transactions, on='CustomerID', how='left')


customer_profile.fillna(0, inplace=True)


customer_profile = pd.get_dummies(customer_profile, columns=['Region'], drop_first=True)


## Feature Engineering:

Transaction data is aggregated by customer, calculating total quantity and total value of purchases.

Customer profiles, including CustomerID and Region, are merged with aggregated transaction data.

Missing values in the combined profile are filled with 0 (for customers with no transactions).

One-hot encoding is applied to the Region column to convert it into binary features.

## Similarity Calculation:

The customer profile features are standardized to ensure they are on the same scale.
Cosine similarity is calculated to measure how similar each customer is to others based on their profile and transaction history.

In [3]:
scaler = StandardScaler()
profile_features = customer_profile.drop('CustomerID', axis=1)  
profile_scaled = scaler.fit_transform(profile_features)

# To Compute cosine similarity
similarity_matrix = cosine_similarity(profile_scaled)

## Lookalike Recommendations:

For the first 20 customers (C0001 to C0020), the top 3 most similar customers are found based on the cosine similarity scores.
These recommendations are stored in a dictionary, with each customer mapped to their top 3 lookalikes and similarity scores.

In [4]:
lookalikes = {}
for i in range(20):  
    customer_id = f'C00{i+1}'
    similarity_scores = list(enumerate(similarity_matrix[i]))  
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)[1:4]  
    lookalikes[customer_id] = [(f'C00{score[0]+1}', score[1]) for score in similarity_scores]

## Output Format:

Recommendations are formatted into a structured format where each row contains a customer ID and a list of the top 3 similar customers, including similarity scores.
The results are saved to a CSV file (Lookalike.csv).

In [5]:
lookalike_data = []
for cust_id, recommendations in lookalikes.items():
    recommendations_list = [f'{rec[0]}:{rec[1]:.4f}' for rec in recommendations]
    lookalike_data.append([cust_id, ', '.join(recommendations_list)])


In [6]:
lookalike_df = pd.DataFrame(lookalike_data, columns=["CustomerID", "Lookalikes"])
lookalike_df.to_csv('Siddhant_Gupta_Lookalike.csv', index=False)
