# Price Optimization Project for Company ABC

## Introduction

The purpose of this project is to analyze a pricing experiment conducted by Company ABC. The company divided its user base into two groups: Group A, constituting 66% of the users, and Group B, the remaining 34%. Group A received a lower price offer, while Group B was presented with a higher price for the same product.

The objectives of this analysis are to:

- Execute exploratory data analysis on the experimental data.
- Determine the statistical significance of the observed difference in conversion rates between the two groups via A/B testing.
- Identify an optimal price for the product to maximize revenue.
- Analyze user behavior and derive actionable insights to enhance conversion rates.
- Determine an optimal duration for running such tests to yield stable and reliable results.

This project utilizes two data sources: `test_results` and `user_table`. The `test_results` table contains information regarding user interactions on the site, the device used, the source of traffic, the price shown to the user, and the user's purchasing decision. The `user_table` includes geographical information about each user.

The subsequent sections encompass data cleaning and preprocessing, exploratory data analysis, hypothesis testing, user behavior analysis, and the provision of insights based on the findings.

## Importing Packages

In [1]:
# Import Necessary Libraries

# Data handling libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis libraries
from scipy import stats

# Set the style for seaborn
sns.set(style="whitegrid")


## Loading the Data

In [28]:
user_table = pd.read_csv(r"C:\Users\ramdh\Documents\GitHub\DS_Project\Other\Price Optimization Amazon DS Challenge\data\user_table.csv")
test_results = pd.read_csv(r"C:\Users\ramdh\Documents\GitHub\DS_Project\Other\Price Optimization Amazon DS Challenge\data\test_results.csv")

## Data Preprocessing

In [13]:
def dataframe_info(df):
    report = pd.DataFrame(columns=['Column', 'Data Type', 'Unique Count', 'Unique Sample', 'Missing Values', 'Missing Percentage'])
    for column in df.columns:
        data_type = df[column].dtype
        unique_count = df[column].nunique()
        unique_sample = df[column].unique()[:5]
        missing_values = df[column].isnull().sum()
        missing_percentage = (missing_values / len(df)) * 100
        report = pd.concat([report, pd.DataFrame({'Column': [column],
                                                      'Data Type': [data_type],
                                                      'Unique Count': [unique_count],
                                                      'Unique Sample': [unique_sample],
                                                      'Missing Values': [missing_values],
                                                      'Missing Percentage': [missing_percentage.round(4)]})],
                             ignore_index=True)
    return report

In [17]:
print("\nDuplicate rows in user_table:", user_table.duplicated().sum())
print("Duplicate rows in test_results:", test_results.duplicated().sum())


Duplicate rows in user_table: 0
Duplicate rows in test_results: 0


In [18]:
dataframe_info(user_table)

Unnamed: 0,Column,Data Type,Unique Count,Unique Sample,Missing Values,Missing Percentage
0,user_id,int64,275616,"[510335, 89568, 434134, 289769, 939586]",0,0.0
1,city,object,923,"[Peabody, Reno, Rialto, Carson City, Chicago]",0,0.0
2,country,object,1,[USA],0,0.0
3,lat,float64,713,"[42.53, 39.54, 34.11, 39.15, 41.84]",0,0.0
4,long,float64,830,"[-70.97, -119.82, -117.39, -119.74, -87.68]",0,0.0


In [22]:
user_table.shape

(275616, 5)

In [14]:
dataframe_info(test_results)

Unnamed: 0,Column,Data Type,Unique Count,Unique Sample,Missing Values,Missing Percentage
0,user_id,int64,316800,"[604839, 624057, 317970, 685636, 820854]",0,0.0
1,timestamp,object,140931,"[2015-05-08 03:38:34, 2015-05-10 21:08:46, 201...",0,0.0
2,source,object,12,"[ads_facebook, seo-google, ads-bing, direct_tr...",0,0.0
3,device,object,2,"[mobile, web]",0,0.0
4,operative_system,object,6,"[iOS, android, mac, windows, other]",0,0.0
5,test,int64,2,"[0, 1]",0,0.0
6,price,int64,2,"[39, 59]",0,0.0
7,converted,int64,2,"[0, 1]",0,0.0


In [20]:
test_results.shape

(316800, 8)

### Identifying and handling invalid timestamps

An issue was discovered when converting the 'timestamp' column to a datetime object, due to the presence of some invalid entries.

In [29]:
df_processed = test_results.copy(deep=True)
df_processed['timestamp'] = pd.to_datetime(df_processed['timestamp'], errors='coerce')

# Create a new DataFrame where the 'timestamp' column is NaT (Not a Time)
invalid_timestamps = df_processed[df_processed['timestamp'].isna()]

print("Number of invalid timestamps: ", invalid_timestamps.shape[0])

Number of invalid timestamps:  10271


These entries are invalid due to their incorrect format. For instance, timestamps such as '2015-04-24 12:60:46' have the minutes value outside the acceptable range (0-59). Such anomalies can arise from data entry errors or issues in the data collection process. 

In [33]:
anomaly_df = test_results.loc[invalid_timestamps.index]

Unnamed: 0,user_id,timestamp,source,device,operative_system,test,price,converted
54,370914,2015-04-24 12:60:46,direct_traffic,mobile,android,0,39,0
104,549807,2015-04-24 11:60:20,friend_referral,mobile,iOS,0,39,0
121,107010,2015-03-14 12:60:02,direct_traffic,web,windows,0,39,0
278,287830,2015-04-04 02:23:60,direct_traffic,web,windows,1,59,0
282,676183,2015-05-11 12:60:53,ads-google,web,windows,1,59,0
...,...,...,...,...,...,...,...,...
316566,999430,2015-03-21 16:27:60,friend_referral,web,windows,0,39,0
316606,256920,2015-05-29 14:34:60,ads_other,mobile,iOS,1,59,0
316709,177121,2015-03-20 18:33:60,ads_facebook,mobile,iOS,0,39,0
316756,546292,2015-03-06 22:33:60,seo-yahoo,mobile,iOS,1,59,0


In [34]:
def correct_minute(value):
    try:
        pd.to_datetime(value)
        return value
    except:
        return value.replace("60", "59")
        
df_processed['timestamp'] = df_processed['timestamp'].apply(correct_minute)
df_processed['timestamp'] = pd.to_datetime(df_processed['timestamp'])


In [None]:
import dtale

# Assigning a reference to a running D-Tale process.
d = dtale.show(df_processed)

In [21]:
df = pd.merge(test_results, user_table, how='left', on='user_id')
dataframe_info(df)

Unnamed: 0,Column,Data Type,Unique Count,Unique Sample,Missing Values,Missing Percentage
0,user_id,int64,316800,"[604839, 624057, 317970, 685636, 820854]",0,0.0
1,timestamp,object,140931,"[2015-05-08 03:38:34, 2015-05-10 21:08:46, 201...",0,0.0
2,source,object,12,"[ads_facebook, seo-google, ads-bing, direct_tr...",0,0.0
3,device,object,2,"[mobile, web]",0,0.0
4,operative_system,object,6,"[iOS, android, mac, windows, other]",0,0.0
5,test,int64,2,"[0, 1]",0,0.0
6,price,int64,2,"[39, 59]",0,0.0
7,converted,int64,2,"[0, 1]",0,0.0
8,city,object,923,"[Buffalo, Lakeville, Parma, Fayetteville, Fish...",41184,13.0
9,country,object,1,"[USA, nan]",41184,13.0
