## PRICING: What should be price for the item ?
A game company gave gift coins to its users for purchasing items in a game.Using these virtual coins, users buy various items for their characters.The game company did not specify a price for an item and provided users to buy this item at the price they wanted.For example, for the item named shield, users will buy this shield by paying the amounts they see fit.A user can pay with 30 units of virtual money given to his/her, while the other user can pay with 45 units. Therefore, users can buy this item with the amounts they can afford to pay.

## Problems to be solved:
1. Does the price of the item differ by category? Express it statistically.
1. Depending on the first question, what should the item cost? Explain why?
1. It is desirable to be "flexible" in terms of price. Create a decision support system for the price strategy.
1. Simulate item purchases and income for possible price changes.

## Data Analysis

In [1]:
# loading necessary libraries
import pandas as pd
import itertools
import statsmodels.stats.api as sms
from scipy.stats import shapiro
import scipy.stats as stats

In [2]:
# reading the data
df = pd.read_csv("../input/pricing-dataset/pricing.csv", sep=";")

In [3]:
# Analyze the dataframes 
def analyze_df (df):
    print("Shape of dataframe: {0}".format(df.shape), "\n") # shape of dataframe
    print("There are {0} observations and {1} features".format(len(df),len(df.columns)),"\n") # number of observations and features
    print(df.head(),"\n") # first 5 observation
    print("Number of unique categories:{0}".format(df["category_id"].nunique()),"\n")
    print("Names of categories:{0}".format(df["category_id"].unique()),"\n")
    for col in df.columns:
        print(" Number of null value in the {0} column: {1}".format(col,df[col].isnull().sum())) # is there a null value in any columns
    print(df.describe().T,"\n") # for observe the outliers

In [4]:
analyze_df (df)

Shape of dataframe: (3448, 2) 

There are 3448 observations and 2 features 

   category_id      price
0       489756  32.117753
1       361254  30.711370
2       361254  31.572607
3       489756  34.543840
4       489756  47.205824 

Number of unique categories:6 

Names of categories:[489756 361254 874521 326584 675201 201436] 

 Number of null value in the category_id column: 0
 Number of null value in the price column: 0
              count           mean            std       min            25%  \
category_id  3448.0  542415.171984  192805.689911  201436.0  457630.500000   
price        3448.0    3254.475770   25235.799009      10.0      31.890438   

                       50%            75%            max  
category_id  489756.000000  675201.000000  874521.000000  
price            34.798544      41.536211  201436.991255   



In [5]:
# When the average price by categories is analyzed, we can make comparisons for groups, but we need to prove this statistically.
df.groupby("category_id").agg({"price":"mean"})

Unnamed: 0_level_0,price
category_id,Unnamed: 1_level_1
201436,36.175498
326584,1424.665182
361254,1659.680663
489756,3589.808526
675201,3112.240362
874521,4605.357258


When we look at the average of the categories, we can some observe. But this observations are not a statistically significant results.So,
we will test all hypotheses of categories  in pairs and obtain statistical results.

# Definition of the Hypothesis

- **H0 :** There is no statistically significant difference between price average of two category

- **H1 :** There is statistically significant difference between price average of two category

## Assumptions of the Hypothesis

1.  Normal Distribution 
1.  Homogeneity of Variances



# Checking Assumptions

## Normal Distribution
Non-normal population distributions, especially those that are thick-tailed or heavily skewed, considerably reduce the power of the test

**The Shapiro-Wilks Test for Normality**

* H0: There is no statistically significant difference between sample distribution and theoretical normal distribution
* H1: There is statistically significant difference between sample distribution and theoretical normal distribution

In [6]:
# We apply shapiro-wilks test to each group and test their normal distribution.
print(" Shapiro-Wilks Test Result")
for category in df["category_id"].unique():
    test_statistic , pvalue = shapiro(df.loc[df["category_id"] ==  category,"price"])
    if(pvalue<0.05):
        print('\n','{0} -> '.format(category),'Test statistic = %.4f, p-Value = %.4f' % (test_statistic, pvalue),"H0 is rejected.")
    else:
         print('Test statistic = %.4f, p-Value = %.4f' % (test_statistic, pvalue),"H0 is not rejected.")

 Shapiro-Wilks Test Result

 489756 ->  Test statistic = 0.1095, p-Value = 0.0000 H0 is rejected.

 361254 ->  Test statistic = 0.0615, p-Value = 0.0000 H0 is rejected.

 874521 ->  Test statistic = 0.1311, p-Value = 0.0000 H0 is rejected.

 326584 ->  Test statistic = 0.0568, p-Value = 0.0000 H0 is rejected.

 675201 ->  Test statistic = 0.1011, p-Value = 0.0000 H0 is rejected.

 201436 ->  Test statistic = 0.6190, p-Value = 0.0000 H0 is rejected.


*  489756 ->   H0 is rejected.
*  361254 ->   H0 is rejected.
*  874521 ->   H0 is rejected.
*  326584 ->   H0 is rejected.
*  675201 ->   H0 is rejected.
*  201436 ->   H0 is rejected.

Normal distribution is not provided so, we can analyze outliers.

In [7]:
# To determine the threshold value for outliers
def outlier_thresholds(dataframe, variable, low_quantile=0.05, up_quantile=0.95):
    quantile_one = dataframe[variable].quantile(low_quantile)
    quantile_three = dataframe[variable].quantile(up_quantile)
    interquantile_range = quantile_three - quantile_one
    up_limit = quantile_three + 1.5 * interquantile_range
    low_limit = quantile_one - 1.5 * interquantile_range
    return low_limit, up_limit

In [8]:
# Threshold values are determined for the price variable.
low_limit,up_limit = outlier_thresholds(df, "price")
print("Low Limit : {0}  Up Limit : {1}".format(low_limit,up_limit))

Low Limit : -64.46732638496243  Up Limit : 187.44554397493738


In [9]:
# Are there any outlier observations? if any, how many?
def has_outliers(dataframe, numeric_columns, plot=False):
   # variable_names = []
    for col in numeric_columns:
        low_limit, up_limit = outlier_thresholds(dataframe, col)
        if dataframe[(dataframe[col] > up_limit) | (dataframe[col] < low_limit)].any(axis=None):
            number_of_outliers = dataframe[(dataframe[col] > up_limit) | (dataframe[col] < low_limit)].shape[0]
            print(col, " : ", number_of_outliers, "outliers")
            #variable_names.append(col)
            if plot:
                sns.boxplot(x=dataframe[col])
                plt.show()
    #return variable_names

In [10]:
has_outliers(df, ["price"])

price  :  77 outliers


In [11]:
# removing outliers
def remove_outliers(dataframe, numeric_columns):
    for variable in numeric_columns:
        low_limit, up_limit = outlier_thresholds(dataframe, variable)
        dataframe_without_outliers = dataframe[~((dataframe[variable] < low_limit) | (dataframe[variable] > up_limit))]
    return dataframe_without_outliers

In [12]:
df = remove_outliers(df, ["price"])

In [13]:
# new shape of dataframe
df.shape

(3371, 2)

In [14]:
# We apply shapiro-wilks test to each group and test their normal distribution.
print(" Shapiro-Wilks Test Result")
for category in df["category_id"].unique():
    test_statistic , pvalue = shapiro(df.loc[df["category_id"] ==  category,"price"])
    if(pvalue<0.05):
        print('\n','{0} -> '.format(category),'Test statistic = %.4f, p-Value = %.4f' % (test_statistic, pvalue),"H0 is rejected.")
    else:
         print('Test statistic = %.4f, p-Value = %.4f' % (test_statistic, pvalue),"H0 is not rejected.")

 Shapiro-Wilks Test Result

 489756 ->  Test statistic = 0.6328, p-Value = 0.0000 H0 is rejected.

 361254 ->  Test statistic = 0.4757, p-Value = 0.0000 H0 is rejected.

 874521 ->  Test statistic = 0.5116, p-Value = 0.0000 H0 is rejected.

 326584 ->  Test statistic = 0.5026, p-Value = 0.0000 H0 is rejected.

 675201 ->  Test statistic = 0.6382, p-Value = 0.0000 H0 is rejected.

 201436 ->  Test statistic = 0.6190, p-Value = 0.0000 H0 is rejected.


*  489756 ->   H0 is rejected.
*  361254 ->   H0 is rejected.
*  874521 ->   H0 is rejected.
*  326584 ->   H0 is rejected.
*  675201 ->   H0 is rejected.
*  201436 ->   H0 is rejected.

Normal distribution was not achieved even after the outliers were removed.

## Homogeneity of Variances

**Levene’s Test for Homogeneity of variances**

Levene’s test is an equal variance test. It can be used to check if our data sets fulfill the homogeneity of variance assumption before we perform the t-test or Analysis of Variance

* H0: the compared categories have equal variance.
* H1: the compared categories do not have equal variance.

In [15]:
# category pairs for hypothesis
pairs = []
for pair in itertools.combinations(df["category_id"].unique(),2):
    pairs.append(pair)
pairs

[(489756, 361254),
 (489756, 874521),
 (489756, 326584),
 (489756, 675201),
 (489756, 201436),
 (361254, 874521),
 (361254, 326584),
 (361254, 675201),
 (361254, 201436),
 (874521, 326584),
 (874521, 675201),
 (874521, 201436),
 (326584, 675201),
 (326584, 201436),
 (675201, 201436)]

In [16]:
print("  Levene Test Result")
for pair in pairs:
    test_statistic,pvalue = stats.levene(df.loc[df["category_id"] ==  pair[0],"price"],df.loc[df["category_id"] ==  pair[1],"price"] )
    if(pvalue < 0.05):
        print('\n',"({0} - {1}) -> ".format(pair[0],pair[1]),'Test statistic = %.4f, p-Value = %.4f' % (test_statistic, pvalue), "  H0 is rejected")
    else:
         print('\n',"({0} - {1}) -> ".format(pair[0],pair[1]),'Test statistic = %.4f, p-Value = %.4f' % (test_statistic, pvalue), "  H0 is not rejected")

  Levene Test Result

 (489756 - 361254) ->  Test statistic = 86.2415, p-Value = 0.0000   H0 is rejected

 (489756 - 874521) ->  Test statistic = 12.9032, p-Value = 0.0003   H0 is rejected

 (489756 - 326584) ->  Test statistic = 11.9908, p-Value = 0.0005   H0 is rejected

 (489756 - 675201) ->  Test statistic = 9.1123, p-Value = 0.0026   H0 is rejected

 (489756 - 201436) ->  Test statistic = 10.7554, p-Value = 0.0011   H0 is rejected

 (361254 - 874521) ->  Test statistic = 31.9953, p-Value = 0.0000   H0 is rejected

 (361254 - 326584) ->  Test statistic = 4.9865, p-Value = 0.0258   H0 is rejected

 (361254 - 675201) ->  Test statistic = 7.8657, p-Value = 0.0052   H0 is rejected

 (361254 - 201436) ->  Test statistic = 1.3037, p-Value = 0.2539   H0 is not rejected

 (874521 - 326584) ->  Test statistic = 2.8538, p-Value = 0.0915   H0 is not rejected

 (874521 - 675201) ->  Test statistic = 1.7651, p-Value = 0.1843   H0 is not rejected

 (874521 - 201436) ->  Test statistic = 3.4482, 

**Pairs that do not have equal variance (H0 is rejected):**
*  (489756 - 361254) 
*  (489756 - 874521) 
*  (489756 - 326584) 
*  (489756 - 675201) 
*  (489756 - 201436) 
*  (361254 - 874521) 
*  (361254 - 326584) 
*  (361254 - 675201) 

**Pairs that have equal variance (H0 is not rejected):**
*  (361254 - 201436) 
*  (874521 - 326584) 
*  (874521 - 675201) 
*  (874521 - 201436) 
*  (326584 - 675201) 
*  (326584 - 201436) 
*  (675201 - 201436) 


# Implementing Hypothesis Test

We decide which test to apply according to the assumptions of normality and variance.
Normal distribution hypotheses of the groups were rejected. Therefore, we need to apply a non-parametric method.

## Non-Parametrik İndependet Two Sample Test
**Mann-Whitney U test:** It is a non-parametric method used to compare the means of two independent groups in a distribution that does not show normal distribution.

* H0 : There is no statistically significant difference between price average of two category

* H1 : There is statistically significant difference between price average of two category

In [17]:
listofResult = []
print(" Mann-Whitney U test Result")
for pair in pairs:
    test_statistic,pvalue = stats.stats.mannwhitneyu(df.loc[df["category_id"] ==  pair[0],"price"],df.loc[df["category_id"] ==  pair[1],"price"] )
    if(pvalue < 0.05):
        listofResult.append((pair[0],pair[1], "H0 is Rejected"))
        print('\n',"({0} - {1}) -> ".format(pair[0],pair[1]),'Test statistic = %.4f, p-Value = %.4f' % (test_statistic, pvalue), "  H0 is rejected")
    else:
        print('\n',"({0} - {1}) -> ".format(pair[0],pair[1]),'Test statistic = %.4f, p-Value = %.4f' % (test_statistic, pvalue), "  H0 is not rejected")
        listofResult.append((pair[0],pair[1], "H0 is not Rejected"))

 Mann-Whitney U test Result

 (489756 - 361254) ->  Test statistic = 371652.5000, p-Value = 0.0000   H0 is rejected

 (489756 - 874521) ->  Test statistic = 482405.0000, p-Value = 0.0000   H0 is rejected

 (489756 - 326584) ->  Test statistic = 68317.0000, p-Value = 0.0000   H0 is rejected

 (489756 - 675201) ->  Test statistic = 83360.5000, p-Value = 0.0000   H0 is rejected

 (489756 - 201436) ->  Test statistic = 60158.0000, p-Value = 0.0000   H0 is rejected

 (361254 - 874521) ->  Test statistic = 214411.0000, p-Value = 0.0909   H0 is not rejected

 (361254 - 326584) ->  Test statistic = 32541.0000, p-Value = 0.0000   H0 is rejected

 (361254 - 675201) ->  Test statistic = 38936.0000, p-Value = 0.3708   H0 is not rejected

 (361254 - 201436) ->  Test statistic = 29521.0000, p-Value = 0.4354   H0 is not rejected

 (874521 - 326584) ->  Test statistic = 38009.0000, p-Value = 0.0000   H0 is rejected

 (874521 - 675201) ->  Test statistic = 46044.0000, p-Value = 0.3623   H0 is not rejec

In [18]:
result_df = pd.DataFrame()
result_df["Category 1"] = [pair[0] for pair in listofResult]
result_df["Category 2"] = [pair[1] for pair in listofResult]
result_df["H0"] = [pair[2] for pair in listofResult]

In [19]:
result_df

Unnamed: 0,Category 1,Category 2,H0
0,489756,361254,H0 is Rejected
1,489756,874521,H0 is Rejected
2,489756,326584,H0 is Rejected
3,489756,675201,H0 is Rejected
4,489756,201436,H0 is Rejected
5,361254,874521,H0 is not Rejected
6,361254,326584,H0 is Rejected
7,361254,675201,H0 is not Rejected
8,361254,201436,H0 is not Rejected
9,874521,326584,H0 is Rejected


## Does the price of the item differ by category? 
When we examine the table above, there is no statistically significant difference average price between 6 categorical pairs, while there is a statistically significant difference average price** between 12 categorical pairs.

## What should the item cost?


In [20]:
result_df[result_df["H0"] == "H0 is not Rejected"]

Unnamed: 0,Category 1,Category 2,H0
5,361254,874521,H0 is not Rejected
7,361254,675201,H0 is not Rejected
8,361254,201436,H0 is not Rejected
10,874521,675201,H0 is not Rejected
11,874521,201436,H0 is not Rejected
14,675201,201436,H0 is not Rejected


Categorical groups with no statistically significant difference :  
* 361254
* 874512
* 675201
* 201436

We can make the prices of these groups which do not differ statistically, are the same. we may apply the same price to the remaining two groups, let's examine their averages.

In [21]:
df.groupby("category_id").agg({"price":"mean"})

Unnamed: 0_level_0,price
category_id,Unnamed: 1_level_1
201436,36.175498
326584,35.69317
361254,35.477261
489756,43.603983
675201,37.443592
874521,39.273175


Category 326584 is very close to the price average of the other 4 groups we are considering to make the same price. Category 489756 differs, but we will not set a separate price for it, and we will continue with a common price scenario for all categories.

The average of 4 statistically identical categories will be the price we will determine. Since the average price paid by the other two category is high or close, not including them will not affect the purchase negatively.

In [22]:
signif_cat = [361254,874521,675201,201436]
sum = 0 
for i in signif_cat:
    sum += df.loc[df["category_id"]== i,"price"].mean() 
PRICE = sum/4

In [23]:
print("PRICE :{%.4f}"%PRICE)

PRICE :{37.0924}


## Confidence Intervals: It is desirable to be "flexible" in terms of price. 

In [24]:
# We list the prices of the 4 categories that selected for pricing
prices = []
for category in signif_cat:
    for i in df.loc[df["category_id"]== category,"price"]:
        prices.append(i)

In [25]:
print("Felexible Price Range: ", sms.DescrStatsW(prices).tconfint_mean())

Felexible Price Range:  (36.7109597897918, 38.17576299427283)


# Simulation For Item Purchases

We will calculate the incomes that can be obtained from the minimum, maximum values of the confidence interval and the prices we set.

## Assumption 1 :  Price(36.71096) 

In [26]:
#For minimum price in confidence interval
freq = len(df[df["price"]>=36.7109597897918]) #number of sales equal to or greater than this price
income = freq * 36.71096 #income
print("Income: ", income)

Income:  38436.37512


## Assumption 2: Price(37.0923) 

In [27]:
# For decided price
freq = len(df[df["price"]>=37.09238177238653]) #number of sales equal to or greater than this price
income = freq * 37.09238177238653 #income
print("Income: ", income)

Income:  37611.67511719994


## Assumption 3 : Price(38.17576)

In [28]:
# For maximum price in confidence interval
freq = len(df[df["price"]>=38.17576299427283])
income = freq * 38.17576299427283
print("Income: ",income)

Income:  35388.93229569092


## SUMMARY

- A statistical test was applied to see if prices varied categorically.
   - the assumptions for the test were checked
   - All Categories rejected the normal distribution hypothesis, so it was decided to apply  non-parametric independent two sample test.
- It was observed whether there was a significant statistical difference between the categories and pricing has been made for the item.
- Confidence interval was determined as flexibility was desired in terms of price.
- Product purchases were simulated for possible price changes according to the confidence interval.