# Case Study on Testing of Hypothesis
# A company started to invest in digital marketing as a new way of their product promotions.For that they collected data and decided to carry out a study on it.

# ● The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.
# ● The company needs to check whether there is any dependency between the features “Region” and “Manager”.

# Help the company to carry out their study with the help of data provided.

In [None]:
# Import the required libraries and load the dataset 
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
Sales_ad_data = pd.read_csv('Sales_add.csv')

# First, analyse the data using simple python functions

In [3]:
#Check the column names  
Sales_ad_data.columns

Index(['Month', 'Region', 'Manager', 'Sales_before_digital_add(in $)',
       'Sales_After_digital_add(in $)'],
      dtype='object')

In [4]:
#Check the sales data head 
Sales_ad_data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [8]:
# Check the data types of each column
Sales_ad_data.dtypes

Month                             object
Region                            object
Manager                           object
Sales_before_digital_add(in $)     int64
Sales_After_digital_add(in $)      int64
dtype: object

In [9]:
# Check if any of the columns have null values 
Sales_ad_data.isna().sum()

Month                             0
Region                            0
Manager                           0
Sales_before_digital_add(in $)    0
Sales_After_digital_add(in $)     0
dtype: int64

# ● The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.

In [11]:
# Conduct descriptive analysis on numerical columns ( sales before digital add and after ad columns ) of the dataset 
Sales_ad_data.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


# Inference 1 - We could notice that there in an increase in mean , std deviation , quartile values and max after the sales .
# So in hindsight , we can say that there is an increase in sales after stepping into digital marketing

In [14]:
# Let's conduct t test on before sales and after sales column and see if there's an increase in
# sales after digital marketing step-in
Before_adv_sales = Sales_ad_data['Sales_before_digital_add(in $)']

In [15]:
After_adv_sales = Sales_ad_data['Sales_After_digital_add(in $)']

In [27]:
# The null hypothesis  H0 = " No increase in sales after the digital ads"
# The alternative hypothesis  H1 = " There is increase in sales after digital ads "

from scipy import stats
ttest,pvalue=stats.ttest_ind(After_adv_sales,Before_adv_sales)
print("P value is :",pvalue)
print("Test statistic is:",ttest)

if pvalue< 0.05:
    print ('Null hypothesis fails. It means that there is increase in sales after the digital ads')
else:
    print("failed to reject null hypothesis")

P value is : 2.614368006904645e-16
Test statistic is: 12.995084451110877
Null hypothesis fails. It means that there is increase in sales after the digital ads


# Inference 2 - The test clarifies us that there is a clear increase in sales after stepping into digital marketing

# ● The company needs to check whether there is any dependency between the features “Region” and “Manager”.

# Stating the Hypothesis:

# Null Hypothesis H0: The features 'Region' and ' Manager' does not have any dependency.

# Alternative Hypothesis H1:The features 'Region' and 'Manager' are dependent.

In [30]:
Region_Manager_data = Sales_ad_data.groupby('Region')['Manager'].value_counts().unstack()
Region_Manager_data

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,4,3,3
Region - B,4,1,2
Region - C,1,3,1


In [31]:
# Let's use the chi square test to find the inter-dependency of 'Region' and 'Manager'
#chi2_rm: The test statistic used 
#p_rm: The p-value of the test
#dof_rm: Degrees of freedom
#expected_rm: The expected frequencies, based on the marginal sums of the table

chi2_rm,p_rm,dof_rm,expected_rm = stats.chi2_contingency(Region_Manager_data)

In [32]:
chi2_rm

3.050566893424036

In [33]:
p_rm

0.5493991051158094

In [34]:
dof_rm

4

In [35]:
expected_rm

array([[4.09090909, 3.18181818, 2.72727273],
       [2.86363636, 2.22727273, 1.90909091],
       [2.04545455, 1.59090909, 1.36363636]])

In [36]:
chi2_crit_value = stats.chi2.ppf(1-0.05,4)
chi2_crit_value

9.487729036781154

In [37]:
if chi2_rm > chi2_crit_value and p < 0.05 :
    print("Null hypothesis Rejected.")
else:
    print("Alternative hypothesis Rejected")

Alternative hypothesis Rejected


# Here Chi square results directs us to reject the alternative hypothesis which means that there is no dependency between 'Region' and 'manager'

In [None]:
# Submitted by viswaraj Chandran