# Week08_Case Study on Testing of Hypothesis

#### A company started to invest in digital marketing as a new way of their product promotions. For that they collected data and decided to carry out a study on it.
#### ● The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.
#### ● The company needs to check whether there is any dependency between the features “Region” and “Manager”.

#### Help the company to carry out their study with the help of data provided.

In [1]:
import numpy as np
import pandas as pd

In [2]:
data=pd.read_csv('Sales_add.csv')
data.head()

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402


In [3]:
data.shape

(22, 5)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


In [5]:
data.describe()

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


# The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.

#### Level of significance, alpha = 95% (0.05)

#### The two hypotheses for this particular two sample t-test are as follows:

#### Null Hypothesis, H0 : µ1 >= µ2 (no increase in sales after stepping into digital marketing)

#### Alternative Hypothesis, HA : µ1 < µ2 (increase in sales after stepping into digital marketing)

In [6]:
d1=data['Sales_before_digital_add(in $)']
d2=data['Sales_After_digital_add(in $)']

In [7]:
d1.mean()

149239.95454545456

In [8]:
d2.mean()

231123.72727272726

#### 1 tailed 2 sample t-test

In [9]:
import scipy.stats as stats

test=stats.ttest_ind(d1, d2, equal_var=True)
display(test)

Ttest_indResult(statistic=-12.995084451110877, pvalue=2.614368006904645e-16)

In [10]:
print("p-value is : ", 2.614368006904645e-16/2)

p-value is :  1.3071840034523225e-16


# Insights :

#### Because the p-value of our test (1.3071840034523225e-16) is less than alpha = 0.05, we reject the null hypothesis of the test. That is, there is increase in sales after stepping into digital marketing.

# The company needs to check whether there is any dependency between the features “Region” and “Manager”.

#### Level of significance, alpha = 95% (0.05)

#### The two hypotheses for this particular two sample t-test are as follows:

#### Null Hypothesis, H0 : no dependency between the features "Region" and "Manager"

#### Alternative Hypothesis, HA : there is dependency between the features "Region" and "Manager"

In [11]:
dr=pd.crosstab(data['Region'], data['Manager'])
print(dr)

Manager     Manager - A  Manager - B  Manager - C
Region                                           
Region - A            4            3            3
Region - B            4            1            2
Region - C            1            3            1


In [12]:
Observed_Values=dr.values
print("Observed Values :\n", Observed_Values)

Observed Values :
 [[4 3 3]
 [4 1 2]
 [1 3 1]]


In [13]:
val=stats.chi2_contingency(dr)

In [14]:
val

(3.050566893424036,
 0.5493991051158094,
 4,
 array([[4.09090909, 3.18181818, 2.72727273],
        [2.86363636, 2.22727273, 1.90909091],
        [2.04545455, 1.59090909, 1.36363636]]))

In [15]:
Expected_Values=val[3]
Expected_Values

array([[4.09090909, 3.18181818, 2.72727273],
       [2.86363636, 2.22727273, 1.90909091],
       [2.04545455, 1.59090909, 1.36363636]])

In [16]:
# Finding degrees of freedom

no_of_rows=len(dr.iloc[0:3, 0])
no_of_columns=len(dr.iloc[0, 0:3])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degrees of freedom : ", ddof)
alpha=0.05

Degrees of freedom :  4


In [17]:
# Finding Chi-Square statistic

from scipy.stats import chi2
chi_square=sum([(o-e)**2/e for o,e in zip(Observed_Values, Expected_Values)] )
chi_square_statistic=chi_square[0]+chi_square[1]+chi_square[2]

In [18]:
print("Chi-Square statistic : ", chi_square_statistic)

Chi-Square statistic :  3.0505668934240364


In [19]:
# Finding Critical Value

critical_value=chi2.ppf(q=1-alpha, df=ddof)
print("Critical Value : ", critical_value)

Critical Value :  9.487729036781154


In [20]:
# Finding p-value

p_value=1-chi2.cdf(x=chi_square_statistic, df=ddof)
print("p-value : ", p_value)
print("Significance Level : ", alpha)
print("Degree of freedom : ", ddof)

p-value :  0.5493991051158094
Significance Level :  0.05
Degree of freedom :  4


In [21]:
# Comparing chi_square_statistic vs critical_value

print("On the basis of comparison between chi_square_statistic & critical_value : ")
if chi_square_statistic>=critical_value:
    print('Reject H0 : There is a relationship between the features "Region" and "Manager"')
else:
    print('Retain H0 : There is no relationship between the features "Region" and "Manager"')

On the basis of comparison between chi_square_statistic & critical_value : 
Retain H0 : There is no relationship between the features "Region" and "Manager"


In [22]:
# Comparing p_value vs alpha

print("On the basis of comparison between p_value & alpha : ")
if p_value<=alpha:
    print('Reject H0 : There is a relationship between the features "Region" and "Manager"')
else:
    print('Retain H0 : There is no relationship between the features "Region" and "Manager"')

On the basis of comparison between p_value & alpha : 
Retain H0 : There is no relationship between the features "Region" and "Manager"


# Insights :

#### We have made comparison between the below 2 :
#### a) chi_square_statistic vs critical_value
#### b) p_value vs alpha

#### Since chi_square_statistic is less than critical_value (p_value is greater than alpha), we fail to reject Null Hypothesis. That is , there is no relationship/dependency between the features "Region" and "Manager".