In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

A company started to invest in digital marketing as a new way of their product promotions.
For that they collected data and decided to carry out a study on it.
● The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.
● The company needs to check whether there is any dependency between the features “Region” and “Manager”.

In [2]:
# Read the dataset to python environment
data=pd.read_csv('Sales_add.csv')

In [3]:
data

Unnamed: 0,Month,Region,Manager,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
0,Month-1,Region - A,Manager - A,132921,270390
1,Month-2,Region - A,Manager - C,149559,223334
2,Month-3,Region - B,Manager - A,146278,244243
3,Month-4,Region - B,Manager - B,152167,231808
4,Month-5,Region - C,Manager - B,159525,258402
5,Month-6,Region - A,Manager - B,137163,256948
6,Month-7,Region - C,Manager - C,130625,222106
7,Month-8,Region - A,Manager - A,131140,230637
8,Month-9,Region - B,Manager - C,171259,226261
9,Month-10,Region - C,Manager - B,141956,193735


In [4]:
data.shape

(22, 5)

In [5]:
data.columns

Index(['Month', 'Region', 'Manager', 'Sales_before_digital_add(in $)',
       'Sales_After_digital_add(in $)'],
      dtype='object')

data set has 22 rows and 5 columns

In [6]:
data.isna().sum() ## Data set is clear

Month                             0
Region                            0
Manager                           0
Sales_before_digital_add(in $)    0
Sales_After_digital_add(in $)     0
dtype: int64

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 5 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Month                           22 non-null     object
 1   Region                          22 non-null     object
 2   Manager                         22 non-null     object
 3   Sales_before_digital_add(in $)  22 non-null     int64 
 4   Sales_After_digital_add(in $)   22 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 1008.0+ bytes


In [41]:
data.describe() ## statistical summary

Unnamed: 0,Sales_before_digital_add(in $),Sales_After_digital_add(in $)
count,22.0,22.0
mean,149239.954545,231123.727273
std,14844.042921,25556.777061
min,130263.0,187305.0
25%,138087.75,214960.75
50%,147444.0,229986.5
75%,157627.5,250909.0
max,178939.0,276279.0


The mean value increases for sales after digital marketing

# The company wishes to clarify whether there is any increase in sales after stepping into digital marketing.

*The data set is having entries less than 30 and population varience is unknown.
*There are two samples of data and the two samples are dependent(sales before and after digital marketing)
*So we can go for paired t-test


In [9]:
##Arrange the columns with relevent data
sales_before=data['Sales_before_digital_add(in $)']
sales_before

0     132921
1     149559
2     146278
3     152167
4     159525
5     137163
6     130625
7     131140
8     171259
9     141956
10    159339
11    178939
12    145062
13    151514
14    147463
15    177195
16    140862
17    167996
18    132135
19    152493
20    147425
21    130263
Name: Sales_before_digital_add(in $), dtype: int64

In [10]:
sales_after=data['Sales_After_digital_add(in $)']
sales_after

0     270390
1     223334
2     244243
3     231808
4     258402
5     256948
6     222106
7     230637
8     226261
9     193735
10    203175
11    276279
12    205180
13    253131
14    229336
15    187305
16    234908
17    191517
18    227040
19    212579
20    263388
21    243020
Name: Sales_After_digital_add(in $), dtype: int64

*Null hypothesis(H0):There is no change in sales after stepping into digital marketing,
*Alternate hypotheis(H1):There is an increase in sales after stepping into digital marketing

Alternate hypothesis is an increase in sale so one tailed(right)t test

In [11]:
# Test statistic
import scipy
tvalue, pvalue = scipy.stats.ttest_ind(sales_after,sales_before,alternative='greater')
print('tvalue=',tvalue)
print('pvalue=',pvalue)

tvalue= 12.995084451110877
pvalue= 1.3071840034523225e-16


*To determine if the results of the t-test are statistically significant, you can compare the test statistic to a T critical value. 
If the absolute value of the test statistic is greater than the T critical value, then the results of the test are statistically significant.

**To find the T critical value in Python, you can use the
scipy.stats.t.ppf(q, df)

where:

q: The significance level to use
df: The degrees of freedom

In [12]:
tcritical=scipy.stats.t.ppf(q=0.05,df=42)

tcritical

-1.6819523559426006

In [13]:
if np.abs(tvalue)>np.abs(tcritical) and pvalue <0.05:
    print('Reject Null hypothesis(H0)')
else:
    print('Reject Alternate Hypothesis(H1)')

Reject Null hypothesis(H0)


So its evident that sales has an increase after digital marketing

# The company needs to check whether there is any dependency between the features “Region” and “Manager”.

A chi-square test is used when you want to see if there is a relationship between two categorical variables.


**Null Hypothesis:Region and Manager is independent,
**Alternate Hypothesis:Region and Manager is dependent

In [14]:
con_data=pd.crosstab(data['Region'],data['Manager'],margins=False)

In [15]:
con_data

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,4,3,3
Region - B,4,1,2
Region - C,1,3,1


In [16]:
contigency_pct = pd.crosstab(data['Region'], data['Manager'], normalize='index')
contigency_pct

Manager,Manager - A,Manager - B,Manager - C
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Region - A,0.4,0.3,0.3
Region - B,0.571429,0.142857,0.285714
Region - C,0.2,0.6,0.2


In [17]:
from scipy.stats import chi2_contingency

In [18]:
stst,p,dof,exp=chi2_contingency(con_data) 
p

0.5493991051158094

The p-value is 54.93% which means that we can reject the null hypothesis at 95% level of confidence. So Region and manager is independent

In [19]:
chi_crit=scipy.stats.chi2.ppf(0.05, 84)
chi_crit

63.87626144303417

In [55]:
if np.abs(stst)>np.abs(chi_crit) and pvalue <0.05:
    print('Reject Null hypothesis(H0)That is Region and Manager is independent ')
else:
    print('Reject Alternate Hypothesis that is Region and manager is dependent')

Reject Alternate Hypothesis that is Region and manager is dependent


Region and Manager is independent