## Case Study on ANOVA

# XYZ Company has offices in four different zones. The company wishes to investigate the following 
● The mean sales generated by each zone.
● Total sales generated by all the zones for each month.
● Check whether all the zones generate the same amount of sales.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Read the dataset to python environment
data=pd.read_csv('Sales_data_zone_wise.csv')

In [4]:
data

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062
5,Month - 6,1534390,1573128,1751825,2292044
6,Month - 7,1820196,1992031,1786826,1688055
7,Month - 8,1625696,1665534,2161754,2363315
8,Month - 9,1652644,1873402,1755290,1422059
9,Month - 10,1852450,1913059,1754314,1608387


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Month     29 non-null     object
 1   Zone - A  29 non-null     int64 
 2   Zone - B  29 non-null     int64 
 3   Zone - C  29 non-null     int64 
 4   Zone - D  29 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


The dataset describes about the montly amount of sales generated by 4 different zones.29 samples are taken

In [6]:
data.isna().sum()  ## Data set is clean

Month       0
Zone - A    0
Zone - B    0
Zone - C    0
Zone - D    0
dtype: int64

In [7]:
## Statistical summary
data.describe()

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
count,29.0,29.0,29.0,29.0
mean,1540493.0,1755560.0,1772871.0,1842927.0
std,261940.1,168389.9,333193.7,375016.5
min,1128185.0,1527574.0,1237722.0,1234311.0
25%,1305972.0,1606010.0,1523308.0,1520406.0
50%,1534390.0,1740365.0,1767047.0,1854412.0
75%,1820196.0,1875658.0,2098463.0,2180416.0
max,2004480.0,2091194.0,2290580.0,2364132.0


Mean value of sales is different for 4 different zones

# The mean sales generated by each zone.

In [8]:
Zone_A=data['Zone - A'].mean()
Zone_A

1540493.1379310344

In [9]:
Zone_B=data['Zone - B'].mean()
Zone_B

1755559.5862068965

In [10]:
Zone_C=data['Zone - C'].mean()
Zone_C

1772871.0344827587

In [11]:
Zone_D=data['Zone - D'].mean()
Zone_D

1842926.7586206896

In [12]:
mean=data[["Zone - A", "Zone - B","Zone - C","Zone - D"]].mean()
mean

Zone - A    1.540493e+06
Zone - B    1.755560e+06
Zone - C    1.772871e+06
Zone - D    1.842927e+06
dtype: float64

# Total sales generated by all the zones for each month.


In [13]:
data['Total_Sales']=data['Zone - A']+data['Zone - B']+data['Zone - C']+data['Zone - D']

In [14]:
data

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D,Total_Sales
0,Month - 1,1483525,1748451,1523308,2267260,7022544
1,Month - 2,1238428,1707421,2212113,1994341,7152303
2,Month - 3,1860771,2091194,1282374,1241600,6475939
3,Month - 4,1871571,1759617,2290580,2252681,8174449
4,Month - 5,1244922,1606010,1818334,1326062,5995328
5,Month - 6,1534390,1573128,1751825,2292044,7151387
6,Month - 7,1820196,1992031,1786826,1688055,7287108
7,Month - 8,1625696,1665534,2161754,2363315,7816299
8,Month - 9,1652644,1873402,1755290,1422059,6703395
9,Month - 10,1852450,1913059,1754314,1608387,7128210


# Check whether all the zones generate the same amount of sales.

ANOVA provides a statistical test of whether two or more population means are equal,and therefore generalizes the t-test beyond two means.

Null hypothesis(H0):All the zones generate the same amount of sales.                                                                                  
Alternate hypothesis(H1):All the zones generate different amount of sales

In [16]:
from scipy.stats import f_oneway
fstat,pvalue=f_oneway(data['Zone - A'],data['Zone - B'],data['Zone - C'],data['Zone - D'])
print('F-statistics=',fstat,'Pvalue=',pvalue)

F-statistics= 5.672056106843581 Pvalue= 0.0011827601694503335


In [18]:
if pvalue<=0.05:
    print("Reject Null hypothesis")
else:
    print("Reject Alternate Hypothesis")

Reject Null hypothesis


All the four zones generate different amount of sales