# Case Study on ANOVA

##### XYZ Company has offices in four different zones. The company wishes to investigate the following:

##### ● The mean sales generated by each zone.

##### ● Total sales generated by all the zones for each month.

##### ● Check whether all the zones generate the same amount of sales.

##### The given dataset is 'Sales_data_zone_wise.csv'. Help the company to carry out their study with the help of data provided.

## Importing the modules, reading and analysing the dataset

In [15]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import stats

In [22]:
zone_data = pd.read_csv(r'C:\Users\Dell\Downloads\Sales_data_zone_wise.csv')
zone_data.head()

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D
0,Month - 1,1483525,1748451,1523308,2267260
1,Month - 2,1238428,1707421,2212113,1994341
2,Month - 3,1860771,2091194,1282374,1241600
3,Month - 4,1871571,1759617,2290580,2252681
4,Month - 5,1244922,1606010,1818334,1326062


In [23]:
zone_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Month     29 non-null     object
 1   Zone - A  29 non-null     int64 
 2   Zone - B  29 non-null     int64 
 3   Zone - C  29 non-null     int64 
 4   Zone - D  29 non-null     int64 
dtypes: int64(4), object(1)
memory usage: 1.3+ KB


## 1. The mean sales generated by each zone.

In [24]:
zone_data.describe()

Unnamed: 0,Zone - A,Zone - B,Zone - C,Zone - D
count,29.0,29.0,29.0,29.0
mean,1540493.0,1755560.0,1772871.0,1842927.0
std,261940.1,168389.9,333193.7,375016.5
min,1128185.0,1527574.0,1237722.0,1234311.0
25%,1305972.0,1606010.0,1523308.0,1520406.0
50%,1534390.0,1740365.0,1767047.0,1854412.0
75%,1820196.0,1875658.0,2098463.0,2180416.0
max,2004480.0,2091194.0,2290580.0,2364132.0


In [28]:
print('Average sales in Zone - A : ', zone_data['Zone - A'].mean())
print('Average sales in Zone - B : ', zone_data['Zone - B'].mean())
print('Average sales in Zone - C : ', zone_data['Zone - C'].mean())
print('Average sales in Zone - D : ', zone_data['Zone - D'].mean())

Average sales in Zone - A :  1540493.1379310344
Average sales in Zone - B :  1755559.5862068965
Average sales in Zone - C :  1772871.0344827587
Average sales in Zone - D :  1842926.7586206896


#### Insights:
    
It is clearly visible from the above analysis that the offices in Zone - D generated the highest sales whereas the lowest sales is geneated in Zone - A. Also, we can see that the standard deviation of Zone - B is the lowest while it is highest for Zone - D. This means that the sales in Zone - B is more consistent and Zone - D is less consistent compared to other zones.

## 2.Total sales generated by all the zones for each month.

In [26]:
def sum_frame_by_column(frame,  list_of_cols_to_sum, new_col_name):
    frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
    return(frame)

sum_frame_by_column(zone_data,  ['Zone - A','Zone - B','Zone - C','Zone - D'],'Total_sales')

Unnamed: 0,Month,Zone - A,Zone - B,Zone - C,Zone - D,Total_sales
0,Month - 1,1483525,1748451,1523308,2267260,7022544.0
1,Month - 2,1238428,1707421,2212113,1994341,7152303.0
2,Month - 3,1860771,2091194,1282374,1241600,6475939.0
3,Month - 4,1871571,1759617,2290580,2252681,8174449.0
4,Month - 5,1244922,1606010,1818334,1326062,5995328.0
5,Month - 6,1534390,1573128,1751825,2292044,7151387.0
6,Month - 7,1820196,1992031,1786826,1688055,7287108.0
7,Month - 8,1625696,1665534,2161754,2363315,7816299.0
8,Month - 9,1652644,1873402,1755290,1422059,6703395.0
9,Month - 10,1852450,1913059,1754314,1608387,7128210.0


#### Insights:

We can see that the highest sales generated in Month - 4 and the lowest sales generated in Month - 13. The total sales values ranges from 5925424.0 to 8174449.0

## 3.Check whether all the zones generate the same amount of sales.

Null Hypothesis = The average sales are same across the zones

Alternate Hypothesis = The average sales across the zones are not equal 

Significant value = 0.05

In [29]:
f_value, p_value = stats.f_oneway(zone_data['Zone - A'], zone_data['Zone - B'], zone_data['Zone - C'], zone_data['Zone - D'])
print('f-value is: ', f_value)
print('p-value is: ', p_value)

f-value is:  5.672056106843581
p-value is:  0.0011827601694503335


#### Insights:

In this case, we reject the Null Hypothesis and accept the Alternate Hypothesis since the p-value(0.0011827601694503335) is considerably lesser than the Alpha value(0.05). Hence, it is clear that all the 4 zones generate different sales amount.