<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# **Retail Sales Dataset 2018-2022**
## **Lab 3. Data Analysis with Python**

Estimated time needed: **30** minutes

### **Dataset Attributes**
*   Date: year and month
*   SKU: unique code consisting of letters and numbers that identify each product
*   Group: group of related products which share some common attributes
*   Units Pkg: package weight (kg)
*   Avg Price Pkg: average price per package
*   Sales Pkg: total package sales per month

### **Target Field**
*   Turnover per month

## **Objectives**

After completing this lab you will be able to:

*   Explore features or charecteristics to predict turnover 


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="https://#import_data">Import Data from Module</a></li>
    <li><a href="https://#pattern_visualization">Analyzing Individual Feature Patterns using Visualization</a></li>
    <li><a href="https://#discriptive_statistics">Descriptive Statistical Analysis</a></li>
    <li><a href="https://#correlation_causation">Correlation and Causation</a></li>
    <li><a href="https://#anova">ANOVA</a></li>
</ol>

</div>

<hr>


<h3>What are the main characteristics that have the most impact on the turnover?</h3>


<h2 id="import_data"><b>1. Import Data from Module 2</b></h2>


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
# If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
# install specific version of libraries used in lab
# ! mamba install pandas==1.3.3
# ! mamba install numpy=1.21.2
# ! mamba install scipy=1.7.1-y
# !  mamba install seaborn=0.9.0-y
! pip install dython

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/DA101EN_object_storage?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0HM6EN/clean_sales_1.csv'

In [ ]:
df = pd.read_csv(path)
df.head()

<h2 id="pattern_visualization"><b>2. Analyzing Individual Feature Patterns Using Visualization</b></h2>


Don't forget about <code>%matplotlib inline</code>to plot in a Jupyter notebook.


In [ ]:
%matplotlib inline 

<h4>How to choose the right visualization method?</h4>
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [ ]:
# list the data types for each column
df.dtypes

We need to convert data types:


In [ ]:
df["Date"] = pd.to_datetime(df["Date"])
df[["Group"]] = df[["Group"]].astype("category")
df[["SKU"]] = df[["SKU"]].astype("category")
df.dtypes

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #1:</b>

<b>What is the data type of the column "Sales Pkg"? </b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df['Sales Pkg'].dtypes
```

</details>


For example, we can calculate the correlation between variables  of type "int64" or "float64" using the method <code>corr()</code>:


In [ ]:
df.corr()

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.


To show relationship between categorical and numerical variables, we can calculate association coefficient using dython library:


In [ ]:
from dython.nominal import associations

In [ ]:
from sklearn.preprocessing import OrdinalEncoder

In [ ]:
enc = OrdinalEncoder()
df1 = df[df.columns]
df1[df1.columns] = enc.fit_transform(df1)

Visualize:


In [ ]:
fig, ax = plt.subplots(figsize=(16, 8))
r = associations(df1, ax=ax, cmap="Blues")

Now we can observe correlation among all fields.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #2:</b>
    
<p>Find the correlation between the following columns: Sales Pkg, Turnover per month and Units Pkg.</p>
<p>Hint: if you would like to select those columns, use the following syntax: df[['col1', 'col2', 'col3']]</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[['Sales Pkg', 'Turnover per month', 'Units Pkg']].corr()
```

</details>


<h2><b>Continuous Numerical Variables</b></h2> 

<p>Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>In order to start understanding the (linear) relationship between an individual variable and the turnover, we can use <code>regplot</code> which plots the scatterplot plus the fitted regression line for the data.</p>


Let's see several examples of different linear relationships:


<h3><b>Positive Linear Relationship</b></h4>


Let's find the scatterplot of total package sales per month and turnover per month.


In [ ]:
sns.regplot(x="Sales Pkg", y="Turnover per month", data=df)

<p>As the total package sales goes up, the turnover goes up: this indicates a positive direct correlation between these two variables. Total package sales seems like a pretty good predictor of turnover since the regression line is almost a perfect diagonal line.</p>


We can examine the correlation between 'Sales Pkg' and 'Turnover per month' and see that it's approximately 0.95.


In [ ]:
df[["Sales Pkg", "Turnover per month"]].corr()

<h3><b>Weak Linear Relationship</b></h3>


Let's see if "Avg Price Pkg" is a predictor variable of "Turnover per month" and find the scatterplot of these variables.


In [ ]:
sns.regplot(x="Avg Price Pkg", y="Turnover per month", data=df)

There is no correlation between these values. Sadly, average price per package does not seem like a good predictor of turnover per month since the regression line is close to horizontal.<br>
Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.


We can examine the correlation between 'Avg Price Pkg' and 'Turnover per month' and see it's 0.098444.


In [ ]:
df[['Avg Price Pkg', 'Turnover per month']].corr()

It means that there is little or no linear relationship between them. In other words, the values of one variable don't appear to be strongly influenced by the values of the other variable.</br>However, it's important to note that a correlation close to zero does not necessarily mean there is no relationship between the variables. There could still be a non-linear or complex relationship between them that is not captured by the correlation coefficient. 


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
  <b style="font-size: 2em; font-weight: bold;">Question #3:</b>

<p>Find the correlation  between "Units Pkg" and "Turnover per month".</p>
<p>Given the correlation results between these variables, do you expect a linear relationship?<br>Verify your results using the function "regplot()".</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python

#The correlation is -0.129794

df[["Units Pkg", "Turnover per month"]].corr()

#There is a weak negative correlation between the variable 'Units Pkg' and 'Turnover per month' as such regression will not work well. We can see this using "regplot" to demonstrate this.

sns.regplot(x="Units Pkg", y="Turnover per month", data=df)
```

</details>


<h2><b>Categorical Variables</b></h2>

<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object", "int64" or "category". A good way to visualize categorical variables is by using boxplots.</p>


In [ ]:
df.head()

Let's look at the relationship between "Turnover per month" and "Avg Price Pkg binned".


In [ ]:
sns.boxplot(x="Avg Price Pkg binned", y="Turnover per month", data=df)

<p>We see that the distributions of turnover between the different average price categories have a significant overlap, so average price would not be a good predictor of turnover. </p>


**Why overlapping is bad?**<br>
Because there is not a clear difference between the groups in terms of the variable being measured. We can see that for example, Turnover per month with value 4000 occurs in low or medium average price group. In other words, the variable (Average Price Pkg binned) can not be a good predictor of the turnover. It is important to note that overlapping box plots do not necessarily mean that there is no difference between the groups, but rather that the difference is not statistically significant or may be too small to be practically meaningful.


Let's examine relationship between "Sales Pkg binned" and "Turnover per month":


In [ ]:
sns.boxplot(x="Sales Pkg binned", y="Turnover per month", data=df, order=["Low", "Medium", "High"])

<p>Here we see that the distribution of turnover between these total package sales categories, Low, High and Medium, are distinct enough to take total package sales as a potential good predictor of turnover.</p>


<h2 id="discriptive_statistics"><b>3. Descriptive Statistical Analysis</b></h2>


<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <code><b>describe</b></code> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>

This will show:

<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li> 
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


We can apply the method "describe" as follows:


In [ ]:
df.describe()

The default setting of "describe" skips variables of type object or category. We can apply the method "describe" on the variables of type 'object' (or 'category') as follows:


In [ ]:
df.describe(include=['category'])

<h3><b>Value Counts</b></h3>


<p>Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "Sales Pkg binned". Don’t forget the method <code>value_counts()</code> only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['Sales Pkg binned']</code>, not two brackets <code>df[['Sales Pkg binned']]</code>.</p>


In [ ]:
df['Sales Pkg binned'].value_counts()

We can convert the series to a dataframe as follows:


In [ ]:
df['Sales Pkg binned'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe "sales_pkg_counts" and rename the column 'Sales Pkg binned' to 'Value counts'.


In [ ]:
sales_pkg_counts = df['Sales Pkg binned'].value_counts().to_frame()
sales_pkg_counts.rename(columns={'Sales Pkg binned': 'Value counts'}, inplace=True)
sales_pkg_counts

Now let's rename the index to 'Turnover type':


In [ ]:
sales_pkg_counts.index.name = 'Sales type'
sales_pkg_counts

<h2 id="correlation_causation"><b>4. Correlation and Causation</b></h2>


The main question we want to answer in this module is, "What are the main characteristics which have the most impact on the turnover?".

To get a better measure of the important characteristics, we look at the correlation of these variables with the turnover. In other words: how is the turnover dependent on this variable?


<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>


To better understand this, we could say that if people buy more ice cream there are more likely forest fires to happen.</br>
Does it mean that buying ice cream causes forest fires? Of course not. But people buy ice cream when it's hot more often and when it's hot, forest fires are more likely to happen.
</br>So, we can conclude that buying ice cream and forest fires are correlated but doing one thing **doesn't cause** another, but the heat does.


<p><b>Pearson Correlation</b></p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Perfect negative linear correlation.</li>
</ul>


<p>Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p>


In [ ]:
df.corr()

Sometimes we would like to know the significant of the correlation estimate.


<b>P-value</b>

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


We can obtain this information using  "stats" module in the "scipy"  library.


In [ ]:
from scipy import stats

<h3>Sales Pkg vs. Turnover per month</h3>


Let's calculate the Pearson Correlation Coefficient and P-value of 'Sales Pkg' and 'Turnover per month'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df["Sales Pkg"], df["Turnover per month"])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between sales and turnover per month is statistically significant, and the linear relationship is extremely strong (~0.96,  close to 1).</p>


<h3>Avg Price Pkg vs. Turnover per month</h3>


Let's calculate the Pearson Correlation Coefficient and P-value of 'Avg Price Pkg' and 'Turnover per month'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Avg Price Pkg'], df['Turnover per month'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

<h4>Conclusion:</h4>

<p>Since the p-value is $<$ 0.001, the correlation between average price and turnover per month is statistically significant, although the linear relationship is very weak (~0.09).</p>


<h3>Units Pkg vs. Turnover per month</h3>

Let's calculate the Pearson Correlation Coefficient and P-value of 'Units Pkg' and 'Turnover per month'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Units Pkg'], df['Turnover per month'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  

<h4>Conclusion:</h4>
<p>Since the p-value is $<$ 0.001, the correlation between units pkg (package weight in kg) and turnover per month is statistically significant, and the linear relationship is negative and quite weak (~ 0.691).</p>


<h3><b>Autocorrelation</b></h3>


As we have information about year and month, it would be nice to know if our turnover is dependent on the turnover from previous month or dependent on the turnover from two months ago, and so on.


To do such analysis, we'll use <code>statsmodels</code> library:


In [ ]:
from statsmodels.graphics.tsaplots import pacf, plot_pacf

A moderate correlation at lag 1 would indicate that the current turnover is directly related to the previous month's turnover, and not related to any other previous months' turnovers. A weak correlation at lag 2 would indicate that the current month's turnover is not directly related to the turnover from two months ago, and so on.


<code>lags</code> mean number of months in our case. You can directly indicate number of lags:


In [ ]:
plot_pacf(df["Turnover per month"], lags=10)

We see, that almost all lags have very low correlation coefficient. Even lag=1 has correlation value nearly 0.3, which is low. So, we can conclude that for this time series, there are no dependencies associated with a time delay.


To examine values:


In [ ]:
pacf(df["Turnover per month"])

<h2 id="anova"><b>5. ANOVA: Analysis of Variance</h2>


<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.</p>


<h3>Group</h3>


<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>To see if different types of 'Group' impact  'Turnover per month', we group the data.</p>


In [ ]:
grouped_test2 = df[['Group', 'Turnover per month']].groupby(['Group'])
grouped_test2.head()

We can obtain the values of the method group using the method <code>get_group</code>.


In [ ]:
grouped_test2.get_group('C')['Turnover per month']

We can use the function <code>f_oneway</code> in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.


If we analyze 'A' and 'B':


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('A')['Turnover per month'], 
                              grouped_test2.get_group('B')['Turnover per month'])
 
print( "ANOVA results: F =", f_val, ", P =", p_val)   

We see that F-test score is very small and P-value is extremely high, so it means no statistical significance. You can check out mean values of turnover grouped by "Group' field and see that mean of "A" and mean of "B" doesn't differ much.


Analyzing "C", "M", "X":


In [ ]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('C')['Turnover per month'], 
                              grouped_test2.get_group('M')['Turnover per month'], 
                              grouped_test2.get_group('X')['Turnover per month'])  
 
print( "ANOVA results: F =", f_val, ", P =", p_val)   

This is a great result with a large F-test score showing a strong correlation and a P-value of 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated?

Let's examine them separately.


#### C and X


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('C')['Turnover per month'], 
                              grouped_test2.get_group('X')['Turnover per month'])
 
print( "ANOVA results: F =", f_val, ", P =", p_val )

Let's examine the other groups.


#### C and M


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('C')['Turnover per month'], 
                              grouped_test2.get_group('M')['Turnover per month'])
 
print( "ANOVA results: F =", f_val, ", P =", p_val )

<h4>M and X</h4>


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('M')['Turnover per month'], 
                              grouped_test2.get_group('X')['Turnover per month'])
 
print( "ANOVA results: F =", f_val, ", P =", p_val )

### Sales Pkg binned


To see if different types of 'Sales Pkg' impact 'Turnover per month', we group the data.


In [ ]:
grouped_test3 = df[['Sales Pkg binned', 'Turnover per month']].groupby(['Sales Pkg binned'])
grouped_test3.head()

#### Low and High


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test3.get_group('Low')['Turnover per month'], 
                              grouped_test3.get_group('High')['Turnover per month'])
 
print( "ANOVA results: F =", f_val, ", P =", p_val )

#### Low and Medium


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test3.get_group('Low')['Turnover per month'], 
                              grouped_test3.get_group('Medium')['Turnover per month'])
 
print( "ANOVA results: F =", f_val, ", P =", p_val )

#### Medium and High


In [ ]:
f_val, p_val = stats.f_oneway(grouped_test3.get_group('Medium')['Turnover per month'], 
                              grouped_test3.get_group('High')['Turnover per month'])
 
print( "ANOVA results: F =", f_val, ", P =", p_val )

<h3><b>Conclusion: Important Variables</b></h3>


<p>We now have a better idea of what our data looks like and which variables are important to take into account when predicting the turnover value. We have narrowed it down to the following variables:</p>

Continuous numerical variables:

<ul>
    <li>Sales Pkg</li>
</ul>

Categorical variables:

<ul>
    <li>Sales Pkg binned</li>
</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>


### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/rosana_klym?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0HM6EN2945-2023-01-01">Rosana Klym</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0HM6EN2945-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0HM6EN2945-2023-01-01">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                 |
| ----------------- | ------- | ---------- | ---------------------------------- |
| 2023-05-05        | 2.0     | Rosana     | Changed authors                    |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
