<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

<h1 style="font-size: 36px; font-weight: bold; margin-top: 20px; text-decoration: none; margin-bottom: 25px;">Forecasting the turnover of supermarkets</h1>
<h2 style="font-size: 32px; font-weight: 500; margin-top: 0;">Lab. 2. Data Wrangling</h2>

Estimated time needed: **30** minutes

## Context
<details><summary>Click here to learn more about the purpose of the labs</summary>

In the dataset, you'll get data of different stores of a supermarket company. Our goals of analysis are:
<ol>
    <li>Calculate:</li>
    <ul>
    <li>Average sales volume per customer;</li>
    <li>Average sales volume per 1 square meter of store area;</li>
    </ul>
    <li>Investigate how indicators such as the number of customers, the number of products, and the size of store area affect the turnover volume;</li>
    <li>Calculate the forecast value of turnover for the next period;</li>
</ol>
    
</details>

## Incoming data
<details><summary>Click here to learn more about the incoming data</summary>

<p>The dataset contains information on sample parameters from 896 supermarkets: store identifier, retail store area, number of product categories for sale, average monthly customer traffic, turnover volume.</p>
<ul>
    <li>Store ID: (Index) ID of the particular store;</li>
    <li>Store Area: Physical Area of the store in yard square;</li>
    <li>Items Available: Number of different items available in the corresponding store;</li>
    <li>Daily Customer Count: Number of customers who visited to stores on an average over month;</li>
    <h3>Target value</h3>
    <li>Store Sales: Sales in (US $) that stores made;</li>
</ul>
    
</details>




<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#data_acquisition">Data Acquisition</a></li>
    <li><a href="#identify_missing_values">Identify missing values</a></li>
        <li><a href="#add_new_column">Adding new column</a></li>
    <li><a href="#sort_data">Sorting Data</a></li>
    <li><a href="#data_normalization">Data normalization</a></li>
    <li><a href="#binning">Binning</a></li>
    <li><a href="#groping_data">Groung Data</a></li>
</ol>

</div>

<hr>



<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="data_acquisition" style="text-decoration: none;">
        <font color="black">Data Acquisition</font>
    </a>
</b>
<details><summary>Сlick here to learn more about the dataframe</summary>
<br>   
<p>In our case, the Store Dataset is an online source, and it is in a CSV (comma separated value) format. We use this dataset as an example to practice data reading. </p>

<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/surajjha101/stores-area-and-sales-data/download?datasetVersionNumber=1" target="_blank">https://www.kaggle.com/datasets/surajjha101/stores-area-and-sales-data</a></li>
    <li>Data type: csv</li>
    <li>Licence: <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0</a></li>
</ul>
 
<p>This DataSet released under CC0: Public Domain license that allow of copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.</p>




    
</details>

## What is the purpose of data wrangling?


Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.


### Import data
<p>
You can find the "Supermarket store branches sales analysis Dataset" from the following link: <a href="https://www.kaggle.com/datasets/surajjha101/stores-area-and-sales-data/code">link</a>. 
We will be using this dataset throughout this course.
</p>


### Import libraries 


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3 -y
#! mamba install numpy=1.21.2 -y
 

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

### Reading the dataset from the URL


First, we assign the URL of the dataset to "filename".


This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/corsera_da0101en_notebook_bottom?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0QTOEN/Stores.csv"

Use the Pandas method <b>read_csv()</b> to load the data from the web address. Set the parameter "index_col" equal 0 so that pandas will set the first column as the index column.


In [ ]:
df = pd.read_csv(filename, index_col=0)

Before starting to perform operations on the data, we need to ensure that our dataframe is ready for further analysis.

<div>We need to check if the data contains any missing values or type mismatches, and if so, we need to address these issues.</div> 

<b>How to work with missing data?</b>

Steps for working with missing data:

<ol>
    <li>Identify missing data</li>
    <li>Deal with missing data</li>
    <li>Correct data format</li>
</ol>



<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="identify_missing_values" style="text-decoration: none;">
        <font color="black">Identify missing values</font>
    </a>
</b><br>
The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

<ol>
    <li><b>.isnull()</b></li>
    <li><b>.notnull()</b></li>
</ol>
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.


In [ ]:
missing_data = df.isnull()
missing_data.head(5)

"True" means the value is a missing value while "False" means the value is not a missing value.


### Count missing values in each column
<p>
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset.  In the body of the for loop the method ".value_counts()" counts the number of "True" values. 
</p>


In [ ]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

Based on the summary above, each column has 896 rows of data and no column containing any missing data:

<ol>
    <li>"Store Area": 0 missing data</li>
    <li>"Items Available": 0 missing data</li>
    <li>"Daily Customer Count": 0 missing data</li>
    <li>"Store Sales" : 0 missing data</li>
</ol>

<b>Great!</b> We have a dataset with no missing values

### Correct data format

<p>The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</p>

In Pandas, we use:

<p><b>.dtype()</b> to check the data type</p>
<p><b>.astype()</b> to change the data type</p>


### Let's list the data types for each column


In [ ]:
df.dtypes

<p>As all the columns in our DataFrame consist of integers, the output confirms that all the data is of the correct data type, and there is no need to make any modifications.</p> 



<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="add_new_column" style="text-decoration: none;">
        <font color="black">Adding new column</font>
    </a>
</b><br>

<p><b>Target:</b> Add a column that represents the average amount of sales per customer by dividing the "Store Sales" column by the "Daily Customer Count" column</p>

To add a column to a Pandas dataframe, you can use the bracket notation and assign a value or a list of values to the new column label. So

Let's calculate the average sales volume per customer and add the values to a new column:

In [ ]:
df['Revenue per Customer'] = df['Store Sales'] / df['Daily Customer Count']
df.head(10)

To add a new column at a specific position, you can use the <code>insert()</code> method. This method takes three arguments: the position at which to insert the column, the name of the new column, and the data to populate the new column.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 32px; font-weight: bold;"> Question #1: </b>

<b>Add a new column 'Sales per Store Area' for the average sales volume per 1 square meter of the sales area ('Store Sales divided by 'Store Area') to the second-to-last position</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df.insert(len(df.columns)-1, 'Sales per Store Area', df['Store Sales'] / df['Store Area'])
df.head(10)
```

</details>



<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="sort_data" style="text-decoration: none;">
        <font color="black">Sorting Data</font>
    </a>
</b><br>
<b>What is data sorting?</b>
<p>
    Sorting refers to the process of arranging or reordering the rows and columns of a DataFrame based on certain criteria.
</p>
<b>Why we use sorting?</b>
<p>
    It allows us to quickly identify the highest or lowest values, the most frequent categories, and the relationships between different variables.
</p>

<b>Example:</b>
<p>In our dataset, all "Store Area" values are ordered by index. But what if we want to find out which store has the largest/smallest area?</p>


### Sorting by a Single Column
<p>To sort the dataframe by a single column, you can use the <code>sort_values()</code> method. Let's sort our dataframe by the "Store Area" column in descending order:</p>

In [ ]:
df.sort_values(by='Store Area', ascending=False)

<p>Note that we set <code>ascending=False</code> to sort in descending order. If you want to sort in ascending order, you can simply omit the ascending parameter or set it to "True".<p>

### Sorting by Multiple Columns
<p>You can also sort a DataFrame by multiple columns. For example, let's sort our dataframe first by "Items Available" in ascending order, and then by "Daily Customer Count" in descending order</p>

In [ ]:
df.sort_values(by=['Items Available', 'Daily Customer Count'], ascending=[True, False])

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 32px; font-weight: bold;"> Question #2: </b>

<b>Sort the dataframe by the Store Area column in descending order and the Store Sale column in ascending order.</b><br>
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df.sort_values(by=['Store Area', 'Store Sales'], ascending=[False, True])
```

</details>



<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="data_normalization" style="text-decoration: none;">
        <font color="black">Data Normalization</font>
    </a>
</b><br>
<b>Why normalization?</b>

<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
</p>

<b>Example</b>

<p>To demonstrate normalization, let's say we want to scale the column "Store Sales".</p>
<p><b>Target:</b> would like to normalize this variable so its value ranges from 0 to 1</p>



<b>We can use two main approaches for normalization:</b>
<ol>
    <li>replacing original value by (original value)/(maximum value)</li>
</ol>

In [ ]:
# replace (original value) by (original value)/(maximum value)
norm = df['Store Sales']/df['Store Sales'].max()
print(norm)

<ol start=2>
    <li>using the <code>preprocessing</code> module from the scikit-learn library</li>
</ol>

In [ ]:
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
norm = scaler.fit_transform(df[['Store Sales']])
print(norm[0:10])

<p>
Here we use the <code>MinMaxScaler</code> class, which performs scaling using minimum and maximum values.

The <code>fit_transform()</code> method performs two actions: first, it computes the minimum and maximum values for the 'Store Sales' column in the dataframe, and then it scales the values of this column to the range of 0 to 1.
</p>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 32px; font-weight: bold;"> Question #3: </b>

<b>According to the first example above, normalize the column "Store Area".</b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
norm = df['Store Area']/df['Store Area'].max()
norm
```

</details>



<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="binning" style="text-decoration: none;">
        <font color="black">Binning</font>
    </a>
</b><br>
<b>Why binning?</b>
<p>
    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
</p>

<b>Example: </b>

<p>In our dataset, "Store Area" is a real valued variable ranging from 775 to 2229. What if we only care about the stores with large, medium, or small area?(3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </p>

<p>We will use the pandas method 'cut' to segment the 'Store Area' column into 3 bins.</p>


### Example of Binning Data In Pandas


Let's plot the histogram of 'Sales per Store Area' to see the sales volume per unit area of the stores.


In [ ]:
import matplotlib.pyplot as plt

plt.hist(df["Sales per Store Area"])

# Set x/y labels and plot title
plt.xlabel("Sales per Store Area")
plt.ylabel("Amount")
plt.title("Sales per Store Area bins")

<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>
<p>Since we want to include the minimum value, we want to set start_value = min(df["Sales per Store Area"]).</p>
<p>Since we want to include the maximum value, we want to set end_value = max(df["Sales per Store Area"]).</p>
<p>Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.</p>


In [ ]:
import math
sales_area_bins = np.linspace(math.floor(min(df["Sales per Store Area"])), max(df["Sales per Store Area"]), 4)
sales_area_bins

We set group  names:


In [ ]:
sales_area_labels = ['Low', 'Middle', 'High']

We apply the function "cut" to determine what each value of `df["Sales per Store Area"]` belongs to.


In [ ]:
df["Sales-per-Store-Area-binned"] = pd.cut(df['Sales per Store Area'], bins=sales_area_bins, labels=sales_area_labels)
df[['Sales per Store Area',"Sales-per-Store-Area-binned"]].head(853)

Let's see the number of stores in each bin:


In [ ]:
df["Sales-per-Store-Area-binned"].value_counts()

Let's plot the distribution of each bin:


In [ ]:
plt.bar(sales_area_labels, df["Sales-per-Store-Area-binned"].value_counts())

# set x/y labels and plot title
plt.xlabel("Sales per Store Area")
plt.ylabel("Amount")
plt.title("Sales per Store Area bins")

<p>
Look at the dataframe above carefully. You will find that the last column provides the bins for 'Sales per Store Area' based on 3 categories ("Low", "Middle" and "High"). 
</p>



### Bins Visualization
Normally, a histogram is used to visualize the distribution of bins we created above. 


In [ ]:
# draw historgram of attribute "Store Area" with bins = 3
plt.hist(df["Sales per Store Area"], bins = 3)

# set x/y labels and plot title
plt.xlabel("Sales per Store Area")
plt.ylabel("Amount")
plt.title("Sales per Store Area bins")

The plot above shows the binning result for the attribute 'Sales per Store Area'.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 32px; font-weight: bold;"> Question #4: </b>

<b>Separate the column "Store Area" into 3 bins of '1-999', '1000-1499', '1500-inf'. Call the new column 'Store-Area-binned'</b><br>
<p>Hint! This problem can be divided into three tasks:</p>
<ol>
    <li>Generating a bin array;</li>
    <li>Creating an array to store labels (the same as bins);</li>
    <li>Dividing the data into bins using the <code>.cut()</code> function;</li>
</ol>


</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
store_area_bins = [1, 999, 1499, max(df["Store Area"])]
store_area_names = ['1-999', '1000-1499', '1500-inf']
df['Store-Area-binned'] = pd.cut(df['Store Area'], bins=store_area_bins, labels=store_area_names)


<details><summary>Click here for the solution</summary>

```python

store_area_bins = [1, 999, 1499, max(df["Store Area"])]
store_area_names = ['1-999', '1000-1499', '1500-inf']
df['Store-Area-binned'] = pd.cut(df['Store Area'], bins=store_area_bins, labels=store_area_names)

```

</details>


It's important to complete this task, as the result will be needed for creating a crosstable later on.


<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="groping_data" style="text-decoration: none;">
        <font color="black">Groung Data</font>
    </a>
</b><br>
<b>What is data grouping?</b>
<p>
Grouping refers to the process of splitting the data into groups based on some criteria and applying a function to each group.    
</p>
<b>Why we use grouping?</b>
<p>
Grouping allows us to perform aggregate functions, such as counting, summing, or averaging, on subsets of data based on a particular attribute or combination of attributes.
</p>
<b>Example:</b>
<p>
 In our dataset, we want to find out, for example, how many stores have an area between 1000 and 1499 square meters and count of daily customers ranging from 500 to 999.</p>


Before moving on to grouping the data, we need to create another column that will categorize the number of visitors to the stores into 3 groups: 1-499, 500-999, 1000-inf. We have performed similar actions a little earlier.

In [ ]:
customers_count_bins = [1, 499, 999, max(df["Daily Customer Count"])]
customers_count_names = ['1-499', '500-999', '1000-inf']
df['Customers-Count-binned'] = pd.cut(df['Daily Customer Count'], bins=customers_count_bins, labels=customers_count_names)

To group data, we use the <code>groupby()</code> method. This method can be called on a dataframe object and passed one or several columns by which the data should be grouped.

In [ ]:
result = df.groupby(['Store-Area-binned', 'Customers-Count-binned']).size().reset_index(name='Count')
print(result)

The above code groups by two columns: 'Store-Area-binned' and 'Customers Count-binned'. Then calculates the number of rows in each group using the <code>size()</code> method. The <code>reset_index()</code> method is used to reset the index of the resulting dataframe and name the new column with the count values as 'Count'

After grouping, we can use <code>crosstab()</code> to obtain a summary table showing the count of observations for each combination of values within the groups.

In [ ]:
pd.crosstab(result['Store-Area-binned'], result['Customers-Count-binned'], 
                              values=result['Count'], aggfunc='sum', 
                              rownames=['Store Area'], colnames=['Daily Customer Count'])

Here the first argument represents the row variable and the second is the column variable. <code>aggfunc='sum'</code> specifies the aggregation function to be used, which is sum in this case.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 32px; font-weight: bold;"> Question #5: </b>

<b>Create a cross-tabulation based on the columns 'Store-Area-binned' and "Sales-per-Store-Area-binned", which was calculated in the previous section.</b><br>
    
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
result = df.groupby(['Store-Area-binned', "Sales-per-Store-Area-binned"]).size().reset_index(name='Count')
pd.crosstab(result['Store-Area-binned'], result["Sales-per-Store-Area-binned"], 
                              values=result['Count'], aggfunc='sum', 
                              rownames=['Store Area'], colnames=['Sales per Store Area'])

<details><summary>Click here for the solution</summary>

```python
result = df.groupby(['Store-Area-binned', "Sales-per-Store-Area-binned"]).size().reset_index(name='Count')
pd.crosstab(result['Store-Area-binned'], result["Sales-per-Store-Area-binned"], 
                              values=result['Count'], aggfunc='sum', 
                              rownames=['Store Area'], colnames=['Sales per Store Area'])
```

</details>


In [ ]:
df.to_csv('Stores.csv')

Save the new csv:

> Note : The  csv file cannot be viewed in the jupyterlite based SN labs environment.However you can Click <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Module%202/DA0101EN-2-Review-Data-Wrangling.ipynb?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2022-01-01">HERE</a> to download the lab notebook (.ipynb) to your local machine and view the csv file once the notebook is executed.


### Thank you for completing this lab!


## Author

<a href="https://author.skills.network/instructors/ivan_dvylyuk">Ivan Dvylyuk</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                            |
| ----------------- | ------- | ---------- | --------------------------------------------- |
| 2023-04-03        | 2.2     | Ivan       | Created the lab                               |


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>