<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

<h1 style="font-size: 36px; font-weight: bold; margin-top: 20px; text-decoration: none; margin-bottom: 25px;">Forecasting the turnover of supermarkets</h1>
<h2 style="font-size: 32px; font-weight: 500; margin-top: 0;">Lab. 1. Loading the dataset</h2>

Estimated time needed: **10** minutes

## Context
In the dataset, you'll get data of different stores of a supermarket company. Our goals of analysis are:
<ol>
    <li>Calculate:</li>
    <ul>
    <li>Average sales volume per customer;</li>
    <li>Average sales volume per 1 square meter of store area;</li>
    </ul>
    <li>Investigate how indicators such as the number of customers, the number of products, and the size of store area affect the turnover volume;</li>
    <li>Calculate the forecast value of turnover for the next period;</li>
</ol>


## Incoming data
<p>The dataset contains information on sample parameters from 896 supermarkets: store identifier, retail store area, number of product categories for sale, average monthly customer traffic, turnover volume.</p>
<ul>
    <li>Store ID: (Index) ID of the particular store;</li>
    <li>Store Area: Physical Area of the store in yard square;</li>
    <li>Items Available: Number of different items available in the corresponding store;</li>
    <li>Daily Customer Count: Number of customers who visited to stores on an average over month;</li>
    <h3>Target value</h3>
    <li>Store Sales: Sales in (US $) that stores made;</li>
</ul>
 <br>

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#data_acquisition">Data Acquisition</a>
    <li><a href="#basic_insight">Basic Insight of Dataset</a></li>
</ol>
</div>
<hr>



<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="data_acquisition" style="text-decoration: none;">
        <font color="black">Data Acquisition</font>
    </a>
</b>
<p>
There are various formats for a dataset: .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online.<br>

In this section, you will learn how to load a dataset into our Jupyter Notebook.<br>

In our case, the Store Dataset is an online source, and it is in a CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/surajjha101/stores-area-and-sales-data/download?datasetVersionNumber=1" target="_blank">https://www.kaggle.com/datasets/surajjha101/stores-area-and-sales-data</a></li>
    <li>Data type: csv</li>
    <li>Licence: <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0</a></li>
</ul>

This DataSet released under CC0: Public Domain license that allow of copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

The Pandas Library is a useful tool that enables us to read various datasets into a dataframe; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.
</p>


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#install specific version of libraries used in  lab
#! mamba install pandas==1.3.3  -y
#! mamba install numpy=1.21.2 -y

In [ ]:
# import pandas library
import pandas as pd
import numpy as np

## Read Data
<p>
We use <code>pandas.read_csv()</code> function to read the csv file. In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address. The file path can be either an URL or your local file address.<br>

Because the dataframe contains its own index, we can add an argument <code>index_col=0</code> inside the <code>read_csv()</code> method so that pandas will set the first column as the index column.

You can also assign the dataset to any variable you create.

</p>


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0G2SEN/Stores.csv"

This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/DA101EN_object_storage?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
# Read the online file by the URL provides above, and assign it to variable "df"
df = pd.read_csv(path, index_col=0)

We can use the <code>dataframe.head(n)</code> method to check the top n rows of the dataframe, where n is an integer. Contrary to <code>dataframe.head(n)</code>, <code>dataframe.tail(n)</code> will show you the bottom n rows of the dataframe.


In [ ]:
# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe")
df.head(5)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 32px; font-weight: bold;">Question #1:</b><br>
<b>Check the bottom 10 rows of data frame "df".</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
print("The last 10 rows of the dataframe\n")
df.tail(10)
```


### Update Headers
<p>
Take a look at our dataset. All columns have names that use underscores, and we need to replace the underscores with spaces.
</p>
<p>
Thus, we have to change headers manually.
</p>
<p>
First, we use the <code>str.replace()</code> method to replace all occurrences of the underscore character in each column name with a space character function. Then, we use <code>dataframe.columns = headers</code> to replace the headers with the list we created.
</p>




In [ ]:
headers = df.columns.str.replace('_', ' ')

We replace headers and check column names:

In [ ]:
df.columns = headers
print(df.columns)

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 32px; font-weight: bold;">Question #2:</b><br>
<b>Capitalize the column names of the dataframe and show all columns name.</b><br>
<b style="font-weight: 600;">You need to make changes on a temporary variable what is shown in the code block below.</b>
</div>


In [ ]:
temp = df.copy()
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
temp.columns = df.columns.str.upper()
print(temp.columns)
```

</details>


<b style="font-size: 32px; font-weight: 500; text-decoration: none; padding-top: 20px;">
    <a name="basic_insight" style="text-decoration: none;">
        <font color="black">Basic Insight of Dataset</font>
    </a>
</b>
<p>
After reading data into Pandas dataframe, it is time for us to explore the dataset.<br>

There are several ways to obtain essential insights of the data to help us better understand our dataset.

</p>


## Data Types
<p>
Data has a variety of types.<br>

The main types stored in Pandas dataframes are <b>object</b>, <b>float</b>, <b>int</b>, <b>bool</b> and <b>datetime64</b>. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:

</p>


In [ ]:
df.dtypes

A series with the data type of each column is returned.


In [ ]:
# check the data type of data frame "df" by .dtypes
print(df.dtypes)

<p>
As shown above, it is clear to see that each data type is <code>int64</code>.
</p>



<h2>Describe</h2>
If we would like to get a statistical summary of each column e.g. count, column mean value, column standard deviation, etc., we use the describe method:


In [ ]:
df.describe()

<p>
This shows the statistical summary of all numeric-typed (int, float) columns.<br>

For example, the attribute "Daily_Customer_Count" has 896 counts, the mean value of this column is 786, the standard deviation is 265.38, the minimum value is 10, 25th percentile is 600, 50th percentile is 780, 75th percentile is 970, and the maximum value is 1560. <br>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 32px; font-weight: bold;">Question #3:</b><br>

<p>
You can select the columns of a dataframe by indicating the name of each column. For example, you can select the three columns as follows:
</p>
<p>
    <code>dataframe[[' column 1 ',column 2', 'column 3']]</code>
</p>
<p>
Where "column" is the name of the column, you can apply the method  ".describe()" to get the statistics of those columns as follows:
</p>
<p>
    <code>dataframe[[' column 1 ',column 2', 'column 3'] ].describe()</code>
</p>

Apply the  method to ".describe()" to the columns 'Store Area' and 'Items Available'.

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[['Store Area', 'Items Available']].describe()
```

</details>


## Info
Another method you can use to check your dataset is:


It provides a concise summary of your DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.


In [ ]:
# look at the info of "df"
df.info()

## Save Dataset
<p>
Correspondingly, Pandas enables us to save the dataset to csv. By using the <code>dataframe.to_csv()</code> method, you can also add the file path and name along with quotation marks in the brackets.
</p>


In [ ]:
df.to_csv("Stores.csv")

We can also read and save other file formats. We can use similar functions like **`pd.read_csv()`** and **`df.to_csv()`** for other data formats. The functions are listed in the following table:

## Read/Save Other Data Formats

| Data Formate |        Read       |            Save |
| ------------ | :---------------: | --------------: |
| csv          |  `pd.read_csv()`  |   `df.to_csv()` |
| json         |  `pd.read_json()` |  `df.to_json()` |
| excel        | `pd.read_excel()` | `df.to_excel()` |
| hdf          |  `pd.read_hdf()`  |   `df.to_hdf()` |
| sql          |  `pd.read_sql()`  |   `df.to_sql()` |
| ...          |        ...        |             ... |


# Excellent! You have just made first steps in DA!


### Thank you for completing this lab!


## Author

<a href="https://author.skills.network/instructors/ivan_dvylyuk">Ivan Dvylyuk</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                            |
| ----------------- | ------- | ---------- | --------------------------------------------- |
| 2023-03-24        | 1.0     | Ivan       | Created the lab                               |


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
