<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# **Retail Sales Dataset 2018-2022**
## **Lab 1. Loading the dataset**
Estimated time needed: **10** minutes

### **Context**
This dataset contains historical sales data. It was extracted from a global retail company. The data was transformed to protect the identity of the company.
The dataset is built from the initial dataset consisted of data collected in 5 years (period 2018-2022), indicating date, product and sales month. 
The goal is to load the dataset into your machine and familiarize yourself with the content using Python.

### **Dataset Attributes**
*   Date: year and month
*   SKU: unique code consisting of letters and numbers that identify each product
*   Group: group of related products which share some common attributes
*   Units Pkg: quantity per package
*   Avg Price Pkg: average price per package
*   Sales Pkg: total package sales per month

### **Target Fields**
*   Avg Price Pkg
*   Units Pkg
### **Objectives**

After completing this lab you will be able to:

*   Acquire data in various ways
*   Obtain insights from data with Pandas library


<h2>Table of Contents</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#data_acquisition">Data Acquisition</a>
    <li><a href="#basic_insight">Basic Insight of Dataset</a></li>
</ol>
</div>
<hr>


<h2><b id="data_acquisition">Data Acquisition</b></h2>
<p>There are various formats for a dataset: .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online.<br>

In this section, you will learn how to load a dataset into our Jupyter Notebook.<br>

In our case, the Automobile Dataset is an online source, and it is in a CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/tsmldata/retail-sales-dataset-2018-2022?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX03NFEN2622-2023-01-01">https://www.kaggle.com/datasets/tsmldata/retail-sales-dataset-2018-2022</a></li>
    <li>Data type: csv</li>
    <li>License: <a href="https://creativecommons.org/publicdomain/zero/1.0/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX03NFEN2622-2023-01-01">https://creativecommons.org/publicdomain/zero/1.0/</a><br>
    You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.</li>
</ul>
The Pandas Library is a useful tool that enables us to read various datasets into a dataframe; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.
</p>


If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:


In [ ]:
#install specific version of libraries used in  lab
#! mamba install pandas==1.3.3  -y
#! mamba install numpy=1.21.2 -y

In [ ]:
# import pandas library
import pandas as pd
import numpy as np

### **Read Data**
<p>We use <code>pandas.read_csv()</code> function to read the csv file. In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address. The file path can be either an URL or your local file address.<br>

Because the data does not include headers, we can add an argument <code>headers = None</code> inside the <code>read_csv()</code> method so that pandas will not automatically set the first row as a header.<br>

You can also assign the dataset to any variable you create.

</p>


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX03NFEN/sales_1.csv"

This dataset was hosted on IBM Cloud object. Click <a href="https://cocl.us/DA101EN_object_storage?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01">HERE</a> for free storage.


In [ ]:
# Read the online file by the URL provides above, and assign it to variable "df"
df = pd.read_csv(path)

After reading the dataset, we can use the <code>dataframe.head(n)</code> method to check the top n rows of the dataframe, where n is an integer. Contrary to <code>dataframe.head(n)</code>, <code>dataframe.tail(n)</code> will show you the bottom n rows of the dataframe.


In [ ]:
# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe") 
df.head(5)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1:</b>
    
<b>Check the bottom 10 rows of data frame "df".</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
print("The last 10 rows of the dataframe\n")
df.tail(10)
```


Sometimes dataset may contain null values, so it's worth check this. We can do this this way:


In [ ]:
print(df.isnull().sum())

As we can observe, there are not null values in our dataset. Everything is stored in a proper way.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #2:</b>
    
<b>Find the name of the columns of the dataframe.</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
print(df.columns)
```

</details>


### **Modify Headers**
<p>We can modify our headers. Let's do this.<br>
First, we print our headers by command <code>print(df.columns)</code></p>


In [ ]:
# print headers list
print(df.columns)

<p>We want to get rid of underscores. To do this we can use method <code>replace()</code></p>


In [ ]:
df.columns = [col.replace("_", " ") for col in df.columns]

In [ ]:
print(df.columns)

<p>Looks better, but let's capiitalize every word in each header using method <code>title()</code> and leave column "SKU" as it is:


In [ ]:
df.columns = [col.title() if col != "SKU" else col for col in df.columns]
        
print(df.columns)

In [ ]:
df.head(1500)

Now, we have successfully read the raw dataset and modified headers. Looks much better.


<h2><b id="basic_insight">Basic Insight of Dataset</b></h2>
<p>
After reading data into Pandas dataframe, it is time for us to explore the dataset.<br>

There are several ways to obtain essential insights of the data to help us better understand our dataset.

</p>


### **Data Types**
<p>
Data has a variety of types.<br>

The main types stored in Pandas dataframes are <b>object</b>, <b>float</b>, <b>int</b>, <b>bool</b> and <b>datetime64</b>. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:

</p>


In [ ]:
# check the data type of data frame "df" by .dtypes
print(df.dtypes)

As we can see, "SKU" and "Group" are <code>object</code> type, "Units Pkg", "Avg Price Pkg", "Date" and "Sales Pkg" are <code>int64</code> type.


These data types can be changed; we will learn how to accomplish this in a later module.


### **Describe**
If we would like to get a statistical summary of each column e.g. count, column mean value, column standard deviation, etc., we use the describe method:


This method will provide various summary statistics, excluding <code>NaN</code> (Not a Number) values.


In [ ]:
df.describe()

<p>
This shows the statistical summary of all numeric-typed (int, float) columns.<br>

For example, the attribute "Avg Price Pkg" has 4138 counts, the mean value of this column is 9.138231, the standard deviation is 3.551443, the minimum value is 2, 25th percentile is 7, 50th percentile is 9, 75th percentile is 11, and the maximum value is 26. <br>

However, what if we would also like to check all the columns including those that are of type object? <br><br>

You can add an argument <code>include = "all"</code> inside the bracket. Let's try it again.

</p>


In [ ]:
# describe all the columns in "df" 
df.describe(include = "all")

<p>
Now it provides the statistical summary of all the columns, including object-typed attributes.<br>

We can now see how many unique values there, which one is the top value and the frequency of top value in the object-typed columns.<br>

Some values in the table above show as "NaN". This is because those numbers are not available regarding a particular column type.<br>

</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #3:</b>

<p>
You can select the columns of a dataframe by indicating the name of each column. For example, you can select the three columns as follows:
</p>
<p>
    <code>dataframe[[' column 1 ',column 2', 'column 3']]</code>
</p>
<p>
Where "column" is the name of the column, you can apply the method  ".describe()" to get the statistics of those columns as follows:
</p>
<p>
    <code>dataframe[[' column 1 ',column 2', 'column 3'] ].describe()</code>
</p>

Apply the  method to ".describe()" to the columns 'Date' and 'Units Pkg'.

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[['Date', 'Units Pkg']].describe()
```

</details>


### **Info**
Another method you can use to check your dataset is:


It provides a concise summary of your DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.


In [ ]:
# look at the info of "df"
df.info()

As we can see, there aren't null values. Great!


### **Save Dataset**
<p>
Correspondingly, Pandas enables us to save the dataset to csv. By using the <code>dataframe.to_csv()</code> method, you can add the file path and name along with quotation marks in the brackets.
</p>
<p>
For example, if you would save the dataframe <b>df</b> as <b>automobile.csv</b> to your local machine, you may use the syntax below, where <code>index = False</code> means the row names will not be written.
</p>


We can also read and save other file formats. We can use similar functions like **`pd.read_csv()`** and **`df.to_csv()`** for other data formats. The functions are listed in the following table:


### **Read/Save Other Data Formats**

| Data Formate |        Read       |            Save |
| ------------ | :---------------: | --------------: |
| csv          |  `pd.read_csv()`  |   `df.to_csv()` |
| json         |  `pd.read_json()` |  `df.to_json()` |
| excel        | `pd.read_excel()` | `df.to_excel()` |
| hdf          |  `pd.read_hdf()`  |   `df.to_hdf()` |
| sql          |  `pd.read_sql()`  |   `df.to_sql()` |
| ...          |        ...        |             ... |


## **Excellent! You have just completed the Introduction Notebook!**


### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/rosana_klym?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX03NFEN2622-2023-01-01">Rosana Klym</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX03NFEN2622-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX03NFEN2622-2023-01-01">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
| 2023-05-05        | 2.0    | Rosana    | Changed style of questions and changed authors               |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
