<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Motorcycle sales analysis

# *Lab 1. Loading data from dataset*
Estimated time needed: **15 minutes**


## Objectives
<p>After completing this lab you will be able to</p>
<ul>
    <li>Download dataset</li>
    <li>Acquire data in various ways</li>
    <li>Obtain insights from data with Pandas library</li>
    <li>Save dataset</li>
</ul>



<b style="font-size: 1.5em; font-weight: bold;">About dataset</b><br><br>
<b style="font-size: 1.2em; font-weight: bold;">Content</b>
<p>You work in the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.</p>


<b style="font-size: 1.2em; font-weight: bold;"> Dataset Glossary (Column-wise)</b>
<ul>
    <li>Date<p>Determines the date when client bought products</p></li>
    <li>Warehouse<p>The warehouse location. In this dataset are present Central warehouse, North warehouse and West warehouse</p></li>
    <li>Client type<p>Determines how client bought the products. This column can be only Retail or Wholesale</p></li>
    <li>Product line<p>Name of product (part of motorcycle)</p></li>
    <li>Quantity<p>The count bought product</p></li>
    <li>Unit price<p>Cost of one product</p></li>
    <li>Total<p>The total purchase price</p></li>
    <li>Payment<p>Determines the method of payment for the purchase. This dataset has three types of payment: Credit card, cash or transfer</p></li>
</ul>

<b style="font-size: 1.2em; font-weight: bold;"> Target field</b>
<ul>
    <li>Total</li>
</ul>
<b style="font-size: 1.5em; font-weight: bold;">Data source and licence</b>
<p>
There are various formats for a dataset: .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online.<br>

In this section, you will learn how to load a dataset into our Jupyter Notebook.<br>

In our case, the Motorcycle sales analysis Dataset is an online source, and it is in a CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX05AGEN2614-2023-01-01">https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis</a></li>
    <li>Data type: csv</li>
    <li>Licence: <a href="https://creativecommons.org/publicdomain/zero/1.0/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX05AGEN2614-2023-01-01">CC0: Public Domain</a></li>
</ul>
<p>
This DataSet released under CC0: Public Domain license that allow of copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
</p>
You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
The Pandas Library is a useful tool that enables us to read various datasets into a dataframe; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.
</p>


<b style="font-size: 1.5em; font-weight: bold;">Table of Contents</b>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#id1">Read and work with data</a></li>
    <li><a href="#id2">Basic Insight of Dataset</a></li>
</ol>

</div>

<hr>


In [ ]:
#! mamba install pandas -y
#! mamba install numpy -y
import pandas as pd
import numpy as np

<b style="font-size: 1.5em; font-weight: bold"><a name="id1" style="text-decoration: none;"><font color="black">1. Read and work with data</font></a></b>
<p>
We use <code>pandas.read_csv()</code> function to read the csv file. In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address. The file path can be either an URL or your local file address.<br>

You can also assign the dataset to any variable you create.

</p>


This dataset was hosted on IBM Cloud object. Click <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX05AGEN/sales_data.csv">HERE</a> for free storage.


In [ ]:
#you will need to download the dataset; if you are running locally, please comment out the following
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX05AGEN/sales_data.csv"
df = pd.read_csv(path)

After reading the dataset, we can use the <code>dataframe.head(n)</code> method to check the top n rows of the dataframe, where n is an integer. Contrary to <code>dataframe.head(n)</code>, <code>dataframe.tail(n)</code> will show you the bottom n rows of the dataframe.


In [ ]:
# show ten first data from dataframe with using dataframe.head() method 
df.head(10)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1:</b><br>
<b style="font-size: 1.2em">Check the bottom 10 rows of data frame "df".</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute


<details><summary>Click here for the solution</summary>

```python
print("The last 10 rows of the dataframe\n")
df.tail(10)
```


<b style="font-size: 1.2em; font-weight: bold;">Modify Headers</b>
<p>Let's print our headers in this dataset. To do that execute the command <code>df.columns</code></p>


In [ ]:
df.columns

<p>We can see that some headers have underline in their name. We can replace it using method <code>replace()</code></p>


In [ ]:
df.columns = [col.replace('_',' ') for col in df.columns]
df.columns

<p>Awesome. We got rid of underlines.</p> 


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #2:</b><br>
<b style="font-size: 1.2em">Now make the header names start with a capital letter</b>
    <br>
    <b style="font-size: 1.2em">Hint: use the method capitalize()</b><br>
    <b style="font-size: 1.2em; font-weight: bold;">Note that if you do not complete this task, the following code will not run</b>
</div>


In [ ]:
#Write code here


<details><summary>Click here for the solution</summary>

```python
df.columns = [col.capitalize() for col in df.columns]
df.columns
```


<b style="font-size: 1.2em; font-weight: bold;">Access to data in dataset</b><br>
<b>In the previous part we work with headers. Now we will work with data. For example, we want to print only <code>product lines</code> from our dataset. This command <code>df["Product line"]</code> helps us.<br> <b style="font-size: 1.2em;font-weight:bold">REMEMBER. We changed name of our headers.</b>



In [ ]:
df["Product line"]

<p>Now assume that we want to print information from our dataset only about 'Engine'.
The code below do that.
</p>


In [ ]:
df[df['Product line'] == 'Engine']

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #3:</b><br>
<b style="font-size: 1.2em">Find information about the last ten purchases where the quantity of purchased product line was greater than 30 </b>
    <p style="font-size: 1.2em">Hint: Don't forget about method <code>df.tail()</code></p>
</div>


In [ ]:
#Write code here


<details><summary>Click here for the solution</summary>

```python
df[df['Quantity'] > 30].tail(10)
```


<b style="font-size: 1.2em; font-weight: bold;">Describe</b>
<p>If we would like to get a statistical summary of each column e.g. count, column mean value, column standard deviation, etc., we use the describe method: <code>df.describe()</code>. This method will provide various summary statistics, excluding NaN (Not a Number) values.</p>


In [ ]:
df.describe()

<p>If we need to get specific statistical value like maximum value or average value for specific column, we should specify it. Command <code>df["Quantity"].describe()["max"]</code> show the most count of product line bought</p>


In [ ]:
df["Quantity"].describe()["max"]

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #4:</b><br>
<b style="font-size: 1.2em">Find the cheapest price per product unit</b>
</div>


In [ ]:
#Write code here


<details><summary>Click here for the solution</summary>

```python
df["Unit price"].describe()["min"]
```


<b style="font-size: 1.5em; font-weight: bold;"><a name="id2" style="text-decoration: none;"><font color="black">2. Basic Insight of Dataset</font></a></b>


<b style="font-size: 1.2em; font-weight: bold;">Data types</b>
<p>
Data has a variety of types.<br>

The main types stored in Pandas dataframes are <b>object</b>, <b>float</b>, <b>int</b>, <b>bool</b> and <b>datetime64</b>. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas: command <code>df.dtypes</code> shows a series with the data type of each column is returned. 
</p>


In [ ]:
# check the data type of data frame "df" by .dtypes
df.dtypes

<p>As shown above, it is clear to see that the data type of "Quantity" are int64 and "Unit price" or "Total" are float64.
These data types can be changed; we will learn how to accomplish this in a later module.</p>


<b style="font-size: 1.2em; font-weight: bold;">Info</b><br>
Another method you can use to check your dataset is <code>dataframe.info()</code>


It provides a concise summary of your DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.


In [ ]:
# look at the info of "df"
df.info()

<b style="font-size: 1.2em; font-weight: bold;">Save Dataset</b>
<p>
Correspondingly, Pandas enables us to save the dataset to csv. By using the <code>dataframe.to_csv()</code> method, you can add the file path and name along with quotation marks in the brackets.
</p>
<p>
For example, if you would save the dataframe <b>df</b> as <b>automobile.csv</b> to your local machine, you may use the syntax below, where <code>index = False</code> means the row names will not be written.
</p>


We can also read and save other file formats. We can use similar functions like **`pd.read_csv()`** and **`df.to_csv()`** for other data formats. The functions are listed in the following table:


In [ ]:
#Run this code to save your dataset
df.to_csv("motorcycles.csv",index=False)

<b style="font-size: 1.2em; font-weight: bold;">Read/Save Other Data Formats</b>

| Data Formate |        Read       |            Save |
| ------------ | :---------------: | --------------: |
| csv          |  `pd.read_csv()`  |   `df.to_csv()` |
| json         |  `pd.read_json()` |  `df.to_json()` |
| excel        | `pd.read_excel()` | `df.to_excel()` |
| hdf          |  `pd.read_hdf()`  |   `df.to_hdf()` |
| sql          |  `pd.read_sql()`  |   `df.to_sql()` |
| ...          |        ...        |             ... |


<b style="font-size: 1.2em; font-weight: bold;">Excellent! You have just completed the Lab #1!</b>


### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/victor_dyrenko?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX05AGEN2614-2023-01-01">Victor Dyrenko</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX05AGEN2614-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX05AGEN2614-2023-01-01">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version |     Changed By   | Change Description                                         |
| ----------------- | ------- | ---------------- | ---------------------------------------------------------- |
| 2023-03-24        | 1.0       | Victor Dyrenko   | Finished lab                                               |


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
