<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Motorcycle sales analysis

# *Lab 2. Data wrangling*
Estimated time needed: **45 minutes**

## Objectives

After completing this lab you will be able to:

*   Correct data format
*   Normalize data with two methods
*   Bin data into intervals
*   Work with indicator (dummy) variables
*   Group and sort data


<details><summary><b style="font-size: 1.5em; font-weight: bold;">Click here to see content, description of dataset, source of dataset and licence</b></summary>
<br/>
<b style="font-size: 1.2em; font-weight: bold;">Content</b>
<p>You work in the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.</p>

<b style="font-size: 1.2em; font-weight: bold;">Dataset Glossary (Column-wise)</b>
<ul>
    <li>Date<p>Determines the date when client bought products</p></li>
    <li>Warehouse<p>The warehouse location.</p></li>
    <li>Client type<p>Determines how client bought the products. This column can be only Retail or Wholesale</p></li>
    <li>Product line<p>Name of product (part of motorcycle)</p></li>
    <li>Quantity<p>The count bought product</p></li>
    <li>Unit price<p>Cost of one product</p></li>
    <li>Total<p>The total purchase price</p></li>
    <li>Payment<p>Determines the method of payment for the purchase. This dataset has three types of payment: Credit card, cash or transfer</p></li>
</ul>

<b style="font-size: 1.2em; font-weight: bold;">Target field</b>
<ul>
    <li>Total</li>
</ul>

<b style="font-size: 1.2em; font-weight: bold;">Data source and licence</b>
<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX07DAEN2685-2023-01-01">https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis</a></li>
    <li>Data type: csv</li>
    <li>Licence: <a href="https://creativecommons.org/publicdomain/zero/1.0/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX07DAEN2685-2023-01-01">CC0: Public Domain</a></li>
</ul>
<p>
This DataSet released under CC0: Public Domain license that allow of copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
</p>
You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
</details>


<b style="font-size: 1.5em; font-weight: bold;">Table of Contents</b>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#id1">Identify missing value and correct data format</a></li>
    <li><a href="#id2">Data normalization (centering/scaling)</a></li>
    <li><a href="#id3">Binning</a></li>
    <li><a href="#id4">Indicator variable</a></li>
    <li><a href="#id5">Working with dummy variables and grouping</a></li>
    <li><a href="#id6">Sorting</a></li>
</ol>

</div>

<hr>


<b style="font-size: 1.5em; font-weight: bold;">What is the purpose of data wrangling?</b>


Data wrangling is the process of converting data from the initial format to a format that may be better for analysis.


Import necessary libraries


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2

! mamba install scikit-learn -y

In [ ]:
import matplotlib as plt
from matplotlib import pyplot
from sklearn import preprocessing
import pandas as pd
import numpy as np

<b style="font-size: 1.5em; font-weight: bold;"><a name="id1" style="text-decoration: none"><font color="black">1. Identify missing value and correct data format</font></a></b>


This dataset was hosted on IBM Cloud object. Click <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX07DAEN/motorcycles.csv">HERE</a> for free storage.


In [ ]:
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX07DAEN/motorcycles.csv"

Use the Pandas method <b>read_csv()</b> to load the data from the web address.


In [ ]:
df = pd.read_csv(path)

Use the method <b>head()</b> to display the first five rows of the dataframe.


In [ ]:
# To see what the data set looks like, we'll use the head() method.
df.head()

<p>Now let prints information about a DataFrame including the index dtype and columns, non-null values and memory usage using method <code>df.info()</code></p>


In [ ]:
df.info()

<p>We see that out dataset has 1000 data and each column has 1000 non-null values. It means that our dataset hasn't missing values.</p>


<b style="font-size: 1.2em; font-weight: bold;">Correct data format</b>
<p>The next step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</p>

In Pandas, we use:

<p><b>.dtype()</b> to check the data type</p>
<p><b>.astype()</b> to change the data type</p>


<p>Let's list the data types for each column</p>


In [ ]:
df.dtypes

<p>As we can see above, column <code>Date</code> are not of the correct data type. Numerical variables should have type 'float' or 'int'.</p>
<p>We can change data type of column <code>Date</code> using method <code><a href="https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX07DAEN2685-2023-01-01">pandas.to_datetime()</a></code> from Pandas library. This method takes one required parameter. This is our data, which we want to cast to the Date type.</p>


In [ ]:
df['Date'] = pd.to_datetime(df['Date'])
df.dtypes

<p>Let's print unique values of columns <code>Warehouse, Client type, Product line, Payment</code></p>


In [ ]:
print('Unique values for column Warehouse:\n',df['Warehouse'].unique())
print('Unique values for column Client type:\n',df['Client type'].unique())
print('Unique values for column Product line:\n',df['Product line'].unique())
print('Unique values for column Payment:\n',df['Payment'].unique())

<p>You see that unique values in these columns are less than 10, but our dataset has 1000 records. Type of these columns must be <code>category</code>. Let's change type for columns 'Payment' and 'Client type'</p>


In [ ]:
df['Payment'] = df['Payment'].astype('category')
df['Client type'] = df['Client type'].astype('category')
df.dtypes

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1:</b><br>
<b style="font-size: 1.2em">Change data type of columns 'Warehouse' and 'Product line' to 'category'</b>

</div>


In [ ]:
#Write your code here


<details><summary>Click here for the solution</summary>

```python
df['Warehouse'] = df['Warehouse'].astype('category')
df['Product line'] = df['Product line'].astype('category')
df.dtypes
```

</details>


<b>Wonderful!</b> Now we have finally obtained the cleaned dataset with all data in its proper format.


<b style="font-size: 1.5em; font-weight: bold;"><a name="id2" style="text-decoration: none"><font color="black">2. Data Normalization</font></a></b>

<p>Why normalization?</p>

<p>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
</p>

<b>Example</b>

<p>To demonstrate normalization, let's say we want to scale the columns "Total" and "Unit price".</p>
<p><b>Target:</b> would like to normalize those variables so their value ranges from 0 to 1</p>
<p><b>Approach:</b> Let's do normalization with two methods:
    <li>replace original value by (original value)/(maximum value)</li>
    <li>replace original value using method <code><a href="https://scikit-learn.org/stable/modules/preprocessing.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX07DAEN2685-2023-01-01#normalization">normalize</a></code> from sklearn library</li>
</p>
<p>Before normalization we save original data from columns "Total" and "Unit price"</p>


In [ ]:
#save data
Total = df['Total']
UnitPrice = df['Unit price']

<b style="font-size: 1.2em; font-weight: bold">First method</b>


In [ ]:
# replace (original value) by (original value)/(maximum value)
Total = Total/Total.max()
Total

<b style="font-size: 1.2em; font-weight: bold">Second method</b>
<p>In this case you need to use method <code>normalize()</code>. To use you need to import from sklearn library class <code>preprocessing</code>. This class contain method <code>normalize()</code> 
The first parameter of this method is your data and second parameter is the 'norm' which can be 'l1','l2' or 
'max'. Let's use 'max' norm</p>
<p>But the second method has little problem. Our data 'Total' is 1D array. To use method <code>normalize()</code> we need to have 2D array. We can get data from column 'Total' in such way <code>[df['Total'].values]</code> to normalize this column</p>


In [ ]:
Total = [df['Total'].values]
Total[0][0:5]


<p>Use the method <code>normalize</code>. Print the first five elements</p>


In [ ]:
Total = preprocessing.normalize(Total, norm='max')
Total[0][0:5]

<p>Compare this result with result from firts method. You can see that they are equal.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #2:</b><br>
<b style="font-size: 1.2em">Normalize data from column 'Unit price' with two methods</b>
<li><b style="font-size: 1.2em">First method: replace (original value) by (original value)/(maximum value). Print first ten normalized data</b></li>
<li><b style="font-size: 1.2em">Second method: (using sklearn library). Use norm 'l2' and print first ten normalized data</b></li>
<b style="font-size: 1.2em; font-weight: bold;">DON'T SAVE NORMALIZED DATA IN YOUR DATASET</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 
print('FIRST METHOD')


<details><summary>Click here for the solution</summary>

```python
UnitPrice = UnitPrice/UnitPrice.max()
UnitPrice
```
</details>


In [ ]:
# Write your code below and press Shift+Enter to execute 
print('SECOND METHOD')


<details><summary>Click here for the solution</summary>

```python
UnitPrice = [df['Unit price'].values]
UnitPrice = preprocessing.normalize(UnitPrice, norm='l2')
UnitPrice[0][0:10]

```

</details>


<b style="font-size: 1.5em; font-weight: bold;"><a name="id3" style="text-decoration:none"><font color="black">3. Binning</font></a></b>
<p>Why binning?</p>
<p>
    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
</p>

<b>Example: </b>

<p>In our dataset, "Quantity" is a real valued variable ranging from 1 to 40 and it has 18 unique values. Maybe we want to know how many of each part of motorcycle was bought in terms: 'Less 11', 'From 11 till 20', 'From 21 till 30', 'more 30'? Can we rearrange them into four ‘bins' to simplify analysis? </p>

<p>We will use the pandas method 'cut' to segment the 'Quantity' column into 4 bins.</p>


<b style="font-size: 1.2em; font-weight:bold">Example of Binning Data In Pandas</b>


Let's plot the histogram of 'Total' to see what the distribution of quantity looks like.


In [ ]:
plt.pyplot.hist(df["Quantity"])

# set x/y labels and plot title
plt.pyplot.xlabel("quantity")
plt.pyplot.ylabel("count")
plt.pyplot.title("Quantity bins")

<p>We would like 3 bins of equal size bandwidth so we use numpy's <code>linspace(start_value, end_value, numbers_generated</code> function.</p>
<p>Since we want to include the minimum value of Quantity, we want to set start_value = df["Quantity"].min().</p>
<p>Since we want to include the maximum value of horsepower, we want to set end_value = df["Quantity"].max().</p>
<p>Since we are building 4 bins of equal length, there should be 5 dividers, so numbers_generated = 5.</p>


We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.


In [ ]:
bins = np.linspace(df["Quantity"].min(), df["Quantity"].max(), 5)
bins

We set group  names:


In [ ]:
group_names = ['[0-10]', '[11-20]', '[21-30]','[31-inf]']

We apply the function "cut" to determine what each value of `df['Quantity']` belongs to and add to our dataset new column 'Quantity-binned' with new information about quantity


In [ ]:
df['Quantity binned'] = pd.cut(df['Quantity'], bins, labels=group_names,include_lowest=True)
df[['Quantity','Quantity binned']].head(20)

Let's see the number of this purchases in each bin. By the way, we need call function `sort_index()` because method `pd.cut()` sort count data in each group in descending order and our intervals are mixed up 


In [ ]:
values = df['Quantity binned'].value_counts().sort_index()
values

Let's plot the distribution of each bin:


In [ ]:
pyplot.bar(group_names, values)

# set x/y labels and plot title
plt.pyplot.xlabel("quantity")
plt.pyplot.ylabel("count")
plt.pyplot.title("quantity bins")

<p>
    Look at the dataframe above carefully. You will find that the last column provides the bins for 'quantity' based on 4 categories ("[0-10]", "[11-20]", "[21-30]", "[31-inf]" ). 
</p>
<p>
    We successfully narrowed down the intervals to 4!
</p>


<b style="font-size: 1.2em; font-weight:bold">Bins Visualization</b><br>
Normally, a histogram is used to visualize the distribution of bins we created above. 


In [ ]:
# draw historgram of attribute "Quantity" with bins = 4
plt.pyplot.hist(df["Quantity"], bins = 4)
# set x/y labels and plot title
plt.pyplot.xlabel("quantity")
plt.pyplot.ylabel("count")
plt.pyplot.title("quantity bins")

The plot above shows the binning result for the attribute "Quantity".


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #3:</b><br>
<b style="font-size: 1.2em">Similar to before, create a binning data for the column "Unit price"</b>
    <br>
    <b style="font-size: 1.5em; font-weight: bold;">Your tasks</b>
    <li><b style="font-size: 1.2em">Plot the histogram of Unit price to see what the distribution of quantity looks like.</b></li>
    <li><b style="font-size: 1.2em">Build a bin array with a minimum value to a maximum value by using method `np.linspace()`</b></li>
    <li><b style="font-size: 1.2em">Create three groups of data: 'Cheaper', 'Medium cost', 'Expensive'</b></li>
    <li><b style="font-size: 1.2em">Apply the function "cut" to determine what each value of `df['Unit price']` belongs to.</b></li>
    <li><b style="font-size: 1.2em">Add to your dataset a new column `Binned price` with data from the previous step</b></li>
    <li><b style="font-size: 1.2em">Don't forget use the method <code>sort_index()</code> to sort intervals</b></li>
    <li><b style="font-size: 1.2em">Plot the data from `Binned price` using bar diagram</b></li>
    <li><b style="font-size: 1.2em">Plot the data from `Unit price` using histogram. Don't forget about count of bins</b></li>
</div>


In [ ]:
#Plot the histogram of Unit price


<details><summary>Click here for the solution</summary>

```python
#Display our data
plt.pyplot.hist(df["Unit price"])
plt.pyplot.xlabel("price")
plt.pyplot.ylabel("count")
plt.pyplot.title("price bins")
```
</details>


In [ ]:
#Create a bins array, groups, cut data from column 'Unit price' in 3 groups 
#and save them in your dataset in new column 'Binned price'. Check whether groups are not mixed up


<details><summary>Click here for the solution</summary>

```python
#Create a bins array
bins = np.linspace(df["Unit price"].min(), df["Unit price"].max(), 4)
bins

#Cretate groups
group_names = ['Cheap', 'Medium cost', 'Expensive']

#Cutting our data in 3 groups and save it in our dataset
df['Binned price'] = pd.cut(df['Unit price'], bins, labels=group_names, include_lowest=True)
df[['Unit price','Binned price']].head(20)
df['Binned price'].value_counts().sort_index()
```
</details>


In [ ]:
#Plot each bin using bar diagram


<details><summary>Click here for the solution</summary>

```python
pyplot.bar(group_names, df["Binned price"].value_counts())
plt.pyplot.xlabel("price")
plt.pyplot.ylabel("count")
plt.pyplot.title("price bins")
```
</details>


In [ ]:
#Plot each bin using histogram


<details><summary>Click here for the solution</summary>

```python
pyplot.hist(df['Unit price'], bins=3)
plt.pyplot.xlabel("price")
plt.pyplot.ylabel("count")
plt.pyplot.title("price bins")
```

</details>


<b style="font-size: 1.5em; font-weight: bold;"><a name="id4" style="text-decoration:none"><font color="black">4. Indicator Variable (or Dummy Variable)</font></a></b>
<p>What is an indicator variable?</p>
<p>
    An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. 
</p>

<b>Why we use indicator variables?</b>

<p>
    We use indicator variables so we can use categorical variables for regression analysis in the later modules.
</p>
<b>Example</b>
<p>
    We see the column "Payment" has three unique values: "Cash", "Credit card" and "Transfer". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "Payment" to indicator variables.
</p>

<p>
    We will use pandas' method 'get_dummies' to assign numerical values to different categories of Payment. 
</p>


In [ ]:
df.columns

Get the indicator variables and assign it to data frame "dummy_variable\_1":


In [ ]:
dummy_variable_1 = pd.get_dummies(df["Payment"])
dummy_variable_1.head()

In the dataframe, column 'Payment' has values for 'Cash', 'Credit card' and 'Transfer' as 0s and 1s now.


In [ ]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("Payment", axis = 1, inplace=True)

In [ ]:
df.head()

The last three columns are now the indicator variable representation of the payment variable. They're all 0s and 1s now.


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #4:</b><br>
<b style="font-size: 1.2em">Similar to before, create an indicator variable for the column "Client type", merge the new dataframe with original dataframe and then drop the column "Client type"</b>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
dummy_variable_2 = pd.get_dummies(df["Client type"])
dummy_variable_2.head()

# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_2], axis=1)

# drop original column "fuel-type" from "df"
df.drop("Client type", axis = 1, inplace=True)

df.head()
```

</details>


<b style="font-size: 1.5em; font-weight: bold;"><a name="id5" style="text-decoration:none"><font color="black">5. Working with dummy variables. Grouping</font></a></b>


<p>In the next tasks we need to split data from column 'Total' into several intervals:</p>
<li>(0-99]</li>
<li>(99-499]</li>
<li>(499-999]</li>
<li>(999-1499]</li>
<li>(1499-inf]</li>
<p>And save these information in dataset. Let's create ranges and labels</p> 


In [ ]:
Ranges = [0, 99, 499, 999, 1499, df['Total'].max()]
labels = ['(0-99]','(99-499]','(499-999]','(999-1499]','(1499-inf]']
df['Total ranged'] = pd.cut(df['Total'], Ranges, labels = labels)
df[['Total','Total ranged']].head(20)

<p>Good. Now we ready for next exercises</p>


<p>As you remember in the previous task we create dummy variables for column 'Payment' and as result we replace column 'Payment' by three new columns 'Cash', 'Credit card', 'Transfer'. The values in these columns are 1 and 0, which respectively means that the customer paid in this way or did not pay</p>


In [ ]:
print('Not paid by cash')
df.loc[df['Cash'] == 0,['Warehouse', 'Total']]

In [ ]:
print('Paid by cash')
df.loc[df['Cash'] == 1,['Warehouse', 'Total']]

<p>Now we print count of each payment method. We need to group data from column 'Total-ranged' using method <code>groupby()</code> and print count of each group</p>


In [ ]:
col = ['Cash', 'Credit card', 'Transfer']
D = dict()
for c in col:
    d = df.loc[df[c] == 1]
    D[c] = d.groupby(d['Total ranged'])['Total ranged'].count()
    print('Total count payment by ', c, ': ', D[c])

<p>Now we see the intervals of column 'Total' and count of payment by each method which belongs to each interval.</p>


<p>Now we need to visualizate data</p>


In [ ]:
pyplot.bar(labels,D['Cash'])

# set x/y labels and plot title
plt.pyplot.xlabel("interval")
plt.pyplot.ylabel("count")
plt.pyplot.title("Cash")

In [ ]:
pyplot.bar(labels, D['Credit card'])

# set x/y labels and plot title
plt.pyplot.xlabel("interval")
plt.pyplot.ylabel("count")
plt.pyplot.title("Credit card")

In [ ]:
pyplot.bar(labels, D['Transfer'])

# set x/y labels and plot title
plt.pyplot.xlabel("interval")
plt.pyplot.ylabel("count")
plt.pyplot.title("Transfer")

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #5:</b><br>
<b style="font-size: 1.2em">Similar to before, work with column 'Client type'. Remember you must replaced this column by two new columns 'Retail' and 'Wholesale' in the previous task.</b><br>
<b style="font-size: 1.5em; font-weight: bold;">Your tasks</b>
    <li><b style="font-size: 1.2em">Print count data in each interval for each 'Client type'</b></li>
    <li><b style="font-size: 1.2em">Visualizate the data</b></li>
</div>


In [ ]:
#Find count of data for each client type



<details><summary>Click here for the solution</summary>

```python
client = ['Retail', 'Wholesale']
D = dict()
for c in client:
    d = df.loc[df[c] == 1]
    D[c] = d.groupby(d['Total ranged'])['Total ranged'].count()
    print('Total count of client ', c, ': ', D[c])

```
</details>


In [ ]:
#Visualizating data for retail client


<details><summary>Click here for the solution</summary>

```python
pyplot.bar(labels, D['Retail'])

# set x/y labels and plot title
plt.pyplot.xlabel("interval")
plt.pyplot.ylabel("count")
plt.pyplot.title("Retail")
```
</details>


In [ ]:
#Visualizating data for wholesale client


<details><summary>Click here for the solution</summary>

```python
pyplot.bar(labels, D['Wholesale'])

# set x/y labels and plot title
plt.pyplot.xlabel("interval")
plt.pyplot.ylabel("count")
plt.pyplot.title("Wholesale")
```
</details>


<b  style="font-size:1.2em; font-weight:bold">Grouping by two fields</b>


<p>Let's work with the column 'Product line'. Print the unique values in this column</p>


In [ ]:
df['Product line'].unique()

<p>Let's print how many count of data belongs to intervals above for each product line. We can group data by 'Product line' and 'Total-ranged'</p>


In [ ]:
product = df.groupby(['Product line','Total ranged'])[['Total']].count()
product

<p>We see each product line, intervals and count data in each interval</p>


<p>We can also use a cross table instead of grouping and find count of data in each interval for each Product line. Use method <code>pd.crosstab()</code></p>


In [ ]:
ProductLine = pd.crosstab(index = [df['Product line']], columns = [df['Total ranged']])
ProductLine

<p>Let's use a heat map to visualize the relationship between Product line and Total-ranged</p>


In [ ]:
pyplot.pcolor(ProductLine, cmap = 'RdBu')
pyplot.colorbar()

The default labels convey no useful information to us. Let's change that:


In [ ]:
fig, ax = pyplot.subplots()
im = ax.pcolor(ProductLine, cmap='RdBu')

#label names
row_labels = labels
col_labels = ProductLine.index

#move ticks and labels to the center
ax.set_xticks(np.arange(ProductLine.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(ProductLine.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
pyplot.xticks(rotation=90)

fig.colorbar(im)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #6:</b><br>
<b style="font-size: 1.2em">Work with column 'Warehouse'</b><br>
<b style="font-size: 1.5em; font-weight: bold;">Your tasks</b>
    <li><b style="font-size: 1.2em">Print unique values from column 'Warehouse'. Use the method <code>unique()</code></b></li>
    <li><b style="font-size: 1.2em">Group data by column 'Warehouse' and 'Total ranged' and print count data in each group using method <code>groupby()</code></b></li>
    <li><b style="font-size: 1.2em">Show data in cross table using method <code>crosstab()</code></b></li>
    <li><b style="font-size: 1.2em">Visualize the relationship between Warehouse and Total ranged using heat map. Names of columns and rows must not be default</b></li>
    
</div>


In [ ]:
#Find unique values in column 'Warehouse' and print how many count of data belongs to intervals above for each warehouse


<details><summary>Click here for the solution</summary>

```python
print('Unique values for warehouse: \n',df['Warehouse'].unique())
warehouse = df.groupby(['Warehouse','Total ranged'])[['Total']].count()
warehouse
```

</details>


In [ ]:
#Display data in table


<details><summary>Click here for the solution</summary>

```python
Warehouse = pd.crosstab(index = [df['Warehouse']], columns = [df['Total ranged']])
Warehouse
```

</details>


In [ ]:
#Display data with heat map


<details><summary>Click here for the solution</summary>

```python
#Display data with heat map
fig, ax = pyplot.subplots()
im = ax.pcolor(Warehouse, cmap='RdBu')

#label names
row_labels = labels
col_labels = Warehouse.index

#move ticks and labels to the center
ax.set_xticks(np.arange(Warehouse.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(Warehouse.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
pyplot.xticks(rotation=90)

fig.colorbar(im)
```

</details>


<b style="font-size: 1.5em; font-weight: bold;"><a name="id6" style="text-decoration:none"><font color="black">6. Sorting</font></a></b>


<p>As a rule, data in dataset are not ordered. Maybe we want to see record in sorted way by the date, count or even name of product. We can do it using method <code>sort_values()</code></p>


<p>Let's sort our data by the column 'Unit price'</p>


In [ ]:
df.sort_values(by='Unit price').head(10)

<p>You see that data is sorted by column 'Unit price'. To sort data in descending order put the the parameter <code>ascending=False</code></p>


In [ ]:
df.sort_values(by = 'Unit price',ascending = False).head(10)

<p>If we do sorting string values, the data is sorted alphabetically</p>


In [ ]:
df.sort_values(by='Warehouse')

<p>We can create a custom sorting as well. For example we want to sort data in column 'Product line' by specific rule. As you remember from previous task we have six unique values in column 'Product line'
    <li>Miscellaneous</li>
    <li>Breaking system</li>
    <li>Suspension and traction</li>
    <li>Frame and body</li>
    <li>Engine</li>
    <li>Electrical system</li>
</p>
<p>We need to display record with product 'Miscellaneous' first, then record with product 'Breaking system' and so on.</p>
<p>The method <code>sort_values()</code> has parameter <code>key</code> which determines a custom rule how sort the data. Let's write the function <code>custom_sorting(product)</code>. This function will take as paramater our product and return integer value</p>


In [ ]:
def custom_sorting(product):
    d = {'Miscellaneous' : 1,'Breaking system' : 2,'Suspension & traction' : 3,'Frame & body' : 4, 'Engine' : 5, 'Electrical system' : 6}
    return product.map(d)

In [ ]:
df.sort_values(by='Product line', key=custom_sorting)

<p>Now we dispaly our records with product 'Miscellaneous' first, and records with product 'Electrical system' last</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #7:</b><br>
<b style="font-size: 1.5em; font-weight: bold;">Your tasks</b>
    <li><b style="font-size: 1.2em">Sort data by the column 'Total' in descending order</b></li>
    <li><b style="font-size: 1.2em">Sort data by the column 'Product line'</b></li>
    <li><b style="font-size: 1.2em">Sort data by the column 'Warehouse' in such rules:
        <ul>
            <li><b style="font-size: 1.2em">First must be records with West warehouse</b></li>
            <li><b style="font-size: 1.2em">Second must be records with Central warehouse</b></li>
            <li><b style="font-size: 1.2em">Third must be records with North warehouse</b></li>
        </ul>
    </b></li>
        
<b style="font-size: 1.2em">Hint: you can use function <code>custom_sorting</code> but you need to fix it or you can write another sorting function with sort rules above</b>
    
</div>


In [ ]:
#Sorting data by the column 'Total'


<details><summary>Click here for the solution</summary>

```python
df.sort_values(by = 'Total', ascending = False)
```

</details>


In [ ]:
#Sorting data by the column 'Product line'


<details><summary>Click here for the solution</summary>

```python
df.sort_values(by = 'Product line')
```
</details>


In [ ]:
#Sorting data by the column 'Warehouse' with given rules


<details><summary>Click here for the solution</summary>

```python
def custom_sorting(location):
    d = {'West' : 1,'Central' : 2,'North' : 3}
    return location.map(d)
df.sort_values(by = 'Warehouse',key = custom_sorting)
```
</details>


<b style="font-size: 1.2em; font-weight: bold;">Saving modified dataset</b>


In [ ]:
df.to_csv('clean_motorcycles.csv',index=False)

### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/victor_dyrenko?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX07DAEN2685-2023-01-01">Victor Dyrenko</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX07DAEN2685-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX07DAEN2685-2023-01-01">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version |     Changed By   | Change Description                                         |
| ----------------- | ------- | ---------------- | ---------------------------------------------------------- |
| 2023-04-02        | 1       | Victor Dyrenko   | Finished lab                                               |


<hr>


## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
