<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="300" alt="cognitiveclass.ai logo">
</center>

# Motorcycle sales analysis

# *Lab 3. Data analysis with Python*
Estimated time needed: **45 minutes**

## Objectives

After completing this lab you will be able to:

*   Explore features or charecteristics to predict total cost of purchases


<details><summary><b style="font-size: 2em; font-weight: bold;">Click here to see content, description of dataset, source of dataset and licence</b></summary>
<br/>
    <b style="font-size: 2em; font-weight: bold;">Content</b>
<p>You work in the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.</p>

<b style="font-size: 2em; font-weight: bold;">Dataset Glossary (Column-wise)</b>
<ul>
    <li>Date<p>Determines the date when client bought products</p></li>
    <li>Warehouse<p>The warehouse location.</p></li>
    <li>Client type<p>Determines how client bought the products. This column can be only Retail or Wholesale</p></li>
    <li>Product line<p>Name of product (part of motorcycle)</p></li>
    <li>Quantity<p>The count bought product</p></li>
    <li>Unit price<p>Cost of one product</p></li>
    <li>Total<p>The total purchase price</p></li>
    <li>Payment<p>Determines the method of payment for the purchase. This dataset has three types of payment: Credit card, cash or transfer</p></li>
</ul>

<b style="font-size: 2em; font-weight: bold;">Target field</b>
<ul>
    <li>Total</li>
</ul>

<b style="font-size: 2em; font-weight: bold;">Data source and licence</b>
<ul>
    <li>Data source: <a href="https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0F83EN2842-2023-01-01">https://www.kaggle.com/datasets/devijeganath/motorcycle-sales-analysis</a></li>
    <li>Data type: csv</li>
    <li>Licence: <a href="https://creativecommons.org/publicdomain/zero/1.0/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0F83EN2842-2023-01-01">CC0: Public Domain</a></li>
</ul>
<p>
This DataSet released under CC0: Public Domain license that allow of copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
</p>
You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
</details>


<b style="font-size: 1.5em; font-weight: bold;">Table of Contents</b>

<div class="alert alert-block alert-info" style="margin-top: 20px">
<ol>
    <li><a href="#id1">Download Data</a></li>
    <li><a href="#id2">Analyzing Individual Feature Patterns using Visualization</a></li>
    <li><a href="#id3">Descriptive Statistical Analysis</a></li>
    <li><a href="#id4">Correlation and Causation</a></li>
    <li><a href="#id5">ANOVA</a></li>
</ol>

</div>

<hr>


<b style="font-size: 1.5em; font-weight: bold;">What are the main characteristics that have the most impact on the total cost of purchase?</b><br>


<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id1"><font color="black">1. Download Data</font></a></b>


Import libraries:


To install seaborn and dython we use pip, the Python package manager.
If needed install scikit-learn <br>
<b style="font-size: 1.3em; font-weight: bold;">Don't comment the last two lines and run code. After you can comment these lines if you want</b>


In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
#! mamba install scipy=1.7.1-y
#! pip install scikit-learn==0.24.2


! pip install seaborn
! pip install dython

<b style="font-size: 1.5em; font-weight: bold;">If you will have error during running the code below, please run it again</b>


In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dython.nominal import associations
from scipy import stats
from sklearn.preprocessing import OrdinalEncoder 

Load the data and store it in dataframe `df`:


This dataset was hosted on IBM Cloud object. Click <a href="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0F83EN/clean_motorcycles.csv">HERE</a> for free storage.


In [ ]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0F83EN/clean_motorcycles.csv'

In [ ]:
df = pd.read_csv(path)
df.head()

<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id2"><font color="black">2. Analyzing Individual Feature Patterns Using Visualization</font></a></b>


<b style="font-size: 1.2em; font-weight: bold;">How to choose the right visualization method?</b>
<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>


In [ ]:
# list the data types for each column
df.dtypes

<p>The types for column 'Date', 'Warehouse', 'Product line', 'Quantity binned', 'Binned price', 'Total ranged' aren't correct. Let's fix it.</p>


In [ ]:
df['Date'] = pd.to_datetime(df['Date'])

<p>Data type of column 'Warehouse', 'Product line', 'Quantity binned', 'Binned price', 'Total ranged' is 'object' but they must be 'category'. We can extract these columns in variable <code>column</code> and change their type</p>


In [ ]:
columns = df.columns[df.dtypes == 'object']
columns

<p>Change their type using <code>astype()</code></p>


In [ ]:
df[columns] = df[columns].astype('category')
df.dtypes

<p>Data types for column 'Cash', 'Credit card', 'Transfer', 'Retail' and 'Wholesale' also must be 'category' not 'int64'. We also can change their type as previously but we have field 'Quantity' which must be type 'int64'</p>


In [ ]:
columns = df.columns[(df.dtypes == 'int64') & (df.columns != 'Quantity')]
columns

In [ ]:
df[columns] = df[columns].astype('category')
df.dtypes

<p>Now data types are correct</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #1:</b>
<b style="font-size: 1.5em">Print data type of the columns 'Total', 'Quantiy' and 'Unit price' </b>

</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[['Total','Quantity','Unit price']].dtypes
```

</details>


For example, we can calculate the correlation between variables  of type "int64" or "float64" using the method "corr":


In [ ]:
df.corr()

<p>The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook. You also can see that for Categorical fields we can't calculate correlation with method <code>corr()</code></p>


<p>Find coefficient of correlation for Category fields, as well using <code>associations()</code>.


<p>For categorical fields which have text value <code>associations()</code> not work. In this case we can use method <code>fit_transform()</code> from class <code>OrdinalEncoder()</code>.This method replaces all values into numbers start with '0'. The same values will be replaced the same numbers. Before transforming data we need to it in separate variable. Use method <code>copy</code>.</p>


In [ ]:
#RUN THIS BLOCK ONLY ONE TIME!!!
data = df.copy()

In [ ]:
enc = OrdinalEncoder()
df[df.columns] = enc.fit_transform(df[df.columns])
fig, ax = plt.subplots(figsize=(16, 8))
r = associations(df, ax = ax, cmap = "Blues")

<b style="font-size: 1.2em; font-weight: bold;">DON'T FORGET restore original data</b>


In [ ]:
df = data
df.head()

<p>From this table we can see that 'Quantity', 'Quantity binned', 'Transfer', 'Retail', 'Wholesale' have good correlation with target field 'Total'.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #2:</b>
<b style="font-size: 1.5em">Find the correlation between the following columns: Quantity, Total, Unit price</p>
<p>Hint: if you forgot how to get several columns from dataset simultaneously look at the previous question</p>
</div>


In [ ]:
# Write your code below and press Shift+Enter to execute 


<details><summary>Click here for the solution</summary>

```python
df[['Quantity', 'Total', 'Unit price']].corr()
```

</details>


<b style="font-size: 1.5em; font-weight: bold;">Continuous Numerical Variables:</b> 

<p>Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.</p>

<p>In order to start understanding the (linear) relationship between an individual variable and the price, we can use "regplot" which plots the scatterplot plus the fitted regression line for the data.</p>


Let's see several examples of different linear relationships:


<b style="font-size: 1.2em; font-weight: bold;">Positive Linear Relationship</b>


Let's find the scatterplot of "Quantity" and "Total".


In [ ]:
# Quantity bought product as potential predictor variable of Total cost of purchase
sns.regplot(x="Quantity", y="Total", data=df)
plt.ylim(0,)

<p>As the Quantity goes up, the Total cost goes up: this indicates a positive direct correlation between these two variables. Quantity of products seems like a pretty good predictor of Total cost.</p>


We can examine the correlation between 'Quantity' and 'Total' and see that it's approximately 0.87.


In [ ]:
df[["Quantity", "Total"]].corr()

<b style="font-size: 1.2em; font-weight: bold;">Weak Linear Relationship</b>


Let's see if "Unit price" is a predictor variable of "Total".


In [ ]:
sns.regplot(x="Unit price", y="Total", data=df)

<p>Price for unit of product does not seem like a good predictor of the Total price.</p>


We can examine the correlation between 'Unit price' and 'Total' and see it's approximately 0.37.


In [ ]:
df[['Unit price','Total']].corr()

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #3:</b>
<b style="font-size: 1.5em">Use method <code>regplot()</code> to see linear relatioship between 'Unit price' and 'Quantity'. Find the correlation between these two fields</b>
</div>


In [ ]:
#Plot the data of column 'Unit price' and 'Quantity' using method regplot() 


<details><summary>Click here for the solution</summary>

```python
sns.regplot(x="Unit price", y="Quantity", data=df)
```

</details>


In [ ]:
#Find correlation between 'Unit price' and 'Quantity'


<details><summary>Click here for the solution</summary>

```python
df[['Unit price','Quantity']].corr()
```

</details>


<b style="font-size: 1.5em; font-weight: bold;">Categorical Variables</b>

<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.</p>


Let's look at the relationship between "Quantity binned" and "Total".


In [ ]:
sns.boxplot(x="Quantity binned", y="Total", data=df)

<p>Here we see that the distribution of 'Tota'l between these for 'Quantity binned' categories are distinct enough to take Quantity intervals as a potential good predictor of 'Total'. </p>


In [ ]:
sns.boxplot(x="Warehouse", y="Total", data=df)

<p>We see that the distributions of 'Total' between the different 'Warehouse' categories have a significant overlap, so 'Warehouse' would not be a good predictor of 'Total'.</p>


In [ ]:
sns.boxplot(x="Retail", y="Total", data=df)

<p>Numbers 1 and 0 means that client is retail client or not, respectively</p>


In [ ]:
sns.boxplot(x="Wholesale", y="Total", data=df)

<p>Numbers 1 and 0 means that client is wholesale client or not, respectively</p>


<p>So 'Retail' and 'Wholesale' can be good predictors for 'Total'</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #4:</b>
<b style="font-size: 1.5em">Use method <code>boxplot()</code> to see relatioship between 'Cash' and 'Total', 'Credit card' and 'Total', 'Transfer' and 'Total'.</b> <b><br></b>
<b style="font-size: 1.2em">Remember numbers 1 and 0 means that the buyer paid or did not pay by the specified method, respectively</b>
</div>


In [ ]:
#Cash vs Total


<details><summary>Click here for the solution</summary>

```python
sns.boxplot(x = 'Cash',y = 'Total',data = df)
```

</details>


In [ ]:
#Credit card vs Total


<details><summary>Click here for the solution</summary>

```python
sns.boxplot(x = 'Credit card',y = 'Total',data = df)
```

</details>


In [ ]:
#Transfer vs Total


<details><summary>Click here for the solution</summary>

```python
sns.boxplot(x = 'Transfer',y = 'Total',data = df)
```

</details>


<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id3"><font color="black">3. Descriptive Statistical Analysis</font></a></b>


<p>Let's first take a look at the variables by utilizing a description method.</p>

<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>

This will show:

<ul>
    <li>the count of that variable</li>
    <li>the mean</li>
    <li>the standard deviation (std)</li> 
    <li>the minimum value</li>
    <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>
    <li>the maximum value</li>
<ul>


We can apply the method "describe" as follows:


In [ ]:
df.describe()

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'category' as follows:


In [ ]:
df.describe(include=['category'])

<b style="font-size: 1.2em; font-weight: bold;">Value Counts</b>


<p>Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "Product line". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket <code>df['Product line']</code>, not two brackets <code>df[['Product line']]</code>.</p>


In [ ]:
df['Product line'].value_counts()

We can convert the series to a dataframe as follows:


In [ ]:
df['Product line'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe "product_counts" and rename the column  'Product line' to 'Value counts'.


In [ ]:
product_counts = df['Product line'].value_counts().to_frame()
product_counts.rename(columns={'Product line': 'Value counts'}, inplace=True)
product_counts

Now let's rename the index to 'Type product':


In [ ]:
product_counts.index.name = 'Type product'
product_counts

<p>After examining the value counts of the product line, we see that product would not be a good predictor variable for the total. This is because we only have 61 bought Engine and 230 bought Breaking system so this result is skewed.</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #5:</b>

<p>Use method <code>value_counts()</code> to find count of each Warehouse.</p>
<p>Display data using method <code>to_frame()</code>. Name of index must be 'Warehouse location', name of column must be 'Count warehouse'</p>
</div>


In [ ]:
#Write your code here


<details><summary>Click here for the solution</summary>

```python
warehouse_counts = df['Warehouse'].value_counts().to_frame()
warehouse_counts.rename(columns={'Warehouse': 'Count warehouse'}, inplace=True)
warehouse_counts.index.name = 'Warehouse location'
warehouse_counts
```

</details>


<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id4"><font color="black">4. Correlation and Causation</font></a></b>


<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>


<p><b>Pearson Correlation</b></p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Perfect negative linear correlation.</li>
</ul>


<p>Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64'  variables.</p>


Sometimes we would like to know the significant of the correlation estimate.


<b>P-value</b>

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


We can obtain this information using  "stats" module in the "scipy"  library.


<b style="font-size: 1.2em; font-weight: bold;">Unit price vs. Total</b>


Let's calculate the  Pearson Correlation Coefficient and P-value of 'Unit price' and 'Total'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Unit price'], df['Total'])
print("The Pearson Correlation Coefficient is {:.2f} with a P-value of P = {:.3f}".format(pearson_coef, p_value))  

<b style="font-size: 1.2em; font-weight: bold;">Conclusion:</b>
<p>Since the p-value is $<$ 0.001, the correlation between Unit price and Total is statistically significant, although the linear relationship isn't extremely strong (~0.37).</p>


<b style="font-size: 1.2em; font-weight: bold;">Quantity vs. Total</b>


Let's calculate the  Pearson Correlation Coefficient and P-value of 'Quantity' and 'Total'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Quantity'], df['Total'])
print("The Pearson Correlation Coefficient is {:.2f} with a P-value of P = {:.3f}".format(pearson_coef, p_value))  

<b style="font-size: 1.2em; font-weight: bold;">Conclusion:</b>

<p>Since the p-value is $<$ 0.001, the correlation between quantoty and total is statistically significant, and the linear relationship is quite strong (~0.87, close to 1).</p>


<b style="font-size: 1.2em; font-weight: bold;">Cash vs. Total</b>

Let's calculate the  Pearson Correlation Coefficient and P-value of 'Cash' and 'Total'.


In [ ]:
pearson_coef, p_value = stats.pearsonr(df['Cash'], df['Total'])
print("The Pearson Correlation Coefficient is {:.2f} with a P-value of P = {:.3f}".format(pearson_coef, p_value))  

<b style="font-size: 1.2em; font-weight: bold;">Conclusion:</b>
<p>Since the p-value is $<$ 0.001, the correlation between Cash and Total is statistically significant, and the linear relationship is not strong (~-0.13).</p>


<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #6:</b>

<p>Find Pearson Correlation Coefficient and P-value between columns 'Credit card', 'Transfer', 'Retail', 'Wholesale' and column 'Total'.</p> <p>Hint you can enumerate these column in array and put the needed column in the loop in order to not duplicate the code </p>
</div>


In [ ]:
#Write code here


<details><summary>Click here for the solution</summary>

```python
columns = ['Credit card','Transfer','Retail','Wholesale']
for c in columns:
    pearson_coef, p_value = stats.pearsonr(df[c], df['Total'])
    print("The Pearson Correlation Coefficient for column {} is {:.2f} with a P-value of P = {:.3f}".format(c,pearson_coef,p_value))
```

</details>


<b style="font-size: 2em; font-weight: bold; text-decoration: none;"><a name="id5"><font color="black">5. ANOVA</font></a></b>


<b style="font-size: 1.5em; font-weight: bold;">ANOVA: Analysis of Variance</b>
<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our total variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.</p>


<b style="font-size: 1.2em; font-weight: bold;">Product line</b>


<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>

<p>To see if different types of 'Product line' impact 'Total', we group the data.</p>


In [ ]:
grouped=df.groupby(['Product line'])

In [ ]:
df['Product line'].unique()

<p>As you remember we have six unique values in column 'Product line'</p>
<li>'Breaking system'</li> 
<li>'Electrical system'</li> 
<li>'Engine'</li> 
<li>'Frame & body'</li> 
<li>'Miscellaneous'</li> 
<li>'Suspension & traction'</li>


Let's work with three values: 'Breaking system', 'Suspension & traction' and 'Electrical system' because information about these parts of motorcycle is the most.


We can obtain the values of the method group using the method "get_group".


In [ ]:
grouped.get_group('Breaking system')['Total']

We can use the function 'f_oneway' in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>.


In [ ]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped.get_group('Breaking system')['Total'], grouped.get_group('Suspension & traction')['Total'], grouped.get_group('Electrical system')['Total'])  
 
print( "ANOVA results: F = {:.2f}, P = {}".format(f_val,p_val)) 

This is a not bad result with a F-test score showing a moderate correlation and a P-value is very small implying almost certain statistical significance. But does this mean all three groups are all this highly correlated?

Let's examine them separately.


<b style="font-size: 1.2em; font-weight: bold;"> Breaking system and Suspension & traction</b>


In [ ]:
f_val, p_val = stats.f_oneway(grouped.get_group('Breaking system')['Total'], grouped.get_group('Suspension & traction')['Total'])  
 
print( "ANOVA results: F = {:.2f}, P = {}".format(f_val,p_val)) 

<p>The F-score and P-value is better than previous</p>


Let's examine the other groups.


<b style="font-size: 1.2em; font-weight: bold;"> Breaking system and Electrical system</b>


In [ ]:
f_val, p_val = stats.f_oneway(grouped.get_group('Breaking system')['Total'], grouped.get_group('Electrical system')['Total'])  
   
print( "ANOVA results: F = {:.2f}, P = {:.3f}".format(f_val,p_val))  

<p>F-score and P-value is worse than previous</p>


<b style="font-size: 1.2em; font-weight: bold;">Suspension & traction and Electrical system</b>


In [ ]:
f_val, p_val = stats.f_oneway(grouped.get_group('Suspension & traction')['Total'], grouped.get_group('Electrical system')['Total'])  
 
print( "ANOVA results: F = {:.2f}, P = {:.4f}".format(f_val,p_val))    

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<b style="font-size: 2em; font-weight: bold;">Question #7:</b>

<p>Work with column Warehouse. Find F-score and P-value between Central and North, Central and West, West and North. In this column is three unique values: <code>Central, West, North</code></p>
</div>


In [ ]:
#Write code here


<details><summary>Click here for the solution</summary>

```python
group = df.groupby('Warehouse')
f_val, p_val = stats.f_oneway(group.get_group('Central')['Total'], group.get_group('North')['Total'])  
print("ANOVA results between Central and North: F = {:.2f}, P = {:.2f}".format(f_val,p_val))
f_val, p_val = stats.f_oneway(group.get_group('Central')['Total'], group.get_group('West')['Total'])  
print("ANOVA results between Central and West: F = {:.2f}, P = {:.2f}".format(f_val,p_val))
f_val, p_val = stats.f_oneway(group.get_group('West')['Total'], group.get_group('North')['Total'])  
print("ANOVA results between West and North: F = {:.2f}, P = {:.2f}".format(f_val,p_val))
```

</details>


<b style="font-size: 1.5em; font-weight: bold;">Conclusion: Important Variables</b>


<p>We now have a better idea of what our data looks like and which variables are important to take into account when predicting the total cost of purchase. We have narrowed it down to the following variables:</p>

Continuous numerical variable:

<ul>
    <li>Quantity</li>
</ul>

Categorical variable:

<ul>
    <li>Quantity binned</li>
    <li>Retail</li>
    <li>Transfer</li>
    <li>Total ranged</li>
</ul>

<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>


<b style="font-size: 1.5em; font-weight: bold;">Saving dataset with important fields</b>


In [ ]:
df[['Product line','Quantity','Quantity binned','Retail','Transfer','Total ranged','Total']].to_csv('new_motorcycles.csv',index=False)

### Thank you for completing this lab!

## Authors

<a href="https://author.skills.network/instructors/victor_dyrenko?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0F83EN2842-2023-01-01">Victor Dyrenko</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0F83EN2842-2023-01-01">Yaroslav Vyklyuk</a>

<a href="https://author.skills.network/instructors/olga_kavun?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0F83EN2842-2023-01-01">Olga Kavun</a>

## Change Log

| Date (YYYY-MM-DD) | Version |     Changed By   | Change Description                                         |
| ----------------- | ------- | ---------------- | ---------------------------------------------------------- |
| 2023-04-09        | 1       | Victor Dyrenko   | Finished lab                                               |


<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
