Skip to content

rohinegi548/EDA-and-Machine-Learning-Avocado-Prices-Predictions

Repository files navigation

EDA-and-Machine-Learning-Avocado-Prices-Predictions

Exploratory Data Analysis and Price Predictions for Avocado Dataset based on Machine Learning

Avocado Dataset Analysis and ML Prediction

    Table of Contents

  • Problem Statement
  • Data Loading and Description
  • Data Profiling
    • Understanding the Dataset
    • Profiling
    • Preprocessing
  • Data Visualisation and Questions answered
    • Q.1 Which type of Avocados are more in demand (Conventional or Organic)?
    • Q.2 In which range Average price lies, what is distribution look like?
    • Q.3 How Average price is distributed over the months for Conventional and Organic Types?
    • Q.4 What are TOP 5 regions where Average price are very high?
    • Q.5 What are TOP 5 regions where Average consumption is very high?
    • Q.6 In which year and for which region was the Average price the highest?
    • Q.7 How price is distributed over the date column?
    • Q.8 How dataset features are correlated with each other?
  • Feature Engineering for Model building
  • Model selection/predictions
    • P.1 Are we good with Linear Regression? Lets find out.
    • P.2 Are we good with Decision Tree Regression? Lets find out.
    • P.3 Are we good with Random Forest Regressor? Lets find out.
  • Lets see final Actual Vs Predicted sample.
  • Conclusions

Problem Statement

  • The notebooks explores the basic use of Pandas and will cover the basic commands of (EDA) for analysis purpose.
  • In this study, we will try to see if we can predict the Avocado’s Average Price based on different features. The features are different (Total Bags,Date,Type,Year,Region…).
  • The variables of the dataset are the following:

  • Categorical: ‘region’,’type’
  • Date: ‘Date’
  • Numerical:’Total Volume’, ‘4046’, ‘4225’, ‘4770’, ‘Total Bags’, ‘Small Bags’,’Large Bags’,’XLarge Bags’,’Year’
  • Target:‘AveragePrice’

Data Loading and Description

This data was downloaded and provided by INSAID, from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. Represents weekly 2018 retail scan data for National retail volume (units) and price. The dataset comprises of 18249 observations of 14 columns. Below is a table showing names of all the columns and their description.

The unclear numerical variables terminology is explained in the next section:

FeaturesDescription
‘Unamed: 0’ Its just a useless index feature that will be removed later
‘Total Volume’ Total sales volume of avocados
‘4046’ Total sales volume of Small/Medium Hass Avocado
‘4225’ Total sales volume of Large Hass Avocado
‘4770’ Total sales volume of Extra Large Hass Avocado
‘Total Bags’ Total number of Bags sold
‘Small Bags’ Total number of Small Bags sold
‘Large Bags’ Total number of Large Bags sold
‘XLarge Bags’ Total number of XLarge Bags sold

-->Use this while viewing notebook