Welcome to the guide on data cleaning on the "Avocado Prices" data set. As a Californian and millenial, it made me very happy to see this dataset. In this notebook, we will clean the data set to improve readability, and remove confounding data.

In [None]:
#imports
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

data = pd.read_csv('../input/avocado.csv')
data.head()

At first glance, this data is in decent condition.

Here are the variables and their descriptions provided by the author:

* Date - The date of the observation
* AveragePrice - the average price of a single avocado
* type - conventional or organic
* year - the year
* Region - the city or region of the observation
* Total Volume - Total number of avocados sold
* 4046 - Total number of avocados with PLU 4046 sold
* 4225 - Total number of avocados with PLU 4225 sold
* 4770 - Total number of avocados with PLU 4770 sold

We can improve upon the columns: 4046, 4225, and 4770. A quick Google search for PLUs lead me to http://indexfresh.com/retail-foodservice/brands/packaging/plus/ and I was able to find our three names with bonus location and size info.

* 4046: Small Hass, California, size 60 or smaller
* 4225: Large Hass, California, size 40 & 48
* 4770: Extra Large Hass, Mexico, size 36 and larger

Time to map PLUs to English.

In [None]:
# rename columns: 4046, 4225, 4770
data = data.rename(index=str, columns={"4046" : "Small Hass", "4225" : "Large Hass", "4770" : "XLarge Hass"})
data.head()

Now we know precisely what to call our beloved avos. 

Next question: is any data missing?


In [None]:
data.isna().sum()

Wow! What a convenient data set to work with. No need for imputation of nulls.

Next up: validate dates in the 'Date' column. We are going to check whether they are in the format of YYYY-MM-DD, which is the preferred format of working with dates.

In [None]:
import datetime
def validate(date_text):
    try:
        datetime.datetime.strptime(date_text, '%Y-%m-%d')
    except ValueError:
        raise ValueError("Incorrect data format: ", date_text, ", should be YYYY-MM-DD")

for index, row in data.iterrows():
    validate(row.Date)
    
print ("No errors!")

No date errors, fantastics! Time to move onto a more complicated column, AveragePrice. This is the average price for a single avo. We need to verify we don't have any outliers which could skew model predictions.


In [None]:
data.AveragePrice.describe()

In [None]:
data.AveragePrice.hist()

Our avos range in price from $0.44 to $3.25. We just verified there are no invalid inputs or outliers.

'Total Volume' is our next subject where we will do the same thing.

In [None]:
data['Total Volume'].describe()

Total Volume contains a wide breadth of numbers. I'm especially concerned about the max value, since it is so much larger than both 75% and 50% quartiles. Perhaps there are some groups of outliers. Time to visualize what we are working with.

In [None]:
data['Total Volume'].hist()

In [None]:
data.nlargest(10, 'Total Volume')

In [None]:
data.region.unique()

Let's example the outputs of what just happened. First, the histogram verified our hypothesis about groupings of outliers. Second, the view of the largest 10 regions revealed that each of their regions is 'TotalUS'. More on this later. Third, we can see how regions represent cities, states, and regions. This means there is overlap. For example, San Francisco and Sacramento are cities which contribute to the state of California. 

How will we deal with this confounding data? It depends on what you're interested in. In this notebook, we will remove all regions which are not cities, leaving the most basic units of city data.

I dug around on Avocado Board's website and pulled up [a link](http://www.hassavocadoboard.com/sites/default/files/xls/hab-markets-and-regions-1-22-2018.xlsx) which lists all the regions. Now to remove all those!

In [None]:
# remove all rows where 'region' = 'TotalUS'
regionsToRemove = ['California', 'GreatLakes', 'Midsouth', 'NewYork', 'Northeast', 'SouthCarolina', 
                   'Plains', 'SouthCentral', 'Southeast', 'TotalUS', 'West']
size = data['Total Volume'].size
data = data[~data.region.isin(regionsToRemove)]
newsize = size - data['Total Volume'].size
print("old size: ", size, ", removed", newsize, "rows")

In [None]:
data['Total Volume'].hist()

In [None]:
data.nlargest(10, 'Total Volume')

Boom! The confounding data was removed via cleaning the region column and our view shows that LA now has the greatest Hass avo volume. Furthermore, the histogram shows the top 10 rows occupy the largest values.

Enough with regions and total volume. Let's focus the spotlight back on Small Hass, Large Hass, XLarge Hass columns. Below, we will graph volumes of individual avo sizes and then inspect each one. 

In [None]:
sns.distplot(data['Small Hass'], color="yellow", label='Small Hass', kde=False)
sns.distplot(data['Large Hass'], color="orange", label='Large Hass', kde=False)
sns.distplot(data['XLarge Hass'], color="red", label='XLarge Hass', kde=False)
plt.legend()

In [None]:
data['Small Hass'].describe()

In [None]:
data.nlargest(10, 'Small Hass')

In [None]:
data['Large Hass'].describe()

In [None]:
data.nlargest(10, 'Large Hass')

In [None]:
data['XLarge Hass'].describe()

In [None]:
data.nlargest(10, 'Large Hass')

These individual sizes are similar to what we found previously working with 'Total Volume'. LA contributes to many of the greatest value and the histogram has a very strong right skew.

Time to check out our various bag columns.

In [None]:
data['Total Bags'].describe()

In [None]:
data['Small Bags'].describe()

In [None]:
data['Large Bags'].describe()

In [None]:
data['XLarge Bags'].describe()

Nothing weird: check! Time for our 'type' and 'year' columns, where we shall inspect their values and graph them.

In [None]:
data.type.unique()

In [None]:
data.year.unique()

Both 'type' and 'year' are fine

**Congratulations!** You've made it to the end of the Avocado Prices data cleanup. 