I'm *very* new so, if I am doing something the hard way or am otherwise incorrect, please point out to me what might seem obvious to you.
# Import Libraries and Data

_and clean-up a few inconsistently applied naming conventions to make life easier later_

In [None]:
import pandas as pd
import numpy as np
import time
from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt
import seaborn as sns


avocados = pd.read_csv("../input/avocado.csv")
avocados.rename(columns = {"Unnamed: 0" : "Avocado ID", "AveragePrice" : "Average Price"}, inplace=True) #because consistent naming makes things a lot easier.
#LEARNING: also check for inconsistent casing of column names
print(avocados.columns)

# Ambiguous Variables?

Many of the variables don't have units; I will make the following assumptions:

* the "Average Price" values are based on an average across *weight* (e.g. \$ per 100g) and **not** across *number* of avocados (e.g. \$ per avocado).
  * the latter would not be a fair indicator of price as there are 3 different classes of avocado, each of different size.
  * for example, an average price of \$1.50 for 300 Hass avocados is not equal to an average price of \$1.50 for 300 *large* Hass avocados (you would get more value out of 300 large Avocados than you would 300 regular sized avocado, assuming size matters).


## Change "Date" column to datetime64

I'll keep the old date in the old_date column just in case I need to refer back to it.

In [None]:
date_conversion = pd.to_datetime(avocados["Date"], format='%Y-%m-%d')
avocados["old_date"] = avocados["Date"]
avocados["Date"] = date_conversion

avocados.describe(include="all").round(2)

# Leaderboard
There have been quite a few analyses involving time series forecasting.
For something a little different, I thought it would be fun to show price leadership over time.
I was primarily looking for:
* consistent price leadership of a region for 
    * a given season in a year or
    * across *all* seasons in a year    

## Create the leaderboard dataframe

In [None]:
grp = avocados['Average Price'].groupby([avocados["Date"], avocados["region"]]).mean()
index_values = pd.Series(grp.index.values).apply(pd.Series)
group_by_date_and_region = pd.DataFrame(data={"date" : index_values[0], "region" : index_values[1], "average_price" : grp.values})

dates = avocados.Date.unique()
dates.sort()
region = avocados.region.unique()

region.sort() #alpha sort the regions
leaderboard = np.zeros(len(region)) #put in a row of zeros to enable the vstacking in the loop below

for date in dates:
    
    date_slice = group_by_date_and_region.loc[group_by_date_and_region["date"] == date]
    leaderboard_for_date = pd.DataFrame(data={"region" : date_slice["region"], "rank" : date_slice["average_price"].rank(), 
                                        "average_price" : date_slice["average_price"]})
    leaderboard_for_date = leaderboard_for_date.sort_values("region")
    leaderboard_for_date = leaderboard_for_date["rank"].values.reshape(-1, 1).transpose()
    leaderboard = np.vstack((leaderboard, leaderboard_for_date))

leaderboard = np.delete(leaderboard, 0, 0) #remove the row of zeros initially inserted to allow vstacking
leaderboard = pd.DataFrame(columns=region, data=leaderboard)
leaderboard["date"] = dates
leaderboard = leaderboard.set_index("date")

## Plotting the leaderboard and the resulting confusion

In [None]:
dates = leaderboard.index.values
regions = avocados.region.unique()

fig, ax = plt.subplots(1, 1, figsize=(25, 12))

for region in regions:
    plt.plot(dates, leaderboard[region].values)
    
plt.title("Price leadership of regions over time", fontsize=20)
plt.show()

# A Different Tactic: Price Leadership Averaged Over Month
That plot took a good couple of minutes to generate and isn't really that informative. Maybe I can try a slightly different grouping method: Price Leadership Averaged Over Month.

In [None]:
#split the date values into year and date. Group on date

avocados["time_tuple"] = avocados["Date"]
avocados["time_tuple"] = avocados["time_tuple"].apply(lambda x: x.timetuple())
avocados["year, month"] = avocados["time_tuple"].apply(lambda x: pd.Period((str(x[0]) + "-" + str(x[1]))))


grp = avocados['Average Price'].groupby([avocados["year, month"], avocados["region"]]).mean()
index_values = pd.Series(grp.index.values).apply(pd.Series)
group_by_date_and_region = pd.DataFrame(data={"year, month" : index_values[0], "region" : index_values[1], "average_price" : grp.values})

dates = avocados["year, month"].unique()
dates.sort()
region = avocados.region.unique()

region.sort() #alpha sort the regions
leaderboard = np.zeros(len(region)) #put in a row of zeros to enable the vstacking in the loop below

for date in dates:
    
    date_slice = group_by_date_and_region.loc[group_by_date_and_region["year, month"] == date]
    leaderboard_for_date = pd.DataFrame(data={"region" : date_slice["region"], "rank" : date_slice["average_price"].rank(), 
                                        "average_price" : date_slice["average_price"]})
    leaderboard_for_date = leaderboard_for_date.sort_values("region")
    leaderboard_for_date = leaderboard_for_date["rank"].values.reshape(-1, 1).transpose()
    leaderboard = np.vstack((leaderboard, leaderboard_for_date))

leaderboard = np.delete(leaderboard, 0, 0) #remove the row of zeros initially inserted to allow vstacking
leaderboard = pd.DataFrame(columns=region, data=leaderboard)
leaderboard["year, month"] = dates
leaderboard = leaderboard.set_index("year, month")

## Plot a slightly different leaderboard

In [None]:
dates = leaderboard.index.values
regions = avocados.region.unique()

fig, ax = plt.subplots(1,1, figsize=(25, 12))
x = np.arange(0,len(dates),1)
ax.set_xticks(x)
ax.set_xticklabels(dates)
for region in regions:
    plt.plot(x,leaderboard[region].values)
plt.xticks(x, rotation='vertical')
plt.title("Price leadership of regions averaged per month", fontsize=20)
plt.show()

## The results...meh
The plot is a little less chaotic but the benefits of that grouping were trivial, if any. 

As a last ditch effort to pull some value from this expedition, I'll try and narrow down on the top performers (the top 5 seems to be the easiest to read. Even 10 becomes a tangled mess).

In [None]:

grp = leaderboard.mean(axis=0).sort_values()
top = grp.head(5).index.values

dates = leaderboard.index.values
regions = top

fig, ax = plt.subplots(1,1, figsize=(30, 20))
x = np.arange(0,len(dates),1)
ax.set_xticks(x)
ax.set_xticklabels(dates)
artists = []
for region in regions:
    #plt.plot(dates, leaderboard[region].values)
    plt.plot(x,leaderboard[region].values, label=region)
    
plt.xlabel("month", fontsize=20)
plt.xticks(rotation='vertical', fontsize=20)
plt.ylabel("leadership rank", fontsize=20)
plt.yticks(np.arange(0, 45, 1.0))
plt.legend(fontsize=20)
plt.title("Top five price leaders averaged per month", fontsize=20)
plt.show()

# Observations
1. Even among the top 5, there is a lot of variability in price leadership.
2. Houston and DallasFtWorth were the *most* solid front-runners but, even among these two, there were periods where neither of them were at the top.
3. Indeed, for the periods below, the price leader and second place came from outside the top 5.
    * November 2015 - April 2016.
    * September 2017 - November 2017.

# Further Investigation
* the investigation was fairly coarse grain in that
    * the influence of avocado type on price leadership (conventional vs organic) was not explored.
    * it might have been useful to take a look at price leadership averaged across the years.