# Which avocado size is most popular?

Avocados are increasingly popular and delicious in guacamole and on toast. The Hass Avocado Board keeps track of avocado supply and demand across the USA, including the sales of three different sizes of avocado. In this exercise, you'll use a bar plot to figure out which size is the most popular.

Bar plots are great for revealing relationships between categorical (size) and numeric (number sold) variables, but you'll often have to manipulate your data first in order to get the numbers you need for plotting.

In [2]:
# # Import matplotlib.pyplot with alias plt
# import matplotlib.pyplot as plt

# # Look at the first few rows of data
# print(avocados.head())

# # Get the total number of avocados sold of each size
# nb_sold_by_size = avocados.groupby("size")["nb_sold"].agg('sum')

# # Create a bar plot of the number of avocados sold by size
# nb_sold_by_size.plot(kind="bar")

# # Show the plot
# plt.show()

# Changes in sales over time

Line plots are designed to visualize the relationship between two numeric variables, where each data values is connected to the next one. They are especially useful for visualizing the change in a number over time since each time point is naturally connected to the next time point.

In [3]:
# # Import matplotlib.pyplot with alias plt
# import matplotlib.pyplot as plt

# # Get the total number of avocados sold on each date
# nb_sold_by_date = avocados.groupby("date")["nb_sold"].agg('sum')

# # Create a line plot of the number of avocados sold by date
# nb_sold_by_date.plot(kind = "line")

# # Show the plot
# plt.show()

# Avocado supply and demand

Scatter plots are ideal for visualizing relationships between numerical variables. If they're related, you may be able to use one number to predict the other.

In [4]:
# # Scatter plot of nb_sold vs avg_price with title
# avocados.plot(x="nb_sold", y="avg_price", kind="scatter", title = "Number of avocados sold vs. average price")

# # Show the plot
# plt.show()

# Price of conventional vs. organic avocados

Creating multiple plots for different subsets of data allows you to compare groups

In [5]:
# # Modify bins to 20
# avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins=20)

# # Modify bins to 20
# avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins=20)

# # Add a legend
# plt.legend(["conventional", "organic"])

# # Show the plot
# plt.show()

# Finding missing values

Missing values are everywhere, and you don't want them interfering with your work. Some functions ignore missing data by default, but that's not always the behavior you might want. Some functions can't handle missing values at all, so these values need to be taken care of before you can use them.

In [6]:
# # Import matplotlib.pyplot with alias plt
# import matplotlib.pyplot as plt

# # Check individual values for missing values
# print(avocados_2016.isna())

# # Check each column for missing values
# print(avocados_2016.isna().any())

# # Bar plot of missing values by variable
# avocados_2016.isna().sum().plot(kind = "bar")

# # Show plot
# plt.show()

# Removing missing values

There are a few options to deal with missing values. One way is to remove them from the dataset completely

In [None]:
# # Remove rows with missing values
# avocados_complete = avocados_2016.dropna()

# # Check if any columns contain missing values
# print(avocados_complete.isna().any())

# Replacing missing values

Another way of handling missing values is to replace them all with the same value. For numerical variables, one option is to replace values with 0

In [7]:
# # List the columns with missing values
# cols_with_missing = ["small_sold", "large_sold", "xl_sold"]

# # Create histograms showing the distributions cols_with_missing
# avocados_2016[cols_with_missing].plot(kind="hist")

# # Show the plot
# plt.show()

In [8]:
# # From previous step
# cols_with_missing = ["small_sold", "large_sold", "xl_sold"]
# avocados_2016[cols_with_missing].hist()
# plt.show()

# # Fill in missing values with 0
# avocados_filled = avocados_2016[cols_with_missing].fillna(0)

# # Create histograms of the filled columns
# avocados_filled[cols_with_missing].plot(kind="hist")

# # Show the plot
# plt.show()

# List of dictionaries

You recently got some new avocado data from 2019 that you'd like to put in a DataFrame using the list of dictionaries method. 

In [9]:
# # Create a list of dictionaries with new data
# avocados_list = [
#     {"date": "2019-11-03", "small_sold": 10376832, "large_sold": 7835071},
#     {"date": "2019-11-10", "small_sold": 10717154, "large_sold": 8561348},
# ]

# # Convert list into DataFrame
# avocados_2019 = pd.DataFrame(avocados_list)

# # Print the new DataFrame
# print(avocados_2019)

# Dictionary of lists

Some more data just came in! This time, you'll use the dictionary of lists method, parsing the data column by column.

In [10]:
# # Create a dictionary of lists with new data
# avocados_dict = {
#   "date": ["2019-11-17", "2019-12-01"],
#   "small_sold": [10859987, 9291631],
#   "large_sold": [7674135, 6238096]
# }

# # Convert dictionary into DataFrame
# avocados_2019 = pd.DataFrame(avocados_dict)

# # Print the new DataFrame
# print(avocados_2019)

# CSV to DataFrame

you'll need to get the CSV into a pandas DataFrame and do some manipulation!

In [11]:
# # Read CSV as DataFrame called airline_bumping
# airline_bumping = pd.read_csv("airline_bumping.csv")

# # Take a look at the DataFrame
# print(airline_bumping.head())

# # For each airline, select nb_bumped and total_passengers and sum
# airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()

# # Create new col, bumps_per_10k: no. of bumps per 10k passengers for each airline
# airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000

# # Print airline_totals
# print(airline_totals)

# DataFrame to CSV

you'll need to sort the data and export it to CSV so that your colleagues can read it.

In [12]:
# # Create airline_totals_sorted
# airline_totals_sorted = airline_totals.sort_values("bumps_per_10k", ascending= False)

# # Print airline_totals_sorted
# print(airline_totals_sorted)

# # Save as airline_totals_sorted.csv
# airline_totals_sorted.to_csv("airline_totals_sorted.csv")