See README.md file for more details about this notebook.

#### Import modules

In [None]:
import pandas as pd
from itertools import combinations as cmb
from collections import Counter
import os

#### Read the file(s)
Task 1: Merge all the individual month sales files into a single file

In [None]:
# identifying the files
dir_path = os.path.dirname(os.path.realpath("ess_analysis.ipynb"))
csvs = []
for root, dirs, files in os.walk(dir_path):
    csvs.extend(f"{root}/{str(file)}" for file in files if file.endswith("_2019.csv"))

In [None]:
# merging the files into a single DataFrame
all_months = pd.DataFrame()
for csv in csvs:
    df = pd.read_csv(csv)
    all_months = pd.concat([all_months, df])

In [None]:
# exporting to a .csv file
all_months.to_csv("ess_analysis.csv", index=False)

Task 2: Read that file

In [None]:
df = pd.read_csv("ess_analysis.csv")
df.tail()

#### Clean up the data

Task 1: Identify and remove rows with missing values

In [None]:
# # identifying
# null_values = pd.isnull(df['Order ID'])
# df[null_values]
# removing
df.dropna(inplace=True)
# we drop any possible duplicate
df.drop_duplicates(inplace=True)
# additionnally we reassign the index
df.reset_index(drop=True)

# #
# i'll explain that later
i = df[df["Order ID"] == "Order ID"].index
df.drop(i, inplace=True)
# let's check a quick thing
df[df["Order ID"] == "Order ID"]

In [None]:
# let's rename the 'Price Each' column to 'Price Unit'
df.rename(columns={"Price Each": "Price Unit"}, inplace=True)

Task 2: Make sure data are in the right type in every column<br />
- 'Quantity Ordered' and 'Price Unit' in `int` type
- order date in `datetime` type

In [None]:
# with `df.dtypes` we can identify the dtype of a column
# df.dtypes
# using `.astype()` || `.pd.to_numeric` method
df["Quantity Ordered"] = df["Quantity Ordered"].astype("int8")
df["Price Unit"] = pd.to_numeric(df["Price Unit"])
df.dtypes

#### Question 1: What was the best month for sales? How much was earned that month?

In order to answer that question let's first add a 'Sales' and a 'Month' columns

In [None]:
df["Sales"] = df["Quantity Ordered"] * df["Price Unit"]
df["Month"] = df["Order Date"].str[:2]
# and convert it to `int` using `.astype()` method
df["Month"] = df["Month"].astype("int8")

Now let's finally answer the question!

In [None]:
answer1 = df.groupby("Month").sum()
answer1

A.: As we can see **December** was the most profitable month, earning **$4,608,295.70** from **28,074** sold items.

Let's viz!

In [None]:
import matplotlib.pyplot as plt

In [None]:
# plt.style.use('seaborn')
# plt.style.use('ggplot')

In [None]:
months = [
    "JAN",
    "FEB",
    "MAR",
    "APR",
    "MAY",
    "JUN",
    "JUL",
    "AUG",
    "SEP",
    "OCT",
    "NOV",
    "DEC",
]

plt.plot(months, answer1["Sales"])
# plt.bar(months, answer1['Sales'])
plt.xticks(months, size=8)
plt.xlabel("Months")
plt.ylabel("Sales ($US Million)")
plt.grid(axis="both")
plt.show()

In [None]:
plt.pie(
    answer1["Sales"],
    labels=months,
    autopct="%.2f %%",
    counterclock=False,
    radius=2.5,
)
plt.xlabel("Sales % by Month", labelpad=200)
plt.show()

#### Question 2: What city sold the most product?

Again, let's create another column called 'City'

Task 1: Get the cities with their respective states

In [None]:
def get_city(address):
    return address.split(", ")[1]


#
def get_state(address):
    return address.split(", ")[-1].split(" ")[0]

Task 2: Put them in the 'City' column

In [None]:
# cities = []
# for row in df['Purchase Address']:
#     cities.append(f"{get_city(x)} ({get_state(x)})")
# df['City'] = cities

# or using the `.apply()` method we can simply do
df["City"] = df["Purchase Address"].apply(
    lambda x: f"{get_city(x)} ({get_state(x)})"
)

Show the results

In [None]:
answer2 = df.groupby("City").sum()
answer2

B.: With **50,169** units ordered, **San Francisco (CA)** is by far the city with the most sold products.<br />
Now let's take a visual look at that.

In [None]:
cities = [city for city, _ in df.groupby("City")]

plt.style.use("seaborn")
plt.bar(cities, answer2["Quantity Ordered"])
plt.xticks(cities, rotation=90, size=8)
plt.xlabel("Cities", labelpad=10)
plt.ylabel("Orders (Units)", labelpad=10)
plt.show()

In [None]:
plt.pie(
    answer2["Quantity Ordered"],
    labels=cities,
    autopct="%.2f %%",
    radius=2,
    counterclock=True,
)
plt.xlabel("Orders % by City", labelpad=150)
plt.show()

#### Question 3: What time should we display advertisements to maximize the likelihood of customer’s buying product?

First, let's convert our **'Order Date'** column into `datetime` type

In [None]:
df["Order Date"] = pd.to_datetime(df["Order Date"])

And break it down to <i>Hour</i> and <i>Minute</i>

In [None]:
# df['Hour'] = df.insert(5, 'Hour', df['Order Date'].dt.hour, True) # doesn't work?
df["Hour"] = df["Order Date"].dt.hour
df["Minute"] = df["Order Date"].dt.minute

So far so good. Now let's look at the results!

In [None]:
answer3 = df.groupby(["Hour"]).sum()
answer3.tail(50)

C.: Between **11:00** and **12:00** we observe a peak in sales value, and again at **19:00**. So, displaying ads at those hours would most likely drag more customers into buying our products.

In [None]:
hours = [hour for hour, _ in df.groupby("Hour")]


plt.plot(hours, answer3["Quantity Ordered"], "b.-")
# plt.plot(hours, answer3['Sales'], 'b-')
plt.xticks(hours)
plt.xlabel("Hours", labelpad=10)
plt.ylabel("Orders (Units)", labelpad=10)
plt.grid()
plt.show()

In [None]:
plt.pie(
    answer3["Quantity Ordered"],
    labels=hours,
    autopct="%.0f %%",
    radius=2,
    counterclock=False,
)
plt.xlabel("Orders % by Hour", labelpad=150)
plt.show()

#### Question 4: What products are most often sold together?

Well, in order to do that we'll have to look at the purchases that have the same Order ID value.

Let's create a smaller DF containing only rows with duplicated Order IDs

In [None]:
id_dup = df[df["Order ID"].duplicated(keep=False)]

Now let's create a new column to put the products from a same order id together in the same row.

In [None]:
# let's call it 'Cart'
id_dup["Cart"] = df.groupby("Order ID")["Product"].transform(
    lambda x: ", ".join(x)
)
#

In [None]:
id_dup.head()

In [None]:
# now let's drop the duplicates as we no longer need them
id_dup = id_dup[["Order ID", "Cart"]].drop_duplicates()

Now we need to count what Cart value is more reccuring.<br />
For this we use:
- from `itertools` import `combinations` as `cmb`
- from `collections` import `Counter`

In [None]:
count = Counter()

for items in id_dup.Cart:
    item_list = items.split(", ")
    count.update(Counter(cmb(item_list, 2)))

In [None]:
for key, value in count.most_common(10):
    print(key, value)

D.: Well, turns out people buy a lot of **iPhone***s* and **Lightning Charging Cable***s* together.

#### Question 5: What product sold the most? Why do you think it sold the most?

Okay that should be a pretty easy one.

In [None]:
products_group = df.groupby("Product")
qty_ordered = products_group.sum()["Quantity Ordered"]
products_group.sum()

E.: With over **30,000** items sold, **AAA Batteries (4-pack)** is our #1 sold product, followed by its brother **AA Batteries (4-pack)** which almost cross the **28,000** mark.

In [None]:
products = [product for product, _ in df.groupby("Product")]

plt.bar(products, qty_ordered)
plt.xticks(products, rotation=90, size=8)
plt.xlabel("Products", labelpad=20)
plt.ylabel("Orders (Units)", labelpad=15)
plt.show()

In [None]:
plt.pie(
    qty_ordered,
    labels=products,
    autopct="%.2f %%",
    radius=3,
    counterclock=True,
)
# plt.xlabel('Orders % by Products', labelpad=150)
plt.show()

#### Don't mind this

In [None]:
# df1 = pd.DataFrame({'Zone': ['C', 'L', 'N', 'O', 'S'],
#                     'Total_MSP': [464245, 3764942, 1877505, 1023160, 3179477]})
# df2 = pd.DataFrame({'Zone': ['C', 'L', 'N', 'O', 'S'],
#                     'CasasFavelas_2017': [463, 4228, 851, 1802, 2060]})
# df3 = pd.merge(df2, df1, on='Zone')

In [None]:
# df3

In [None]:
# plt.style.use('ggplot')
# df.plot.bar(x='Zone', logy=True)
# plt.xticks(rotation=0)
# plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# plt.show()