# Exploratory Data Analysis, Time Series Analysis, and Sales Forecasting

I have used the dataset to perform exploratory data analysis to gain valuable insights, and also apply time series analysis to get forecast of sales after time period of 7 days.

This task has beeen divided into three notebooks:
1. Pert 1 - Exploratory Data Analysis
2. Part 2 - Time Series Analysis
3. Part 3 - Sales Forecasting

## Part 1 - Exploratory Data Analysis

In [None]:
#import the librraies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import matplotlib.ticker as ticker
import plotly.express as px

In [None]:
df = pd.read_csv('../input/sample-sales-data/sales_data_sample.csv', engine='python') #read the dataset

### Take a look at the dataset

In [None]:
df.head() #show five datas from the top

In [None]:
df.info() #show the data info

In [None]:
df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE']) #convert ORDERDATE to pandas datetime format

In [None]:
df.sort_values(by = ['ORDERDATE'], inplace = True) #sorting data by ORDERDATE
df.set_index('ORDERDATE', inplace = True) #setting the index to be the ORDERDATE (it will hep a lot later on)

In [None]:
print(df.isnull().sum()) #check if there is any null data or not

Since there are lot of null values in ADDRESSLINE2, STATE, POSTALCODE, COUNTRY, and TERRITORY, the I will drop them. COUNTRY and CITY will represent the order geographical information

In [None]:
to_drop = ['ADDRESSLINE2','STATE','POSTALCODE','TERRITORY']
df = df.drop(to_drop, axis = 1)
df.head()

In [None]:
print(df.isnull().sum()) #checking again if there are null values

In [None]:
#show the unique value of each column
for c in df.columns:
    print(f'Number of {c} unique values: {df[c].nunique()}')

In [None]:
df.describe() # describing the data

### Gain the insights

**Find out 20 Most Valuable Customers**

The Most Valuable Customers are the customer who are the most profitable for a company (have a big sales on them). These customers buy more or higher-value than the other customers.

In [None]:
top_customer = df.groupby(['CUSTOMERNAME']).sum().sort_values('SALES', ascending = False).head(20) #sorting the customers as per the sales
top_customer = top_customer[['SALES']].round(3) #round off the sales value up to 3 decimal places
top_customer.reset_index(inplace = True) #reset the index to add the customer name into dataframe

In [None]:
plt.figure(figsize = (15,5)) #width an dheight of figure is defined in inches
plt.title('20 Most Valueable Customer (2003 - 2005)', fontsize = 18)
plt.bar(top_customer['CUSTOMERNAME'], top_customer['SALES'], color = '#37C6AB', edgecolor = 'black', linewidth = 1)
plt.xlabel('Customer Name', fontsize = 15) #x axis shows the customer name
plt.ylabel('Revenue', fontsize = 15) #y axis shows the revenue
plt.xticks(fontsize = 12, rotation = 90)
plt.yticks(fontsize = 12)
for k, v in top_customer['SALES'].items(): #to show the exact revenue generated on the figure
    if v > 600000:
        plt.text(k, v-270000, '$' + str(v), fontsize = 12, rotation = 90, color = 'black', ha = 'center')
    else:
        plt.text(k, v+ 50000, '$' + str(v), fontsize = 12, rotation = 90, color = 'black', ha = 'center')

**Find out 20 Highest Revenue by Country**

Here are th Top 20 Country which generated the highest revenue

In [None]:
top_country = df.groupby(['COUNTRY']).sum().sort_values('SALES', ascending = False).head(20) #sort the country as per the sales
top_country = top_country[['SALES']].round(3) #round off teh sales value up to 3 decimal places
top_country.reset_index(inplace = True) #reset the index to add the country into dataframe

In [None]:
plt.figure(figsize = (15,5)) #width and height of figure is defined in inches
plt.title('20 Highest Revenue by Country (2003 - 2005)', fontsize = 18)
plt.bar(top_country['COUNTRY'], top_country['SALES'], color = '#37C6AB', edgecolor = 'black', linewidth = 1)
plt.xlabel('Country', fontsize = 15) #x axis shows the country
plt.ylabel('Revenue', fontsize = 15) #y axis shows the revenue
plt.xticks(fontsize = 12, rotation = 90)
plt.yticks(fontsize = 12)
for k, v in top_country['SALES'].items(): #to show the exact revenue generated on the figure
    if v > 3000000:
        plt.text(k, v-1200000, '$' + str(v), fontsize = 12, rotation = 90, color = 'black', ha = 'center')
    else:
        plt.text(k, v+100000, '$' + str(v), fontsize = 12, rotation = 90, color = 'black', ha = 'center')

**Find out 20 Highest Revenue by City**

Here are th Top 20 City which generated the highest revenue

In [None]:
top_city = df.groupby(['CITY']).sum().sort_values('SALES', ascending = False).head(20) #sort the city as per the sales
top_city = top_city[['SALES']].round(3) #round off the sales value up to 3 decimal places
top_city.reset_index(inplace = True) #reset the index

In [None]:
plt.figure(figsize = (15,5))
plt.title('20 Highest Revenue by City (2003 - 2005)', fontsize = 18)
plt.bar(top_city['CITY'], top_city['SALES'], color = '#37C6AB', edgecolor = 'black', linewidth = 1 )
plt.xlabel('City', fontsize = 15) #x axis shows the city
plt.ylabel('Revenue', fontsize = 15) #y axis shows the revenue
plt.xticks(fontsize = 12, rotation = 90)
plt.yticks(fontsize = 12)
for k, v, in top_city['SALES'].items():
    if v > 800000:
        plt.text(k, v-350000, '$' + str(v), fontsize = 12, rotation = 90, color = 'black', ha = 'center')
    else:
        plt.text(k, v+35000, '$' + str(v), fontsize = 12, rotation = 90, color = 'black', ha = 'center')

**Which products give the highest revenue**

In [None]:
top_product = df.groupby(['PRODUCTLINE']).sum().sort_values('SALES', ascending = False) #sort the categories as per the sales
top_product = top_product[['SALES']] #keep only the sales column in dataframe
top_product.reset_index(inplace = True) #reset index
total_revenue_product = top_product['SALES'].sum() #find the total revenue generated as per product line
total_revenue_product = str(int(total_revenue_product)) #convert the total revenue from float to int and then to string
total_revenue_product = '$' + total_revenue_product #adding '$' sign before the value

In [None]:
plt.rcParams['figure.figsize'] = (13,7)
plt.rcParams['font.size'] = 12.0 #font size is defined
plt.rcParams['font.weight'] = 6 #font weight is defined
# we don't want to look at the percentage distribution in the pie chart. Instead, we want to look at the exact revenue generated by the product line.
def autopct_format(values):
    def my_format(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return ' ${v:d}'.format(v = val)
    return my_format
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99','#55B4B0','#E15D44','#009B77'] # Colors are defined for the pie chart
explode = (0.05,0.05,0.05,0.05,0.05,0.05,0.05)
fig1, ax1 = plt.subplots()
pie1 = ax1.pie(top_product['SALES'], colors = colors, labels = top_product['PRODUCTLINE'], autopct = autopct_format(top_product['SALES']), startangle = 90, explode = explode)
fraction_text_list = pie1[2]
for text in fraction_text_list:
    text.set_rotation(315)
center_circle = plt.Circle((0,0), 0.80, fc = 'white') # drawing a circle on the pie chart to make it look better 
fig = plt.gcf()
fig.gca().add_artist(center_circle)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle
# we can look the total revenue generated by all the categories at the center
label = ax1.annotate('Total Revenue \n' + str(total_revenue_product), color = 'red', xy = (0,0), fontsize = 12, ha  ='center')
plt.tight_layout()
plt.show()

Seen on the figure above, that Classic Cars generated the highest revenue of about 3919616 dollar. And the total revenue generated by all these product line 10032628 dollar

### Correlation Test

**Correlation Features**

Plotting a correlation matrix to see the overview of how the features are related to one another

In [None]:
plt.figure(figsize = (10,10))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot = True)

**Observations**
* There is high co-relation in ORDERNUMBER and YEAR_ID, and between QTR_ID and MONTH_ID
* +velly correlated between SALES, QUANTITYORDERED, PRICEEACH and MSRP
* YEAR_ID is -velly correlated to QTR_ID and MONTH_ID