# Analytical Data Project: [Shopping Card Database]

## Determine Business Problems(QUESTIONS)

1. How has the company's sales and revenue performed in recent months?
2. What are the most and least sold products?
3. What are our customer demographics?
4. When did the customer last make a transaction?
5. How often has a customer made a purchase in the last few months?
6. How much money did the customer spend in the last few months?

## Import Semua Packages/Library yang Digunakan

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data Wrangling

### Gathering Data

In [None]:
# Loading the customers data
customers_df = pd.read_csv("https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/main/DicodingCollection/customers.csv")
customers_df.head()

In [None]:
# Loading the orders data
orders_df = pd.read_csv("https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/main/DicodingCollection/orders.csv")
orders_df.head()

In [None]:
# Loading the product data
product_df = pd.read_csv("https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/main/DicodingCollection/products.csv")
product_df.head()

In [None]:
# Loading the sales data
sales_df = pd.read_csv("https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/main/DicodingCollection/sales.csv")
sales_df.head()

**Insight:**
- Okay, now we have successfully loaded all the required data. The next stage is to assess the quality of the data.

### Assessing Data

In [None]:
customers_df.info()
customers_df.isna().sum()
print("Jumlah duplikasi: ", customers_df.duplicated().sum())    

In [None]:
customers_df.describe()

In [None]:
orders_df.info()
print("Jumlah duplikasi: ",orders_df.duplicated().sum())

In [None]:
orders_df.describe()

In [None]:
product_df.info()
print("Jumlah duplikasi: ", product_df.duplicated().sum())

In [None]:
product_df.describe()

In [None]:
sales_df.info()
sales_df.isna().sum()
print("Jumlah duplikasi: ", sales_df.duplicated().sum())

In [None]:
sales_df.describe()

**Insight:**
- **costumers_df** = If you pay attention, there is a strangeness in the maximum value contained in the age column. This is most likely due to the presence of an inaccurate value in the column. We will also clean up this problem in the data cleaning stage.
- **orders_df** = If you pay attention, there is no strangeness in the results. This shows that there is no duplication and strangeness of values in orders_df.
- **product_df** = Based on the image above, it can be seen that there are 6 duplicated data in product_df. In the data cleaning stage, we will remove the duplication.
- **sales_df** = The above results show that there is no duplication in the sales_df. In addition, it also shows that there is no strangeness in the summary of statistical parameters from sales_df.

### Cleaning Data

**CUSTOMERS_DF**

In [None]:
### CLEANING customers_df Data

# Based on the results of the data assessment process, it is known that there are three problems
# encountered in the customer_df, namely duplicate data, missing value, and inaccurate value. At
# this stage, we will clear up all three problems.

# Eliminating duplicate data

# The first problem we will deal with is duplicate data. As we have learned before, when we find
# duplicates in the data, we must eliminate or delete those duplicates. Well, to do this, we can
# make use of the drop_duplicates() method. Here is the code to remove duplicates on customer_df.

customers_df.drop_duplicates(inplace=True)

# After running the above code, double-check that there are still duplicates in the data by 
# running the following code.

print("Jumlah duplikasi: ", customers_df.duplicated().sum())

# If the deduplication process goes smoothly, the above code will produce an output indicating 
# the absence of duplicates on the customers_df.

# Dealing with missing value

# The next problem we have to deal with is the missing value in the gender column. Well, in 
# general, there are three methods to overcome missing value, namely dropping, imputation, and 
# interpolation. To determine which method to use, we need to look at the data that contains the
# missing value using the following filtering technique.

customers_df[customers_df.gender.isna()]

# The above code will only display rows of data that meet the condition customers_df.gender.isna()
# or in other words it will display rows of data that contain missing values in the gender column.
# Here's what the row of data looks like.

# Based on the image above, it can be seen that the data row still contains a lot of important
# information so it would be a pity if it was thrown away immediately. Therefore, in this case,
# we will use the imputation method to handle the missing value.


In [None]:
# In the imputation method, we will use a specific value to replace the missing value. The gender
# column is a categorical column, we will use the dominant value as a substitute for the missing
# value. Use the value_counts() method to identify the dominant value.

customers_df.gender.value_counts()

# The above code will produce the following output.

In [None]:
# Based on the results above, it can be seen that the most dominant value in the gender column is
# "Prefer not to say". This value is what we will use next as a substitute for missing value.
# This replacement process can be done using the fillna() method as in the following example.

customers_df.fillna(value="Prefer not to say", inplace=True)

# To make sure the above process is running properly, we can re-run the code to identify the
# missing value as follows.

customers_df.isna().sum()

# If the missing value cleanup process is successful, you will get the following results.

In [None]:
# Handling the innacurate value

# Okay, now we're going to solve the problem of inaccurate values in the age column. For starters,
# we need to look at the data row data that contains the inaccurate value (the row with the 
# maximum age value). This is done using a filter technique like the following code example.

customers_df[customers_df.age == customers_df.age.max()]

# The code above will display the rows of data that have the maximum age value.

In [None]:
# Based on this data, we can assume that the inaccurate value occurred due to human error so that
# the excess entered a zero value. Therefore, replace it with a value of 70. This process is done
# by utilizing the replace() method as shown in the following example.

customers_df.age.replace(customers_df.age.max(), 70, inplace=True)

# Well, to make sure the code above runs as expected, run the following code again.

customers_df[customers_df.age == customers_df.age.max()]

# Upsi, it turns out that there are still other invalid values contained in the age column.

In [None]:
# The cause of this error is likely to be the same as before, namely human error that is
# overloaded with a zero value. To handle this, we'll replace it with a value of 50.

customers_df.age.replace(customers_df.age.max(), 50, inplace=True)

# To make sure there are no inaccurate values in the customers_df, run the following code.

customers_df.describe()

# The above code will produce the following output!

**ORDERS_DF**

In [None]:
# CLEANING orders_df Data

# Okay, now we have solved all the problems that exist in customers_df. Next, we will overcome
# the problem in orders_df. Based on the previous data assessment process, it is known that there
# is a data type error for the order_date & delivery_date columns. To solve this problem, we'll
# replace the data type in the order_date & delivery_date columns to datetime. This process can 
# be done using  the to_datetime() function provided by the pandas library. Here's an example
# code to do so.

datetime_columns = ["order_date", "delivery_date"]
 
for column in datetime_columns:
  orders_df[column] = pd.to_datetime(orders_df[column])

# The above code will change the data type in the order_date & delivery_date columns to datetime.
# To make sure this works as expected, double-check the data type using the info() method.

orders_df.info()

**PRODUCT_DF**

In [None]:
# CLEANING product_df Data

# The next data we will clean up is product_df. According to the results of the previous data
# assessment, we know that there are 6 duplicate data in product_df. To solve this, we need to
# discard the same data using the drop_duplicates() method as in the following example.

product_df.drop_duplicates(inplace=True)

# The code above will delete all duplicate data. To make sure the code works as expected, run the following code.

print("Jumlah duplikasi: ", product_df.duplicated().sum())

**SALES_DF**

In [None]:
# CLEANING sales_df Data

# The next data you need to clean up is sales_df. Based on the results of the previous data
# analysis, it is known that there are 19 missing values in the total_price column. To find
# out the most appropriate process for handling missing values, we need to first look at the
# data rows that contain the missing values.

sales_df[sales_df.total_price.isna()]

# The code above will display all the rows of data that have missing values in the total_price
# column as shown in the following image.

In [None]:
# Based on the display of the data, we find that the value of total_price is the result of
# multiplication between price_per_unit and quantity. We can use this pattern to handle missing
# values in total_price columns. Here's an example of implementing code to do this.

sales_df["total_price"] = sales_df["price_per_unit"] * sales_df["quantity"]

# The code above will address all missing values and ensure that the values in total_price
# columns are correct. To make sure of this, you can double-check the number of missing values
# on the sales_df using the following code.

sales_df.isna().sum()

**Insight:**
- **customers_df** = Based on these results, it can be seen that the age column has a maximum value that is quite reasonable. In addition, if you pay attention, the mean and standard deviation values also change after we deal with the inaccurate value.
- **orders_df** = If all stages go as expected, the above code will produce the following output.
- **product_df** = If the process of deleting duplicate data goes smoothly, the above code will produce an output like the following "Number of duplicates: 0".
- **sales_df** = If the previous process went smoothly, you will find the following results.

## Exploratory Data Analysis (EDA)

### Explore ...

**Insight:**
- xxx
- xxx

## Visualization & Explanatory Analysis

### Pertanyaan 1:

### Pertanyaan 2:

**Insight:**
- xxx
- xxx

## Analisis Lanjutan (Opsional)

## Conclusion

- Conclution pertanyaan 1
- Conclution pertanyaan 2