# Overview
![](./images/4/Overview.png)

# Relationship

![](./images/4/Overview_Relationship.png)

## Aggregation Analysis
- Calculate a value across a group or dimension
- Aggregation is often done in reporting to be able to "slice and dice" information to make decisions and view performance
- For example, aggregate sales data
  - for a salesperson by month
  - by month per sales territory
- **Applications:**
  - In Dashboard Design: Data Exploration. For example, Channel vs. Region

  ![](./images/4/Dashboard_Design.png)

  - In ad-hoc Analytics: choose which category to focus
  
  ![](./images/4/Adhoc_Analytics.png)

## Categorical vs. Categorical

![](./images/4/Cat_Cat_Overview.png)

### Calculation
We ussually use **_count_** and **_count%_** of **_keys_** of the most granular detailed data (or **_number of rows_** if there is no key)
- Noted: don't put any measure for count/sum for Cat vs. Cat. For example, **_Channel_** vs. **_Georgraphy_**, don't put **_Sales Amount_**

![](./images/4/Cat_Cat_Calculation_Incorrect_Example.png)

### Displaying

![](./images/4/Combined_Cat_Cat_Example.png)

#### Two-way Table and Stacked Column Chart
We can use following methods:
- **Two-way table:** of **_count_** and **_count%_**
  - Rows: represent the category of 1 variable
  - Columns: represent te categories of other variables
- **Stacked Column Chart:** is a visual form of **_Two-way table_**

**Example:**
![](./images/4/Twoway_Table_Stacked_Column_Chart_Example.png)

##### Counts vs. Percentages
There is no single correct way to display the data in crosstabs. Ultimately, the data is always counts, but they can be shown as:
1. Raw Counts (Absolute)
2. Percentage of **_Overall_** Totals
3. Percentages of **_Column_** Totals
4. Percentage of **_Row_** Totals

![](./images/4/Count_Percentages_Exmaple.png)

Nevertheless, showing the counts as percentages of **_row_** totals or percentages of **_column_** totals makes any relationships stand out **_more clearly_**.

**Example:**

![](./images/4/Combined_Cat_Cat_Example.png)

##### Example: why we need to have multiple views of count%
**Question:** should we target **_single_** or **_married_** customers who are willing for **_home delivery_**?

**Scenario 1:** Look at **_count_** and **_count% overal_**

--> We might decide focus on **_single_** (because 15.9%) compared to **_married_** (12.7%), but it seems **_incorrect_**

![](./images/4/Count_Count_Example_1.png)

**Scenario 2:** Look at **_count%_** in **_row totals_** and **_column totals_**

--> Clearer views of how martial status contribute. In this case we choose **_married_** (35%) instead of **_single_** (25%)

![](./images/4/Count_Count_Example_2.png)

#### Tornado Chart
In case a Category has **_2 values_**, we can use **Tornado chart** to compare

**Example:**

![](./images/4/Tornado_Chart_Example.png)

#### Mekko Chart
In case we want to display both **_absolute_** and **_relative_** numbers, we can use **Mekko Chart**

**Example:**

![](./images/4/Mekko_Chart_Example.png)

### Too Many Cat-Cat Pairs in Data source
When there are many combinations of categorical variables, we can apply:
1. Main DIM category
2. Using Hierarchy
3. Using Business Domain knowledge to choose important variables
4. Combine charts
5. Reduce dimensional techniques

![](./images/4/Too_Many_Cat_Cat_Variables_Example.png)

### Chi-square test
**Question:** can the **_relationship observed in the sample data_** be inferred to hold in the **_population_** represented by the data?

**Chi-square test:** test association
- H0: The 2 variables are not related in the population
- H1: The 2 variables are related in the population

![](./images/4/Chi_square_Test_Example.png)

## Categorical vs. Numerical
Categorical & Numerical: Compare the **numerical** variables across **_each of the levels_** of the **_categorical_** variable

![](./images/4/Cat_Num_Overview.png)

**Sample Data:**

![](./images/4/Sample_Data.png)

### Calculation
With numerical variables we do the same 3 calculations (similar to Descriptive Analysis for numerical variable):
1. Central Tendency
2. Dispersion
3. Shape

### Displaying

![](./images/4/Cat_Num_Example.png)

#### Metrics
To show the **Summary Measure**

![](./images/4/Cat_Num_Summary_Measure_Example.png)

#### Cluster Bar Chart
To show Aggregation of measure in Static and Change Over Time

![](./images/4/Cat_Num_Static_ChangeOverTime_Example.png)

#### Pareto Chart
To show 80/20

![](./images/4/Cat_Num_Pareto_Example.png)

#### Magnitude Cuts (using Average)
We apply **_Top filter_** to show Top/Bottom 5 or 10 categorical variables
- Noted we use **AVERAGE** instead of SUM or COUNT

![](./images/4/Cat_Num_Magnitude_Cuts_Example.png)

#### Side-by-side Box Plots
There are 2 visuals that can be used

##### Box and Whisker by MAQ Software
**Step 1:** add data

![](./images/4/Box_Whisker_MAQ_Software_Example_1.png)

**Step 2:** add outliers limit by IQR

![](./images/4/Box_Whisker_MAQ_Software_Example_2.png)

##### Box and Whisker chart
**Step 1:** add data

![](./images/4/Box_Whisker_Chart_Example_1.png)

**Step 2:** add outliers limit by IQR

![](./images/4/Box_Whisker_Chart_Example_2.png)

### T-Test
...

## Numerical vs. Numerical

![](./images/4/Among_Numerical_Overview.png)

### Discrete and Continuous Variables
If 1 numerical variable is **_discrete_** and the other is **_continuous_**, we can treat them as **Categorical vs. Numerical**
  - If the discrete variable has **_too many distinct values_**, then we can choose:
    - **_group_** them, e.g. age --> age group
    - **_top/bottom_**, e.g. top 10, bottom 5

![](./images/4/Discrete_Continuous_Example.png)

### Continuous vs. Continuous

#### Calculation

##### Correlation (he so tuong quan)

1. **_Positive_** Correlations

2. **_Negative_** Correlations

3. **_Other_** Correlations

![](./images/4/Correlation_Score.png)


**Causation:** after we get correlation of 2 variables, we need to understand to business domain knowledge to clarify the causation and determine the actions to take

![](./images/4/Act_With_Correlation.png)

#### Displaying

##### Correlation Plot


Unfortunately, Microsoft has disabled this chart on App Source and it requires to install R. Hence, we need to do it in Python.

**Step 1:** install Python

**Step 2:** install libraries
```
pip install pandas
pip install matplotlib
pip install searborn
```

**Step 3:** use Python visual

![](./images/4/Correlation_Plot_Example.png)

**Script:**
```
import seaborn as sns
import matplotlib.pyplot as plt

dataset = pandas.get_dummies(dataset)
correlation = dataset.corr()
heatmap = sns.heatmap(correlation, annot=True, cmap ='BuPu')

plt.show(heatmap)
```

**Note:** in Values, don't get aggregation --> choose **_Don't summarize_**. Otherwise, there will be 1 data point for each measure = nothing to show correlation

![](./images/4/Dont_Summarize.png)

##### Scatter Plot
We can either use:
1. PowerBI Scatter Plot to display 1 numerical variable vs. 1 numerical variable
![](./images/4/Scatter_Plot_Example_1.png)

2. Python to display multiple numerical variables
![](./images/4/Scatter_Plot_Example_2.png)

**Script:**
```
import seaborn as sns
import matplotlib.pyplot as plt

scatter = sns.pairplot(data = dataset)

plt.show(scatter)
```

# Multi-Dimensional Visualization
This part we use other visuals to show multivariate relationships
  - **Noted:** **_Correlation Plot_** and **_Scatter Plot_** can also display multivariate relationships. They are shown in previous part.

## 1 Categorical vs. Multiple Numerical

### Bubble Chart = Advanced Scatter Plot
- **Values:** a point is plotted for **_each value_** in this field. Other measure will be grouped by this field
- **X Axis:** values to place on horizontal axis
- **Y Axis:** values to place on vertical axis
- **Legend:** the **_categorical_** field to show for color
- **Size:** the measure for relative **_bubble sizing_**
- **Play Axis:** normally put year/month to show progress

![](./images/4/Bubble_Chart_Example.png)

## Multiple Categorical vs. 1 Numerical

### Tassels Chart

**Noted:** we can de-select categories by clicking the categorical values to have more focused view (clearer)

![](./images/4/Tassels_Example.png)

### Sankey Chart
**Option 1:** use Sankey chart of App Source. **Noted:** we only have 1 layer

![](./images/4/Sankey_Example_1.png)

**Option 2:** Python. We can add multiple layers
- Each layer has source, target and amount (aggregated). For example: 
  - (1) ChannelName + ContinentName
  - (2) ContinentName + BrandName
- Do the same for multiple layers

![](./images/4/Sankey_Example_2.png)

**Script:**
```
import pandas as pd
from sankeyflow import Sankey
import matplotlib.pyplot as plt
import numpy as np

# create list of flows
flows = []
for i in range(len(dataset)):
    flows.append((dataset['ChannelName'][i], dataset['ContinentName'][i], dataset['SalesAmount'][i]))
    flows.append((dataset['ContinentName'][i], dataset['BrandName'][i], dataset['SalesAmount'][i]))
    
# create Sankey chart
plt.figure(figsize=(10, 5), dpi=144)
s = Sankey(flows=flows)
s.draw()
plt.show()
```

### Decomposition Tree

![](./images/4/Decomposition_Tree_Example.png)

**Cons:**
- It cannot show multiple values of 1 layer at a time.
- It cannot show relationship between values in 1 layer to other layers

**Suggestion:**
- This chart should be used for percentage: Margin% increasing, YoY%, MoM%

### -->Key Influencers Chart<--
**Key Influencers Chart:** is used to **_quickly get_** which variables **mostly influence** the reasons, trends, etc. that we are analyzing

#### Example: Attrition

##### Step 1: Analyze Each Segment

![](./images/4/Influence_Chart_Example_1.png)

**Notes:**
- (1) Add field we want to analyze for reasons, trends, etc.
- (2) Add others fields we want to know how they influence the field in (1)
- (3) Select a value of (1) that we want to analyze. Select **_Yes_** for example
- (4) PBI runs ML algorithms which mostly influence (1). Select **_Age_** for example
    ![](./images/4/Key_Influence_ML.png)

- (5) The conclusion of PBI for **_age bins_**. Again, PBI uses ML
- (6) "Average (exluding selected): 15.05%"
    ![](./images/4/Influence_Chart_Explain_AVG.png)
    - (1) In total, we have 16.12% of Yes for Attrition
    - (3) In age bins, we have **_53.66%_** of Yes for "Age is 21 or less" --> more than half of attrition people are in this group
    - (4) We exclude "Age is 21 or less" group
    - (5) Then we get **_15.05%_**
    - (2) Then we get the chart with **_53.66% / 15.05% = 3.57_**

- (7) 53.66% of people in "Age is 21 or less" has **_3.57x time_** more than avarage of other group (exluding "Age is 21 or less" group)

Sizing = Count
Segment

##### Step 2: Take Closer Look at Sizing of Each Segment

![](./images/4/Influence_Chart_Sizing.png)

**Notes:**
- (1) Enable **_Counts_** option in Analysis tab
- (2) Then we will see:
  - New Sort by: feature enables --> Select Count
  - Each of the circle will have a portion highlighted in its border, indicating the size of data for each segment
  - Noted that the "Age is 21 or less" segment is now the lowest impact

##### Step 3: Choose Segment
We need to choose Segments which has:
- High **_impact_** and 
- High **_sizing_** in the dataset

For example,

![](./images/4/Influence_Chart_Choose_Segment.png)

##### Step 4: Group Segments in "Top segments"

![](./images/4/Influence_Chart_Top_Segment.png)

- (1) In Top segments we will see PBI group segments in top 5 groups and show the impact and sizing of each group
  - Higher location = higher impact
  - Bigger size = higher count (size)

- (2) Select 1 segments, Segment 1 for example
  - Segment 1 contains: 
    - JobLevel = 1
    - OverTime = Yes
    - StockOptionLevel <= 0
  - Segment 1 has:
    - Attrition = Yes: **_66.7%_**
    - Population: **_75_**

![](./images/4/Influence_Chart_Select_Segment.png)

- (3) Click **_Learn more about this segment_**
  - **Gray chart:** the current Attrition of No and Yes for Segment1
  - **Blue chart:** the Attrition of No and Yes for other fields
  - **Slicer:** for value or groups of values of fields
  - This feature tells us whether other value of other fields have similar behaviour of Segment 1
![](./images/4/Influence_Chart_Learn_More_Segment_1.png)

  - **Example:** Department --> same behaviour No < Yes
    ![](./images/4/Influence_Chart_Department.png)


  - **Example:** Job Satisfaction --> different behaviour No ~= Yes
    ![](./images/4/Influence_Chart_Job_Satisfaction.png)

**Notes:** there are several notes...

# Other Methods

## Auto-Correlation
**The autocorrelation function plot (correlogram):** show a series **_correlated with itself_**, like by x time units.
- So if you take each of the correlation numbers we just calculated and plot them, you'd have the autocorrelation function plot or ACF plot.
- It shows **_how correlated_** a time series is with its past values

**Seasonality (pattern):** is a characteristic of a time series in which the data **_experiences regular_** and the **_predictable changes_** that recur every period of time.

![](./images/4/Auto_Correlation_Example.png)

**Notes:**
- From (1) to (2): 1st cycle
- From (2) to (3): 2nd cycle
- **Lag:** how long the pattern occurs again

## Segmentation vs. Clustering
Both group data points which have **_similar attributes_**
- **Segmentation:** data **_analysis_** technique for creating groups from a dataset
- **Clustering:** data **_science_** technique for **_more advanced_** creation of groups

**Example:**

![](./images/4/Segmentation_Custering_Example.png)

**Notes:**
- Segmentation: income >= 65,000
- Clustering: using K-mean

### Segmentation
Sometimes there are multiple very small segmentations, we can combine them into 1 big **_Others_** segment
- We will focus on big segments first
- Then, if needed (because small segments usually have small impact), we come back to analyze those small segments

![](./images/4/Small_Segments_Others.png)

### Clustering
In Power BI, we can use **_Scatter Plot_** and **_Table_** for **_Clustering_**

**Example of Scatter Plot:**

![](./images/4/Clustering_Scatter_Plot_1.png)
- (1) Select **_Automatically find clusters_**

![](./images/4/Clustering_Scatter_Plot_2.png)
- (2) **Result:** new field will be generated and put in Legend

![](./images/4/Clustering_Box_Plot.png)
- (3) Then we can add the new field in Box Plot to see the distribution for further analysis

## Sentiment Analysis

![](./images/4/Sentiment_Analysis_Example_1.png)

## Cohort Analysis
**Cohort Analysis:** is a popular way for companies to gain a more in-depth insight into their **_customers' behavior_**. 
- It gives invaluable insight into customer behavior that we can leverage to set up successful growth strategies and improve the decision-making process.

With Cohort Analysis, we can answer questions like:
- Do customers acquired in **_one period_** behave differently than those in **_another period_**
- Do customers who bought at **_promotions_** behave differently than those paying at **_full price_**?
- Do **_large companies_** use our services longer than **_small companies_**?

![](./images/4/Cohort_Analysis.png)

When we perform a Cohort Analysis, we **_don't look at individual users_** or the user base as a whole but instead **_split those into groups (cohorts)_**. 
- This is done based on **_similarity_** in properties.

**Example:**

![](./images/4/Cohort_Analysis_Example.png)

**Notes:**
- (1) Month name
- (2) Amount of new customers that we acquired
- (3) 1 -> 11: the duration from the first month to the 11th month that customers stay within our platform
- We are selling milk powder. Every box will be consumed in **_4 months_**.
  - We notice that for **_Jan_**, at the **_4th month_**, the % is **_ok with 5%_**
  - However, for **_other months_**, the percentage are **_very low_** for the **_4th month_** --> investigate more

## What-If Scenario
We use it to **_simulate_** the result for each scenario. There are 4 main functions:

Simulation:
- (1) Data Table (1 Variable): 1 varible will change
- (2) Data Table (2 Variables): 2 variables will change
- (3) Scenario Manager: Many variables will change

Optimization:
- (4) Goal Seek: Inverse problem

### Data Table (1 Variable)

![](./images/4/Data_Table_1_Variable.png)

**Notes:**
- (1) Income increases, Spending stays the same --> Saving increases
- (2) Income stays the same, Spending increase --> Saving decreases

### Data Table (2 Variables)

![](./images/4/Data_Table_2_Variable.png)

**Notes:**
- Both Income and Spending vary

### Scenario Manager

![](./images/4/Scenario_manager.png)

**Notes:**
- For each **_Venue_** that we are going to hire, there are **_different prices_** of **_other elements_** which cause the **_differences in profit or loss**

### Goal Seek

![](./images/4/Goal_Seek.png)

**Notes:**
- We already have in hand all of variable and the results
- We input new **(1)** **_quantity of product_** to seek for the **(2)** **_optimal revenue_**