# What Is A Root Cause?
A root cause is the underlying or core cause of a problem.

# What Is Root Cause Analysis (RCA)?
- A root cause analysis is a systematic process for identifying the underlying causes of a problem or an issue.
- It is defined as a collective term that describes a wide range of approaches, tools and techniques used to uncover the causes of a problem.
- It involves breaking down the problem into smaller parts, examining each part in detail and identifying the root causes that are contributing to the problem.

The easiest way to understand the root cause of a problem is to think about common problems.

Examples,
- If a person falls ill, he'd go to a Doctor and describe his or her symptoms to the Doctor, and the Doctor inturn would find the root cause of the illness.
- If the car stops working, a mechanic would find the root cause of the problem.
- Same would be the case in a business, if the business is underperforming, a person or a group of people will be tasked to find the root cause of the underperformance.

# Goal Of RCA
- The first goal of RCA is to discover the root cause of a problem or an event.
- The second goal is to fully understand how to fix, compensate or learn from any underlying issues within the root cause.
- The third goal is to apply whatever has been learnt from the analysis to systematically prevent future issues or to repeat successes.

Within an organization, problem solving, incident investigation and root cause analysis are all fundamentally connected by 3 basic questions,
1. What is the problem?
    - Define the issue or the problem by its impact on overall goals.
2. When did it occur?
    - List the potential factors that could cause the problem.
    - Investigate each one of them and determine the root cause of the problem.
3. What should be done to prevent it from happening again?
    - Prevent any negative impact by selecting the best solution.

# Life Cycle Of RCA
1. Clarify:
    - Ask questions to get enough clarity on the problem.
    - Clarify terms and set up parameters for discussion.
    - Create an outlier of approach that will be followed.
2. Rule out:
    - Explore the possibilities and list some high level causes.
    - Discard issues that seems to be out of scope.
        - Check the underlying data to make sure it is accurate.
        - Make sure there are no technical issues, glitches, bugs or outliers.
    - List the observations and start diagonising the root cause.
3. Internal factors:
    - Consider any recent features or products that were launched.
    - Look for any recent changes made by other teams.
    - Slice and dice the data into segments based on user demographics.
4. External factors:
    - Look for any bad PR or controversy related to the company.
    - Look for any changes in user behavior customer trends.
    - Look for macroeconomic or geographical changes.
    - Conduct a cometitor analysis.

Based on the analysis a point of issue or error will eventually be found that can be further investigated based on its type. A conclusion can later be reached on how to address the problem.

Note that all the findings should be captured and reported. The size of the finding can be small to large, in other words all the findings insignificant of their size should be captured.

# Product Diagnostics (Analyzing A Metric Change)
### Example case
How should the investigation go about if the percentage of users who clicked on a search result about a Facebook event increased 15% week-over-week.

General framework (CRIED)

### Clarify
Ask clarifying questions and share the thoughts about it. Below is an example of how to drive a discussion with an interviewer.

What does a search result for an event mean?
- Does it refer to when a user searches for something in the search bar on Facebook and the results produce a Facebook event?
- These search results could belong to different categories like a Facebook event, page or group and success is defined when a user clicks on an event.

The definition of the metric in the question can also be clarified upon,
- 15% increase = $\frac{\text{Number of users who clicked on the event result after searching}}{\text{Number of users who searched for any keyword}}$.
- 15% week-on-week = 15% increase in success rate compared to last week? or there has been a 15% increase over the past few weeks.

### Rule out
Rule out any change in the metric happening due to technical issues or infrastructural glitches or bugs or outliers.
- Has there been any bug in the logging code because of which event clicks have been duped?
- Is there a 3rd party software tracking the search result clicks? If so, is there any glitch in that software?
- Any failures in the pipeline?
- Did the metrics for the week get affected by one day's data alone or has it been a consistent increase? (Outlier analysis).

### Internal data
Explore the internal factors that could have affected the metric.

Acronym: TROPiCS
- Time: Is this 15% increase seasonal, sudden or gradual?
    - Sudden increase: Could mean there is a bug in the logging of a new feature, or update that's recently launched (ranking change?) that is creating problems so there may be a need to roll back.
    - Gradual increase: May indicate a change in user behavior. Maybe users are starting to prefer live virtual events over physical events due to COVID restrictions.
- Region:
    - Is this change concentrated in a specific region or is it evenly distributed globally?
    - For example, after the pandemic came to an end, some cities started to reopen. In which case, the rising interest in events may only be concentrated in those cities that are not re-opened.
- Other realted features affected: If an interest in events is going up, is a similar jump seen in Instagram or Facebook stories because users attending these events will have more content to post about?
- Platform:
    - Is this being seen across both Android and iOS devices?
    - Is this being seen across mobile and desktop devices?
    - Is this being seen across Windows and Mac?
    - If only one of them is seeing an increase, then investigation should be carried out in finding an engineering bug on the platform where the glitch is seen on.
- Cannibalization: If the metric for a product is decreasing it is because another product is cannibalizing the engagement? Alternatively, if the metric in question is increasing, is the current product, where the increase is seen in, cannibalizing from the other product?
    - Around the time when the spike in the event click is seen, is there a decrease in number of clicks on profiles or pages or groups?
    - Is there a specific category that the current product is cannabalizing from or is it evenly distributed?
    - For instance, is it only users that previously clicked on Groups (not pages) that are clicking on events now?
        - This may indicate that a change is made to the ranking of groups in the search results.
        - Was the product down-ranked? Or accidentally remove it completely?
- Segmentation: Slice and dice the data to identify the demographic of users this increase has affected.
    - Age: Is the increase only being noticed among teenagers, young adults, middle age or senior users?
    - Gender: Is this increase only among female users or across both genders?
    - Power v. casual users: Is this only observed among the frequent users or causal users or both?
    - New v. existing users: Is the increase only observed among the users that have newly joined the platform or the existing users or both?

### External data
- Did the number of users attending events on Twitter (X) decrease?
- Has a new competitor joined the market?
- Are the competitors changing their offering?
- Good PR.
- It could also be due to seasonality or a major temporary event. If it is a major temporary event, the KPIs will return to their normal state shortly.