# **Interview-Ready EDA Summary**

‚ÄúIn this project, I performed a complete end-to-end Exploratory Data Analysis on four interconnected datasets ‚Äî leads, marketing-qualified leads (MQL), products, and orders ‚Äî to prepare them for a funnel analysis.

I started with a full data quality assessment: checking row counts, missing values, duplicate entries, and overall schema consistency across datasets. After identifying anomalies, I applied targeted cleaning steps ‚Äî such as validating categorical variables, handling invalid entries (e.g., recategorizing unexpected values in lead_behaviour_profile), imputing missing values using statistical methods like mode, and dropping non-informative or redundant columns.

I also standardized boolean fields, removed irrelevant features, and ensured consistent formatting across datasets. To confirm data reliability, I generated duplicate summaries and cross-validated row counts across all tables.

By the end of the EDA, I had a clean, structured, analysis-ready dataset that accurately reflects user behavior across the customer journey. This foundation sets up a reliable base for conducting actionable funnel analysis and deriving business insights

### **1. üß† How to explain this `def summarize_missing_values(dataframe):` in an interview**

‚ÄúFirst, I load all Olist datasets using read_csv.
Then I created a function called `summarize_missing_values()` that checks missing values for each dataset.
It loops through each DataFrame, counts missing values, calculates their percentage, removes columns with zero missing values, and finally returns a clean summary.
This helps me quickly understand data quality before starting analysis.‚Äù

#### ‚ö†Ô∏è **What happens if you change something inside the function?**

| Change                                   | What happens                                                                         |
| ---------------------------------------- | ------------------------------------------------------------------------------------ |
| Remove `.query("`missing count` > 0")`   | It will show ALL columns, even those with zero missing values ‚Üí much longer output   |
| Replace `df.shape[0]` with `df.shape[1]` | Wrong result ‚Üí percentage becomes incorrect (uses number of columns instead of rows) |
| Remove the loop                          | Function stops working because it can‚Äôt process multiple datasets                    |
| Change `summary[name] = missing_df`      | You won‚Äôt be able to identify which missing values belong to which dataset           |

---

### **2. üß† How to explain this `missing_summary` in an interview**

‚ÄúAfter creating the missing-value checker function, I pass all datasets into it.
The function returns a dictionary where each dataset has its own missing-value summary.
I then loop through this dictionary and print results for each dataset.
If the dataset has no missing values, I print a clean message.
If it has missing values, I print a table showing how many values are missing and what percentage it represents.‚Äù

#### **‚ö†Ô∏è What happens if you change something?**

| Change                                               | What happens                                              |
| ---------------------------------------------------- | --------------------------------------------------------- |
| Remove `if df.empty:`                                | You will print empty tables and it becomes confusing.     |
| Remove `.items()`                                    | The loop won't work; you‚Äôll get an error.                 |
| Pass the list without names (just `[cld, mql, ...]`) | The function will break ‚Äî it expects pairs `(name, df)`.  |
| Remove one of the datasets from the list             | That dataset won't be included in missing-value analysis. |

---


### **3. üéØ Final Simple Summary after cleaning the data (For Interviews)**

‚ÄúTo clean the data, I first created copies of each dataset so that my raw data stays safe.
Then I used dropna(subset=...) to remove rows that were missing critical information.
After cleaning, I again checked for missing values to confirm which columns still needed attention.
This ensures my analysis is based on high-quality, complete data. Changing the subset fields directly affects how many rows are removed and how clean the dataset becomes.‚Äù

"I cleaned the `lead_behaviour_profile` column by keeping only valid categories and converting incorrect entries to NaN. Then I used the mode (most frequent value) to fill missing values so the column becomes complete and usable.

I checked the `has_company` column and found that it had too many missing values and wasn‚Äôt reliable, so I removed it along with `has_gtin`, `average_stock`, and `declared_product_catalog_size`. These columns had exceptionally high missing percentages (over 90%), meaning they would not contribute meaningful insights.

The goal of all these steps is to ensure my dataset is clean, consistent, and ready for analysis."

---

### **4. üß† How to explain this `def summarize_duplicate(*datasets, names = None):` in an interview**

‚ÄúI created a function that summarizes duplicates for multiple datasets at once.
It counts total rows and duplicated rows using df.duplicated().sum(), and returns the results as a DataFrame.
This helps me quickly assess data quality across all datasets in my pipeline.‚Äù