

## üìò **Day 19: Understanding Your Data - Study Notes**

**Video Link**: [Watch on YouTube](https://www.youtube.com/watch?v=mJlRTUuVr04)

---

### üìå **Overview**

This video is the first in a series focused on **Understanding Your Data**, an essential step before any Machine Learning modeling begins. It highlights the basic questions to ask when you first receive a dataset and how to extract meaningful insights using Python (especially with Pandas).

---

## üîç **Why Understanding Data is Important**

Before diving into machine learning models, it's critical to:

* Know what your data represents.
* Clean and prepare your dataset.
* Reduce memory usage.
* Handle missing values.
* Understand the distribution and type of data.

This helps avoid incorrect assumptions, improves model accuracy, and ensures efficient computation.

---

## üî¢ **Basic Questions to Ask When You Receive a Dataset**

### **1. How Big is the Data?**

* **Goal**: Find out the **number of rows and columns** in the dataset.
* **Why**: Knowing the size helps decide:

  * How much memory it might need.
  * Whether it can be loaded into memory all at once.
* **How**:

  ```python
  df.shape
  ```

  * `.shape` returns a tuple `(rows, columns)`.

---

### **2. What Does the Data Look Like?**

* **Goal**: Get a glimpse of how the data is structured.

* **Two methods**:

  **a) `df.head()`**: Shows the first 5 rows by default.

  ```python
  df.head()
  ```

  **b) `df.sample(n)`**: Shows `n` random rows.

  ```python
  df.sample(5)
  ```

* **Why Random Samples?**

  * Sometimes, the first few rows are **structured or biased**.
  * Random rows give a **more holistic** idea of the dataset.
  * Avoid misinterpretation caused by initial patterns.

---

### **3. What is the Data Type of Each Column?**

* **Why Important**:

  * Data types affect **memory usage** and **model behavior**.
  * Helps you know which columns are:

    * Categorical (`object`)
    * Numerical (`int`, `float`)
    * Boolean, etc.
* **How**:

  ```python
  df.info()
  ```

  * Shows:

    * Column names
    * Non-null values
    * Data types
    * Memory usage

#### ‚úÖ Optimization Tip:

* Sometimes **numerical columns are stored as floats** unnecessarily.
* If all values are whole numbers, consider converting to `int` to save memory.

Example:

```python
df['Age'] = df['Age'].astype('int')
```

This is especially important in large datasets.

---

### **4. Are There Any Missing Values?**

* **Why**:

  * Missing values can break ML models.
  * You need to decide whether to **fill** (impute) or **drop** missing values.

* **How to Detect**:

  **a) Use `df.info()`**: Gives a rough idea.

  **b) Use this code for a detailed count**:

  ```python
  df.isnull().sum()
  ```

* **This command returns**:

  * Column-wise count of missing values.
  * Helps identify:

    * Columns that can be dropped.
    * Columns that require imputation.

---

## üéØ **Dataset Used in This Video**

### **Dataset**: Titanic Dataset

* Famous introductory dataset on Kaggle.
* Includes features like passenger class, age, sex, fare, and survival status.

---

## üß† **Key Learnings**

| Concept                    | Description                                                            |
| -------------------------- | ---------------------------------------------------------------------- |
| `.shape`                   | Understand dataset dimensions (rows, columns).                         |
| `.head()` vs `.sample()`   | Head shows top rows; sample gives unbiased, random view.               |
| `.info()`                  | See data types, null values, and memory usage.                         |
| `.isnull().sum()`          | Count missing values in each column.                                   |
| Data Type Optimization     | Convert unnecessary `float` to `int` to save memory.                   |
| Avoid Bias in Initial Rows | Use `.sample()` to avoid incorrect assumptions about dataset patterns. |

---

## üõ†Ô∏è **Tools and Libraries Used**

* **Pandas**: For all data handling operations.

  ```python
  import pandas as pd
  ```

---

## üîÑ **What‚Äôs Next?**

* In the upcoming videos:

  * **Day 20**: Univariate Data Analysis (EDA Part 1)
  * **Day 21+**: Multivariate Analysis and Automated Profiling using tools like `pandas-profiling`.

---

## üìö **Conclusion**

The first step in any Machine Learning workflow should be to **understand the data** you're working with. Without a good understanding:

* You risk training your models on poor-quality or misleading data.
* Insights and decisions may be unreliable.

These foundational techniques ensure you're building on solid ground.

---


