# Interview Question 

**23-04-2025**

# Python

**1. What are Python Keywords?**

Python has 35+ keywords in latest versions!

### 🗂️ Examples:

```python
if, else, for, while, def, return, True, False, None, import, class, try, except
```



**2. How do you handle a specific exception in your code?If you encounter a 
‘FileNotFoundError’, how would you catch and handle it gracefully in your 
code?**

### 🧑‍💻 Example to catch `FileNotFoundError`:

```python
try:
    file = open("myfile.txt", "r")
    content = file.read()
    print(content)
except FileNotFoundError:
    print("📁 File not found, bro! Please check the name or path.")
```

**3. What is a lambda function in Python and where is it useful?**

A lambda function is a nameless (anonymous) function

Used for short, one-line functions (no def needed)

useful in the map(), filter(), reduce()

``
syntax
lambda arguments: expression
``

**4. How do you handle exceptions in Python and what is the reason for using the 
exceptions?**

### ⚠️ **Why handle exceptions?**

To stop program from **crashing** 💥 and give a **clean message** to the user 📢


### 🛠️ How to handle?

Use `try-except` block 🧱

```python
try:
    x = 10 / 0
except ZeroDivisionError:
    print("Can't divide by zero, bro! ❌")
```

✅ Helps in debugging and makes code **safe** and **smooth** 🚀

**5. What distinguishes the Python '==' and 'is' operators?** 


### ✅ `==` → Checks **value**

```python
a = [1, 2]
b = [1, 2]
print(a == b)  # ✅ True (values same)
```

### 🧠 `is` → Checks **memory location**

```python
print(a is b)  # ❌ False (not same object)
```

👉 Use `==` for comparing values

👉 Use `is` for checking if both refer to **same object** in memory


# EDA

**1. Explain the concept of correlation and Which function is used to check the correlation 
between features?**

### 📌 **Correlation** = Relationship 🔗 between two features

* Positive ➕: both go up
* Negative ➖: one up, one down
* 0 ➖➕: no link

### 🧪 **Function to check correlation:**

```python
df.corr()
```

👉 Shows how strong one feature is related to another (range: -1 to +1)

**2. Explain the different types of transformation?**


### 🔄 Types of Transformation in EDA:

1. **Normalization** 📏
   → Scale data between 0 and 1
   → `MinMaxScaler`

2. **Standardization** ⚖️
   → Mean = 0, Std = 1
   → `StandardScaler`

3. **Log Transformation** 🔢➡️🧮
   → Reduces skewness
   → `np.log()`, `np.log1p()`

4. **Square Root / Cube Root** 🧪
   → Handles moderate skew
   → `np.sqrt()`, `np.cbrt()`

5. **Box-Cox / Yeo-Johnson** 📦
   → Advanced normalizing
   → From `scipy.stats`

6. **Encoding** 🔡➡️🔢
   → Convert categories to numbers
   → Label / One-hot encoding

7. **Binning / Discretization** 📊
   → Convert continuous to bins
   → Age → Child, Adult, Senior

8. **Handling Outliers** 🚫
   → Log, clipping, or removing

9. **Scaling** 📐
   → Bring values to same level
   → `RobustScaler` for outliers

**3. What is the formula for calculating Skewness and which python function is used to 
get the skewness value?**

<img src="../resources/skew.png" alt="image" width="700">

### 🧠 **Interpreting Skewness**:

* **Positive Skew**: Right-tailed distribution
* **Negative Skew**: Left-tailed distribution
* **Zero Skew**: Symmetric distribution

**4.What does X - axis and Y - axis represent in a Histogram?**

### 📊 **X-axis** (Horizontal axis):

* Represents the **bins** (or intervals) of the data.
* These bins group the continuous data into specific ranges, like 0-10, 10-20, etc.
* Shows the **values** or **ranges** of the variable being measured.


### 📏 **Y-axis** (Vertical axis):

* Represents the **frequency** or **count** of data points within each bin.
* Shows how many times data points fall into a specific bin or range.


### Example:

If you're plotting the ages of a group of people:

* **X-axis**: Age ranges (0-10, 10-20, 20-30, etc.)
* **Y-axis**: Number of people in each age range.

**5. Which function is used to get a horizontal bar plot?**

To create a **horizontal bar plot**, you can use the **`barh()`** function in **Matplotlib**.

### Example:

```python
import matplotlib.pyplot as plt

# Data
categories = ['A', 'B', 'C', 'D']
values = [3, 7, 2, 5]

# Horizontal bar plot
plt.barh(categories, values)

# Labels and title
plt.xlabel('Values')
plt.ylabel('Categories')
plt.title('Horizontal Bar Plot')

# Show plot
plt.show()
```

### 👇 What happens:

* **`barh()`**: This is the function that creates a **horizontal bar plot**.

# Data preprocessing 

**1. How do you handle skewed distributions in data preprocessing?**

### Common methods:

1. **Log Transform**
   👉 `np.log(x)` — Best for **right-skewed** data

2. **Square Root Transform**
   👉 `np.sqrt(x)` — Good for **moderate skew**

3. **Box-Cox Transform**
   👉 `scipy.stats.boxcox(x)` — Works only for **positive** data

4. **Yeo-Johnson Transform**
   👉 `PowerTransformer(method='yeo-johnson')` — Works for **all** values (+/-)

5. **Z-score or Min-Max Scaling** (after skew fix)

### 📌 Example in Python:

```python
import numpy as np
import pandas as pd

data = pd.DataFrame({'value': [1, 2, 3, 100, 200]})
data['log_value'] = np.log(data['value'])  # Fixing skew
```

**2. Sometimes in data, null values play hide and seek. How will you identify null 
values ?**

### 🔍 Identify null values:

```python
df.isnull()         # Shows True where value is null
df.isnull().sum()   # Total nulls in each column
```

### Example:

```python
import pandas as pd

df = pd.DataFrame({
    'name': ['Revanth', None, 'Kumar'],
    'age': [25, None, 30]
})

print(df.isnull())
print(df.isnull().sum())
```

This will help you catch the "hide and seek" nulls 🕵️‍♂️💥

**3. What challenges can arise during the data preprocessing phase and how do 
you overcome those challenges?**

### 🔧 Challenges in Data Preprocessing & Solutions:

| Challenge 😵‍💫          | Solution 💡                              |
| ------------------------ | ---------------------------------------- |
| **Missing Values**       | Fill with mean/median/mode or drop ❌     |
| **Outliers**             | Use z-score or IQR to remove/handle 🚫   |
| **Skewed Data**          | Apply log, sqrt, or box-cox transform 🔄 |
| **Different Scales**     | Normalize or standardize the data 📏     |
| **Categorical Data**     | Use label or one-hot encoding 🔤➡️🔢     |
| **Duplicates**           | Drop using `df.drop_duplicates()` 🧹     |
| **Inconsistent Formats** | Clean using regex, mapping, etc 🔧       |

All these make your data clean and model-ready 🧽✅


**4. In One Hot Encoding technique sometimes results come as sparse Matrix, 
What is the reason for it ?** 

In **One Hot Encoding**, we create a column for **every category**. If many categories ➡️ many columns ➡️ **most values are 0** ➡️ this forms a **sparse matrix** 😶‍🌫️

### Example:

For colors: Red, Blue, Green
You get:

| Red | Blue | Green |
| --- | ---- | ----- |
| 1   | 0    | 0     |
| 0   | 1    | 0     |
| 0   | 0    | 1     |

Lots of **zeros = sparse** 💥




**5. What is the purpose of data normalisation and what methods can we use to 
normalise the data?**

📊 **Purpose of Data Normalization**:

To **scale data** into a standard range so all features contribute equally to the model 💯
It **improves accuracy, speed & convergence** in training 🧠⚡

### ✅ Common Methods:

1. **Min-Max Scaling (0 to 1)**

   $$
   X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}
   $$

2. **Z-score / Standardization (mean = 0, std = 1)**

   $$
   X_{std} = \frac{X - \mu}{\sigma}
   $$

3. **Robust Scaler (uses median & IQR)** – Good for outliers 💪


**when will use** 

🔹 **MinMaxScaler** – Use when data has no outliers 📈 (scales 0 to 1)

🔹 **StandardScaler** – Use when data is normal (bell curve) 🔔

🔹 **RobustScaler** – Best for data **with outliers** 🚨 (uses median & IQR)