# SQL Questions 

## 1. What is SQL, and how is it used in Data Science?

**SQL (Structured Query Language)** is a standard programming language used to communicate with relational databases. It allows users to store, retrieve, manage, and manipulate structured data efficiently.

In data science, SQL is essential because most organizational data is stored in databases. Data scientists use SQL to extract relevant datasets, clean data, filter records, join multiple tables, and aggregate information before performing analysis or building machine learning models.

### Key Uses of SQL in Data Science:

- **Data Extraction:** Retrieve specific datasets required for analysis.
- **Data Cleaning:** Handle missing values, remove duplicates, and standardize data.
- **Data Transformation:** Use joins, subqueries, and aggregations to prepare data for modeling.
- **Exploratory Data Analysis:** Perform quick calculations such as averages, counts, and trend analysis.
- **Workflow Integration:** Commonly used alongside tools like Python, R, and BI platforms to support end-to-end data workflows.

**In summary:** SQL is a foundational skill for data scientists because effective data analysis begins with efficient data retrieval and preparation.

---

## 2. Explain the difference between INNER JOIN and LEFT JOIN

**INNER JOIN** and **LEFT JOIN** are SQL operations used to combine data from multiple tables based on a related column, but they differ in how unmatched records are handled.

**INNER JOIN:**  
Returns only the records that have matching values in both tables. If a row does not have a corresponding match, it is excluded from the result.

**LEFT JOIN (LEFT OUTER JOIN):**  
Returns all records from the left table and the matched records from the right table. If no match is found, the result will contain NULL values for columns from the right table.

**Key Difference:**  
INNER JOIN focuses on common data between tables, whereas LEFT JOIN ensures all data from the left table is preserved, regardless of matches.

---

## 3. How do you find duplicate records in a SQL table?

Duplicate records in a SQL table can be identified by grouping rows based on one or more columns and filtering those groups that appear more than once.

This is commonly achieved using the **GROUP BY** clause along with the **HAVING** condition.

**Example Approach:**
```sql
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;
```

### Explanation:

- GROUP BY organizes rows with identical values into groups.

- COUNT(*) calculates the number of occurrences in each group.

- HAVING COUNT(*) > 1 filters and returns only duplicate entries.

- This method helps data professionals detect data quality issues and maintain database integrity.

---

## 4. Can you describe the process of normalization in a database?

Normalization is a database design technique used to organize data efficiently by reducing redundancy and improving data integrity. It involves structuring tables and defining relationships to ensure that each piece of data is stored only once.

**Objectives of Normalization:**
- Eliminate duplicate data
- Ensure logical data dependencies
- Improve data consistency
- Simplify database maintenance

**Common Normal Forms:**

- **First Normal Form (1NF):** Ensures that each column contains atomic values and each record is unique.
- **Second Normal Form (2NF):** Removes partial dependencies by ensuring that non-key attributes fully depend on the primary key.
- **Third Normal Form (3NF):** Eliminates transitive dependencies so that non-key attributes depend only on the primary key.

Normalization results in a well-structured database that supports efficient querying and reliable data management.

---

## 5. What is an index in SQL, and why is it important?

An index in SQL is a database object that improves the speed of data retrieval operations by creating a structured reference to the data in a table. It functions similarly to an index in a book, allowing the database to locate records quickly without scanning the entire table.

**Importance of Indexes:**
- **Faster Query Performance:** Significantly reduces the time required to retrieve data.
- **Efficient Searching and Sorting:** Optimizes operations such as WHERE, ORDER BY, and JOIN.
- **Improved Scalability:** Helps maintain performance as the dataset grows.
- **Enhanced Data Access:** Enables quicker filtering of large volumes of data.

However, indexes should be used strategically, as excessive indexing can increase storage usage and slightly slow down data modification operations like INSERT, UPDATE, and DELETE.

---

## 6. Explain the concept of an aggregate function. Provide examples.

An aggregate function in SQL performs a calculation on a set of rows and returns a single summarized value. These functions are commonly used in data analysis to generate insights from large datasets.

**Common Aggregate Functions:**

- **COUNT():** Returns the total number of rows.
- **SUM():** Calculates the total of a numeric column.
- **AVG():** Computes the average value.
- **MIN():** Identifies the smallest value.
- **MAX():** Identifies the largest value.

**Example:**
```sql
SELECT COUNT(*) AS total_employees,
       AVG(salary) AS average_salary
FROM employees;
```
Aggregate functions are frequently used with the GROUP BY clause to summarize data across different categories, supporting efficient reporting and decision-making.

---

## 7. How would you retrieve the top 5 records from a table in SQL?

The top records in a SQL table can be retrieved using clauses such as **LIMIT**, **TOP**, or **FETCH FIRST**, depending on the database system. These clauses restrict the number of rows returned in the result set.

**Using LIMIT (MySQL, PostgreSQL):**
```sql
SELECT *
FROM table_name
ORDER BY column_name DESC
LIMIT 5;
```
***Using TOP (SQL Server):***
```
SELECT TOP 5 *
FROM table_name
ORDER BY column_name DESC;
```
***Using FETCH FIRST (Oracle, DB2):***
```
SELECT *
FROM table_name
ORDER BY column_name DESC
FETCH FIRST 5 ROWS ONLY;
```
Including ORDER BY ensures that the retrieved records are meaningful, such as the highest values or most recent entries.

---

## 8. What is the purpose of the GROUP BY clause, and when would you use it?

The **GROUP BY** clause is used to organize rows with similar values into groups, allowing aggregate functions to perform calculations on each group instead of the entire dataset.

**Purpose:**
- Summarize large datasets into meaningful categories  
- Perform calculations such as totals, averages, and counts per group  
- Support data analysis and reporting  

**When to Use GROUP BY:**
- When calculating metrics by category (e.g., total sales per region)  
- When identifying patterns or trends within grouped data  
- When generating summary reports  

**Example:**
```sql
SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;
```

---

## 9. Describe a scenario where you would use a subquery in SQL

A **subquery** is a query nested inside another query, used to perform intermediate calculations or filter data based on the results of another query.

**Scenario Example:**

Suppose you want to find employees who earn more than the average salary in a company.

```sql
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
```

---

## 10. How can you optimize a slow-performing SQL query?

Optimizing SQL queries improves performance and reduces database load. Key strategies include:

- **Use Indexes:** Create indexes on columns frequently used in WHERE, JOIN, and ORDER BY clauses.  
- **Avoid SELECT *:** Retrieve only necessary columns to reduce data transfer.  
- **Use Joins Efficiently:** Choose appropriate join types and ensure join columns are indexed.  
- **Filter Early:** Apply WHERE conditions to limit rows processed as soon as possible.  
- **Avoid Subqueries When Possible:** Use JOINs or CTEs (Common Table Expressions) for better performance.  
- **Analyze Query Execution:** Use `EXPLAIN` or execution plans to identify bottlenecks.  
- **Limit Data Retrieval:** Use LIMIT or pagination when working with large datasets.  
- **Denormalization (if necessary):** For read-heavy systems, consider denormalizing tables to reduce complex joins.

Following these practices ensures queries run efficiently, especially on large datasets.


---

## 11. Explain the concept of a self-join with an example

A **self-join** is a join in which a table is joined with itself to compare rows within the same table. It is useful for hierarchical or relational data stored in a single table.

**Example Scenario:** Finding employees and their managers from the same `employees` table.

```sql
SELECT e1.name AS employee, e2.name AS manager
FROM employees e1
LEFT JOIN employees e2
ON e1.manager_id = e2.id;
```

---

## 12. What are window functions? Provide an example of their usage

**Window functions** perform calculations across a set of rows related to the current row without collapsing the result into a single output. They are used for ranking, running totals, moving averages, and other analytical tasks.

**Example: Ranking employees by salary within each department**

```sql
SELECT name,
       department,
       salary,
       RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank
FROM employees;
```

---

## 13. How can you handle NULL values in SQL queries

NULL values represent missing or unknown data in SQL. Handling them properly is essential to ensure accurate analysis and prevent errors.

**Common Approaches:**

- **IS NULL / IS NOT NULL:** Filter rows with or without NULL values.
```sql
SELECT * FROM employees WHERE manager_id IS NULL;
COALESCE(): Replace NULL with a default value.

SELECT name, COALESCE(salary, 0) AS salary
FROM employees;

IFNULL() / NVL(): Database-specific functions to substitute NULL values.

SELECT name, IFNULL(salary, 0) AS salary FROM employees; -- MySQL
SELECT name, NVL(salary, 0) AS salary FROM employees;   -- Oracle
NULL-safe comparisons: Use functions or conditional logic to handle NULLs in calculations and joins.
```
Proper handling of NULLs ensures data integrity and accurate query results.

---


## 14. What is the difference between UNION and UNION ALL? When would you use each?

**UNION** and **UNION ALL** are SQL operators used to combine results from two or more queries.

- **UNION:**  
  - Combines results and **removes duplicate rows**.  
  - Performs an implicit **DISTINCT** operation.  
  - Slightly slower due to duplicate elimination.  

- **UNION ALL:**  
  - Combines results **including duplicates**.  
  - Faster because no duplicate check is performed.  

**When to Use:**
- Use **UNION** when you want only unique records.  
- Use **UNION ALL** when duplicates are acceptable or when performance is critical and duplicate removal is unnecessary.

**Example:**
```sql
SELECT name FROM employees_us
UNION
SELECT name FROM employees_uk;

SELECT name FROM employees_us
UNION ALL
SELECT name FROM employees_uk;
``

## 15. How would you calculate the cumulative sum of a column in SQL

A **cumulative sum** calculates the running total of a column across rows, often using window functions.

**Example: Cumulative sales by date**

```sql
SELECT date,
       sales,
       SUM(sales) OVER (ORDER BY date) AS cumulative_sales
FROM sales_data;
