# 1. SQL notes

## Difference between Relational and NOSQL databases

- Relational databases use the relational model, which organizes data into tables with rows and columns, and uses structured query language (SQL) to access and manipulate the data. They are well suited for structured data, such as financial transactions, and are commonly used in business applications.


- NoSQL databases, on the other hand, are designed to handle large amounts of unstructured or semi-structured data, such as social media posts, log files, or user-generated content. They use a variety of data storage models, including key-value, document-based, column-based, and graph databases. NoSQL databases are designed to be horizontally scalable, allowing them to handle large amounts of data and high levels of traffic.


___In summary, relational databases are well suited for structured data, while NoSQL databases are designed to handle unstructured data and scale horizontally.___

### Types of Database?

There are several types of databases, including:



- __Relational databases: Store data in tables with rows and columns and use structured query language (SQL) to access data.__
    - MYSQL
    - SQL Server
    - PostreSQL
    - SQLite
    - MariaDB



- __Non-relational databases (NOSQL): Store data in a format other than tables, such as key-value pairs, document-based, or graph databases.__
    - Hbase
    - mongodb
    - cassandra



- Centralized databases: Store data in a single, centralized location and allow multiple users to access the data from different locations.



- Distributed databases: Store data on multiple servers and allow multiple users to access the data from different locations.



- Operational databases: Store real-time data and are designed to support the day-to-day operations of an organization.



- Data warehouses: Store historical data for analysis and decision-making purposes.



- In-memory databases: Store data in RAM for faster access and processing.



- Cloud databases: Store data on remote servers and allow access over the internet.


These are the main types of databases, and different applications may use different types depending on their specific requirements.

## <span class="mark">Difference between DMBS and RDBMS?</span>

**DBMS (Database Management System):**

- Manages and stores data in a structured way.


- Provides basic data storage and retrieval capabilities.


- Doesn't enforce strong relationships between data elements.


- Can be non-relational and handle various data formats.


- Doesn't guarantee ACID properties (Atomicity, Consistency, Isolation, Durability).

**RDBMS (Relational Database Management System):**


- Organizes data into structured tables with predefined schemas.


- Enforces strong relationships between data using keys (primary, foreign).


- Uses SQL (Structured Query Language) for querying and manipulating data.


- __Guarantees ACID properties for transactions  (Atomicity, Consistency, Isolation, Durability).__


- Ensures data integrity through constraints and normalization.

In simple terms, a DBMS is a broader term that includes systems managing any type of data, while an RDBMS specifically deals with structured data using tables and SQL.

## Difference between DELETE, TRUNCATE and DROP 

__In summary, <br></br>"DELETE" statement is used to remove one or more rows from a table based on a specified condition, <br></br>"TRUNCATE" statement is used to remove all data from a table, but leave the table structure intact. <br></br>While the "DROP" statement is used to completely delete the entire table and its data.__

## DDL vs DML

__DDL $\Longrightarrow$ Data Definition Language.__

It includes the SQL commands that can be used to ___define the database schema.___ It simply deals with descriptions of the database schema and is used to create and modify the structure of database objects in the database. Examples of DDL statements include __CREATE, ALTER, DROP__
<br></br>

__DML $\Longrightarrow$  Data Manipulation Language.__

It includes the SQL commands that can be used to ___manage data stored in the database.___ This includes inserting, updating, and deleting data. Examples of DML statements include __SELECT, INSERT, UPDATE, DELETE__

____In short, DDL is used to create and modify database structure, while DML is used to manage the data stored in the database.____

### 2. Modifying an existing column: use MODIFY

#### changing number of character limit in location column and setting it as not null and default = 'Bangalore' :

```sql
ALTER TABLE employee MODIFY COLUMN location varchar(27) DEFAULT 'Bangalore';
```

### 4. Adding a new constraint:

```sql
ALTER TABLE table_name
ADD  constraint_name constraint_type (column_name);
```

##### add primary key id column

eg : 
```sql
ALTER TABLE employee add primary key(id);
```

### 5. Dropping a constraint:

##### DROP a UNIQUE Constraint :

```sql
ALTER TABLE Persons
DROP INDEX UC_Person;
```

### NOTE : Constraints cannot be Modified, they need to be deleted and added again in the correct format

### 6. Renaming column names

```sql
ALTER TABLE sleep
CHANGE COLUMN `Wakeup time` wakeup_time VARCHAR(255);
```

### DATA INTEGRITY:

Data integrity in databases refers to the ___accuracy, completeness, and consistency___
of the data stored in a database. 


It is a measure of the reliability and
trustworthiness of the data and ensures that the data in a database is protected
from errors, corruption, or unauthorized changes.

#### There are various methods used to ensure data integrity, including:

- __Constraints:__ Constraints in databases are rules or conditions that must be met for data to be
inserted, updated, or deleted in a database table. They are used to enforce the
integrity of the data stored in a database and to prevent data from becoming
inconsistent or corrupted.



- __Transactions:__ a sequence of database operations that are treated as a single unit
of work.




- __Normalization:__ a design technique that minimizes data redundancy and ensures
data consistency by organizing data into separate tables.

### CONSTRAINTS : 

Constraints in MySQL are rules that enforce data integrity and consistency. They specify the conditions that data must meet in order to be inserted, updated, or deleted from a table. Constraints ensure that the data in a table remains consistent and meets certain requirements.

There are several types of constraints in MySQL:

- __PRIMARY KEY:__ Enforces uniqueness and defines a column or a combination of columns as the primary key of a table.


- __FOREIGN KEY:__ Enforces referential integrity and ensures that the values in a foreign key column match the values in the referenced column of a referenced table.


- __UNIQUE:__ Enforces uniqueness and ensures that the values in a column or a combination of columns are unique within the table.


- __NOT NULL:__ Enforces non-NULL values and ensures that a value is entered in a column for every row in a table.


- __CHECK:__ Enforces conditional constraints and allows you to specify conditions that data must meet in order to be inserted, updated, or deleted from a table.


- __DEFAULT:__ Specifies a default value for a column.


- __AUTO-INCREMENT:__



___Constraints can be specified when creating a table, or they can be added or modified later using ALTER TABLE statements. They are an important tool for maintaining the integrity and consistency of your data.___

### NOTE : The table with the foreign key is called child table, the table with the primary key is called the parent table or refrenced table

```sql
-- Add a foreign key constraint to link Employees table to Departments table

ALTER TABLE Employees ADD FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID);
```

### <span class="mark">CASCADE KEYS</span>

- Cascading in MySQL, specifically in the context of database operations, refers to the behavior that occurs when you perform certain operations on a parent table that has associated child tables with foreign key relationships.


- Cascade actions define **what should happen to the child records when certain operations are performed on the parent record**


- When a CASCADE action is specified for a foreign key, it means that changes made to the referenced primary key in the parent table will automatically propagate to the child table with the foreign key.ds.

There are several types of CASCADE actions that can be applied to foreign keys:

- __CASCADE UPDATE:__ When the primary key value in the parent table is updated, the corresponding foreign key value in the child table will also be updated automatically.


- __CASCADE DELETE:__ When a row is deleted from the parent table, all related rows in the child table with matching foreign key values will also be automatically deleted.

Here's an example SQL query that demonstrates the implementation of a CASCADE DELETE foreign key constraint:

```sql
-- Create the parent table
CREATE TABLE Authors (
    AuthorID INT PRIMARY KEY,
    AuthorName VARCHAR(100)
);

-- Create the child table with a foreign key referencing the Authors table
CREATE TABLE Books (
    BookID INT PRIMARY KEY,
    BookTitle VARCHAR(200),
    AuthorID INT,
    CONSTRAINT fk_AuthorID
        FOREIGN KEY (AuthorID)
        REFERENCES Authors(AuthorID)
        ON DELETE CASCADE
);

-- Insert some data into the Authors table
INSERT INTO Authors (AuthorID, AuthorName) VALUES
(1, 'John Doe'),
(2, 'Jane Smith');

-- Insert some data into the Books table
INSERT INTO Books (BookID, BookTitle, AuthorID) VALUES
(101, 'Book 1', 1),
(102, 'Book 2', 1),
(103, 'Book 3', 2);

-- Now, let's delete the author with AuthorID 1
DELETE FROM Authors WHERE AuthorID = 1;
```

In this example, the foreign key constraint `fk_AuthorID` in the `Books` table has the `ON DELETE CASCADE` option, meaning that when an author with `AuthorID = 1` is deleted from the `Authors` table, all related books with `AuthorID = 1` will also be automatically deleted from the `Books` table. This ensures that the database remains consistent and avoids orphaned records in the child table.

## MEDIAN()

- **total odd numbers :** middle most number when values sorted ascending or descening.

- **total even numbers :**  average of middle numbers.

- https://www.youtube.com/watch?v=fwPk1RXlorQ&ab_channel=AnkitBansal

### Median for even number of rows:

#### step 1:

```sql
select *, total_rows * 1.0 /2,(total_rows*1.0/2) +1
from
  (select lat,
  COUNT(*) OVER()  as total_rows,
  ROW_NUMBER() OVER(ORDER BY lat asc) as rn
  from LAT_N)as
t1
WHERE rn between total_rows * 1.0 /2 AND (total_rows*1.0/2) +1;
```

![image.png](attachment:image.png)

#### step 2 : 

```sql
select avg(lat)
from
  (select lat,
  COUNT(*) OVER()  as total_rows,
  ROW_NUMBER() OVER(ORDER BY lat asc) as rn
  from LAT_N)as
t1
WHERE rn between total_rows * 1.0 /2 AND (total_rows*1.0/2) +1;
```

![image.png](attachment:image.png)

### Median for even number of rows: (same works for odd too)

```sql
select *, total_rows * 1.0 /2,(total_rows*1.0/2) +1
from
  (select lat,
  COUNT(*) OVER()  as total_rows,
  ROW_NUMBER() OVER(ORDER BY lat asc) as rn
  from LAT_N)as
t1
WHERE rn between total_rows * 1.0 /2 AND (total_rows*1.0/2) +1;
```

![image.png](attachment:image.png)

```sql
select avg(lat) as average_lat
from
  (select lat,
  COUNT(*) OVER()  as total_rows,
  ROW_NUMBER() OVER(ORDER BY lat asc) as rn
  from LAT_N)as
t1
WHERE rn between total_rows * 1.0 /2 AND (total_rows*1.0/2) +1;
```

![image.png](attachment:image.png)

### Question : finding median company wise:

```sql
select company_name,total_rows,employee_salary,
total_rows*1.0/2,
  (total_rows*1.0/2)+1
from
  (select *,
  ROW_NUMBER() OVER(PARTITION BY company_name ORDER BY employee_salary asc) as rn,
  COUNT(*) OVER(PARTITION BY company_name)  as total_rows
  from salary)as
t1
```

![Screenshot%202023-09-08%20032132.png](attachment:Screenshot%202023-09-08%20032132.png)

#### now finding median (adding where clause):

```sql
select company_name,total_rows,employee_salary,
total_rows*1.0/2,
  (total_rows*1.0/2)+1
from
  (select *,
  ROW_NUMBER() OVER(PARTITION BY company_name ORDER BY employee_salary asc) as rn,
  COUNT(*) OVER(PARTITION BY company_name)  as total_rows
  from salary)as
t1
WHERE rn between total_rows * 1.0 /2 AND (total_rows*1.0/2) +1;

```

![image.png](attachment:image.png)

#### even number of rows still doesnot have one median, it has 2 values. 


#### so we group by on company_name and will calculate average of those 2 values: 

```sql
select company_name,total_rows,avg(employee_salary),
total_rows*1.0/2,
  (total_rows*1.0/2)+1
from
  (select *,
  ROW_NUMBER() OVER(PARTITION BY company_name ORDER BY employee_salary asc) as rn,
  COUNT(*) OVER(PARTITION BY company_name)  as total_rows
  from salary)as
t1
WHERE rn between total_rows * 1.0 /2 AND (total_rows*1.0/2) +1
GROUP BY company_name;
```

![image.png](attachment:image.png)

## ROUND OF DECIMALS:

while creating the column instead of INT we can pass DECIMAL(5,2) i.e 5 digits before decimal and in that 2 digit after decimal

```sql
select student_company, ROUND(AVG(years_of_experience), 2) as average_exp 
from students 
GROUP BY student_company;
```

![image.png](attachment:image.png)

## CAST ()

The CAST() function converts a value (of any type) into the specified datatype.

```sql
SELECT CAST(150 AS CHAR);
```

```sql
SELECT CAST("2017-08-29" AS DATE);
```

#### euclidean distance : 

```sql
SELECT 
CAST(SQRT(POWER(MAX(LAT_N)-MIN(LAT_N),2) + POWER(MAX(LONG_W)-MIN(LONG_W),2)) AS DECIMAL (9,4))
FROM STATION;
```

### 3rd and 4th Largest:

```sql
SELECT model, battery_capacity
FROM smartphones
ORDER BY battery_capacity DESC
LIMIT 2,2;
```

## CASES : 

```sql
SELECT DISTINCT customer_id,monthname(sales_date),
SUM(amount) OVER(PARTITION BY customer_id, monthname(sales_date) ORDER BY customer_id)
FROM test3;

select t.*,t2.Total 
from 
  (select customer_id,
    SUM(CASE WHEN monthname(sales_date) = 'January' THEN amount ELSE 0 END) as 'Jan-21',
    SUM(CASE WHEN monthname(sales_date) = 'February' THEN amount ELSE 0 END) as 'Feb-21',
    SUM(CASE WHEN monthname(sales_date) = 'March' THEN amount ELSE 0 END) as 'Mar-21',
    SUM(CASE WHEN monthname(sales_date) = 'April' THEN amount ELSE 0 END) as 'Apr-21',
    SUM(CASE WHEN monthname(sales_date) = 'May' THEN amount ELSE 0 END) as 'May-21',
    SUM(CASE WHEN monthname(sales_date) = 'June' THEN amount ELSE 0 END) as 'June-21',
    SUM(CASE WHEN monthname(sales_date) = 'July' THEN amount ELSE 0 END) as 'July-21',
    SUM(CASE WHEN monthname(sales_date) = 'August' THEN amount ELSE 0 END) as 'Aug-21',
    SUM(CASE WHEN monthname(sales_date) = 'September' THEN amount ELSE 0 END) as 'Sept-21',
    SUM(CASE WHEN monthname(sales_date) = 'October' THEN amount ELSE 0 END) as 'Oct-21',
    SUM(CASE WHEN monthname(sales_date) = 'November' THEN amount ELSE 0 END) as 'Nov-21',
    SUM(CASE WHEN monthname(sales_date) = 'December' THEN amount ELSE 0 END) as 'Dec-21'
  from test3
  group by customer_id 

   UNION

select 'Total' kungfu, -- for 'Total' cell
    SUM(CASE WHEN monthname(sales_date) = 'January' THEN amount ELSE 0 END) as 'Jan-21',
    SUM(CASE WHEN monthname(sales_date) = 'February' THEN amount ELSE 0 END) as 'Feb-21',
    SUM(CASE WHEN monthname(sales_date) = 'March' THEN amount ELSE 0 END) as 'Mar-21',
    SUM(CASE WHEN monthname(sales_date) = 'April' THEN amount ELSE 0 END) as 'Apr-21',
    SUM(CASE WHEN monthname(sales_date) = 'May' THEN amount ELSE 0 END) as 'May-21',
    SUM(CASE WHEN monthname(sales_date) = 'June' THEN amount ELSE 0 END) as 'June-21',
    SUM(CASE WHEN monthname(sales_date) = 'July' THEN amount ELSE 0 END) as 'July-21',
    SUM(CASE WHEN monthname(sales_date) = 'August' THEN amount ELSE 0 END) as 'Aug-21',
    SUM(CASE WHEN monthname(sales_date) = 'September' THEN amount ELSE 0 END) as 'Sept-21',
    SUM(CASE WHEN monthname(sales_date) = 'October' THEN amount ELSE 0 END) as 'Oct-21',
    SUM(CASE WHEN monthname(sales_date) = 'November' THEN amount ELSE 0 END) as 'Nov-21',
    SUM(CASE WHEN monthname(sales_date) = 'December' THEN amount ELSE 0 END) as 'Dec-21'
  from test3) as t

LEFT JOIN

(select customer_id,sum(amount) as `Total`
from test3
group by customer_id) as t2
ON t.customer_id=t2.customer_id
```

![image.png](attachment:image.png)

### 6. SELF JOIN :

> A SELF JOIN in SQL is a regular join, but the table is joined with itself.<br></br>
SELF JOIN combines rows from the same table based on a related column between two or more rows within the same table.

To perform a SELF JOIN, you must __specify an alias for the table because you're joining it with itself__. This allows you to distinguish between the two copies of the same table in the join condition.

```sql
SELECT *
FROM table_name AS t1
JOIN table_name AS t2
ON t1.column_name = t2.column_name;
```

#### eg: fetch child name and their age corresponding to their parent and parent age:

##### parent_id will be matched with the same member_id and name will be fetched

![image.png](attachment:image.png)

##### using self join

```sql
SELECT child.member_id AS child_id, child.name AS child_name, child.age AS child_age, parent.name AS parent_name, parent.age AS parent_age, child.parent_id AS parent_id
from relations AS child

JOIN relations AS parent

ON child.parent_id = parent.member_id;
```

![image.png](attachment:image.png)

### example of self join : 

![Screenshot%202023-09-08%20040413.png](attachment:Screenshot%202023-09-08%20040413.png)

##### self join now to get friend's name

```sql
with cte as 
    (SELEct s.*,f.friend_id from students_p as s
    JOIN friends_p as f
    ON s.id = f.id)
    
select * from cte
JOIN students_p as t2
ON cte.friend_id=t2.id
ORDER BY cte.id asc;
```

##### the joined table :

![Screenshot%202023-09-08%20040709.png](attachment:Screenshot%202023-09-08%20040709.png)

![Screenshot%202023-09-08%20040830.png](attachment:Screenshot%202023-09-08%20040830.png)

## SET Operations

1. __UNION:__ The UNION operator is used to combine the results of two or more SELECT
statements into a single result set. The UNION operator removes duplicate rows
between the various SELECT statements.


2. __UNION ALL:__ The UNION ALL operator is similar to the UNION operator, but it does
not remove duplicate rows from the result set.


3. __INTERSECT:__ The INTERSECT operator returns only the rows that appear in both
result sets of two SELECT statements.


4. __EXCEPT :__ ___The EXCEPT or MINUS operator returns only the distinct rows that appear
in the first result set but not in the second result set of two SELECT statements.___

### <span class="mark">EXCEPT : find customers who have never ordered</span>

```sql
select user_id from users
EXCEPT 
select user_id from orders;
```

![image.png](attachment:image.png)

## ROLL UP, CUBE, Grouping sets

- "roll-up" is an operation used for generating aggregated results from a set of data. 


- It's commonly used in data warehousing and reporting scenarios to create hierarchical summaries of data at different levels of granularity. 


- The ROLLUP operation produces a result set that includes aggregated values for various combinations of specified columns, representing different levels of summarization.

### Difference between rollup and groupby


The key distinction is that ROLLUP allows you to generate __multiple levels of aggregation, including subtotals and grand totals,__ making it useful for generating summary reports with varying levels of detail.

| product_category | product_type | sales_date | amount |
|------------------|--------------|------------|--------|
| Electronics     | Smartphone   | 2023-08-01 | 500    |
| Electronics     | Laptop       | 2023-08-01 | 800    |
| Clothing        | T-Shirt      | 2023-08-01 | 50     |
| Electronics     | Smartphone   | 2023-08-02 | 450    |
| Clothing        | Jeans        | 2023-08-02 | 70     |


#### Groupby query:
```sql
SELECT product_category, product_type, SUM(amount) AS total_amount
FROM sales
GROUP BY product_category, product_type;
```

##### result : 

| product_category | product_type | total_amount |
|------------------|--------------|--------------|
| Electronics     | Smartphone   | 950          |
| Electronics     | Laptop       | 800          |
| Clothing        | T-Shirt      | 50           |
| Clothing        | Jeans        | 70           |


#### rollup query : 

```sql
SELECT product_category, product_type, SUM(amount) AS total_amount
FROM sales
GROUP BY ROLLUP (product_category, product_type);
```

##### results:

| product_category | product_type | total_amount |
|------------------|--------------|--------------|
| Electronics     | Smartphone   | 950          |
| Electronics     | Laptop       | 800          |
| Electronics     |              | 1750         |  <!-- Subtotal for Electronics -->
| Clothing        | T-Shirt      | 50           |
| Clothing        | Jeans        | 70           |
| Clothing        |              | 120          |  <!-- Subtotal for Clothing -->
|                  |              | 1870         |  <!-- Grand Total -->


In the ROLLUP result:

- We get subtotals for each combination of product_category and product_type, such as Electronics with Smartphone and Laptop.


- __We also get subtotals for each individual product_category, showing the total amount for all Electronics products and all Clothing products.__


- __Finally, we have the grand total of all sales.__

### CUBE and Grouping sets:

watch it to understand - https://www.youtube.com/watch?v=KLPULneM4mo

### <span class="mark">GROUPING SETS:</span>

The GROUPING SETS operation allows you to specify multiple grouping sets within a single query. This provides more flexibility in choosing specific combinations of columns for subtotals and totals.

```sql
SELECT product_category, product_type, SUM(amount) AS total_amount
FROM sales
GROUP BY GROUPING SETS (
    (product_category, product_type),
    (product_category),
    ()
);

```

| product_category | product_type | total_amount |
|------------------|--------------|--------------|
| Electronics      | Smartphone   | 950          |
| Electronics      | Laptop       | 800          |
| Clothing         | T-Shirt      | 50           |
| Electronics      |              | 1750         |  -- Subtotal for Electronics
| Clothing         |              | 120          |  -- Subtotal for Clothing
|                  |              | 1870         |  -- Grand Total


#### Difference between Grouping sets and ROLLUP 

- The key difference is that ROLLUP generates subtotals and grand totals automatically based on the columns specified in the ROLLUP clause. In contrast, GROUPING SETS allows you to explicitly define multiple grouping sets, giving you more control over which subtotals and grand totals you want to include in the result.

- While ROLLUP provides a structured approach with hierarchical subtotals, GROUPING SETS offers more flexibility to create custom combinations of subtotals and totals in a single query.

### CUBE

The CUBE operation generates subtotals and grand totals for all possible combinations of columns specified in the CUBE clause. It provides a more comprehensive approach, creating aggregates for every possible combination of dimensions.

```sql
SELECT product_category, product_type, SUM(amount) AS total_amount
FROM sales
GROUP BY CUBE (product_category, product_type);

```

| product_category | product_type | total_amount |
|------------------|--------------|--------------|
| Electronics      | Smartphone   | 950          |
| Electronics      | Laptop       | 800          |
| Electronics      |              | 1750         |
| Clothing         | T-Shirt      | 50           |
| Clothing         | Jeans        | 70           |
| Clothing         |              | 120          |
|                  |              | 1870         |
| Electronics      |              | 950          |
| Clothing         |              | 120          |
|                  |              | 1070         |
|                  | Smartphone   | 950          |
|                  | Laptop       | 800          |
|                  |              | 1750         |
|                  | T-Shirt      | 50           |
|                  | Jeans        | 70           |
|                  |              | 120          |
|                  |              | 1870         |
|                  |              | 2750         |


#### Difference between Roll up and CUBE:

- The key difference is in the scope of aggregation. While ROLLUP generates subtotals and grand totals in a hierarchical manner for specific columns, CUBE generates subtotals and grand totals for all possible combinations of the specified columns. CUBE provides a more comprehensive overview of the data, but it can result in a larger result set.



In summary, ROLLUP is more focused on structured subtotals, while CUBE provides a broader view by including aggregates for all possible combinations of dimensions. The choice between them depends on the level of detail and insight you need from your aggregated data.

### <span class="mark">REPLACE() in sql:</span> 

#### Question : 

Samantha was tasked with calculating the average monthly salaries for all employees in the EMPLOYEES table, but did not realize her __keyboard's  0 key was broken__ until after completing the calculation. She wants your help finding the difference between her miscalculation (using salaries with any zeros removed), and the actual average salary.


Write a query calculating the amount of error (i.e.:  __actual - miscalculated average monthly salaries__), and round it up to the next integer.

```sql
select 
CEIL(AVG(Salary) - AVG(replace(Salary,0,'')))
from employees;
```

#### output : 2253

## COALESCE () - handle null values and returns them as non-null values

The COALESCE function in SQL is used to return the first non-NULL value from a list of expressions. 


__It takes a list of one or more expressions as its arguments,__ and returns the first expression that is not NULL. If all expressions are NULL, then COALESCE returns NULL. That is how it is different from isnull()

The COALESCE function works as follows:

- It evaluates the expressions in the order they are provided.
- It returns the value of the first non-null expression.
- If all expressions are null, it returns null.

#### Example 1: Using COALESCE to Replace NULL Values:

| student_id | student_name | grade |
|------------|--------------|-------|
| 1          | Alice        | 85    |
| 2          | Bob          | NULL  |
| 3          | Carol        | 92    |
| 4          | Dave         | NULL  |
| 5          | Eve          | 78    |


```sql
SELECT student_id, student_name, COALESCE(grade, 'N/A') AS final_grade
FROM Students;
```

| student_id | student_name | final_grade |
|------------|--------------|-------------|
| 1          | Alice        | 85          |
| 2          | Bob          | N/A         |
| 3          | Carol        | 92          |
| 4          | Dave         | N/A         |
| 5          | Eve          | 78          |


#### Example 2: Using COALESCE with Multiple Columns:

```sql
SELECT student_id, student_name, COALESCE(grade, extra_credit, 0) AS final_grade
FROM Students;
```

| student_id | student_name | final_grade |
|------------|--------------|-------------|
| 1          | Alice        | 85          |
| 2          | Bob          | 0           |
| 3          | Carol        | 92          |
| 4          | Dave         | 0           |
| 5          | Eve          | 78          |


#### Example 3: Using COALESCE in a Conditional Expression:

```sql
SELECT student_id, student_name,
       CASE WHEN grade >= 90 THEN 'A'
            WHEN grade >= 80 THEN 'B'
            ELSE 'C'
       END AS letter_grade,
       COALESCE(grade, 0) AS final_grade
FROM Students;
```

| student_id | student_name | letter_grade | final_grade |
|------------|--------------|--------------|-------------|
| 1          | Alice        | B            | 85          |
| 2          | Bob          | C            | 0           |
| 3          | Carol        | A            | 92          |
| 4          | Dave         | C            | 0           |
| 5          | Eve          | C            | 78          |


### diiference between IS NOT NULL and COALESCE

- Use IS NOT NULL to filter rows where a specific column or expression has a non-null value.


- Use COALESCE to handle NULL values by providing an alternative non-null value, which is particularly useful when displaying data or performing calculations.


----

# 2. Subqueries & Windows function

## SUBQUERIES in SQL - Campusx

https://youtu.be/YYq47MN3TZI

A subquery is a query within another query. It is a SELECT statement that is
nested inside another SELECT, INSERT, UPDATE, or DELETE statement. The
subquery is executed first, and its result is then used as a parameter or condition
for the outer query.

`In SQL, subqueries are queries nested within another query to perform specific tasks. There are several types of subqueries, each serving different purposes:

1. **Scalar Subquery:**
   - Returns a single value (one row and one column) to the outer query.
   - Often used in expressions, comparisons, or calculations.
   - Example: `SELECT name, (SELECT MAX(salary) FROM employees) AS max_salary FROM employees;`


2. **Single-Row Subquery:**
   - Returns a single row with multiple columns to the outer query.
   - Typically used with comparison operators such as `IN`, `=`, `<`, `>`, etc.
   - Example: `SELECT name FROM employees WHERE salary = (SELECT MAX(salary) FROM employees);`


3. **Multi-Row Subquery:**
   - Returns multiple rows with one or more columns to the outer query.
   - Typically used with the `IN` or `ANY` operators.
   - Example: `SELECT name FROM employees WHERE department_id IN (SELECT department_id FROM departments WHERE location = 'New York');`


4. **Correlated Subquery:**
   - References columns from the outer query in the inner query.
   - Executed once for each row processed in the outer query.
   - Example: `SELECT name FROM employees e WHERE salary > (SELECT AVG(salary) FROM employees WHERE department_id = e.department_id);`


5. **Nested Subquery:**
   - A subquery within another subquery.
   - Provides a way to perform more complex queries by building on multiple levels of nesting.
   - Example: `SELECT name FROM employees WHERE department_id = (SELECT department_id FROM departments WHERE name = 'Sales');`


6. **Correlated EXISTS Subquery:**
   - Checks for the existence of rows in the subquery result.
   - Used with the `EXISTS` keyword in conditions.
   - Example: `SELECT name FROM customers c WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.order_date > '2023-01-01');`

These subquery types offer various ways to manipulate and retrieve data based on specific conditions or relationships within a database.

### <span class="mark">Difference between subqueries and co related sub-queries:</span>

##### In essence, the key difference is that a correlated subquery establishes a connection between the inner and outer queries by utilizing the current row's data in the outer query,
### eg : WHERE e1.department_id = e2.department_id


##### whereas a non-correlated subquery operates independently of the outer query's data.

Subqueries and correlated subqueries are both types of SQL queries, but they serve different purposes and have distinct characteristics:

1. **Subquery (Non-Correlated Subquery):**
    - A non-correlated subquery is a subquery that can run independently of the outer query.
    - It executes once and provides a single result set that the outer query uses.
    - It doesn't reference columns from the outer query, making it self-contained.
    - It's used to retrieve a single value or a set of values that are used as constants in the outer query.
   
```sql
SELECT employee_id, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

```

2. **Correlated Subquery:**
   - A correlated subquery is a subquery that references columns from the outer query.
   - The subquery's execution depends on the data of the current row being processed in the outer query.
   - It is executed repeatedly, once for each row processed by the outer query.
   - Correlated subqueries can be less efficient and slower than non-correlated subqueries, especially for large datasets.
   - They are used when comparing data between the inner and outer queries is necessary.
   
```sql
SELECT employee_id, salary
FROM employees e1
WHERE salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e1.department_id = e2.department_id
);
```

# WINDOW FUNCTIONS

Window functions in SQL are a type of aggregate function that operate over a set of rows, defined by a sliding window or a set of rows. They are used to perform calculations on a subset of rows, rather than on the entire result set of a query.


The window specification is defined using the __OVER() clause in SQL__, which specifies
the partitioning and ordering of the rows that the window function will operate
on.


Window functions allow you to perform calculations that depend on the values of multiple rows in a query, and can be useful for tasks such as calculating running totals, moving averages, percentiles, and more.


### NOTE : It is mandatory to use OVER () while using WINDOWS Function.

## Types of Window functions :

- __ROW_NUMBER:__ Assigns a unique number to each row in the result set, based on the order specified in the ORDER BY clause of the OVER clause.


- __RANK:__ Assigns a unique rank to each row in the result set, based on the order specified in the ORDER BY clause of the OVER clause. Rows with the __same values receive the same rank, and a gap is left in the ranking for the next unique value.__


- __DENSE_RANK:__ Assigns a unique rank to each row in the result set, based on the order specified in the ORDER BY clause of the OVER clause. __Rows with the same values receive the same rank, and there is no gap in the ranking for the next unique value.__


- __NTILE:__ Divides the result set into a specified number of groups, or tiles, and assigns a number to each row indicating which tile it belongs to.


- __PERCENT_RANK:__ Calculates the relative rank of each row within the result set as a fraction between 0 and 1.


- __CUME_DIST:__ Calculates the cumulative distribution of a value within the result set, expressed as a fraction between 0 and 1.


- __LEAD and LAG:__ Return the value of a specified column from a row at a specified offset from the current row, either ahead (LEAD) or behind (LAG) in the result set.
<br></br>

__You can use a window function by including it as part of an aggregate function in a query, and using the OVER clause to specify the window for the function.__

## 5. Last_value() - range between unbounded preceding and unbounded following

```sql
select student_id, student_fname, location,source_of_joining, years_of_experience,
last_value(years_of_experience) over (partition by source_of_joining order by years_of_experience desc) as least_experienced
from students;
```

![image.png](attachment:image.png)

### NOTE : We are not getting correct results bcoz of FRAME () clause

___Default FRAME clause:___

over (partition by source_of_joining order by years_of_experience desc __range between unbounded preceding and current row)__ as least_experienced

##### changing current row to unbounded following for last_value to work:

```sql
select student_id, student_fname, location,source_of_joining, years_of_experience,
last_value(years_of_experience) 
over (partition by source_of_joining order by years_of_experience desc 
range between unbounded preceding and unbounded following)  as least_experienced
from students;
```

![image.png](attachment:image.png)

### another example

```sql
SELECT *,
LAST_VALUE(marks) OVER (ORDER BY marks DESC ROWS BETWEEN  UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as highest_marks
from task_campusx.marks;
```

![image.png](attachment:image.png)

## FRAME clause()


- FRAME is a subset of partition created by Windows function.


- It  defines the scope of the calculation performed by a window function, and it's used to specify which rows should be included in the calculation based on their relative position to the current row performed by a window function.


___The ROWS clause___ specifies how many rows should be included in the frame
relative to the current row. For example, ROWS 3 PRECEDING means that the
frame includes the current row and the three rows that precede it in the partition.


___The BETWEEN___ clause specifies the boundaries of the frame.


The FRAME clause has two components:

- __ROWS BETWEEN :__ Specifies the range of rows to include in the calculation, either UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING, to include all rows, or a range such as 1 PRECEDING and 1 FOLLOWING to include only the current row and its two neighbors.


- __EXCLUSIVE or INCLUSIVE :__ Specifies whether the first and last rows in the frame should be included in the calculation or excluded.

Examples


- __ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW__ - means that the
frame includes all rows from the beginning of the partition up to and including the
current row.


- __ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING:__ the frame includes the
current row and the row immediately before and after it.


- __ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING:__ the
frame includes all rows in the partition.


- __ROWS BETWEEN 3 PRECEDING AND 2 FOLLOWING:__ the frame includes the
current row and the three rows before it and the two rows after it.

#### Difference between (range between unbounded preceding and current row) AND (rows between unbounded preceding and current row)

- `(range between unbounded preceding and current row)` considers rows based on their values in the ordering column within a specified range of the current row's value.


- `(rows between unbounded preceding and current row)` includes a fixed number of rows, starting from the first row of the partition (unbounded preceding) up to and including the current row.

## 6. Nth_Value

The NTH_VALUE function in MySQL is a window function that returns the nth value __(any value from a position specified by us)__ in a set of values, based on a specified order. The function returns the value of a specified expression for the nth row in the window frame, where the frame is defined using the OVER clause.

##### 2nd most experienced person from each group

```sql
select student_id, student_fname, location,source_of_joining, years_of_experience,
nth_value(years_of_experience, 2) 
over (partition by source_of_joining order by years_of_experience desc 
range between unbounded preceding and unbounded following)  as 2nd_most_experienced
from students;
```

![image.png](attachment:image.png)

### NOTE : if number of rows in any group is less than tha nth_value provided, that group will return NULL

## 7. Ntile

- Segmentation using NTILE is a technique in SQL for **dividing a dataset into equal-
sized groups** based on some criteria or conditions, 


- and then performing calculations or analysis on each group separately using window functions.It returns a number representing the group or tile that each row belongs to.


The NTILE function in SQL is a window function that returns the ntile value for a __set of rows (buckets)__ based on a specified order. 



```sql
SELECT
  Student,
  Score,
  NTILE(3) OVER (ORDER BY Score) AS TileNumber
FROM
  scores;
```

```
|--------|-------|------------|
| Student| Score | TileNumber |
|--------|-------|------------|
| Frank  | 70    | 1          |
| Helen  | 75    | 1          |
| Carol  | 78    | 1          |
| Alice  | 85    | 2          |
| Jack   | 83    | 2          |
| David  | 88    | 2          |
| Grace  | 89    | 2          |
| Ivan   | 91    | 3          |
| Bob    | 92    | 3          |
| Eve    | 95    | 3          |
+--------+-------+------------+
```

#### example 2:

```sql
select student_id, student_fname, location,source_of_joining, years_of_experience,
NTILE(2) over (order by years_of_experience desc) as ntile_groups 
from students
where source_of_joining = 'linkedin'
```

![image-2.png](attachment:image-2.png)

### NOTE : if total rows is uneven when divided, then 1st quantile group gets more records 

### Using cases in Ntile

```sql
select *,
CASE
when x.ntile_groups = 1 then 'Experienced Student'
when x.ntile_groups = 2 then 'Inexperienced Student'
END experience_category
from(
select student_id, student_fname, location,source_of_joining, years_of_experience,
ntile(2) over (order by years_of_experience desc) as ntile_groups from students
where source_of_joining = 'linkedin') x;
```

![image.png](attachment:image.png)

## 8. CUME_DIST() - cummulative distribution

The cumulative distribution function is used to
describe the probability distribution of random
variables. 

It can be used to describe the probability
for a discrete, continuous or mixed variable. It is
obtained by summing up the probability density
function and getting the cumulative probability for
a random variable

![image.png](attachment:image.png)

#### students having marks greater than 90 percentile:

```sql
SELECT * FROM 
      (SELECT *,CUME_DIST() OVER(ORDER BY marks) AS 'Percentile_Score'
      FROM marks) 
t
WHERE t.Percentile_Score > 0.90
```

![image.png](attachment:image.png)

#### Another example:

```sql
select student_id, student_fname, location,source_of_joining, years_of_experience,
cume_dist() over (partition by source_of_joining order by years_of_experience desc) as cume_distri
from students;
```

![image.png](attachment:image.png)

### using it to fetch first 35% from each group

```sql
select *
from (
select student_id, student_fname, location,source_of_joining, years_of_experience,
round(cume_dist() over (partition by source_of_joining order by years_of_experience desc),3) as cume_distri
from students
) x
where x.cume_distri >=0.35;
```

![image.png](attachment:image.png)

## 9. PERCENT_RANK

percent_rank is a window function in SQL that calculates the __relative rank of a row within a set of rows,__ represented as a decimal value between 0 and 1, where 0 represents the lowest rank and 1 represents the highest rank.

The percent_rank function is used in a similar way to the rank function, but instead of returning the rank as an integer, it returns the rank as a decimal value. The percent_rank function is calculated as: __(rank - 1) / (total number of rows - 1).__

```sql
select student_id, student_fname, location,source_of_joining, years_of_experience,
round(percent_rank() over (partition by source_of_joining order by years_of_experience desc),3) as cume_distri
from students;
```

![image.png](attachment:image.png)

## Difference between cum_dist() and percent_rank()

The difference between percent_rank and cume_dist lies __in the way they calculate the relative position of a row within a set of rows.__

- __percent_rank__ returns the relative rank of a row as a decimal value between 0 and 1, where 0 represents the lowest rank and 1 represents the highest rank. 

__The percent_rank function is calculated as (rank - 1) / (total number of rows - 1).__


- __cume_dist__ returns the cumulative distribution of a value within a set of values, represented as a decimal value between 0 and 1. 

The cume_dist function calculates the fraction of rows that are less than or equal to the current row, within the set of rows defined by the PARTITION BY clause.


```sql
select student_id, student_fname, years_of_experience,
round(percent_rank() over (order by years_of_experience asc)*100,3) as percentage_rank,
round(cume_dist() over (order by years_of_experience asc)*100,3) as cume_distri
from students;
```

![image.png](attachment:image.png)

## 10. CUMULATIVE SUM

Cumulative sum is another type of calculation that can be performed using
window functions. A cumulative sum calculates the sum of a set of values up to a
given point in time, and includes all previous values in the calculation.

### career runs of viart kohli after 50th, 100th match

```sql
SELECT * FROM 
    (SELECT 
    CONCAT("Match-",CAST(ROW_NUMBER() OVER(ORDER BY ID) AS CHAR)) AS match_number,
    SUM(batsman_run) as 'runs_scored',
    SUM(SUM(batsman_run)) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as career_runs
    from ipl
    WHERE batter = 'V Kohli'
    GROUP BY ID) as t
WHERE match_number IN ('Match-50','Match-100');
```

![image.png](attachment:image.png)

### 11. CUMULATIVE AVERAGE

Cumulative average is another type of average that can be calculated using
window functions. A cumulative average calculates the average of a set of values
up to a given point in time, and includes all previous values in the calculation.

```sql
SELECT * FROM 
    (SELECT 
    CONCAT("Match-",CAST(ROW_NUMBER() OVER(ORDER BY ID) AS CHAR)) AS match_number,
    SUM(batsman_run) as 'runs_scored',
    SUM(SUM(batsman_run)) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as career_runs,
    AVG(SUM(batsman_run)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as career_average
    from ipl
    WHERE batter = 'V Kohli' 
    GROUP BY ID) as 
t;
```

![image.png](attachment:image.png)

## 12. MOVING AVERAGE/ ROLLING AVERAGE

![image.png](attachment:image.png)

```sql
SELECT * FROM 
    (SELECT 
    CONCAT("Match-",CAST(ROW_NUMBER() OVER(ORDER BY ID) AS CHAR)) AS match_number,
    SUM(batsman_run) as 'runs_scored',
    SUM(SUM(batsman_run)) OVER(ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as career_runs,
    AVG(SUM(batsman_run)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as career_average,
    AVG(SUM(batsman_run)) OVER (ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) as moving_average
    from ipl
    WHERE batter = 'V Kohli'
    GROUP BY ID) as 
t;
```

![image.png](attachment:image.png)

## 13. Percentage of total

Percent of total refers to the percentage or proportion of a specific value in
relation to the total value. It is a commonly used metric to represent the relative
importance or contribution of a particular value within a larger group or
population.

```sql
SELECT
  Product,
  Revenue,
  ROUND((Revenue / SUM(Revenue) OVER ()) * 100, 2) AS Percentage_of_Total
FROM
  sales;
```

Here's the output table:

|   Product |   Revenue |   Percentage_of_Total |
|:---------:|----------:|-----------------------:|
| Product A |    100.00 |                  20.00 |
| Product B |    150.00 |                  30.00 |
| Product A |    200.00 |                  40.00 |
| Product B |    120.00 |                  24.00 |
| Product A |    250.00 |                  50.00 |

In [2]:
l=[100,150,200,120,250]
print(sum(l))

820


### 14. Percentage of change

Percent change is a way of expressing the difference between two
values as a percentage of the original value. It is often used to measure
how much a value has increased or decreased over a given period of
time, or to compare two different values.

![image.png](attachment:image.png)

```sql
SELECT 
YEAR(Date),QUARTER(Date), SUM(views) AS 'views',
((SUM(views) - LAG(SUM(views)) OVER (ORDER BY YEAR(Date),QUARTER(Date)))/LAG(SUM(views)) 
OVER(ORDER BY YEAR(Date),QUARTER(Date)))*100 AS 'Percent_change'
FROM youtube_views
GROUP BY YEAR(Date),QUARTER(Date)
ORDER BY YEAR(Date),QUARTER(Date);
```

## 15 . Percentiles & Quantiles

A __Quantile__ is a measure of the distribution of a dataset that divides the data into
any number of equally sized intervals. 


For example, a dataset could be divided into

- __deciles__ (ten equal parts), 


- __quartiles__ (four equal parts), 


- __percentiles__ (100 equal parts), or any other number of intervals.


Each quantile represents a value below which a certain percentage of the data
falls. For example, the 25th percentile (also known as the first quartile, or Q1)
represents the value below which 25% of the data falls. The 50th percentile (also
known as the median) represents the value below which 50% of the data falls, and
so on.

### 16. PERCENTILE_CONT

**PERCENTILE_CONT** calculates the continuous percentile value, which returns the
interpolated value between adjacent data points. In other words, it estimates the
percentile value by assuming that the values between data points are distributed
uniformly. This function returns a value that may not be present in the original
dataset.


**PERCENTILE_DISC**, on the other hand, calculates the discrete percentile value,
which returns the value of the nearest data point. This function returns a value
that is present in the original dataset.

## EXISTS / NOT EXISTS

#### used in correlated nested query

```sql
SELECT column_name(s)
FROM table_name
WHERE EXISTS
(SELECT column_name FROM table_name WHERE condition);
```

- The EXISTS operator is used to test for the existence of any record in a subquery.


- The EXISTS operator returns TRUE if the subquery returns one or more records


- __Each row of Outer query table will be compared with rows of inner query table. Entire inner query will run for each row of outer query.__

```sql
SELECT *
FROM users
WHERE EXISTS (
  SELECT *
  FROM orders
  WHERE orders.user_id = users.user_id
);
```

![image.png](attachment:image.png)

In this example, we're using EXISTS in a subquery to check if there are any rows in the orders table that have a customer_id equal to the id of a row in the customers table. 


If the subquery returns any rows, the EXISTS condition is considered to be true, and the outer query will return all rows from the customers table. 

If the subquery doesn't return any rows, the EXISTS condition is considered to be false, and the outer query won't return any rows.

## EXIST vs ANY

#####  "EXISTS" is used to check the existence of rows in a subquery, whereas "ANY" is used to compare a value with a set of values returned by a subquery. 

#### "EXISTS":
- "EXISTS" is used to check whether a subquery returns any rows. It returns a Boolean value (true or false) based on whether any rows are returned by the subquery.

- It is often used in correlated subqueries, where the subquery refers to a table in the outer query.



#### "ANY"

- ANY" is used to compare a single value to a set of values returned by a subquery. 



- It is used in conjunction with comparison operators such as "=", "<", ">", etc., to compare the value to each value in the result set of the subquery.
The syntax involves using a comparison operator followed by "ANY" and a subquery enclosed in parentheses.

```sql
SELECT *
FROM products
WHERE price > ANY (
    SELECT price
    FROM products
    WHERE category = 'Electronics'
);


```

# Indexes

good video : https://www.youtube.com/watch?v=fsG1XaZEa78

- Index is a database object that makes data retrival faster.


- it is created on columns that are frequnetly used. These are created on specific columns of a table and store a copy of the data in those columns in a separate data structure. This allows the DBMS to rapidly locate data without having to scan the entire table.



- Indexes work similarly to the index of a book, helping the database locate the desired data more efficiently. Without indexes, the DBMS would need to scan the entire table, which can become inefficient for large datasets.
 

- __Index for primary and unique constraints are automatically created and dropped during table creation and deletion.__


- Index improves performance in select but hamper insert update delete. so not good idea to create index on every column.


- Index contains redundant data already existing in table. hence consumes space.


- Each table can have only one clustered index usually created on a primary key


- No limit on non clustered index

__Index Key :__ Column on which we create Index 

## Types of indexes:

### __1. Clustered Index__

- clustered index is a special type of index that physically reorders the rows of a table to match the order of the index.  This means that the data in the table is stored in the same order as the clustered index. 



- As a result, a clustered index is often used as the primary key of a table, as it can provide fast access to rows based on the primary key value. __In a table there can only be 1 clustered index.__ They are physically ordered in the actual table.


- Clustered Index can be made of only 1 column (Primary Key) or using multiple column (composite key) known as composite cluster key.

### 2. Non-clustered Index

- __Not a primary key column__


- A non-clustered index is a type of index that does not physically reorder the rows of a table. Instead, it creates a separate structure that maps the values of one or more columns in the table to their physical location. 


- When a query is executed that uses a non-clustered index, the database must first look up the index to find the physical location of the data, and then retrieve the actual data from the table.


-  A table can have multiple non clustered index

## Difference between Clustered and Non-clustered Indexes

- The main difference between clustered and non-clustered indexes is that a clustered index physically reorders the rows of a table to match the index, while a non-clustered index provides a mapping of values to physical locations but does not change the physical order of the table.


-  A table can have multiple non-clustered index but it can only have 1 clustered index

## SEEK and SCAN in sql

https://www.youtube.com/watch?v=gZu2ZldwrK4

"Seek" and "Scan" are two methods used by the MySQL database management system to __search for data in a table.__

- __Seek__ is a direct lookup method, where MySQL **uses the index of a table** to quickly find a specific row of data based on its unique key. This method is fast and efficient, but it can only be used when searching for an exact match of a unique key value.


- __Scan__ on the other hand, is a method where MySQL scans the entire table to find the rows that match a certain condition. This method is slower than "Seek", but it can be used to find all rows that match a certain condition, even if no index exists for the columns being searched. Scans can also be used to return all rows in a table if no specific search condition is provided.


__In summary, "Seek" is a fast, direct lookup method for finding a specific row in a table, while "Scan" is a slower method for finding all rows that match a certain condition.__

### CREATE INDEX

The CREATE INDEX statement is used to create indexes in tables.

Indexes are used to retrieve data from the database more quickly than otherwise. The users cannot see the indexes, they are just used to speed up searches/queries.

```sql
CREATE INDEX index_name
ON table_name (column_name);
```

### Note: Updating a table with indexes takes more time than updating a table without (because the indexes also need an update). So, only create indexes on columns that will be frequently searched against.

#### CREATE UNIQUE INDEXES

```sql
CREATE UNIQUE INDEX index_name
ON table_name (column_name);
```

#### COMPOSITE INDEX

```sql
CREATE INDEX index_name
ON table_name(column1,column2)
```

#### DROP INDEX :

```sql
ALTER TABLE table_name
DROP INDEX index_name;
```

#### VIEW index of a particular  table

```sql
SELECT * from user_indexes where table_name = 'table-name';
```

### Types of scans in sql:

Important types of scans in SQL, explained briefly:

1. **Table Scan**:
   - Reads the entire table, row by row.
   
   - Used when there's no suitable index or for queries requiring all table data.
   


2. **Index Scan**: 
      
      - Scans an index structure to locate rows that match a condition.
       
      - Can be slow for large data ranges or if many rows match the condition.



3. **Clustered Index Scan**:
   
   - Scans the entire table based on the order of the clustered index.
   
   - Used when the query can't utilize an index seek on the clustered index.



4. **Index Seek**:
   
   - Directly finds specific rows using an index. 
   
   - Efficient for targeted searches and equality conditions.



5. **Covering Index Scan**:
   - Uses a specialized index that includes all columns needed for a query.
   
   - Reduces the need to access the actual table, improving performance.

These are the key scan types to be aware of in SQL. The choice of scan type depends on the query, indexing, and the specific data retrieval needs.

### B-Tree index

- B-Tree index is like an organized, multi-level list of values in a database, making it quick to find specific data. 


- It's structured hierarchically, with branching nodes and leaves, helping databases efficiently search and retrieve information, especially in large datasets.


- B-tree is a self-balancing tree structure where each node can have multiple child nodes. The name "B-tree" stands for "balanced tree," and the structure maintains balance by redistributing data between nodes as data is inserted or deleted.




**Characteristics of B-Tree Index:**

1. **Sorted Order:** B-trees maintain data in a sorted order based on the indexed columns. This enables efficient range-based queries and ordered retrieval.


2. **Balanced Structure:** B-trees are self-balancing, ensuring that the height of the tree remains relatively small. This ensures efficient search operations.


3. **Multiple Levels:** B-trees can have multiple levels of nodes, and each level corresponds to a level of precision in the sorting order.


4. **Branching Factor:** Each node in a B-tree can have multiple children, known as the "branching factor." This factor keeps the number of nodes at each level manageable.


5. **Root and Leaf Nodes:** B-trees have a root node at the top, which branches into intermediate nodes, and ultimately into leaf nodes where actual data resides.


6. **Efficient Insertion and Deletion:** B-trees maintain their balance and structure during insertion and deletion operations, optimizing performance.

![Screenshot%202023-09-14%20232729.png](attachment:Screenshot%202023-09-14%20232729.png)

**How B-Tree Indexes Are Utilized:**

When you create a B-tree index on a column or set of columns in a table, the DBMS creates a separate data structure that organizes the indexed data in a B-tree format. This index structure is then used by the DBMS to quickly locate the rows that satisfy query conditions involving the indexed columns.

B-tree indexes are particularly effective for scenarios where you need to perform range-based searches, such as finding records within a specific date range, or retrieving data in ascending or descending order based on indexed columns.

Example of Creating a B-Tree Index:

```sql
CREATE INDEX idx_sales_date ON sales(sales_date);
```

In this example, an index named "idx_sales_date" is created on the "sales_date" column of the "sales" table. This B-tree index would improve the efficiency of queries involving date-based range searches.

B-tree indexes are a fundamental tool in database optimization, helping to significantly enhance the performance of various types of queries.