# Data Engineer

- b tree and binary search
- different sorting: https://www.cs.cmu.edu/~adamchik/15-121/lectures/Sorting%20Algorithms/sorting.html
- index in sql
- merge strategy
- relational vs non relational vs graph database
- column oriented


- python
- sql
- elastic search

## Memory hiercharchy

- Processor registers – the fastest possible access (usually 1 CPU cycle). A few thousand bytes in size
- Cache
    - Level 0 (L0) Micro operations cache – 6 KiB in size
    - Level 1 (L1) Instruction cache – 128 KiB in size
    - Level 1 (L1) Data cache – 128 KiB in size. Best access speed is around 700 GiB/second
    - Level 2 (L2) Instruction and data (shared) – 1 MiB in size. Best access speed is around 200 GiB/second
    - Level 3 (L3) Shared cache – 6 MiB in size. Best access speed is around 100 GB/second
    - Level 4 (L4) Shared cache – 128 MiB in size. Best access speed is around 40 GB/second
- Main memory (Primary storage) – Gigabytes in size. Best access speed is around 10 GB/second.
- Disk storage (Secondary storage) – Terabytes in size. As of 2017, best access speed is from a consumer solid state drive is about 2000 MB/second

## Linked list

A linked list is a linear data structure, in which the elements are not stored at contiguous memory locations. The elements in a linked list are linked using pointers as shown in the below image:

In simple words, a linked list consists of nodes where each node contains a data field and a reference(link) to the next node in the list.

##### Key Differences Between Array and Linked List:
1. An array is the data structure that contains a collection of similar type data elements whereas the Linked list is considered as non-primitive data structure contains a collection of unordered linked elements known as nodes.
2. In the array the elements belong to indexes, i.e., if you want to get into the fourth element you have to write the variable name with its index or location within the square bracket.
3. In a linked list though, you have to start from the head and work your way through until you get to the fourth element.
4. Accessing an element in an array is fast, while Linked list takes linear time, so it is quite a bit slower.
5. Operations like insertion and deletion in arrays consume a lot of time. On the other hand, the performance of these operations in Linked lists is fast.
6. Arrays are of fixed size. In contrast, Linked lists are dynamic and flexible and can expand and contract its size.
7. In an array, memory is assigned during compile time while in a Linked list it is allocated during execution or runtime.
9. Elements are stored consecutively in arrays whereas it is stored randomly in Linked lists.
10. The requirement of memory is less due to actual data being stored within the index in the array. As against, there is a need for more memory in Linked Lists due to storage of additional next and previous referencing elements.
11. In addition memory utilization is inefficient in the array. Conversely, memory utilization is efficient in the linked list.


##### Conclusion
So Linked list provides the following two advantages over arrays
1. Dynamic size
2. Ease of insertion/deletion

Linked lists have following drawbacks:
1. Random access is not allowed. We have to access elements sequentially starting from the first node. So we cannot do a binary search with linked lists.
2. Extra memory space for a pointer is required with each element of the list.
3. Arrays have better cache locality that can make a pretty big difference in performance.

## Index in sql

###### The advantages of indexes are as follows:

1. Their use in queries usually results in much better performance.
2. They make it possible to quickly retrieve (fetch) data.
3. They can be used for sorting. A post-fetch-sort operation can be eliminated.
4. Unique indexes guarantee uniquely identifiable records in the database.

##### The disadvantages of indexes are as follows:

1. They decrease performance on inserts, updates, and deletes.
2. They take up space (this increases with the number of fields used and the length of the fields).
3. Some databases will monocase values in fields that are indexed.


You should only create indexes when they are actually needed.

## Relational vs non relational

A relational database is one where data is stored in the form of a table. Each table has a schema, which is the columns and types a record is required to have. Each schema must have at least one primary key that uniquely identifies that record. In other words, there are no duplicate rows in your database. Moreover, each table can be related to other tables using foreign keys.

One important aspect of relational databases is that a change in a schema must be applied to all records. This can sometimes cause breakages and big headaches during migrations. Non-relational databases tackle things in a different way. They are inherently schema-less, which means that records can be saved with different schemas and with a different, nested structure. Records can still have primary keys, but a change in the schema is done on an entry-by-entry basis.

If you have a constantly changing schema, such as financial regulatory information, then NoSQL can modify the records and nest related information.

Databases also differ in scalability. A non-relational database may be less of a headache to distribute. That’s because a collection of related records can be easily stored on a particular node. On the other hand, relational databases require more thought and usually make use of a master-slave system.

## Python

Q45. What advantages do NumPy arrays offer over (nested) Python lists?
Ans: 
Python’s lists are efficient general-purpose containers. They support (fairly) efficient insertion, deletion, appending, and concatenation, and Python’s list comprehensions make them easy to construct and manipulate.
They have certain limitations: they don’t support “vectorized” operations like elementwise addition and multiplication, and the fact that they can contain objects of differing types mean that Python must store type information for every element, and must execute type dispatching code when operating on each element.
NumPy is not just more efficient; it is also more convenient. You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.
NumPy array is faster and You get a lot built in with NumPy, FFTs, convolutions, fast searching, basic statistics, linear algebra, histograms, etc. 

How does break, continue and pass work?
Break
Allows loop termination when some condition is met and the control is transferred to the next statement.
Continue
Allows skipping some part of a loop when some specific condition is met and the control is transferred to the beginning of the loop
Pass
Used when you need some block of code syntactically, but you want to skip its execution. This is basically a null operation. Nothing happens when this is executed.



Q49. What is the difference between deep and shallow copy?
Ans: Shallow copy is used when a new instance type gets created and it keeps the values that are copied in the new instance. Shallow copy is used to copy the reference pointers just like it copies the values. These references point to the original objects and the changes made in any member of the class will also affect the original copy of it. Shallow copy allows faster execution of the program and it depends on the size of the data that is used.
Deep copy is used to store the values that are already copied. Deep copy doesn’t copy the reference pointers to the objects. It makes the reference to an object and the new object that is pointed by some other object gets stored. The changes made in the original copy won’t affect any other copy that uses the object. Deep copy makes execution of the program slower due to making certain copies for each object that is been called.


Q45. What advantages do NumPy arrays offer over (nested) Python lists?
Ans: 
Python’s lists are efficient general-purpose containers. They support (fairly) efficient insertion, deletion, appending, and concatenation, and Python’s list comprehensions make them easy to construct and manipulate.
They have certain limitations: they don’t support “vectorized” operations like elementwise addition and multiplication, and the fact that they can contain objects of differing types mean that Python must store type information for every element, and must execute type dispatching code when operating on each element.
NumPy is not just more efficient; it is also more convenient. You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.
NumPy array is faster and You get a lot built in with NumPy, FFTs, convolutions, fast searching, basic statistics, linear algebra, histograms, etc. 



Q34. What is the usage of help() and dir() function in Python?
Ans: Help() and dir() both functions are accessible from the Python interpreter and used for viewing a consolidated dump of built-in functions. 
Help() function: The help() function is used to display the documentation string and also facilitates you to see the help related to modules, keywords, attributes, etc.
Dir() function: The dir() function is used to display the defined symbols.


Q58. Does python support multiple inheritance?
Ans: Multiple inheritance means that a class can be derived from more than one parent classes. Python does support multiple inheritance, unlike Java.
Q59. What is Polymorphism in Python?
Ans: Polymorphism means the ability to take multiple forms. So, for instance, if the parent class has a method named ABC then the child class also can have a method with the same name ABC having its own parameters and variables. Python allows polymorphism.
Q60. Define encapsulation in Python?
Ans: Encapsulation means binding the code and the data together. A Python class in an example of encapsulation.
Q61. How do you do data abstraction in Python?
Ans: Data Abstraction is providing only the required details and hiding the implementation from the world. It can be achieved in Python by using interfaces and abstract classes.
Q62.Does python make use of access specifiers?
Ans: Python does not deprive access to an instance variable or function. Python lays down the concept of prefixing the name of the variable, function or method with a single or double underscore to imitate the behavior of protected and private access specifiers.  

## behavial

- https://365datascience.com/data-engineer-interview-questions/


# Languages comparison

https://www.geeksforgeeks.org/c-vs-java-vs-python/

https://www.javatips.net/blog/c-vs-java-vs-python-a-comparison


### Interpreted (python) vs compiled languages (C/Java)

The difference between an interpreted and a compiled language lies in the result of the process of interpreting or compiling. An interpreter produces a result from a program, while a compiler produces a program written in assembly language. The assembler of architecture then turns the resulting program into binary code. Assembly language varies for each individual computer, depending upon its architecture. Consequently, compiled programs can only run on computers that have the same architecture as the computer on which they were compiled.

### Python: pass by assignment

Remember that arguments are passed by assignment in Python. Since assignment just creates references to objects, there’s no alias between an argument name in the caller and callee, and so no call-by-reference per se. You can achieve the desired effect in a number of ways.

If you pass a mutable object into a method, the method gets a reference to that same object and you can mutate it to your heart's delight, but if you rebind the reference in the method, the outer scope will know nothing about it, and after you're done, the outer reference will still point at the original object.

If you pass an immutable object to a method, you still can't rebind the outer reference, and you can't even mutate the object.

### Python: id() 

This identity has to be unique and constant for this object during the lifetime. Two objects with non-overlapping lifetimes may have the same id() value. If we relate this to C, then they are actually the memory address, here in Python it is the unique id. This function is generally used internally in Python.

### Python: memory

In Python, memory is managed in a private heap space. This means that all the objects and data structures will be located in a private heap. However, the programmer won’t be allowed to access this heap. Instead, the Python interpreter will handle it. At the same time, the core API will enable access to some Python tools for the programmer to start coding. The memory manager will allocate the heap space for the Python objects while the inbuilt garbage collector will recycle all the memory that’s not being used to boost available heap space. 

## Python: Mutable vs immutable

Common immutable type:
- numbers: int(), float(), complex()
- immutable sequences: str(), tuple(), frozenset(), bytes()

Common mutable type (almost everything else):
- mutable sequences: list(), bytearray()
- set type: set()
- mapping type: dict()
- classes, class instances
All immutable built-in objects in python are hashable. Mutable containers like lists and dictionaries are not hashable while immutable container tuple is hashable

Tuples are smaller. Tuples have structure, lists have order, set are list without order

## Random question

- Is there "switch" operator in Python?

No

- How get out of a loop ? 

```break```

# Complexity

$O(1)$ :	Determining if a binary number is even or odd; Calculating $\displaystyle (-1)^{n}$; Using a constant-size lookup table

$O(\log n)$ :	logarithmic	Finding an item in a sorted array with a binary search or a balanced search tree as well as all operations in a Binomial heap

# SQL


### What is a primary key?

A primary key is a combination of fields which uniquely specify a row. This is a special kind of unique key, and it has implicit NOT NULL constraint. It means, Primary key values cannot be NULL.

### What is a unique key?

A Unique key constraint uniquely identified each record in the database. This provides uniqueness for the column or set of columns.

A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key.

There can be many unique constraint defined per table, but only one Primary key constraint defined per table.

### What is a foreign key?

A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between two tables by referencing foreign key with the primary key of another table.
e.g Order table with foreign key person_id from table person

### What is a constraint?

Constraint can be used to specify the limit on the data type of table. Constraint can be specified while creating or altering the table statement. Sample of constraint are.

- NOT NULL.
- CHECK.
- DEFAULT.
- UNIQUE.
- PRIMARY KEY.
- FOREIGN KEY.

### Which operator is used in query for pattern matching?

LIKE operator is used for pattern matching, and it can be used as -.

- % - Matches zero or more characters.
- _ – Matching exactly one character.


`Select * from Student where studentname like 'a%'`

`Select * from Student where studentname like ‘ami_'`

### What is Union, minus and Intersect commands?

UNION operator is used to combine the results of two tables, and it eliminates duplicate rows from the tables.

UNION ALL does not (eliminate duplicates).

MINUS/EXEPT operator is used to return rows from the first query but not from the second query. 

INTERSECT operator is used to return rows returned by both the queries.