# Data Structures 

A data structure makes it possible to store data. Each data structure provides a set of operations that can be performed on the data. At its most basic, a data structure should allow a user to add and retrieve data.

* Lists
* Stacks
* Queues
* Dictionaries


# Linked List

We call these fundamental data structures because they exist in most programming languages and it's hard to write any program without using at least one of these. The reason why there are so many data structures is that each of them is specific to certain types of data manipulation.

We implement all of these data structures using classes.

Internally, Python lists are implemented in the C programming language using arrays. An array is a fixed-length chunk of memory that allows you to access each position in constant time. So the list [5, 3, 8] will be stored in an array where the first value is 5, the second is 3 and the third is 8.

We're going to implement lists using a linked structure. This means that the list [5, 3, 8] will be stored using three objects. Each of these objects will store the value plus references (links) to the neighboring elements.

The following figure shows the array structure and the linked structure of list [5, 3, 8] side-by-side:

<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/linked_list_1.JPG?raw=true">

When using a linked structure, we cannot access elements by index because, unlike arrays, the objects storing the values are not locating in consecutive memory positions. For example, to reach the third element in the linked structure, we need to start at 5 and follow the links from 5 to 3 and then from 3 to 8. Because of its linked structure, we call this data structure a linked list.

To build a linked structure, we use an auxiliary class commonly called a node. Our nodes will keep track of three pieces of information:

* The data
* The previous node
* The next node

<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/linked_list_2.JPG?raw=true">


For example, using this representation, we can display the linked list representation of list [5, 3, 8] as follows:


<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/linked_list_3.JPG?raw=true">
    
Note that the first node doesn't have a previous node and that the last node does not have a next node. We will use the value None in these situations.
    


In [4]:
class Node:
    def __init__(self,data):
        self.data = data
        self.prev = None
        self.next = None
    
node = Node(42)

We implemented a class to represent nodes in a linked list. You can use an individual node to store one list element via its *Node.data* attribute. Using the *Node.next* attribute, you can link nodes together to form a list.

The *Node.prev* attribute will store the predecessor of each node. We call a linked list with predecessor links a doubly linked list. This isn't strictly necessary to implement a linked list. We'll see in the next two missions that having predecessor links is very convenient.

We will implement the linked list in a class named *LinkedList*. This class will use the *Node* class to chain the data together into a list-like structure. We commonly call the first node of a linked list the **head** and the last node of a list the **tail**.

On the list [5, 3, 8] the head is the node containing 5, and the tail is the node that contains 8:

To implement list operations in constant time, the LinkedList class will keep track of three attributes:

* The length of the list
* The head node
* The tail node


Let's start by declaring the LinkedList class and its constructor (the _init__() method). A new list is initially empty. This means that the head and tail nodes don't exist yet. To represent that the node doesn't exist, we use the **None** value.

In practice, the constructor should initialize the length to 0 and both the head and tail nodes to None.

In [2]:
class LinkedList:
    def __init__(self):
        self.head = None
        self.tail = None
        self.length = 0
        
lst = LinkedList()

We'll start to implement functionality into our linked list. More specifically, we are going to implement a method to append a new value to the list.

Notice that in a list with a single element, both the head and the tail point to the same node:



To append a new element on an empty list we do the following:
1. Create a node with the provided data
2. Set both the head and the tail to the newly created node

If the list isn't empty, then the new node should be placed as the next node of the tail
. This implies doing the following:
4. Setting the next node of the current tail to the newly created node
5. Setting the previous node of the newly created node to be the current tail
6. Making the newly created node become the new tail

<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/Node.gif?raw=true">


In [5]:
class node:
    def __init__(self, data=None):
        self.data = data
        self.next = None

In [6]:
class LinkedList:
    def __init__(self):
        self.head = node()
        
    def append(self,data):
        new_node =  node(data)
        cur = self.head
        while cur.next != None:
            cur = cur.next
        cur.next = new_node
    
    def length(self):
        cur = self.head
        total = 0
        while cur.next != None:
            total +=1
            cur = cur.next
        return total
    
    def display(self):
        elems = []
        cur_node = self.head
        while cur_node.next != None:
            cur_node = cur_node.next
            elems.append(cur_node.data)
        print(elems)    
        
    def get(self,index):
        if index >= self.length():
            print ("Error: Index out of Range")
            return None
        cur_idx = 0
        cur_node = self.head
        while True:
            cur_node = cur_node.next
            if cur_idx == index: return cur_node.data
            cur_idx += 1
    def erase(self,index):
        if index >= self.length():
            print("Error: Index out of Range")
            return None
        cur_idx = 0
        cur_node = self.head
        while True:
            last_node = cur_node
            cur_node = cur_node.next
            if cur_idx == index:
                last_node.next = cur_node.next
                return
            cur_idx += 1
        
        

In [7]:
my_list = LinkedList()

In [8]:
my_list.display()

[]


In [9]:
my_list.append(1)
my_list.append(2)

In [10]:
my_list.display()

[1, 2]


In [11]:
my_list.append(3)
my_list.append(4)
my_list.append(5)

In [12]:
my_list.display()

[1, 2, 3, 4, 5]


In [13]:
my_list.erase(3)

In [14]:
my_list.display()

[1, 2, 3, 5]


In [15]:
class Node:
    def __init__(self,data):
        self.data = data
        self.prev = None
        self.next = None
    
node = Node(42)

**Implementing append method**

In [17]:
class LinkedList:
    def __init__(self):
        self.head = None
        self.tail = None
        self.length = 0
    def append(self,data):
        new_node = Node(data)
        if self.length == 0:
            self.head = self.tail = new_node
        else:
            self.tail.next = new_node
            new_node.prev = self.tail
            self.tail = new_node
        self.length += 1
            
        

In [18]:
lst = LinkedList()

In [19]:
lst.append(10)
print(lst.length,lst.head.data,lst.tail.data)

1 10 10


In [20]:
lst.append(12)

In [21]:
print(lst.length,lst.head.data,lst.tail.data)

2 10 12


We implemented a method to append elements to a list. But this isn't useful if we can't easily access list elements. To access list elements, we're going enable using for loops to iterate over all elements in a list.

When implementing a class, for loops aren't automatically available. We need to specify what it means to iterate over the class. In other words, we need to make our class an iterable.

For example, with the **Person** class defined in the first screen, we cannot write *for x in person* if *person* is a *Person* instance. In the same way, we cannot (yet) write *for x in lst* where lst is a LinkedList instance.

<Img src= "https://github.com/rhnyewale/Data-Engineering/blob/main/Images/iterate_node.gif?raw=true">
    
In practice, to enable for loops, we define two methods in our class:

1. The _iter__() method: This method should set up all the necessary data to start a new iteration. When making a class iterable in this way, this method should return always return self.

2. The _next__() method: This method should return the current iteration element and move on to the next one. It should also notify when the iteration is over.


Let's focus on the _iter__() method. To keep track of the current iteration node, we will assign this node to an attribute called _iter_node. We chose to use an underscore (_) at the start of the attribute name to signal to users of our class that they should not access it. It should only be used internally. 

We learned that the _iter__() method is responsible to initialize the iteration. In our case, this means initializing _iter_node to the head of the linked list. Also, as we mentioned, this method should return self. This is because this method needs to return a reference to the object over which we are iterating.

In [22]:
class LinkedList:
    def __init__(self):
        self.head = None
        self.tail = None
        self.length = 0
    def append(self,data):
        new_node = Node(data)
        if self.length == 0:
            self.head = self.tail = new_node
        else:
            self.tail.next = new_node
            new_node.prev = self.tail
            self.tail = new_node
        self.length += 1
    def __iter__(self):
        self._iter_node = self.head
        return self

We've started making our linked list into an iterable. An **iterable is an object over which we can iterate using a for loop**

<Img src= "https://github.com/rhnyewale/Data-Engineering/blob/main/Images/iterate_node.gif?raw=true">
    
We implemented the __iter__() method, which is responsible for initializing the iteration. We did it by creating an attribute named _iter_node and setting it to the head of the list.
    
To actually iterate over the list, we need to implement the __next__() method. This method is responsible for:

1. Returning the current iteration value. In our case this is the data stored in the node stored in _iter_node.
2. Moving the iteration to the next value. In our case this means moving _iter_node to the next node.
3. Notifying when the iteration is over. In our case, this happens when we run out of nodes and _iter_node becomes None.
    
In the example above, the __next__() method should return the value 5 and move _iter_node to the next node, as we see in the following diagram:
    
<Img src="https://github.com/rhnyewale/Data-Engineering/blob/main/Images/iter_next_method.JPG?raw=true">
    
Note that since the return statement is the last thing executed in a method, we need to move the _iter_node before returning the value. One way to overcome this is to store the data in the _iter_node before moving it to the next node.
    
The _next__() method also needs to let Python know when the iteration is over. This is done by raising a **StopIteration exception**. The **iteration ends when the _iter_node becomes None**

In [23]:

class LinkedList:
    
    def __init__(self):
        self.head = None
        self.tail = None
        self.length = 0
        
    def append(self, data):
        new_node = Node(data)
        if self.length == 0:
            self.head = self.tail = new_node
        else:
            self.tail.next = new_node
            new_node.prev = self.tail
            self.tail = new_node
        self.length += 1
        
    def __iter__(self):
        self._iter_node = self.head
        return self
    
    
    def __next__(self):
        if self._iter_node is None:
            raise StopIteration
        # Rest of the implementation goes here
        ret = self._iter_node.data
        self._iter_node = self._iter_node.next
        return ret
    

In [24]:
# Testing the implementation
lst = LinkedList()
lst.append(5)
lst.append(3)
lst.append(8)
for value in lst:
    print(value)

5
3
8


We've implemented our first list data structure. We can now add data to our list and iterate over all of the data it contains.

When compared to Python lists, this might not seem like much. For example, we can't access an element from its index given its position on the list. We could implement a method to do it, but it would require us to traverse the list making it a O(N) operation, where N is the length of the list. By contrast, Python lists can do this in constant time.

However, linked lists have advantages over array-based implementations. For example, we can append values in constant time. In contrast, as we insert elements in a Python list, since the array is a fixed-length data structure, this array needs to grow. Every time this happens, the whole list needs to be recreated. This means that appending to a Python list has O(N) complexity or O(1) amortized complexity.

Another thing that we can do with a linked list is to prepend elements in constant time — that is, adding an element to the start of the list. Here's an animation describing the step for doing this:

