### Python Data Structures and String Manipulation

1. What are the main data structures in Python?
Welcome to the course on Coding Interview Questions in Python! My name is Kirill Smirnov and I am a Data Science Consultant at Altran. We'll go through some common topics met during a job interview process. Let's start with the basic question: what are the main data structures in Python?

2. Data Structure
But first, what is a data structure? It is a specialized format to organize and store data. Python's standard library has four main data structures: list, tuple, set, and dictionary. Let's get through each of them!

3. List
List. List represents an ordered mutable sequence of items, like numbers, strings, and other objects. Lists are created by inserting comma-separated items into square brackets.

4. List: accessing items
To retrieve an item, we have to specify its index in square brackets after the variable's name. Remember, Python starts indexing from zero, not from one! Using negative indices is also possible. In this case, counting works backwards in the list. We can also access a sub-list via slicing. Note that the item corresponding to the right-hand side index is not included. Not specifying the left-hand or right-hand side index results in the whole range taken into account.

5. List: modifying items
Modifying items is very simple as well. We can change either a single item or a slice.

6. List: methods
Lists have some useful methods. .append() adds a new item to a list. .remove() deletes a specific item from a list.

7. List: methods
.pop() removes the last item from a list and returns its value. .count() counts the amount of a certain item in a list.

8. Tuple
Tuple. Tuple is an immutable sequence of items. Here, immutable means that we cannot modify it. There are two ways to create a tuple: Either by using round brackets or by writing comma-separated values. Accessing items in a tuple is similar to list mechanics.

9. Tuple: modifying values
Modifying though is not possible: we will get TypeError.

10. Set
Set. Set is an unordered collection with no duplicate items. Unordered means that there is no indexing for the constituent items. Here is the code to create a set. If we have duplicates during creation, they are not included.

11. Set: methods
Sets have some useful methods. .add() inserts a new item to a set. .remove() does the contrary. .union() returns a new set with items from both sets. .intersection() returns a new set with only common items. .difference() returns a new set with items present in one set but not in another.

12. Dictionary
Dictionary. Dictionary is a collection of key-value pairs where keys are unique and immutable. A key has a unique correspondence to its value but not vice versa. There are several ways to create a dictionary. For example, using curly brackets and specifying each key-value pair with a colon or using the dict() constructor together with a list of tuples.

13. Dictionary: accessing values
A value associated with a key can be accessed by specifying the key within square brackets. The operation raises KeyError if the key does not exist.

14. Dictionary: modifying values
Modifying the value for a key is very straightforward: Accessing the value and re-assigning it. If the key does not exist, the operation creates a new key-value pair.

15. Dictionary: methods
Dictionaries have some useful methods. .items() returns the stored key-value pairs.

16. Dictionary: methods
We can pass the method output to the list() constructor to get the corresponding list.

17. Dictionary: methods
We can also retrieve keys and values separately

18. Dictionary: methods
and also use the trick with the list() constructor.

19. Dictionary: methods
We can remove the last inserted key-value pair with the .popitem() method. The method also returns the associated value.

20. Operations on Lists, Tuples, Sets, and Dictionaries
We can apply some operations on all the aforementioned collections. One very practical is: len() that returns the collection's size.

21. Operations on Lists, Tuples, Sets, and Dictionaries
Another one uses the in keyword that checks if an item is already present in a collection.

22. Let's practice!
We went through some main data structures in Python. Now it's time to practice!

#### List methods
Let's practice list methods!

Let's imagine a situation: you went to the market and filled your baskets (basket1 and basket2) with fruits. You wanted to have one of each kind but realized that some fruits were put in both baskets.

Task 1. Your first task is to remove everything from basket2 that is already present in basket1.

Task 2. After the removal it is reasonable to anticipate that one of the baskets might weigh more compared to the another (all fruit kinds weight the same). Therefore, the second task is to transfer some fruits from a heavier basket to the lighter one to get approximately the same weight/amount of fruits.

In [10]:
basket1 = ['banana', 'kiwifruits', 'grapefruits', 'apples', 'apricots', 'nectarines', 'oranges', 'peaches', 'pears', 'lemons']
basket2 = ['grapes', 'dragonfruits', 'limes', 'papaya']

In [11]:
# Remove fruits from basket2 that are present in basket1
for item in basket1:
    if item in basket2:
        basket2.remove(item)

print('Basket 1: ' + str(basket1))
print('Basket 2: ' + str(basket2))

# Transfer fruits from basket1 to basket2
while len(basket1) > len(basket2):
    item_to_transfer = basket1.pop()
    basket2.append(item_to_transfer)

print('Basket 1: ' + str(basket1))
print('Basket 2: ' + str(basket2))

Basket 1: ['banana', 'kiwifruits', 'grapefruits', 'apples', 'apricots', 'nectarines', 'oranges', 'peaches', 'pears', 'lemons']
Basket 2: ['grapes', 'dragonfruits', 'limes', 'papaya']
Basket 1: ['banana', 'kiwifruits', 'grapefruits', 'apples', 'apricots', 'nectarines', 'oranges']
Basket 2: ['grapes', 'dragonfruits', 'limes', 'papaya', 'lemons', 'pears', 'peaches']


#### Operations on sets
Using mathematical notation, we can define the following operations given two sets X and Y:
  
 - the intersection between 
 X a Y (all elements which are in both X and Y)

  
 - the union between 
 X U Y (all elements which are either in X or Y)

  
 - the difference between 
X - Y (all elements which are in X but not in Y)

You are given 5 sets of integers A, B, C, D,E (You should see them in the console). What is the result of the following expression?

(A U (B a C)) - (D a E)

In [3]:
A = {1, 2, 3, 4, 5, 6, 7}
B = {5, 7, 9, 11, 13, 15}
C = {1, 2, 8, 10, 11, 12, 13, 14, 15, 16, 17}
D = {1, 3, 5, 7, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}
E = {9, 10, 11, 12, 13, 14, 15}

In [4]:
result = {1,2,3,4,5,6,7}

It can be described by the following equation:
 
 xˆ2 / a^2 + yˆ2 / aˆ2 = z
 
Let's set the coefficient a to 1. Therefore, the radius at each cut will be equal to sqrt(z)

Your task is to create a dictionary that stores the mapping from the pair of coordinates (x,y) to the z coordinate (the lists storing considered ranges for x and y are given: range_x and range_y, respectively).

In [2]:
range_x = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
range_y = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0]

In [3]:
circ_parab = dict()

for x in range_x:
    for y in range_y:        
        # Calculate the value for z
        z = x**2 + y**2
        # Create a new key for the dictionary
        key = x,y
        # Create a new key-value pair      
        circ_parab[key] = z

What is the value of circ_parab for the key (1.8, 1.4)?

In [4]:
circ_parab[1.8,1.4]

5.2

### What are common ways to manipulate strings?

1. What are common ways to manipulate strings?
The next topic we'll cover is string manipulation. Coding exercises in interviews often involve this subject.

2. String
Strings are created simply by enclosing a text in single or double quotes.

3. String
Generally, strings are created using the str() constructor. We can also pass to it other data types like real numbers, or lists to convert them to a string.

4. str() constructor
Actually, we can pass any object to the str() constructor. Let's check it with our own object. Recall that first we need to create a class representing a "blueprint" of an object. We need to include the __init__() method indicating the state of our object at initialization. Here, an object is initialized with the num variable. After creating an instance, we can retrieve the value of this variable. If we pass the object to the str() constructor, we'll get quite unreadable output. How to customize it?

5. str() constructor
We have to implement the __str__() method in the class. Here, this method will return the value of the num variable. Now, when we create an instance and pass it to the str() constructor, we get the number defined at initialization.

6. Accessing characters in a string
String characters are indexed. Therefore, we can access each character using square brackets with the corresponding index. Negative indices can be used as well. We can also use slicing. If left-hand or right-hand side index is omitted, all the characters in the corresponding direction are considered.

7. The .index() method
To retrieve an index of a specific character in a string, use the .index() method. Note that if a character is present in a string more than once, only the lowest index is returned.

8. Strings are immutable
Strings are immutable. We cannot modify an existing string. Doing so will raise an error. You could ask: but there are plenty of methods that we can apply on a string object implying modification! The answer is: it only looks like modification. In reality, we return a new string object. Let's look at some of these methods.

9. Modifying methods 1
First is string concatenation. It is easily done with the "+" operator. Another important method is replacing a substring. It substitutes all the occurrences of a specific character sequence in a string with another sequence.

10. Modifying methods 2
There is a set of methods to change case representation in a string. For example, transforming to the upper case or to the lower case. If we need to capitalize only the first letter in a string, we have to use the .capitalize() method.

11. Relation to lists
Let's look at a couple of methods dealing with lists. We will begin with creating a string from a list of strings. Assume we have the following list. To convert it to a proper sentence, we can use the .join() method. We need to specify a delimiter that is inserted between the strings in the list and to pass the list as an argument. The second method is about breaking a string into a list. Let's inverse the operation we just did. For this, we can use the .split() method. We need to specify the string to apply the method to and the delimiter for splitting.

12. String methods with DataFrames
Knowledge of string methods is very handy because we can use them with DataFrames. More specifically, we can apply them to columns containing text data. Let's create a DataFrame using a custom dictionary. As you can see, the name column contains text data. However, the names are not capitalized.

13. String methods with DataFrames
To change that, we need to modify the column of interest.

14. String methods with DataFrames
We start by accessing the column.

15. String methods with DataFrames
Then, we add an str specifier that gives us access to the string methods.

16. String methods with DataFrames
And finally, we call the .capitalize() method we already know. Let's check the result! Great! We capitalized all the names in our DataFrame in just seconds!

17. Let's practice!
aThat was quite a lot of information. Let's put it all together and get some practice!

#### String indexing and concatenation
You are presented with one of the earliest known encryption techniques - Caesar cipher. It is based on a simple shift of each letter in a message by a certain number of positions down the given alphabet. For example, given the English alphabet, a shift of 1 for 'xyz' would imply 'yza' and vice versa in case of decryption. Notice that 'z' becomes 'a' in this case.

Thus, encryption/decryption requires two arguments: text and an integer key denoting the shift (key = 1 for the example above).

Your task is to create an encryption function given the English alphabet stored in the alphabet string.

In [3]:
alphabet = 'abcdefghijklmnopqrstuvwxyz'

In [4]:
def encrypt(text, key):
  
    encrypted_text = ''

    # Fill in the blanks to create an encrypted text
    for char in text.lower():
        idx = (alphabet.index(char) + key) % len(alphabet)
        encrypted_text = encrypted_text + alphabet[idx]

    return encrypted_text

# Check the encryption function with the shift equals to 10
print(encrypt("datacamp", 10))

nkdkmkwz


Great! Interestingly, decryption function is only different by the line you fixed in the for loop. What would be the corresponding change in the decrypt() function?

In [5]:
idx = (alphabet.index(char) - key) % len(alphabet)

NameError: name 'char' is not defined

#### Operations on strings
You are given the variable text storing the following string 'StRing ObJeCts haVe mANy inTEResting pROPerTies'.

Your task is to modify this string in such a way that would result in 'string OBJECTS have MANY interesting PROPERTIES' (every other word in text is lowercased and uppercased, otherwise). You will obtain this result in three steps.

In [7]:
text = 'StRing ObJeCts haVe mANy inTEResting pROPerTies'

# Create a word list from the string stored in 'text'
word_list = text.split()

# Make every other word lowercased; otherwise - uppercased
for i in range(len(word_list)):
    if (i + 1) % 2 == 0:
        word_list[i] = word_list[i].upper()
    else:
        word_list[i] = word_list[i].lower()

print(word_list)

# Join the words back and form a new string
new_text = ' '.join(word_list)
print(new_text)

['string', 'OBJECTS', 'have', 'MANY', 'interesting', 'PROPERTIES']
string OBJECTS have MANY interesting PROPERTIES


#### Fixing string errors in a DataFrame
You are given the heroes dataset containing the information on different comic book heroes. However, you'll need to make some refinements in order to use this dataset further.

Comparing Eye color, Hair color, and Skin color columns, you can see that strings in the Hair color columns are capitalized, whereas in other two the strings are lowercased.

Moreover, some rows in the Gender column contain a spelling error (Fmale instead of Female).

Your task is to make the strings in the Hair color column lowercased and to fix the spelling error in the Gender column.

In [8]:
# Make all the values in the 'Hair color' column lowercased
heroes['Hair color'] = heroes['Hair color'].str.lower()
  
# Check the values in the 'Hair color' column
print(heroes['Hair color'].value_counts())

# Substitute 'Fmale' with 'Female' in the 'Gender' column
heroes['Gender'] = heroes['Gender'].str.replace('Fmale', 'Female')

# Check if there is no occurences of 'Fmale'
print(heroes['Gender'].value_counts())

NameError: name 'heroes' is not defined

### How to write regular expressions in Python?

1. How to write regular expressions in Python?
In this lesson we'll cover how to write regular expressions in Python.

2. Definition
A regular expression is a sequence of special characters or so-called metacharacters defining a pattern to search in a text. The easiest sequence is just a sequence of letters as in this example. If we provide some text and search for this sequence,

3. Definition
we will find the following matches. Note that, apart from the word cat, the sequence fits to the beginning of the word "catches". But what to do if we want to search for more complicated patterns?

4. Complex patterns
Let's check with this text. We want to have a sequence that fits

5. Complex patterns
to all the e-mail addresses in it. Simple character sequence isn't enough. Therefore, metacharacters are needed.

6. Special characters
In a regular expression metacharacters are mapped to real characters. Some of them are simple and are mapped onto themselves. A dot metacharacter is mapped to everything. But a dot prefixed with backslash maps to a dot character.

7. Special characters
The following metacharacters represent backslash followed by a letter. "w" small maps to any alphanumeric character or underscore. "d" small maps to any digit. "s" small maps to any whitespace character

8. Square brackets
Several metacharacters can be enclosed in square brackets which itself is a metacharacter. In this case, the mapping will result in either of the characters enclosed. There are also short versions for some of the frequently used expressions: any lowercased character, any uppercased character, any digit. We can also combine them together.

9. Repetitions
Complex or simple metacharacters can be followed by symbols indicating how many times the associated character is repeated. "*" indicates that the character is absent or repeats an undefined number of times. "+" indicates that the character is present at least once. "?" indicates that the character exists or not. "{}" indicate the lower and upper bound for a character to be present.

10. Regular expression for an e-mail
Returning back to our previous example, a regular expression fitting an e-mail address can look like this. Let's have a better understanding.

[\w\.]+@[a-z]+\.[a-z]+

11. Regular expression for an e-mail
This part maps to at least one letter, digit, underscore, or dot character.

[\w\.]+

12. Regular expression for an e-mail
The '@' symbol maps to itself.

13. Regular expression for an e-mail
This part maps to at least one lowercased letter.

[a-z]+

14. Regular expression for an e-mail
Backslash and dot map simply to a dot character.

15. Regular expression for an e-mail
And again, mapping to at least one lowercased letter.

16. re package
We defined a regular expression. But how do we use it programmatically? The re package comes to help! Once we defined an expression, we can pass it to the .compile() function. Note that we use the "r" prefix before the expression. The next step is to use it against our text. We'll cover a couple of functions to do so.

import re
pattern = re.compile(r'[\w\.]+@[a-z]+\.[a-z]+')

17. re.finditer()
The finditer() function returns a special object given a pattern and text. We can use this object in a for loop. In this case each item will represent a Match object containing the information about a single match in our text.

18. re.finditer()
To retrieve this information, we can call the following methods on our Match object. The .group() method will return the matching substring. The .start() and .end() methods return the start and end indices of the matching substring in a given text.

19. re.findall()
If we are only interested about the matching substrings, we can use the findall() function. It simply returns a list of substrings representing the matches to our pattern.

20. re.split()
Another interesting function is called split(). Instead of returning matches, the method splits a given string by a matching pattern. This results in a list of strings of the following form.

21. Let's practice!
That was a concise reminder on regular expressions. Now, let's practice our skills!

#### Write a regular expression
Let's write some regular expressions!

Your task is to create a regular expression matching a valid temperature represented either in Celsius or Fahrenheit scale (e.g. '+23.5 C', '-4 F', '0.0 C', '73.45 F') and to extract all the appearances from the given string text. Positive temperatures can be with or without the + prefix (e.g. '5 F', '+5 F'). Negative temperatures must be prefixed with -. Zero temperature can be used with a prefix or without.

The re module is already imported.

Tips:

The + symbol within the square brackets [] matches the + symbol itself (e.g. the regular expression [1a+] matches to '1', 'a', or '+').
You can also apply ? to the characters within the square brackets [] to make the set optional (e.g. [ab]? matches to 'a', 'b', or '').


In [11]:
import re

text = """Let's consider the following temperatures using the Celsius scale: +23 C, 0 C, -20.0 C, -2.2 C, -5.65 C, 0.0001 C. To convert them to the Fahrenheit scale you have to multiply the number by 9/5 and add 32 to the result. Therefore, the corresponding temperatures in the Fahrenheit scale will be: +73.4 F, 32 F, -4.0 F, +28.04 F, 21.83 F, +32.00018 F."""

# Define the pattern to search for valid temperatures
pattern = re.compile(r'[+-]?\d+\.?\d* [CF]')

# Print the temperatures out
print(re.findall(pattern, text))

# Create an object storing the matches using 'finditer()'
matches_storage = re.finditer(pattern, text)

# Loop over matches_storage and print out item properties
for match in matches_storage:
    print('matching sequence = ' + match.group())
    print('start index = ' + str(match.start()))
    print('end index = ' + str(match.end()))

['+23 C', '0 C', '-20.0 C', '-2.2 C', '-5.65 C', '0.0001 C', '+73.4 F', '32 F', '-4.0 F', '+28.04 F', '21.83 F', '+32.00018 F']
matching sequence = +23 C
start index = 67
end index = 72
matching sequence = 0 C
start index = 74
end index = 77
matching sequence = -20.0 C
start index = 79
end index = 86
matching sequence = -2.2 C
start index = 88
end index = 94
matching sequence = -5.65 C
start index = 96
end index = 103
matching sequence = 0.0001 C
start index = 105
end index = 113
matching sequence = +73.4 F
start index = 295
end index = 302
matching sequence = 32 F
start index = 304
end index = 308
matching sequence = -4.0 F
start index = 310
end index = 316
matching sequence = +28.04 F
start index = 318
end index = 326
matching sequence = 21.83 F
start index = 328
end index = 335
matching sequence = +32.00018 F
start index = 337
end index = 348


#### Find the correct pattern
You visit a website and it asks you to fill in a registration form. The username section requires you to choose a name between 3 and 16 characters long. It can only include alphanumeric characters (no capital letters), hyphens, and underscores. Which of the following patterns will your input be matched against?

The symbols ^ and $ were not discussed in the video lecture: they simply indicate the beginning and end of a line (in this case, the beginning and end of the chosen username).

The module re is already imported for you.

In [None]:
^[a-z0-9_-]{3,16}$

In [12]:
movies = ['1984, 1984, Michael Radford', 'The Good, the Bad and the Ugly, 1966, Sergio Leone', 'Terminator 2: Judgment Day, 1991, James Cameron', "Harry Potter and the Philosopher's Stone, 2001, Chris Columbus", 'Back to the Future, 1985, Robert Zemeckis', 'No Country for Old Men, 2007, Joel Coen, Ethan Coen']

# Compile a regular expression
pattern = re.compile(r', \d+, ')

movies_without_year = []
for movie in movies:
    # Retrieve a movie name and its director
    split_result = re.split(pattern, movie)
    # Create a new string with a movie name and its director
    movie_and_director = ', '.join(split_result)
    # Append the resulting string to movies_without_year
    movies_without_year.append(movie_and_director)
    
for movie in movies_without_year:
    print(movie)

1984, Michael Radford
The Good, the Bad and the Ugly, Sergio Leone
Terminator 2: Judgment Day, James Cameron
Harry Potter and the Philosopher's Stone, Chris Columbus
Back to the Future, Robert Zemeckis
No Country for Old Men, Joel Coen, Ethan Coen
