**This notebook is an exercise in the [Python](https://www.kaggle.com/learn/python) course.  You can reference the tutorial at [this link](https://www.kaggle.com/colinmorris/strings-and-dictionaries).**

---


You are almost done with the course. Nice job!

We have a couple more interesting problems for you before you go. 

As always, run the setup code below before working on the questions.

In [1]:
from learntools.core import binder; binder.bind(globals())
from learntools.python.ex6 import *
print('Setup complete.')

Setup complete.


Let's start with a string lightning round to warm up. What are the lengths of the strings below?

For each of the five strings below, predict what `len()` would return when passed that string. Use the variable `length` to record your answer, then run the cell to check whether you were right.  

# 0a.

In [2]:
a = ""
length = len(a)
print(length)

q0.a.check()

0


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

The empty string has length zero. Note that the empty string is also the only string that Python considers as False when converting to boolean.

# 0b.

In [3]:
b = "it's ok"
length = len(b)
print(length)

q0.b.check()

7


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

Keep in mind Python includes spaces (and punctuation) when counting string length.

# 0c.

In [4]:
c = 'it\'s ok'
length = len(c)
print(length)

q0.c.check()

7


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

Even though we use different syntax to create it, the string `c` is identical to `b`. In particular, note that the backslash is not part of the string, so it doesn't contribute to its length.

# 0d.

In [5]:
d = """hey"""
length = len(d)
print(length)

q0.d.check()

3


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

The fact that this string was created using triple-quote syntax doesn't make any difference in terms of its content or length. This string is exactly the same as `'hey'`.

# 0e.

In [6]:
e = '\n'
length = len(e)
print(length)

q0.e.check()

1


<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct:</span> 

The newline character is just a single character! (Even though we represent it to Python using a combination of two characters.)

# 1.

There is a saying that "Data scientists spend 80% of their time cleaning data, and 20% of their time complaining about cleaning data." Let's see if you can write a function to help clean US zip code data. Given a string, it should return whether or not that string represents a valid zip code. For our purposes, a valid zip code is any string consisting of exactly 5 digits.

HINT: `str` has a method that will be useful here. Use `help(str)` to review a list of string methods.

In [7]:
def is_valid_zip(zip_code):
    """Returns whether the input string is a valid (5 digit) zip code
    """
    return (len(zip_code)==5 and zip_code.isdigit())

# Check your answer
q1.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [8]:
q1.hint()
q1.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Try looking up `help(str.isdigit)`

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
def is_valid_zip(zip_code):
    return len(zip_code) == 5 and zip_code.isdigit()
```

# 2.

A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word. Complete the function below to help her filter her list of articles.

Your function should meet the following criteria:

- Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.” 
- She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
- Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.

In [9]:
def word_search(doc_list, keyword):
    """
    Takes a list of documents (each document is a string) and a keyword. 
    Returns list of the index values into the original list for all documents 
    containing the keyword.

    Example:
    doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
    >>> word_search(doc_list, 'casino')
    >>> [0]
    """
    list=[]
    # Iterate through the indices (statement) and elements (phrase) of doc_list
    for statement,phrase in enumerate(doc_list):
        # Split the string phrase into a list of words (according to whitespace)
        phrase=phrase.split()
        # Make a transformed list where we 'normalize' each word to facilitate matching.
        # Periods and commas are removed from the end of each word, and it's set to all lowercase.
        if keyword.lower() in [word.rstrip('.,').lower() for word in phrase]:
            list.append(statement)
    return list
            

# Check your answer
q2.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [10]:
q2.hint()
q2.solution()

<IPython.core.display.Javascript object>

<span style="color:#3366cc">Hint:</span> Some methods that may be useful here: `str.split()`, `str.strip()`, `str.lower()`.

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
def word_search(doc_list, keyword):
    # list to hold the indices of matching documents
    indices = [] 
    # Iterate through the indices (i) and elements (doc) of documents
    for i, doc in enumerate(doc_list):
        # Split the string doc into a list of words (according to whitespace)
        tokens = doc.split()
        # Make a transformed list where we 'normalize' each word to facilitate matching.
        # Periods and commas are removed from the end of each word, and it's set to all lowercase.
        normalized = [token.rstrip('.,').lower() for token in tokens]
        # Is there a match? If so, update the list of matching indices.
        if keyword.lower() in normalized:
            indices.append(i)
    return indices
```

# 3.

Now the researcher wants to supply multiple keywords to search for. Complete the function below to help her.

(You're encouraged to use the `word_search` function you just wrote when implementing this function. Reusing code in this way makes your programs more robust and readable - and it saves typing!)

In [11]:
def multi_word_search(doc_list, keywords):
    """
    Takes list of documents (each document is a string) and a list of keywords.  
    Returns a dictionary where each key is a keyword, and the value is a list of indices
    (from doc_list) of the documents containing that keyword

    >>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car and a casino", "Casinoville"]
    >>> keywords = ['casino', 'they']
    >>> multi_word_search(doc_list, keywords)
    {'casino': [0, 1], 'they': [1]}
    """
    indices={}
    for keyword in keywords:
        indices[keyword]=word_search(doc_list,keyword)
    return indices
   
# Check your answer
q3.check()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [12]:
q3.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> 
```python
def multi_word_search(documents, keywords):
    keyword_to_indices = {}
    for keyword in keywords:
        keyword_to_indices[keyword] = word_search(documents, keyword)
    return keyword_to_indices
```

# Keep Going

You've learned a lot. But even the best programmers rely heavily on "libraries" of code from other programmers. You'll learn about that in **[the last lesson](https://www.kaggle.com/colinmorris/working-with-external-libraries)**.


---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161283) to chat with other Learners.*