# Now You Code 2: Information Extraction

How do we make computers seem intelligent? One approach is to use *term extraction*. Term extration is a type of information extration where we attempt to find relevant terms in text. The relevant terms come from a *corpus*, or set of plausible terms we want to extract.

For example, suppose we have the text:

`One day I would like to visit Syracuse`

We has smart humans can be fairly confident that `Syracuse` is a place, more specifically a `city`. 

A rudimentary method to make the computer interpret `Syracuse` as a place is to provide a corpus of cities and have the computer look up `Syracuse` in that corpus. 

In this code exercise we will do just that. Let's first write a function to read cities from the file `NYC2-cities.txt` into a corpus of cities, which will be represented in Python as a list.

Then write a main program loop to input some text, split the text into a list of words and if any of the words match a city in the corpus list we will output the word is a city.

The program should handle upper / lower case matching. A good approach is to title case the input. 

IMPORTANT: Please note that our program will ONLY work for one word cities, like `Syracuse` and will not work for multiple-word cities like `San Diego`. Don't worry about that now. 

SAMPLE RUN

```
Enter some text (or ENTER to quit): one day I would like to visit syraucse and rochester
Syracuse is a city
Rochester is a city
Enter some text (or ENTER to quit): austin is in texas
Austin is a city
Enter some text (or ENTER to quit): 
Quitting...
```

Once again we will solve this problem using the problem simplification approach. First we will write the `load_city_corpus` function to build our city list. Second we will write the  `is_a_city` function which given a word and a city list will return `True` when the word is a city. Finally we conclude with the main program which finds cities in our text, as demonstrated in our sample run.

## Step 1: Problem Analysis for `load_city_corpus`

Inputs: None (reads from a file)

Outputs: a Python list of cities

Algorithm (Steps in Program):
1. Make a function load_city_corpus()
2. Make a list for city_list
3. Set file name "NYC2-cities.txt"
4. open file as f  #for me (there is encoding error, so use UTF8 to prevent error)
5. for city in f append city to list
6. return list city_list

In [2]:
## Step 2: write the defintion for the load_city_corpus function
def load_city_corpus():
    city_list = []
    filename = "NYC2-cities.txt"
    with open(filename, encoding='UTF8') as f:
        for city in f:
            city_list.append(city.strip())
        return city_list
        
load_city_corpus()

['New York',
 'Los Angeles',
 'Chicago',
 'Houston',
 'Philadelphia',
 'Phoenix',
 'San Antonio',
 'San Diego',
 'Dallas',
 'San Jose',
 'Austin',
 'Jacksonville',
 'San Francisco',
 'Indianapolis',
 'Columbus',
 'Fort Worth',
 'Charlotte',
 'Seattle',
 'El Paso',
 'Detroit',
 'Denver',
 'Washington',
 'Memphis',
 'Boston',
 'Nashville',
 'Baltimore',
 'Oklahoma City',
 'Portland',
 'Las Vegas',
 'Louisville',
 'Milwaukee',
 'Albuquerque',
 'Tucson',
 'Fresno',
 'Sacramento',
 'Long Beach',
 'Kansas City',
 'Mesa',
 'Atlanta',
 'Virginia Beach',
 'Omaha',
 'Colorado Springs',
 'Raleigh',
 'Miami',
 'Oakland',
 'Minneapolis',
 'Tulsa',
 'Cleveland',
 'Wichita',
 'New Orleans',
 'Arlington',
 'Bakersfield',
 'Tampa',
 'Aurora',
 'Honolulu',
 'Anaheim',
 'Santa Ana',
 'Corpus Christi',
 'Riverside',
 'St. Louis',
 'Lexington',
 'Pittsburgh',
 'Stockton',
 'Anchorage',
 'Cincinnati',
 'Saint Paul',
 'Greensboro',
 'Toledo',
 'Newark',
 'Plano',
 'Henderson',
 'Lincoln',
 'Orlando',
 'Jerse

## Step 3: Problem Analysis for `is_a_city`

Inputs: a string word and a Python list of cities

Outputs: True or False when word is in the list of cities.

Algorithm (Steps in Program):

1. city_list = load_city_corpus()
2. Make a function is_a_city
1. Make split cities
1. cities = [city[:1]upper letter and add city[1:] for city in cities
1. Matched_cities = reason_city for reason_city in cities if reason_city in city_list
3. If any reason_city in city_list for reason_city in cities
4. Use for loop to figure out which one is city name
5. else print "No cities found"

In [3]:
## Step 4: write the definition for the is_a_city function
city_list = load_city_corpus()
cities = input("Enter some text (or Enter to quit):")
def is_a_city(cities,city_list):
    cities = cities.split(" ")
    cities = [city[:1].upper() + city[1:] for city in cities]
    matched_cities = [reason_city for reason_city in cities if reason_city in city_list]
    if any(reason_city in city_list for reason_city in cities):
        for matched_city in matched_cities:
            print("%s is a city!" % (matched_city))
    else:
        print("No cities found")

is_a_city(cities, city_list)

Enter some text (or Enter to quit):syracuse and boston
Syracuse is a city!
Boston is a city!


## Step 5: Problem Analysis for entire program

Inputs:
1. User input

Outputs:
1. if there is city in string print city name
1. If there is not, then print cannot found it
Algorithm (Steps in Program): (make sure to use the two functions we created)
1. Make two functions
2. Use load_city_corpus to put cities in strings
4. is_a_city function to prove is it city
5. return the city name if name is in string
6. Make break statement "quit"
7. Else print city name


In [4]:
## Step 6: Write complete program, making sure to use your two functions.

city_list = load_city_corpus()
while True:
    cities = input("Enter some text (or Enter to quit):")
    if cities == "quit":
        break
    else:
        is_a_city(cities, city_list)

Enter some text (or Enter to quit):syracuse or boston
Syracuse is a city!
Boston is a city!
Enter some text (or Enter to quit):quit


## Step 7: Questions

1. Explain your approach to solving this problem for cities with 2 words like `New York` or `Los Angeles`?
1. I use if any sentence to find cities in matched entry to print two cities if user input includes two cities.
2. How would you solve the problem where you enter a city name which is not in the corpus?
1. We can solve this problem by print this is not a city in the text, or we can make another function for appending city name which is not in the corpus.


## Reminder of Evaluation Criteria

1. What the problem attempted (analysis, code, and answered questions) ?
2. What the problem analysis thought out? (does the program match the plan?)
3. Does the code execute without syntax error?
4. Does the code solve the intended problem?
5. Is the code well written? (easy to understand, modular, and self-documenting, handles errors)
