# Project 1 

Live Project: Discovering Disease Outbreaks from News Headlines

Author: Bowden Wise (wisegb@ge.com)


In [5]:
import pandas as pd
import re
import geonamescache

# Prep

Utilize the [geonamescache](https://pypi.org/project/geonamescache/#:~:text=A%20Python%20library%20that%20provides,include%20population%20and%20geographic%20data.) API to obtain a list of countries and cities

The list of cities and countries are sorted by lenght (descending) so that when searching heeadlines for matches, the longest matching string is found.

In [6]:
gc = geonamescache.GeonamesCache()

In [7]:
city_names = []

for key, value in gc.get_cities().items(): 
    #print(value["name"])
    city_names.append(value["name"])

print("Loaded", len(city_names), " cities")
#print(city_names)
sorted_cities = sorted(city_names, key=len,reverse=True)

Loaded 24336  cities


In [8]:
country_names = []

for key, value in gc.get_countries().items(): 
    #print(value["name"])
    country_names.append(value["name"])
    
print("Loaded", len(country_names), " countries")
sorted_countries = sorted(country_names, key=len,reverse=True)
#print(sorted_countries)

Loaded 252  countries


# Regular Expression

Leverage python [re](https://docs.python.org/3/library/re.html) to facillitate searching for countries and cities.

The search method may be used to search for patterns. The match object returned is None if no matches are found. If a match is found, the group() method may be used to return the part of the string where the match occured.

Below we define two functions to perform a search for a city or country given a headline.

In [9]:
regexp_countries = re.compile(r'(?:{})'.format('|'.join(map(re.escape, sorted_countries))))
regexp_cities = re.compile(r'(?:{})'.format('|'.join(map(re.escape, sorted_cities))))

In [10]:
# Much faster since compiled
def match_cities(text):
    match = None
    m=regexp_cities.search(text)
    if m is not None:
        match = m.group(0)
    return match

def match_countries(text):
    match = None
    m=regexp_countries.search(text)
    if m is not None:
        match = m.group(0)
    return match

In [11]:
# Find New York not York
# Sort cities by longest first

def match_cities_naive(text):
    for city in sorted_cities:
        #if (city == "Miami"): print("**")
        raw_search_string = r"\b" + city + r"\b"
        match = re.search(raw_search_string, text, flags=re.IGNORECASE)
        #if (city == "Miami"): print("**", city, text, match)
        if (match is not None):
            return city
    return None

def match_countries_naive(text):
    for country in sorted_countries:
        raw_search_string = r"\b" + country + r"\b"
        match = re.search(raw_search_string, text, flags=re.IGNORECASE)
        if (match is not None):
            return country
    return None


# Build Data Frame

Here we build a data frame containing 3 columns:
- headline
- countries
- cities

## Load headlines from the headlines.txt filename

In [14]:
headlines_filename = "../data/headlines.txt"

headlines = []

with open (headlines_filename, "r") as f:
    for line in f:
        #print(line.strip())
        headlines.append(line.strip())
        
print(len(headlines), " headlines read")

650  headlines read


## Create the data frame

Use the apply method to create the columns for countries and cities by calling our match functions defined above.

In [15]:
# Create data frame
df = pd.DataFrame()
df["headline"] = headlines
df["countries"] = df["headline"].apply(lambda h: match_countries(h))
df["cities"] = df["headline"].apply(lambda h: match_cities(h))

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650 entries, 0 to 649
Data columns (total 3 columns):
headline     650 non-null object
countries    16 non-null object
cities       610 non-null object
dtypes: object(3)
memory usage: 15.4+ KB


In [17]:
df

Unnamed: 0,headline,countries,cities
0,Zika Outbreak Hits Miami,,Miami
1,Could Zika Reach New York City?,,New York City
2,First Case of Zika in Miami Beach,,Miami Beach
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
4,Dallas man comes down with case of Zika,,Dallas
5,Trinidad confirms first Zika case,,Trinidad
6,Zika Concerns are Spreading in Houston,,Houston
7,Geneve Scientists Battle to Find Cure,,
8,The CDC in Atlanta is Growing Worried,,Atlanta
9,Zika Infested Monkeys in Sao Paulo,,Mon


# Review Results

In [18]:
# Find rows where countries is Brazil
df_country_rows = df [df["countries"] == "Brazil"]
df_country_rows

Unnamed: 0,headline,countries,cities
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
44,"Zika outbreak in Piracicaba, Brazil",Brazil,Piracicaba
78,"Student sick in Campinas, Brazil",Brazil,Campinas


In [19]:
# Find rows where countries is not null
bool_series = pd.notnull(df["countries"]) 
df[bool_series] 

Unnamed: 0,headline,countries,cities
3,"Mystery Virus Spreads in Recife, Brazil",Brazil,Recife
25,Zika cases in Vietnam's Ho Chi Minh City surge,Vietnam,Ho Chi Minh City
30,Thailand-Zika Virus in Bangkok,Thailand,Bangkok
44,"Zika outbreak in Piracicaba, Brazil",Brazil,Piracicaba
58,"Zika surfaces in Klang, Malaysia",Malaysia,Klang
59,Rumors about Meningitis spreading in Guatemala...,Guatemala,Guatemala City
77,Belize City under threat from Zika,Belize,Belize City
78,"Student sick in Campinas, Brazil",Brazil,Campinas
83,Zika outbreak spreads to Mexico City,Mexico,Mexico City
124,"New Zika Case in Kota Kinabalu, Malaysia",Malaysia,Kota Kinabalu


In [20]:
# Find rows where cities is not null
bool_series = pd.isnull(df["cities"]) 
df[bool_series] 

Unnamed: 0,headline,countries,cities
7,Geneve Scientists Battle to Find Cure,,
17,Louisiana Zika cases up to 26,,
19,Zika infects pregnant woman in Cebu,,
63,Carnival under threat in Rio De Janeiro due to...,,
73,Zika case reported in Oton,,
82,Hepatitis B Vaccine is now Required in Quebec,,
86,Visitor to Cucuta contracts Zika,,
88,Maka City Experiences Influenza Outbreak,,
115,Spreading Zika ruining Cancun fun,,
116,Cancun hit by Outbreak of Party Fever!,,


# Future Work

Note geonamescache city names include foreign spellings e.g., Québec

In [21]:
m = match_cities("Quebec")
print(m)

None


In [22]:
# Qu\u00e9bec
m = match_cities("Qu\u00e9bec")
print(m)

Québec
