<a href="https://colab.research.google.com/github/rahiakela/data-science-research-and-practice/blob/main/data-science-bookcamp/case-study-3--disease-outbreaks/03_case_study_disease_outbreaks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Case Study: disease outbreaks

Our goal is to extract locations from disease-related headlines to uncover the largest active epidemics within and outside of the United States. 

We will do as follows:

1. Load the data.
2. Extract locations from the text using regular expressions and the GeoNames-
Cache library.
3. Check the location matches for errors.
4. Cluster the locations based on geographic distance.
5. Visualize the clusters on a map, and remove any errors.
6. Output representative locations from the largest clusters to draw interesting conclusions.

##Setup

Reference:
https://colab.research.google.com/github/astg606/py_materials/blob/master/visualization/introduction_cartopy.ipynb

In [None]:
!apt-get install libproj-dev proj-data proj-bin
!apt-get install libgeos-dev
!pip install cython
!pip install cartopy

In [None]:
!apt-get -qq install python-cartopy python3-cartopy
!pip uninstall -y shapely    # cartopy and shapely aren't friends (early 2020)
!pip install shapely --no-binary shapely
!pip install geonamescache
!pip install Unidecode

In [None]:
!wget https://github.com/rahiakela/data-science-research-and-practice/raw/main/data-science-bookcamp/case-study-3--disease-outbreaks/headlines.txt

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
from collections import defaultdict
import itertools
import re
import numpy as np
import pandas as pd
from scipy import stats
from math import cos, sin, asin
from math import pi

from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

from scipy.spatial.distance import euclidean
from sklearn.datasets import make_circles

from geonamescache import GeonamesCache

from unidecode import unidecode

import cartopy
import seaborn as sns
import matplotlib.pyplot as plt
from cartopy.crs import PlateCarree
from cartopy.crs import LambertConformal

##Extracting locations from headline data

In [7]:
# Loading headline data
with open("headlines.txt", "r") as f:
  headlines = [line.strip() for line in f.readlines()]
  num_headlines = len(headlines)
  print(f"{num_headlines} headlines have been loaded")

650 headlines have been loaded


Now we need a mechanism for extracting city and
country names from the headline text.

In [8]:
# Converting names to regexes
def name_to_regex(name):
  decoded_name = unidecode(name)
  if name != decoded_name:
    regex = fr"\b({name}|{decoded_name})\b"
  else:
    regex = fr"\b{name}\b"
  return re.compile(regex, flags=re.IGNORECASE)

Let’s create two dictionaries, `country_to_name` and `city_to_name`, which map regular expressions to country names and city
names, respectively.

In [9]:
# Mapping names to regexes
gc = GeonamesCache()

countries = [country["name"] for country in gc.get_countries().values()]
country_to_name = {name_to_regex(name): name for name in countries}

cities = [city["name"] for city in gc.get_cities().values()]
city_to_name = {name_to_regex(name): name for name in cities}

Next, we use our mappings to define a function that looks for location names in text.

In [10]:
# Finding locations in text
def get_name_in_text(text, dictionary):
  for regex, name in sorted(dictionary.items(), key= lambda x: x[1]):
    if regex.search(text):
      return name
  return None

In [11]:
# Finding locations in headlines
matched_countries = [get_name_in_text(headline, country_to_name) for headline in headlines]
matched_cities = [get_name_in_text(headline, city_to_name) for headline in headlines]
data = {"Headline": headlines, "City": matched_cities, "Country": matched_countries}

df = pd.DataFrame(data)
df.head()

Unnamed: 0,Headline,City,Country
0,Zika Outbreak Hits Miami,Miami,
1,Could Zika Reach New York City?,New York City,
2,First Case of Zika in Miami Beach,Miami,
3,"Mystery Virus Spreads in Recife, Brazil",Recife,Brazil
4,Dallas man comes down with case of Zika,Dallas,


Let’s explore our location table by summarizing the contents.

In [12]:
# Summarizing the location data
df[["City", "Country"]].describe()

Unnamed: 0,City,Country
count,618,15
unique,511,10
top,Of,Brazil
freq,44,3


The most frequently mentioned city is apparently “Of,” Turkey. That doesn’t seem right!

In [None]:
# Fetching cities named "Of"
