# Fetching breed data

 - Author: Telmo de Menezes e Silva Filho
 - email: tmfilho@gmail.com / telmo@de.ufpb.br
 - Github: @tmfilho
 - Date: 18 may 2020
 - Original data source: [American Kennel Club (AKC)](https://www.akc.org)

In [1]:
import re
import requests
import os
from datetime import datetime

from bs4 import BeautifulSoup
from bs4.element import Tag

import pandas as pd
from tqdm import tqdm_notebook as tqdm

## Breed description

Breed description is presented in two fields in a breed's page:

 1. Summarized in a `<div>` at the top of the page, under the main photo slides;
 2. Detailed description in a `<div>` in the "About" section.
 
Both fields may be missing from a breed's info page.

In [2]:
def get_description(breed_soup):
    try:
        first_part = breed_soup.find(
                'div', class_='breed-info__content-wrap'
        ).get_text().strip()
    except:
        first_part = ''
    
    try:
        second_part = breed_soup.find(
                'div', class_='breed-hero__footer'
        ).get_text().strip()
    except:
        second_part = ''
    
    description = ' '.join([first_part, second_part])
    
    # Removing weird characters
    # Probably not exhaustive
    description = description.replace(
        '\n', '').replace('\u200b', '').replace('\xa0', ' ')
    return description

## Breed temperament

Breed temperament can be found in a `<span>` to the side of the the main photo slides. This field may be missing.

In [3]:
def get_temperament(breed_soup):
    first_part = 'attribute-list__description attribute-list__text '
    second_part = 'attribute-list__text--lg mb4 bpm-mb5 pb0 d-block'
    class_ = first_part + second_part
    try:
        return breed_soup.find(
            'span', class_=class_
        ).get_text()
    except:
        return ''

## Main breed attributes

This is a list including physical attributes and the breed's popularity rank out of 195 most popular breeds. These informations are contained in `<span>` fields under the temperament information. Breed popularity is missing for 82 breeds.

### Breed popularity

In [4]:
def get_popularity(popularity_span):
    pop_text = popularity_span.get_text()
    return {'popularity': pop_text.split()[1]}

### Breed height, weight and life expectancy

Breed height, weight and life expectancy can show up in many forms. They can be informed as intervals or a single average number. They can also be listed separately for males and females or for different size categories. Additionally, life expectancy may be given as a minimum value, such as "12+". Finally, numbers may be missing altogether, with something like "Weight: Proportionate to height" listed instead, as in the [Cane corso](https://www.akc.org/dog-breeds/cane-corso/) page. 

Thus we used a regular expression to capture any numbers that appear in the descriptions of these attributes. We then return the minimum and the maximum values. If only a single value is found, we return the same number for min and max. If no numbers are found, we return 0 for both min and max, representing missing data. The regex function receives as arguments the `<span>` text, the variable of interest (height, weight or expectancy) and a multiplier. This multiplier is used to convert values in inches or pounds to the metric system.

In [5]:
def general_regex(text, var, mul=1):
    reg = re.compile('(\d+\.?\d*)')
    results = reg.findall(text)
    numbers = [float(value) * mul for value in results]
    if len(numbers) == 1:
        numbers = numbers * 2
    elif len(numbers) == 0:
        numbers = [0, 0]
    return {
        'min_{}'.format(var): min(numbers),
        'max_{}'.format(var): max(numbers)
    }

In [6]:
def get_height(height_span):
    ht_text = height_span.get_text()
    
     # one inch corresponds to 2.54 cm
    return general_regex(ht_text, 'height', 2.54)

In [7]:
def get_weight(weight_span):
    wt_text = weight_span.get_text()
    
     # one pound corresponds to 0.45359237 kg
    return general_regex(wt_text, 'weight', 0.45359237) 

In [8]:
def get_expectancy(expectancy_span):
    exp_text = expectancy_span.get_text()
    return general_regex(exp_text, 'expectancy') 

### Breed group

The AKC classifies 198 of the 277 breeds into seven main groups:

 1. Sporting;
 2. Hound;
 3. Working;
 4. Terrier;
 5. Toy;
 6. Non-Sporting;
 7. Herding.
 
The remaining 79 breeds are categorized into two extra groups:

 1. Miscellaneous Class;
 2. Foundation Stock Service.


In [9]:
def get_group(group_span):
    return {'group': group_span.get_text()}

### Returning the attributes

Two problems are addressed by the dict and function below: 
 1. Popularity may be missing;
 2. All attributes are available in id-less `<span>` tags with the same class attribute.
 
Therefore we take all `<span>` tags with class `'attribute-list__description attribute-list__text'` (which contain the attribute values) and all `<span>` tags with class `'attribute-list__term attribute-list__text'` (which contain the attribute name). And we loop over these lists, calling functions from the `attr_function`, indexed by attribute name.

In [10]:
attr_function = {
    'AKC Breed Popularity': get_popularity,
    'Height': get_height,
    'Weight': get_weight,
    'Life Expectancy': get_expectancy,
    'Group': get_group
}

def get_main_attributes(breed_soup):
        
    breed_attr_terms = breed_soup.find_all(
        'span', class_='attribute-list__term attribute-list__text'
    )
    # When pressent, the first span is the temperament
    if 'Temperament' in breed_attr_terms[0].get_text():
        breed_attr_terms = breed_attr_terms[1:]
    
    breed_attr_values = breed_soup.find_all(
        'span', class_='attribute-list__description attribute-list__text'
    )
    
    attributes = {}
    
    for term_span, value_span in zip (breed_attr_terms, breed_attr_values):
        term = term_span.get_text().replace(':', '')
        attributes.update(attr_function[term](value_span))
    
    return attributes

## Breed care

Information about the breed's care requirements is given in five "progress bars", which are actually `<div>` tags  with relative width corresponding to how much of a certain kind of care the breed needs. These are acompanied by a categorization of the breed.

In [11]:
def get_care_info(breed_soup):
    titles = breed_soup.find_all(
        'h4', class_='bar-graph__title'
    )
    
    values = breed_soup.find_all(
        'div', class_='bar-graph__section'
    )
    
    categories = breed_soup.find_all(
        'div', class_='bar-graph__text'
    )
    
    care_dict = {}
    
    for (title, value, category) in zip (titles, values, categories):
        t = title.get_text().lower().replace(' ', '_')
        t = t[t.find('/') + 1:]
        care_dict[t + '_value'] = float(
            value['style'].split()[1].split('%')[0]
        ) / 100
        care_dict[t + '_category'] = category.get_text()
    
    return care_dict

## The breed class

In [12]:
class Breed:
    def __init__(self, url):
        self.url = url
        breed_page = requests.get(url)
        breed_soup = BeautifulSoup(breed_page.content, 'html.parser')

        self.breed_info = {}
        self.breed_info['description'] = get_description(breed_soup)    
        self.breed_info['temperament'] = get_temperament(breed_soup)
        self.breed_info.update(get_main_attributes(breed_soup))
        self.breed_info.update(get_care_info(breed_soup))
        
    def get_info(self):  
        return self.breed_info

## Running everything

In [13]:
def get_data():
    page = requests.get('https://www.akc.org/dog-breeds/')
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # An HTML select tag with all the breeds and their urls
    breed_select = soup.find('select', id='breed-search')
    
    # Keeping only children from breed_select which are actually breeds
    breeds = [
        tag for tag in breed_select.children if type(tag) is Tag and tag['value']
    ]
    
    breed_dict = {
        breed.get_text(): Breed(breed['value']).get_info(
        ) for breed in tqdm(breeds)
    }
    
    return breed_dict

In [14]:
breed_dict = get_data()

HBox(children=(IntProgress(value=0, max=277), HTML(value='')))




## Converting data to DataFrame

In [15]:
breed_df = pd.DataFrame.from_dict(
    breed_dict, orient='index'
)

In [16]:
breed_df

Unnamed: 0,description,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,grooming_frequency_category,shedding_value,shedding_category,energy_level_value,energy_level_category,trainability_value,trainability_category,demeanor_value,demeanor_category
Affenpinscher,The Affen’s apish look has been described many...,"Confident, Famously Funny, Fearless",148,22.86,29.21,3.175147,4.535924,12.0,15.0,Toy Group,0.6,2-3 Times a Week Brushing,0.6,Seasonal,0.6,Regular Exercise,0.8,Easy Training,1.0,Outgoing
Afghan Hound,"The Afghan Hound is an ancient breed, his whol...","Dignified, Profoundly Loyal, Aristocratic",113,63.50,68.58,22.679619,27.215542,12.0,15.0,Hound Group,0.8,Daily Brushing,0.2,Infrequent,0.8,Energetic,0.2,May be Stubborn,0.2,Aloof/Wary
Airedale Terrier,The Airedale Terrier is the largest of all ter...,"Friendly, Clever, Courageous",60,58.42,58.42,22.679619,31.751466,11.0,14.0,Terrier Group,0.6,2-3 Times a Week Brushing,0.4,Occasional,0.6,Regular Exercise,1.0,Eager to Please,0.8,Friendly
Akita,"Akitas are burly, heavy-boned spitz-type dogs ...","Courageous, Dignified, Profoundly Loyal",47,60.96,71.12,31.751466,58.967008,10.0,13.0,Working Group,0.8,Daily Brushing,0.6,Seasonal,0.8,Energetic,1.0,Eager to Please,0.6,Alert/Responsive
Alaskan Malamute,The Alaskan Malamute stands 23 to 25 inches at...,"Affectionate, Loyal, Playful",58,58.42,63.50,34.019428,38.555351,10.0,14.0,Working Group,0.6,2-3 Times a Week Brushing,0.6,Seasonal,0.8,Energetic,0.4,Independent,0.8,Friendly
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wirehaired Vizsla,WVs are close relatives of Vizslas but a disti...,"Gentle, Loyal, Trainable",167,54.61,63.50,20.411657,29.483504,12.0,14.0,Sporting Group,0.2,Occasional Bath/Brush,0.6,Seasonal,0.8,Energetic,0.6,Agreeable,0.6,Alert/Responsive
Working Kelpie,The overall appearance of the Working Kelpie i...,"Alert, Eager, Intelligent",,48.26,63.50,12.700586,27.215542,12.0,15.0,Foundation Stock Service,0.2,Occasional Bath/Brush,0.6,Seasonal,0.8,Energetic,0.4,Independent,0.6,Alert/Responsive
Xoloitzcuintli,The Xoloitzcuintli (show-low-eats-queen-tlee) ...,"Loyal, Alert, Calm",140,25.40,58.42,4.535924,24.947580,13.0,18.0,Non-Sporting Group,0.2,Occasional Bath/Brush,0.2,Infrequent,0.8,Energetic,0.6,Agreeable,0.6,Alert/Responsive
Yakutian Laika,For centuries the Yakutian Laika was an irrepl...,"Affectionate, Intelligent, Active",,53.34,58.42,18.143695,24.947580,10.0,12.0,Foundation Stock Service,0.4,Weekly Brushing,0.6,Seasonal,0.8,Energetic,0.2,May be Stubborn,0.4,Reserved with Strangers


In [17]:
breed_df.describe(include='all')

Unnamed: 0,description,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,grooming_frequency_category,shedding_value,shedding_category,energy_level_value,energy_level_category,trainability_value,trainability_category,demeanor_value,demeanor_category
count,277,277,198.0,277.0,277.0,275.0,275.0,274.0,274.0,277,270.0,270,257.0,257,271.0,271,253.0,253,252.0,252
unique,276,268,191.0,,,,,,,9,,5,,5,,5,,5,,5
top,Poodles come in three size varieties: Standard...,"Friendly, Smart, Willing to Please",7.0,,,,,,,Foundation Stock Service,,Weekly Brushing,,Seasonal,,Regular Exercise,,Agreeable,,Friendly
freq,2,3,3.0,,,,,,,68,,119,,125,,118,,77,,77
mean,,,,44.225801,52.720588,17.888858,27.291416,11.306569,13.832117,,0.425926,,0.529183,,0.712915,,0.624506,,0.620635,
std,,,,14.238298,15.885454,12.2906,19.061416,1.817949,2.016668,,0.198306,,0.189068,,0.168927,,0.247271,,0.201713,
min,,,,12.7,17.78,0.0,0.0,0.0,0.0,,0.2,,0.2,,0.2,,0.2,,0.2,
25%,,,,33.02,38.1,8.164663,12.927383,10.0,13.0,,0.2,,0.4,,0.6,,0.4,,0.4,
50%,,,,45.085,53.34,15.875733,24.94758,12.0,14.0,,0.4,,0.6,,0.6,,0.6,,0.6,
75%,,,,55.88,66.04,22.679619,34.019428,12.0,15.0,,0.6,,0.6,,0.8,,0.8,,0.8,


## Saving DataFrame as csv

In [18]:
if not os.path.exists('data/'):
    os.makedirs('data/')
    
breed_df.to_csv('data/akc-data-{}.csv'.format(datetime.today()))
breed_df.to_csv('data/akc-data-latest.csv')