### 1. Форматування

[The Associated Press Stylebook](https://www.amazon.com/Associated-Press-Stylebook-2017-Briefing/dp/0465093043/) - це посібник зі стилю, яким часто послуговуються журналісти по всьому світу. Він рекомендує такі правила форматування заголовків:
1. З великої літери потрібно писати іменники, займенники, дієслова, прикметники, прислівники та підрядні сполучники. Якщо слово написане через дефіс, велику літеру потрібно додати для кожної частинки слова (наприклад, правильно "Self-Reflection", а не "Self-reflection").
2. З великої літери потрібно писати перше і останнє слово заголовку, незалежно від частини мови.
3. З маленької літери потрібно писати всі інші частини мови: артиклі/визначники, сурядні сполучники, прийменники, частки, вигуки.

**Завдання:**
1. напишіть програму, яка форматує заголовки за вказаними правилами
2. проженіть вашу програму на [корпусі заголовків з The Examiner](examiner-headlines.txt)
3. збережіть програму та файл із відформатованими заголовками у директорії з вашим іменем
4. скільки заголовків у корпусі було відформатовано правильно? (скільки заголовків залишились незмінними?)

Зверніть увагу, що ваша програма повинна правильно розрізняти прийменники та підрядні сполучники. Наприклад, `Do as you want` => `Do As You Want` (бо "as" тут є сполучником), but `How to use a Macbook as a table` => `How to Use a Macbook as a Table` (бо "as" тут є прийменником).

In [11]:
import spacy

nlp = spacy.load("en_core_web_lg")

In [179]:
doc = nlp("'The Biggest Loser' 2012 Video: Face-off Week Brings the Competition")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

' ' PUNCT `` punct ' False False
The the DET DT det Xxx True False
Biggest big ADJ JJS compound Xxxxx True False
Loser loser PROPN NNP poss Xxxxx True False
' ' PUNCT '' punct ' False False
2012 2012 NUM CD nummod dddd False False
Video video NOUN NN dep Xxxxx True False
: : PUNCT : punct : False False
Face face NOUN NN amod Xxxx True False
- - PUNCT HYPH punct - False False
off off PART RP prt xxx True False
Week week PROPN NNP nsubj Xxxx True False
Brings bring VERB VBZ ROOT Xxxxx True False
the the DET DT det xxx True False
Competition competition NOUN NN dobj Xxxxx True False


In [182]:
UPPER_POS = {'PROPN', 'NOUN','PRON', 'VERB', 'ADV', 'ADJ', 'SCONJ', 'NUM'}

LOWER_POS = {'DET','ADP','CCONJ', 'INT', 'PART'}

SEPARATORS = {'!', '.', '?', ':', '-'}

def capitalize(s):
    
    if not s or s.startswith('eBay') or s.startswith('iPhone') or \
        s.startswith("iOS") or s.startswith("iMac") or s.startswith("iPod"):
        return s
    
    return s[0:1].capitalize() + s[1:]


def format_token(idx, doc, token):
    #print(token.text, token.pos_, token.dep_)
    
    if token.pos_ == 'ADV' and token.dep_ == 'neg' and token.text[-2:] == "'t":
        return token.text
    
    if (token.pos_ in UPPER_POS):        
        return capitalize(token.text)
    
    
    if token.pos_ in LOWER_POS:       
        if len(token.text) > 3 or \
            (token.pos_ == 'ADP' and (token.dep_ == 'mark' or \
                                      ((token.dep_ == 'nmod' and doc[idx + 1].text == '-') \
                                       or doc[idx - 1].text == '-'))) or \
            (token.pos_ == 'DET' and (token.dep_ == "appos" or token.dep_ == "advmod" or doc[idx + 1].text == '-')) or \
            (token.pos_ == "PART" and (token.dep_ == "advmod" or doc[idx - 1].text == '-')): 
            return capitalize(token.text)            
        else:
            return token.text.lower() 
            
    return capitalize(token.text)

def format_headline(headline):
    
    doc = nlp(headline)
    formatted = []
    quotes = 0
    
    for idx, token in enumerate(doc):
        
        if token.text == "'":
            quotes +=1
        
        if idx == 0 or idx == (len(doc) - 1):
            if idx == (len(doc) - 1) and token.pos_ == 'PUNCT':
                formatted[-2] = capitalize(doc[idx - 1].text)                
                formatted.append(token.text)
            else:    
                formatted.append(capitalize(token.text))                      
        elif doc[idx - 1].text in SEPARATORS:
            formatted.append(capitalize(token.text))    
        elif doc[idx - 1].text == "'" and quotes % 2 == 1:
            formatted.append(capitalize(token.text))
            quotes = True
        elif quotes and doc[idx + 1].text == "'" and quotes % 2 == 1:
            formatted.append(capitalize(token.text))            
        else:          
            formatted.append(format_token(idx, doc, token))
        
        formatted.append(token.whitespace_)
        
    #print(''.join(formatted))
    return ''.join(formatted)


assert format_headline('Do as you want') == 'Do As You Want'
assert format_headline('A guide to a banking career') == 'A Guide to a Banking Career'
assert format_headline("It's the end of the world... and I feel fine") == "It's the End of the World... and I Feel Fine"
assert format_headline("Furniture and cabinet design trends; where is it heading?") == "Furniture and Cabinet Design Trends; Where Is It Heading?"
assert format_headline("How to use a Macbook as a table") == "How to Use a Macbook as a Table"
assert format_headline("Real men do not give up") == "Real Men Do Not Give Up"
assert format_headline("Real men don't give up") == "Real Men Don't Give Up"
assert format_headline("Real men don't give up!") == "Real Men Don't Give Up!"
assert format_headline("Game #29: friday night fish-fry") == "Game #29: Friday Night Fish-Fry"
assert format_headline("Yoga as a self-reflection method") == "Yoga as a Self-Reflection Method"
assert format_headline("Cindy Crawford: Low-carb diet and yoga workouts are my anti-aging beauty secrets") == "Cindy Crawford: Low-Carb Diet and Yoga Workouts Are My Anti-Aging Beauty Secrets"
assert format_headline("Is Medical Cannabis the Solution for ALS?") == "Is Medical Cannabis the Solution for ALS?"
assert format_headline("Iberry Ice Cream, Chiang Mai: Thailand's favorite ice cream shop") == "Iberry Ice Cream, Chiang Mai: Thailand's Favorite Ice Cream Shop"
assert format_headline("Obama CNN joke: President jokes at dinner, Sarah Palin calls it a 'nerd prom'") == "Obama CNN Joke: President Jokes at Dinner, Sarah Palin Calls It a 'Nerd Prom'"
assert format_headline("Exclusive: Director Ana Lily Amirpour talks 'A Girl Walks Home Alone At Night'") == "Exclusive: Director Ana Lily Amirpour Talks 'A Girl Walks Home Alone at Night'"
assert format_headline("Google giving a shiny gift to holiday travelers: free in-flight wi-fi") == "Google Giving a Shiny Gift to Holiday Travelers: Free In-Flight Wi-Fi"
assert format_headline("Summit releases final trailer for 'The Twilight Saga: Breaking Dawn, Part 2'") == "Summit Releases Final Trailer for 'The Twilight Saga: Breaking Dawn, Part 2'"
assert format_headline("Puppy blamed for loss of infant's finger") == "Puppy Blamed for Loss of Infant's Finger"
assert format_headline("Halep Enters Rogers Cup Final in Straight Sets Win over Errani") == "Halep Enters Rogers Cup Final in Straight Sets Win Over Errani"
assert format_headline("Top ten Books in the Intelligent Design Controversy 2009 #1") == "Top Ten Books in the Intelligent Design Controversy 2009 #1"
assert format_headline("Burn those Calories! Try the Very Steep Trail.") == "Burn Those Calories! Try the Very Steep Trail."
assert format_headline("How It all Plays out on Church Street") == "How It All Plays out on Church Street"
assert format_headline("'Psych' Season Premiere Date Revealed As 'Psych: the Musical' Airing Looms") == "'Psych' Season Premiere Date Revealed as 'Psych: The Musical' Airing Looms"
assert format_headline("Radford Asks Volunteers to Help Renew the New With Saturday River Clean up") == "Radford Asks Volunteers to Help Renew the New With Saturday River Clean Up"
assert format_headline("Dicks Creek: Georgia's Go-to Trout Water") == "Dicks Creek: Georgia's Go-To Trout Water"
assert format_headline("Piroshki! a Taste of Russia From a Michigan Kitchen") == "Piroshki! A Taste of Russia From a Michigan Kitchen"
assert format_headline("'The Biggest Loser' 2012 Video: Face-off Week Brings the Competition") == "'The Biggest Loser' 2012 Video: Face-Off Week Brings the Competition"
assert format_headline("Maine Caucus Results - a Vote of 'No Confidence'") == "Maine Caucus Results - A Vote of 'No Confidence'"
assert format_headline("'Minecraft: Xbox 360 Edition' Mash-up Pack Free Trial and Suggestions Discussed") == "'Minecraft: Xbox 360 Edition' Mash-Up Pack Free Trial and Suggestions Discussed"
assert format_headline("ABS Festival Begins With all-Bach Chamber Recital") == "ABS Festival Begins With All-Bach Chamber Recital"
assert format_headline("The Importance of eBay Feedback") == "The Importance of eBay Feedback"
assert format_headline("iOS7 Release Date 2013: New iOS a Big Change for Apple Users") == "iOS7 Release Date 2013: New iOS a Big Change for Apple Users"
assert format_headline("NBC's 'Revolution' Gives More Sword Fights, Clues In 'Chained Heat'")
assert format_headline("'NCIS' 'out of the Frying Pan...' Preview: Vance Wants a Confession") == "'NCIS' 'Out of the Frying Pan...' Preview: Vance Wants a Confession"
assert format_headline("'The Inspector Chronicles' To Become a Feature Film") == "'The Inspector Chronicles' to Become a Feature Film"
assert format_headline("NBC's 'Revolution' Gives More Sword Fights, Clues In 'Chained Heat'") == "NBC's 'Revolution' Gives More Sword Fights, Clues in 'Chained Heat'"
assert format_headline("Will Women Rule all in November?") == "Will Women Rule All in November?"


In [184]:
with open('examiner-headlines.txt') as headlines:
    with open('formatted-headlines.txt', 'w') as output:
        n = 0
        m = 0
        for headline in headlines:
            n += 1
            fh = format_headline(headline)
            if fh != headline:
                m += 1
            output.write(fh)
        print(f"Headlines total: {n}")
        print("Headlines changed due to formatting: {} ({:.2f}%)".format(m, (m * 100) / n))
        

Headlines total: 5000
Headlines changed due to formatting: 4345 (86.90%)
