In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

## Introduction

I recently found out about the concept of "found" poetry, where one takes an existing piece of text and breaks it up in an interesting way that takes a life of its own compared to the original. I had the idea to try to find news headlines that can be broken up in the haiku 5-7-5 syllable structure. After finding the Million News Headlines dataset on Kaggle, I knew I had what I needed to automate the search process!

This notebook shows what I used to give myself a dataset of potentially haiku-able headlines.

## Package installation

To start, we install the `syllables` and `cmudict` packages. `cmudict` has a large dictionary of words and the number of syllables in them. `syllables` will use heuristics to estimate the number of syllables in a word.

In [None]:
!pip install syllables
!pip install cmudict

In [None]:
import syllables
import cmudict

## Syllable counting

Haiku follow a 5-7-5 syllable structure. First, we will write a function to count the syllables in a word using the packages installed above. Then, we will write a haiku detection function that will test if a piece of text follows that syllable structure.

In [None]:
import string
syl_dict = cmudict.dict()

def num_syl_cmudict(word):
    '''Count the number of syllables in a word using cmudict
    
    Args:
        - word (str): the word to be counted
        
    Returns:
        - An integer representing the number of syllables, 
          or None if the syllables couldn't be counted
    '''
    # Remove all punctuation from the word
    word = word.translate(str.maketrans('', '', string.punctuation))
    
    # Check if word is in the cmudict
    syls = syl_dict[word]
    
    if len(syls) == 0:
        est = syllables.estimate(word)
        return est
    else:
        ct = 0
        # The cmudict actually lists all sounds in the word. Counting in
        # this way allows you to identify the number of syllables.
        # From StackOverflow:
        # https://datascience.stackexchange.com/questions/23376/how-to-get-the-number-of-syllables-in-a-word
        for sound in syls[0]:
            if sound[-1].isdigit():
                ct += 1
        
        return ct

def is_haiku(sent):
    '''Checks if a given piece of text can be split into the haiku 5-7-5 syllable structure
    
    Args:
        - sent (str): the sentence to be tested
    
    Returns:
        - A boolean indicating whether the text is composed of 
          words that can be separated into a haiku
    '''
    words = sent.split(" ")
    
    count = 0
    hit_5 = False
    hit_7 = False

    # Loop through all the words and check for the 
    # intermediate milestones of a haiku structure
    for word in words:
        num_syl = num_syl_cmudict(word)

        if num_syl is None:
            return False
        
        count += num_syl
        
        # We hit the first five syllables - reset the counter
        if count == 5 and not hit_5:
            count = 0
            hit_5 = True
        
        # If we hit five and then found 7 syllables, 
        # we have our second haiku line
        if count == 7 and hit_5:
            count = 0
            hit_7 = True
    
    # If we hit 5 and 7 and there are only 5
    # syllables left, we have a haiku-able text
    return ((count == 5) and hit_5 and hit_7)

In [None]:
# Process the dataframe and dump the output into CSV

news_df = pd.read_csv("../input/abcnews-date-text.csv")

news_df['is_haiku'] = news_df['headline_text'].map(is_haiku)
haiku = news_df[news_df['is_haiku']]
haiku.to_csv('haiku_list.csv')

In [None]:
np.sum(news_df['is_haiku'])

We see that out of the million news headlines, we found 7486 headlines that were haiku-able! After manual inspection, you'll see that some of these headlines don't fit because abbreviations, etc. weren't counted correctly in the dataset, but this should be enough to get you going!