# The Flat Squat
### Analyzing W&M Newspaper Articles

  - The Flat Hat, school-sponsored, has "standards" and "history"
  - The Botetourt Squat, freedom-sponsored, comedic newspaper

  - Flat Hat website &rarr; Flat Hat corpus
  - Squat archive PDFs &rarr; Squat corpus

  - 9 hypotheses

To begin, we need a **normalized word count** for each corpus.

In [1]:
import nltk, random
from articles import flathat, squat, Article

def articleLength(article: Article):
    return len(article.content().split(' '))

def wordsInCorpus(corpus: [Article]):
    '''Counts the amount of discrete words in a corpus.'''
    wordsInArticles = ( articleLength(article) for article in corpus )
    return sum(wordsInArticles)

flathat_words, squat_words = wordsInCorpus(flathat), wordsInCorpus(squat)

We'll define "significance" as having the following properties:

In [2]:
def is_statistically_significant(greater, less):
    return (greater / less) >= 1.25

def supports(greater, less):
    neg = "SUPPORTS" if is_statistically_significant(greater, less) else "DOESN'T support"
    return "Evidence {} hypothesis, ratio is {:.2}".format(neg, greater / less)

# General Hypotheses

**Hypothesis 4:** Articles in The Flat Hat will be longer.

  - the Flat Hat tries to cover as much of a topic as possible.
  - comedy is hard

In [3]:
from statistics import median
fh_mean = flathat_words / len(flathat)
sq_mean = squat_words / len(squat)
fh_median = median(map(articleLength, flathat))
sq_median = median(map(articleLength, squat))

m = '''
    Flat Hat Mean Length: {} words
       Squat Mean Length: {} words

    Flat Hat Median Length: {} words
       Squat Median Length: {} words
'''
print(m.format(fh_mean, sq_mean, fh_median, sq_median))
print("Mean:", supports(fh_mean, sq_mean))
print("Median:", supports(fh_median, sq_median))


    Flat Hat Mean Length: 573.3441558441558 words
       Squat Mean Length: 409.52332361516034 words

    Flat Hat Median Length: 560.0 words
       Squat Median Length: 388.5 words

Mean: Evidence SUPPORTS hypothesis, ratio is 1.4
Median: Evidence SUPPORTS hypothesis, ratio is 1.4


# Word Frequency Hypotheses

The majority of our hypotheses have to do with word frequency, which we'll define as the _occurances of a word in a corpus divided by the total word count_. We'll define some functions so as not to repeat ourselves.

In [4]:
def countWordMatchesInArticle(article: Article, matches: [str]) -> int:
    '''Count the amount of occurances of a list of given strings in a given article.'''
    article = article.content().lower()
    return sum(map(article.count, matches))

def wordFrequency(corpus: [Article], matches: [str], wordcount: int) -> float:
    '''Count the frequency of any of the given match words occurring in a given corpus.'''
    matches = [ countWordMatchesInArticle(article, matches) for article in corpus ]
    return sum(matches) / wordcount

**Hypothesis 1:** The frequency of the words 'basketball', 'football', 'soccer', 'golf',
'tennis', and 'baseball' will be higher in The Flat Hat than in The Botetourt Squat.

  - the Squat doesn't have a real sports section
  - the Flat Hat is more affiliated with W&M-sponsored events

In [5]:
m = '''
    Flat Hat Frequency: {:.4%}
       Squat Frequency: {:.4%}
'''

sportsball = ["basketball", "football", "soccer", "golf", "tennis", "baseball"]
fh = wordFrequency(flathat, sportsball, flathat_words)
sq = wordFrequency(squat, sportsball, squat_words)
print(m.format(fh, sq))
print(supports(fh, sq))


    Flat Hat Frequency: 0.0576%
       Squat Frequency: 0.0441%

Evidence SUPPORTS hypothesis, ratio is 1.3


**Hypothesis 6:** The Flat Hat will mention 'William & Mary', 'Alumni/us/a' more frequently
than The Squat.

  - the Flat Hat is more associated with the school and official events

In [6]:
m = '''
    Flat Hat Frequency: {:.4%}
       Squat Frequency: {:.4%}
'''

tribe_pride = ["william & mary", "william and mary", "the college", "alumni",
               "alumna", "alumnus", "alumnae"]
fh = wordFrequency(flathat, tribe_pride, flathat_words)
sq = wordFrequency(squat, tribe_pride, squat_words)
print(m.format(fh, sq))
print(supports(fh, sq))


    Flat Hat Frequency: 0.7782%
       Squat Frequency: 0.1502%

Evidence SUPPORTS hypothesis, ratio is 5.2


**Hypothesis 7:** The Squat will have more references to national and international politics;
instances of "Obama", "ISIS", "Republican/s", and "Putin" will appear more often.

  - the scope of the Flat Hat is comparatively small

In [7]:
m ='''
    Flat Hat Frequency: {:.4%}
       Squat Frequency: {:.4%}
'''

national_news = ["obama", "isis", "republican", "republicans", "putin"]
fh = wordFrequency(flathat, national_news, flathat_words)
sq = wordFrequency(squat, national_news, squat_words)
print(m.format(fh, sq))
print(supports(sq, fh))


    Flat Hat Frequency: 0.0355%
       Squat Frequency: 0.0801%

Evidence SUPPORTS hypothesis, ratio is 2.3


**Hypothesis 8:** The Squat will mention The Flat Hat and The Squat more than The Flat Hat
mentions either.

  - parody requires more reference than news fact
  - the Botetourt Squat isn't exactly news to the Flat Hat

In [8]:
m ='''  
    Flat Hat Frequency: {:.4%}
       Squat Frequency: {:.4%}
'''

rivalry = ["the flat hat", "the fat hat", "the botetourt squat", "the squat"]
fh = wordFrequency(flathat, rivalry, flathat_words)
sq = wordFrequency(squat, rivalry, squat_words)
print(m.format(fh, sq))
print(supports(sq, fh))

  
    Flat Hat Frequency: 0.0179%
       Squat Frequency: 0.0851%

Evidence SUPPORTS hypothesis, ratio is 4.7


**Hypothesis 9:** The Flat Hat will mention official dining services more frequently
than The Squat; tested by frequency of the terms 'Commons', 'Caf', 'Sadler',
'Marketplace', 'Cosi', 'Student Exchange', '1693 Barbecue', 'Wholly Haba(n/ñ)eros'.

  - the Flat Hat is in the pocket of the school
  - the Flat Hat goes out of their way to review on-campus things

In [9]:
m = '''
    Flat Hat Frequency: {:.4%}
       Squat Frequency: {:.4%}
'''

sodexo = ["commons", "caf", "sadler", "marketplace", "cosi", "student exchange",
          "1693 barbecue", "wholly habaneros", "wholly habañeros"]
fh = wordFrequency(flathat, sodexo, flathat_words)
sq = wordFrequency(squat, sodexo, squat_words)
print(m.format(fh, sq))
print(supports(fh, sq))


    Flat Hat Frequency: 0.0374%
       Squat Frequency: 0.0459%

Evidence DOESN'T support hypothesis, ratio is 0.81


# Regular Expression Hypotheses

Some of our hypotheses concern how many words **match a given pattern**. We'll define functions to do this for us.

In [10]:
from re import findall

def countRegexMatchesInArticle(article: Article, matches: [r'regex']) -> int:
    '''Count the amount of occurances of a list of given regular expressions in a given article.'''
    article = article.content().lower()
    return sum([ len(findall(match, article)) for match in matches ])

def regexFrequency(corpus: [Article], matches: [r'regex'], wordcount: int) -> float:
    '''Count the frequency of any of the given match expressions occurring in a given corpus.'''
    matches = [ countRegexMatchesInArticle(article, matches) for article in corpus ]
    return sum(matches) / wordcount

**Hypothesis 2:** The frequency of all n-grams in the forms "was __ed" or "__ed by"
will be higher in The Flat Hat than in The Squat.

  - the Flat Hat uses the common news style
  - the Squat doesn't conform to news style

In [11]:
m = '''
Flat Hat Frequency: {:.4%}
   Squat Frequency: {:.4%}
'''

passive = [r'was \S+ed', r'\S+ed by']
fh = regexFrequency(flathat, passive, flathat_words)
sq = regexFrequency(squat, passive, squat_words)
print(m.format(fh, sq))
print(supports(fh, sq))


Flat Hat Frequency: 0.2209%
   Squat Frequency: 0.2509%

Evidence DOESN'T support hypothesis, ratio is 0.88


**Hypothesis 3:** The frequency of score reports, in the form of regular expression
\d+-\d+ will match more often in The Flat Hat. The Squat does not report on sports,
certainly not with statistics.

In [12]:
m = '''
    Flat Hat Frequency: {:.4%}
       Squat Frequency: {:.4%}
'''

score_reports = [r'\d+-\d+']
fh = regexFrequency(flathat, score_reports, flathat_words)
sq = regexFrequency(squat, score_reports, squat_words)
print(m.format(fh, sq))
print(supports(fh, sq))


    Flat Hat Frequency: 0.3108%
       Squat Frequency: 0.0107%

Evidence SUPPORTS hypothesis, ratio is 2.9e+01


**Hypothesis 5:** The Flat Hat will show more graduation years, as it includes them
after every interviewee’s name. The Squat does not always include this indicator of
year. The regular expression [’']\d\d will match more often in The Flat Hat.

In [13]:
m = '''
    Flat Hat Frequency: {:.4%}
       Squat Frequency: {:.4%}
'''

class_year = [r"[’']\d\d"]
fh = regexFrequency(flathat, class_year, flathat_words)
sq = regexFrequency(squat, class_year, squat_words)
print(m.format(fh, sq))
print(supports(fh, sq))


    Flat Hat Frequency: 0.2384%
       Squat Frequency: 0.0196%

Evidence SUPPORTS hypothesis, ratio is 1.2e+01


# Fun & Curiosity

The Flat Hat uses real names. The Botetourt Squat uses pseudonyms, and those pseudonyms aren't always isomorphic to real people. So which "authors" are the most prolific in the Flat Hat and the Squat?

In [14]:
# most and least prolific writers for the Squat and the Flat Hat
# for fun and testing

import collections
from itertools import islice

def nodupes(seq):
    noDupes = []
    [noDupes.append(i) for i in seq if not noDupes.count(i)]
    return noDupes

def prolific(articles, reverse=True):
    authors = [ x.author for x in articles ]
    count = collections.Counter(authors)
    authors = nodupes(sorted(authors, key=count.get, reverse=reverse))
    return [ (author, count[author]) for author in authors ]

def displayProlific(least=False):
    comparator = "Least" if least else "Most"
    print((" " * 15 + comparator + " Prolific Writers").upper())
    print("\n     {:10}{:30}{:10}{:30}\n".format("Articles", "Flat Hat", "Articles", "Botetourt Squat"))
    fh_prolific = prolific(flathat, reverse=not least)
    sq_prolific = prolific(squat, reverse=not least)
    for i, (fh, sq) in enumerate(islice(zip(fh_prolific, sq_prolific), 30)):
        print("#{:2}: {:>9} {:30}{:>9} {:30}".format(i+1, fh[1], fh[0][:28], sq[1], sq[0][:28]))

In [15]:
displayProlific()

               MOST PROLIFIC WRITERS

     Articles  Flat Hat                      Articles  Botetourt Squat               

# 1:       194 Chris Weber                          51 HANK MANGKLACE                
# 2:       185 Zach Hardy                           43 PARTICLE-MAN SKYLORD          
# 3:       173 Flat Hat Editorial Board             24 DEMOSTHENES                   
# 4:       122 Jack Powers                          21 FATHER PADRE                  
# 5:       113 Jared Foretek                        21 ANTICITIZEN ONE               
# 6:       101 Meredith Ramey                       19 DARK PALADIN                  
# 7:        88 Mick Sloan                           18 GOLDEN-HAIRED NINNY           
# 8:        86 Abby Boyle                           17 TOM BOMBADIL                  
# 9:        85 Katherine Chiglinsky                 17 PARTICLE MAN SKYLORD          
#10:        83 Ariel Cohen                          16 PUBLIUS                       
#11:        71 

In [16]:
displayProlific(least=True)

               LEAST PROLIFIC WRITERS

     Articles  Flat Hat                      Articles  Botetourt Squat               

# 1:         1 MeghanCondlin                         1 GUCCI STEVE                   
# 2:         1 Lucas Cohen                           1 EXPAND_DONG.JPEG              
# 3:         1 Ryan Corcoran                         1 THE PERSON WHO WROTE IT       
# 4:         1 William Gaskins                       1 MARY QUEEN OF HOT             
# 5:         1 Spencer Chretien                      1 THE ORAL IN “FLORAL”          
# 6:         1 Brady Meixell                         1 RONALDINHO MCDONLD            
# 7:         1 Ian Kirkwood                          1 GHOST OF JAMES MADISON        
# 8:         1 Elizabeth Jacob                       1 A PATRIOTIC HAWK              
# 9:         1 Emily McMillen                        1 ME                            
#10:         1 Amanda Triplett                       1 LAZERCUNT tasted bitterness   
#11:         1