# COMP 205: Quiz 4
## Fall 2020

Please enter your answers into this workbook in the places provided. This is an open-Internet examination. You may use any books and materials, and the entire internet -- as well as all of the class workbooks and examples -- in answering these questions. You may not communicate with _other people_ in completing this workbook. 

Please enter your answers into this workbook in the places provided. You will need to create extra cells in the workbook to explain your reasoning in support of your answers. Please make sure that your "final answer" for each question is in the place marked for that purpose. 

There are only 3 questions but Q3 has a number of corner cases. Your answer will be evaluated on how completely and elegantly you answer it. 

<span style="color:red">Please do not share this workbook with others until after the exam window is closed on Wednesday. Also, please sign the last cell of this notebook before submitting.</span>



# News Stream Analysis

This case study seeks to read a news RSS feed and process the headlines are they arrive!

To begin, uncomment the next cell and run it to install `feedparser` using pip. You need only do it once (but doing it more than once won't hurt anything). Run the next cell.

In [None]:
# !pip install feedparser

This notebook depends on two external files: `news_feeder.py` and `moral_foundations.py`. They are in the same directory as the `.ipynb` file but your setup may be different and a bit of experimentation may be involved.

In [2]:
import news_feeder, moral_foundations

## Reading Headlines

The next cell reads headlines from the New York Times, BBC and a few other news sources. Once you start it, it will keep going! Do a kernel interrupt to stop it. To give the next cell a name, because we will be referring to it, let's call it a **streaming cell**.

In [None]:
# Streaming Cell
import sys
for feed in news_feeder.feeds:
    for (tt, name, title, link) in feed.getHeadline():
        print (title)

### <span style="color:blue"> 1. Clean input </span>

To clean a headline, do this:

1. Remove any punctuations and replace them with spaces. This means that words like O'Hara will become two words.
That's OK.
2. Replace multiple spaces with a single space.
3. Change all words to lowercase.

Copy the above cell and replace the print line to `print (clean(title))` to examine the cleaned up input. Python has libraries for processing text, called `nltk` (natural language toolkit) but we didn't study `nltk` &#x2639; and this is a poor-person's implementation!

## Moral Foundations Theory

<img align="right" style="padding-left:10px; height: 55%; width: 55%" src="figures/political-camps-moral-foundations.png" ></a>

The second part of this case study is Moral Foundations Theory. Google the phrase or visit [MoralFoundations.org](https://moralfoundations.org/) to learn more about the theory (or watch [Jonathan Haidt's Ted Talk](https://www.ted.com/talks/jonathan_haidt_the_moral_roots_of_liberals_and_conservatives/)).

For the purpose of this exercise, you needn't learn much about the theory &mdash; we will be using the Moral Foundations team's list of words that connote different dimensions of morality. 

* **Care** connotes safety, peace, compassion, etc.
* **Harm**, Care's opposite, connotes war, fight, hurt, kill, suffer, etc.

In the coding, the above two moral opposites are called 'HarmVirtue' and 'HarmVice' respectively. Similarly for other moral dimensions. The categories and the words that belong to each are available on-line in a [Moral Foundations Dictionary](https://moralfoundations.org/wp-content/uploads/files/downloads/moral%20foundations%20dictionary.dic). The dictionary words often include &ast;s, for example `peace`&ast;, which could be peace, peaceful, peacefully, even peacenik!

In the next cell, we retrieve the groups and codes. Go ahead and print them to discern the pattern.

In [17]:
groups, codes = moral_foundations.read_dicts()

In [18]:
groups

{'01': 'HarmVirtue',
 '02': 'HarmVice',
 '03': 'FairnessVirtue',
 '04': 'FairnessVice',
 '05': 'IngroupVirtue',
 '06': 'IngroupVice',
 '07': 'AuthorityVirtue',
 '08': 'AuthorityVice',
 '09': 'PurityVirtue',
 '10': 'PurityVice',
 '11': 'MoralityGeneral'}

### Codes

Print the codes first

In [19]:
codes

{'safe*': '01',
 'peace*': '01',
 'compassion*': '01',
 'empath*': '01',
 'sympath*': '01',
 'care': '01',
 'caring': '01',
 'protect*': '01',
 'shield': '01',
 'shelter': '01',
 'amity': '01',
 'secur*': '01',
 'benefit*': '01',
 'defen*': '01',
 'guard*': '01',
 'preserve': '01,07,09',
 'harm*': '02',
 'suffer*': '02',
 'war': '02',
 'wars': '02',
 'warl*': '02',
 'warring': '02',
 'fight*': '02',
 'violen*': '02',
 'hurt*': '02',
 'kill': '02',
 'kills': '02',
 'killer*': '02',
 'killed': '02',
 'killing': '02',
 'endanger*': '02',
 'cruel*': '02',
 'brutal*': '02',
 'abuse*': '02',
 'damag*': '02',
 'ruin*': '02,10',
 'ravage': '02',
 'detriment*': '02',
 'crush*': '02',
 'attack*': '02',
 'annihilate*': '02',
 'destroy': '02',
 'stomp': '02',
 'abandon*': '02,06',
 'spurn': '02',
 'impair': '02',
 'exploit': '02,10',
 'exploits': '02,10',
 'exploited': '02,10',
 'exploiting': '02,10',
 'wound*': '02',
 'fair': '03',
 'fairly': '03',
 'fairness': '03',
 'fair-*': '03',
 'fairmind*'

### Understanding Codes

The first key-value in the dictionary is `'safe*': '01'.` It means that safe, safely, safety, ... are all in 
group 01 (HarmVirtue). 

Also notice that some words (`preserve`, for example) belong in multiple groups.

### <span style="color:blue"> 2. Write a function for finding categories of a word </span>

We want a function `find_word_categories(word)` such that the **assert** statements at the end of the next cell all pass.

### Code Design

The obvious design is to create a dictionary that maps each word &rArr; category. Then we could take each word in an incoming tweet and categorize it. _But the &ast;s throw a wrench into this idea!_ How would it categorize "He came peacefully" when a dictionary whose key was peace&ast; wouldn't match the word "peacefully!"

Looking up each each tweet word without the benefit of a dictionary might be too slow!

Our design will be based on a combination of strategies: using a dictionary with 4-letter keys, and having each key map to a (relatively short) list of &ast;ed words for quick scanning. A good example is the 4-letter key 'cont' which can map to [{'control': '07'} or {'contagio*': '10'}]. Our search will look up the first 4 letters and then walk the list of dictionaries,
             

The Python class [`defaultdict`](https://docs.python.org/3.8/library/collections.html#defaultdict-examples) is perfect for assembling this list of dictionaries.

In [46]:
fast_lookup_codes = defaultdict(list)

for k in codes:
    word_key = k[0:4]
    fast_lookup_codes[word_key].append({k:(codes[k])})
    if "*" in word_key:
        # we don't want any asterisks among the keys, print and quit if this happens
        print (k)
        break
fast_lookup_codes

defaultdict(list,
            {'safe': [{'safe*': '01'}],
             'peac': [{'peace*': '01'}],
             'comp': [{'compassion*': '01'},
              {'complian*': '07'},
              {'comply': '07'}],
             'empa': [{'empath*': '01'}],
             'symp': [{'sympath*': '01'}],
             'care': [{'care': '01'}],
             'cari': [{'caring': '01'}],
             'prot': [{'protect*': '01'}, {'protest': '08'}],
             'shie': [{'shield': '01'}],
             'shel': [{'shelter': '01'}],
             'amit': [{'amity': '01'}],
             'secu': [{'secur*': '01'}],
             'bene': [{'benefit*': '01'}],
             'defe': [{'defen*': '01'},
              {'defere*': '07'},
              {'defer': '07'},
              {'defector': '08'}],
             'guar': [{'guard*': '01'}],
             'pres': [{'preserve': '01,07,09'}],
             'harm': [{'harm*': '02'}],
             'suff': [{'suffer*': '02'}],
             'war': [{'war': '02'}],
      

In [80]:
def find_word_categories(word):
    word_key = word[0:4]
    if word_key in fast_lookup_codes:
        matches = fast_lookup_codes[word_key]
        for match in matches:
            mks = list(match.keys())
            for mk in mks:
                if mk.endswith('*'):
                    prefix = mk[:-1]
                    # print ('compare',word[0:(len(prefix))], prefix)
                    if word[0:(len(prefix))] == prefix:
                        # print ('"'+word+'"', 'matches', mk)
                        return match[mk].split(',')
                    else:
                        # print ('"'+word+'"', 'does not match', mk)
                        continue
                else:
                    if word == mk:
                        # print ('"'+word+'"', '==', mk)
                        return match[mk].split(',')
                    else:
                        # print ('"'+word+'"', '<>', mk)
                        return []
    return []

assert (find_word_categories('hurt') == ['02'])
assert (find_word_categories('tradition') == ['07'])
assert (find_word_categories('disease') == ['10'])
assert (find_word_categories('@nytnational') == [])
assert (find_word_categories('preserve') == ['01', '07', '09'])
assert (find_word_categories('preserved') == [])
assert (find_word_categories('protective') == ['01'])

### <span style="color:blue"> 3. Classify each headline with its moral foundation group </span>

Identify each headline with its moral code group. (Some headlines won't have any at all).

A headline like "A tradition can be many things — for some, it's food. For others, it's faith. For many, family. "
should find {'AuthorityVirtue': 'tradition'} and {'IngroupVirtue': 'family'}.

Rewrite the streaming cell to print the moral code group along with the title.

**Note:** There are many corner cases in this problem. Your answer will be evaluated on how cleanly and completely you address them.

In [81]:
import re
def analyze_headline(headline):
    hdline_words = re.sub(r'[\W_]+', ' ', headline).lower().split(' ')
    # print ('words', hdline_words)
    categories = []
    for hdl_word in hdline_words:
        cats = find_word_categories(hdl_word)
        for cat in cats:
            categories.append({groups[cat]: hdl_word})
    return categories

cats = analyze_headline("A tradition can be many things — for some, it's food. For others, it's faith. For many, family. ")
print ('cats', cats)
assert ({'AuthorityVirtue': 'tradition'} in cats)
assert ({'IngroupVirtue': 'family'} in cats)
assert (2 == len(cats))

cats [{'AuthorityVirtue': 'tradition'}, {'IngroupVirtue': 'family'}]


In [82]:
# Streaming Cell
import sys
for feed in news_feeder.feeds:
    for (tt, name, title, link) in feed.getHeadline():
        cats = analyze_headline(title)
        print (title, cats)

Moderna Vaccine Is Highly Protective and Prevents Severe Covid, Data Show [{'HarmVirtue': 'protective'}]
Coronavirus Vaccinations Begin, But Some Americans Are Wary []
In Canada, the First Vaccines Leave Health Workers in Tears of Relief []
With First Dibs on Vaccines, Rich Countries Have ‘Cleared the Shelves’ []
‘A Shot of Hope’: What the Vaccine Is Like for Frontline Doctors and Nurses []
If Teachers Get the Vaccine Quickly, Can Students Get Back to School? []
A Day That Settled an Election and Brought Hope for Defeating a Pandemic []
After Electoral College Votes, More Republicans Warily Accept Trump’s Loss []
Formally President-Elect, Joe Biden Looks Ahead to Stimulus []
Inside Biden’s Struggle to Manage Factions in the Democratic Party []
How Biden Can Move His Economic Agenda Without Congress []
Dysfunctional Prison and Court Pose Guantánamo Headaches for Biden []
Scope of Russian Hack Becomes Clear: Multiple U.S. Agencies Were Hit []
Inside the Right-Wing Media Bubble, Where the

KeyboardInterrupt: 

# when you have completed this exercise

1. Rerun all cells from top to bottom and check that they work. Please sign the next line.

2. <span style="color:blue">I, **(your name)**, certify that the work in this notebook is my own. </span>
    
3. <span style="color:blue">Further, I promise not to share or discuss the solutions until after the submission period has passed (on midnight of 10/18/20). </span>

You can submit a notebook by saving it as PDF. In the cluster environment, it's File | Print (Save as PDF) and submit to Gradescope. https://www.gradescope.com/courses/182658,On other versions, it may be File | Download As (PDF) and then submit to Gradescope.

To submit to Gradescope, log into the [website](https://www.gradescope.com/courses/182658), add course **9W7PW3** (if not already added) and submit. The assignment name should match the name of this notebook.

# Good luck!