# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Attempt-to-Fix-SSL-Issue" data-toc-modified-id="Attempt-to-Fix-SSL-Issue-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Attempt to Fix SSL Issue</a></div><div class="lev1 toc-item"><a href="#Project-Overview" data-toc-modified-id="Project-Overview-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Project Overview</a></div><div class="lev1 toc-item"><a href="#Part-1---Scraping-The-Federalist-Papers" data-toc-modified-id="Part-1---Scraping-The-Federalist-Papers-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Part 1 - Scraping The Federalist Papers</a></div>

# Attempt to Fix SSL Issue

Received the following error when trying to scrap the federalist papers using the Python 3.8 conda environment:

```
SSLError: HTTPSConnectionPool(host='avalon.law.yale.edu', port=443): Max retries exceeded with url: /18th_century/fed01.asp (Caused by SSLError("Can't connect to HTTPS URL because the SSL module is not available."))
```

Possibly [this SO answer](https://stackoverflow.com/a/45909353) is the reason why. I tried [this SO answer](https://stackoverflow.com/a/55632553) by adding to the shell `PATH` variable (did not modify system `PATH`), but no dice. I also tried `verify=False` in the `get` method but still have the same error.

In [None]:
# Shell command to find out where anaconda is installed
anaconda_exec = !where anaconda
anaconda_exec

In [None]:
# Modify path 
import sys
sys.path.append("C:\\Users\\Sonya\\Anaconda3")
sys.path.append("C:\\Users\\Sonya\\Anaconda3\\Scripts")
sys.path.append("C:\\Users\\Sonya\\Anaconda3\\Library\\bin")
sys.path.append("C:\\Users\\Sonya\\Anaconda3\\envs\\scipybase_Jun2021")
sys.path.append("C:\\Users\\Sonya\\Anaconda3\\envs\\scipybase_Jun2021\\Scripts")
sys.path.append("C:\\Users\\Sonya\\Anaconda3\\envs\\scipybase_Jun2021\\Library\\bin")

In [None]:
# Verify path env variable
sys.path

# Project Overview

lorem ipsum dolor

# Part 1 - Scraping The Federalist Papers ###

**This section pulls the federalist papers from a website to give an example of the jupyter `%run` command - you don't need to execute these cells if you have already retrieved the federalist papers.**

First we want to retrieve the contents of all the federalist papers from the law school website. We can do this using the `requests` package to retrieve the HTML for each paper's web page, then use the `bs4` package to parse that HTML, and finally save the actual text contents of each paper to our data folder. We have a stand-alone python script at `scripts/scrape_federalist_papers.py` which does this; as an exercise you can try writing such a script yourself.

Jupyter provides the `%run` magic command for executing code in external python scripts, where all the script variables become available within your notebook namespace. We'll run our scraping script now and write the federalist papers as individual text files in the `data/federalist_papers` folder. 

In [2]:
%run ../scripts/scrape_federalist_papers.py

ModuleNotFoundError: No module named 'requests'

In [None]:
# A list variable called `authors` that stores the paper authors is created in the external script. 
# After using the `%run` magic we now have this list variable available in our notebook as well:
authors

In [None]:
### Part 2 - Mimicking the three Federalist authors ###


import nltk
import numpy

nltk.download('punkt')

def draw_word(distrn):
    words = list(distrn)
    freqs = [freq for w, freq in distrn.items()]
    total = sum(freqs)
    probs = [freq/total for freq in freqs]
    return numpy.random.choice(words, p=probs)

def generate_with_trigrams(text, word=None, num=100):
    tokens = nltk.tokenize.word_tokenize(text)
    trigrams = nltk.trigrams(tokens)
    condition_pairs = (((w0, w1), w2) for w0, w1, w2 in trigrams)
    cfdist = nltk.ConditionalFreqDist(condition_pairs)
    if word is None:
        prev = draw_word(nltk.FreqDist(tokens))
        word = draw_word(nltk.ConditionalFreqDist(nltk.bigrams(tokens))[prev])
    elif len(word.split()) == 1:
        prev = word
        word = draw_word(nltk.ConditionalFreqDist(nltk.bigrams(tokens))[prev])
        # will give an error if this pair doesn't show up in the text
    else:
        prev, word = word.split()[:2]
    print(prev, end=' ')
    for i in range(1, num):
        print(word, end=' ')
        prev, word = word, draw_word(cfdist[(prev, word)])

In [14]:
from pathlib import Path  # This is a useful class for constructing file system paths
import os
import pandas as pd

In [15]:
# Set the data folder and the number of papers to scrape (papers are identified by integers)
datapath = Path(r"../data/federalist_papers")

In [24]:
author = pd.read_csv(datapath / "authors.csv", squeeze=True).tolist()

In [27]:
len(author)

84

In [26]:
# each author will have all his papers merged into a single string
hamilton = ""
madison = ""
jay = ""

docnames = [datapath / f for f in os.listdir(datapath) if f[-4:]==".txt"]
docnames.sort()

N = len(docnames)
for i in range(N):
    with open(docnames[i], 'r') as f:
        if author[i] == "Hamilton":
            hamilton += f.read() + " "
        elif author[i] == "Madison":
            madison += f.read() + " "
        elif author[i] == "Jay":
            jay += f.read() + " "
        else:
            # discard papers with mixed authorship e.g. "Hamilton and Madison"
            pass

len(hamilton)
len(madison)
len(jay)

IndexError: list index out of range

In [None]:
# generate_with_trigrams(hamilton, "The")
# generate_with_trigrams(madison, "The")
# generate_with_trigrams(jay, "The")

In [2]:
%env CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1

env: CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1


In [1]:
### Part 3 - Creating data frame of token frequencies ###

import pandas as pd
import re

In [None]:
N = len(docnames)
tables = [None]*N
for i in range(N):
    with open(docnames[i], 'r') as f:
        doc = f.read()
        doc = doc.replace("To the People of the State of New York:", "")
        doc = doc.replace("PUBLIUS", "")
        doc = doc.replace("Ã¥", "")
        doc = re.sub("[0-9]+", "", doc)
        doc = doc.lower()
        tokens = nltk.tokenize.word_tokenize(doc)
        tables[i] = nltk.FreqDist(tokens)

df = pd.DataFrame(tables)

# fill in zeros
df.fillna(0, inplace=True)

# divide rows by totals
for i in range(N):
    s = sum(df.iloc[i])
    df.iloc[i] = [n/s for n in df.iloc[i]]

df.iloc[0] # check a row to make sure it worked

# write as csv
df.to_csv("../federalist.csv")












# write authors as well
with open("../authors.csv", "w") as f:
    f.write("author\n")
    for a in author:
        f.write(a + "\n")

# and write tokens
with open("../tokens.csv", "w") as f:
    f.write("token\n")
    for t in list(df):
        f.write('"' + t + '"\n')