# Part I: Python Basics

* Shift/enter runs a command in a cell
* If this doesn't work, try Kernel/Restart then shift/enter
* Cell/All output/Clear clears the results from the notebook
* The command pallette (small rectangle above) lets you specify the type of a cell

* In MyBinder, you can edit a notebook but the notebook restores to defaults upon exit

In [None]:
import sys
import os
os.getcwd() # what is your current work directory? 
os.chdir("C:\\Users\\Administrator\\Desktop") # change your working directory (enter correct path). Note backslash direction 


## Basic data types in Python

In [None]:
print(2+2) # integer 

print(8/2) # floating-point numbers
type(8/2)

In [None]:
text = "Text analysis is pretty cool!"
type(text) # string 

In [None]:
#Lists are mutable vectors []


list_example = [12, "hungry", "dogs"]
type(list_example) # list
print(list_example[2]) # the index starts from 0
print(list_example[0:2]) # slice the list 

In [None]:
#Tuples are immutable ()

tuple_example = (12, "hungry", "dogs") # note the difference between () and []
len(tuple_example)
type(tuple_example) # tuple 
print(tuple_example(2)) # the error says the tuple object is not callable; tuple is immutable while list is mutable 

In [None]:
#Dictionaries are lists with keys and values {}

dict_example = {"a": 1, "b" : 42, "text" : "hi there"}
type(dict_example) # dict
print(dict_example["text"]) # we can use key to get what we want 

## Method

In [None]:
#A Method is a command that is specific to an object type. 
#Here, .insert and .append are list methods.

text_list = [2, 5, "yes"]
text_list.insert(0, "no") # insert "no" to the location 0
print(text_list) 

text_list.append("whatever") # append "whatever" at the end of the list
print(text_list)

## Loops

In [None]:
#Loops are probably familiar. They are very useful when dealing with thousands of repetitive tasks

test_list = ["Dr.Slater", "Dr.Pepinsky", "Lily", "Samantha"]

for element in test_list:
    if "Dr." in element:
        print("Hello " + element + "!")

In [None]:
#use a loop to apply a method to a list 5 times 
myList=[]
for element in range(5):
    myList.append(element) # append is a method or attribute of the "list" object (or module)
print(myList)

## Functions

In [None]:
#A function is a block of organized, reusable code used to perform a single, related action.
#Here, the function 'perfect' is applied to whatever values are assigned to the variable 'score'


def perfect(score):
    print ("I got a perfect " + score)
    
perfect(score='100') 

In [None]:
def pinfo(name, age):
   print("Name:", name)
   print("Age:", age)
   
pinfo(age=25, name="Joanne" )

## Regular expressions

* A sequence of characters that define a search pattern. For example, '\,' says look for a comma. See https://docs.python.org/3.4/library/re.html

In [None]:
#A Regular expression is a special sequence of characters for finding strings in text
#They are incredibly useful for finding and extracting text
# See https://docs.python.org/3/library/re.html

#For example, let's split 'happy, go lucky' wherever there is a comma, and whereever there is a space

from bs4 import re
re.split('\,', 'happy, go lucky') 
#split wherever there is a comma (the backslash says - don't treat comma as an RE character)

re.split('\s', 'happy, go lucky')
#split whereever there is a space (in this case \s is an RE for space)

#try some regular expressions here  http://www.regexr.com/

In [None]:
#You write a script to scrape addresses from thousands of records. 
#The field you are scraping does not exist in one of the records so your program crashes. 
#The ignore and try/except commands say in effect: if it works do it, if not ignore and move on to the next case!
#Here's an example that is part of a function that let's you know when it happens

def divide(x, y):
    try:
        result = x / y
    except ZeroDivisionError:
        print("division by zero!")
    else:
        print("result is", result)


divide(2.0,3.0)
divide(2.0, 0)

In [None]:
#If you just want to ignore bad cases (rather than printing an error message):

def divide(x, y):
    try:
        result = x / y
    except:
        pass
    else:
        print("result is", result)

divide(2.0,3.0)
divide(2.0, 0)

### How to remove unicode from files?

The characters you see in text are based on an underlying encoding. 
Ascii is the common encoding but some texts have unicode encoded characters
The best way to deal with the problems they create is to remove them in advance

This script converts the text to strings that are sentences in ascii format. 
#If a character can't be converted to ascii (e.g. some unicode characters), ignore it 
cleanwords=str(sentence.encode('ascii',errors='ignore'))

In [None]:
# This script converts the text to strings that are sentences in ascii format. 
# If a character can't be converted to ascii (e.g. some unicode characters), ignore it 

cleanwords=str(sentence.encode('ascii',errors='ignore'))

In [None]:
# Here's a more deliberate way to remove unicode from a list of files.
# This worked when the above ignore command did not 

list2 = []
file_counter = 0
for file in list1:
    file_counter += 1
    missing_words = 0
    out_file = ''
    word_list = file.split()
    for word in word_list:
        try:
            new_word = str(word)
            out_file = '%s %s'%(out_file, new_word)
        except:
            missing_words += 1
    list2.append(out_file)
    print('%s%s%s%s'%('File: ', file_counter, '| Missing words: ', missing_words)) 



# Part II: Collecting and Pre-processing Text 

* Scraping (two examples using an API or scraping directly from a website)
* Splitting documents
* Tokenizing and cleaning

## Scrape data using API
* This [tutorial](https://dlab.berkeley.edu/blog/scraping-new-york-times-articles-python-tutorial) demonstrates how to use the New York Times Articles Search API using Python. 

## Scrape a single website

In [None]:
from urllib.request import urlopen

polisci_url = urlopen('https://www.polisci.washington.edu/people')
type(polisci_url) 
polisci_page = polisci_url.read()

print(polisci_page)
type(polisci_page) # When the content contains an unrecognized character Python will convert it into bytes

In [None]:
# Let's fix that by using the correct character encoding from `Content-Type` request header
charset_encoding = polisci_url.info().get_content_charset()
# apply encoding
polisci_page = polisci_url.read().decode(charset_encoding)

type(polisci_page) # Now it is a long string instead of a byte

In [None]:
# A handier way to fix possible encoding issues
polisci_page = urlopen('https://www.polisci.washington.edu/people').read().decode('utf-8')
print(polisci_page)
type(polisci_page)

In [None]:
# This says create a new list polisci_list from the polisci_page list that splits at every space
polisci_list = [polisci_page.split()]
polisci_list

## How to make the text more readable?

*Ok that list is kind of a mess because it contains a lot of html code that we don't care about. 
Fortunately people have written programs to remove much of it 

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup, re

polisci_url = urlopen("https://www.polisci.washington.edu/people")
polisci_page = BeautifulSoup(polisci_url.read())
polisci_text = polisci_page.get_text()
print(polisci_text)
type(polisci_text) # string

In [None]:
# It is almost always the case that additional cleanup steps are required. For example...
#remove all of the blank lines 

lines = (line.strip() for line in polisci_text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
politext = '\n'.join(chunk for chunk in chunks if chunk)
print(politext)

## Parsing

In [None]:
# Or we might want to extract specific things from a text using a regular expression!
#Here the RE finds all of the strings corresponding to a particular pattern of characters (an email address)

polisci_emails = set()
emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", politext, re.I)) #https://docs.python.org/3/library/re.html
polisci_emails.update(emails)

print(polisci_emails)
type(polisci_emails)

In [None]:
# convert the set of emails above to a list 
email_list = list(polisci_emails)
print(email_list)

### Parse names and positions and save them locally

In [None]:
import requests 
from bs4 import BeautifulSoup

# collect and extract specific information from a web page using html fields.
#to see these fields, go to the website, right click and 'view page source'

r = requests.get("https://www.seattle.gov/elected-officials") 
soup = BeautifulSoup(r.text, 'html.parser')
boxes = soup.find_all('div', class_ = 'primaryContent')

name_item = [box.h3.text for box in boxes]  #looks for text that is boxed by h3
position_item = [box.span.text for box in boxes]  #looks for text that is boxed by span
    
# for better presentation (look for seattle_officials.csv in your working directory?)
import pandas
pandas.DataFrame({'position':position_item, 'name':name_item}).to_csv('seattle_officials.csv', index = False)

print(name_item)
print(position_item)

### Parse the timeline of US-Iran relations

In [None]:
import requests

r = requests.get('http://pri.org/stories/2020-01-03/history-us-iran-relations-timeline')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

header_tags = soup.find_all('h3')  
headers = [h.text.strip() for h in header_tags]  #text.strip removes the surrounding html code whereever an h3 field is found

print(header_tags)
print(headers)

dates = [header.split(': ')[0] for header in headers]  #split what's in the h3 box by : and first item is date
events = [header.split(': ')[1] for header in headers] #second item is event
descriptions = [h.next_sibling.text.strip() for h in header_tags] 
#next_sibling says grab the next item at the same html level after each header_tag. Shows how much you can do!

print(dates)
print(events)
print(descriptions)

import pandas
pandas.DataFrame({'date':dates, 'event': events, 'description':descriptions}).to_csv('iran_history.csv', index = False, sep = ',')

## Another example

In [23]:
import requests
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []
for result in results:
    date = result.find('strong').text[0:-1] + ', 2017'  #find text in bold font and add '2017' to it
    lie = result.contents[1][1:-2] 
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

import pandas as pd
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])
df['date'] = pd.to_datetime(df['date'])
df.to_csv('trump_lies.csv', index=False, encoding='utf-8')