<font color=darkred>

# Soc220: Computational Text Analysis
## Lab1: A (re)introduction to Python

<br>

![philology](https://lh3.ggpht.com/-ZPt_AIFJdXk/Tzzs9tdvFRI/AAAAAAAADLA/eSVlRz4VS-g/s1600/rudimenta.jpg)

***
    v4
    1/29/2018
    Zach Wehrwein
    (Image: A High Medieval philology seminar)

### Lab syllabus

This lab meets weekly following the discussion section with Bart. We will review an implementation of the method discussed. Here's the basic plan: on Wednesday we discuss a method, then on Thursday, I show you how to do it, and then from Friday through Sunday, you work on your own to implement and interpret said method on your own text data. Come Monday, I am available all day for office hours to discuss any issues you have.

Office hours booking: [zwehrwein.youcanbook.me](zwehrwein.youcanbook.me)

-- Be on the look out for frequent feedback forms.



### Rought sketch of the labs:

Week 1.) A (re)Introduction to Python  <br>
Week 2.) Web scraping; regular expressions<br>
Week 3.) APIs; storing data <br>
Week 4.) Preprocessing and data wrangling: stemming, cleaning, lemmatization, and corpus structure choices. Simple counting and visualization. <br>

Week 5.) Basic NLP (parts of speech tagging, named entity recognition), dictionary methods. <br>
Week 6.) Classifying text with ML: logistic regression, Naive Bayes, SVMs <br>

Week 7.) Topic Models 1 (LDA) <br>
Week 8.) Topic Models 2 (probably stm)<br>
Week 9.) Word embeddings; word2vec <br>
Week 10.) Experimental network analysis and text methods<br>

Week 11.) Sentiment analysis and the problem of finding meaning.
   
***

### 1: Outline of today
(n.b. This lab used code from labs in from AC209a: Data Science 1 @Harvard & PS425: Text as Data @Stanford)
   
1. [Jupyter notebooks](#jup_notebooks)
1. [Installing and importing libraries/packages/modules](#install_libraries)
1. [Common error messages](#error_messages)
1. [Variables and data structures](#data_structures)
1. [Loops and list comprehensions](#loops_and_lists)
1. [User-defined functions and lambda functions](#functions)
1. [Control flow](#control_flow)

***

<font color=darkpurple>
    
[Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)

<a id='jup_notebooks'></a>

<font color=darkblue>

### 1.) Jupyter notebooks

What are Python notebooks good for?

* Easy to share code (as in all of your labs this semester)
* Easy to reproduce analysis (for all of your homeworks)
* Great for mixing text, code, and output.
* Code can be run in chunks.
* Great for customizing and export visualizations
* Excellent library management
* Pressing 'tab' in a Jupyter notebook opens up menu option of commands.
* Jupyter lab also includes 'suggested code' while you work which can be extremely useful.

What's not so great?

* Unlike the R IDE Rstudio, (or Stata), no way to show what objects are in active memory, no GUI, and like many aspects of working in the CS domain, generally can be 'social science unfriendly.' Jupyter Lab is an alternative in development. Rodeo is a very unfunctional IDE for Python. Atom is the go-to free text editor.
* The 'guts' of various options aren't always intutive.


**Installations**

Anaconda: [Anaconda](http://continuum.io/downloads) <br>
(this includes many useful packages, a 3.6 distribution of Python, and Jupyter notebook program. This is all you need!

Jupyter Lab: `conda install -c conda-forge jupyterlab` in Terminal

Atom: [Atom](https://atom.io/)

For Windows users: [Git Bash](https://git-scm.com/downloads) is a good "Terminal" to use.

Then, be sure to update by typing the following in Terminal:

```
conda update conda
conda update anaconda
```

+ To open jupyter notebook, type `jupyter notebook` in Terminal
+ To open a jupyter lab notebook, type `jupyter lab` in Terminal

#### Magic commands

Execute a command inside a Python notebook: `%run myscript.py`

Also, time how long a given cell takes to run. Useful for large computations or complicated data wrangles. `%timeit` at the begining of any cell to see how long computation takes.

<font color=red>
#### Exercise

Once you have opened a notebook, type excute the following two cells:

In [None]:
x = [10, 20, 30, 40, 50]
for ele in x:
    print("This element is",ele)

In [None]:
import sys
print(sys.version)

<a id='install_libraries'></a>

<font color=darkblue>

***
### 2.) Installing and loading libraries

<font color=black >

Let's say you are working away and wish to execute something complicated or you've found some website telling you to use a given library that you don't have installed. This is what you should do.

Libraries are families of functions and methods. You can import an entire library or you can import a specific function or method from a library. n.b. Sometimes libraries conflict with one another and loading a large package can be memory heavy.

To install libraries:

1) In terminal type `pip install LIBRARY-NAME` or do the same in a notebook with a `!`. <br>
2) Including -U `pip install -U LIBRARY_NAME` tells Anacoda to go get a fresh install of the package (this is useful for debugging purposes).

<font color=darkblue>

**Use the following code to install any libraries that you may not have:**

In [None]:
!pip install NetworkX

In [None]:
!pip install -U NetworkX

<font color=red>
Exercise 
    
Check to make sure the following libraries are installed and up to date.

In [None]:
#System related
import IPython
print("IPython version: %6.6s (need at least 5.0.0)" % IPython.__version__)

#computationally related
import numpy as np
print("Numpy version: %6.6s (need at least 1.12.0)" % np.__version__)
import scipy as sp
print("SciPy version: %6.6s (need at least 0.19.0)" % sp.__version__)
import pandas as pd
print("Pandas version: %6.6s (need at least 0.20.0)" % pd.__version__)
import sklearn
print("Scikit-Learn version: %6.6s (need at least 0.18.1)" % sklearn.__version__)

#for visualizations
import seaborn
print("Seaborn version: %6.6s (need at least 0.7)" % seaborn.__version__)
import matplotlib
print("Matplotlib version: %6.6s (need at least 2.0.0)" % matplotlib.__version__)

#for web scrapping
import bs4
print("BeautifulSoup version: %6.6s (need at least 4.4)" % bs4.__version__)
import requests
print("requests version: %6.6s (need at least 2.9.0)" % requests.__version__)

#for text analysis
import gensim
print("gensim version:%6.6s (need at least 3.2.0)" % gensim.__version__)
import nltk
print("nltk version:%6.6s (need at least 3.25)" % nltk.__version__)

<font color=darkgreen>
    
### Function vs Method

In [None]:
#import the following commands

#500 coin flips data simulated:
heads = np.random.binomial(500, .5, size=500)

#TWO WAYS TO IMPORT:
#TO IMPORT ENTIRE LIBRARY
#import [LIBRARY] as [SHORTCUT]|
import matplotlib.pyplot as plt

In [None]:
#[LIBRARY].[METHOD]
plt.hist(heads)
plt.show()

In [None]:
# TO IMPORT SPECIFIC METHOD
# from [LIBRARY] import [METHOD]
from matplotlib.pyplot import hist 

In [None]:
hist(heads)
plt.show()

<a id='error_messages'></a>

<font color=darkblue>
***
    
### 3.) Common error messages and what to do about them

I don't understand the error message I got, what should I do?

+ Well, perhaps don't Google. Py3.6 is very different than Py2.7 -- if you start copying code from the wrong version history, there are many ways things can fail.
    - But, then again, googling error code is probably your best bet.
    + Stack overflow and github can be invaluable sources of code, but always be sure you know what you are doing. In this class, we ask that for *every line of code that you copy, you comment above it describing what it is doing, even if it's really obvious.*
+ [Fluent Python](https://github.com/fluentpython/example-code), [Learn Python the Hard Way](https://github.com/wzpan/Learn-Python-The-Hard-Way), and [the core Python docs on errors](https://docs.python.org/3/tutorial/errors.html) are great resources.
+ More often than note, from personal experience, lots of errors are a fuction of misapplying methods or functions to objects (which makes sense in an object-oriented programming language).

In [None]:
# use the dir function to see which methods apply to a given object.
dir(heads)

In [None]:
heads.hist

In [None]:
# function applied to a numpy array
hist(heads)

In [None]:
# we need to call it again in order to actually show it.
plt.show()

![a](https://pbs.twimg.com/media/Cl0oWqfWEAAH8bA.jpg)

***

<a id='data_structures'></a>

<font color=darkblue>

### 4.) Types of variables and data structures

![signsnotbuckets](sticksnotboxes.png)

(from Fluent Python)

#### Variables
* Single value
* Strings, integer, floats and boolean
* Remember, under the covers they are all objects.
* Multiple variables can be output with the print() statement.
* \t can be used to add a tab while \n can input a new line.

In [None]:
# Python as calculator

print('Addition:', 5 + 5)
print('Multiplication:', 5 * 5)
print('Subtraction:', 5 - 5)
print('Division', 5 / 5 )
print('Floor Division (discards the fractional part)', 53 // 5 )
#n.b. modulo operator useful for data wrangling and finding the 'end' of a list.
print('Floor Division (returns the remainder)', 53 % 5 )
print('Exponents:', 5 ** 2 )

In [None]:
a = 1
print(a, type(a))

a = str(a)
print (a, type(a))

a = float(a)
print (a, type(a))

a = int(a)
print (a, type(a))

In [None]:
print(1 == 2) #False
print(1 != 2) #True
print(1 > 2 ) #False
print(1 < 2 ) #True
print(2 >= 2) #True
print(1 <= 2) #True

#### Strings

- Human text is stored as a string
- Strings are really lists of characters though, computers don't understand what constitutes a words or sentences.

In [None]:
print("Call me Ishmael.")

In [None]:
a = "Call me"
b = 'Ishmael'

In [None]:
print(a+" "+b)
print(a*3)

In [None]:
len("Call me Ishmael.")

#### Lists
- Lists can be used to contain a sequence of values (or entities) of any type.
- You can do operations on lists, but they are ordered (so you must do an operation to each item itteratively).
- The list values start at 0 and that the first value of a list can be printed using `list[0]`.
- Lists can be sliced or indexed using the start and end value `list[start:end]`
- Lists are mutable datastructures, meaning that they can be changed (added to).
- for vector algebra, use a numpy array.


#### Tuples
- Exactly like lists, but they are immutable -- they cannot be changed after assingnment!
- Selectively useful in computational social science.

#### List examples

In [None]:
empty_list = []
float_list = [1., 3., 5., 4., 2.]
int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mixed_list = [1, 2., 3, 4., 5]
string_list = ['a','b','c','d','e']
print(empty_list)
print(int_list)
print(mixed_list, float_list)

In [None]:
print(int_list[0])
print(int_list[1])

In [None]:
#IndexError is asking for the 11th item on a 10 item list.
print(float_list[10])

In [None]:
string_list = ['a','b','c','d','e']

print(string_list[0])  #select an individual item in a list (n.b. returns ITEM not a LIST!)
print(string_list[0:2])#select between two items, returns list
print(string_list[3:]) #select all after an item, returns list
print(string_list[:3]) #selects before an item, returns list

#### Sets

- Lists can contain duplicate values.
- A set, in contrast, contains no duplicates.
- Sets can be created from lists using the set() function.
- Alternatively we can write a set literal using the { and } brackets. (n.b. brackets are also dict constructor)
- Sets are mutable like lists (meaning we can change them)
- Duplicates are automatically removed
- Sets do not have an order.
- Therefore we cannot index or slice them.


In [None]:
X = {1, 2, 3, 4}
X.add(0)
X.add(5)
print(X)
X.add(5)
print(X)

In [None]:
# n.b. sets have no index
X[0]

#### Dictionaries

- Building on sets, which are a bit useless since we cannot index them, dictionaries associate a key with a value
- Dictionaries can be specified with `{key: value, key: value}` as well as `dict([('key', value), ('key', value)])`
- Key's and values can be either string or numeric.
- Dictionaries are mutable, (can be changed) adict['g'] = 41
- Extremely flexible, can hold anything, and it works well when written out to a json object (more on that next week).


In [None]:
dict1 = {'a' : 0, 'b' : 1, 'c' : 2}
dict2 = dict([(1, 'a'), (2, 'b'), (3, 'c')])

In [None]:
print(dict1,dict2, '\n', type(dict1),type(dict2), '\n',dict1['b'],dict2[2])

In [None]:
dict1.values()

In [None]:
dict1.items()

In [None]:
#dictionaries are also indexed
for item in dict1.items():
    print(item)

***

<a id='lists_and_loops'></a>

<font color=darkblue>

### 7.) Loops and list comprehensions

In [None]:
ishmael = 'Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, \\
and nothing particular to interest me on shore, I thought I would sail about a little and see the \\
watery part of the world.'

In [None]:
# FOOR LOOP
# for [entity] in [iterable]:
#   do_something_to [entity]
for c in ishmael[:15]:
    print(c)

In [None]:
# WHILE LOOP
# start a counter
i = 0
# while the counter condition is not met, the code executes.
while i < len(ishmael[:15]):
    #print character in a string
    print(ishmael[:15][i])
    #increment up the counter
    i = i + 1

<font color=red>

#### Exercise

What would happen if we removed the counter above?

In [None]:
#selecting first 15 characters and print
words_list = []
# more on these string commands next week, but here we are saying 'for each word in ishmael[], split on blank spaces, append those words ot a blank list
for word in ishmael.split()[:3]:
    words_list.append(word)

print(words_list)

**Loops work off of any iterator**

What is an iterator? Very simply, it's a think that Python can 'chunk' through. One type of error you might encounter if you are doing for-loops is that the computer won't know what the next step is -- you have to give it an object that has an index to it, that is, an object with a first element, a second, a third, a nth etc. `enumerate` is a command that transforms most objects into an enumerated list of entities with an index.

In [None]:
ishmael_iterator = enumerate(ishmael[:15])

for index, char in ishmael_iterator:
    print(index,' ',char)

#### List comprehensions

- A compact for-loop, useful for programming in complex operations. Let's say you have a for-loop marching through a list of web pages scraping data. Perhaps it's more computationally efficient to clean the data *as it comes in* as opposed to getting all the data first and then cleaning it. A single list comprehension would allow you to define a single line of code to do just that.
- Also great for a simple filtering exercise too.

In [None]:
#list comprehensions are a shortened form of a for-loop
[word for word in ishmael.split()[:3]]

In [None]:
# start with a list of integers
int_list = [0,1,3,4,16,32,64,128,256,512,725,1024,2048,2500]

In [None]:
for number in int_list:
    print('Mumber:',number,'and that number squared:',number*number)

for ELE in LIST:
    CONDITIONAL:
        operation(ELE)
        
[operation(ELE) for ELE in LIST if CONDITIONAL]

In [None]:
[number*number for number in int_list]

In [None]:
squaredlist = [number*number for number in int_list]

In [None]:
# multiple by 2 for number if number is even
[number for number in squaredlist if number % 2 == 0]

***

<a id='jup_notebooks'></a>

<font color=darkblue>

### 8.) Functions

<font color=black>

- Useful to create a series of functions at the beginning of data wrangling / break down code into understandable chunks.
- Can have both default and completely user-driven input.

In [None]:
def name_of_function(arg):
    '''
    Describe what function does
    Specify input and output
    
    List data types, e.g. String --> Float   
    
    '''
    
    ...
    
    
    return(output)

In [None]:
def square(x):
    '''
    Squares inputted number
    
    Float --> Float
    '''
    x_sqr = x*x
    return(x_sqr)

def cube(x):
    '''
    Cubes inputted number
    
    Floast --> Float
    '''
    x_cub = x*x*x
    return(x_cub)

square(5),cube(5)

<font color=darkblue>

### Lambda functions

- The spiritual equivalent of list comprehensions, lambda functions are 'single use' functions defined in a single line.
- Useful for clarity in code.

Synatx:
`lambda arg, arg2,... arg_n : <inline expression using args>`

In [None]:
lambda_square = lambda n: n*n
lambda_square(5)

We can use anonymous functions in combination with other built-in functions like map (applies function to a list), filter (filters from a list for which function returns a true output, reduce (applies a function to a rolling list of entities).

In [None]:
# map: apply this function to each item in this list
map_out = map(lambda_square, [1, 2, 3, 4, 5])

In [None]:
#map and filter objects are created so you have to use list to actually get the values out.
list(map_out)

In [None]:
# return entities for which the function returns T in this list
lambda_evens = lambda n: n%2 == 0
filter_out = filter(lambda_evens, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

In [None]:
list(filter_out)

In [None]:
# this is no longer default loaded in 3.6, so we have to pull it in.
from functools import reduce

# apply function to each item in sequence up, i.e 1+2+3+4+5
lambda_sum = lambda x, y: x + y
reduce(lambda_sum, [1, 2, 3, 4, 5])

<a id='control_flow'></a>

<font color=darkblue>

### Control flow

<font color=black>

- 'if' : meets conditional
- 'elif' : ("else if") builds on previous if or elif
- 'else' : do something else
- 'return' : exits the function and returns that to caller.

***

<font color=darkred>

### Exercise: Fizz Buzz

<font color=black>
Automate this great drinking game: count up, but on every multiple of three, say 'fuzz' instead of the number and for every multiple of 5, 'buzz.' For both, say "fuzz-buzz." Do this up to 50

For instance, counting around in a cirlce: "1, 2, ~~3~~ fizz, 4, ~~5~~ buzz, ~~6~~  fizz, 7, 8, ~~9~~ fizz, ~~10~~ buzz, 11, ~~12~~  fuzz, 13, 14, ~~15~~ fuzz-buzz."

In [None]:
for i in range (1,20):
    if i % 3 == 0 and i % 5 ==0:
        print('FizzBuzz')
    elif i % 3 == 0:
        print('Fizz')
    elif i % 5 == 0:
        print('Buzz')
    else:
        print(i)

***

<font color=darkred>

## Class Exercise / HW1

<font color=black>

1. Write a for-loop that counts up from 0 to 100 and prints all prime numbers.

2. Write a function called `isprime` that takes in a positive integer $n$, and determines whether or not it is prime.  Return the $n$ if it's prime and return nothing if it isn't.

3. Using a list-comprehension and `isprime`, create a list `primes_to_100` that contains all the prime numbers less than 100. 

***

Start with the simplest situation: how do we determine if a single number is a prime? Well the definition of a prime is a natural number greater than 1 whose only whole-number factors are 1 and itself, i.e. it has no other divisors. 2, 3, 5, 7, 11, 13, 17, 19....

<font color=darkgreen>

Hint: think of a computer as like a dumb child (or undergrad RA) who you are forcing to do repetive tasks over and over again. What thing would you make the machine do over and over again to check if something is prime?