<center>
<h1>Python 2</h1>
</center>


Matt Jansen, Davis Library Research Hub <br>
February 6, 2018

** Abstract:**
This workshop will:
* Briefly review the basics covered in Python 1
* Learn how to work with files and directory
* Consider new data types and statements
* Survey some of the more important data-related packages

### NOTE:
This workshop assumes that you already have the <a href="https://www.anaconda.com/download/">Anaconda distribution</a> of Python 3.6 installed.  Detailed installation instructions are available in the <a href="http://gis.unc.edu/instruction/Python/Python_1_S18.html">Python 1 materials</a>. 

# Pseudocode

As you get started coding in Python, there will be many many tasks and steps you aren't familiar with!  As you learn new functions and approaches, you'll become better and better at searching for help online and reviewing documentation.  Learning to write and use pseudocode where appropriate can help organize your plan for any individual script.

Pseudocode is essentially a first draft of your code, written in English for **human consumption**, though with the tools of your programming language in mind.  For example, we might write pseudocode for extracting text from pdf files as:

1. Set Working Directory
2. Loop through each pdf in the directory:
    * open the pdf file
    * extract text
    * check length of text extracted
        * if length is zero: add to problems list
        * otherwise, add to output file
3. Write output file(s)

This process can divide a complicated task into more digestible parts.  You may not know how to open a pdf file or extract text from it, but you'll often have better luck finding existing help online on smaller tasks like these than with your overall goal or project.

**Exercise:**
* Write pseudocode to summarize what's happening in the following code:

In [19]:
random_words=["statement", "toy", "cars", "shoes", "ear", "busy", 
              "magnificent", "brainy", "healthy", "narrow", "join", 
              "decay", "dashing", "river", "gather", "stop", "satisfying", 
              "holistic", "reply", "steady", "event", "house", "amused", 
              "soak", "increase"]

vowels=["a","e","i","o","u","y"]

output=[]

for word in random_words:
    count=0
    for char in word:
        if char in vowels:
            count=count+1
    if count>=3:
        output.append(word)

* Write pseudocode to check an arbitrary list of numbers, `my_numbers`, to find all even numbers and convert them to odd numbers by adding one.  Put the resulting numbers into a new list `my_numbers2`.  (Recall `for` loops ,`if` conditions, and the modulo function `%` from Python 1.)

# Comments

Recall that Python ignores anything following a `#` as a comment.  Comments are a vital part of your code, as they leave notes about how or why you're doing something.  As you gain experience, you'll use comments in different ways.

Comments can also provide a link between pseudocode and real code.  Once you've written your pseudocode, use comments to put the major steps into your code file itself.  Then fill in the gaps with actual code as you figure it out.

Here's a possible answer to the previous exercise.

In [1]:
#1. Get or define the list my_numbers
my_numbers=list(range(100))

#2. Create an empty list for the new all-odd numbers, my_numbers2.

#3. Use a loop to iterate through the list of numbers

    #3a. For a given number check to see if it is even.
    
    #3b. If the number is even, add 1.
    
    #3c. Append the resulting number to the my_numbers2 list.

**Exercise:**

* Use your own pseudocode or the example above as an outline to fill in with Python code.  Test your code with the `my_numbers` object defined above.

# Reading and Writing Files from Python

## Packages

Packages provide additional tools and functions not present in base Python.  Python includes a number of packages to start with, and others can be installed using `pip install <package name>` and/or `conda install <package name>` commands **in your terminal**. 

Before you can use these functions, you'll need to load them with with the `import` function.

In [4]:
import os #functions for working with your operating system

## Working Directories

To open a file with Python, you'll need to tell your computer where it's located on your computer.  You can specify the entire absolute filepath (starting with C:\ on PC or / on Mac), or you can set a working directory and work with relative file paths.  

If a file is located in your working directory, its relative path is just the name of the file!

**Note:** Windows filepaths use `\`, which Python interprets as escape characters.  File paths need to be modified to either use `/` or `\\` in place of `\`.

In [2]:
myfile="C:/Users/mtjansen/Dropbox/Python_Sales/day1.txt" #absolute path
os.path.isfile(myfile) #check if Python can find my file 

True

In [3]:
os.chdir("C:/Users/mtjansen/Dropbox/Python_Sales") #set working directory
myfile="day1.txt"
os.path.isfile(myfile)

True

Once we've set a working directory, we can get a list of all files with `os.listdir(".")`.

In [4]:
print(os.listdir("."))
print(os.listdir("C:/Users/mtjansen/Dropbox/Python_Sales")) #alternatively we can specify a folder

['Day1.txt', 'Day2.txt', 'Day3.txt']
['Day1.txt', 'Day2.txt', 'Day3.txt']


**Exercise:**

Download the zipped data availalble <a href="http://gis.unc.edu/instruction/Python/Python_Sales/Python_Sales.zip">here</a>. Unzip them somewhere on your computer.

Use `import os` and `os.chdir` to set your working directory to the unzipped folder "Python Sales".  Use `os.listdir` to check what files are stored there.


## Reading and Writing Files

Python requires you both open and close files explicitly.  If you forget to close a file, it can remain in use, preventing you from opening it later.

Best practices for reading and writing files use the `with` function to make sure files are automatically closed.

In [5]:
with open('Day1.txt',"r") as txtfile: #"r" indicates that we are reading the textfile and not writing to it
    raw=txtfile.read() #read() retrieves raw text information from the file opened in txtfile
    
print(raw)

19.6
60.6
67.9
76.9
44.6
61.4
39.5
42.7
48
58.9


In [6]:
rawlist=raw.split("\n") #.split("\n") uses each new line, denoted by "\n" to split the string into a list
print(rawlist)

['19.6', '60.6', '67.9', '76.9', '44.6', '61.4', '39.5', '42.7', '48', '58.9']


In [7]:
total=0
for item in rawlist:
    n=float(item) #convert strings to decimal numbers (i.e. floats)
    total=total+n
print(total)

520.1


Let's write a new file with the total.

In [8]:
total=str(total) #we need to convert numerics to strings before writing
with open("Day1_TOTAL.txt","w") as txtfile: #like "r" above, "w" specifies that we're writing to the file
    txtfile.write(total)

** Exercise: **

* Use a loop to extend the above to get the total for each of the three files, Day1.txt, Day2.txt, and Day3.txt. Create a new file that contains the overall total.  There shouldn't be any sales over 100, so if you find any exclude them!

# Data Objects: continued

Last week, we introduced a number of important data structures in Python: string and numeric types, as well as lists.  We used indexing to specify particular parts of the sequential objects - strings and lists.  

## Dictionaries

Dictionaries provide a "mapping object"; instead of an index, they used named "keys" to organized data.  Dictionaries also benefit from faster performance than lists in most cases, due to their use of <a href="https://en.wikipedia.org/wiki/Hash_table">hash tables</a>.  A dictionary is defined as follows:

In [9]:
class_dict = {"course":"Python II","location":"Davis Library","time":"4pm"}
type(class_dict)

dict

In this case, `"course"`, `"location"`, and `"time"` serve as the "keys" for this dictionary.  They serve to index a list, like the numbers we use to index lists.  We can print a particular value by placing its key in the same square brackets `[]` used by list indices.

In [10]:
print(class_dict["location"])

Davis Library


We can also generate a list of all of the keys for a dictionary using the `.keys()` method. 

In [8]:
print(class_dict.keys())

dict_keys(['course', 'location', 'time'])


## List and Dictionary Comprehensions

Python provides some shortcuts to generating lists and dictionaries, especially those that you might (now) generate with a list.  For example, let's generate a list of the square of each number from 1 to 15.

In [6]:
squares=[]
for n in range(1,16):
    squares.append(n**2)
print(squares)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]


Using a "comprehension", we can shorten this to a single line, effectively bringing the loop inside the `[]` used to define the list.

In [7]:
squares=[x**2 for x in range(1,16)]
print(squares)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]


The same general format holds for defining dictionaries.

In [8]:
squaresdict={k:k**2 for k in range(1,16)}
print(squaresdict)

{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100, 11: 121, 12: 144, 13: 169, 14: 196, 15: 225}


# Defining your own Functions

While Python (and its available packages) provide a wide variety of functions, sometimes it's useful to create your own.  Python's syntax for defining a function to return the mean of a list of numbers.  (Base Python does not include a function for the mean.)

In [11]:
def mean(number_list):
    s=sum(number_list)
    n=len(number_list)
    m=s/n
    return m

numbers=list(range(1,51))
print(mean(numbers))

25.5


**Exercise:**
Choose one of the following (or both if you're feeling ambitious!):
* Define a function, `median` to find the median of a list.  The median is the middle number of an odd-numbered list or the average of the middle two numbers in an even numbered list.  (Hint: Use `sorted(<your_list>)` to create a list sorted from low to high values.

* Define your own function, `variance` to calculate the (population) variance, of a list of numbers:

$$ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i-\mu)^2 $$

Where $N$ is the length of the list, $x_1, x_2, ... x_N$ are the values in the list, and $\mu$ is the sample average (you can re-use the `mean` function above).  


* Test your function(s) with the lists below:


In [18]:
data1 = list(range(1,100))

#Normally Distributed Data:
from numpy.random import normal
data2 = normal(loc=0,scale=2,size=100) #scale=2 defines the standard deviation as 2

# Try / Except - Making your code robust

Errors and warnings are a regular occurrence in coding, and an important part of the learning process.  In some cases, they can also be useful in designing an algorithm.  For example, suppose we have a stream of user entered data that is supposed to contain the user's age in years.  You might expect to get a few errors or nonsense entries.

In [10]:
user_ages=["34","27","54","19","giraffe","15","83","61","43","91","sixteen"]

It would be useful to convert these values to a numeric type to get the average age of our users, but we want to build something that can set non-numeric values aside.  We can attempt to convert to numeric and give Python instructions for errors with a `try`-`except` statement:

In [3]:
ages=[]
problems=[]

for age in user_ages:
    try:
        a=int(age)
        ages.append(a)
    except:
        problems.append(age)
        
print(ages)
print(problems)

[34, 27, 54, 19, 15, 83, 61, 43, 91]
['giraffe', 'sixteen']


# Useful Packages

Some of these packages may NOT be included in your Anaconda installation.  Whenever you need to install a package, you need to use the Anaconda prompt window, **NOT Python itself**.  The Anaconda Prompt window can be reached through the Windows Start Menu folder for Anaconda or right clicking and opening a terminal from the Python 3 tab in your Evnironments tab of your Anaconda Navigator on a Mac.

Installing packages known to Anaconda can be done with the `conda install <package name>` command in your Anaconda Prompt window.  Otherwise you may need to use a different manager like `pip install <package name>`.

<a href="https://conda.io/docs/user-guide/tasks/manage-pkgs.html">More information about managing packages in Python is available here.</a>


## Data Analysis

### Numpy and Scipy

<a href="http://www.numpy.org/">`numpy`</a> provides the mathematical functionality (e.g. large arrayes, linear algebra, random numbers, etc.) for many popular statistical and machine learning tasks in Python.  This is a dependency for many of the packages we discuss below.  <a href="https://docs.scipy.org/doc/scipy/reference/">`scipy`</a> adds an array of mathematica and statistical functions that work with `numpy` objects. 

### Pandas 

The <a href="https://pandas.pydata.org/">`pandas` package</a> provides high-level data manipulation functionality, similar to that found by default in R.  That means new objects like data frames and time series, as well as new functions to manage missing values, merge, and/or reshape datasets.

Download the csv file <a href="http://gis.unc.edu/instruction/Python/CountyHealthData_2014-2015.csv"> 
CountyHealthData_2014-2015.csv</a>.

In [7]:
import pandas as pd #pd shortens the name to make it easier to call functions from pandas
os.chdir("C:/Users/mtjansen/Dropbox/Python_Spring18")

df = pd.read_csv("CountyHealthData_2014-2015.csv")

df.head()

Unnamed: 0,State,Region,County,Year,Premature death,Poor physical health days,Poor mental health days,Adult obesity,Food environment index,Physical inactivity,Excessive drinking,Diabetes,Median household income
0,AK,West,Aleutians West Census Area,1/1/2014,,2.1,2.1,0.3,7.002,0.234,0.266,0.067,69192
1,AK,West,Aleutians West Census Area,1/1/2015,,2.1,2.1,0.329,6.6,0.22,0.266,0.065,74088
2,AK,West,Anchorage Borough,1/1/2014,6827.0,3.3,3.0,0.257,8.185,0.205,0.185,0.07,71094
3,AK,West,Anchorage Borough,1/1/2015,6856.0,3.3,3.0,0.268,8.0,0.18,0.185,0.07,76362
4,AK,West,Bethel Census Area,1/1/2014,13345.0,3.3,2.6,0.315,3.21,0.283,0.171,0.067,41722


In [10]:
df.groupby("Region").mean()

Unnamed: 0_level_0,Premature death,Poor physical health days,Poor mental health days,Adult obesity,Food environment index,Physical inactivity,Excessive drinking,Diabetes,Median household income
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Midwest,7183.902799,3.338256,3.165094,0.311058,7.69506,0.27122,0.190614,0.10246,47277.853288
Northeast,6233.209677,3.535047,3.514623,0.277198,8.197622,0.240961,0.176693,0.097512,53695.292627
South,9116.940754,4.336586,3.971896,0.322495,6.894375,0.300112,0.132786,0.122163,41787.915448
West,7293.75625,3.658853,3.361783,0.259058,6.927634,0.211495,0.16751,0.084556,48647.646283


**Learn more:**
* `pandas` provides a quick introduction <a href="https://pandas.pydata.org/pandas-docs/stable/10min.html">here</a>
* <a href="https://jakevdp.github.io/PythonDataScienceHandbook/">Python Data Science Handbook</a> provides more detail and integraion with other software.

### Matplotlib and Visualization

<a href="https://matplotlib.org/">`matplotlib`</a> is a commonly used data visualization package for Python, oriented towards static, scientific plotting.  There are a number of other packages for visualization including:

* `seaborn` provides aesthetic extensions to matplotlib
* `ggplot` - a Python version of the popular ggplot2 package for R
* `Bokeh` and `Plotly` help create interactive web visualizations

## Other Areas

### BeautifulSoup (for parsing HTML or XML data)

Python's built-in `urllib.request` package makes it relatively easy to download the underlying html from a web page. Note that the `from package import function` notation used here allows you to selectively import only parts of a package as needed.

In [4]:
from urllib.request import urlopen
page = urlopen("http://gis.unc.edu/instruction/Python/Python_1_S18.html")  #The Python 1 materials!
html = page.read()
print(html[:1000]) #print only the first 1000 characters

b'<!DOCTYPE html>\n<html>\n<head><meta charset="utf-8" />\n<title>Python_1_S18</title><script src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.10/require.min.js"></script>\n<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script>\n\n<style type="text/css">\n    /*!\n*\n* Twitter Bootstrap\n*\n*/\n/*!\n * Bootstrap v3.3.7 (http://getbootstrap.com)\n * Copyright 2011-2016 Twitter, Inc.\n * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE)\n */\n/*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.css */\nhtml {\n  font-family: sans-serif;\n  -ms-text-size-adjust: 100%;\n  -webkit-text-size-adjust: 100%;\n}\nbody {\n  margin: 0;\n}\narticle,\naside,\ndetails,\nfigcaption,\nfigure,\nfooter,\nheader,\nhgroup,\nmain,\nmenu,\nnav,\nsection,\nsummary {\n  display: block;\n}\naudio,\ncanvas,\nprogress,\nvideo {\n  display: inline-block;\n  vertical-align: baseline;\n}\naudio:not([controls]) {\n  display: none;\

**Note:** Always check a site's Terms of Service before scraping it.  Some sites explicitly prohibit web scraping of their data.

### NLTK (text analysis)

<a href="http://www.nltk.org/">The Natural Language Toolkit (`nltk`)</a> provides a wide array of tools for processing and analyzing text.  This includes operations like splitting text into sentences or words ("tokenization"), tagging them with their part of speech, classification, and more.

## Jupyter Notebooks

Sooner or later, you'll want to share your code or projects with other people (even if only future-you!).  <a href="http://jupyter.org/">Jupyter notebooks</a>, included with Anaconda, provide integration between code, its output, images, and formatted text beyond what's possible with in-code comments.  The materials for these workshops were created in Jupyter notebooks.

## Feedback

Please fill out our feedback form available here: http://bit.ly/hubSpring2018

We'd love your input on future workshop topics and ways we could improve this workshop next time we teach it!

# Other Resources:

* <a href="https://jakevdp.github.io/PythonDataScienceHandbook/">Python Data Science Handbook</a>  This free ebook emphasizes Numpy, Scipy, Matplotlib, Pandas and other data analysis packages in Python, assuming some familiarity with the basic principles of the language.

