## IPython Bootcamp

We will start writing some code! This bootcamp is designed to be a crash-course to get you up to speed on Python programming. The goal of this bootcamp is to write your own script 'bootcamp.py'. Running your script should print the answers to each of the questions in part 4 of the tutorial, separated by a single blank line. 

If you have any difficulty navigating the IPython interface, there's some documentation [here][IPython documentation link] that provides a great platform to get started. 

[IPython documentation link]: https://ipython.org/ipython-doc/3/notebook/notebook.html

### 1. The basics: variables and data structures

Python has the basic variable types you are used to: strings, ints, floats. Unlike Java and many other languages, variables are not type-checked. You simply declare a variable by assigning a value to it. Later, you can reassign a different type to that same variable and Python couldn’t care less.

Use the cells below to play with variable assignment and reassignment:

In [1]:
# You can comment with pound sign. 

# Running this cell won't produce any output, because it's all comments. 
# Go ahead and try it by selecting the Cell -> Run dropdown in the above toolbar. 

For ease of use, you can comment out multiple lines by selecting all you wish to comment and pressing <kbd>Control</kbd> + <kbd>/</kbd>. 

In [2]:
# run me! You will see which version of Python you are using
import sys
print(sys.version)

3.4.0 (v3.4.0:04f714765c13, Mar 15 2014, 23:02:41) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]


In [3]:
# run me! You're expecting a 2 to print out. 
x = 2
x

2

In [4]:
y = "hello world"
y

'hello world'

In [5]:
# run me! I remember the results of cells above me that have already run, which is pretty handy. 
# The variables above me are still assigned. 
x = y #notice the lack of whining about "incompatible types"...
x

'hello world'

This also means that you can mix variable types within a data structure. There is no need to specify that L is a list of ints or that M is a map from strings to floats.

Lists are declared with square brackets and indexed using square bracket notation. They can also be treated as stacks, if you are into that sort of thing.

Create a list of ints. Then, in order to drive those Scala people insane, start appending strings to it. Play with indexing and slicing. In Python, you can use the colon notation to pull out slices of a list. E.g. lst[i:j] will give you a new list which includes the ith through the (j-1)th elements of lst.

In [6]:
l = [1, 2, 3]
for elem in l: #We'll talk about loops more in a bit
    print (elem)

1
2
3


In [7]:
l.append("i am a string. mwahahaha.") 
for elem in l: 
    print (elem)

1
2
3
i am a string. mwahahaha.


In [8]:
print (l[2]) #should be 3
print (l[1:3]) #should be [2, 3]
l += ['here is more stuff', 6, [2,3,4], 5*1367] 
print (l)
#expecting [1, 2, 3, 'i am a string. mwahahaha.', 'here is more stuff', 6, [2, 3, 4], 6835]
print (l.pop()) #should be 6835. Run me to check! 

3
[2, 3]
[1, 2, 3, 'i am a string. mwahahaha.', 'here is more stuff', 6, [2, 3, 4], 6835]
6835


Dictionaries (or maps or associative arrays) are probably the favorite data structure of Python. They are a simple key/value store, again without any restrictions on which data types are the keys or values. You can declare dictionaries with curly braces and associate or retrieve keys and values using square bracket notation.

In [9]:
d = {"give me an A!" : "B", "give me a P!" : 7, "give me a Q!" : "no."}
print (d["give me a P!"])
# 7
d[14] = 12
print (d)
#{'give me a Q!': 'no.', 'give me an A!': 'B', 14: 12, 'give me a P!': 7} # the new k/v pair was added to d

7
{'give me a P!': 7, 'give me a Q!': 'no.', 14: 12, 'give me an A!': 'B'}


### 2. Control structures and functions
Python makes it easy to write bad code. But it makes it very hard to write ugly code. So chalk one up for superficiality. Python uses whitespace to denote control structures, like loops and if/else blocks. By convention, you should use four spaces for each level of indentation. 

In [10]:
print (l)
[1, 2, 3, 'here is more stuff', 6, [2, 3, 4]]
# Here is a for loop. 
for elem in l:
  print (elem)

# Here is a while loop
i = 0
while i < 5 : 
    if i % 2 == 0 : 
        print ("even")
    else : 
        print ("odd")
    i += 1

# Let's write our very own function!
# No types are required for parameters, so commenting is SO important. So important.
# Returns the idx element of a list

# lst - the list 
# idx - the integer index of the element to return

def get_list_element(lst, idx) : 
    return lst[idx]

print (get_list_element(l, 0))
get_list_element(l, 5) #expecting a 6

[1, 2, 3, 'i am a string. mwahahaha.', 'here is more stuff', 6, [2, 3, 4]]
1
2
3
i am a string. mwahahaha.
here is more stuff
6
[2, 3, 4]
even
odd
even
odd
even
1


6

### 3. File IO
You can open, read, and write files using the aptly-named open(), read(), and write() commands. read() returns the entire contents of the file as a string. readlines() will split on the newline character and return the lines as a list, which is generally nicer for allowing you to iterate line-by-line. We won’t go through an example here, but we highly recommend playing with the csv module, which is incredibly useful and we will likely use regularly throughout the semester.

In [11]:
file = open('test.txt', 'w')
for s in ['line1', 'line2', 'line3', 'line4'] : 
    file.write(s+'\n')
file.close()
contents = open('test.txt').read()
print (contents)
#'line1\nline2\nline3\nline4\n'
contents = open('test.txt').readlines()
contents
#['line1\n', 'line2\n', 'line3\n', 'line4\n']

line1
line2
line3
line4



['line1\n', 'line2\n', 'line3\n', 'line4\n']

### 4. Text processing in Python
For this part, you will need to submit your code to answer the following questions.

We will be playing with a small but oh so wonderful data set of wine reviews!

The cell below will download and unpack the files. You can modify them in subsequent cells by referring to their path names, data/stopwords.txt and data/wine.txt. 

In [29]:
import urllib.request
import shutil
import zipfile
import os

url = 'http://socialmedia-class.org/assignments/python-bootcamp/data.zip'
file_name = 'data.tgz'
# Download the file from `url` and save it locally under `file_name`:
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

# After running the following commend, you should see a new directory './data/' on your hard drive 
z = zipfile.ZipFile('data.zip', 'r')
z.extractall()

wine.txt is in the format of one review per line, followed but a star rating between 1 and 5 (except for 3 reviews, where the review decided to go rogue and give 6 stars. Pft.) The text of the review and the star rating are separated by a single tab character. There is also a file called stopwords.txt. You will use this in question 6.

Write a python script that answers each of the following questions and prints the answer to standard output. Since this is a tutorial, there are no secrets: your script should produce [this output][target output] when you are done. We will compare the output of your script directly to this answer key, so start early and come ask for help if you get stuck! We highly recommend looking into the functions available in the [python string module][string module].

1. What is the distribution over star ratings?
2. What are the 10 most common words used across all of the reviews, and how many times is each used?
3. How many times does the word ‘a’ appear?
4. How many times does the word ‘fruit’ appear?
5. How many times does the word ‘mineral’ appear?
6. Common words (like ‘a’) are not as interesting as uncommon words (like ‘mineral’). In natural language processing, we call these common words “stop words” and often remove them before we process text. stopwords.txt gives you a list of some very common words. Remove these stopwords from your reviews. Also, try converting all the words to lower case (since we probably don’t want to count ‘fruit’ and ‘Fruit’ as two different words). Now what are the 10 most common words across all of the reviews, and how many times is each used?
7. You should continue to use the preprocessed reviews for the following questions (lower-cased, no stopwords). What are the 10 most used words among the 5 star reviews, and how many times is each used?
8. What are the 10 most used words among the 1 star reviews, and how many times is each used?
9. Gather two sets of reviews: 1) Those that use the word “red” and 2) those that use the word “white”. What are the 10 most frequent words in the “red” reviews which do NOT appear in the “white” reviews?
10. What are the 10 most frequent words in the “white” reviews which do NOT appear in the “red” reviews?

[string module]: https://docs.python.org/2/library/string.html


In [30]:
#You can experiment with the below bodies of text and write Python code below. Running the cell will run your code. 
wine = open('data/wine.txt').read()
stopwords = open('data/stopwords.txt').read()

#print (wine)
print (stopwords)

# Write here! 

i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers
herself
it
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
's
't
can
will
just
don
should
now



Thats it! Again, you can compare your answers against [our key][target output] to see if you have done things correctly.

Your code is due Friday, January 22nd, before class. Please submit it via [turnin][turnin instructions] from the eniac machines. You can do so even from home by [copying the file][scp syntax] onto an eniac machine, then [sshing into an eniac machine][ssh instructions] and running turnin from there. 

[target output]: http://crowdsourcing-class.org/assignments/downloads/python-bootcamp/bootcamp-key.txt
[turnin instructions]: https://alliance.seas.upenn.edu/~cis520/wiki/index.php?n=Resources.HomeworkSubmission
[scp syntax]: http://www.hypexr.org/linux_scp_help.php
[ssh instructions]: http://www.seas.upenn.edu/cets/answers/remote.html
