# Programming in Python

This notebook offers an introduction to progamming in the Python language. It's impossible to cover it all in a single notebook (or a single class!); however, this notebook highlights core aspects of Python that are important for this class. I highly recommend the (free and online!) book <a href=https://python.swaroopch.com/><i>A Byte of Python</i></a> if you would like to further study the ideas outlined in this notebook.

## Hello world!

As is customary when learning a new programming language, we can start a hello world program:

In [1]:
print("hello world!")

hello world!


We can also use single quotes to specify a string:

In [2]:
print('hello world!')

hello world!


## Comments

It is absolutely essential to comment your code when writing a program in any language and this is no different for Python. You can easily add inline and multi-line comments in Python. Consider the following inline comments:

In [3]:
# You can put a comment on a newline
print('I love you Python.') # You can also put a comment here

I love you Python.


The Python interpreter ignores everything after the hash symbol. Multi-line comments are specified using 3 consecutive quotation marks (either double or single quotes):

In [4]:
print('''Here is one of my all time favorite Trump
tweets on climate change.''')

#print('The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.')

Here is one of my all time favorite Trump
tweets on climate change.


For more on best practices regarding commenting code, see the <a href="https://google.github.io/styleguide/pyguide.html#Comments">Google's Python style guide</a>. I will illustrate (you need to hold me to this!) these best practices throughout the course. In short, as outlined in the book <i>Byte of Python Use</i>, use as many useful comments as you can in your program to:
* explain assumptions
* explain important decisions
* explain important details
* explain problems you're trying to solve
* explain problems you're trying to overcome in your program, etc.

<a href="https://blog.codinghorror.com/code-tells-you-how-comments-tell-you-why/">Code tells you how, comments should tell you why</a>.

## String methods and concatenation

Strings -- sequences of characters -- are obviously important for a text as data course. We have already seen how to specify a string in Python, using both single and double quotation marks. (<b>Note</b>: I tend to use single quotes, as this is easier on my keyboard. You can use whichever you like best; however, be consistent.) Strings also have a number of "methods" that will prove useful throughout the term. For instance, say that we want to convert a string to all lowercase:

In [5]:
print('University of Exeter'.lower())

university of exeter


Or all uppercase letters,

In [6]:
'University of Exeter'.upper()

'UNIVERSITY OF EXETER'

Another method that we will rely on heavily throughout the course is the <b><span style="color:green">split()</span></b> method,

In [14]:
'University of Exeter'.split(' ')

'of'

In [None]:
len('University of Exeter'.split(' '))

Here, we "split" the string based on space (i.e., the ' '). There are a bunch of other string methods (see this <a href="https://www.shortcutfoo.com/app/dojos/python-strings/cheatsheet">cheatsheet</a> for more information) and we will use several of these methods throughout the course. We will often want to combine (or concatenate) strings together.

In [15]:
# This code illustrates one way to concatenate a string
'data' + 'science'

'datascience'

In [16]:
'data' + ' ' + 'science'

'data science'

In [17]:
'data ' + 'science'

'data science'

You can also use the "format" strings in Python 3 to insert/concatenate a string. 

In [18]:
term1 = 'scientist'
term2 = 'sexy'
print(f'Data {term1} is the new {term2} job.')

Data scientist is the new sexy job.


## Numbers

Here is the description of numbers in <i>A Byte of Python</i>:

"Numbers are mainly of two types -- integers and floats.
An example of an integer is 2 which is just a whole number.
Examples of floating point numbers (or floats for short) are 3.23 and 52.3E-4. The E notation indicates powers of 10. In this case, 52.3E-4 means 52.3 * 10^-4^."

That pretty much sums it up!

## Variables and operators

Often, we want to store numbers in strings in variables and perform "operations" on those <b>variables</b>. 

In [19]:
# Assigning variables is easy in Python
a = 'data'
b = 'science'

# And we can 'do things' with these variables
a + b

'datascience'

In [20]:
# Works the same way for numbers
c = 2
d = 4

# And we can add these variables
c + d

6

In [21]:
# We can also assign a new variable based on an operation
x = c + d
print(x)

6


Be careful, however, when trying to mix types:

In [23]:
# Try to concatenate a string and an integer
a + c

'data2'

In [None]:
# Instead, we need to preform the operation using consistent types
a + str(c)

### Operators

Python includes all of the arithmetic (for integers and floats), relational, and logical operators that you will need (<a href="https://www.tutorialspoint.com/python/python_basic_operators.htm">click here for a complete list of operators</a>). Let's look at the main <b>arithmetic</b> operators.

In [None]:
3 + 5 # addition

In [None]:
3 - 5 # subtraction

In [None]:
3 * 5 # multiplication

In [None]:
3 / 5 # division

There are a number of other arithmetic operators that we could run into throughout the term, such as:

* Power: ``` 5 ** 3 ``` outputs ``` 125 ```.
* Modulo: ``` 100 % 10 ``` outputs ```0```.
* And so on and so forth (again, see (<a href="https://www.tutorialspoint.com/python/python_basic_operators.htm">here</a> for more info)

We will also often make use of <b>relational</b> operators. For instance, the relational "equals" operator is important for testing the equality between two objects:

In [25]:
a = 2
b = 3

# Are a and b equal?
a == b

False

In [None]:
# And if we re-assign variable a to 3?
a = 3
a == b

Here are the other relational operators that we will use:

* `!=` (not equal to)
* `<`  (less than)
* `>`  (greater than)
* `<=` (less than or equal to)
* `>=` (greater than or equal to)

Finally, Python also provides a set of <b>logical</b> and <b>membership</b> operators:

* `and` (boolean AND)
* `or`  (boolean OR)
* `not` (boolean NOT)
* `in` (membership)

So, for instance,


In [26]:
a and b == 3

True

In [30]:
tokens = 'University of Exeter'.split(' ')

In [32]:
'Exeter' in tokens

True

In [34]:
'exeter' in 'University of Exeter'.lower()

True

In [None]:
'Travis' in 'University of Exeter'.split(' ')

We will also occasionally use the following <b>assignment</b> operator to increment counter (more on this when we get to "loops"),

In [38]:
# Assignment for i
i = 0
print(i)

# Increment i by 1
i -= 2

print(i)

0
-2


## Control flow

For simple programs -- such as those outlined in the code above -- executing code from top to bottom works just fine. However, for everything else, we will need a bit more control. This is where <b>control flow</b> statements come in handy. In this section, we will introduce Python's three control flow statements: `if`, `for`, and `while`.

### The `if` statement

The value of the various logical and relational operators outlined above really come into focus when combined with the `if` statment in Python. Let's take a look at several examples.

<b>Example 1</b>: Simple if/else statement. Check whether our Trump tweet includes the phrase "global warming."

In [45]:
# Initialize our program
keyword = 'Global'
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'

if keyword in tweet:
    print(f'Found {keyword} in tweet.')
else:
    print(f'Could not find {keyword} in tweet')

Found global in tweet.


<b>Example 2</b>: Nested if/else statements. First, check if the string has 140 or fewer characters (i.e., consistent with Twitter limits). If this is true, check whether our Trump tweet includes the phrase "global warming."

In [48]:
# Initialize our program
keyword = 'global warming'
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'

if len(tweet) <= 140:
    if keyword in tweet:
        print(f'Found {keyword} in tweet')
    else:
        print(f'Could not find {keyword} in tweet')
else:
    print('Not a tweet!')

Found global warming in tweet


### The `for` loop

We often want to make repeated calculations and this is where the idea of a "loop" comes in. Let's start by taking a look at a `for` loop, which allows you to <i>iterate over a sequence of objects</i>.

In [52]:
# Split (or tokenize) the Trump tweet into words
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'
words = tweet.split(' ')

# Iterate over the sequence of words and print
for word in words:
    print(word.lower())

the
concept
of
global
warming
was
created
by
and
for
the
chinese
in
order
to
make
u.s.
manufacturing
non-competitive.


In [51]:
len(tweet)

117

The variable "word" holds each object in the sequence, one at a time. Note that you can name this anything you want (e.g., 'travis' or 'token' or whatever).

In [53]:
for trav in words:
    print(trav.lower())

the
concept
of
global
warming
was
created
by
and
for
the
chinese
in
order
to
make
u.s.
manufacturing
non-competitive.


As another example, say that we wanted to iterate over the numbers 0 to 9. How can we do this in Python?

In [58]:
# The xrange() function creates the "sequence of objects" to iterate over. 
# By default, xrange() iterates from 0 in increments of 1.
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'
words = tweet.split(' ')

for i in range(len(words)):
    print(i, words[i])

The
concept
of
global
warming
was
created
by
and
for
the
Chinese
in
order
to
make
U.S.
manufacturing
non-competitive.


In [54]:
range(10)

range(0, 10)

Iterating over lists of objects (such as our words above) or numbers is a common task. Sometimes you want to iterate over a list of objects AND keep a counter to track here you are in the list. This is where the `enumerate` function comes in handy.

<b>Example 3</b>: The `enumerate` function. Print the first 5 words in our Trump tweet.

In [None]:
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'
words = tweet.split(' ')

for i, word in enumerate(words):
    print(i, word, 'trump')

We can also get the same answer without the `enumerate` function by instead `xrange` and iterating over the "words" list:

In [59]:
for i, word in enumerate(words):
    if i < 5:
        print(word)
    else:
        break

The
concept
of
global
warming


Need to explain break

### The `while` loop

Whenever possible, it is good to use a `for` loop to iterate over sequences. However, there are times when you do not know the length of the sequence you are iterating over ahead of time. This is when a `while` loop useful. Let's revist <b>Example 3</b>, but this time using a `while` loop.

In [61]:
# We need to initialize a counter to hold our iterations
i = 0

while i < len(words):
    print(words[i])
    # Need to update the counter. Otherwise, we get trapped in an
    # "infinite loop"!
    i += 1

The
concept
of
global
warming
was
created
by
and
for
the
Chinese
in
order
to
make
U.S.
manufacturing
non-competitive.


## Exceptions

Sometimes we need to catch errors before they happen. We do so using ``try`` and ``except`` in Python (see <a href="https://python.swaroopch.com/exceptions.html"><i>Byte of Python</i> on Excpections</a> for more information). For instance, consider the following ``while`` loop:

In [62]:
# Initialize counter
i = 0

# This is called an infinite loop -- be careful!
while True:
    print(words[i])
    i += 1

The
concept
of
global
warming
was
created
by
and
for
the
Chinese
in
order
to
make
U.S.
manufacturing
non-competitive.


IndexError: list index out of range

Once we run out of words, the code breaks -- it errors out with a ``IndexError``. If we ran into this error in one of our programs, the program would stop executing. Instead, we can "catch" the error, using a ``try`` and ``except`` sequence:

In [63]:
# Initialize counter
i = 0

# This is called an infinite loop -- be careful!
while True:
    # Try to print a word
    try:
        print(words[i])
        i += 1
    # Raise an exepction if the code errors out
    except:
        print('We ran out of words!')
        break

The
concept
of
global
warming
was
created
by
and
for
the
Chinese
in
order
to
make
U.S.
manufacturing
non-competitive.
We ran out of words!


This code, instead, catches the error -- our program could continue doing other things if we wanted. There are times when catching errors can be super helpful.

## Functions

Often when writing programs and doing analysis, we want to reuse pieces (or blocks) of code. We do so by declaring a function using the `def` statement. We have already used several of Python's built-in functions earlier in this tutoral. For instance, we "called" the `len` function to get the number of characters in a string. Python, however, makes it super easy to define your own functions.

<b>Example 4</b>: Looking up words in a tweet, any tweet. We can extend our code in <b>Example 1</b> to make it reusable for any tweet by defining a function.

In [1]:
tweet = 'The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.'

def lookup(tweet, keyword):
    '''This function takes a tweet and keyword, and returns True if the 
       keyword is present and False otherwise '''
    
    #print('Is %s found in the tweet?' % keyword)
    
    if keyword in tweet:
        return True
    else:
        return False

In [67]:
print(lookup(tweet, 'global warming'))

True


In [68]:
lookup(tweet, 'covfefe')

False

Our function above illustrates features that most functions have:

* <b> Parameters </b>: `keyword` and `tweet` are parameters in our function (i.e., information sent to the function that it needs to run).
* <b> Return value(s) </b>: Most, but not all, functions return a value (or set of values).
* <b> Doc string </b>: String explaining what the function does (i.e., "documenting" the function), which appears just under the function definition.

Doc strings are helpful, as they allow a user (including yourself!) to get help on what the function does:

In [69]:
help(lookup)

Help on function lookup in module __main__:

lookup(tweet, keyword)
    This function takes a tweet and keyword, and returns True if the 
    keyword is present and False otherwise



We can also specify <b>default values for parameters</b>. For instance, we can add the following:

In [70]:
def lookup(tweet, keyword = 'covfefe'):
    '''This function takes a tweet and keyword, and returns True if the 
       keyword is present and False otherwise '''
    
    print('Is %s found in the tweet?' % keyword)
    
    if keyword in tweet:
        return True
    else:
        return False

In [71]:
lookup(tweet)

Is covfefe found in the tweet?


False

In [72]:
lookup(tweet, keyword='global warming')

Is global warming found in the tweet?


True

Your functions can get quite complex and you can even inlude an <a href=https://www.geeksforgeeks.org/args-kwargs-python/>arbitrary number of arguments</a>. However, we are not going to worry about the complexities at this point. And don't worry, we will be using functions throughout this course, so you will get many (many!) opportunities to practice their use (for more on functions, click <a href = "https://python.swaroopch.com/functions.html">here</a>).

## Data structures

Python offers a number of alternatives (or "structures") for storing data. There are four built-in data structures: `list`, `dict`, `tuple`, and `set`. We will look at each of these in turn.

### Lists

A list is just that -- a list of objects. These "objects" can be numbers, strings, and even other data structures. For instance, when we "split" the Trump tweet above into seperate words, Python returnd a list:

In [2]:
words = tweet.split(' ')
print(words)
print(len(words))

['The', 'concept', 'of', 'global', 'warming', 'was', 'created', 'by', 'and', 'for', 'the', 'Chinese', 'in', 'order', 'to', 'make', 'U.S.', 'manufacturing', 'non-competitive.']
19


This list has 19 elements and we can lookup a particular element in the list using the appropriate index. Once again, note that Python indexes lists starting at 0, and moves right to left. So if we wanted to lookup the 5th element in this list, we would type:

In [None]:
print(words[4])

We can iterate over a list in the opposite direction by using negative indices. So to get the last and second to last word in the list, we could type:

In [3]:
# Last word
print(words[-1])

# Second to last word
print(words[-2])

non-competitive.
manufacturing


We can also `append` objects to the end of a list or `insert` objects into a list using an index:

In [4]:
# Add an additional word to the end of our list
words.append('crazy')
print(words)

['The', 'concept', 'of', 'global', 'warming', 'was', 'created', 'by', 'and', 'for', 'the', 'Chinese', 'in', 'order', 'to', 'make', 'U.S.', 'manufacturing', 'non-competitive.', 'crazy']


or `insert` an object based on an index:

In [5]:
# Add a word to the begining of the list
words.insert(0, 'trump')
print(words)

['trump', 'The', 'concept', 'of', 'global', 'warming', 'was', 'created', 'by', 'and', 'for', 'the', 'Chinese', 'in', 'order', 'to', 'make', 'U.S.', 'manufacturing', 'non-competitive.', 'crazy']


Lists are super flexable and store just about anything. For example, we will often need to work with "lists of lists". 

In [6]:
# Define a list to hold two Trump tweets
tweets = ["Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy!",
          "The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive."]

# Loop over the "tweets" and tokenize
tokenized_tweets = []
for tweet in tweets:
    tokenized_tweets.append(tweet.split(' '))

print(tokenized_tweets)

[["Let's", 'continue', 'to', 'destroy', 'the', 'competitiveness', 'of', 'our', 'factories', '&', 'manufacturing', 'so', 'we', 'can', 'fight', 'mythical', 'global', 'warming.', 'China', 'is', 'so', 'happy!'], ['The', 'concept', 'of', 'global', 'warming', 'was', 'created', 'by', 'and', 'for', 'the', 'Chinese', 'in', 'order', 'to', 'make', 'U.S.', 'manufacturing', 'non-competitive.']]


We can then access an individual element within our nested lists as follows:

In [None]:
# Get word 12 in tweet 2
#print(tokenized_tweets[1][11])
tokenized_tweets[0][11]

### List comprehension

While we are on the subject of lists, it is good to introduce the idea of "list comprehesnion" in Python. I think of list comprehension as a special type of loop. This procedure takes a list, loops over it (typically modifying it in some way), and then returns a new list. The advantage of using list comprehension rather than, say, a `for loop` is that it often lead to efficient, easy to read code. Let's take a look.

In [7]:
nums = [i for i in range(10)]
print(nums)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


We could also do the same thing using a loop, but it's a bit long-winded:

In [8]:
nums = []
for i in range(10):
    nums.append(i)

print(nums)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


Or take our tweet example above:

In [9]:
tweets = ['The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.',
          'This is also a tweet.']
tokenized_tweets = [tweet.split(' ') for tweet in tweets]
print(tokenized_tweets)

[['The', 'concept', 'of', 'global', 'warming', 'was', 'created', 'by', 'and', 'for', 'the', 'Chinese', 'in', 'order', 'to', 'make', 'U.S.', 'manufacturing', 'non-competitive.'], ['This', 'is', 'also', 'a', 'tweet.']]


We will see many more examples of using list comprehension in Python.

### Dictionaries

In addition to lists, dictionaries are one of the most often used data structures in Python programs. As aptly described in <i>Byte of Python</i>, 

> "A dictionary is like an address-book where you can find the address or contact details of a person by knowing only his/her name i.e. we associate <i>keys</i> (name) with <i>values</i> (details). Note that the key must be unique just like you cannot find out the correct information if you have two persons with the exact same name."

How do dictionaries work in practice? Let's go back to the two tweets from Donald Trump above. Say that we wanted to store the `tweets` list along with a list of tweet IDs. We could do so using a dictionary as follows:

In [11]:
# Define a list of tweet ids
ids = [1, 2]

# Combine the ids and tweets into a dictionary
tweets_dict = {'tweets': tweets, 'ids': ids}

print(tweets_dict)

{'tweets': ['The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.', 'This is also a tweet.'], 'ids': [1, 2]}


And we can now call up each list using the `tweets_dict` and the relevant key.

In [12]:
# Grab the ids to view
print(tweets_dict['tweets'])

['The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive.', 'This is also a tweet.']


As with lists, dictionaries are super flexable. I often store data as a list of dictionaries as follows:

In [13]:
tweets = [{'id': 1, 
           'tweet': "Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy!"},
          {'id': 2,
           'tweet': "The concept of global warming was created by and for the Chinese in order to make U.S. manufacturing non-competitive."}
         ]


This allows you to call up a particular tweet using the "tweet" key, instead of having to remember which index in a list holds the tweet element.

In [15]:
print(tweets[0]['id'])

1


### Tuples

I tend to use tuples less often then lists and dictionaries, but they are still quite useful in certain circumstances. You can think of a tuple as a stripped down version of a list, with the added feature that they are <a href="https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747">immutable</a> (don't worry about this concept too much at this point). Basically, we can use tuples when we really, really want to objects to remain together and we don't want them to be changed.

You define a tuple in a very similar way to a list:

In [16]:
tweet = (1, "Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy!")
print(tweet)

(1, "Let's continue to destroy the competitiveness of our factories & manufacturing so we can fight mythical global warming. China is so happy!")


We can still iterate over this tuple and call individual elments based on their index; however, their is no `append` or `insert` method for tuples. They are "hard to change" by design!

### Sets

A set is useful when you want a <i>unique</i>, unordered collection of Python objects. For example,

In [23]:
names = ['travis', 'travis', 'travis', 'riley', 'riley', 'dreolin']
names_set = list(set(names))
names_set.append('ranu')
print(names_set)

['travis', 'dreolin', 'riley', 'ranu']


Where the use of sets really helps us is when checking for membership in a collection of objects. For instance, if I wanted to know whether 'travis' was includded in this list of names, I could use the membership operator above on the list of names directly:

In [None]:
print('travis' in names)

However, when the list of names is large or you need to check for membership many times, it becomes much more efficient to do the following:

In [None]:
print('travis' in set(names))

## Input and output 

Most of your scripts and programs will need to read and write data. Let's jump right in with an example of reading and writing a CSV file in Python. After learning how to write a CSV formatted file, we will look at some other useful file formats.

<b>Example 5</b>: Reading, processing, and then writing data. Open the trump_tweets_2017.csv file, flag tweets about "fake news", and write these tweets to disk.

We need to start by downloading the trump_tweets_2017.csv data and store it a location that you can find. I downloaded it to the following folder on my machine: /Users/tcoan/git_repos/notebooks/data. If you want to avoid typing the entire (absolute) path each time you read and write data, you can set the working directory using the `os` module (similar to `setwd()` in **R**).


In [25]:
import os
os.chdir('/Users/tcoan/git_repos/notebooks')

Next, we need to import the `csv` module and then use the module to read the tweet data.

In [26]:
# We need the csv module to read and write 
# CSVs in Python.
import csv

# First, we will do this the "long" way. Start by openning
# a connection to a file on disk
csvfile = open('data/trump_tweets_2017.csv', 'r')

# Initialize a csv.reader() object, pointing to the file on
# disk
csvreader = csv.reader(csvfile)

# Read the tweets
tweets = [row for row in csvreader]

# Close the connection.
csvfile.close()

In [32]:
with open('data/trump_tweets_2017.csv', 'r') as csvfile:
        # Connect to file
        csvreader = csv.reader(csvfile)
        tweets = [row for row in csvreader]

In [33]:
labels = tweets[0]
tweets = tweets[1:]

In [34]:
tweets[0]

['Twitter for iPhone',
 '9.47E+17',
 'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!',
 'FALSE',
 '24332',
 '117013']

This is sort of a "clunky" way to read a file, however, as you need to remember to actually close the connection (and you probably should also use the `del` statment to delete the csvreader object). The preferred way to read (and write) data is using the `with()` function:

In [35]:
# Read the file within the "with" structure
def read_csv(path):
    with open(path, 'r') as csvfile:
        # Connect to file
        csvreader = csv.reader(csvfile)

        # Read the tweets
        tweets = [row for row in csvreader]
    return tweets

In [None]:
read_csv('data/trump_tweets_2017.csv')

Using the `with()` function reads the tweets and then cleans up the (local) environment. Let's take a closer look at the "tweets" data:

In [None]:
# Is the first row a header?
print(tweets[0])

# Yes, it is. Let's save the headers for future reference
# and then remove them from the data
labels = tweets[0]
tweets = tweets[1:]

# Let's take a look at the first tweet
print(tweets[0])

# How many total tweets are we working with?
print(len(tweets))

So our our Trump tweets are loaded. The next step is to look for the keywords "fake news" in these tweets.

In [None]:
# Let's save the fake news tweets in a
# seperate list
fake_news = []

# Lookup fake news tweets
for tweet in tweets:
    # Standardize by converting to lowercase
    if 'fake news' in tweet[2].lower():
        fake_news.append(tweet)

print('Found %s tweets about "fake news"' % len(fake_news))

In [None]:
fake_news[0]

Lastly, let's write our "fake news" tweets to disk:

In [None]:
# Use the csv module to write a CSV file
with open('data/fake_news_tweets.csv', 'w') as csvfile:
    csvwriter = csv.writer(csvfile)
    csvwriter.writerows(fake_news)

### JSON formatted files

Many APIs (e.g., the Twitter API) return JSON formatted files. <a href="https://en.wikipedia.org/wiki/JSON">Wikipedia</a> describes JSON files as follows:

> "In computing, JavaScript Object Notation or JSON (/ˈdʒeɪsən/ JAY-sən) is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value). It is a very common data format used for asynchronous browser–server communication, including as a replacement for XML in some AJAX-style systems."

The JSON file format looks a lot like a Python dictionary. For example,

[
   {
      "source":"Twitter for iPhone",
      "text":"Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!,
      "created_at":"Sat Dec 30 22:42:09 +0000 2017",
      "retweet_count":24332,
      "favorite_count":117013,
      "is_retweet":false,
      "id_str":"947236393184628741"
   }
]

We load JSON files in Python using the ``json`` module. As an example, we can load the JSON version of the 2017 Trump Twitter data (again, stored on my system in /Users/tcoan/git_repos/notebooks/data):

In [41]:
import json

# Read JSON formatted data
with open('data/trump_tweets_2017.json', 'r', encoding='utf-8') as jfile:
    jdata = json.load(jfile)

jdata[0]

{'source': 'Twitter for iPhone',
 'text': 'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!',
 'created_at': 'Sat Dec 30 22:42:09 +0000 2017',
 'retweet_count': 24332,
 'favorite_count': 117013,
 'is_retweet': False,
 'id_str': '947236393184628741'}

In [42]:
jdata[0].keys()

dict_keys(['source', 'text', 'created_at', 'retweet_count', 'favorite_count', 'is_retweet', 'id_str'])

We write (or dump) JSON files in the usual way. When writing JSON, I like to use a handful of additional options to the `json.dump`:

In [None]:
with open('data/pretty.json', 'w') as jfile:
    json.dump(jdata[0:10], jfile, indent=4, separators=(',', ': '), sort_keys=True)
    # Add trailing newline for POSIX compatibility
    jfile.write('\n')

### Pickle files

The last file format that we will use are so-called "pickle" files. Here is how the <a href="https://docs.python.org/3/library/pickle.html">Python docs describes pickling files</a>:

> "Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or “flattening”; however, to avoid confusion, the terms used here are “pickling” and “unpickling."

We read and write pickle files (surprise, surprise) using the `pickle` module. As an example, let's "serialize" our `jdata` file and write it to disk.

In [None]:
import pickle

with open('data/trump_tweets_2017.pkl', 'wb') as pfile:
    pickle.dump(jdata, pfile)

This creates a "pickled" file in the /home/tcoan directory. Pickle files are not human readable, but they are super useful because the preserve, exactly, the Python object that you are writing to disk. We can then, at a later point, load the same object back into Python for further analysis. For example, 

In [None]:
tweets = pickle.load(open('data/trump_tweets_2017.pkl', 'rb'))

In [None]:
tweets == jdata

## Pandas

You can also use the `pandas` library to load the data. For example, this code will load our tweets CSV into a new data type unique to `pandas`: the "data frame". 

In [43]:
import pandas as pd
trump_df = pd.read_json('data/trump_tweets_2017.json')

Like in **R**, we can look at the first couple of row by using the `head` method:

In [44]:
trump_df.head()

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,Jobs are kicking in and companies are coming b...,2017-12-30 22:42:09+00:00,24332,117013,False,947236393184628736
1,Twitter for iPhone,"I use Social Media not because I like to, but ...",2017-12-30 22:36:41+00:00,50342,195754,False,947235015343202304
2,Twitter for iPhone,On Taxes: “This is the biggest corporate rate ...,2017-12-30 21:12:45+00:00,16703,73325,False,947213895286054912
3,Twitter for iPhone,"Oppressive regimes cannot endure forever, and ...",2017-12-30 19:02:53+00:00,23270,78932,False,947181212468203520
4,Twitter for iPhone,The entire world understands that the good peo...,2017-12-30 19:00:54+00:00,23532,77986,False,947180713236934656


We will come back to the awesomeness of `pandas` and dataframes; for now, we will focus on reading and writing data. We can read in write most forms of data with `pandas`, including the data formats described above:

In [45]:
trump = trump_df.to_dict(orient="records")

In [39]:
trump[0]

{'source': 'Twitter for iPhone',
 'id_str': 9.47e+17,
 'text': 'Jobs are kicking in and companies are coming back to the U.S. Unnecessary regulations and high taxes are being dramatically Cut, and it will only get better. MUCH MORE TO COME!',
 'is_retweet': False,
 'retweet_count': 24332,
 'favorite_count': 117013}

In [None]:
# Read a json file
trump_df_json = pd.read_json('data/trump_tweets_2017.json')

## Our first program: pulling it all together

The final step in our whirlwind tour of Python is to pull our code together into a single "program" (i.e., a collection of functions that, when executed, perform a task). We will stick with our Twitter example.

<b>Example 6</b>: Write a Python program that reads a CSV file of Tweets, searches for a particular keyword, and returns the relevant Tweets.

In [46]:
# Import dependencies
import csv

# Let's write a function to read a CSV file
def read_csv(path):
    '''This function takes an (absolute) path to a CSV file and
       returns a utf-8 encoded list of tweets and the header of
       field "labels" associated with the CSV. Note that we assume
       that the first row of the input file is the header.
    
       Arguments:
       ----------
       path: absolute path to the CSV file.
       
      
       Returns:
       --------
       A dictionary with the header labels and the Tweets.
    '''
    
    with open(path, 'r') as csvfile:
        # Connect to file
        csvreader = csv.reader(csvfile)

        # Read the tweets
        tweets = [row for row in csvreader]
    
    return {'header': tweets[0], 'tweets': tweets[1:]}


# See if a tweet includes the relevant keyword
def lookup(tweet, keyword):
    '''This function takes a tweet and keyword, and returns True if the 
       keyword is present and False otherwise.
       
       Arguments:
       ----------
       tweet: The text of a Tweet
       keyword: The keyword of interest to lookup
       
       Returns:
       --------
       True if the keyword is present and False otherwise
    '''
    
    # Standardize keyword and Tweet to use lowercase
    if keyword.lower() in tweet.lower():
        return True
    else:
        return False


# Main function to search a CSV of tweets
def search_tweets(keyword, path, text_idx = 2):
    '''This function takes a keyword and absolute path to
       to a CSV file of Tweets and returns a new list of
       Tweets that contain the keyword.
       
       Arguments:
       ----------
       keyword:  The keyword of interest to lookup in the Tweet
       path: The absolute path to the CSV file holding the Tweets
       text_idx: Is the index for the element holding the Tweet text
                 (defaults to index = 2)
    '''
    
    # Read CSV content
    content = read_csv(path)
    
    # Search Tweets for keyword
    key_tweets = [tweet for tweet in content['tweets'] 
                  if lookup(tweet[text_idx], keyword) == True]
    
    print('Found %s tweets about %s' % (len(key_tweets), keyword))
    
    return key_tweets


We can now execute the `search_tweets` function to search a CSV of Tweets for a particular keyword:

In [47]:
res = search_tweets('CNN', 'data/trump_tweets_2017.csv')

Found 36 tweets about CNN


And we can inspect the individual Tweets as per usual:

In [48]:
print(res[0])

['Twitter for iPhone', '9.40E+17', 'Another false story, this time in the Failing @nytimes, that I watch 4-8 hours of television a day - Wrong!  Also, I seldom, if ever, watch CNN or MSNBC, both of which I consider Fake News. I never watch Don Lemon, who I once called the “dumbest man on television!” Bad Reporting.', 'FALSE', '34018', '138768']


## A bit more on the `pandas` (and `numpy`) library

We've covered pretty much all there is to know regarding base Python (at least so far as this class is concerned). However, more and more, I'm using the `pandas` library for all of my data wrangling needs and so you should too! `pandas` on top of the highly efficient and flexible NumPy library (you will find most of the libraries that you use are built on top of `numpy`).

Let's take a quick tour of `pandas` -- we will come back to `pandas` throughout the course!

### Reading and writing data

We can use ``pandas`` to read and write various forms of data. For instance, a CSV file:

In [None]:
# Import the pandas library using the namespace "pd" to save on typing
import pandas as pd

# Read the Trump tweets CSV into a pandas "dataframe"
trump_df = pd.read_csv('data/trump_tweets_2017.csv')

This loads our tweets CSV into a new data type unique to pandas: the "data frame". Like in **R**, we can look at the first couple of row by using the head method:

In [None]:
trump_df.head()

We can do the same thing with a JSON file:

In [None]:
# Read a json file
trump_df_json = pd.read_json('data/trump_tweets_2017.json')
trump_df_json.head()

Dataframes represent tabular data organized by variable (or what `pandas` refers to as "series"). You can call a variable in two ways:

In [None]:
# You can use a "."
print(trump_df.retweet_count.shape)

# Our you can call it using the name (or the "key")
print(trump_df['retweet_count'].shape)

Here, we looked at the length of the vector "retweet_count", but `pandas` series "objects" have all sorts of "methods" attached to them. For instance:

In [None]:
print('The mean retweet_count =')
print(trump_df.retweet_count.mean())

print('Here is a frequency table for "source"')
print(trump_df.source.value_counts())

Now, to the internet to learn more about `pandas`!

<https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/>

## Object oriented programming

In the geekier corners of the internet (or the University campus), there's an on-going debate on the benefits and drawbacks of functional progamming (FP) versus object oriented programming (OOP). You can ignore these debates! However, when using Python, you will often run into the use of "<b>classes</b>" and thus it is important to have some knowledge of what a "class" is. Providing the knowledge is the goal of this section. (Note: for an excellent introduction to classes in Python, see <i>Byte of Python</i> chapter on <a href="https://python.swaroopch.com/oop.html">Object Oriented Programming</a>.)

We have actually already run into classes. For instance, the `csv.reader`code that we used to import a CSV file above, is a "class." 

In [None]:
print(csvreader)

In [None]:
"Good bye Donald!".lower()

This tells us that the `csvreader` that we assigned above is an object of the UnicodeReader `class`. Great, but what does all this actually mean?

### The `class` function

OOP is a paradigm of programming built on the idea of classes of <b>objects</b>---i.e., a structure that holds data (often referred to as "attributes" and functions or procedures (often referred as "methods"). As an example, say we were interested in defining "classes" of people walking around this university. There are different types of people and these people do different things. We can define a `professor` class as follows:

In [49]:
# Define the "professor" class. 
class professor:
    pass

In [50]:
prof = professor()

In [51]:
print(prof)

<__main__.professor object at 0x7ff38d432f10>


We now have a professor class, but they don't actually do anything. We can add a <b>method</b>, as follows:

In [53]:
# Define the "professor" class. 
class professor:
    def pontificate(self):
        print("I'm a professor. Blah, blah, blah.")
        

Now our professor does what professors do best: pontificate! We can now instantiate our class and call the `pontificate` method:

In [54]:
prof = professor()

In [55]:
prof.pontificate()

I'm a professor. Blah, blah, blah.


In [None]:
# Initialize class
travis = professor()

# Call method
travis.pontificate()

Great, but we still have a bunch of unanswered questions? What's this `self` thingy? How do I store and pass <b>attributes</b> to my professor class? Let's start with the second question. Say we wanted to add two attributes to our professor `class`: a `name` and `subject` attribute. 

In [56]:
# Define the "professor" class. 
class professor:
    def __init__(self, name, subject):
        self.name = name
        self.subject = subject
    
    def pontificate(self):
        print("My name is %s. I teach %s. Blah, blah, blah." % (self.name, self.subject))

As shown above, we can add attributes to our class by defining a `__init__` method and then attaching the `self` object. Now if we instantiate and call `pontificate`:

In [None]:
# Intialize class
travis = professor('jason', 'public opinion')
print(travis)
# Call method
#travis.pontificate()

Again, it is not super important for you to understand how classes work for this class (no pun intended!). You just need to know that they exist, you initialize them with a set of attributes, and then "use" them by calling their methods.