# Big Data Python Basics

### What is This Document?

This document is a collection of sample Python examples with detailed explanations in comment form. It is intended as a quickstart to using Python in the Big Data and Analytics class and as an assorted collection of sample code that will be useful for labs.It is *NOT* a detailed tutorial for learning Python. 

Have questions? For more details about Python, try these resources:
* Python wiki, for full set of resources: https://wiki.python.org/moin/
* Long list of Python tutorials: https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
* Most interesting tutorials (I've not gone through all of them so feedback welcome)
    * Free online book "A Byte of Python": https://python.swaroopch.com
    * Intro to Python for Data Scientists: https://www.datacamp.com/courses/intro-to-python-for-data-science
    * Curated list of turorials: http://docs.python-guide.org/en/latest/intro/learning/
    
* Learn more about Markdown: https://daringfireball.net/projects/markdown/

## Printing examples

### Basic way to print a string 

##### What's a string? 
 1. A way to work with letters, special characters, spaces, individual digits (as opposed to the numbers)
 2. The characters go inside a single quote ' or double quote " (called delimiters)


In [1]:
print ('hello world')
print ("hello Zaphod Beeblebrox")

hello world
hello Zaphod Beeblebrox


#### Basic way to print just a number or a numerical calculation

In [2]:
print (42)
print (1+1*2/3)

42
1.6666666666666665


#### Basic way to mix strings and numbers


In [3]:
print ("1 + 1 = ", 1+1)
print ("Average of 1, 2, 10 is:", (1+2+10)/3.0)

1 + 1 =  2
Average of 1, 2, 10 is: 4.333333333333333


#### A more detailed example, using formatting to make the answer easier to read

In [4]:
print ("Average of 1, 2, 10 is: %.2f" % ((1+2+10)/3))

Average of 1, 2, 10 is: 4.33


## Variable and types

#### A very brief introduction to Python variables
* Variables are a way to hold and store values
* A variable can only hold a single value at any time
* Python variables can be numerical, a string of characters, boolean (true or false), or other things, as we'll see soon

* To set a value for a variable, put the name on the left of an equal sign and the value on the right
   * example: cookie = "Chocolate Chip"
   * setting a new value for a variable overwrites (erases) the old one
* To use the value in a variable, put its name on the right side of the equal sign, Then its current value will be used in the calculation
    * height = 42
    * double_height = height + height

In [5]:
# examples of setting numerical variables
a=10
b=30
c= a + b

# examples of using numerical values
print("Average of a, b, and c is: %.4f" % ((a+b+c)/3))

Average of a, b, and c is: 26.6667


In [6]:
# examples with different variable types: numerical, string, and boolean
nbr=1
name="Zaphod Beeblebrox"
truth=True
print(nbr, name, truth)

1 Zaphod Beeblebrox True


In [7]:
# variables have types
# types help determine the operations you can use on a variable
# operations are called methods

print("Type of the variable nbr is ", type(nbr))
print("Type of the variable name is ", type(name))
print("Type of the variable truth ", type(truth))

pi = 3.14159265358979328462643383279
print("Type of the non-integer number pi is ", type(pi))
print(pi)
print('pi')


Type of the variable nbr is  <class 'int'>
Type of the variable name is  <class 'str'>
Type of the variable truth  <class 'bool'>
Type of the non-integer number pi is  <class 'float'>
3.141592653589793
pi


In [8]:
#you can perform operations on variables
nbr = 4
print(nbr+nbr, nbr-nbr, nbr*nbr, nbr/nbr, nbr**nbr, -nbr )

print("Arthur"  + "Dent")

print (nbr/3, nbr//3, round(nbr/3), nbr%3)

8 0 16 1.0 256 -4
ArthurDent
1.3333333333333333 1 1 1


## Lists  

If you want to store more than one item, a list is the best way to go. 
These are somewhat similar to arrays in other languages. The list is itself a variable, and every item in the list is also a variable, but has special name. 

For example, a_list = ['a', 'b', 'c'] creates a list with three alphabetical characters in it. 

Each item in the list has its own variable name:
* a_list[0] holds the value 'a'
* a_list[1] holds the value 'b'
* a_list[2] holds the value 'c'. 

Notice that the number in the brackets[] start at 0 and end at one less than the length of the list.


#### Python lets you make a list of things
* other languages sometimes call these arrays (there are subtle differences but we don't care right now)

In [9]:
name = "Tracey"
favFruits = ["apple", "orange", "banana", "peach", "dragonfruit", "lychee", "jackfruit", "papaya", "blueberry"]
favVeggies = [] #this is an empty list. Its a list type, but has nothing in it

#### Print everything in the list just by using its name

In [10]:
print(favFruits)
print(favVeggies)
print (type(favFruits))
print(type(favVeggies))

['apple', 'orange', 'banana', 'peach', 'dragonfruit', 'lychee', 'jackfruit', 'papaya', 'blueberry']
[]
<class 'list'>
<class 'list'>


#### Print just one thing in a list with ther name of the list, a pair of brackets, and its index (the number inside the brackets)
* like most coding languages, lists start at nbr 0, and go to n-1
* notice that the thing printed is NOT in brackets. Its just a variable and not a list. 

In [11]:
print(favFruits[1])
print(favFruits[4])

orange
dragonfruit


#### A list can have different types (heads up - this is not true for other languages)

In [12]:
mixItUp = ["petunia", 42, False]
print(mixItUp)

['petunia', 42, False]


#### You can add things to lists

In [13]:
favFruits.append("strawberry")
favFruits.append("raspberry")
print(favFruits)

['apple', 'orange', 'banana', 'peach', 'dragonfruit', 'lychee', 'jackfruit', 'papaya', 'blueberry', 'strawberry', 'raspberry']


#### You can select 'slices' of lists
* notice the values of the two numbers
  * first number is the index of the first list item you want in the slice
  * the second number is index one PAST the item you want (who know why? It just is)
  * the colon indicates you want a 'slice' of the index

In [14]:
print(favFruits[1:3])
print(favFruits[0:4])

['orange', 'banana']
['apple', 'orange', 'banana', 'peach']


## Mathematical Calculations - Minimum, Maximum, Averages

* Jupyter notebooks let you perform calculations right in your notebook. That's what is meant by it being interactive.
* Not only can you do calculations right in this document, you can make updates to your equations and run them again.
* This means you can develop and refine your analysis right in your working lab report

Python uses functions to automate a great many calculations, just like your graphing calculator does. Unlike your graphing calculator though, you have access to hundreds of thousands of functions. You can even learn to write your own.

IN this section, we'll explore some common mathematical functions, and see how to use them in Python

In [15]:
# if you have a list of numbers then you can do mathematical calculations 
someNbrs = [10,24,43,66,87,105,326,526,601,744]


### min and max

In [16]:
# finding the minimum and maximum numbers in a list is easy - you just call functions min and max
minNbr = min(someNbrs)
maxNbr = max(someNbrs)

print("Range of values is ", minNbr, " to ", maxNbr)

Range of values is  10  to  744


### average (the mean)

In [17]:
# to count how many items are in the list, use the function len
count = len(someNbrs)
print("Nbr of values is ", count)

# to add up all the items in the list, use the function sum
total = sum(someNbrs)
print("Total sum of values is ", total)

# you can then use these two values to calculate the average (also called the mean)
avg=total/count
print("Average of values is ", avg)

# or, use the function statistics.mean to do the calculations
# you must include the import statistics line first
import statistics
another_mean = statistics.mean(someNbrs)
print("Mean (average) of values is", another_mean)

Nbr of values is  10
Total sum of values is  2532
Average of values is  253.2
Mean (average) of values is 253.2


In [18]:
# doing an average "old school"
# need to *loop* through the list and add all the values
# then divide by the number of items in the list

total = 0; #variable to hold our total
nbrListItems = len(someNbrs); # len is the length of the list, or the number if things in the list

# main loop; ii is loop counter, 
# range is part of the list we want to loop through - in this case, the entire length of the list
for ii in range(nbrListItems):
    # in python, you need to indent spaces for the code you want to loop over
    total = total + someNbrs[ii]; #add the current list item at ii to the running total
    print (total);
avg = total/nbrListItems;
print("Total: ", total, "Nbr items: ", len(someNbrs), "Average: ", avg)

10
34
77
143
230
335
661
1187
1788
2532
Total:  2532 Nbr items:  10 Average:  253.2


### Median

In [19]:
# Use the function statistics.median to do the calculations
# you must include the import statistics line first ONLY if you didn't do it in an earlier cell
med = statistics.median(someNbrs)
print("Median (middle value) of values is ", med)

Median (middle value) of values is  96.0


In [20]:
#what about median? Need a list in sorted order
# there are some subtleties to using methods to sort in python.
# the .sort method will rearrange the existing list in sorted order
# other ways to sort create a copy of the list in sorted order and preserve the original list

someNbrs.sort();  # sort the list. It's that easy

import math;
halfwayPoint = math.floor(nbrListItems/2);

#to determine if a number is even or odd, take the modulo of 2 and check remainder; 
# 0 remainder means even
if (nbrListItems % 2) == 0: 
    median = (someNbrs[halfwayPoint] + someNbrs[halfwayPoint-1])/2;
else: # the halfway point is an odd number, so the median is the middle number
    median = someNbrs[halfwayPoint];
    
print("Nbr List items", nbrListItems, "Halfway point: ", halfwayPoint, "Median: ", median);

Nbr List items 10 Halfway point:  5 Median:  96.0


## Reading File Data

To do anything useful in these notebooks, we need data, specifically datasets. The most common (and easiest) way that datasets are created ansd stored is in a CSV file. We are going to explore how to get the data out of the csv file and into a form that we can then perform Python calculations on.

###CSV files
* csv stand for comma separated values.  
* **csv file** - simple text document with data in text form (characters and not binary) where the fields (data columns) are separated by commas
* csv files are a common way to store datasets 

Common algorithm is to open a csv file and read the data into a list structure
      1. open command sets a 'pointer' to the beginning of file
      2. the pointer moves through the file as the values are 'read'
             * you can store each value as its read in a list
      3. keep reading the file until no more data (the 'pointer' reaches the end of file)
      4.  close the file after you are done
Once you've read the data and put it into a list, you can then write code to process or analyze it

### Opening Files For Reading

* The trickiest part here is getting the right location for the file
* file name is *relative* to the Anaconda folder system
* this means that Fremont_bridge_data.csv is in a folder called FremontBridge in the Anaconda file system
* you can organize your data files however you wish, just make sure the string matches your folder and file organization


In [21]:
#open the file
data_file = open("Fremont_bridge_data.csv", "r");

#create an empty list to store the data from the csv file
bridge_data = [];

#put all lines from the file into a list
for row in data_file:
    bridge_data.append(row);
    
#close the file. It's the courteous and clean thing to do
data_file.close();

# show the first five items in the list. Notice that they are all strings
# the \n part of the string means "newline" - its the symbol that represents the end of a line
print(bridge_data[:5]);


['Date,Fremont Bridge West Sidewalk,Fremont Bridge East Sidewalk\n', '9/1/15,1393,1483\n', '9/2/15,1745,1867\n', '9/3/15,1532,1627\n', '9/4/15,1425,1512\n']


## Processing File Data

**Data Processing Goals**
- Clean up and split the data into two lists, east and east
-  This allows for further statistical analysis on cleaned up lists


In [22]:
#remove the first element (column header)
# the header is useful for telling us what the columns mean, 
# but you don't want to include it in your mathematical calculations
del bridge_data[0];

# create two empty lists - one for westbound and one for eastbound bike data
# we want to make a list of just the westbound data and just the eastbound data
west = [];
east = [];  

# loop (go through every row one by one) through the original bridge_list
for item in range(len(bridge_data)):
    
    # this line is a bit confusing, but what it is doing is splitting out each comma separated item 
    # from a row and turning it into a list 
    # for example, this line of code turns he single string '9/1/15,1393,1483\n' 
    # into three separate strings '9/1/15' '1393' and '1483'
    # After being split, the three separate strings are then stored in another list called bridge_row
    bridge_row = bridge_data[item].split(",")
    
    #now that we've split the row into its three parts, we can pick and choose 
    # add the second and the third to specific lists for west and east
    west.append(int(bridge_row[1]))
    east.append(int(bridge_row[2]))

print("List of just the westbound data:")
print(west)
print("\nList of just the eastbound data:")
print(east)

List of just the westbound data:[1393, 1745, 1532, 1425, 826, 466, 1107, 1733, 2070, 1961, 1863, 1178, 648, 1615, 1795, 1782, 1160, 1461, 981, 712, 1824, 2190, 2270, 2086, 1314, 899, 895, 1907, 2097, 1992]

List of just the eastbound data:
[1483, 1867, 1627, 1512, 770, 425, 898, 1932, 2243, 2069, 1898, 1013, 679, 1669, 1862, 1949, 1293, 1539, 853, 591, 1981, 2361, 2292, 2235, 1370, 869, 676, 1955, 2239, 2063]


# Analyzing Bridge Data

Now you have the data in two lists - east and west. You can use the mathematical functions described in early sections to analyze your data.



In [23]:
# This is a code cell
# Try adding some code below to find the smallest number of bike commuters, the largest number, the average, and the median,
# using the code examples from the section above



# What's Next?

Be thinking about what sorts of questions you might ask, what other things you might need to do to prepare your datasets, and what sorts of analysis do you want to do on it.

* Look for inspiration at: [http://www.seattle.gov/transportation/bikecounter_fremont.html](http://www.seattle.gov/transportation/bikecounter_fremont.htm#detail)
* More details about bike counters:
 http://www.seattle.gov/transportation/bikecounter.htm
* Dataset with daily bike counts:
 https://data.seattle.gov/Transportation/Fremont-Bridge-Daily-Bicycle-Counts/eytj-7qg9/data
* Find more city data at:  https://data.seattle.gov