# Python and Notebooks

In [1]:
student_name = "Your Name Here" 

The goal of this lab is to get you started using Python, Jupyter Notebooks, and Git, three tools that you will use through the course in your work.  

**Python** is our language of choice.  You may have seen it before, if not, you need to learn basic Python coding.

You are looking at a **Jupyter Notebook**, it is a document that mixes text, code and the output of the code.   A lot of your work will be creating notebooks like this to present your analysis.  

**Git** is a distributed version control system (DVCS), you will use it to keep track of your work and ensure that you have a backup copy of what you are doing.  You should have checked this notebook out of **Github** using Git. Your task is to complete some programming work in this worksheet and commit your changes to your own Github repository.

# Python Basics

Your task this week is to complete some basic programming tasks with Python in this worksheet.  There are questions below with a space for you to write code to achieve the given outcomes. Write the code, test it, and when you are done, submit your work as described at the end of the notebook. 

The tasks aren't meant to be complicated Python problems, just some simple tasks to get you started with this process.  

## String Manipulation

The next cell defines three strings that you will use in the first group of questions. Note that the first uses single quotes, the second uses double quotes and the third uses three double quotes since it includes newline characters.  These are all valid ways of writing strings in Python and are equivalent.

In [8]:
title = 'My Notebook'
date = "18 June 2018"
description = """My notebook will contain examples of Python code and text that describes
what it does.  This is a Python string. Add some Chinese characters: """


Write code to print the length of these strings.

In [9]:
# write your code here

Write code to create a new string in a variable 'summary' that contains the date, title and the first 20 characters of the description, with a ':' character between each one (ie '18 June 2018:My Notebook:My notebook will con')

In [11]:
# write your code here

Write code to find the number of words in the description.  Hint, this is easy in Python since strings support the [split method](https://docs.python.org/3.6/library/stdtypes.html#str.split) that returns a list of strings after splitting on whitespace (or another character if you wish).   Try split on the string, then find out how many strings are in the resulting list.

In [18]:
# write your code here

# Control Structures

Here you will explore Python control structures - conditionals and loops.  

Write a for loop over the words in the description and count how many times the word 'unit' occurs.  Your solution will have an if statement inside the for loop.

Here you will encounter Python's required indentation for the first time. This will annoy you at first but you will learn to either love it or hate it with time...

In [20]:
# write your for loop here

You can iterate over any sequence with a for loop, including the characters in a string.  Write a for loop over the characters in the description that prints out 'Comma!' every time it sees a comma.

In [23]:
# write your code here

## Functions

Python is a dynamically typed language so we don't need to declare the type of a variable or declare the return type of a function (although Python 3 introduced optional [type hints](https://stackoverflow.com/documentation/python/1766/type-hints#t=201607251908319482596)).  Apart from that the idea of writing a function in Python is the same as in Processing or (methods in) Java.

Write a function that takes a single string argument and returns the number of words in the string using the code you wrote above to count words.

In [30]:
# write your code here

Use your function to find the number of words in the description string defined above.

In [33]:
# write your code here

## Lists and Dictionaries

First we look at some built in Python data structures: lists and dictionaries. 

A list is a sequence of things, unlike strongly typed languages (Java, C#) a list can contain a mixture of different types - there is no type for a list of integers or a list of lists.   Here are some lists:

In [12]:
ages = [12, 99, 51, 3, 55]
names = ['steve', 'jim', 'mary', 'carrie', 'zin']
stuff = [12, 'eighteen', 6, ['another', 'list']]

1. write code to print the first and third elements of each list
2. write code to select and print everything except the first element of each list
3. write a for loop that prints each element of the 'names' list

In [None]:
# write code here

A dictionary is an associative array - it associates a value (any Python data type) with a key. The key is usually a string but can be any immutable type (string, number, tuple).  Here's some code that counts the occurence of words in a string.  It stores the count for each word in a dictionary using the word as a key. If the word is already stored in the dictionary, it adds one to the count, if not, it initialises the count to one.  

The second for loop iterates over the keys in the dictionary and prints one line per entry.

Modify this example to be a bit smarter:
- make sure that punctuation characters are not included as parts of a word, be careful with hyphens - should they be included or not?
- make the count use the lowercase version of a word, so that 'The' and 'the' are counted as the same word
- **Challenge**: find the first and second most frequent words in the text
- **Challenge**: take your code and write it as a function that takes a string and returns a list of words with their counts in order

In [13]:
description = """This unit introduces students to the fundamental techniques and 
tools of data science, such as the graphical display of data, 
predictive models, evaluation methodologies, regression, 
classification and clustering. The unit provides practical 
experience applying these methods using industry-standard 
software tools to real-world data sets. Students who have 
completed this unit will be able to identify which data 
science methods are most appropriate for a real-world data 
set, apply these methods to the data set, and interpret the 
results of the analysis they have performed. """

count = dict()
for word in description.split():
    if word in count:
        count[word] += 1
    else:
        count[word] = 1
        
for word in count:
    print(word, count[word])

This 1
unit 3
introduces 1
students 1
to 4
the 5
fundamental 1
techniques 1
and 3
tools 2
of 3
data 5
science, 1
such 1
as 1
graphical 1
display 1
data, 1
predictive 1
models, 1
evaluation 1
methodologies, 1
regression, 1
classification 1
clustering. 1
The 1
provides 1
practical 1
experience 1
applying 1
these 2
methods 3
using 1
industry-standard 1
software 1
real-world 2
sets. 1
Students 1
who 1
have 2
completed 1
this 1
will 1
be 1
able 1
identify 1
which 1
science 1
are 1
most 1
appropriate 1
for 1
a 1
set, 2
apply 1
interpret 1
results 1
analysis 1
they 1
performed. 1


<hr>
## Pandas Data Frames

[Pandas](https://pandas.pydata.org) is a Python module that provides some important data structures for Data Science work and a large collection of methods for data analysis. 

The two main data structures are the [Series]() and [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).  

A Series is a one dimensional array of data, but unlike the Python list the data is indexed - the index is like the dictionary key, any immutable value like a number or string.  You can use the label to select elements from the series as well as positional values.  

A DataFrame is analogous to a spreadsheet - a two dimensional table of data with indexed rows and named columns. 

You should read up on these and follow the examples in the text.  Here are a few exercises to complete with data frames.

<hr>

You are given three csv files containing sample data.

In [90]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ds1 = 'files/ds1.csv'
ds2 = 'files/ds2.csv'
ds3 = 'files/ds3.csv'

Write code below to read one of these data files into a pandas data frame and:
- show the first few rows: .head
- find the summary data for each column: .describe
- select just those rows where the value of x and y is over 50
- select the column 'x' and create a series
- plot the 'x' series as a line graph
- plot the dataframe as a scatterplot

Once you have the code for this, you can change the file you use for input of the data (ds2, ds3) and re-run the following cells to see the different output that is generated