# Unit 7 - Section 1: An Introduction to Python & Jupyter

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import warnings
warnings.filterwarnings('ignore')

# Outline

__[1.1] Introduction to Python & Jupyter Notebooks__

__[1.2] Basic Data Types: *Strings*, *Integers* & *Floats*__

__[1.3] Basic Data Structures: *Lists* & *Dictionaries*__

__[1.4] Writing basic functions, *if* statements and Importing External Packages__

__References__

# [1.1] Introduction to Python & Jupyter Notebooks

## Python

**Python** is a programming language. We can write instructions (i.e. code) in Python to carry out a specific set of tasks. Python is the work-horse of this Unit and we will be using the language to create visualizations & conduct data analysis.

- If you'd like to learn how to install Python & Jupyter Notebooks onto your computer, I recommend the following [tutorial](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) it doesn't take took long and doesn't require many steps. To run the notebooks in this unit, all you will need to do is:

1. download the **Anaconda** distribution for **python3** from the Anaconda [website](https://www.anaconda.com/distribution/)
1. follow the installation instructions from the Anaconda website after downloading
1. follow the instructions to download 2 additional packages: (1) geopandas and (2) statsmodels for the *3rd* notebook (this takes seconds).

Once you have everything installed, you can download the notebooks from the Github repository & run through them yourself. And having Anaconda installed on your computer, should you choose to enter the world of data analysis and visualization will be incredibly handy!

## Jupyter Notebooks

**Jupyter Notebooks** are the interface that we will use to interact with **Python**. What your looking at right now is a Jupyter Notebook! These notebooks are incredibly useful for testing chunks of code and analyzing the output that those chunks of code yield.

- You can write code, test & run that code and write notes in these notebooks to help construct your analysis or share your work with others!

- Jupyter Notebooks have two different types of **cells**

- **Markdown Cells**: These cells are used for taking notes. All of the headers in this notebook are written in *Markdown* cells as well as the notes that you're reading right now. You can change the types of cells with the bar at the top of each notebook.

- **Code Cells**: These cells are used for executing (i.e. running) code. These cells will either have *IN [ ]:* or *IN [n]:* to the left of them, where *n* is some number that represents the order that these cells were executed in. The output of any code run from one of these cells is displayed below the cell.

A **Kernel** is the *Computational Engine* behind every active Jupyter Notebook. The Kernel is always running in the background and is where the execution of the code takes place. The entire notebook runs on a single *Kernel*, so that computations run in one *code* cell will carry on over to computation from another *code* cell.

We'll run through all of these concepts again and again, as they are central to learning how to use *Jupyter Notebooks* to analyze and visualize data. Of important note is that Jupyter Notebooks themselves are files that can be stored & saved, like the one you're currently viewing.

##### You don't have to have Anaconda or Jupyter installed on your computer to takeaway the main concepts from this Unit. One of the big advantages to Jupyter Notebooks is that you can easily share ideas and the results of data analyses with others. So feel free to just scroll throuth the notebooks to see what types of things you can do with Python!

# [1.2] Basic Data Types: *Strings*, *Integers* & *Floats*

Python has a lot of built-in data types. The basic idea is that you'll have some string of text that you want to manipulate or store as a label. Or you might have some number that respresents some data, like the temperature on a given day. Perhaps you want to record the temperature for Tuesday, Wednesday & Thursday, we'll see how the different data types in Python come into play.

A **String** is a piece of text that is surrounded by quotation marks (either single or double marks work fine), note that entering a string into a *code* cell also prints the string out at the end. **Note**: that you can enter comments into *code* cells just by preceding the text of the comment with a #

In [17]:
'This is a string' #string

'This is a string'

In [18]:
type( 'This is a string' )

str

**Integers** and **Floats** are both ways to store numbers. You can use *Integers* to store whole numbers and *Floats* to store decimals. I'd recommend sticking to *Floats* for data analysis since most data is usually riddled with numbers that have decimals in them.

In [19]:
4 #integer

4

In [20]:
type( 4 )

int

In [21]:
4.5 #float

4.5

In [22]:
type( 4.5 )

float

## Running Code

As as first step, we'll write some very basic code in Python that we'll execute in a *code* cell. All we want the code to do is to **output** the words *Hello World*, this can be done with a simple *print* command, a function that takes a string of input.

In [23]:
print('Hello World!')

Hello World!


All you have to do to run code in a *code* cell is have the cell highlighted and print the RUN command at the top of the notebook or hit SHIFT + ENTER.

## Python as a Calculator

We can use Python to do some basic computations, remember that anything written in the *code* cells is exected on a Kernel.

In [24]:
5.0 + 7.0 #addition

12.0

In [25]:
10.0 / 2.0 #division

5.0

In [26]:
3.0 ** 2 #this denotes an exponent, 3 squared or 3x3 is 9

9.0

In [27]:
(9.0 + 11.0) / 1.5 #you can combine operations

13.333333333333334

## Variables

It's often helpful to have place-holders for quantities of interest. We call these place-holders variables and they can store any data type. Let's say you wanted to store your favorite day of the week in a variable called *fav_day*, you can assign it as follows:

In [28]:
fav_day = 'Saturday' #here we're storing a string in a variable

Now at some point later in the analysis, suppose we want the data that we stored in the variable. We can print what's inside as follows:

In [29]:
print(fav_day)

Saturday


Or maybe you'd like to store your favorite number:

In [30]:
fav_num = 3.14

In [31]:
print(fav_num)

3.14


You can store a lot of different *objects* and *structures* in variables, you will see why variables are so useful below.

# [1.3] Basic Data Structures: *Lists* & *Dictionaries*

In data analysis (and visualization) it is incredibly useful to be able to store your data in some structure that can be called when you want to graph it or preform some computation on it. This is where **Lists** and **Dictionaries** come into play. These structures can be used to create a bunch of numbers or strings or actually a lot of other things!

## Lists

**Lists** are ordered, meaning that the things that are stored inside of them are stored in an ordered way. For example, let's say we want to store what the average temperature is every day for a week and suppose that the data looked like this (where temperature is in Farenheit)

1. Monday = 69 degrees
1. Tuesday = 70 degrees
1. Wednesday = 73 degrees
1. Thursday = 77 degrees
1. Friday = 69 degrees
1. Saturday = 66 degrees
1. Sunday = 68 degrees

We can store the temperature data as a list (denoted with square brackets) where each entry is seperated by a ',' :

In [32]:
[69.0 , 70.0 , 73.0 , 77.0 , 69.0 , 66.0 , 68.0]

[69.0, 70.0, 73.0, 77.0, 69.0, 66.0, 68.0]

We can also store this list in a variable so I can access it later:

In [33]:
avg_temp = [69.0 , 70.0 , 73.0 , 77.0 , 69.0 , 66.0 , 68.0]

In [34]:
print(avg_temp)

[69.0, 70.0, 73.0, 77.0, 69.0, 66.0, 68.0]


Now we have an ordering of the temperatures for each day of the week. But let's say we also wanted the weekday that each temperature corresponds to. We can make a seperate list to include the weekdays as strings where the ordering (index) of each list matches up (the 3rd element of the temperature list is the temperature on the day of the 3rd element of the weekday list.

In [35]:
weekday = ['Monday' , 'Tuesday' , 'Wednesday' , 'Thursday' , 'Friday' , 'Saturday' , 'Sunday']

In [36]:
print(weekday , avg_temp)

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] [69.0, 70.0, 73.0, 77.0, 69.0, 66.0, 68.0]


While lists are quite handy, there's actually an easier way to match up labels to data. Another *Data Structure*, called a **dictionary** allows you to index each entry of a list-like structure with a *key* or label. Dictionaries are denoted by curly brackets; each entry is structured as *key* : *value* is is sepeated by a ','

**Imporant Notes:**

- Python starts counting from 0 not 1. So the first element in the list is actually the 0th element in the list.

- You can **index** lists by using integers if you want to retreive the ith element in the list. So I see that Monday is the first (or 0th) element in the list, if I want to retrieve the temperature for Monday I can index the list of temperatures as so:

In [37]:
print(avg_temp[0])

69.0


And if I want the temperature for Wednesday (which has an index of 2), then I can get it as:

In [38]:
print(avg_temp[2])

73.0


## Dictionaries

**Dictionaries** can be constructed in a few different ways. The *first* is to create one with all of the entries at once:

In [39]:
avg_temp_dict = {
'Monday' : 69.0,
'Tuesday' : 70.0,
'Wednesday' : 73.0,
'Thursday' : 77.0, 
'Friday' : 69.0, 
'Saturday' : 66.0}

In [40]:
print(avg_temp_dict)

{'Monday': 69.0, 'Tuesday': 70.0, 'Wednesday': 73.0, 'Thursday': 77.0, 'Friday': 69.0, 'Saturday': 66.0}


You can then retreive the temperature for a given day like this

In [41]:
print( avg_temp_dict['Tuesday'] ) #we want the temperature for Tuesday

70.0


Another way of adding elements to a dictionary is to add elements to a pre-existing dictionary. Looks like we forgot Sunday, so let's add it!

In [42]:
avg_temp_dict['Sunday'] = 68.0

In [43]:
print(avg_temp_dict) #we have Sunday added to the list now

{'Monday': 69.0, 'Tuesday': 70.0, 'Wednesday': 73.0, 'Thursday': 77.0, 'Friday': 69.0, 'Saturday': 66.0, 'Sunday': 68.0}


A couple of other things to note about Dictionaries

- You can retreive just the KEYS or just the VALUES 
- The KEYS have to be unique, you can't have the same KEY correspond to 2 or more VALUES

In [44]:
avg_temp_dict.keys() #just get the keys

dict_keys(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

In [45]:
avg_temp_dict.values() #just get the values

dict_values([69.0, 70.0, 73.0, 77.0, 69.0, 66.0, 68.0])

# [1.4] Writing basic functions, *if* statements and Importing External Packages

## Functions

If there is a process or set of instructions that you'd like to repeat, then it'd be useful to have a way to call that set of instructions again without having to write out the script again. For example, let's say that I want the Python to print the following message: "Happy Birthday Roger!", I could just do:

In [46]:
name = 'Roger' #variable to store my name
print('Happy Birthday ' + name) # you can join (concatente) 2 strings

Happy Birthday Roger


But let's say that I wanted to modify this so that the message could take anyone's name, and then write the approprate message for them. We will define a function to do this with the input being the individual's name.

In [47]:
def happy_bday(name): #this is how you define a function
    print('Happy Birthday ' + name)

Now if we want to wish Happy Birthday to someone else, we can use the function that we just defined

In [48]:
happy_bday('Scott')

Happy Birthday Scott


Obviously this is a very simple example, and in this case you could've just written out the message but chunks of code can get unwieldy very quickly and functions are a useful way writing a script for a given task, then re-using it again easily for other tasks. 

## *if* statements 

Another useful thing you can do in Python is an *if statement* which basically just tells your code to check that some condition is met before continuing with the execution of the code.

Let's say you didn't like someone named Cersei very much and instead of wishing her a Happy Birthday in the function we defined above, you wanted the function to return another message when Cersei's name is input. We can use an if statement to check the name (input) and return a different message depending on whether the name is Cersei or not.

In [49]:
def happy_bday(name): 
    
    #check to see if the input name is Cersei
    if name == 'Cersei': #if this condition is True, then print message below
        print('Screw you Cersei!')
        
    else: #otherwise, for any other name, just wish that person a happy birthday
        print('Happy Birthday ' + name)

Let's see this in action:

In [50]:
happy_bday('Roger')

Happy Birthday Roger


In [51]:
happy_bday('Cersei')

Screw you Cersei!


Looks like it works!

## Importing External Packages

While Python comes with many of the tools you'll need to in order to conduct your analysis, sometimes you may need to download **external packages**. Packages are compiled by other people for several purposes and need to be downloaded once, then **imported** in your Jupyter Notebook/Script when you want to use tools from these packages. We'll cover two of the most widely used packages here **Numpy** and **Pandas** and will cover another package called **Matplotlib** when we start covering data visualization.

### Numpy

We've introduced **lists** at this point, which hold an ordered set of data types that can be strings or floats or even other lists & dictionaries. But a lot of time in data analysis, you'll want to exclusively deal with a list of floats and manipulating this list gets a lot easier by turning it into a **Numpy** array. Numpy is a widely used package in the scientific computing community that uses Python and while we won't delve too far into it, it's useful to see some basic things in it.

In [52]:
avg_temp = [69.0 , 70.0 , 73.0 , 77.0 , 69.0 , 66.0 , 68.0]

In [53]:
print(avg_temp)

[69.0, 70.0, 73.0, 77.0, 69.0, 66.0, 68.0]


In [54]:
type(avg_temp)

list

We're going to convert this list to a numpy array

In [55]:
import numpy as np #import the package

In [56]:
avg_temp = np.array(avg_temp)

In [57]:
print(avg_temp)

[69. 70. 73. 77. 69. 66. 68.]


In [58]:
type(avg_temp) #the object is now a numpy array

numpy.ndarray

This is pretty useful as numpy arrays have a lot of useful properties. Let say I want to get the *sum* and the *mean* of all of the temperatures. 

In [59]:
print(avg_temp.sum()) #add all of the temperature together

492.0


In [60]:
print(avg_temp.mean()) #average temperature during the week

70.28571428571429


Numpy arrays are also useful for plotting data which we'll see shortly.

### Pandas

Another very useful package for data analysis is **Pandas**. Pandas offers a convenient way to store & manipulate real-word data. We will make use of this package extensively in the last notebook. For now let's revisit the dictionary that we constructed earlier and turn it into an object called a pandas *series*.

This is the dictionary from earlier - 

In [61]:
avg_temp_dict = {
'Monday' : 69.0,
'Tuesday' : 70.0,
'Wednesday' : 73.0,
'Thursday' : 77.0, 
'Friday' : 69.0, 
'Saturday' : 66.0,
'Sunday' : 68.0}

In [62]:
print(avg_temp_dict)

{'Monday': 69.0, 'Tuesday': 70.0, 'Wednesday': 73.0, 'Thursday': 77.0, 'Friday': 69.0, 'Saturday': 66.0, 'Sunday': 68.0}


In [63]:
type(avg_temp_dict) #we have a dictionary

dict

In [64]:
import pandas as pd #import the package

In [65]:
avg_temp_series = pd.Series(avg_temp_dict) #convert the dictionary to a series

In [66]:
print(avg_temp_series)

Monday       69.0
Tuesday      70.0
Wednesday    73.0
Thursday     77.0
Friday       69.0
Saturday     66.0
Sunday       68.0
dtype: float64


In [67]:
type(avg_temp_series) #this object is a pandas series

pandas.core.series.Series

Pandas series are analagous to dictionaries in that have a key (called an index) and a value stored for each index. You can retreive the index or value in a similar way to dictionaries.

In [68]:
avg_temp_series.index #get the index (days of the week)

Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'],
      dtype='object')

In [69]:
avg_temp_series.values #get just the values (temperatures)

array([69., 70., 73., 77., 69., 66., 68.])

Notice that the structure of the values looks like a numpy array. Let's check the object type:

In [70]:
type(avg_temp_series.values)

numpy.ndarray

Indeed it is a numpy array! Pandas *runs on* Numpy and uses it to build other data structures that are more convenient when analyzing messy data. You can do some cool things with *pandas series* just like with *numpy arrays*:

In [71]:
avg_temp_series.sum() #returns the sum of the temperatures

492.0

In [72]:
avg_temp_series.mean() #returns the mean of the temperatures

70.28571428571429

In [73]:
avg_temp_series.median() #returns the median temperature throughout the week

69.0

# References

**Setting up Jupyter & Tutorial**

- [https://www.dataquest.io/blog/jupyter-notebook-tutorial/](https://www.dataquest.io/blog/jupyter-notebook-tutorial/)

**Anaconda Distribution**

- [https://www.anaconda.com/distribution/](https://www.anaconda.com/distribution/)