<a href="https://colab.research.google.com/github/tonytan4ever/visualization-for-fun/blob/main/visualization_notebooks/PythonRefresher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Python refresher

If you're new to Python, I would recommend going to the section 5 of the course to get a better introduction to Python. This section is intended to refresh your memory in case you already know Python but haven't used it in a while. I'll mostly focus on aspects of Python that will be used in the upcoming sections. Note this is not a comprehensive refresher on Python.

## Defining variables

Python is a dynamically typed language. This means that defining variables does not involve mentioning the "type" of the variable. You just assign a value to a variable and that's that.

In [None]:
a = 4

In [None]:
b = 5

In [None]:
# Note that the value of the last expression in a cell is printed automatically
a * b

20

In [None]:
# But you can supress this behavior with a semicolon in Jupyter notebooks
a * b;

In [None]:
# You can also just explicitly print
print(a * b)

20


In [None]:
# This is useful for debugging. You can print out intermediate values
print('a is', a)
print('b is', b)

a is 4
b is 5


## Types of data in Python

We're going to quickly go over the major data types that we'll be using in the rest of the course. Reminder that this is not a comprehensive list.

In [None]:
type(66)

int

In [None]:
type(4.5)

float

In [None]:
type('Sandeep')

str

In [None]:
name = 'Sandeep'
type(name)

str

In [None]:
type(a)

int

### List

This is an extremely common data type used in Python. It's exactly what it sounds like: it's an ordered collection of items.

In [None]:
my_list = [1, 2, 3]

In [None]:
print(my_list)

[1, 2, 3]


In [None]:
type(my_list)

list

One common mistake newcomers to Python will often make is to name their list "list". `list` is a reserved keyword in Python. If you make a variable called `list`, it will overshadow an existing builtin function of Python with the same name, which will cause problems.

In [None]:
# DO NOT DO THIS. "list" is a reserved keyword.
# list = [1, 2, 3]

In [None]:
# You can also make a list of lists
list_of_lists = [
  [1, 2, 3],
  [4, 5, 6],
  [7, 8, 9]
]

In [None]:
print(list_of_lists)

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]


In [None]:
# Reading items from a list is easy. Indexing starts at 0
my_list[0]

1

In [None]:
# You can also index from the right side using negative numbers
my_list[-1]

3

In [None]:
# Read the second item using "1", because the first item is indexed using "0"
list_of_lists[1]

[4, 5, 6]

In [None]:
# List elements can be changed similarly too
my_list[0] = 10
my_list

[10, 2, 3]

In [None]:
# Add an item using append
my_list.append(4)
my_list

[10, 2, 3, 4]

In [None]:
# use help() to find out what you can do with something in Python
help(my_list)

Help on list object:

class list(object)
 |  list() -> new empty list
 |  list(iterable) -> new list initialized from iterable's items
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /

In [None]:
# Create a list containing a "range" of numbers using range
range(1,4)

range(1, 4)

In [None]:
# range(5) returns a "range" object. Convert it to a list using the "list" keyword
list(range(1,4))

[1, 2, 3]

In [None]:
help(range)

Help on class range in module builtins:

class range(object)
 |  range(stop) -> range object
 |  range(start, stop[, step]) -> range object
 |  
 |  Return an object that produces a sequence of integers from start (inclusive)
 |  to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
 |  start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
 |  These are exactly the valid indices for a list of 4 elements.
 |  When step is given, it specifies the increment (or decrement).
 |  
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |

In [None]:
list(range(1, 4))

[1, 2, 3]

In [None]:
list(range(0, 91, 10))

[0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Dictionary

A dictionary (also knows as a Map in other programming languages) allows you to have key-value pairs. In the context of data visualization, dictionaries will often be used to provide configuration options.

In [None]:
my_dict = {'key1': 'value1', 'key2': 'value2', 'key3': 100}

In [None]:
my_dict

{'key1': 'value1', 'key2': 'value2', 'key3': 100}

In [None]:
my_dict['key2']

'value2'

In [None]:
my_dict.keys()

dict_keys(['key1', 'key2', 'key3'])

In [None]:
my_dict.values()

dict_values(['value1', 'value2', 100])

In [None]:
my_dict['key3'] = 'another value'
my_dict

{'key1': 'value1', 'key2': 'value2', 'key3': 'another value'}

### Boolean

Python contains a `bool` type that can store "Boolean" values, which are one of `True` and `False` (note the Title Case).

In [None]:
type(True)

bool

In [None]:
type(False)

bool

In [None]:
# true and false are not a thing in Python. Remember the Title Case.
type(true)

NameError: ignored

### Other data types

Python contains other data types such as Tuples, Sets and many more. Since those will not be directly used in the rest of the course, we'll skip over them. For a well-rounded knowledge of Python, you should definitely learn about them.

The last section of the course should go over them in a bit more detail.

## Logical Operators

Logical operators are used to compare expressions in Python. We'll use these in future sections to filter datasets. Let's quickly go over them:

In [None]:
1 == 2

False

In [None]:
1 == 1

True

Note that the expressions above return `True` and `False`, which are the Boolean types we learned about earlier. When 2 expressions are compared in Python, a bool is returned which tells us whether the comparison was successful.

In [None]:
1 != 2

True

In [None]:
1 < 2

True

In [None]:
1 >= 2

False

## Imports

Sometimes we will need to "bring in" some functionality into our Python session. This process is called importing. There's a lot to importing, but here's the very high-level overview. For this overview, we'll be importing numpy, which is a Python library we're going to discuss very soon.

In [None]:
# This is the most basic form of the import
import numpy

In [None]:
# Now we can access stuff inside the "numpy" variable
numpy.array([1, 2, 3])

array([1, 2, 3])

In [None]:
# Some popular libraries have conventions for importing them
import numpy as np

In [None]:
# This allows us to access numpy using the variable "np".
np.array([1, 2, 3])

array([1, 2, 3])

Try not to do this unless it's a popular and established convention. Code readability is important and conventions matter.

In [None]:
# You can also import things from inside other modules. For example, we could
# hypothetically import "array" from inside of numpy.
from numpy import array

array([1, 2, 3])

array([1, 2, 3])

Later on in this course, we'll be using this `from` syntax to import a library that provides us some datasets. It will look like:

```python
from pydataset import data
```

As I mentioned earlier, there are a lot of nuances to importing that we're skipping right over. What has been covered in this section is all you'll need for this particular course.

## Other concepts in Python

Python is a featureful language. The tiny list of concepts we've covered so far doesn't begin to cover it all. We've tried to keep this refresher short and focused on the Python knowledge you're going to need in the short term.

Note that we've not discussed loops, conditionals, classes or any of the built-in libraries. This course is about data visualization and we've tried to stay away from as much complexity that does not directly contribute to the topic at hand.

In the real world though, you're going to encounter novel situations that we haven't covered and it would be helpful for you to learn more about Python to be able to deal with them.

# Numpy Refresher

For this course, we'll mostly interact with data through Pandas. But Pandas is built on top of NumPy, so an overview of NumPy will be useful. Again, if you're completely new to NumPy, consider working through the last section of the course first. This section is intended to merely refresh your memory on NumPy and assumes you have some prior experience.

The primary datastructure provided by NumPy is called the N-Dimensional Array (a.k.a. `ndarray`). Think of it as a Python list, except that it's built for numerical computation. It's not untyped like Python's list; all elements in the `ndarray` are of the same type.

On top of this datastructure, NumPy provides a whole host of utilities, some of which we'll be reviewing in this section.

In [1]:
# Numpy is conventionally imported like this:
import numpy as np

In [None]:
np_arr = np.array([1, 2, 3])
np_arr

array([1, 2, 3])

In [None]:
type(np_arr)

numpy.ndarray

In [None]:
# Numpy has a method similar to python's range
np_range = np.arange(0, 91, 10)
np_range

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

In [None]:
# Numpy also has a method to generally evenly spaced (linearly) array
np_lin = np.linspace(0, 10, 11)
np_lin

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])

In [None]:
# A conditional with a numpy array creates a "boolean mask". The mask is
# True everywhere the condition is met and False elsewhere
np_arr > 1

array([False,  True,  True])

In [None]:
# A subset of the rows of a Numpy Array can be selected using a boolean mask
print('np_arr is', np_arr)
mask = [False, True, True]
print('mask is', mask)
np_arr[mask]

np_arr is [1 2 3]
mask is [False, True, True]


array([2, 3])

In [None]:
# These 2 concepts can be combined to select all items in a numpy array that
# meet a certain condition.
np_arr[np_arr > 1]

array([2, 3])

In [None]:
# Numpy arrays allow operations that Python lists don't
a = [3, 4, 5]
a * 2

[3, 4, 5, 3, 4, 5]

In [None]:
np_arr = np.array([1, 2, 3])
np_arr * 2

array([2, 4, 6])

This operation above causes each element of the NumPy array to be multiplied by 2 individually. It's called "broadcasting".

You can even operate on two similarly sized numpy arrays:

In [None]:
# This will lead to element-wise additions
np_arr + np_arr

array([2, 4, 6])

In [None]:
np_arr * np_arr

array([1, 4, 9])

In [None]:
np_arr1 = np.array([1, 2, 3, 4])
np_arr1

array([1, 2, 3, 4])

In [None]:
# But trying to operate on arrays of different sizes won't work
np_arr * np_arr1

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

That's enough of a refresher on NumPy. We'll revisit some of these concepts in the context of Pandas in the next section.

# Pandas Refresher

Pandas is the most common tool used in the industry for data analysis with Python. While this refresher will only focus on the parts necessary for this data visualization course, it's highly recommended that you spend some time learning Pandas.

Pandas is built on top of NumPy. In addition, it also uses Matplotlib to provide some visualization capabilities (we'll see them in future sections).

The first Pandas concept we'll encounter is called `Series`. A Pandas `Series` can be thought of as a 1-dimensional NumPy array with more features.

In [None]:
# Pandas is conventionally imported like this
import pandas as pd

In [None]:
# Let's start by creating a simple series
simple_series = pd.Series([1, 2, 3])
simple_series

0    1
1    2
2    3
dtype: int64

Note that the series does not look like an ordinaly NumPy array, which for reference looks like:

In [None]:
np.array([1, 2, 3])

array([1, 2, 3])

Instead, the output of `simple_series` indicates that it's of "type" "int64" and that the numbers 1, 2, 3 have some sort of labels called 0, 1, 2.

This is because Pandas Series have an "index". Think of a Series as a cross between a 1-dimensional NumPy array and a Python dictionary. The index provides labels for each entry in the Series. We can even create a more detailed series like so:

In [None]:
detailed_series = pd.Series(data=[1, 2, 3], index=['one', 'two', 'three'],
                            name='numbers')
detailed_series

one      1
two      2
three    3
Name: numbers, dtype: int64

In [None]:
# We can access items from this series using both their label and their position
# using loc and iloc respectively
print('using label (loc):', detailed_series.loc['one'])
print('using position (iloc):', detailed_series.iloc[0])

using label (loc): 1
using position (iloc): 1


In [None]:
# We can operate on Pandas series just like we could with NumPy arrays
detailed_series * 2

one      2
two      4
three    6
Name: numbers, dtype: int64

Pandas has a more powerful data structure called **DataFrame** that builds on top of `pd.Series` to provide a Excel-sheet like functionality. In fact, "Excel-sheet on steroids" is a good way to think about DataFrames in Pandas.

Let's create a simple `pd.DataFrame`

In [None]:
simple_df = pd.DataFrame(
    data=[[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    index=['row1', 'row2', 'row3'],
    columns=['col1', 'col2', 'col3'],
)
simple_df

Unnamed: 0,col1,col2,col3
row1,1,2,3
row2,4,5,6
row3,7,8,9


Note how well Jupyter notebooks can display Pandas DataFrames. Also note that the dataframe has labels for both rows and columns.

You can access columns from dataframes very easily:

In [None]:
simple_df['col1']

row1    1
row2    4
row3    7
Name: col1, dtype: int64

That should remind you of a Pandas Series. And that's because that's exactly what that is:

In [None]:
type(simple_df['col1'])

pandas.core.series.Series

You can select rows using the `.loc` and `.iloc` trick we saw above as well. And guess what? Individual rows are series too!

In [None]:
simple_df.loc['row1']

col1    1
col2    2
col3    3
Name: row1, dtype: int64

In [None]:
simple_df.iloc[0]

col1    1
col2    2
col3    3
Name: row1, dtype: int64

In [None]:
type(simple_df.loc['row1'])

pandas.core.series.Series

Note that we won't be creating dataframes like we did above in this course. For the rest of this course, we're going to be using a Python library called `pydataset` to load various datasets for us as a Pandas Dataframe.

Even in your professional life, you'll rarely have to create dataframes by hand. You'll most often be importing external data.

Importing data will be discussed this in the next section.