<a href="https://colab.research.google.com/github/stevenkhwun/P4DS/blob/main/Chp02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Built-In Data Structures, Functions, and Files

This notebook is based on [Chapter 3](https://wesmckinney.com/book/python-builtin) of Python for Data Analysis (3rd ed.) by Wes Mckinney.

## Data Structures and Sequences

### Tuple

A tuple is a **fixed-length**, **immutable** sequence of Python objects which, once assigned, cannot be changed.

**Creating a tuple**

In [2]:
# Create a tuple (with parentheses)
tup1 = (4, 5, 6)

In [3]:
tup1

(4, 5, 6)

In [4]:
# Create a tuple (without parentheses)
tup2 = 7, 8, 9

In [5]:
tup2

(7, 8, 9)

**Converting any sequence or iterator to a tuple by invoking `tuple`**

In [6]:
# Convert a list into a tuple
tuple([4, 0, 2])

(4, 0, 2)

In [8]:
# Convert a string into a tuple
tup3 = tuple('string')

In [9]:
tup3

('s', 't', 'r', 'i', 'n', 'g')

**Accessing elements by `[]`**

Sequences are 0-indexed in Python.

In [10]:
# Accessing elements of tuple
tup3[0]

's'

**Complicated tuples**

When defining tuples within more complicated expressions, it's often necessary to enclose the values in parentheses.

In [11]:
# Create a tuple of tuples
nested_tup = (4, 5, 6), (7, 8)

In [12]:
nested_tup

((4, 5, 6), (7, 8))

In [13]:
nested_tup[0]

(4, 5, 6)

In [14]:
nested_tup[1]

(7, 8)

**Mutable elements in a tuple**

While the objects stored in a tuple may be mutable themselves, once the tuple is created it's not possible to modify which object is stored in each slot:

In [15]:
# Creating a tuple with different type of objects
tup4 = ('foo'), [1, 2], (True)

In [16]:
tup4

('foo', [1, 2], True)

In [17]:
# Another way to create the same tuple
tup5 =  ('foo', [1, 2], True)

In [18]:
tup5

('foo', [1, 2], True)

In [19]:
# Checking equivalence of the tuples
tup4 == tup5

True

*Elements in a tuple cannot be modified*

In [21]:
# Elements in a tuple cannot be modified
tup4[2] = False

TypeError: 'tuple' object does not support item assignment

*If an object inside a tuple is mutable, such as a list, you can modify it in place.*

In [23]:
# Modifying an mutable object in a tuple
tup4[1].append(3)

In [24]:
tup4

('foo', [1, 2, 3, 3], True)

**Concatenating tuples using the `+` operator**

In [25]:
# Concatentating tuples
tup1 + tup2

(4, 5, 6, 7, 8, 9)

In [27]:
# Concatentating tuples
tup6 = (4, None, 'foo') + (6, 0)

In [28]:
tup6

(4, None, 'foo', 6, 0)

*Note that the end `,` is needed if a tuple contain only one 'string' element*

In [32]:
# Creating tuple with only one string
# Note the end , is needed if a tuple contain only one 'string' element
k = ('bar',)

In [33]:
k

('bar',)

In [34]:
type(k)

tuple

In [35]:
k[0]

'bar'

*If there is no end `,`, only a string will be created.*

In [36]:
# Without the end ',', only a string will be created
s = ('bar')

In [37]:
s

'bar'

In [38]:
type(s)

str

In [39]:
s[0]

'b'

**Multiplying a tuple by an integer**

In [40]:
# Multiplying a tuple
('foo', 'bar') * 4

('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar')

**Unpacking tuples**

If you try to assign to a tuple-like expression of variables, Python will attempt to unpack the value on the righthand side of the equal sign:

In [41]:
# Unpacking tuples
tup = (4, 5, 6)
a, b, c = tup

In [42]:
print("a is", a, ",", "b is", b, ",", "and c is", c, ".")

a is 4 , b is 5 , and c is 6 .


*Even sequences with nested tuples can be unpacked:*

In [43]:
# Unpacking nested tuples
tup = 4, 5, (6, 7)
a, b, (c, d) = tup

In [44]:
print("a is", a, ",", "b is", b, ",", "c is", c, ",", "and d is", d, ".")

a is 4 , b is 5 , c is 6 , and d is 7 .


**Swapping**

*In Python, **swap** can be done like this:*

In [45]:
# Initial setup
a, b = 1, 2

In [46]:
print("a is", a, "and b is", b, ".")

a is 1 and b is 2 .


In [47]:
# Swapping
b, a = a, b

In [48]:
print("Now, a is", a, "and b is", b, ".")

Now, a is 2 and b is 1 .


There are some situations where you may want to "pluck" a few elements from the beginning of a tuple. There is a special syntax that can do this, **`*rest`**.

In [49]:
# Pluck the first few elements with *rest
values = 1, 2, 3, 4, 5
a, b, *rest = values

In [50]:
a

1

In [51]:
b

2

In [52]:
rest

[3, 4, 5]

*As a matter of cnvention, many Python programmers will use the underscore (`_`) for unwanted variables.*

In [53]:
# use '_' instead of 'rest'
a, b, *_ = values

In [54]:
a

1

In [55]:
b

2

In [56]:
_

[3, 4, 5]

**Tuple methods**

The method **`.count()`** counts the number of occurrences of a value:

In [57]:
# .count() method
a = (1, 2, 2, 2, 3, 4, 2)
a.count(2)

4

### List

Lists are variable length and their contents can be modified in place. Lists are mutable. You can define them using square brackets **`[]`** or using the **`list`** type function:

In [27]:
# Create a list
a_list = [2, 3, 7, None]
a_list

[2, 3, 7, None]

In [28]:
# Create a list using list
tup = ("foo", "bar", "baz")    # This is a tuple
b_list = list(tup)
b_list

['foo', 'bar', 'baz']

In [29]:
# Assign a new value to a list
b_list[1] = "peekaboo"
b_list

['foo', 'peekaboo', 'baz']

The **`list`** built-in function is frequently used in data processing as a way to materialze an iterator or generator expression:

In [30]:
# Materialize an iterator using list
gen = range(10)
list(gen)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

#### List methods

In [31]:
# Append an element to the end of the list
b_list.append("dwarf")
b_list

['foo', 'peekaboo', 'baz', 'dwarf']

In [32]:
# Insert an element at a specific location in the list
b_list.insert(1, "red")
b_list

['foo', 'red', 'peekaboo', 'baz', 'dwarf']

In [33]:
# Removes and returns an element at a particular index
b_list.pop(2)

'peekaboo'

In [34]:
b_list

['foo', 'red', 'baz', 'dwarf']

In [35]:
# Remove the element at the first occurence in a list
b_list.append("foo")
b_list

['foo', 'red', 'baz', 'dwarf', 'foo']

In [36]:
b_list.remove("foo")
b_list

['red', 'baz', 'dwarf', 'foo']

In [37]:
# Check if a list contains a value using the in keyword
"dwarf" in b_list

True

In [38]:
# The keyword not can be used to negate in
"dwarf" not in b_list

False

#### Concatenating and combining lists

In [39]:
# Adding two lists together with +
[4, None, "foo"] + [7, 8, (2, 3)]

[4, None, 'foo', 7, 8, (2, 3)]

In [40]:
# Append multiple elements using extend method
x = [4, None, "foo"]
x.extend([7, 8, (2, 3)])
x

[4, None, 'foo', 7, 8, (2, 3)]

#### Sorting

You can sort a list in place (without creating a new object) by calling its **`sort`** function:

In [41]:
# Sorting
a = [7, 2, 5, 1, 3]
a.sort()
a

[1, 2, 3, 5, 7]

In [45]:
# Sort a collection of strings by their lengths
b = ["saw", "small", "He", "foxes", "six"]
b.sort(key=len)
b

['He', 'saw', 'six', 'small', 'foxes']

In [43]:
b.sort(key=len)
b

['He', 'saw', 'six', 'foxes', 'small']

# Basic descriptive statistics

The following demonstrates calculation of some common descriptive statistics, which includes mean, trimmed mean, weighted mean, weighted median, sample standard deviation, interquartile range (IQR) and median absolute deviation from the median (MAD).

We import the data as a pandas dataframe as the pandas dataframe methods, that is the `.method()`, can easily provide the mean, median, sample standard deviation and quantiles.

For trimmed mean, we need to use the `trim_mean` function in `scipy.stats`. For weighted mean, we use `average` function in `NumPy`. For weighted median, we use the specialized package `wquantiles`. And for MAD, we need the `robust` module in the package `statsmodels`.

Firstly, we need to install the `wquantiles` package as this is not included in the base Colab environment.

In [None]:
# Install the package "wquantiles"
!pip install wquantiles

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wquantiles
  Downloading wquantiles-0.6-py3-none-any.whl (3.3 kB)
Installing collected packages: wquantiles
Successfully installed wquantiles-0.6


We now import the necessary packages:

In [None]:
# Import necessary packages
import pandas as pd
import numpy as np
import scipy.stats
import wquantiles
from statsmodels import robust

We now load the data as a pandas dataframe:

In [None]:
# Load the dataset as pandas dataframe
link = "https://raw.githubusercontent.com/stevenkhwun/P4DS/main/Data/state.csv"
state = pd.read_csv(link)
state.head()

Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


## Mean

In [None]:
# Mean by pandas dataframe method
state['Population'].mean()

6162876.3

## Trimmed mean

In [None]:
# Trimmed mean using the scipy.stats package
scipy.stats.trim_mean(state['Population'], 0.1)

4783697.125

## Median

In [None]:
# Median by pandas dataframe method
state['Population'].median()

4436369.5

## Weighted mean

In [None]:
# Weighted mean by average function in NumPy
np.average(state['Murder.Rate'], weights=state['Population'])

4.445833981123393

## Weighted median

In [None]:
# Weighted median by median function in wquantiles
wquantiles.median(state['Murder.Rate'], weights=state['Population'])

4.4

## Standard deviation

Note that the result is a sample standard deviation.

In [None]:
# Sample standard deviation by pandas datafram method
state['Population'].std()

6848235.347401142

In [None]:
data = [2, 9, 12, 19, 86]
datadf = pd.DataFrame (data)
datadf.std()

0    34.311806
dtype: float64

## Interquartile range (IQR)

In [None]:
# Interquartile range (IQR) by pandas dataframe method
state['Population'].quantile(0.75) - state['Population'].quantile(0.25)

4847308.0

## Absolute deviation from the median (MAD)

In [None]:
# MAD by robust function in statsmodels package
robust.mad(state['Population'])

3849876.1459979336

# Built-in functions for descriptive statistics

## Maximum function `max()` and minimun function `min()`

These functions can be apply to tuples, lists and pandas dataframe.

In [None]:
# Maximun function max() apply to a tuple
max(36, 27, 12)

36

In [None]:
# Minimum function min() apply to a list
min([36, 27, 12])

12

In [None]:
# Maximun function max() apply to a pandas dataframe
max(state['Population'])

37253956

## `sum()` and `len()`

The Python’s built-in function `sum()` is an efficient way to sum a list of numeric values.

The Python’s built-in functions `len()` returns the length of an object. For example, it can return the number of items in a list. You can use the function with many different data types. However, not all data types are valid arguments for `len()`.

In [None]:
# Creates a list grades
grades = [85, 93, 45, 89, 85]

Calculate the mean grade by calculate the total and divided by the number of grades:

In [None]:
# Mean grade
sum(grades) / len(grades)

79.4

## `mean()`, `median()` and `mode()` functions in `statistics` module

The Python Standard Library's `statistics` module provides functions for calculating the mean, median and mode. Each function's argument must be an *iterable* and can apply to tuples, lists and pandas dataframe.

To use these capabilities, first import the `statistics` module:

In [None]:
# Import statistics module
import statistics

### `mean()`

In [None]:
# Function mean() apply to a list
statistics.mean(grades)


79.4

In [None]:
# Function mean() apply to a pandas dataframe
statistics.mean(state['Population'])

6162876.3

### `median()`

In [None]:
# Function median() apply to a list
statistics.median(grades)

85

### `mode()`

In [None]:
# Function mode() apply to a list
statistics.mode(grades)

85

The `mode()` function causes a `StatisticsError` for lists like [85, 93, 45, 89, 85, 93] in which there are two or more "most frequent" values.

## `sorted()` function

To confirm that the median and mode are correct, you can use the built-in `sorted()` function to get a copy of `grades` with its values arranged in increasing order:

In [None]:
# Sort the object grades
sorted(grades)

[45, 85, 85, 89, 93]

## `pvariancd()` and `pstdev` in `statistics` module

In [None]:
# Create the data
die = [1, 3, 4, 2, 6, 5, 3, 4, 5, 2]

In [None]:
# Population variance
statistics.pvariance(die)

2.25

In [None]:
# Population standard variation
statistics.pstdev(die)

1.5

## `sqrt()` function in `math` module

Passing the `pvariance()` function's result to the `math` module's `sqrt()` function confirms the population standard deviation is 1.5:

In [None]:
# Standard deviation by sqrt() function
import math      # Import the math module
math.sqrt(statistics.pvariance(die))

1.5

# This is the end of the document

In [None]:
die = [1, 3, 4, 2, 6, 5, 3, 4, 5, 2]
statistics.mean(die)

3.5

In [None]:
diff = list(map(lambda x: x - 3.5, die))
diff

[-2.5, -0.5, 0.5, -1.5, 2.5, 1.5, -0.5, 0.5, 1.5, -1.5]