# Python Practice Lecture 4 MATH 342W Queens College
# - More Data Structures, Pandas, and Functions
## Author: Amir ElTabakh
## Date: February 8, 2022

## Agenda:
* More data structures
* Pandas
* Functions

In this demo we will look at other means of storing some elements such as tuples, and sets. We will then take a look at functions, further explore dictionaries, and lastly introduce the Pandas Dataframe object.

Before jumping into tuples or sets, lets re-jog our memories with lists.

In [1]:
# List of some classes in the Data Science major
DS_classes = ['Math 341', 'CS 111', 'Math 231']

In [2]:
# Output element at index 0
DS_classes[0]

In [3]:
# Output second to last element
DS_classes[-2]

In [4]:
# Change the last element to value 'Math 241'
DS_classes[-1] = 'Math 241'
DS_classes

In [5]:
# Add the value 'CS 211' to end of the list
DS_classes += ['CS 211']
DS_classes

In [6]:
# Remove the element at index 1
DS_classes.pop(1)
DS_classes

['Math 341', 'Math 241', 'CS 211']

## Tuples

Another data structure thats similar to lists is the tuple. A tuple is an *immutable* object (objects that cannot be changed) where a list is a *mutable* object (objects that can be changed). To create a tuple object use paranthesis `()`, this differs from lists where we've seen lists use square brackets `[]`.

In [7]:
# Tuple of apple cultivars
apples = ("Golden", "Red", "Fuji", "Granny")

In [8]:
# Output the data type of the apples object
type(apples)

tuple

In [9]:
# Call the first element in apples
apples[0]

'Golden'

In [10]:
# Call the first three elements in the apples
apples[0:3]

('Golden', 'Red', 'Fuji')

In [11]:
# Change the first element of apples to 'Honeycrisp'
apples[0] = "Honeycrisp"

# What went wrong?

TypeError: 'tuple' object does not support item assignment

We have casting before, it is the process of changing the datatype of an object to a valid one. You can also cast a tuple as a list, and a list as a tuple.

In [12]:
# Creating a list
programming_langs = ["Python", "C++", "Java", "JavaScript"]

In [13]:
# Print the type of the programming_langs object
print(type(programming_langs))

<class 'list'>


In [14]:
# Cast the programming_langs object as a tuple
programming_langs_tuple = tuple(programming_langs)

In [15]:
# Print the type of the programming_langs_tuple object
print(type(programming_langs_tuple))

<class 'tuple'>


In [16]:
# Print the object to confirm it is a tuple
print(type(programming_langs_tuple))

<class 'tuple'>


In [17]:
# Cast a tuple to a list
programming_langs_list = list(programming_langs_tuple)

In [18]:
# Print the type of the programming_langs_list object
print(type(programming_langs_list))

<class 'list'>


In [19]:
# Lets try to change the first element of programming_langs_list to 'C#'
programming_langs_list[0] = 'C#'
print(programming_langs_list)

['C#', 'C++', 'Java', 'JavaScript']


## Sets

Sets are used to store multiple elements in a single variable, similarly to lists, tuples, and dictionaries. The set is the last of 4 built-in data types in Python used to store collections of data. Each have different qualities and uses.

* Lists: Mutable, ordered
* Tuples: Immutable, ordered
* Dictionaries: Mutable, Key-Value paired, does not allow for duplicate elements
* Sets: Immutable-ish, unordered

Elements in sets cannot be changed, but you may remove elements and add new ones.

In [20]:
# Set of cat breeds
cat_set = {"Siamese", "Bengal", "Calico", "Chartreux"}

In [21]:
# Print the length of the set
print(len(cat_set))

4


## Back to dictionaries

If we want to talk about Pandas Dataframes, we should explore dictionaries a little bit more. Below is an example of a dictionary, lets practice some new operations on it.

Note: Below we use an f-string in the print statement. An f-string allows the programmer to directly add the value of a variable into the string using curly braces. Notice how we did not have to cast the non-string values into strings for this to work.

In [22]:
# Defining athlete_1 dict
athlete_1 = {'Name' : 'Max Verstappen',
             'Sport' : 'Formula 1',
             'Team' : 'Red Bull Racing',
             'WDC' : 1,
             'Age' : 24
             }

athlete_1

{'Name': 'Max Verstappen',
 'Sport': 'Formula 1',
 'Team': 'Red Bull Racing',
 'WDC': 1,
 'Age': 24}

In [23]:
# print type of athlete_1
type(athlete_1)

dict

In [24]:
# Use a for loop to iterate over the key pairs in the dictionary
for k, v in athlete_1.items():
    print(f'Key: {k} - Value: {v}')

Key: Name - Value: Max Verstappen
Key: Sport - Value: Formula 1
Key: Team - Value: Red Bull Racing
Key: WDC - Value: 1
Key: Age - Value: 24


We can use the `in` and `not in` operators to check whether a value exists in a dictionary, list, etc.

In [25]:
'Name' in athlete_1.keys()

True

In [26]:
'Name' not in athlete_1.values()

True

In [27]:
'Sport' not in athlete_1

False

### The `get()` method

It can become tedious to check whether a key exists in a dictionary before accessing that key's value. Fortunately, dictionaries have a `get()` method that takes two arguments:

- The key of the value to retrieve
- A fallback value to return if that key does not exist

In [28]:
# Redefining it so it's here
athlete_1 = {'Name' : 'Max Verstappen',
             'Sport' : 'Formula 1',
             'Team' : 'Red Bull Racing',
             'WDC' : 1,
             'Age' : 24
             }

# Print the athletes name
print(f"The name of the athlete is {athlete_1.get('Name')}.")

The name of the athlete is Max Verstappen.


In [29]:
# Print the athletes team
team = athlete_1.get('Team')
print(f"The athlete plays for {team}.")

The athlete plays for Red Bull Racing.


In [30]:
# Print the athletes nationality
fallback = "-oh, I am not sure."
print(f"The athlete is from {athlete_1.get('Nationality', fallback)}.")

The athlete is from -oh, I am not sure..


In [31]:
# What happens if you use the get method to find the value of a key, but that key does not exist in the dictionary
print(athlete_1.get('Nationality'))

None


## Pandas

Pandas is a Python library used for data manipulation and analysis. It's name is a play on "Python Data Analysis", and was published as an open source library in 2009 by Wes McKinney.

Pandas does not come with standard Python. Python is open source and developers are creating new libraries all the time. These developers can upload these packages as open-source for others to install and use! To install Pandas on our machine we will pip install it. pip is the standard package manager for Python, it allows you to install and manage additional packages. The Python installer installs pip, so it should be ready for us to use. Verify that pip is installed by running the following command:

In [32]:
!pip --version

pip 21.3.1 from E:\Users\amira\anaconda\lib\site-packages\pip (python 3.8)



The cell above should return the version of your pip as well as where it is stored on your machine. Note when using a Notebook, such as this one on Jupyter, we can run shell commands by starting a line with an exclamation mark `!`.

In [33]:
# Update pip
!python -m pip install --upgrade pip

Collecting pip
  Downloading pip-22.0.2-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.3.1
    Uninstalling pip-21.3.1:
      Successfully uninstalled pip-21.3.1
Successfully installed pip-22.0.2




In [34]:
# Install Pandas on your machine
!pip install pandas





Now that we've installed Pandas, lets import the library. Note that we only have to install a library once per machine, but we have to import it in every program we wish to use the library in.

---

Pandas is the most common Python library for data analytics, and data wrangling. Thankfully theres a lot of documentation for us to use in case we get stuck.

https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide

What is a Pandas Dataframe? Well, lets refer to the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. 

Features of DataFrame
- Potentially columns are of different types
- Size – Mutable
- Labeled axes (rows and columns)
- Can Perform Arithmetic operations on rows and columns

In [35]:
# import the pandas library
import pandas as pd

# Create a dictionary of the chocolate bars
menu = {'Item Name': ['Snickers', 'Twix', 'KitKat', 'M&Ms'],
       'Price':[0.25, 0.49, 2.50, 1.00],
       'Mini':[1, 1, 0, 0],
       'Family Size':[False, False, True, False]
       }

# Output dictionary
menu

{'Item Name': ['Snickers', 'Twix', 'KitKat', 'M&Ms'],
 'Price': [0.25, 0.49, 2.5, 1.0],
 'Mini': [1, 1, 0, 0],
 'Family Size': [False, False, True, False]}

In [36]:
# Convert the dictionary aboce into a Pandas DataFrame
menu_df = pd.DataFrame(menu)
menu_df

Unnamed: 0,Item Name,Price,Mini,Family Size
0,Snickers,0.25,1,False
1,Twix,0.49,1,False
2,KitKat,2.5,0,True
3,M&Ms,1.0,0,False


This is our first dataframe, lets practice some useful operations on it.

In [37]:
# Get the Mini column
menu_df[['Mini']]

Unnamed: 0,Mini
0,1
1,1
2,0
3,0


In [38]:
# Get the Mini column
menu_df['Mini']

0    1
1    1
2    0
3    0
Name: Mini, dtype: int64

The difference this cell above and the one above it is the single square brackets. They also output differently. Lets check the data type of each.

We find that the column call with a single pair of brackets is a series data type. What is a Series?

### Series
The Pandas Series object is a one-dimensional 'ndarray' with axis labels. It is analogous to an indexed one dimensional column vector. We will explore them more in future demos. Back to dataframes.

In [39]:
print(type(menu_df['Item Name']))
print(type(menu_df[['Item Name']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [40]:
# Return the data type of each column in menu_df
menu_df.dtypes

Item Name       object
Price          float64
Mini             int64
Family Size       bool
dtype: object

In [41]:
# Print a concise summary of menu_df
menu_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Item Name    4 non-null      object 
 1   Price        4 non-null      float64
 2   Mini         4 non-null      int64  
 3   Family Size  4 non-null      bool   
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 228.0+ bytes


In [42]:
# Print out the size of menu_df
print(menu_df.size)

16


Nothing too exciting here. We just explored how to gather different characteristics of our DataFrame, lets take a deeper dive.

Lets create a more complex DataFrame.

In [43]:
list_of_names = ["Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", 
                      "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", 
                      "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", 
                      "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", 
                      "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", 
                      "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", 
                      "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", 
                      "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", 
                      "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", 
                      "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", 
                      "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", 
                      "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", 
                      "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", 
                      "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", 
                      "Landon", "David", "Christian", "Andrew", "Brayden", "John", 
                      "Lincoln"]

# Generating list of n many salaries
from random import gauss

n = len(list_of_names)
mu = 50000
sigma = 20000
list_of_salaries = []

for i in range(n):
    list_of_salaries += [int(gauss(mu, sigma))]
    

# Generating list of n many past_crime_severity values
# We will use the numpy library
import numpy as np
items = ["no crime", "infraction", "misdimeanor", "felony"]
probs = [.50, .40, .08, .02]
list_of_past_crime_severity = np.random.choice(items, n, p = probs) # run `help(choices)` to read documentation


# Generating list of n many has_past_unpaid_loan values

list_of_has_past_unpaid_loan = np.random.binomial(n = 1, size = n, p = 0.2)




# Initializing Pandas DataFrame
df = pd.DataFrame({'Salary' : pd.Series(list_of_salaries, index = list_of_names),
                  'past_crime_severity' : pd.Series(list_of_past_crime_severity, index = list_of_names),
                  'has_past_unpaid_loan' : pd.Series(list_of_has_past_unpaid_loan, index = list_of_names)})

df

Unnamed: 0,Salary,past_crime_severity,has_past_unpaid_loan
Sophia,48968,infraction,0
Emma,44881,no crime,0
Olivia,85312,no crime,1
Ava,57685,infraction,0
Mia,18735,felony,0
...,...,...,...
Christian,46384,infraction,0
Andrew,44982,infraction,0
Brayden,61932,infraction,0
John,30008,no crime,0


In [44]:
# Lets read documentation to see what univariate distributions are available in the numpy.random module
import numpy
#help(numpy.random)

In [45]:
# Capture a snapshot of df
df.head(10) # default is 5

Unnamed: 0,Salary,past_crime_severity,has_past_unpaid_loan
Sophia,48968,infraction,0
Emma,44881,no crime,0
Olivia,85312,no crime,1
Ava,57685,infraction,0
Mia,18735,felony,0
Isabella,62240,no crime,0
Riley,42420,no crime,1
Aria,12347,no crime,1
Zoe,41652,infraction,0
Charlotte,36816,no crime,0


50% of people have no crime, 40% have an infraction, 8% a misdimeanor and 2% a felony. There is a 20% chance that any individual has a past unpaid loan. Is this a reasonable fabrication of this dataset? No... since salary and not paying back a loan are dependent r.v.'s. But... we will ignore this for now.

It would be nice to see a summary of values. Would median and mean be appropriate here? Not for categorical variables!

In [46]:
# You can view summary statistics of each feature with the .describe() method
df.describe()

Unnamed: 0,Salary,has_past_unpaid_loan
count,100.0,100.0
mean,46831.11,0.15
std,18832.121593,0.35887
min,-6722.0,0.0
25%,36194.0,0.0
50%,46807.0,0.0
75%,60107.25,0.0
max,91471.0,1.0


In [47]:
# Get column labels
df.columns

Index(['Salary', 'past_crime_severity', 'has_past_unpaid_loan'], dtype='object')

In [48]:
# get data types of columns
df.dtypes

Salary                   int64
past_crime_severity     object
has_past_unpaid_loan     int32
dtype: object

`has_past_unpaid_loan` should not be an integer value! The difference between someone paying vs. not paying a loan is not the value 1, it is a boolean value. Let's cast the feature as a boolean instead of an integer value.

In [49]:
# Cast the has_past_unpaid_loan feature as bool
df['has_past_unpaid_loan'] = df['has_past_unpaid_loan'].astype('bool')

df.head()

Unnamed: 0,Salary,past_crime_severity,has_past_unpaid_loan
Sophia,48968,infraction,False
Emma,44881,no crime,False
Olivia,85312,no crime,True
Ava,57685,infraction,False
Mia,18735,felony,False


In [50]:
df['past_crime_severity'].describe()

count          100
unique           4
top       no crime
freq            45
Name: past_crime_severity, dtype: object

Here are some base functions to be aware of

In [51]:
# min
print(f"Min: {df['Salary'].min()}")

Min: -6722


In [52]:
# max
print(f"Max: {df['Salary'].max()}")

Max: 91471


In [53]:
# mean
print(f"Mean: {df['Salary'].mean()}")

Mean: 46831.11


In [54]:
# median
print(f"Median: {df['Salary'].median()}")

Median: 46807.0


In [55]:
# mode
print(f"Mode: {df['Salary'].mode()}") # Can you understand what is going on here?

Mode: 0     -6722
1     -1341
2     10215
3     12347
4     13408
      ...  
95    81997
96    82273
97    83334
98    85312
99    91471
Length: 100, dtype: int64


In [56]:
# standard deviation
print(f"Std: {int(df['Salary'].std())}")

Std: 18832


In [57]:
# variance
print(f"Variance: {int(df['Salary'].var())}")

Variance: 354648803


In [58]:
# Quantile function
print(f"Quantile: {df['Salary'].quantile(0.2)}")

Quantile: 32186.0


In [59]:
# Number of distinct elements
print(f"Distinct Values: {df['Salary'].nunique()}")

Distinct Values: 100


In [60]:
# Calculate interquartile range
q3, q1 = np.percentile(df['Salary'], [75, 25])
iqr = q3 - q1
print(f"IQR: {iqr}")

IQR: 23913.25


Great work! We've created a Pandas DataFrame object, explored and manipulated the data. This DataFrame is our training set `D`. We are missing one final variable, the response! Let's add it and say that 90\% of people are creditworthy i.e. they paid back their loan.

In [61]:
# Creating response variable (this is your y)
df['paid_back_loan'] = np.random.binomial(n = 1, size = n, p = 0.9).astype('bool')

df.head()

Unnamed: 0,Salary,past_crime_severity,has_past_unpaid_loan,paid_back_loan
Sophia,48968,infraction,False,False
Emma,44881,no crime,False,True
Olivia,85312,no crime,True,True
Ava,57685,infraction,False,True
Mia,18735,felony,False,True


Conceptually - why does this not make sense? `y` is independent of `X` --- what happens then? No function `f` can ever have any predictive / explanatory power! This is just a silly example to explore Pandas library. We will work with real data soon. Don't worry.

This was only a brief glimpse of what Pandas can do. We'll be working with Pandas in almost every demo, so we'll gain more exposure to the library with time.

---

## Functions

A function is a block of code which only runs when it is called. The functions we have worked with already come from imported or standard libraries, however we can create our own functions! You can define the parameters the function will accept, when you pass variables into the function call, these values are called arguments (nomenclature counts).

A function entails:
- The `def` statement
- The name of the function followed by ()
- You may define arguments inside the parenthesis
- a colon
- Starting on a new line and indented, the body of the function to execute (also called the function clause)

In [62]:
# Our first function
def hello():
    print("Hello!")

Cool, our first function! How come nothing was output? We did not call the function, lets do that now.

In [63]:
hello()

Hello!


A function can take on multiple arguments. An argument is information that can be passed into a function. There is a very subtle difference between an argument and a parameter.

* A parameter is the variable listed inside the parentheses in the function declaration.
* An argument is the value that are sent to the function when it is called.

Lets edit the `hello()` function to include a parameter called `name`, that way we can include the persons name in the print statement. When I call the function, we'll pass our own name as an argument.

You can add a default to an argument in case a corresponding parameter is not passed into the function call.

The `return` statement is exactly what you think it is. Once it is called, the interpreter exits the function and returns whatever value comes after the statement. You don't need to include a return statement.

In [64]:
def hello(name, age = 40):
    hello_statement = f"Hello {name}! I am {str(age)} years old."
    print(hello_statement)
    
hello("Amir", 21)

hello("Amir")

hello()
# What is wrong with the line above?

Hello Amir! I am 21 years old.
Hello Amir! I am 40 years old.


TypeError: hello() missing 1 required positional argument: 'name'

Lets take a look at a fun function. For those of you in CS 220 you might be learning about encryption and decryption. One cipher you learn about is the Shift Cypher, where you simply replace the letter with the letter k positions away. The function for the shift cipher is generally `f(p) = (p + k) % 26` where `p` is a specific letter in the string and `k` is a given key. Here is a Python function for the shift cypher.

In [65]:
# Shift Cipher Decryptor Function
def shift_cipher_decrypt(coded_message, key):
    coded_message = list(coded_message)
    encrypted_message = ''
    letters = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
    letters = letters.split()
    
    for letter in coded_message:
        if letter in letters:
            x = letters.index(str(letter))
            char = (x - key) % 26
            encrypted_message += letters[char]
            
        else:
            encrypted_message += ' '
            
    return print(encrypted_message)

Consider the following message encrypted with a simple shift cipher: "CKKZ SKNG"
Use the encryption key `100` to decode the message.

In [66]:
shift_cipher_decrypt("CKKZ SKNG", 100)

GOOD WORK


Below is a Python function to encrypt messages with the shift cypher. To encrypt and decrypt the same message make sure your key is the same! These two functions might look intimidating, but its nothing we haven't already gone over. Go through each function line by line, and make sense of each. Now you're a Python programmer and a spy!

In [67]:
def shift_cipher_encrypt(coded_message, key):
    coded_message = list(coded_message)
    encrypted_message = ''
    letters = "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
    letters = letters.split()
    
    for letter in coded_message:
        if letter in letters:
            x = letters.index(str(letter))
            char = (x + key) % 26
            encrypted_message += letters[char]
            
        else:
            encrypted_message += ' '
            
    return print(encrypted_message)

In [68]:
shift_cipher_encrypt("GOOD WORK", 100)

CKKZ SKNG
