# **Intro to Data Wrangling using Python**
To start learning more about more advanced and efficient data wrangling, we will learn the basics of Python programming with a specific focus on the libraries and methods built into and added onto the langauge put in place to analyze data. This exercise will introduce you to the following basic programming and Python principles:
- Comments
- Python Libraries
- Variables and data types (int, double, boolean, string, array, and more) with the assignment operator "="
- Conditions and conditional statements (if/elif/else) with comparison operators (==, >, <, "in", etc.)
- Functions
- Loops (for, while, and nested loops)
- Data Visualization

We will go over these concepts and at the end we will put it into practice with a problem you can practice with: <u>StarLift Airlines Trip Logs</u>.

## **0.0 - Intro to IPYNB (file extension .ipynb)**

<u>**IPYNB**</u> stands for "Interactive Python Notebook" and is the file format created originally for Jupyter Notebook. Since then, other <u>**IDE's**</u> (Integrated Development Environment) have given support for the file type to run. These kinds of files can be a great way to learn and experiement with Python since it allows us to separate code into sections and run each section individually. It also provides a great way to visualize and display outputs as you will see later on when we will output different lines of text, or even a whole graph.

As you read through this notebook feel free to add in your own notes, edit text or code where helpful, and try writing code yourself by adding in blocks. You can also try creating your own IPYNB file to follow along! Hovering under a code or text block will show you two buttons, one that says "Code" and one that says "Text" or "Markdown", depending on what system you are using to open this file. Clicking either one of those buttons will create a new block for its labeled purpose.

**IMPORTANT NOTE:** run every code block as you go through this file! You may have to go back and run some code blocks for others to work if you exit the file and come back. You can run a code block by pressing the play button that appears in the top left corner of the code block, or by pressing *Ctrl + Enter*.

## **0.1 - Comments**

In [None]:
# This is a comment! We can do a single line comment by using "#" at the beginning of the line. This is the
# most common type of comment...

import pandas as pd # You can have code followed by a comment, as seen here. Anything after "#" is a comment, 
                    # while the text before is code. We'll get into what this code does next

'''
...but we can also do a multi-line comment by using three single quotes. You can see that as we continue to type in this comment, we just
need to do the three single-quotes at the beginning and end instead of starting each line with "#". This comment style is less common since
we usually put large blocks of text in other formats, like markdown above or README documents.
'''

## **0.2 - Python Libraries and Import Statements**

We can do a lot of things with the basic Python functionality, but using different libraries that we can import in will make work easier and faster. A <u><b>Python library</b></u> is a collection of code, or modules of code, that someone else has already made that we can use in a program for specific operations. In many parts of our code below we will use the Python libraries <u><b>Pandas</b></u>, <u><b>NumPy</b></u>, <u><b>Math</b></u>, and <u><b>Matplotlib</b></u> which allow us to perform different operations that we normally wouldn't be able to do in normal Python. We will also play around with an additional library called <u><b>FPDF</b></u> for exporting our findings.

In [None]:
# Python pandas is a powerful data manipulation and analysis library. We will use it to read in our data and 
# do some of our data manipulation, specifically with tables.
import pandas as pd # We import pandas, then give it a shorter name we can reference it by. "pd" is a common abbreviation.

# NumPy is a powerful library for numerical computing. We will also use it to do some of our data manipulation.
import numpy as np # We import numpy then give it the abbreviation "np" to reference in our code

# Math is a built-in Python library that provides mathematical functions and constants.
import math

# Matplotlib is a plotting library for Python. We will use it to create visualizations of our data. Specifically we are
# importing the "pyplot" module from matplotlib since we don't need to reference/use the entire library. We specify this
# with the "." after "matplotlib" and then the name of the module we want to import.
import matplotlib.pyplot as plt # Import and give it the abbreviation "plt"

# FPDF is a Python library for creating PDFs. We will use it to create a PDF of our data at the very end of this walkthrough.
from fpdf import FPDF

Try running the code above if you haven't already. If it worked, great! You can keep going. We would expect this to work without doing anything else if you are using an IDE that has these preloaded (like Google Colab) to run this IPYNB, or if you have used those libraries in the past. If you got an error when trying to run them, we will have to do a bit more. Just a bit, luckily it's not too difficult, but it does require using the terminal. You can open a terminal window in VS Code by going to the "Terminal" option in the toolbar and clicking "New Toolbar". If you are using a different IDE, you may need to find resources online to find out how to do this, or skip to the next code block. Once that is open, you can run the following commands one at a time to install the libraries:

> <pre>pip install pandas</pre>

> <pre>pip install numpy</pre>

> <pre>pip install matplotlib</pre>

> <pre>pip install fpdf</pre>

Later we will demonstrate how to export to an Excel file, so you will also need to run this command:
> <pre>pip install openpyxl</pre>

While that is not a library we will explicitly reference like the rest, it is used by Pandas. If you ran the install successfully in something like VS Code, you would see something similar to this:

<img src="./Course%20Resources/terminalPipInstallExample.png" alt="An example of a pip install for openpyxl" width="500">

If you don't want to use the terminal, alternatively you can run the code block below, though it is never a bad idea to get familiary with the terminal.

In [None]:
# The following will work with most IDE's, but you may need to adjust based on your environment
%pip install pandas
%pip install numpy
%pip install matplotlib
%pip install fpdf
%pip install openpyxl

# Once you have run this, run the code block above again to actually import the libraries

An important part of coding is to be able to read documentation so we can know how to use other people's code and the libraries we've been provided. <u>**Code documentation**</u>, in respect to libraries, is a good way to know specifically what commands you can use, how to use them, and if the library will be helpful as you work towards your end goal. Here is the documentation for the above libraries:
- <a href="https://pandas.pydata.org/docs/">Pandas</a>
- <a href="https://numpy.org/doc/1.26/index.html">NumPy</a>
- <a href="https://docs.python.org/3/library/math.html">Math</a>
- <a href="https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html">Matplotlib, specifically pyplot</a>
- <a href="https://pyfpdf.readthedocs.io/en/latest/index.html">FPDF</a>

Don't worry about fully understanding everything about these libraries, but just know this is a resource you have, in addition to using search engines like Google and AI chatbots like ChatGPT. Another resource to be aware of is every programmer's best friend, <a href="https://stackoverflow.com/">Stack Overflow</a>.

## **0.3 - Variables, Data Types, and Intro to Using Libraries**

<b><u>Variables</u></b> are how we store and reference data in our program. There are basic data types built into Python that are frequently used, and there are many others that are more obscure. Python isn't as strict with specifying data types as some other programming languages, but in some cases--especially when dealing with data--we'll want to know how to use them. The ones we will be dealing with include:
- <b><u>Integers</u></b> - a whole number (no decimal point).
- <b><u>Floats</u></b> - a.k.a. floating point number. It is a data type that represents real numbers (has decimal points).
- <b><u>Booleans</u></b> - a data type that can only have one of two values: True or False. We use booleans to control the flow of our code.
- <b><u>String</u></b> - a sequence of characters.

The following are what we refer to as <b><u>data structures</u></b> and are ways of storing variables and data in logical and organized ways. There are many more than what are listed below, but this is mainly what we will use in data analysis:
- <b><u>Lists</u></b> - a default structure in Python that can hold multiple elements/variables of different types.
- <b><u>Arrays</u></b> - an array is not native to the Python library, but are common programming data structures. The difference between an array and a list is that an array consists of only one data type.
- <b><u>Tables</u></b> - a two-dimensional array, or a list of lists.

In a way we can even create our own data types using what are called classes, though that's beyond the scope of this course. Below we will use some of the Python libraries we've imported to declare variables and explore the different data types and structures. As we do this and throughout the rest of the instruction in this file, consider opening the *Variables View* to see the different information about variables we've created.

#### **0.3.1 - Variable Names**

Variable names in Python are case-sensitive. This means that "my_variable" and "My_Variable" are two different variables. Variables can contain letters, numbers, and underscores, but they can't start with a number and they can't contain spaces or special characters like !, @, #, $, %, etc. Python has a few reserved words that we cannot use for variable names like "if", "else", "for", "while", "import", "def", "class", etc. Variable names should be descriptive and meaningful so code can be easily read and understood, not just by others but by yourself when you come back to it. There are three different "cases" for variable names that are commonly used in coding and are considered standards:
- <b><u>Snake Case (snake_case)</u></b> - names like <i>my_variable</i> or <i>num_1</i> that are all lowercase with underscores between words.
- <b><u>Camel Case (camelCase)</u></b> - names like <i>myVariable</i> or <i>num1</i> that have the first word lowercase and the first letter of each subsequent word capitzlied.
- <b><u>Pascal Case (PascalCase)</u></b> - names like <i>MyVariable</i> or <i>Num1</i> that have the first letter of each word capitalized.

While there is no industry standard to which case a programmer uses in Python, there are usually standards based on your team or company's pre-defined standards.

In addition to these cases there is also a standard across all cases: <b><u>constant variables</u></b>, or variables that are not meant to be changed in the program once they are initialized, are named using all uppercase letters and underscores between words, like CONSTANT_VARIABLE, MY_VARIABLE, or NUM_1.

In [None]:
empty_variable = None # This is how we create a variable with no value. We can use "None" to represent an empty variable. "=" is the assignment operator.

In [None]:
# Here are examples of a invalid variables names. Get rid of the multi-line comment to see the errors and code formatting
'''
1st_variable = None                 # This is an invalid variable name because it starts with a number
variable! = None                    # This is an invalid variable name because it contains a "!" character
variable-name = None                # This is an invalid variable name because it contains a "-" character
if = None                           # This is an invalid variable name because it is a reserved word in Python
'''

#### **0.3.2 - Integers**<a id='print'></a>

In [None]:
# Declaring a variable and creating it as an "int" or integer
my_first_variable = 5               # This is how we declare a variable and assign it a value. In this case, we are creating an integer
print(my_first_variable)            # This will print the value of my_first_variable to the console so we can see it

In [None]:
# Let's do some basic math using integers
x = 5
y = 3
z = x + y

print(z) # This will print the value of z to the console

In [None]:
# As an example of case-sensitivity in Python, let's try to print the value of a variable that doesn't exist. Get rid of the multi-line comment 
# to see the errors and code formatting
'''
print(Z) # This will cause an error because "Z" is not the same as "z"
'''

#### **0.3.3 - Floats**

In [None]:
# Declaring a variable and creating it as a "float"
my_first_float = 5.0
print(my_first_float)

In [None]:
# Lets do some basic math using floats
a = 5.1
b = 3.3
c = a + b

# Notice that we have 8.3999999 instead of 8.4. This is due to the way floating point numbers are stored in memory
print(c)

In [None]:
# We can fix those decimal points by using the round function, which is native to Python
c = round(c, 1) # This will round the number to 1 decimal place
print(c)

In [None]:
# Now let's use the math library to do some more complex math
numToGetSquare = 25                 # This is an int, but after we use the sqrt function from the math library, it will be a float
x = math.sqrt(numToGetSquare)       # Get the square root of the number
print(x)                            # This will print the value of x to the console. Notice it has a decimal point, so it is a float

In [None]:
# Here's another example by performing a power operation. Notice each time we use the variable "x" it is 
# being overwritten with a new value
x = math.pow(5, 2)                  # This is how we can use the pow function from the math library
print(x)                            # This will print the value of x to the console

#### **0.3.4 - Booleans**

In [None]:
# Declaring a variable and creating it as a "boolean"
my_first_boolean = True             # Creating a boolean variable and assigning it the value "True"
my_first_boolean = False            # Overwriting the value of the boolean variable to "False"

print(my_first_boolean)

# These become more important as we get to conditional statements and loops

#### **0.3.5 - Strings**

In [None]:
# Declaring a variable and creating it as a "string"
my_first_string = "Hello, World!" # Creating a string variable and assigning it a value
print(my_first_string)

In [None]:
# We can also use single quotes to create a string. This can be useful if we need to use double quotes within the string
my_second_string = 'Hello, "World!"'
my_second_string = "Hello, 'World!'" # Using double quotes to create the string, but single quotes within the string
print(my_second_string)

In [None]:
# Alternatively, we can use the "\" character to escape the quotes
my_third_string = "Hello, \"World!\"" # Using the "\" character to escape the quotes
print(my_third_string)

In [None]:
# String concatenation is the process of combining two strings. We can do that inside the print statement 
# with the "," operator
string1 = "Hello"
string2 = "World"
print("1.", string1, string2, "!") # Notice that the "," operator automatically adds a space between the strings

# We can optionally get rid of those spaces
print("2. ", string1, " ", string2, "!", sep="") # We use the "sep" argument to change the separator from a space to nothing

# We can also concatenate the string with the "+" operator, which is more common
string3 = string1 + " " + string2 + "!" # This is how we concatenate two strings. Notice we add the space in the string
print("3.", string3)

# There is also a useful way to format strings using f-strings. This is a newer feature of Python and is very useful
# for formatting strings. We can use f-strings to embed expressions inside string literals, using curly braces {}.
# This is a very powerful feature and can be used to create dynamic strings.
name = "John"
age = 30 # This is an int, but after we use it in the f-string, it will be a string
greeting = f"My name is {name} and I am {age} years old." # This is an example of using f-strings
print("4.", greeting)

#### **0.3.6 - Casting**

<b><u>Casting</u></b> is the process of converting a variable from one data type to another. We can cast a variable to a different data type using the built-in functions int(), float(), str(), and bool(). We can also use type() to check a variable's data type. Below are some examples of casting.

In [None]:
# We can cast a variable to an int
my_float = 5.5
my_int = int(my_float)                          # This will cast the float to an int
print(f"{my_int}, type: {type(my_int)}")        # This will print the value of my_int and its type to the console

In [None]:
# We can cast a variable to a float
my_int = 5
my_float = float(my_int) # This will cast the int to a float
print(f"{my_float}, type: {type(my_float)}")

In [None]:
# We can cast a variable to a string
my_int = 5
my_string = str(my_int) # This will cast the int to a string
print(f"{my_string}, type: {type(my_string)}")

In [None]:
# We can cast a variable to a boolean
my_string = "True"
my_boolean = bool(my_string) # This will cast the string to a boolean
print(f"{my_boolean}, type: {type(my_boolean)}")

In [None]:
# Some integers are also boolean values. 0 is False, and any other integer is True
my_boolean = bool(0) # This will cast the int to a boolean
print(f"{my_boolean}, type: {type(my_boolean)}")

#### **0.3.7 - Arrays and Lists**

Arrays and lists are so similar you can almost use them interchangeably when it comes to syntax in default python.

In [None]:
# To create an array or a list in Python, we just use square brackets
my_first_list = [1, 3.2, "a", True, None, "None"] # Creating a list. Notice that the list can contain different types
print("1.", my_first_list)

my_first_array = [1, 2, 3, 4, 5] # Creating an array. In default Python this is still a list, we just call it an array
print("2.", my_first_array)

In [None]:
# We can instantiate an empty list
empty_list = [] # This is how we create an empty list
print(empty_list)

In [None]:
# Python's list array is powerful but limited in functionality. We can use numpy to create a true array
np_array = np.array(my_first_array) # Creating a numpy array from the list/array we created above
print("1.", np_array)

# Now that we have a numpy array, we can use numpy's powerful array functions and do other math calculations
print("2.", np_array.mean())                            # This will calculate the mean of the array. .mean() is a function that is part of the numpy
print("3.", sum(my_first_array) / len(my_first_array))  # This is how we would calculate the mean without numpy; it's more work

In [None]:
# Let's try multiplying everything in the array by two!
print("1.", np_array * 2)           # This will multiply every element in the array by 2.
print("2.", np_array)               # Notice that the original array is unchanged because we didn't store the result in a variable

# Lets try the same thing with the default Python list/array
print("3.", my_first_array * 2)     # This will duplicate the array, not multiply every element by 2

# Multiplying the array by 2 is a lot more complicated without numpy and requires a loop to iterate through each element

In [None]:
# Create a numpy array from scratch
my_array = np.array([1, 2, 3, 4, 5])
print(my_array)

In [None]:
# We can also pre-allocate an array with zeros
my_array = np.zeros(10) # This will create an array with 10 zeros
print(my_array)

# Notice that the array is filled with floats, not integers. That's what the "." is
# Make it an array of integers using dtype (short for "data type")
my_array = np.zeros(10, dtype=int) # This will create an array with 10 zeros, but they will be integers
print(my_array)

# We can also preallocate an array with ones
my_array = np.ones(10, dtype=int) # This will create an array with 10 ones
print(my_array)

In [None]:
# We can access individual elements in an array using their index. The index starts at 0 and we use square brackets to access the element
print(my_array[0]) # This will print the first element in the array

#### **0.3.8 - Tables/Dataframes**

In [None]:
# A table in Python is just a list of lists. We can create a table using nested brackets
my_table = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] # Notice that each list represents a row in the table
print(my_table)

In [None]:
# This might be visualized better as a numpy table
np_table = np.array(my_table) # Convert it to a numpy object
print(np_table)

In [None]:
# We access table elements the same way we access array elements
print(np_table[0, 0]) # This will print the first element in the table

# We can also access entire rows or columns
print(np_table[0]) # This will print the first row in the table
print(np_table[:, 0]) # This will print the first column in the table

In [None]:
# We can also pre-allocate a table with zeros or ones
my_table = np.zeros((3, 3), dtype=int) # This will create a 3x3 table with zeros
print(my_table)

A <b><u>dataframe</u></b> is a 2-dimensional labeled data structure with columns of potentially different types. We can think of it kind of like an excel spreadsheet. It is generally the most commonly used pandas object. Dataframes differ from arrays in that they can have column names and row names. They are also more powerful and have more functionality than arrays. Dataframes differ from tables in that they are more powerful and have more functionality.

In [None]:
# We can also create a table using the pandas library
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # This is how we create a table using pandas
df # We can use print(df), but just typing df will display the table in a more user-friendly way

In [None]:
# To add columns names and an index to the already existing table, we can do the following
df.columns = ["A", "B", "C"] # This is how we add column names to the table
df.index = ["Row1", "Row2", "Row3"] # This is how we add an index to the table
df

In [None]:
# We can also assign the column names and index when we create the table
df = pd.DataFrame([[10, 20, 30], [40, 50, 60], [70, 80, 90]], columns=["a", "b", "c"], index=["row1", "row2", "row3"])
df

In [None]:
# We can use a few different methods to access the data in the table. We can use the column name, the index, or the row and column number
print(df["a"]) # This will print the entire "a" column
print(df.a) # This will also print the entire "a" column

In [None]:
# .loc is a label-based method for accessing data in the table. .iloc is an integer-based method for accessing data in the table
print(df.loc["row1"]) # This will print the entire "row1" row
print(df.iloc[0]) # This will also print the entire "row1" row

In [None]:
# We can pass another argument to the .loc and .iloc methods to access a specific element in the table
print(df.loc["row1", "a"]) # This will print the value in the "a" column and "row1" row
print(df.iloc[0, 0]) # This will also print the value in the "a" column and "row1" row

In [None]:
# We can use slicing to access multiple elements in the table. We slice by using ":" and passing the start and end values
print(df.loc["row1", "a":"b"]) # This will print the values in the "a" through "b" columns and "row1" row
print(df.iloc[0, 0:2]) # This will also print the values in the "a" through "b" columns and "row1" row

In [None]:
# We can slice rows and columns at the same time
print(df.loc["row1":"row2", "a":"b"]) # This will print the values in the "a" through "b" columns and "row1" through "row2" rows
print(df.iloc[0:2, 0:2]) # This will also print the values in the "a" through "b" columns and "row1" through "row2" rows

In [None]:
# We can create a dataframe from a regular Python table
table = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(table, columns=["A", "B", "C"], index=["Row1", "Row2", "Row3"])
df # This one is not being displayed

# We can also create a dataframe from a numpy array. Instead of putting in the data, we use the variable name
np_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Create a numpy array
df = pd.DataFrame(np_array, columns=["A", "B", "C"], index=["Row1", "Row2", "Row3"])
df # This one is displaying

# Notice our both of our dataframes didn't output. If we wanted them both to display, we would need to use the print 
# function or go to another cell in the notebook

In [None]:
# We can create a dataframe from scratch, but we will usually read in data from a file. Here we can read 
# in our StarLift Airlines trip data
df_sla = pd.read_csv("./Data/airplaneRoutes.csv")
df_sla

# Notice that the column names (but not an index) are already included in the table. This is because the csv file we read in
# already had column names

In [None]:
# The above dataframe is a little hard to read because it is so large. We can use the head function to only display
# the first few rows of the dataframe
df_sla.head() # This will display the first 5 rows of the dataframe

In [None]:
# We can rename the columns of the dataframe to fit coding and database naming conventions
df_sla.columns = ["source_airport", "dest_airport"]
df_sla.head(10) # This will display the first 10 rows of the dataframe

#### **0.3.9 - Clear Variables**

If you ran this notebook without clearing any variables, you might notice your computer getting pretty slow. That's normal because we are reserving spots in our computer's memory when we declare a variable and are basically holding it hostage either until we close our program or free/delete it from memory. Make sure to run these lines of code as they come up throughout the file.

In [None]:
# The '\' character is used to continue a line of code on the next line
# We will delete all the variables we made that we don't need anymore (we aren't deleting df_bb)
# If you didn't run all of the code blocks above you will get an error while running this, so make sure to run them all.
# If for some reason this gives you an error even though you have run everything, just close the file and re-open (after 
# saving if you made changes)
del a, age, b, c, df, empty_list, empty_variable, greeting, my_array, my_boolean, my_first_array, my_first_boolean,\
    my_first_float, my_first_list, my_first_string, my_first_variable, my_float, my_int, my_second_string, my_string, \
    my_table, my_third_string, name, np_array, np_table, numToGetSquare, string1, string2, string3, table, x, y, z 

#### **0.3.10 - StarLift Airlines**

Let's take what we've learned about variables and data structures to start setting things up for our StarLift Airlines problem! The goal of this analysis is to look into the company StarLift Airline's structure across multiple airports around the world. They have several airports they operate out of and want to find what the most profitable network setup is at different price points and airport space fees—specifically if they should remove airports from their network and, if so, how many they should remove. Your task is to do an analysis of the trip data provided to find the number of unique airports, the number of trips each airport is involved in to find the most used airports, and create a model showing the relationship between potential profits and airports removed from the network.

In [None]:
# Let's create a constant variable that will represent the number of trips in our dataframe
print(df_sla.shape) # .shape displays the number of rows and columns in the dataframe. Rows are first, columns are second

# This will get the number of rows in the dataframe and assign it to our constant variable N_TRIPS. With python arrays, 
# we reference the first element at the 0 index using square brackets. We can do the same with the shape attribute
N_TRIPS = df_sla.shape[0]
print(N_TRIPS) # This will print the number of trips to the console

In [None]:
# Initial assumptions: variables we will use to change projections, but will stay the same throughout the program
PROFIT_PER_TRIP = 10000
FEE_PER_AIRPORT = 200000

## **0.4 - Conditions and Conditional Statments**

A <b><u>condition</u></b> is an expression that evaluates to either <i>True</i> or <i>False</i>. A <b><u>conditional statement</u></b>, also known as a control structure or a conditional construcut, is a programming feature that allows different actions to be taken depending on whether a condition is true or false. We can use these to control the flow of our program by running specific operations only if certain conditions are met. In Python, conditional statements are created like this:<br>
<i>
<pre>
if &lt;condition&gt;:
    &lt;block of code&gt;
elif &lt;condition&gt;:
    &lt;block of code&gt;
else:
    &lt;block of code&gt;
</pre>
</i>

With these if/elif/else statements the elif (short for "else if") and else parts are totally optional, and you can have as many elif statements as you need to fit the different situations you are accounting for. We'll get into some examples below.

#### **0.4.1 - "if", "elif", and "else" with Comparison Operators**

Along with demonstrating if/elif/else, we will also use <b><u>logical operators</u></b> like <b><u>></u></b> (greater than), <b><u><</u></b> (less than), <b><u>==</u></b> (equal to, NOT to be confused with the <b><u>assignment operator</u></b> which is <b><u>=</u></b> that we use to assign values to variables), <b><u>>=</u></b> (greater than or equal to), and <b><u><=</u></b> (less than or equal to). As demonstrated in this section, you may also want to use <b><u>np.isclose()</u></b> in some cases.

In [None]:
# Lets see a simple use case of a conditional statement
if 5 > 3: # This is a simple conditional statement. If the condition is true, the code block will be executed
    print("5 is greater than 3")

In [None]:
if 5 < 3:
    print("5 is less than 3") # This statement will not print becuase the condition is false

In [None]:
# We can also use the "else" keyword to execute a different code block if the condition is false. Let's also use variables
# this time
x = 5
y = 3

if y > x:
    print("y is greater than x") # This code is not run because the condition is false
else:
    print("x is greater than y")

In [None]:
# Now lets see it with a "elif" statement. "elif" is short for "else if" and is used to check multiple conditions
x = 5
y = 5

if y > x:
    print("y is greater than x") # This code is not run because the condition is false
elif x == y:
    print("x is equal to y") # This code is run because the condition is true
elif x > y:
    print("x is greater than y") # This code is not run because the prior elif condition is true
else:
    print("This is just here") # This code is not run because one of the prior elif conditions are true

In [None]:
# This seems very simple while we know the values of x and y, but what if we don't? We can use a random number generator 
# to get different values each time we run the code. np.random.randint in the numpy library will generate a random integer.
# Try running this code block a couple of times!
x = np.random.randint(1, 10) # This will generate a random integer between 1 and 10
y = np.random.randint(1, 10)

if y > x:
    # Let's use the f-string we learned earlier to see what the random values are
    print(f"y ({y}) is greater than x ({x})") 
elif x == y:
    print(f"x ({x}) is equal to y ({y})")
else:
    print(f"x ({x}) is greater than y ({y})")

In [None]:
# BE CAREFUL! Using the "==" operator to compare floats can be dangerous because of the way floating point numbers are 
# stored in memory. We can use the "np.isclose" function from the numpy library to compare floats

x = 0.1 + 0.2
y = 0.3

if np.isclose(x, y): # This will compare the two floats and return True if they are close
    print(f"x ({x}) is equal to y ({y})")
else:
    print(f"x ({x}) is not equal to y ({y})")
    
# See how the "==" operator doesn't work as expected?
if x == y: # This will compare the two floats and return False
    print(f"x ({x}) is equal to y ({y})")
else:
    print(f"x ({x}) is not equal to y ({y})")

#### **0.4.2 - "and" and "or"**

In [None]:
# We can also use the "and" keyword to combine multiple conditions
x = 5
y = 3
z = 7

if x > y and x < z: # This is a compound conditional statement. Both conditions must be true for the code block to run
    print("x is greater than y and less than z")
else:
    print("x is not greater than y and less than z")

In [None]:
# And here's the "or" keyword
x = 5
y = 3
z = 7

# x is greater than y, but not less than z, so let's see what happens
if x > y or x > z: # In this, either condition must be true for the code block to run. Both can be true as well
    print("x is greater than y or greater than z")
else:
    print("x is not greater than y or greater than z")

#### **0.4.3 - "not" and "!="**

Using "not" reverses the output of a condition that has evaluated to either true or false. If the condition is false, it will return true. If the condition is true, it will return false. You could reorganize or rephrase your if/elif/else statements to get the same result without using "not", but it's ultimately up to the programmer's preference. It sometimes makes more sense when reading and can help create shorter and more efficient code.

In [None]:
x = 5
y = 3

if not x == y: # Here we are checking if x is not equal to y
    print("x is not equal to y")
else:
    print("x is equal to y")


In [None]:
# You could also use the != operator to check if two values are not equal, though it can be less readable
if x != y:
    print("x is not equal to y")
else:
    print("x is equal to y")

#### **0.4.4 - "in"**

In [None]:
# We can use the "in" keyword to check if a value is in a list
my_list = [1, 2, 3, 4, 5]
x = 3
y = 6

if x in my_list: # This will check if the value of x is in the list
    print(f"{x} is in the list")
else:
    print(f"{x} is not in the list")


In [None]:
# We can use the "not" keyword with "in" to check if a value is not in a list
if y not in my_list: # This will check if the value of y is in the list
    print(f"{y} is not in the list")
else:
    print(f"{y} is in the list")

In [None]:
# We can also use the "in" keyword to check if a value is in a string
my_string = "Hello, World!"
x = "Hello"
y = "Python"

if x in my_string: # This will check if the value of x is in the string
    print(f"1. {x} is in the string")
else:
    print(f"1. {x} is not in the string")
    
# We can also use the "not" keyword with "in" to check if a value is not in a string
if y not in my_string: # This will check if the value of y is in the string
    print(f"2. {y} is not in the string")
else:
    print(f"2. {y} is in the string")

#### **0.4.5 - Clear Variables**

In [None]:
del my_list, my_string, x, y, z

#### **0.4.6 - StarLift Airlines**

The following code walks through the steps we would want to take to make sure the unique values in the two columns of our dataframe match each other. There will be some new methods to you that are associated with the Python Pandas library like <b><u>drop_duplicates()</u></b>, <b><u>equals()</u></b>, <b><u>sort_values()</u></b>, <b><u>isin()</u></b>, <b><u>pd.concat</u></b>, and <b><u>reset_index()</u></b>, along with their respective <b><u>arguments</u></b>. For now, just try to understand what each line does and why we are doing it. If needed, use the internet or reference the extra resources and definitions below in 0.9.

In [None]:
# Let's compare the two columns in our dataframe to see if the unique airports are the same
# We use double brackets below to keep it as a dataframe object, otherwise it will convert into a list/array object
source_airports = df_sla[['source_airport']].drop_duplicates()      # This will get the unique airports used to start a trip
dest_airports = df_sla[['dest_airport']].drop_duplicates()          # This will get the unique airports used as the destination

print(source_airports.shape) # Display the number of rows and columns in the dataframe
source_airports.head() # Display the first five rows of the dataframe

In [None]:
# If we tried to compare the two dataframes at this point, it would return False because they are not in the same order. 
# They are organized by what value came first
print(dest_airports.shape) # There are also a different number of unique airports in the dest_airports
dest_airports.head()

In [None]:
# This is how we compare the dataframes, but it will return False right now. We need to do a few more manipulations
print(source_airports.equals(dest_airports))

In [None]:
# We can use the "sort_values" function to sort the dataframes by the airport code/id
source_airports = source_airports.sort_values(by="source_airport")      # This will sort the dataframe by the source_airport ID's
dest_airports = dest_airports.sort_values(by="dest_airport")            # This will sort the dataframe by the dest_airport ID's

source_airports.head()

In [None]:
# When it sorted the dataframes, it didn't reset the index. We need to use the "reset_index" function to reset 
# the index before comparing
source_airports = source_airports.reset_index(drop=True)      # This will reset the index and drop the old index
dest_airports = dest_airports.reset_index(drop=True)

# ...and we need to rename the columns headers to match
source_airports.columns = ["source_airport"]
dest_airports.columns = ["dest_airport"]

dest_airports.head()

In [None]:
# We're almost there, but they still aren't equal
if source_airports.equals(dest_airports):
    print("The unique source and dest airports are the same")
else:
    print("The unique source and dest airports are not the same")
    print(source_airports.shape)
    print(dest_airports.shape)
    
# See how there are more rows in the dest_airports than in the source_airports? That means there are some airports that are
# destinations but not sources, and maybe the other way around too. We can find out which ones by using the "isin" function

<b><u>isin</u></b> is a method that is used to filter data frames. It is used to specify a list of values and select rows from a data frame that has the values specified in the list. Remember when we used "in" earlier with our conditional statements? Same idea! We also have the <b><u>~</u></b> operator we'll want to use here, which is similar to when we used "not" with "in" earlier.

In [None]:
# The combination of "~" and "isin" will return the values that are not in the other dataframe. Take some time to make 
# sure you understand what this code is doing and why each part is important
diff_dest = dest_airports[~dest_airports["dest_airport"].isin(source_airports["source_airport"])] # Different destinations  
print(diff_dest.shape)
diff_dest

# When we output the dataframe below we will get all the airports that are used as destination airports but not source airports. We
# could do the same thing to get the source airports that are not destination airports, but we don't need that for our solution

<b><u>pd.concat</u></b> is a useful function we can use to concatenate two dataframes. You can specify how you want to combine them by either setting the axis=0 (for rows, so stacking them) or the axis=1 (for columns, so putting them side-by-side while matching the index).

In [None]:
# Now that we have the 16 airports that are not in the source_airports dataframe, we can add them to our unique list of airports.
# Before we use pd.concat, we need to make sure the column names for the dataframes are the same.
diff_dest.columns = ["source_airport"]                              # We need to rename the column to match the source_airports dataframe
unique_airports = pd.concat([source_airports, diff_dest], axis=0)   # We can use pd.concat to merge the dataframes together
unique_airports = unique_airports.sort_values(by="source_airport")  # This will sort the dataframe by the source_airport ID's
unique_airports = unique_airports.reset_index(drop=True)            # This will reset the index and drop the old index
unique_airports.columns = ["airport_id"]                            # This will rename the column to be a bit more accurate

# Output our new dataframe
unique_airports.shape
unique_airports.head()

## **0.5 - Functions**

A <b><u>function</u></b> is an isolated and reusable block of code that performs a specific task. It accepts zero to many input arguments, performs operations based on those inputs, and can optionally return a result. This makes them extremely dynamic and useful if we are going to use the same operations or processes multiple times in our program, not to mention they can help with readability and reducing clutter. 

Already through this process we have used some <b><u>methods</u></b>, which are functions that are specific to certain classes, or in our case, libraries. While all methods are functions, not all functions are methods. drop_duplicates(), equals(), sort_values(), and reset_index() are examples of methods, while print() and len() (which we used back in 0.3.6) are examples of functions that are not methods, the difference being if they are unique to a class (method and function) or not (just function)

#### **0.5.1 - Our First Function**

This is a good time to talk about <u>**scope**</u>, which is an area of code where a variable can be used. You may have noticed as we've gone through this notebook that we could declare one variable in one code block and then change its value in another code block. This is because the variables and anything else we've created have been declared at the bottom level, or in other words it is a <u>**global variable**</u>. However, if we declare a variable inside a function, inside an if/elif/else statement, in a for or while loop, or any other structure where the code is indented, it will only be available inside that structure. Variables that are only available in certain sections of code are local to that code, or are what we call <u>**local variables**</u>.

In [None]:
# Let's create our first function
def add(a, b):          # This is how we create a function in Python. "def" is short for "define". We require two arguments
    return a + b        # This is the code block that will be executed when the function is called, and the value is returned

# Now we can call our function
x = 5
y = 3
z = add(x, y)           # This is how we call a function in Python
print(z)                # This will print the value of z to the console

In [None]:
# Above, we created variables a and b inside the function, but those are just placeholders to be used in the function. This can
# sometimes get hard to keep track of if we have global variables with the same name. Make sure to name your variables
# in a way that makes sense and is easy to differentiate! Uncomment the code below to see the error. Notice with this kind
# of error, there is no wardning from the IDE or the console. It's a runtime error, so it will only show up when the code is run
'''
if a == 5:
    print("a is 5")     # This will cause an error because "a" is not defined in the global scope
'''

In [None]:
# If we tried to call the function without the required arguments or only one, it would return an error. Uncomment the code
# below to see the error
'''
z = add(5) # This will return an error because the function requires two arguments
'''

In [None]:
# I can also call the function down here, and because the function is written the way it is (with "+"), we can also 
# concatenate a string
new_string = add("1. Hello ", "World!")
print(new_string)

# If our numbers were string objects, it would concatenate them instead of adding them
x = "5"
y = "3"
z = add(x, y)

print("2.", z) # This will print a string of "53" to the console instead of what we might expect with 8

#### **0.5.2 - Brief Intro to Safeguarding Code**

In [None]:
# While we can use our function like we did above to concatenate strings, it doesn't make sense because of the way we named it. 
# It's important to give our functions descriptive and accurate names. Let's safeguard our code by renaming the function and 
# adding checks
def add_two_numbers(x, y):                                  # This is a better name for our function
    if type(x) is not int or type(y) is not int:            # This is a check to make both of our arguments are integers
        return "ERROR: Both arguments must be integers"     # This is what will be returned if the check fails
    else:
        return x + y                                        # This is what will be returned if the check passes
    
# Now we can call our function
z = add_two_numbers(5, 3)
print(z)

In [None]:
# If we tried to call the function with a string, it would return an error
print(add_two_numbers("Hello ", "World!"))

#### **0.5.3 - Functions with Multiple Outputs**

In [None]:
# We can also use the "return" keyword to return multiple values from a function
def add_and_subtract(a, b): 
    add = a + b 
    subtract = a - b 
    return add, subtract # Note that a function doesn't need to return a value, but if it does, it can return multiple values

x = 5
y = 3
z1, z2 = add_and_subtract(x, y) # This will return two values that we can store in two variables
print(f"addition: {z1}, subtraction: {z2}")

#### **0.5.4 - Setting Default Parameters**

In [None]:
# While our functions don't require much, some functions can get very long and complex, requiring many parameters be passed into
# it. We can use default parameters to make our functions more user-friendly to use.
def add_three_numbers(a, b, c=0): # This is how we create a function with a default parameter
    return a + b + c

x = 5
y = 3
z = add_three_numbers(x, y) # This will use the default parameter for "c", which is set to 0
print("1.", z)

# Or we can pass in a value for "c" if we want
z = add_three_numbers(x, y, 2) # This will use the value we passed in for "c"
print("2.", z)

#### **0.5.5 - For Fun: Unlimited Parameters Passed**

In [None]:
# Another thing we can do is use the "args" parameters to pass in an unknown number of arguments
def add_unknown_number_of_numbers(*args):   # This is how we create a function with an unknown number of arguments
    return sum(args)                        # This will return the sum of all the arguments passed into the function

z = add_unknown_number_of_numbers(1, 2, 3, 4, 5) # Let's try it out with 5 arguments
print(z)

# This can get a bit complicated though as we try and safeguard our code.

#### **0.5.6 - Clear Variables**

In [None]:
del new_string, x, y, z, z1, z2, source_airports, diff_dest, dest_airports

#### **0.5.7 - StarLift Airlines**

Let's combine some of the basics of what we've learned so far and create a more complex function that we can use with StarLift. You remember when we compared the two columns in our dataframe to make sure the unique values for the airports matched? Let's turn it into a function! In this, we'll also demonstrate a few more things you can do with functions, includding toggling different output options (like optional feedback to users) and even what kind of return value we want.

In [None]:
# In this example, we will keep it simple by hard-coding a few things specific to our situation. In a real-world
# situation, we would proabbly want to make it a bit more dynamic and more applicable to a wider range of situations

# We are having the user pass in a dataframe, and we also have two optional parameters. The first is "details" which will
# print a message to the console if the dataframes are not the same. The second is "returnSame" which will be how the user
# specifies if he wants the combination of all the unique values in the two columns to be returned
def compare_columns(df, details=True, returnSame=False): # Have the user pass in a dataframe.
    # Now we will follow the same steps we did above to compare the two dataframes
    
    # Get the unique values of each column
    source_airports = df[['source_airport']].drop_duplicates()
    dest_airports = df[['dest_airport']].drop_duplicates()
    
    # Sort the dataframes by the airport id
    source_airports = source_airports.sort_values(by="source_airport")
    dest_airports = dest_airports.sort_values(by="dest_airport")
    
    # Reset the index
    source_airports = source_airports.reset_index(drop=True)
    dest_airports = dest_airports.reset_index(drop=True)
    
    # Rename the columns
    source_airports.columns = ["airport_id"]
    dest_airports.columns = ["airport_id"]
    
    # Compare the two dataframes
    if not source_airports.equals(dest_airports):
        # If the user wants to see the details, we will print a message to the console. Otherwise, we will retuurn False
        if details: print("The two dataframes are not the same") # We can simplify this to one line, but it's less readable
        if returnSame: # If the user wants the combination, we'll return a new dataframe
            # Get the differences between the two dataframes
            diff_dest = dest_airports[~dest_airports["airport_id"].isin(source_airports["airport_id"])]
            
            combo_df = pd.concat([source_airports, diff_dest], axis=0)
            combo_df = combo_df.sort_values(by="airport_id")
            combo_df = combo_df.reset_index(drop=True)
            
            if details: print("The unique values have been combined into one dataframe")
            return combo_df
        if not returnSame: # If not, we'll return False (if details are turned off)
            if not details: return False
    else:
        if details: print("The two dataframes are the same")
        if not details: return True
    
# Now we can call our function. We don't need to pass in the "details" or "returnSame" parameters because they have a default value
compare_columns(df_sla) # Just compare, don't get all the unique values


In [None]:
compare_columns(df_sla, returnSame=True) # Get the dataframe of all the unique values

In [None]:
# Or we can pass in a value for "details" and have it return a boolean! In this case we are having it do the same thing,
# but we could have it do something else if we wanted. Maybe we don't want to run certain code if the dataframes are not
# the same
if compare_columns(df_sla, details=False): # Toggled to return a boolean value instead of printing a message
    print("The two dataframes are the same")
else:
    print("The two dataframes are not the same")

Below is an example of a function that is slightly more dynamic in the sense that you could pass any two dataframes in to compare them. There is definitely more that could be done to this function, but for now see if you can see what it is doing with each line of code. Some code in there includes topics that we haven't covered yet and are slightly beyond the scope of this course, but see if you can understand what each line of code does and why we need it there. Comments have been added to help.

Also see if you can find one way you might want to improve it! You don't have to write the code for it (though you may try if you'd like), but write a comment for what you might want to add to make it more effective. Make sure to put your comment in the right place in the code!

In [None]:
def compare_dataframes(df1, df2, details=True):
    # Remember our import statements? We want to import the libraries we need again here if we are truly trying to make
    # dynamic functions for different use cases. We can't assume another user using our function will have the needed imports
    import pandas as pd 
    
    # Rename columns of df2 to match df1 if they have different column names so we can comopare them
    # 1. zip(df2.columns, df1.columns): This creates an iterator that aggregates elements from df2.columns 
        #   and df1.columns pairwise. So, for each iteration, it pairs up the corresponding column names from 
        #   df2 and df1.
        # 2. dict(...): This converts the pairs of column names generated by zip into a dictionary where the 
        #   keys are the column names of df2 and the values are the corresponding column names of df1. This 
        #   effectively maps the column names of df2 to the column names of df1.
        # 3. df2.rename(...): This renames the columns of df2 using the dictionary created in the previous step. 
        #   It maps the column names of df2 to the corresponding column names of df1, effectively aligning the 
        #   column names of df2 with those of df1.
    if df1.columns != df2.columns: # If the columns names are not the same
        df2 = df2.rename(columns=dict(zip(df2.columns, df1.columns))) # Rename the columns of df2 to match df1
    
    if df1.equals(df2):
        if details == True: # If the user wants to see the details
            print("The dataframes are identical")
        else: return True # If the user doesn't want to see the details, return boolean
    else:
        if details == True: # If the user wants to see the details
            print("The dataframes are not identical")
            if (df1.shape != df2.shape): # Check if the arrays are the same size before we iterate
                print("The arrays are not the same size")
                print(f'First dataframe: {df1.shape[0]} rows, {df1.shape[1]} cols\n \
                    Second dataframe: {df2.shape[0]} rows, {df2.shape[1]} cols')
            else: # If they are the same size but still not the same, output the indexes that are different 
                i = 0
                for i in range(df1.shape[0]): # Iterate through the rows
                    for j in range(df1.shape[1]): # Iterate through the columns
                        if df1.iat[i, j] != df2.iat[i, j]: # If the values at the same index are different
                            print(f'Values at index ({i}, {j}) are different:')
                            print(f'df1: {df1.iat[i, j]}, df2: {df2.iat[i, j]}')
        else: return False # If the user doesn't want to see the details, return boolean

## **0.6 - Loops**

Loops are extremely useful coding structures that allow us to be much more effective in our use of coding space. There are two types of loops in Python:
- <b><u>for loop</b></u> - a repeating structure that is used to iterate over a sequence of elements such as a list, string, or other iterable object. The loop iterates over each item in the sequence and executes a block of code for each item. These are useful for when we have a calculable number of times we want to perform the operation.
- <b><u>while loop</b></u> - a repeating structure that is used to repeatedly execute a block of code as long as a specified condition evaluates to True (boolean value). These are useful for when we have a condition we are trying to reach and we don't know exactly how many times we need to perform the operation to get there.

#### **0.6.1 - Intro to Loops with the For Loop**

In [None]:
# Let's try doing something repetitive and simple without a loop. We'll take x and add 1 to it 10 times.
x = 5

x = x + 1
x = x + 1
x = x + 1
x = x + 1
x = x + 1
x = x + 1
x = x + 1
x = x + 1
x = x + 1
x = x + 1

print(x)

In [None]:
# That was a mess, right? Imagine if we had to repeat a process 100 times! Now let's do the same thing with a loop
x = 5

# Here we will use what is called a for loop. range(10) will create a list of numbers from 0 to 9. The loop will iterate
for i in range(10): # This will iterate 10 times. "i" is a common variable name to use in a for loop
    x = x + 1
    
print(x)

In [None]:
# We have to use the "range" function to create a list of numbers from 0 to 9. If we try putting in just the number we want
# it to iterate to, like 10, it will return an error. Uncomment the code below to see the error
'''
x = 5

for i in 10:
    x = x + 1
    
print(x)
'''

In [None]:
# Let's see it iterate through a list
my_list = [1, 20, 3, "string", 5] # The list does not have to be in any particular order, and doesn't have to be numbers

for i in my_list: # This will iterate through the list and print each value to the console
    print(i)

In [None]:
# A for loop can also be used to iterate through a string
my_string = "Hello"

for i in my_string: # This will iterate through the string and print each character to the console
    print(i)

#### **0.6.2 - The While Loop**

In [None]:
# Lets look at a while loop to do something similar. It will keep running until the condition is false. To make this tie in more closely
# to its usecase, let's give x a random value. That way we don't know how many iterations we need to do to reach the desired end state
x = np.random.randint(1, 10)

while x < 15: # This will keep running until x is no longer less than 15
    print(x)
    x = x + 1
    
print(x)

In [None]:
# Because it takes a condition to be false to stop, we have to be careful with while loops. If we don't write the code
# properly, it could run forever. This is called an infinite loop. Let's see an example. Uncomment the code below to see
# the error. Notice that the program doesn't warn us about the infinite loop. It's a runtime error, so it will only show
# up when the code is run
'''
x = 5

while x < 15: # This will keep running until x is no longer less than 15
    x = x - 1 # This will make x less than 15, but it will never stop because we aren't changing the value of x
    
print(x)
'''

# If you run this code block, you will have to stop the kernel to stop the loop. You can do this by clicking the stop
# button in the toolbar at the top of the notebook or to the side of the cell in the notebook

In [None]:
# Since there is a condition that we check to see if we should stop the loop, we can set it to True or False and use the
# "break" keyword to stop the loop.
x = np.random.randint(1, 10)

while True: # This will keep running until we use the "break" keyword
    print(x)
    x = x + 1
    if x >= 15: # This will stop the loop if x is no longer less than 15
        break
    elif x < 0: # This branch should never run if we have our logic right, but it's good to have it just in case
        break
    
print(x)

In [None]:
# We can also use the "continue" keyword to skip the rest of the code block and go to the next iteration
x = np.random.randint(1, 10)

while x < 15:
    x = x + 1
    
    # The "%"" operator is called the modulo operator. It returns the remainder of a division operation
    if x % 2 == 0: # This will skip the rest of the code block and go to the next iteration if x is even
        continue
    print(x)
    
# Continue and break can also be used in for loops, though they are more commonly used in while loops

#### **0.6.3 - For Fun: Loop with Else**

Funnily enough, you can create a for or while loop with an else statement tied to it, almost like you would with an if/elif/else statement. These test to see if your loop ran and completed without any interruptions. The use case for this is extremely limited, though it's fun and can be useful to know nonetheless.

In [None]:
# else with a for loop
my_list = [1, 2, 3, 4, 5]

for i in my_list:
    # This will stop the loop if i is equal to 6, which according to our logic will never happen, so the "else" block will run
    if i == 6: 
        break
else:
    print("The loop completed without using the break keyword")

In [None]:
# else with a while loop
x = np.random.randint(1, 10)

while x < 15:
    x = x + 1
    if x == 16: # Again, this will never happen, so the "else" block will run
        break
else:
    print("The loop completed without using the break keyword")
    
# Contrariwise, if the loop uses the "break" keyword, the "else" block will not run
while x > 0:
    x = x - 1
    if x == 5: # This will stop the loop if x is equal to 5
        print(f"We got to this part of the loop and x equals {x}")
        break
else: # This part will never be run
    print("The loop completed without using the break keyword")

#### **0.6.4 - Nested Loops**

In [None]:
# Nested loops are loops within loops. They can be useful for iterating through a table or a list of lists
my_table = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

for row in my_table: # This will iterate through each row in the table
    for value in row: # This will iterate through each value in the row
        print(value)

In [None]:
# We can use nested loops to make sure we do calculations on every value in a set of numbers. Here is a fun example of three nested loops
# that could be used to represent rolling three dice and the possible outcomes, which we will store in a 3d array
outcomes = np.zeros((6, 6, 6)) # Create a 3D array of zeros

for i in range(1, 7): # This will iterate through the numbers 1 to 6 since it is exclusive of the second argument
    for j in range(1, 7):
        for k in range(1, 7): 
            outcomes[i - 1, j - 1, k - 1] = i + j + k # This will store the sum of the three dice in the 3d array
            
print(outcomes)

#### **0.6.5 - Clear Variables**

In [None]:
del i, j, k, my_list, my_string, my_table, outcomes, row, value, x

#### **0.6.6 - StarLift Airlines**

So far in our StarLift Airlines problem, we have read in our data to our Python environment and checked to make sure the unique airport ID values in each column were identical. We did this by creating variables and dataframes, testing conditions with if/elif/else statements, and simplifying our code output by creating a function. The last thing we need to do before we can calculate the profits is to count the number of times each airport is used in a trip while *making sure we don't double count any of the airports if they were logged to be the source AND the destination airport*. We can do this with for loops!

Along with using our looping knowledge, we will also learn new methods like <b><u>value_counts()</u></b>, <b><u>sort_index()</u></b>, <b><u>pd.merge()</u></b>, <b><u>rename_axis()</u></b>, <b><u>itterrows()</u></b>, and <b><u>sum()</u></b>. If you would like to understand these methods more fully, make sure to look at the resources in 0.9. Let's break down some of it here though:
- <b><u>value_counts()</u></b> - a method in the Pandas library that gets the number of times each unique value appears in a specified array.
- <b><u>sort_index()</u></b> - a method in the Pandas library that sorts the index by the values from least to greatest.
- <b><u>pd.merge()</u></b> - a function in the Pandas library used to do joins on dataframes; combining two dataframes together on a common column/index.
- <b><u>rename_axis()</u></b> - a method in the Pandas library that allows us to name/rename the index column. Just for clarity.
- <b><u>iterrows()</u></b> - a method in the Pandas library that allows us to itterate over the rows of the dataframe, returning each row as a Pandas Series where you can access the index and values of each column within that row.
- <b><u>sum()</u></b> - a built-in function in default Python that calculates the sum of all elements in an iterable object (in our case, an array).

Another thing that is good to know is what is called a <b><u>boolean array</u></b>. We can filter a regular array by giving it a condition. For example, if we had an array of numbers 1-10 and we only wanted numbers less than 5, instead of removing individual elements we can write something like this:

> <pre>newArray = (oldArray < 5) </pre>

Or if we wanted to filter two columns in a dataframe to include only the rows where both columns have the same value, we could do something like this:

> <pre>newDataFrame = oldDataFrame.loc[(oldDataFrame.iloc[:, 0] == determinedValue) & (oldDataFrame.iloc[:, 1] == determinedValue)]</pre>

We have a boolean array in the code below that has been separated out across multiple lines to hopefully help you understand the logic behind it. Having said that, don't worry about understanding how everything works, just make sure you can recognize what is happening with each line.

In [None]:
# Get the counts of each unique airport from the original dataframe and sort by airport id
source_counts = pd.DataFrame(df_sla['source_airport'].value_counts().sort_index())
dest_counts = pd.DataFrame(df_sla['dest_airport'].value_counts().sort_index())

source_counts.head()

In [None]:
# Combine the two dataframes into the dataframe we already created called unique_counts (see 0.4.6)
unique_airports = pd.merge(unique_airports, source_counts, left_on='airport_id', right_index=True, how='left')
unique_airports = pd.merge(unique_airports, dest_counts, left_on='airport_id', right_index=True, how='left')

unique_airports.head()

# In this we see that airport with the Id AAE was involved in 9 departures and 9 arrivals

In [None]:
# Rename the columns to be a bit more descriptive
unique_airports.columns = ['airportId', 'source', 'dest']
unique_airports.head()

In [None]:
# Let's fix our data a little bit. Because some of the airports are only sources and some are only destinations, we have NaN
# in our dataframe, which stands for "Not a Number". We can use the "fillna" function to fill in the NaN values with 0
print(unique_airports.isnull().sum())   # This will return the number of NaN values in each column
unique_airports.fillna(0, inplace=True) # This will fill in the NaN values with 0. The "inplace" parameter will change the dataframe
                                        # without us having to reassign it to a variable with "="
print(unique_airports.isnull().sum())   # Before and after

In [None]:
# You may have also noticed that the source and dest columns are floats. Let's cast them to integers
unique_airports['source'] = unique_airports['source'].astype(int)
unique_airports['dest'] = unique_airports['dest'].astype(int)
unique_airports.head()

In [None]:
# And the last step before we implement our double counts, let's change the index
unique_airports.set_index('airportId', inplace=True)
unique_airports.head()

In [None]:
unique_airports['double_counts'] = 0 # Create a new column called "double_counts" that we can use to count the number of times an airport
                                     # was the source and destination of a trip. Doing "= 0" will fill the entire column with zeros
unique_airports.head()

In [None]:
# Use a for loop to count double counted trips for each airport
for index, row in unique_airports.iterrows():
    airport_id = index # Get the airport id, which is the index of the row
    
    # Count double counted trips where the source airport is equal to destination airport by using .sum() on a boolean array
    # "\" allows us to continue the code on the next line and is added for clarity with the long line of code
    unique_airports.loc[airport_id, 'double_counts'] = \
        (
            (
                df_sla['source_airport'] == airport_id
            ) & (
                df_sla['dest_airport'] == airport_id
            )
        ).sum()

unique_airports.head()

# This code block might take a little bit of time because it has to go through over 3000 airports and over 60000 trips!

In [None]:
# That block took a little bit longer to run, and it doesn't seem like any of the airports have double counts. Let's check. Below
# we'll use another boolean array to check if any of the airports have double counts and output the dataframe
unique_airports[unique_airports['double_counts'] > 0]

# Look's like there's just one!

In [None]:
# Ok, hopefully you were paying attention and saw that we used "itterrows" to iterate through the dataframe above. Let's look at
# what exactly is being returned with itterrows. The data here is for the last row of the dataframe
print("Row:", row)
print("Index:", index)

unique_airports.tail() # This is like .head(), but it will display the LAST 5 rows of the dataframe instead of the FIRST 5

In [None]:
# Add one more column to the dataframe that will represent the total number of trips for each airport
unique_airports['total'] = unique_airports['source'] + unique_airports['dest'] - unique_airports['double_counts']
print("source + dest - double_counts = total")
unique_airports.head()

In [None]:
# Let's sort it based on the total, going from least to greatest, using sort_values() that we learned a few sections back
unique_airports = unique_airports.sort_values(by='total')
unique_airports.head()

Clear the variables we no longer need.

In [None]:
del index, row, airport_id

## **0.7 - Data Visualization and Finishing StarLift**

Numbers are great and everything, but even better is when we can visualize the numbers for understanding and presentation purposes. Python has different tools to help with this, including one of the libraries we imported earlier called <b><u>Matplotlib</u></b>, specifically in the <b><u>pyplot</u></b> module. This section doesn't go very deep into the topic but will go wide to give you a variety of options.

#### **0.7.1 - Plot a Line Graph**

In [None]:
# Let's get started by making a graph for a simple line graph. We can create an array with two columns, one for x values and one for y values
x = np.array([1, 2, 3, 4, 5]) # Doesn't have to be a numpy array, but it's good practice
y = np.array([1, 4, 9, 16, 25])

# It's two different array, but think of it as value pairs: (1, 1), (2, 4), (3, 9), (4, 16), (5, 25)

plt.plot(x, y) # This will create a simple line graph. x goes first, then y
plt.show() # This will display the graph. We don't need any arguments because we only have one graph


In [None]:
# We can also graph an equation. Let's graph the equation y = x^2
x = np.linspace(0, 10, 100) # This will create an array of 100 evenly spaced numbers from 0 to 10. This would be the same as np.array([0, 0.1, 0.2, ..., 9.9, 10])
y = x**2 # This will create an array of the square of each value in x. We could also write it as np.power(x, 2)

plt.plot(x, y)
plt.show()

In [None]:
# Let's visualize what we generated with y
y

In [None]:
# We can also graph multiple lines on the same graph. Let's graph the equation y = x^2 and y = x^3
x = np.linspace(0, 10, 100)
y1 = x**2
y2 = np.power(x, 3) # Just to show the other way of doing it

plt.plot(x, y1, label="y = x^2") # This will create a line graph for y = x^2. The "label" argument is used to label the line
plt.plot(x, y2, label="y = x^3") # This will create a line graph for y = x^3

plt.legend() # This will display the labels on the graph
plt.show()

#### **0.7.2 - Graph Customizations**

The following are common customizations and are up to personal preference. If you want to find more ways to customize your graph, check out the documentation for Matplotlib.pyplot.

In [None]:
# We can also change the color and style of the lines. Let's graph the equation y = x^2 and y = x^3 again, but with different colors and styles
x = np.linspace(0, 10, 100)
y1 = x**2
y2 = np.log(x + 1) * 50 # Let's mix it up and do a log function that will show on our graph

# The "color" argument is used to change the color of the line. The "linestyle" argument is used to change the style of the line
plt.plot(x, y1, label="y = x^2", color="red", linestyle="--") # Linestyle "--" is a dashed line
plt.plot(x, y2, label="y = log(x + 1) * 50", color="green", linestyle=":") # Linestyle ":" is a dotted line

plt.legend()
plt.show()

In [None]:
# We can also add a title and labels to the x and y axes
x = np.linspace(-10, 10, 100) # Let's get a different range of x values
y1 = x**2
y2 = x**3

plt.plot(x, y1, label="y = x^2", color="red", linestyle="--")
plt.plot(x, y2, label="y = x^3", color="blue", linestyle=":")

plt.title("Graph of y = x^2 and y = x^3") # This will add a title to the graph

plt.xlabel("x") # This will add a label to the x axis
plt.ylabel("y") # This will add a label to the y axis

plt.legend()
plt.show()

In [None]:
# We can also change the size of the graph
x = np.linspace(-10, 10, 100)
y1 = np.sin(x)                  # Let's do an absolute value function
y2 = np.sqrt(x + 10) * 10       # Let's do a square root function that will show on our graph

plt.figure(figsize=(9, 3))      # This will change the size of the graph. The first argument is the width, the second is the height

plt.plot(x, y1, label="y = |x|", color="red", linestyle="--")
plt.plot(x, y2, label="y = sqrt(x + 10) * 10", color="blue", linestyle=":")

plt.title("Graph of y = x^2 and y = x^3")
plt.xlabel("x")
plt.ylabel("y")

plt.legend()

plt.show()

#### **0.7.3 - Other Numeric-to-Numeric Graph Types and Making Functions**

In [None]:
# Let's make a function that will cut this process down to one line, with the option for customization. We can also have it accept 
# an indefinite number of y values
def plot_line_graph(x, *args, title="Graph", xlabel="x", ylabel="y", figsize=(6.75, 5)):
    plt.figure(figsize=figsize)
    
    # We can use a for loop to iterate through the indefinite number of y values
    for i in range(len(args)): # len(args) will give us the number of y values
        plt.plot(x, args[i], label=f"y{i + 1}") # args[i] will give us the ith y value
    
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.show()
    
# Let's try it out
x = np.linspace(-10, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
y3 = np.tan(x)

plot_line_graph(x, y1, y2, y3, title="Graph of sin, cos, and tan") # We don't need to pass in the other parameters because 
                                                                   # they have default values

In [None]:
# We've gone over the basic line graph, so let's look at other numeric-to-numeric graphs. Let's try a scatter plot
def plot_scatter_plot(x, *args, title="Scatter Plot", xlabel="x", ylabel="y", figsize=(9, 3)):
    plt.figure(figsize=figsize)
    
    for i in range(len(args)):
        plt.scatter(x, args[i], label=f"y{i + 1}") # Colors are automatically assigned to each set of points
    
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.show()
    
x = np.random.randint(1, 10, 10) # This will create an array of 10 random integers between 1 and 10
y1 = np.random.randint(1, 10, 10)
y2 = np.random.randint(1, 10, 10)

plot_scatter_plot(x, y1, y2, title="Scatter Plot of Random Points")

In [None]:
# Another numeric-to-numeric graph is the histogram. Let's create a histograph 
def plot_histogram(x, bins=10, title="Histogram", xlabel="Number", ylabel="Frequency", color="green", figsize=(9, 3)):
    plt.figure(figsize=figsize)
    
    plt.hist(x, bins=bins, color=color) # The "bins" argument is used to set the number of bins or groupings
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.show()

x = np.random.randint(1, 10, 100) # This will create an array of 100 random integers between 1 and 10

plot_histogram(x, bins=5, title="Histogram of Random Numbers", color="salmon")

#### **0.7.4 - Numeric-to-Categorical Graph Types**

In [None]:
# Category-to-numeric graphs are also possible. Let's create a bar graph
def plot_bar_graph(x, y, title="Bar Graph", xlabel="Category", ylabel="Number", color="orange", figsize=(9, 3)):
    plt.figure(figsize=figsize)
    
    plt.bar(x, y, color=color) # This will create a bar graph
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.show()

x = ["A", "B", "C", "D", "E"] # This will create a list of strings
y = np.random.randint(1, 10, 5)

plot_bar_graph(x, y, title="Bar Graph of Random Numbers")


In [None]:
# Box plot!
# This one we won't use *args because it's a bit more complex. We will need to pass in a list of lists
def box_plot(data, labels, title="Box Plot", xlabel="Category", ylabel="Number", figsize=(9, 3)):
    plt.figure(figsize=figsize)
    
    plt.boxplot(data, labels=labels) # This will create a box plot
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.show()
    
category1 = np.random.randint(1, 10, 100)
category2 = np.random.randint(1, 10, 100)

box_plot([category1, category2], ["Category 1", "Category 2"], title="Box Plot of Two Categories")

In [None]:
# Another numeric-to-category graph is the pie chart
def pie_chart(data, labels, title="Pie Chart", figsize=(9, 3)):
    plt.figure(figsize=figsize)
    
    plt.pie(data, labels=labels, autopct='%1.1f%%') # This will create a pie chart. The "autopct" argument is used to display the percentage of each slice
    plt.title(title)
    
    plt.show()
    
data = [1, 2, 3, 4, 5]
labels = ["A", "B", "C", "D", "E"]

pie_chart(data, labels, title="Pie Chart of Data")

#### **0.7.5 - Categorical-to-Categorical Graph Types**

In [None]:
# A categorical-to-categorical graph can be a heatmap. This graph is used very often in data analysis
def heatmap(data, x_labels, y_labels, title="Heatmap", xlabel="X", ylabel="Y", figsize=(9, 3)):
    plt.figure(figsize=figsize)
    
    plt.imshow(data, cmap="viridis") # This will create a heatmap. The "cmap" argument is used to set the color map
    plt.colorbar() # This will add a color bar to the graph (the bar on the right side of the graph)
    
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.xticks(ticks=np.arange(len(x_labels)), labels=x_labels) # This will set the x ticks to the x labels
    plt.yticks(ticks=np.arange(len(y_labels)), labels=y_labels) # This will set the y ticks to the y labels
    
    plt.show()
    
data = np.random.randint(1, 10, (5, 5)) # This will create a 5x5 array of random integers between 1 and 10

x_labels = ["F", "G", "H", "I", "J"]
y_labels = ["A", "B", "C", "D", "E"]

heatmap(data, x_labels, y_labels, title="Heatmap of Random Data")

#### **0.7.6 - Clear Variables**

In [None]:
del category1, category2, data, labels, x, x_labels, y, y1, y2, y3, y_labels

#### **0.7.7 - Finishing StarLift**

We have the airport counts now, so all we need to do is figure out the potential profits based on the number of airports we remove from our network and then graph the results. We will want to remove the airport with the least amount of trips associated with it first, which is why we sorted the values in the last step we did. At the very beginning back in section 0.3 we defined some of our constants like PROFIT_PER_TRIP at 10000 and FEE_PER_AIRPORT at 20000. We found N_TRIPS to be 67663 and, while we haven't set it yet, we found the unique number of airports to be 3425. All these numbers we will use to calculate the profits and get the key value pairs we will need to make sense of the data and visualize it. We can get these numbers using a for loop, some more boolean arrays, our friend <b><u>itterrows()</u></b>, and some new methods in <b><u>max()</u></b> and <b><u>idmax()</u></b>.

The first thing we want to do before jumping into this is thinking about what we actually need our code to do. We want to get the total profits by multiplying the total trips by the profit per trip, then subtract the number of airports multiplied by the fee per airport. Our equation:

> <pre>profits = (N_TRIPS * PROFIT_PER_TRIP) - (N_AIRPORTS * FEE_PER_AIRPORT)</pre>

So this is great, but notice how all those variables are constants. That represents our initial state without changing. However, each time we remove an airport, we need to get rid of any instance of it from our log of trips, thereby getting a lower value than N_TRIPS and N_AIRPORTS each time we itterate. We will need to keep track of our airports remaining and our trips remaining. Let's start by just declaring those variables below.

In [None]:
N_AIRPORTS = unique_airports.shape[0]   # This will get the number of airports before removing any
airports_remaining = N_AIRPORTS         # This will be used to keep track of the number of airports that are still there
trips_remaining = N_TRIPS               # This will be used to keep track of the number of trips that are still there

Perfect. Now that we have some helpful variables we can use in our loop--N_AIRPORTS as a constant we can keep track of for the total number of unique airports, airports_remaining as the number of airports still being listed in our trips that will decrement by 1 each iteration, and trips_remaining as the number of trips still in our records by association of those airports still there--we will want to do just a few more things. We don't want to overwrite our db_sla variable in case we need to access the original data again, so we will make a copy. We also want to create another dataframe so we can record the number of airports we remove, which airport was removed, and what the profit comes out to be at each step.

In [None]:
df_sla_copy = df_sla.copy() # This will create a copy of the dataframe so we don't change the original dataframe

# Here is our dataframe to store the profits. We haven't gone over all these elements in this context, but we have looks at the
# method "rename_axis" as well as attributes similar to "index" and "columns". See if you can figure out what is happening here, 
# and why we set the size of it to N_AIRPORTS + 1. If you can't figure it out, don't worry! We will go over it in the next code block
profits = pd.DataFrame(index=range(N_AIRPORTS + 1), columns=['profit', 'airport_removed']).rename_axis('airports_removed')

Now that we have created a copy of our dataframe and initialized another dataframe to store profits, let's begin computing the profits! The <b><u>.at[]</u></b> method is utilized to access a single value for a row/column label pair. While it functions similarly to .loc[], it offers advantages in terms of speed and conciseness. Specifically, it excels when we need to retrieve just one value from the dataframe. With .at[], we specify the row label as the first argument and the column label as the second argument. This makes it a convenient choice for swiftly fetching individual values within the dataframe.

In [None]:
# The easiest one to calculate will be when we remove 0 airports. We will just use the original dataframe and our constant variables, just
# like we showed above. We can write it directly into the dataframe.
profits.at[0, 'profit'] = PROFIT_PER_TRIP * N_TRIPS - FEE_PER_AIRPORT * N_AIRPORTS
profits.at[0, 'airport_removed'] = "--" # No airport was removed, so we will use "--" to represent that

profits.head()

In [None]:
airports_removed = 0 # We're already keeping track of airports_remaining, but this will be helpful with inserting into the profits dataframe

# Iterate through each airport, removing one airport each iteration. Our unique_airports dataframe is already sorted by the 
# total number of trips for each airport, so we can just iterate through it and remove the airport as we go
for index, row in unique_airports.iterrows():
    airport_id = index # Get the airport id. Remember, we set the index of our unique_airports dataframe to be the airport ids
    airports_removed += 1 # Increment the number of airports removed. This is the same as typing "airports_removed = airport_removed + 1"
    airports_remaining = N_AIRPORTS - airports_removed # Calculate the number of airports remaining
    
    # Find all instances of the airport in the dataframe, as the source or dest airport, and remove them. We do this with our 
    # boolean dataframe which will be overridden each iteration. df_sla_copy will get smaller and smaller as we leave out more 
    # and more airports until it finally gets to 0. We'll split it out again so it's a bit more readable (this is not a necessary 
    # step, but readable code is always good)
    df_sla_copy = df_sla_copy [
                                (df_sla_copy['source_airport'] != airport_id) # Keep rows where the source airport is not the one we're removing
                                & # AND
                                (df_sla_copy['dest_airport'] != airport_id) # Keep rows where the dest airport is not the one we're removing
                            ]
    trips_remaining = df_sla_copy.shape[0] # Calculate the number of trips remaining from our new df_sla_copy dataframe
    
    # Calculate the profit for the number of airports removed and store it in a variable called airport_profit. It will be overriden each iteration
    airport_profit = (PROFIT_PER_TRIP * trips_remaining) - (FEE_PER_AIRPORT * airports_remaining)
    
    # Store the profit in the dataframe
    profits.at[airports_removed, 'profit'] = airport_profit
    profits.at[airports_removed, 'airport_removed'] = airport_id # Store the airport that was removed
    
profits.head()

Everything above might look like a lot, and sure it isn't the most basic code in the world, but let's write it all out again without the comments to hopefully make it look a little less intimidating. Don't get me wrong, ALWAYS COMMENT YOUR CODE! But you almost certainly will not need to comment it as extensively as we felt we needed to in this notebook to make everything clear.

In [None]:
# Initialize variables to help with calculations in the loop
airports_remaining = N_AIRPORTS
trips_remaining = N_TRIPS 
df_sla_copy = df_sla.copy()

# Initial state of profits with no airports removed
profits = pd.DataFrame(index=range(N_AIRPORTS + 1), columns=['profit', 'airport_removed']).rename_axis('airports_removed')
profits.at[0, 'profit'] = PROFIT_PER_TRIP * N_TRIPS - FEE_PER_AIRPORT * N_AIRPORTS
profits.at[0, 'airport_removed'] = "--"
airports_removed = 0 

# Iterrate through each airport, removing one airport each iteration
for index, row in unique_airports.iterrows():
    airport_id = index 
    airports_removed += 1 
    airports_remaining = N_AIRPORTS - airports_removed 
    
    # Remove all instances of the airport in the dataframe
    df_sla_copy = df_sla_copy [(df_sla_copy['source_airport'] != airport_id) & (df_sla_copy['dest_airport'] != airport_id)]
    trips_remaining = df_sla_copy.shape[0] 
    
    # Calculate the profit for the number of airports removed and store it in the dataframe
    airport_profit = (PROFIT_PER_TRIP * trips_remaining) - (FEE_PER_AIRPORT * airports_remaining)
    profits.at[airports_removed, 'profit'] = airport_profit
    profits.at[airports_removed, 'airport_removed'] = airport_id
    
profits.head()

And look at that, we did it! After all this time, all this code, we finally have learned enough to get the profits associated with the number of airports removed and by which airport specifically is removed. Let's graph it and then find our maximum profitability!

In [None]:
# Graph
x = profits.index # The x values will be the number of airports removed
y = profits['profit'] # The y values will be the profits themselves
plot_line_graph(x, y, title="Profit vs Number of Airports Removed", xlabel="Number of Airports Removed", ylabel="Profit")

As cool as that graph may be, we still don't know what our maximum profit is or at how many airports removed we found it. Let's use our <b><u>max()</u></b> and <b><u>idmax()</u></b> methods mentioned above to figure it out. <b><u>max()</b></u> is a method that finds the max value in an array, while <b><u>idmax()</u></b> is a method that finds the id of the max value of the array. We can use these to get the profits and the number of airports removed respectively.

In [None]:
max_profit = profits['profit'].max() # Find the max profit
max_profit_index = profits['profit'].idxmax() # Find the index of the max profit
print(f'Max profit of ${max_profit} at {max_profit_index} airports removed')

In [None]:
# Output the dataframe between rows 2570 and 2580
profits[2570:2580]

And there we have our answer! We can obtain a max profit of $355,960,000 if we remove 2575 airports from our network. This is a HUGE jump from where we are right now, which is losing a whole bunch of money

In [None]:
del df_sla_copy, dest_counts, index, max_profit, max_profit_index, row, source_counts, airport_id,\
    airport_profit, airports_remaining, airports_removed, trips_remaining, x, y

## **0.8 - Exporting Our Findings**

Wow, what a journey we have been on to learn Data Wrangling using the powerful tool that is Python. We learned about variables, data types, libraries, conditional statements, functions, loops, and data visualization. However... it's all still just contained within this file. What, you think we want you to copy it all out by hand or now go over to Excel or PowerPoint to recreate your findings? Of course not! Why did we do all this if we were just going to go back there? Python has some different options we can try out to export our information to different viewing mediums.

#### **0.8.1 - Save Just Your Graph**

We can save the graph using a method in plt called <b><u>plt.savefig()</u></b> which is a function that will save the graph to a file. The first argument is the file name and the second argument (though not necessary) is the file type. The file is saved in the same folder that your IPYNB is stored in.

In [None]:
# Create our graph again. Unfortunately, we can't use our function to do this since our plt.savefig won't recognize
# the graph we created with the function. You can recreate the function though and have an option to save the graph!
x = profits.index
y = profits['profit']
plt.plot(x, y)
plt.title("Profit vs Number of Airports Removed")
plt.xlabel("Number of Airports Removed")
plt.ylabel("Profit")

plt.savefig("./Course Outputs/StarLift Graph.png") # This will save the graph as a .png file

# If you wanted to safe it as a .pdf, .svg. .jpg, etc., you would just change the file extension in the string

#### **0.8.2 - Export Data**

The two pieces of data we may want to save and store independently would be our two dataframes: one that stores our calculated profits in relation to which airports are removed, and the other that stores the number of trips each airport was involved in. We'll go over two ways to export these, the first being exporting as a .csv (comma-separated values) file and the other as an Excel file. If we use the latter method we can export both tables into different tabs in the same file.

First (to .csv):

In [None]:
profits.to_csv("./Course Outputs/Star Lift Profits.csv") # Save it in the "Course Outputs" folder

'''
# We can also export it to an excel file. Uncomment this block if you would like to run it, but we will also export both our
# dataframes below to an excel file using a bit different of a method
profits.to_excel("./Course Outputs/Star Lift Profits.xlsx")
'''

Above in the commented-out code you saw how we would export one table to an Excel spreadsheet file, but to do both to the same file will be a tad more involved. Below, we will used the keyword <b><u>with</u></b> to open a file, which also safely closes the file when we are done using it, and in our case it will also create the file since the filename we enterred as an argument doesn't already exist. We'll assign that file to the name "writer", and inside writer we will write our profits dataframe to a new sheet called "profits" and our unique_airports dataframe to a new sheet called "unique_airports". If the tab already existed it would just put the dataframe in the already-existing tab, but since they aren't created yet it will do it for us.

Second (to .xslx):

In [None]:
with pd.ExcelWriter('./Course Outputs/Star Lift Profits and Counts.xlsx') as writer: # Create new excel file and open it for writing
    profits.to_excel(writer, sheet_name='profits') # Write profits dataframe to 'profits' tab
    unique_airports.to_excel(writer, sheet_name='unique_airports') # Write the unique_airports dataframe to 'unique_airports' tab

#### **0.8.3 - For Fun: Save a Custom PDF File**

Now for some really fun stuff. We are going to use the <b><u>FPDF</u></b> library we imported all the way back in 0.2 to output some information to a PDF that we can create, all ready for presenting or distributing. We can put out graph there, some text, and even output our tables. Before we get started, we'll want to format our dataframes in a way that they'll look clean and professional in a presentation setting.

In [None]:
# Right now we are keeping track of the number of airports removed (profits) and the airport id (unique_counts) with the index. 
# It won't print out when we try to iterate through and print, so let's add it to the dataframe as a column instead of the index
profits["num_removed"] = profits.index
unique_airports["airport_id"] = unique_airports.index

# Then reorganize the columns so it's the first one
profits = profits[["num_removed", "profit", "airport_removed"]]
unique_airports = unique_airports[["airport_id", "source", "dest", "double_counts", "total"]]

# And rename columns in both dataframes for printing purposes
profits.columns = ["Number of Airports Removed", "Profit", "Airport Removed"]
unique_airports.columns = ["Airport ID", "Source", "Destination", "Double Counts", "Total Trips"]

Let's do our basic setup for our PDF document. We will create an FPDF object by using <b><u>FPDF()</u></b> and assigning attributes through parameters accordingly. Then we will set our margins with <b><u>set_margins()</u></b> and add a new page with <b><u>add_page()</u></b>. We'll also declare a variable called *page_width* and set our fill color with <b><u>set_fill_color()</u></b> for when we are working with tables later. For the variable *page_width* we will use the methods <b><u>.w</u></b>, <b><u>.r_margin</u></b> and <b><u>.l_margin</b></u> to get the working space of our pdf by subtracting the left and right margins we already declared.

In [None]:
# Set up our pdf page and page configurations
pdf = FPDF(orientation="P", 
           unit="mm", 
           format="A4")
pdf.set_margins(left= 10, 
                right= 10,
                top= 10)
pdf.add_page()

# Some variables we'll use throughout
page_width = pdf.w - pdf.r_margin - pdf.l_margin # This will get the width of the page, minus the left and right margins (2 * 10)
pdf.set_fill_color(163, 163, 163)  # Set to Gray using RGB values

In [None]:
# Add in our top header
pdf.set_font("Arial", size=22, style="B")
pdf.cell(w=185, 
         h=20,
         txt="StarLift Airlines Network Profitability Evaluation", 
         ln=True, # This will move to the next line 
         align="C") # This will center the text

# Let's import the text we want to display from another file
with open("./Course Resources/section0-8-2StarLiftText.txt", "r") as file: # This will open the file in read mode
    text = file.read() # This will read the file and store the text in a variable 
    
par_indent = " " * 10 # Add an indent of 10 spaces to the first line of the paragraph

pdf.ln(10)

# Cell vs multi_cell: cell is for a single line of text, multi_cell is for multiple lines of text
pdf.set_font("Arial", size=11)
pdf.multi_cell(w=0, 
               h=10, 
               txt=(par_indent + text),
               align="L")

pdf.image("./Course Outputs/StarLift Graph.png", 
          w = 200, 
          h = 154)

In [None]:
# Next Page: Our Profits Table
pdf.add_page()

pdf.set_font("Arial", size=25, style="B")
pdf.cell(0, 10, "Profits", align="C")
pdf.ln(15)

pdf.set_font("Arial", size=10)

# Header
for col in profits.columns:
    pdf.cell(page_width / len(profits.columns), 10, str(col), border=1, fill=True, align="C")
pdf.ln()

# Data rows
for row in profits.head(11).values:
    for item in row:
        pdf.cell(page_width / len(profits.columns), 5, str(item), border=1, align="C")
    pdf.ln()
    
# Output the profits from 2570-2580 airports removed, highlighting #2575
pdf.set_font("Arial", size=25, style="B")
pdf.cell(0, 10, "...", align="C")
pdf.set_font("Arial", size=10)
pdf.ln()

for row in profits.iloc[2570:2580].values:
    # If the row is the one with 2575 airports removed, highlight it in yellow
    if row[0] == 2575:
        pdf.set_fill_color(255, 255, 0)
    else:
        pdf.set_fill_color(255, 255, 255)
        
    for item in row:
        pdf.cell(page_width / len(profits.columns), 5, str(item), border=1, fill=True, align="C")
    pdf.ln()

In [None]:
# Next Page: Our Airport Counts Table
pdf.add_page()

pdf.set_font("Arial", size=25, style="B")
pdf.cell(0, 10, "Airport Counts", align="C")
pdf.ln(15)

pdf.set_font("Arial", size=10)

# Header
for col in unique_airports.columns:
    pdf.cell(page_width / len(unique_airports.columns), 10, str(col), border=1, fill=True, align="C")
pdf.ln()

# Data rows
for row in unique_airports.head(11).values:
    for item in row:
        pdf.cell(page_width / len(unique_airports.columns), 5, str(item), border=1, align="C")
    pdf.ln()
    
pdf.set_font("Arial", size=25, style="B")
pdf.cell(0, 10, "...", align="C")
pdf.set_font("Arial", size=10)
pdf.ln()

for row in unique_airports.tail(10).values:
    for item in row:
        pdf.cell(page_width / len(unique_airports.columns), 5, str(item), border=1, align="C")
    pdf.ln()

In [None]:
pdf.output("./Course Outputs/StarLift Airlines Final Report.pdf")

Now that we have our pdf printed, let's return our dataframes to the correct format that we'll want to use for working with them in Python. The indexes never changed, we just created new columns based off of those indexes, so we will just drop the columns we added and rename the remaining columns to what we had before.

In [None]:
# Drop the columns we added
profits = profits.drop(columns="Number of Airports Removed")
unique_airports = unique_airports.drop(columns="Airport ID")

# Rename to the way it was
profits.columns = ["profit", "airports_removed"]
unique_airports.columns = ["source", "dest", "double_counts", "total"]

#### **0.8.4 - Clear Variables**

In [None]:
del col, file, item, page_width, par_indent, pdf, row, text, writer, x, y

## **0.9 - Definitions and Extra Resources**

#### **9.1 - Terms from the Chapter**

##### Foundational Terms
- <b><u>Data Wrangling</u></b>: the process of gathering, cleaning, transforming, and organizing raw data into a format suitable for analysis.
- <b><u>Python</u></b>: a high-level, lightweight, general-purpose programming langauge. One of the most popular programming langauges overall and arguably the most popular in terms of data science.

##### 0.0 - Intro to IPYNB (file extension .ipynb)
- <b><u>IPYNB</u></b>: stands for "Interactive Python Notebook" and is a type of file that contains different cells that contain Python code, markdown text, or other content. These cells can be executed individually, allowing users to run Python code in separate, interconnected blocks. This interactive nature makes IPYNB files a powerful tool for data analysis, scientific computing, machine learning, and teaching programming concepts.
- <b><u>IDE</u></b>: stands for "Integrated Development Environment" and is a type of application that allows users to create, compile, and execute code. Designed to streamline the process of writing, testing, and debugging code by providing a unified interface for all these tasks. They often include features such as syntax highlighting and code completion.
- <b><u>Markdown</u></b>: a lightweight markup computer language that allows for easy formatting of text (as compared to HTML which can be a bit more cumbersome). 
    - Everything you are reading now has been written in Markdown.

##### 0.1 - Comments
- <b><u>Comment</u></b>: a comment is a line in code that is not processed as commands by the computer. It is used to make code more readable and easier to follow/understand
    - <b><u>Single-line Comments</u></b> are comments that extend only to the line it is on. They are made using a *#* symbol
    - <b><u>Multi-line Comments</u></b> are comments that exist within quotes, three quotes (*'''*) to start the comment and three to end the comment. They can span multiple lines.

##### 0.2 - Python Libraries and Import Statements
- <b><u>Python Library</u></b>: a collection of code, or modules of code, that someone else has already made that we can use in a program for specific operations. These libraries are designed to solve common programming problems or to provide access to specific features or functionalities, such as mathematical calculations, data manipulation, and more.
    - <b><u>Pandas</u></b>: a powerful open-source library used for data manipulation and analysis. It provides data structures and functions that allow users to efficiently handle structured data by offering high-level operations and functionalities. It is widely used in fields such as data science, finance, economics, and research due to its flexibility, ease of use, and comprehensive documentation.
    - <b><u>Numpy</u></b>: a powerful open-source library used for numerical computing. It is widely used in scientific computing, data analysis, machine learning, and various fields of engineering and research due to its speed, memory efficiency, and extensive capabilities.
    - <b><u>Math</u></b>: a build-in module in Python that provides access to a wide range of mathematical functions for performing mathematical operations. It is essential for performing complex mathematical computations accurately and efficiently in programs.
    - <b><u>Matplotlib</u></b>: a comprehensive library used for creating visualizations that are highly-customizable and professional in appearance. It is widely used in fields such as data analysis, scientific research, machine learning, and data visualization due to its versatility, ease of use, and extensive plotting capabilities.
        - <b><u>Pyplot</u></b>: a module within matplotlib that provides a simplified interface for creating plots. It offers a MATLAB-like way of plotting and manages the display of figures on your screen.
    - <b><u>FPDF</u></b>: a library for PDF document generation under Python, ported from <a href="http://www.fpdf.org/">PHP</a>. Not a well-known library as not many people will create PDFs in Python.
- <b><u>Code Documentation</u></b>: reference materials or documents that describe the functionality, usage, and inner workings of software code. These documents often include explanations of the purpose of each function, method, or module, as well as examples of how to use them in various contexts.

##### 0.3 - Variables, Data Types, and Intro to Using Libraries
- <b><u>Variables</u></b>: a named storage location in a computer's memory that store a value that can be referenced and redefined/manipulated. Variables can be of many different data types.
    - <b><u>Integers</u></b>: a whole number (no decimal point).
    - <b><u>Floats</u></b>: a.k.a. floating point number. It is a data type that represents real numbers (has decimal points).
    - <b><u>Booleans</u></b>: a data type that can only have one of two values: True or False. We use booleans to control the flow of our code.
    - <b><u>String</u></b>: a sequence of characters.
- <b><u>Casting</u></b>: the process of converting a variable from one data type to another
- <b><u>Data Structures</u></b>: a specialized format or framework for organizing and storing data in a computer's memory for efficient and easy access and manipulation. They are designed in a way as to maintain relationships between variables or values.
    - <b><u>Lists</u></b>: a default structure in Python that can hold multiple elements/variables of different types.
    - <b><u>Arrays</u></b>: an array is not native to the Python library, but are common programming data structures. They are practically identical to lists, though the conceptual difference between them is that an array consists of only one data type while a list can have many.
    - <b><u>Tables</u></b>: a two-dimensional array, or a list of lists. Typically imply structured data.
    - <b><u>Dataframe</u></b>: a 2-dimensional labeled data structure in the Pandas library with columns of potentially different types.
- <b><u>Naming Conventions</u></b>: a set of rules and guidelines for naming identifiers such as variables, functions, constants, and other elements within a program. Naming conventions help improve code readability, maintainability, and consistency by providing a standardized way of naming elements.
    - <b><u>Snake Case</u></b>: words are separated by underscores. Examples: *my_varaible*, *num_1*, *really_cool_variable*.
    - <b><u>Camel Case</u></b>: words are joined together without spaces, with each word after the first word beginning with a capital letter. Examples: *myVariable*, *num1*, *reallyCoolVariable*.
    - <b><u>Pascal Case</u></b>: words are joined together without spaces, every word beginning with a capital letter. Examples: *MyVariable*, *Num1*, *ReallyCoolVariable*.
    - <b><u>Constant Variables</u></b>: words are separated by underscores, all letters are capitalized. Designates variables whose values are not supposed to change, or are constant. Examples: *CONSTANT_VARIABLE*, *MY_VARIABLE*, or *NUM_1*.

##### 0.4 - Conditions and Conditional Statments
- <b><u>Condition</u></b>: an expression that evaluates to either <i>True</i> or <i>False</i>.
- <b><u>Conditional Statement</u></b>: a.k.a. a control structure or a conditional construcut. It is a programming feature that allows different actions to be taken depending on whether a condition is true or false.
- <b><u>Logical Operators</u></b>: symbols or terms that are used in conditions to show how something is being evaluated. Examples include:
    - \> (greater than), 
    - < (less than), 
    - != (not equal), 
    - \>= (greater than or equal to), 
    - <= (less than or equal to), 
    - and (keyword that combines two conditions and only returns true if both conditions are true on their own),
    - or (keyword that combines two conditions and will return true if at least one condition is true on its own),
    - not (keyword that reverses the truth value of a condition),
    - in (keyword to check if a value is contained in some data structure or string).
- <b><u>Assignment Operator (=)</u></b>: used to assign values to variables. While not common in Python because of its built-in protections, one common programming error is to use the assignment operator instead of a logical/comparison operator, which will redefine the value of the variable and always result in a true condition.

##### 0.5 - Functions
- <b><u>Functions</u></b>: an isolated and reusable block of code that performs a specific task. It accepts zero to many input arguments, performs operations based on those inputs, and can optionally return a result.
- <b><u>Methods</u></b>: functions that are specific to certain classes or libraries, typically associated with variables and perform operations on the data contained within those variable objects.
- <b><u>Arguments</u></b>: values passed to a function/method that allow it to perform the operations based on the use's preferences and specifications.
- <b><u>Scope</u></b>: defines the lifetime and visibility of a piece of code, including variables, functions, and methods.
    - <b><u>Global</u></b>: a level of scope that implies visibility everywhere in the code.
    - <b><u>Local</u></b>: a level of scope that implies visibility only in a specific section of code. Most commonly this can refer to a variable that is only visible inside a function.

##### 0.6 - Loops
- <b><u>Loop</u></b>: a strucure used to repeat processes or, generally, blocks of code.
    - <b><u>For Loop</u></b>: a repeating structure that is used to iterate over a sequence of elements such as a list, string, or other iterable objects. The loop iterates over each item in the sequence and executes a block of code for each item. Helpful when the number of repeats is calculable.
    - <b><u>While Loop</u></b>: a repeating structure that is used to repeatedly execute a block of code as long as a specified condition evaluates to True (boolean value). Helpful when the number of repeats needed are not known or are apt to change.
        - <b><u>Infinite Loop</u></b>: an error encountered when a while loop never reaches an end, leading to unexpected results, over-use of computer resources, and code that never ends.
    - <b><u>Nested Loops</u></b>: simply, a loop within a loop. Helpful when repeated processes are needed within another repeated process.
- <b><u>Boolean Array</u></b>: a.k.a. a boolean mask. It is an array-like data structure containing boolean values (True or False) indicating whether each element in the array it references meets a certain condition.

##### 0.7 - Data Visualization and Finishing StarLift
- <b><u>Numeric-to-Numeric</u></b>: a relationship between two categories of data that are both numeric.
    - *Graphs: Line graph, scatter plot, histogram.*
- <b><u>Numeric-to-Categorical</u></b>: a relationship between two categories of data where one is numeric and one is categorical.
    - *Graphs: Bar graph, box plot, pie chart.*
- <b><u>Categorical-to-Categorical</u></b>: a relationship between two categories of data where both are categorical.
    - *Graphs: Heatmap.*

##### 0.8 - Exporting Our Findings (No Definitions)

#### **9.2 - Extra Resources**

**Library Documentation**
- <a href="https://docs.python.org/3/library/index.html">Standard Python</a>
- <a href="https://pandas.pydata.org/docs/">Pandas</a>
- <a href="https://numpy.org/doc/1.26/index.html">NumPy</a>
- <a href="https://docs.python.org/3/library/math.html">Math</a>
- <a href="https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html">Matplotlib, specifically pyplot</a>
- <a href="https://pyfpdf.readthedocs.io/en/latest/index.html">FPDF</a>

**Websites that are a Programmer's Best Friend**
- <a href="https://stackoverflow.com">StockOverflow</a>
- <a href="https://www.w3schools.com/python/default.asp">W3Schools</a>
- <a href="https://www.python.org/about/gettingstarted/">Python.org</a>
- <a href="https://www.learnpython.org/">LearnPython.org</a>

#### **9.3 - Credits**

Everything you see here in this file was either created or compiled by Seth Brock, a student in Brigham Young University's Information Systems program, and was made in 2024, made available that April. The idea behind the document was to create something for Global Supply Chain Management students who are interested in data wrangling to get the very basics of Python with Data Wrangling, taught in a context that may be familiar to them. It was hoped that this file was structured in a way that someone with no programming experience at all could follow along, make sense of it, and pick up some useful skills. Content and definitions are not anything official and if the author was not able to think of a cohesive definition, help was gotten from OpenAI's ChatGPT or GitHub's Copilot. The dataset used is a modified version of <a href="https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat">this open-source dataset</a> and its use in this file may not line up with its original intended use. Numbers used in calculations are almost certainly not correct and will not align with real-world numbers (like profit/flight or the cost to rent out space) but were used to teach the principles and get close to relevant GSCM contexts. A special thanks to all of my instructors and professors in the IS program, including Professor Keith, Professor Hilton, Professor Cutler, Professor Wells, Professor Reese, Professor Anderson, and Professor Schuetzler, as well as Professor Hathaway in the GSCM program who helped give me the idea to do this while working as his teaching assisant. Any feedback is greatly appreciated.